Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science Edited by J. G. Carbonell and J. Siekmann
Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis and J. van Leeuwen
1891
¿ Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Singapore Tokyo
Arlindo L. Oliveira (Ed.)
Grammatical Inference: Algorithms and Applications 5th International Colloquium, ICGI 2000 Lisbon, Portugal, September 11-13, 2000 Proceedings
½¿
Series Editors Jaime G. Carbonell,Carnegie Mellon University, Pittsburgh, PA, USA J¨org Siekmann, University of Saarland, Saarbr¨ucken, Germany
Volume Editor Arlindo L. Oliveira INESC / IST R. Alves Redol 9 1000 Lisbon, Portugal E-mail:
[email protected] Cataloging-in-Publication Data applied for Die Deutsche Bibliothek - CIP-Einheitsaufnahme Grammatical inference: algorithms and applications : 5th international colloquium ; proceedings / ICGI 2000, Lisbon, Portugal, September 11 13, 200. Arlando L. Oliveira (ed.). - Berlin ; Heidelberg ; New York ; Barcelona ; HongKong ; London ; Milan ; Paris ; Singapore ; Tokyo : Springer, 2000 (Lecture notes in computer science ; Vol. 1891 : Lecture notes in artificial intelligence) ISBN 3-540-41011-2
CR Subject Classification (1998): I.2, F.4.2-3, I.5.1, I.5.4, J.5, F.2 ISBN 3-540-41011-2 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH c Springer-Verlag Berlin Heidelberg 2000 Printed in Germany Typesetting: Camera-ready by author, data conversion by PTP-Berlin, Stefan Sossna Printed on acid-free paper SPIN 10722523 06/3142 543210
Preface
The Fifth International Colloquium on Grammatical Inference (ICGI-2000) was held in Lisbon on September 11–13th, 2000. ICGI-2000 was the fifth in a series of successful biennial international conferences in the area of grammatical inference. Previous conferences were held in Essex, U.K.; Alicante, Spain; Montpellier, France; and Ames, Iowa, USA. This series of meetings seeks to provide a forum for the presentation and discussion of original research on all aspects of grammatical inference. Grammatical inference, the process of inferring grammar from given data, is a field that is not only challenging from a purely scientific standpoint but also finds many applications in real world problems. Despite the fact that grammatical inference addresses problems in a relatively narrow area, it uses techniques from many domains, and intersects a number of different disciplines. Researchers in grammatical inference come from fields as diverse as machine learning, theoretical computer science, computational linguistics, pattern recognition and artificial neural networks. From a practical standpoint, applications in areas such as natural language acquisition, computational biology, structural pattern recognition, information retrieval, text processing and adaptive intelligent agents have either been demonstrated or proposed in the literature. ICGI-2000 was held jointly with CoNLL-2000, the Computational Natural Language Learning Workshop and LLL-2000, the Second Learning Language in Logic Workshop. The technical program included the presentation of 24 accepted papers (out of 35 submitted) as well as joint sessions with CoNLL and LLL. A tutorial program organized by Gabriel Pereira Lopes took place after the meetings and included tutorials by Raymond Mooney, Gregory Grefenstette, Walter Daelemans, Ant´ onio Ribeiro, Joaquim Ferreira da Silva, Gael Dias, Nuno Marques, Vitor Rossio, Jo˜ ao Balsa and Alexandre Agostini. The joint realization of these events represents a unique opportunity for researchers in these related fields to interact and exchange ideas. I would like to thank Claire N´edellec, Claire Cardie, Walter Daelemans, Colin de la Higuera and Vasant Honavar for their help in several aspects of the organization; the members of the technical program committee and the reviewers for their careful evaluation of the submissions; the members of the local organizing committee, Ana Teresa Freitas and Ana Fred, for their help in setting up the event; and Ana de Jesus for her invaluable secretarial support.
September 2000
Arlindo Oliveira Technical Program Chair
Technical Program Committee Pieter Adriaans Michael Brent Walter Daelemans Pierre Dupont Dominique Estival Ana Fred Jerry Feldman Lee Giles Colin de la Higuera Vasant Honavar Laurent Miclet G. Nagaraja Jacques Nicolas Arlindo Oliveira Jose Oncina Carratala Rajesh Parekh Lenny Pitt Yasubumi Sakakibara Arun Sharma Giora Slutzki Esko Ukkonen Stefan Wermter Enrique Vidal Thomas Zeugmann
Syllogic/University of Amsterdam, The Netherlands Johns Hopkins University, USA Tilburg University, The Netherlands University de St. Etienne, France Syrinx Speech Systems, Australia Lisbon Technical University, Portugal ICSI, Berkeley, USA NEC Research Institute, USA EURISE, University de St. Etienne, France Iowa State University, USA ENSSAT, France Indian Institute of Technology, India IRISA, France INESC/IST, Portugal Universidade de Alicante, Spain Allstate Research and Planning Center, USA University of Illinois at Urbana-Champaign, USA Tokyo Denki University, Japan University of New South Wales, Australia Iowa State University, USA University of Helsinki, Finland University of Sunderland, UK University Politecnica de Valencia, Spain Kyushu University, Japan
Organizing Committee Conference Chair: Tutorials: Local Arrangements: Social Program: Secretariat:
Additional Reviewers Daniel Gildea Mitch Harris Satoshi Kobayashi Eric Martin Franck Thollard Takashi Yokomori
Arlindo Oliveira, INESC/IST Gabriel Pereira Lopes, Universidade Nova de Lisboa Ana Fred, Lisbon Technical University Ana Teresa Freitas, INESC/IST Ana de Jesus, INESC
Table of Contents
Inference of Finite-State Transducers by Using Regular Grammars and Morphisms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Francisco Casacuberta
1
Computational Complexity of Problems on Probabilistic Grammars and Transducers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Francisco Casacuberta and Colin de la Higuera Efficient Ambiguity Detection in C-NFA, a Step Towards the Inference of Non Deterministic Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 Fran¸cois Coste and Daniel Fredouille Learning Regular Languages Using Non Deterministic Finite Automata . . . 39 Fran¸cois Denis, Aur´elien Lemay, and Alain Terlutte Smoothing Probabilistic Automata: An Error-Correcting Approach . . . . . . . 51 Pierre Dupont and Juan-Carlos Amengual Inferring Subclasses of Contextual Languages . . . . . . . . . . . . . . . . . . . . . . . . . . 65 J.D. Emerald, K.G. Subramanian, and D.G. Thomas Permutations and Control Sets for Learning Non-regular Language Families 75 Henning Fernau and Jos´e M. Sempere On the Complexity of Consistent Identification of Some Classes of Structure Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 Christophe Costa Florˆencio Computation of Substring Probabilities in Stochastic Grammars . . . . . . . . . 103 Ana L. N. Fred A Comparative Study of Two Algorithms for Automata Identification . . . . 115 P. Garc´ıa, A. Cano, and J. Ruiz The Induction of Temporal Grammatical Rules from Multivariate Time Series . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 Gabriela Guimar˜ aes Identification in the Limit with Probability One of Stochastic Deterministic Finite Automata . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Colin de la Higuera and Franck Thollard Iterated Transductions and Efficient Learning from Positive Data: A Unifying View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 Satoshi Kobayashi
VIII
Table of Contents
An Inverse Limit of Context-Free Grammars - A New Approach to Identifiability in the Limit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171 Pavel, Martinek Synthesizing Context Free Grammars from Sample Strings Based on Inductive CYK Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Katsuhiko Nakamura and Takashi Ishiwata Combination of Estimation Algorithms and Grammatical Inference Techniques to Learn Stochastic Context-Free Grammars . . . . . . . . . . . . . . . . 196 Francisco Nevado, Joan-Andreu S´ anchez, and Jos´e-Miguel Bened´ı On the Relationship between Models for Learning in Helpful Environments 207 Rajesh Parekh and Vasant Honavar Probabilistic k-Testable Tree Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Juan Ram´ on Rico-Juan, Jorge Calera-Rubio, and Rafael C. Carrasco Learning Context-Free Grammars from Partially Structured Examples . . . . 229 Yasubumi Sakakibara and Hidenori Muramatsu Identification of Tree Translation Rules from Examples . . . . . . . . . . . . . . . . . 241 Hiroshi Sakamoto, Hiroki Arimura, and Setsuo Arikawa Counting Extensional Differences in BC-Learning . . . . . . . . . . . . . . . . . . . . . . 256 Frank Stephan and Sebastiaan A. Terwijn Constructive Learning of Context-Free Languages with a Subpansive Tree . 270 Noriko Sugimoto, Takashi Toyoshima, Shinichi Shimozono, and Kouichi Hirata A Polynomial Time Learning Algorithm of Simple Deterministic Languages via Membership Queries and a Representative Sample . . . . . . . . 284 Yasuhiro Tajima and Etsuji Tomita Improve the Learning of Subsequential Transducers by Using Alignments and Dictionaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 298 Juan Miguel Vilar Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313
Inference of Finite-State Transducers by Using Regular Grammars and Morphisms Francisco Casacuberta? Departamento de Sistemas Inform´ aticos y Computaci´ on, Instituto Tecnol´ ogico de Inform´ atica, Universidad Polit´ecnica de Valencia, 46071 Valencia, Spain.
[email protected]
Abstract. A technique to infer finite-state transducers is proposed in this work. This technique is based on the formal relations between finitestate transducers and regular grammars. The technique consists of: 1) building a corpus of training strings from the corpus of training pairs; 2) inferring a regular grammar and 3) transforming the grammar into a finite-state transducer. The proposed method was assessed through a series of experiments within the framework of the EuTrans project.
1
Introduction
Formal transducers [8] give rise to an important framework in syntactic pattern recognition [20]. Many tasks in automatic speech recognition can be viewed as simple translations from acoustic sequences to sub-lexical or lexical sequences (acoustic-phonetic decoding) or from acoustic or lexical sequences to sequences of commands to a data-base management system or to a robot (semantic decoding). Other similar application is the recognition of continuous handwritten characters. Other more complex applications of formal transducers are language translations (e.g. English to Spanish) [21,1] from text to text, from speech to text or speech [1] or from continuous handwritten characters to text, etc. Regular transductions [2] constitute an important class within the formal translation field. Regular transduction involves regular or finite-state machines to deal with the input and output languages that are defined in a formal translation. Even though these translations are much more limited than other more powerful ones, the computational costs of the algorithms that are needed to deal with them are much lower. One of the most important interest in finite-state machines for translation comes from the fact that these machines can be learned automatically from examples [20]. However, there are few techniques that infer finite-state transducers [16,14,19,12]. Nevertheless, there is an important number of techniques to infer regular grammars from finite sets of learning strings that have been used successfully in automatic speech recognition [20]. Some of these techniques ?
This work has been partially funded by the European Union under grant IT-LTROS-30268.
A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 1–14, 2000. c Springer-Verlag Berlin Heidelberg 2000
2
F. Casacuberta
are based on results from formal language theory. In particular, complex regular grammars can be built by inferring simple grammars that recognize local languages [9]. A finite-state transducer or a regular syntax-directed translation scheme, T , is a tuple < N, Σ, ∆, R, S >, where N is a finite set of non-terminal symbols or states, S is the initial state, Σ is a finite set of input terminal symbols, ∆ is a finite set of output terminal symbols (Σ ∩ ∆ = ∅) and R is a set of rules of the form A → aB, ωB or A → a, ω for A, B ∈ N , a ∈ Σ, ω ∈ ∆? . A pair (x, y) ∈ Σ ? × ∆? is a translation pair if there is a translation form t(x, y) in T : t(x, y) : (S, S) → (x1 A1 , y1 A1 ) → (x1 x2 A2 , y1 y2 A2 ) → ... → (x, y) A regular translation is the set of all translation pairs for some finite-state transducer. A stochastic finite-state transducer, TP , is a tuple < N, Σ, ∆, R, S, P >, where N , Σ, ∆, R, S are defined as above and P : R → <+ is a function such that (λ is the empty string): X
P (A → aB, ωB) = 1
∀A ∈ N
∀(A→aB,ωB)∈R,B∈N ∪{λ}
The probability of a translation pair (x, y) ∈ Σ ? × ∆? according to TP is defined as: PrTP (x, y) =
X
PrTP (t(x, y))
∀t(x,y)
and the corresponding probability of the translation form is: PrTP (t(x, y)) = P (S → x1 A1 , y1 A1)P(A1 → x2 A2 , y2 A2 )....P (A|x|−1 → x|x| , y|x| ) In the statistical translation framework, given an input string, x from Σ ? , the probabilistic translation of x in ∆? , is yˆ ∈ ∆? , that verifies1 yˆ = argmax Pr(y | x) = argmax Pr(x, y) y∈∆?
y∈∆?
With a stochastic finite-state transducer TP , the probabilistic translation of x is: yˆ = argmax PrTP (x, y) y∈∆?
1
For the purpose of simplicity, we will denote Pr(X = x) as Pr(x) and Pr(X = x | Y = y) as Pr(x | y)
Inference of Finite-State Transducers
3
The search of the optimal yˆ in the last equation has been demonstrated to be a difficult computational problem [7]. In practice, the so called Viterbi probability of a translation is used: VTP (x, y) = max PrTP (t(x, y)) ∀t(x,y)
An approximate stochastic translation can be computed as: y˜ = argmax VTP (x, y) y∈∆?
The computation of the y˜ can be performed with a polynomial algorithm [6]. Finite-state transducers present similar properties to the ones exhibited by regular languages [2]. One of these properties can be stated through the following theorems: Theorem 1 T ⊆ Σ ? × ∆? is a regular translation if and only if there exist an alphabet Γ , a regular language L ⊂ Γ ? and two morphisms hΣ : Γ ? → Σ ? and h∆ : Γ ? → ∆? such that T = {(hΣ (w), h∆ (w)) | w ∈ L} Other theorems relate regular translation with local languages, which are special cases of regular languages: Theorem 2 T ⊆ Σ ? × ∆? is a regular translation if and only if there exist an alphabet Γ , a regular local language L ⊂ Γ ? and two alphabetic morphisms hΣ : Γ ? → Σ ? and h∆ : Γ ? → ∆? such that T = {(hΣ (w), h∆ (w)) | w ∈ L} A stochastic version of the first theorems was proposed in [15]. However, the probabilistic distributions associated to finite-state transducers are usually obtained by specific re-estimation algorithms once the finite-state transducer is inferred [6]. In [12], an alternative procedure is proposed to infer stochastic finite-state transducers from a statistical alignment translation model. In the following section, an inference method of stochastic finite-state transducer is proposed from the above theorems. A very preliminary non-stochastic version was presented in [19]. This first proposal had an important drawback: the methods proposed for building sentences from Γ ? did not deal adequately with the dependencies between the words of the input sentences and the words of the corresponding output sentence. In the following section, we propose the use of statistical alignments [5] to solve this drawback. On the other hand, the method proposed in the next section to learn stochastic finite-state transducers is based on the inference of n-grams [10] from the sentences in Γ ? . Consequently, the inference method allows us to infer the structure (states and transitions) together the associated probabilistic distributions from training pairs.
2
A Method to Infer Finite-State Transducers
The theorems enunciated in the previous section allows us to suggest the following general technique: given a finite sample P of string pairs (x, y) ∈ Σ ? × ∆? (a parallel corpus),
4
F. Casacuberta
1. Each training pair from (x, y) is transformed into a string z from an extended alphabet Γ to obtain a sample S of strings (S ⊂ Γ ? ). 2. A (stochastic) regular grammar G is inferred from S 3. The symbols (from Γ ) of the grammar rules are transformed into input/output symbols (Σ ? × ∆? ). This technique can be illustrated in the schemata of Fig. 1 which is very similar to the one proposed in [9] for the inference of regular grammars. S ⊂ Γ? A sample of training extended strings
Inverse labelling E: P ⊂ T (E) ←− A finite-state transducer
←−
Labelling −→
←−
P ⊂ Σ ? × ∆? A sample of training pairs
GI algorithm G: S ⊂ L(G) A regular grammar
Fig. 1. The basic schemata for the inference of finite-state transducers. P is a finite sample of training pairs. S is a finite sample of extended strings. G is a grammar inferred from S such that S is a subset of the language generated by the grammar G (L(G)). E is a finite-state transducer whose translation (T (E)) includes the training sample P .
The main problem with this approach is the first step, i.e. to transform a parallel corpus into a string corpus. In general, there are many possible transformations, but the existence of complex alignments among words makes the design of the transformation a difficult problem. On the other hand, the third step could be conditioned by the first one. Consequently, the labelling processing of the training pairs must capture the correspondences between words of the input sentences and words of the output sentences and must allow for the implementation of the inverse labelling for the third step. An interesting way to deal with these problems is the use of the statistical translation framework to align parallel corpora [5]. 2.1
Statistical Alignment Translation
The statistical models introduced by Brown et al. [5] are based on the concept of alignment between the components of translation pairs (x, y) ∈ Σ ? × ∆? (statistical alignment models). Formally, an alignment is a function a : {1, ..., |y|} → {0, ..., |x|}. The particular case a(j) = 0 means that the position j in y is not aligned with any position of x. All the possible alignments between y and x are denoted by A(x, y), and, the probability of translating a given x into y by an alignment a is denoted by Pr(y, a | x).
Inference of Finite-State Transducers
5
If m and n represent the lengths of y and x, respectively, the well known Model 2 for Pr(y, a | x) proposed in [5] is P rM 2 (y, a | x) = l(m | n) ·
m Y
t(yi |xa(i) ) · α(a(i) | i; m, n)
i=1
where l(m | n) is the probability that the output y is of length m, given that the input x is of length n; t(yi |xj ) is the translation probability of the output symbol yi given the input symbol xj and α(j | i; m, n) is the alignment probability that position i in the output string is aligned to position j in the input string. In the present case, from the above definitions [5], Pr(y | x) =
X
Pr(y, a | x) ≈ l(m | n) ·
m X n Y
t(yi | xj ) · α(j | i; m, n)
i=1 j=0
a∈A(x,y)
The maximum-likelihood estimation of these distributions from a given training set can be found in [5]. Please, note that in the original work [5], P rM 2 was proposed in the inverse mode, that is P rM 2 (x, a | y). The main reason was to deal with the problem of modeling well-formed output strings by introducing a robust output language model. In that case, the statistical translation problem was formally established as: yˆ = argmaxy∈∆? Pr(x | y) · Pr(y). A variation of Model 2 allows us to obtain the best alignment between the components of a translation pair (x, y) ∈ Σ ? × ∆? . In this case, the optimal alignment of (x, y) given the model M 2 is a ˆ = argmax Pr(y, a | x) a∈A(x,y)
and is obtained from a maximisation process (Viterbi alignment) [5]: a ˆ(i) = argmax (t(yi |xj ) · α(j | i; m, n)) for 1 ≤ i ≤ m 0≤j≤n
An example of the alignment between two string is shown in the following example: Input: he hecho la reserva de una habitaci´ on doble con tel´efono y televisi´on a nombre de Rosario Cabedo . Output: I have made a reservation for a double room with a telephone and a tv for Rosario Cabedo . By using the Viterbi alignment with a trained model Input: he hecho la reserva de una habitaci´ on doble con tel´efono y televisi´on a nombre de Rosario Cabedo . Aligned Output: I (1) have (1) made (2) a (3) reservation (4) for (5) a (6) double (8) room (7) with (9) a (0) telephone (10) and (11) a (0) tv (12) for (14) Rosario (16) Cabedo (17) . (18)
6
F. Casacuberta
The graphic representation of the above alignment is presented in Fig. 2. ’ doble con telefono ’ ’ a nombre de Rosario Cabedo . he hecho la reserva de una habitacion y television
I have made a reservation for a double room with a telephone and a tv for Rosario Cabedo .
Fig. 2. A graphical representation of the alignment presented in the example of the text. Note the correspondence between the Spanish “la” and the English “a” and also note that the model does not allow for alignments between sets of two or more input words and one output word.
2.2
Description of the Proposed Technique to Infer Stochastic Finite-State Transducers
The proposed technique to infer finite-state transducers is composed by the following three steps: Step 1: Transformation of training pairs into strings The first step consists of building a string of certain extended symbols from the training string pair and a statistical alignment between the corresponding input and output strings. The main idea is based on the assignment of each word from y to the corresponding word from x given by the alignment a. But sometimes, this assignment produces a violation of the sequential order of the words in y. In the above example, if the English “double” is assigned to the Spanish “doble” and the English word “room” to the Spanish word “habitaci´ on” implies a reordering of the words “double” and “room”. Therefore, a loss of important information is produced. In order to prevent this problem, the proposed labelling is as follows: Let x, y and a be an input string, an output string and an alignment function, respectively, z is the corresponding labelled string if (“+” is a symbol not in Σ or ∆): – |z| = |x| – 1 ≤ i ≤ |z| xi + yj + yj+1 + ... + yj+l if ∃j : a(j) = i and ¬∃j 0 < j : a(j 0) > a(j) and for j 00 : j ≤ j 00 ≤ j + l, a(j 00 ) ≤ a(j) zi = otherwise xi The procedure of alignment consists of an assignment of each word from y to the corresponding word from x given by the alignment a if the output order is not violated. In the other case, the output word is assigned to the first input word that does not violate the output order. The set of extended symbols is a subset of Σ ∪ (Σ + D), where D ⊂ ∆? is the finite set of output substrings that appear in the training data.
Inference of Finite-State Transducers
7
The application of the procedure to the example gives in section 2.1: he+I+have hecho+made la+a reserva+reservation de+for una+a habitaci´ on doble+double+room con+with+a tel´ efono+telephone y+and+a televisi´ on+tv a nombre+for de Rosario+Rosario Cabedo+Cabedo .+. The extended symbols obtained by using this procedure are called extended symbols of type I. Obviously, other assignments are possible. For example, after the application of the above procedure, consecutive isolated input words (without any output symbol) can be joined to the first extended symbol with an assigned output symbol: Let x, y and a be an input string, an output string and an alignment function, respectively, and let z be the labelled string obtained from the above procedure. A new labelled string can be defined by the substitution of all the substrings zk zk+1 ... zk+l by zk − zk+1 − ... − zk+l , if for 1 ≤ k, k + l ≤ |z| (“+” and “-” are symbols not in Σ or ∆): – zk0 ∈ Σ for all k 0 : k ≤ k 0 < k + l – zk−1 , zk+l 6∈Σ A procedure to implement this second approach is straightforward from the previous one. The application of this algorithm to the above example allows us to obtain: he+I+have hecho+made la+a reserva+reservation de+for una+a habitaci´ on-doble+double+room con+with+a tel´ efono+telephone y+and+a televisi´ on+tv a-nombre+for de-Rosario+Rosario Cabedo+Cabedo .+. The extended symbols obtained by the last method are called extended symbols of type II. In the example, the differences between both types of symbols are in “habitaci´ on doble+double+room” that becames “habitaci´ ondoble+double+room”, “a nombre+for” that becames “a-nombre+for”, and “de Rosario+Rosario” that becames “de-Rosario+Rosario”. In practice, many of these extended symbols define reasonable correspondences between pairs of input and output segments (substrings). Step 2: Inferring the regular grammar In the second step of the proposed procedure, a (stochastic) regular grammar is built from the strings which are built from first step. The so-called n-grams are particular cases of stochastic regular grammars that can be inferred from training samples with well known techniques [18]. These models represent the stochastic languages of strings of the forms x, whose probability is given by Pr(x) =
|x| Y i=1
Pr(xi |xi−n+1 , ..., xi−1 )
8
F. Casacuberta
where xk = $ if k ≤ 0 and $ is a symbol that is not in the alphabet. These probabilities can be estimated by counting the substrings of length n in a training set. In practice, substrings of length n that have not appeared in the training set can appear in the analysis of a new input string. To deal with this problem, smoothed n-grams are used and they are a type of combination of k-gram ∀k ≤ n [10]. Obviously, other possible grammatical inference techniques can be used. However, in this work, n-grams are used due to the availability of a good public tool-kit to build smoothed (back-off) models [18]. On the other hand, the efficiency of smoothed n-grams has been proven successfully in some areas such as language modeling [18,10]. Step 3: Building the finite-state transducer The process of transforming a grammar of extended symbols into a finite-state transducer is based on the application of the two morphisms: if a ∈ Σ and b1 , b2 , ...bk ∈ ∆, hΣ (a + b1 + b2 + ... + bk ) = a h∆ (a + b1 + b2 + ... + bk ) = b1 b2 ...bk The procedure consists in transforming a rule or transition of the inferred regular grammar A → a + b1 + b2 + ... + bk B where a ∈ Σ and b1 , b2 , ..., bk ∈ ∆ into a transition of the finite-state transducer A → aB, b1 b2 ...bk B This procedure is illustrated in Fig. 3. The probabilities associated to the transitions in the finite-state transducer are the same as those of the original stochastic regular grammar. "habitacion+room" 1
"una+a"
2
3
"habitacion" 4
"a"
6
"individual+single+room"
"nombre+for"
7
"de"
8
5
Grammar of extended symbols
"habitacion" / "room" 1
"una" / "a"
2
3
"habitacion" / "" 4
"a" / ""
"individual" / "single room"
6
"nombre" / "for"
7
"de" / ""
8
5
Finite-State Transducer
Fig. 3. An example of an inferred grammar of extended symbols, and the corresponding finite-state transducer obtained from the morphisms hΣ and h∆ .
Inference of Finite-State Transducers
9
When extended symbols of type II are used, rules of the form: A → a1 − a2 − ... − al + b1 + b2 + ... + bk B
a1 , ..., al ∈ Σ and b1 , ..., bk ∈ ∆
can appear in the inferred grammar. In this case, a grammar rule is transformed into a set of transitions of the finite-state transducer: A → a1 B1 , b1 b2 ...bk B1 B1 → a2 B2 , B2 ... Bl−2 → al−1 Bl−1 , Bl−1 Bl−1 → al B, B The probability associated to the first transition in the above transformation is the same as that of the original rule in the stochastic regular grammar. The probabilities of the rest of transitions are set to 1.0.
3
Experimental Results
Two tasks of different levels of difficulty were selected to assess the inference method proposed in the framework of the EuTrans project [11]: a SpanishEnglish task (EuTrans-I) and an Italian-English task (EuTrans-II). In all the experiments reported in this paper, the approximate stochastic translations of the input test strings were computed, and the word-error rate (WER) for the translations was used as an error criterium. The WER was computed as the minimum number of substitution, insertion and deletion operations that had to be performed in order to convert the string which was hypothesized by the translation system into a given reference word string [11]. 3.1
Results Using the Corpus EuTrans-I
A Spanish-English corpus was generated semi-automatically for the EuTrans-I task which is a subtask of the “Traveller Task” [21]. The domain of the corpus is a human-to-human communication situation at a reception desk of a hotel. A summary of the corpus used in the experiments is given in Table 1. Extended symbols of type I: These first experiments corresponded to the use of extended symbols of type I and smoothed (back-off) n-grams as stochastic regular grammars. The results are presented in Table 2. The smoothed n-grams were built using the CMU Statistical Language Modeling Toolkit [18] and were represented by a stochastic grammar [13]. The number of transitions did not correspond to the number of free parameters of the model, since the weights were computed from a combination of the probability transitions and back-off weights [13]. The number of free parameters was approximately three times the number of states.
10
F. Casacuberta Table 1. The EuTrans-I task [1]. Spanish English Train: Sentences 10,000 Words 97,131 99,292 Vocabulary 686 513 Test: Sentences 2,996 Words 35,023 35,590 Bigram Perplexity 8.6 5.2
Extended symbols of type I from the a-priori segmented training pairs: We were able to segment the parallel training corpus due to the existence of some punctuation marks and some special words ([11]) (a-priori segmentation). The idea was to apply the statistical alignments only in each pair of segments and not in the entire sentences. The segments were shorter than the whole sentences; therefore, the alignment probability distributions were better estimated than for whole sentences. Extended symbols were built from these alignments. The strings of extended symbols corresponding to the segments of the same original string pair were concatenated. The best result achieved was a WER of 10.2 % using smoothed five-grams. In this case, the results were slightly worse than the ones in Table 2. The main reason is that the corpus was generated semi-automatically and the statistical alignments in the whole sentence could capture the relation between words in a training pair quite well. Extended symbols of type II: New experiments were performed in order to test extended symbols of type II. The WER for n = 3 was 23.2%, a result which was worse than that achieved using extended symbols of type I. One possible cause for this result could be the size of the achieved finite-state transducers that were twice the size of the finite-state transducers which were obtained using extended symbols of type I. Consequently, the assigned probabilistic distributions were poorly estimated. Summary of the results with the corpus EuTrans-I: The best result achieved using the proposed technique in the EuTrans-I was a WER of 9.7%. This result was achieved by using single extended symbols built from alignments Table 2. Results with the standard corpus EuTrans-I. The regular models were smoothed n-grams for different values of n. The number of states and transitions of the transducer are also reported. n-grams 2 3 4 5
states transitions WER 2,911 34,106 13.2 13,309 133,791 10.3 33,245 300,843 9.7 66,655 592,721 9.8
Inference of Finite-State Transducers
11
which were defined from the output to the input strings and from four-grams as stochastic regular grammars. This result was as good as the ones achieved by other finite-state techniques (a WER of 8.3% using Omega [11] another technique to infer some types of finite-state transducers) in similar experimental conditions. However, a statistical template technique allowed us to achieve a WER of 4.4% [11]. A WER of 13.9% was achieved [11] using a statistical alignment model (similar to IBM Model 2 used for the alignments). 3.2
Results with the Corpus EuTrans-II
The EuTrans-II task consists of two corpora acquired in the EuTrans project [11]: an Italian-Spanish corpus and an Italian-English corpus, consisting of transcriptions of spoken dialogues within the framework of hotel reception desk person-to-person communications. A summary of the corpus (only from Italian to English) used in the experiments is given in Table 3. Table 3. The EuTrans-II task. Italian English Train: Sentences 3,038 Words 55,302 64,176 Vocabulary 2,459 1,712 Test: Sentences 300 Words 6,121 7,243 Bigram Perplexity 31 25
The same translation procedure and error criterium used for EuTrans-I were used for EuTrans-II. Extended symbols of type I: The first experiment performed with this corpus was similar to the first experiment reported for EuTrans-I. In this case, the best WER achieved was 43.0 % using smoothed bigrams. This result was worse than for EuTrans-I since the task was also more difficult (the perplexity of the first task was 5.2 and the perplexity of the second task was 25). Extended symbols of type I from the a-priori segmented training pairs: In this experiment, the training corpus was previously segmented as for EuTrans-I. The results are reported in Table 4. The results of this experiment were clearly better than the corresponding experiments with non-segmented training data. These experiments showed a behaviour which was opposite to the one for EuTrans-I. One possible reason is that this corpus was more spontaneous than the first one and, consequently, had a higher degree of variability. Moreover, the size of the training data was less than the corresponding data of EuTrans-I.
12
F. Casacuberta
Table 4. Results with the standard corpus EuTrans-II. The regular models were smoothed n-grams [18] for different values of n. The training set was segmented using some a-priori knowledge. The statistical alignments were constrained to be in each parallel segment. n-grams 2 3 4 5
states transitions WER 7,988 77,453 27.2 31,157 254,143 28.6 66,507 472,518 28.3 110,197 768,024 28.0
Extended symbols of type II: More experiments were carried out. One of them was designed to test extended symbols of type II and bigrams. The main results were a WER of 48.6% for segmented training and and a WER of 77.0% for non-segmented training. In all of these experiments, the results were clearly worse than the results using extended symbols of type I and segmented training data. Summary of the results using the corpus EuTrans-II: The best result achieved with the proposed technique in the EuTrans-II was a WER of 27.2%. This result was achieved by using extended symbols of type I and a-priori segmentation of the training pairs. A smoothed bigram was the best regular model. This result was one of the best among the others reported in [17]. The above statistical template technique achieved a WER of 25.1% and a WER of 61.0 % using the IBM Model 2.
4
Conclusions
The method proposed in this paper to infer stochastic finite-state transducers from stochastic regular grammars allowed us to achieve good results in two real translation problems with different levels of difficulty. The method seemed to work better than others when the training data was scarce. However, when the available training data was sufficient, the technique presented a behaviour which was similar to the other finite-state approaches. The reasons for the results achieved by this method are due to: 1. the method of transforming training pairs into strings of extended symbols: a) it was based on a statistical alignment model b) it preserved the order of the input string and the output string in each training pair. 2. the use of smoothed n-grams that were trained from extended symbols. These models proved that they can deal adequately with the problem of unseen strings in the training set.
Inference of Finite-State Transducers
13
This method could be improved by using more powerful statistical alignment models (for example, the so called IBM Model 3 and 4). Another way of improving this method could be by adding an accurate output language model to recover the possible output syntactic errors that can be produced in the translation process. Acknowledgements. The author wishes to thank the anonymous reviewers for their criticisms and suggestions.
References 1. J. C. Amengual, J. B. Bened´ı, F. Casacuberta, A. Casta˜ no, A. Castellanos, V.M.Jim´enez, D. Llorens, A. Marzal, M. Pastor, F. Prat, E. Vidal and J. M. Vilar: The EuTrans-I speech translation system. To be published in Machine Translation, 2000. 2. J. Berstel: Transductions and context-free languages. B. G. Teubner Stuttgart, 1979. 3. P. Brown, J. Cocke, S. Della Pietra, V. Della Pietra, J. Jelinek, J. Lafferty, R. Mercer and P. Roossin: A statistical approach to machine translation, Computational Linguistics, Vol. 16, N. 2, pp. 79–85, 1990. 4. P.F. Brown, J.C. Lai and R.L. Mercer: Aligning sentences in parallel corpora, 29th Annual Meeting of the ACL, pp. 169–176, 1991. 5. P. Brown, S. Della Pietra, V. Della Pietra and R. Mercer: The mathematics of statistical machine translation: parameter estimation, Computational Linguistics, Vol. 19, N. 2, pp. 263–310, 1993. 6. F. Casacuberta: Maximum mutual information and conditional maximum likelihood estimations of stochastic syntax-directed translation schemes, in: Grammatical inference: learning syntax from sentences, L. Miclet and C. de la Higuera (eds), Lecture Notes in Artificial Intelligence, Vol. 1147, Springer-Verlag, Berlin, pp. 282–291, 1996. 7. F. Casacuberta and C. de la Higuera: Computational complexity of problems on probabilistic grammars and transducers, Proceedings of the 5th International Colloquium on Grammatical Inference. 2000. 8. K.S. Fu: Syntactic pattern recognition and applications, Prentice-Hall, Englewood Cliffs, NJ. 1982. 9. P. Garc´ıa, E. Vidal and F. Casacuberta: Local languages, the succesor method and a step towards a general methodology for the inference of regular grammars, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 9, No. 6, pp. 841– 844, 1987. 10. H. Ney, S. Martin and F. Wessel: Statistical language modeling using leaving-oneout, in Corpus-based methods in language and speech processing, Chap. 6, Kluwer Academic Publishers, 1997 11. Instituto Tecnol´ ogico de Inform´tica, Fondazione Ugo Bordoni, Rheinisch Westf¨ alische Technische Hochschule Aachen Lehrstuhl f¨ ur Informatik VI and Zeres GmbH Bochum: Example-based language translation systems. Second year progress report, EuTransproject, Technical report deliverable D0.1b. Information Technology. Long Term Research Domain. Open scheme. Project Number 32026. 1999. 12. K. Knight and Y. Al-Onaizan: Translation with finite-state devices, Proceedings of the 4th. ANSTA Conference, 1998.
14
F. Casacuberta
13. D. Llorens: Suavizado general de aut´ omatas finitos, Ph.D. Thesis. Universitat Polit`ecnica de Val`encia. To be published in 2000. 14. E. M¨ akinen: Inferring finite transducers, University of Tampere, Report A-1999-3, 1999. 15. F. Maryanski and M.G. Thomason: Properties of stochastic syntax-directed translation schemata, International Journal of Computer and Information Science, Vol. 8, N. 2, pp. 89–110, 1979. 16. J. Oncina, P. Garc´ıa and E. Vidal: Learning subsequential transducers for pattern recognition tasks, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 15, pp. 448–454, 1993. 17. Rheinisch Westf¨ alische Technische Hochschule Aachen Lehrstuhl f¨ ur Informatik VI and Instituto Tecnol´ ogico de Inform´ atica: Statistical Modeling Techniques and Results and Search Techniques and Results, EuTransproject, Technical Report Deliverables D3.1a and D3.2a Information Technology. Long Term Research Domain. Open scheme. Project Number 32026, 1999. 18. P.R. Clarkson and R. Rosenfeld: Statistical Language Modeling Using the CMUCambridge Toolkit, Proceedings ESCA Eurospeech, Vol. 5, pp. 2707–2710, 1997. 19. E. Vidal, P. Garc´ıa and E. Segarra: Inductive learning of finite-state transducers for the interpretation of unidimensional objects, Structural Pattern Analysis, R, Mohr, Th.Pavlidis, A. Sanfeliu (eds.), pp. 17–35, World Scientific pub. 1989. 20. E. Vidal, F. Casacuberta and P. Garc´ıa: Grammatical inference and automatic speech recognition, in Speech recognition and coding: new advances and trends, A.Rubio, J. L´ opez (eds.) pp. 174–191, NATO-ASI Vol. F147, Springer-Verlag, 1995. 21. E. Vidal: Finite-state speech-to-speech translation, Proceedings of the International Conference on Acoustic, Speech and Signal Processing. Munich (Germany), Vol. I, pp. 111–114, 1997.
Computational Complexity of Problems on Probabilistic Grammars and Transducers Francisco Casacuberta1? and Colin de la Higuera2 1
2
Departamento de Sistemas Inform´ aticos y Computaci´ on, Universidad Polit´ecnica de Valencia, 46071 Valencia, Spain.
[email protected] EURISE, Facult´e des Sciences et Techniques, Universit´e de Saint Etienne - Jean Monnet, 42023 Saint Etienne, France.
[email protected]
Abstract. Determinism plays an important role in grammatical inference. However, in practice, ambiguous grammars (and non determinism grammars in particular) are more used than determinism grammars. Computing the probability of parsing a given string or its most probable parse with stochastic regular grammars can be performed in linear time. However, the problem of finding the most probable string has yet not given any satisfactory answer. In this paper we prove that the problem is NP-hard and does not allow for a polynomial time approximation scheme. The result extends to stochastic regular syntax-directed translation schemes.
1
Introduction
As the problem of not having negative evidence arises in practice when wishing to learn grammars, different options as how to deal with the issue have been proposed. Restricted classes of deterministic finite-state automata can be identified [1,10] heuristics have been proposed [22] and used for practical problems in speech recognition or pattern recognition [15], and stochastic inference has been proposed as a mean to deal with the problem [2,24,21]. Stochastic grammars and automata have been used for some time in the context of speech recognition [20,16]. Algorithms that (heuristically) learn a context-free grammar have been proposed (for a recent survey see [23], and other algorithms (namely the forward-backward algorithm for hidden Markov models, close to stochastic finite automata or the inside-outside algorithm for stochastic context-free grammars) that compute probabilities for the rules have been realised [20,14]. But in the general framework of grammatical inference it is important to search for algorithms that not only perform well in practice, but that provably converge to the optimal solution, using only a polynomial amount of time. For the case of stochastic finite automata the problem has been dealt with ?
This work has been partially funded by the European Union and the Spanish CICYT, under grants IT-LTR-OS-30268 and TIC97-0745-C02, respectively.
A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 15–24, 2000. c Springer-Verlag Berlin Heidelberg 2000
16
F. Casacuberta and C. de la Higuera
by different authors: in [24] stochastic deterministic finite automata are learnt through Bayes minimisation, in [3], through state merging techniques common to classical algorithms for the deterministic finite-state automaton inference problem. Along the same line in [21] acyclic stochastic deterministic finite automata are learnt, proving furthermore that under certain restrictions the inferred automaton is probably approximately correct. Work in the direction of learning this sort of object has been followed these last years, with new algorithms proposed in [27,25]. In a general sense the models that have been inferred are always deterministic. It is not obvious why this should be so as non-deterministic stochastic automata are strictly more powerful than their deterministic counter parts. They can also be of a smaller size and thus be more understandable. One reason may be that in the normal (non-stochastic) paradigm, it can be proved that non deterministic machines can not be identified in polynomial time [17]. In this work we point out that the difference between deterministic and non-deterministic stochastic automata (or regular grammars) is also that some reasonably easy problems in the deterministic case become intractable in the non deterministic case. An appealing feature of stochastic regular grammars is the existence of efficient algorithms for parsing. The probability of generating a given string by a stochastic regular grammar can be computed in linear time with the length of the string. The same holds for the search of the derivation with the highest probability. In spite of the existence of polynomial algorithms for dealing with some problems that involve stochastic regular grammars, there is another important problem which does not have an efficient solution. This is to find the most probable string that can be generated by a stochastic regular grammar. Other useful models which are closely related to stochastic regular grammars are the stochastic regular syntax-directed translation schemes [9,13,26]. Stochastic grammars are adequate models for classification tasks; however, there are many practical situations which do not fit well within the classification framework but can be properly tackled through formal translation [18]. For translation, efficient (linear) algorithms are only known for the computation of the highest probability translation form [4]. In this framework, given an input string, the goal is to find its most probable translation. However, there is no efficient solution for this problem. Under the complexity theory framework [11], we report some results about the difficulty of different computations regarding probabilistic finite state machines.
2
The Most Probable String Problem
The following definition is classical [13]. Definition 1: A stochastic regular grammar (SRG) G is a tuple < N, Σ, R, S, P >, where N is a finite set of non-terminal symbols; Σ is a
Computational Complexity of Problems on Probabilistic Grammars
17
finite set of terminal symbols; R is a set of rules of the form A → aB or A → a for A, B ∈ N and a ∈ Σ (for simplicity, empty rules are not allowed); S is the starting symbol and P : R → Q+ (the set of the positive rational numbers) is a function such that X X P (A → aB) + P (A → a) = 1 ∀A ∈ N a∈Σ,B∈N :(A→aB)∈R
a∈Σ:(A→a)∈R
Stochastic grammars are probabilistic generators of languages; therefore, the concept of probability that a string is generated by a SRG can be defined. Definition 2: Given w ∈ Σ ? (the set of finite-length strings over Σ), the probability that a SRG G generates w is defined as: X prG (w) = prG (d(w)) ∀d(w)
where d(w) is a complete derivation of w in G of the form: S → w1 A1 → w1 w2 A2 → ... → w1 w2 ...w|w|−1 A|w|−1 → w1 w2 ...w|w|−1 w|w| = w and prG (d(w)) = P (S → w1 A1 )P (A1 → w2 A2 )...P (A|w|−1 → w|w| ) Some important problems arise with these definitions. Namely the computation for a given string of its probability (PS) or of its most probable derivation (MPDS), the computation of the most probable derivation (MPD) and the computation of the most probable string (MPS). The PS, MPDS and MPD problems have been widely addressed. The PS and MPDS are classical parsing problems, and can be solved in time O(|w||N |2 ) [5]. The MPD problem can also be dealt with using Dijkstra’s algorithm [6] to compute the shortest path in a weighted path and requires no more than O(|N |2 ) time. The MPS problem, although straightforward, has not been dealt with. Let us define the associated decision problem as follows: Problem Most probable string (MPS). Instance A SRG G, and p ∈ Q+ . Question Is there a string x ∈ Σ ? with |x| ≤ |N |, such that prG (x) ≥ p ? A more restricted problem is the following : Problem Restricted Most Probable String (RMPS). Instance A SRG G, d ∈ N (the set of natural numbers), d ≤ |N |, and p ∈ Q+ . Question Is there a string x ∈ Σ d such that prG (x) ≥ p? RMPS is not just a special case of MPS. We will prove that both MPS and RMPS are NP-hard. As the probability of any string can be computed in
18
F. Casacuberta and C. de la Higuera
polynomial time1 , both MPS and RMPS are in NP. We prove that MPS and RMPS are NP-hard by reduction from the “satisfiability” problem (SAT) [11]. The proof relies on a technical encoding of a set of clauses: Given an instance of SAT: 1) a collection v1 , ..., vn of n boolean variables and 2) a collection c1 , ..., ck of k clauses over the n variables, consider the following SRG G = (N, Σ, R, S, P ): – Σ = {f, t, $, #} – For 1 ≤ j ≤ k, – Aj0 ∈ N – the rule S → $Aj0 is in R with probability 1/k and rules Bn → $ and Ajn → # are in R with an associated probability 1. – for 1 ≤ i ≤ n with an associated probability 1/2: • Aji , Bi ∈ N • the rules Bi−1 → tBi and Bi−1 → f Bi are in R. • if vi appears as a positive literal in cj then the rules Aji−1 → f Aji and Aji−1 → tBi are in R. • if vi appears as a negative literal in cj then the rules Aji−1 → tAji and Aji−1 → f Bi are in R. • if vi does not appear in cj then the rules Aji−1 → tAji and Aji−1 → f Aji are in R. Each of these rules have an associated probability of 1/2. – For RMPS fix d = n + 2. To illustrate this construction, consider an example where one of the clauses is ¯3 ∨ x5 with n = 6. Then the corresponding part of the automaton2 cj = x2 ∨ x for this clause is shown in the Figure 1 This SRG must verify that if a clause is evaluated to true by the assignment of values to the boolean variables, then a complete derivation has to exist and vice-versa. On the other hand, if a clause is evaluated to false, no such derivation associated to the clause has to exist in the SRG nor vice-versa. Theorem 1: RMPS and MPS are NP-complete. Proof of Theorem 1: From the above construction, if the clause is satisfied for some solution (x1 , ..., xn ) the corresponding string $X1 ...Xn $ (Xi is t if xi is true, and Xi is f if xi is f alse) will have probability 1/(k 2n ) for each derivation linked to the clause. Note that the construction is in O(k n). The string length is n + 2 ≤ |N | = k · (2n + 1). Fix3 p to 1/2n . Let m be a solution of SAT; it can be considered as a string in {f, t}n , hence the corresponding $m$ is a string generated by G with k derivations all of probability 1/(k 2n ); so the probability of $m$ by G is 1/2n . On 1 2
3
This can be done in O(|w||N |2 ) A SRG can be interpreted by its associated graph. Notice that some states (Aj6 and B1 ) are useless). The fact that the grammar is not in proper normal form is irrelevant. Encoding of p only requires n bits
Computational Complexity of Problems on Probabilistic Grammars f
f 1/2 Aj1 Aj0 t
f 1/2
Aj2
t 1/2
1/2
$
t
1/k
1/2
S
f
.. . Fig. 1.
B 1 1/2
t 1/2
f
1/2 Aj3 t 1/2
f t 1/2
Aj5
f 1/2 t 1/2
Aj6
# 1
t
1/2
B 2 1/2
f Aj4 1/2
19
1/2
f
B 3 1/2 t 1/2
f
B 4 1/2
t 1/2
f
B 5 1/2 t 1/2
B6 $ 1
Part of the SRG corresponding to clause cj = x2 ∨ x ¯3 ∨ x5 with n = 6.
the other hand, if the instance of SAT does not admit any solution, then as the only strings that have non null probability for the associated grammar are of length n + 2 (= d), and at least one clause is not satisfied (for example if the clause j is not satisfied, the corresponding derivation ends in Ajn ), then no string 5. has probability 1/2n Consequently the corresponding optimization problems (finding the most probable string) are NP-hard. More can be said about the NP-optimization problem: Problem Maximum probability of a string (MaxPS). Instance A SRG G , and p ∈ Q+ . Solution A string x ∈ Σ ? Measure prG (x). By reduction from maximum satisfiability (Max-SAT) [12,19], Theorem 2: MaxPS is APX-hard. Maximum satisfiability is the NP-optimization problem corresponding to SAT. It concerns finding a subset of clauses such that there is a truth assignment satisfying each clause in the subset. The associated measure is just the number of clauses. The problem is APX-complete, i.e. it is complete for the class APX. Being APX-complete implies that you can not do better than a constant approximation (a bound of the constant approximation is proposed by Goemans and Williamson [12]) and that no PTAS (polynomial time approximation scheme) is feasible. Proof of Theorem 2: The proof is straight forward and involves the same construction as for the NP-hardness of MPS: Given an instance I of Max-SAT, and a rational , construct an instance f (I, ) of MaxPS as in the proof of theorem 1. Now given a string x on the input alphabet of the associated SRG f (I, ), the following holds: prf (I,) (x) =
c k 2n
⇒ c = g(I, x, ) clauses of I can be satisfied.
20
F. Casacuberta and C. de la Higuera
Finally we have, for any instance I of Max-SAT, any rational and any string x solution to f (I, ): opt(I) opt(f (I, )) = m(f (I, ), x) m(x, g(I, x, )) where opt denotes the optimal result (maximum number of satisfied clauses or maximum probability) and m is the measure function (number of actual satisfied clauses for a given assignment and probability of a given string). It follows that with playing a dummy part the reduction inequation can be obtained [7]: opt(I) opt(f (I, )) ≤⇒ ≤ m(f (I, ), x) m(x, g(I, x, )) All these constructions are polynomial.
3
5
Stochastic Regular Syntax-Directed Translation Scheme
The last problem deals with the search of an optimal translation of a given input string according to a translation scheme [13]. Definition 3: A stochastic regular syntax-directed translation scheme (SRT) E is a tuple < N, Σ, ∆, R, S, P >, where N and S are defined as in SRGs, Σ is a finite set of input terminal symbols, ∆ is a finite set of output terminal symbols (Σ ∩ ∆ = ∅); R is a set of rules of the form A → aB, ωB or A → a, ω for A, B ∈ N , a ∈ Σ, ω ∈ ∆? and and P : R → Q+ is a function such that X X P (A → aB, ωB) + P (A → a, ω) = 1, ∀A ∈ N ∀a ∈ Σ, ω ∈ ∆? , B ∈ N : A → aB, ωB ∈ R
∀a ∈ Σ, ω ∈ ∆? : A → aB, ωB
For simplicity, empty input rules (A → λB, wB or A → λ, w where λ is the empty string) are not allowed. SRGs and SRTs are closely related and given a SRT E, the probability of a translation pair (x, y) ∈ Σ ? × ∆? , prE (x, y) is defined in a way similar to that for SRGs: Definition 4: The probability of a translation pair (x, y) ∈ Σ ? × ∆? according to the scheme E is defined as: X prE (x, y) = prE (t(x, y)) ∀t(x,y)
Computational Complexity of Problems on Probabilistic Grammars
21
where t(x, y) is a translation form of (x, y) in E: (S, S) → (x1 A1 , y1 A1 ) → (x1 x2 A2 , y1 y2 A2 ) → ... → (x, y) and the corresponding probability of the translation form is: pr(t(x, y)) = P (S → x1 A1 , y1 A1 )P (A1 → x2 A2 , y2 A2 )....P (A|x|−1 → x|x| , y|x|) The following example is presented to illustrate the above definitions. Example 1. N = {S, A, B}, Σ = {0, 1} , ∆ = {a, b} and the rules of the Table 1. The input string 010 has two possible translations: abbbba and Table 1. Set of rules and probabilities corresponding to SRT of Example 2. Rules (R) Probabilities (P) S → 0A, aA 3/10 S → 0B, abB 7/10 A → 1B, aaB 2/7 A → 1A, aaaA 4/7 A → 0, a 1/7 B → 1A, bbbA 2/5 B → 0, aa 3/5
aaaaa. The first one can be obtained as S → 0B, abB → 01A, abbbbA → 010, abbbba with probability 1/25, and the second one with probability 6/245 as S → 0A, aA → 01A, aaaaA → 010, aaaaa or with probability 9/175 as S → 0A, aA → 01B, aaaB → 010, aaaaa. Therefore, 010 can be translated into abbbbba with probability 1/25, or into aaaaa with probability 6/245 + 9/175 =93/1225. An interesting question is thus the one of computing the most probable translation of some input string. Formally: Problem Most probable translation (MPT). Instance A SRT E , x ∈ Σ ? and p ∈ Q+ . Question Is there an output string y ∈ ∆? , |y| ≤ |N | · lmax (lmax is the maximum length of the output string in a rule) such that prE (x, y) ≥ p? In Example 1, the second translation (aaaaa) has the highest probability, therefore it is the most probable translation of 010. If the translation defined by E from Σ ? to ∆? is not ambiguous (E defines a function from Σ ? to ∆? ), there is an efficient algorithm that computes an answer to the MPT problem in linear time. Basically, this algorithm performs a parsing of the input with the input grammar. The MPT problem can be reduced from RMPS as follows: given a SRG G =< N, Σ, R, S, P >, an integer n and a rational p, construct: a SRT E =< N 0 , Σ 0 , ∆, R0 , S 0 , P 0 > with
22
F. Casacuberta and C. de la Higuera
N0 = N, ∆ = Σ, Σ 0 = {$}, for every rule A → aB ∈ R, a rule A → $B, aB is in R0 with P 0 (A → $B, aB) = P (A → aB) – for every rule A → a ∈ R, a rule A → $, a is in R0 with P 0 (A → $, a) = P (A → a) – an input string $n (n ≤ |N |) – a rational p – – – –
Theorem 3: MPT is NP-complete. Proof of the Theorem 3: From the above reduction, it follows that: 1) the construction is polynomial; and 2) $n has an output string y ∈ ∆? such that prG (y) ≥ p if and only if prE ($n , y) ≥ p. The length of y 5 And the associated optimization problem of computing the most probable translation is NP-hard. Without proof, (it follows from the previous different results and proofs) for the associated NP optimization problem (MaxPT) we give a final result: Theorem 4: MaxPT is APX-hard.
4
Conclusions
In this paper we have presented computational complexity results regarding parsing problems for stochastic regular grammars and stochastic regular syntaxdirected translation schemes. In particular, the problems of searching for the most probable string in a SRG and of searching for the most probable translation of an input string given a SRT are NP-hard problems and the associated optimization problems do not admit polynomial approximation schemes. Future work can be conducted in the following direction: we have proved that both NPoptimization problems are APX-hard. Do they belong to APX? Such a result would require a polynomial time algorithm that always meets a given bound. Acknowledgements. The authors wish to thank the anonymous reviewers for their criticisms and suggestions.
References 1. D. Angluin, Inference of reversible languages. Journal of the ACM, Vol. 29(3), pp. 741–765, 1982. 2. R. Carrasco and J. Oncina, Learning stochastic regular grammars by means of a state merging method, in Grammatical Inference and Applications. Proceedings of ICGI ’94, Lecture Notes in Artificial Intelligence 862, Springer Verlag ed., pp. 139–150, 1994.
Computational Complexity of Problems on Probabilistic Grammars
23
3. Carrasco, J. Oncina, Learning deterministic regular grammars from stochastic samples in polynomial time. Informatique Th´eorique et Applications, Vol. 33(1), pp. 1–19, 1999. 4. F.Casacuberta, Maximum mutual information and conditional maximum likelihood estimations of stochastic syntax-directed translation schemes, in: L. Miclet and C. de la Higuera (eds), Grammatical Inference: Learning Syntax from Sentences. Lecture Notes in Artificial Intelligence, Vol 1147, pp. 282–291, Springer-Verlag, 1996. 5. F.Casacuberta, Growth transformations for probabilistic functions of stochastic grammars. International Journal on Pattern Recognition and Artificial Intelligence. Vol. 10, pp. 183–201, 1996. 6. T. Cormen, Ch. Leiserson and R. Rivest, Introduction to algorithms. The MIT Press, 1990. 7. P. Crescenzi and V. Kann, A compendium of NP optimization problems, http://www.nada.kth.se/ viggo/problemlist/compendium.html (1995). 8. K.S. Fu and T.L.Booth, Grammatical inference: introduction and survey. Part I and II, IEEE Transactions on System Man and Cybernetics, Vol. 5, pp. 59–72/409– 23, 1985. 9. K.S. Fu, Syntactic pattern recognition and applications. Prentice-Hall, Englewood Cliffs, NJ. 1982. 10. P. Garc´ıa and E. Vidal, Inference of K-testable languages in the strict sense and applications to syntactic pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence. Vol. 12(9). pp. 920–925. 1990. 11. M.R. Garey and D.S. Johnson, Computers and intractability: a guide to the theory of NP-completeness W.H. Freeman, San Francisco, 1979. 12. M.X. Goemans and D. O. Williamson, 878-approximation algorithms for MAXCUT and MAX-2SAT. Proc. Twenty sixth Ann. ACM Symposium on Th. of Comp., 422–431, 1994. 13. R. Gonz´ alez and M. Thomason, Syntactic pattern recognition: an introduction. Addison-Wesley, Reading, MA 1978. 14. K. Lari, and S. Young, Applications of stocashtic context-free grammars. Computer Speech and Language. Vol. 5. 237–257. 1991. 15. S. Lucas, E. Vidal, A. Amiri, S. Hanlon and J-C.Amengual, A comparison of syntactic and statistical techniques for off-line OCR. Proceedings of the International Colloquium on Grammatical Inference ICGI-94 (pp. 168–179). Lecture Notes in Artificial Intelligence 862, Springer-Verlag, 1994. 16. H. Ney, Stochastic grammars and Pattern Recognition, in Speech Recognition and Understanding. edited by P. Laface and R. de Mori, Springer-Verlag, pp. 45–360, 1995. 17. C. de la Higuera, Characteristic sets for grammatical inference Machine Learning, 27 pp. 1–14, 1997 18. J. Oncina, P. Garc´ıa and E. Vidal, Learning subsequential transducers for pattern recognition tasks, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 15, pp. 448–458, 1993. 19. C.H. Papadimitriou and M. Yannakakis, Optimisation approximation and complexity classes. Journal Computing System Science, Vol. 43, pp. 425–440, 1991. 20. L. Rabiner and B.H. Juang, Fundamentals of Speech Recognition. Prentice-Hall, 1993. 21. D. Ron, Y. Singer and N. Tishby, On the Learnability and Usage of Acyclic Probabilistic Finite Automata, Proceedings of COLT 1995 , pp. 31–40, 1995.
24
F. Casacuberta and C. de la Higuera
22. H. Rulot and E. Vidal, Modelling (sub)string-length-based constraints through grammatical inference methods. Devijver and Kittler eds. Sringer-Verlag 1987. 23. Y. Sakakibara, Recent Advances of Grammatical Inference. Theoretical Computer Science Vol. 185, pp. 15–45, 1997. 24. A. Stolcke and S. Omohundro, Inducing Probabilistic Grammars by Bayesian Model Merging, in Grammatical Inference and Applications. Proceedings of ICGI ’94, Lecture Notes in Artificial Intelligence 862, Springer Verlag ed., pp. 106–118, 1994. 25. F. Thollard and P. Dupont and C. de la Higuera, Probabilistic DFA Inference using Kullback-Leibler Divergence and Minimality. ICML2000 (International Colloquium on Machine Learning), Stanford, 2000. 26. E.Vidal, F.Casacuberta and P.Garc´ıa, Syntactic Learning Techniques for Language Modeling and Acoustic-Phonetic Decoding, in: A. Rubio (ed.) New Advances and Trends in Speech Recognition and Coding Chap. 27, NATO-ASI Series SpringerVerlag, pp. 174–191, 1995. 27. M. Young-Lai and F.W. Tompa, Stochastic Grammatical Inference of Text Database Structure, to appear in Machine Learning, 2000.
Efficient Ambiguity Detection in C-NFA A Step Towards the Inference of Non Deterministic Automata Fran¸cois Coste and Daniel Fredouille IRISA/INRIA Rennes Campus Universitaire de Beaulieu, 35042 RENNES Cedex, France Phone : +33 2 99 84 71 00 Fax: +33 2 99 84 71 71 {Francois.Coste|Daniel.Fredouille}@irisa.fr
Abstract This work addresses the problem of the inference of non deterministic automata (NFA) from given positive and negative samples. We propose here to consider this problem as a particular case of the inference of unambiguous finite state classifier. We are then able to present an efficient incompatibility NFA detection framework for state merging inference process. key words : regular inference, non deterministic automata, finite state classifier, sequence discrimination
Introduction This work addresses the problem of the inference of non deterministic automata (NFA) from given positive and negative samples. This problem has been extensively studied for the inference of deterministic automata (DFA), for which state merging algorithms have been proven efficient [OG92,Lan92,CN97,LPP98, OS98]. Whereas DFA are polynomially identifiable from given data [Gol78, dlH97], this result does not hold for NFA [dlH97]. In contrast, it is well known that there exist languages such that their representation by DFA requires an exponential number of states with respect to the NFA representation. Considering the inference of NFA instead of DFA allows therefore to obtain smaller solution which we expect to require less samples to be characterized. Few studies have been made on the inference of NFA. Yokomori [Yok94] has proposed an algorithm that needs an oracle and can infer NFA that determinizes polynomialy in polynomial time. We propose here to consider the inference of compatible NFA as a particular case of the inference of unambiguous finite state classifier presented in section 1. A first algorithm for checking unambiguousness of a C-NFA is given in this section. The second section proposes an incremental version of this algorithm for a state merging inference process, ensuring the compatibility of the corresponding NFA without parsing the sample. We conclude with a first experimentation comparing minimum sized NFA and DFA inference with respect to the size of the training sample. A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 25–38, 2000. c Springer-Verlag Berlin Heidelberg 2000
26
1
F. Coste and D. Fredouille
Inference of Unambiguous Finite State Classifier
The purpose of this section is to introduce the inference of finite state classifier by means of state merging algorithms. Using this representation allows unbiased inference [AS95,Alq97]. We propose here to take advantage of the simultaneous representation of a set of languages for the inference of unambiguous automata. 1.1
Definitions and Notations
Definition 1. A C-classes non deterministic finite state automata (C-NFA) is defined by a 6-tuple (Q, Q0 , Σ, Γ, δ, ρ) where: Q is a finite set of states; Q0 ⊆ Q is the set of initial states; Σ is a finite alphabet of input symbols; Γ is a finite alphabet of C output symbols; δ is the next-state function mapping Q × Σ to 2Q (if δ maps Q × Σ to Q, the automaton is said deterministic and is denoted by C-DFA); ρ is the output function mapping Q to 2Γ . The function realized by a C-NFA is the classification of sequences. The classification function γ mapping Σ ∗ × Q to 2Γ is defined by: [ γ(q, w) = ρ(q 0 ) q 0 ∈δ(q,w)
where δ has been extended to sequences following the classical way by: S ∀q ∈ Q, ∀w ∈ Σ ∗ , ∀a ∈ Σ ∪ {}, δ(q, ) = {q}, δ(q, wa) = q0 ∈δ(q,w) δ(q 0 , a) The classification of a sequence w by a C-NFA may then be defined as the set of classifications obtained from the initial states. We also denote by γ this function mapping Σ ∗ to 2Γ : [ γ(w) = γ(q, w) q∈Q0
Given a C-NFA M , a sequence w is said classified if its classification is defined (ie: γ(w) 6= ∅). The set of classified sequences is named the domain of M . The classification over this domain defines a C-tuple of regular languages denoted L(M ): L(M ) = hLc (M )ic∈Γ where ∀c ∈ Γ, Lc (M ) = {w ∈ Σ ∗ |c ∈ γ(w)}. A C-NFA allows to handle simultaneously a set of languages. In this paper, we focus on unambiguous C-NFA: Definition 2. A C-NFA is said unambiguous if each sequence is classified in at most one class. From the definition, it follows that a C-NFA M is unambiguous iff the C-tuple of languages represented by M are mutually disjoint, i.e.: ∀i, j ∈ Γ, Li (M ) ∩ Lj (M ) = ∅. The unambiguousness property is important for the search of compatible automata from positive and negative samples and other applications dealing with
Efficient Ambiguity Detection in C-NFA
27
discrimination of sequences by finite state machines. The choice of a C-NFA representation of a set of languages, instead of the classical automata representation, allows to efficiently characterize the disjunction of the recognized languages. We propose to take advantage of this property in the next sections devoted to the inference of unambiguous C-NFA. 1.2
State Merging Inference
The problem of inferring a C-NFA may be seen as a C-regular inference problem [Cos99]. We assume that a training sample S = hSc ic∈Γ is given such that each Sc is a sample from the target language Lc (M ), i.e. a finite subset of Lc (M ). One classical assumption made in grammatical inference is that the sample is structurally complete with respect to the target machine. Under this assumption, the inference of C-NFA may then be done by means of state merging algorithm, which proceeds by merging states of the Maximal Canonical Automaton, denoted by MCA(S), which is the automaton resulting from the union of the canonical C-NFA for each sequence of S (figure 1 and algorithm 1). When looking for
Figure 1. MCA(S) for S = h{ab}, {aaa, aa}i.
unambiguous C-NFA, the search is pruned as soon as the current automaton is detected ambiguous, since all automata obtained by merging states of an ambiguous automaton are ambiguous. Algorithm 1 Greedy state merging algorithm 1: 2: 3: 4: 5: 6: 7: 8:
Greedy SMA(S) /* Input: training sample S */ /* Output: a C-NFA compatible with S */ A ← Maximal Canonical Automaton(S) while Choose States To Merge(q1 ,q2 ) do A0 ← Merge(A,q1 ,q2 ) if A0 is not ambiguous then A ← A0
28
F. Coste and D. Fredouille
Detecting ambiguity is simple in the deterministic case. It can be done by checking that no states of different classes have been merged, or even by parsing the automaton with the training set. In the non-deterministic case, parsing may be done by a viterbi-like procedure. For classical automata parsing the negative sample is sufficient to ensure compatibility. For non deterministic C-NFA, compatibility with samples and unambiguousness should not be confused: whenever all the samples are correctly labeled by the automaton, sequences outside the training set may have more than one classification. We propose in the next section a first algorithm to detect the ambiguousness of C-NFA. 1.3
Ambiguity Detection
Only two cases of ambiguity exist. A C-NFA is ambiguous if: – There exists a state such that its output function returns two different classifications. For C-DFA, it is the unique case of ambiguity. – There exist paths labeled by the same sequence w leading to states with defined and different classifications.
or We introduce the notation γ1 6∼γ2 (γ1 incompatible with γ2 ) for two different and defined classifications γ1 and γ2 : γ1 6∼γ2 ⇔ ((γ1 6= γ2 ) ∧ (γ1 6= ∅) ∧ (γ2 6= ∅)). Otherwise, the classifications are said compatible (denoted γ1 ∼ γ2 ). It is easy to detect whether the first case holds. For the second case, we need to introduce the definition of incompatible pair of states. Two states q1 and q2 are incompatible, (denoted q1 6∼q2 ), if there exists a word whose classifications from these states are incompatible: q1 6∼q2 ⇔ ∃w ∈ Σ ∗ , ∃(s1 , s2 ) ∈ δ(q1 , w) × δ(q2 , w), ρ(s1 ) 6∼ρ(s2 ) Otherwise, the states are said compatible (denoted q1 ∼ q2 ). Then, ambiguity detection for a C-NFA reduces to checking if a state is incompatible with itself or if two initial states are incompatible. To mark incompatible states, we propose an algorithm (algorithm 2) inspired by the algorithm of Hopcroft and Ullman designed to mark non equivalent states
Efficient Ambiguity Detection in C-NFA
29
for automaton minimization [HU80] 1 . Since the automata we consider are not necessarily deterministic, the original algorithm has been changed by inverting the propagation direction of the marking process, which results in O(n2 ) time complexity for tree-like automata. This algorithm may be used to construct the set of incompatible pairs of states E6∼ and to raise an exception if it detects ambiguity. Algorithm 2 Incompatible states and C-NFA ambiguity. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18:
Incompatible States(A = (Σ, Γ, Q, Q0 , δ, ρ)): /* Search of the set of incompatible states of A */ /* and ambiguity detection of A */ E6∼ ← ∅ /* set of incompatible states */ for all {qi , qj } ∈ Q × Q, ρ(qi ) 6∼ρ(qj ) do if {qi , qj } 6∈ E6∼ then Set Incompatible And Propagate(qi , qj ) return E6∼ Set Incompatible And Propagate(q1 ,q2 ): /* ambiguity detection */ if (q1 = q2 ) ∨ (q1 ∈ Q0 ∧ q2 ∈ Q0 ) then throw exception(“ambiguous C-NFA”) /* Incompability memorization */ E6∼ ← E6∼ ∪ {q1 , q2 } /* Propagation */ for all a ∈ Σ, {p1 , p2 } ∈ δ −1 (q1 , a) × δ −1 (q2 , a) do if {p1 , p2 } 6∈ E6∼ then Set Incompatible And Propagate(p1 , p2 )
In the worst case, the complexity of algorithm 2 is O(|Σ|n4 ) : O(n2 ) calls of the function Set Incompatible And Propagate, whose body needs O(|Σ|n2 ) steps. However, if we denote by ta the maximal number of incoming transitions with the same symbol in a state, one can refine the complexity result. The complexity of Set Incompatible And Propagate body with respect to ta is O(|Σ|t2a ) which leads to a global complexity of O(|Σ|t2a n2 ). Therefore the complexity lies more practically between O(|Σ|n2 ) and O(|Σ|n4 ) according to the value of ta . In an inference process, this algorithm may be used to determine whether each candidate is unambiguous. In the next section, we propose an incremental version of this algorithm to detect ambiguity in an extension of the classical state merging framework.
1
The partition refinement algorithm to minimize automata may not be used here since the state equivalence relation is transitive whereas the compatibility relation is not.
30
2
F. Coste and D. Fredouille
Considering Unmergeable States During Inference
We propose here to extend the classical state merging algorithm to consider pairs of unmergeable states (denoted for two states q1 and q2 of a C-NFA by q1 6'q2 ). At each step of the inference, instead of always merging the chosen pair of states, the algorithm will be allowed to set this pair of states unmergeable. This may be used to guide the search or to prune an entire part of the search space: either because it has already been explored or either because it is known that no solution may be found in it. 2.1
Detection of Unmergeable States Due to Ambiguity
During the inference of unambiguous automata, some pairs of states may be detected to have no other choice than being set unmergeable to ensure unambiguousness. The first relation that can be used is that two incompatible states are also unmergeable: ∀q1 , q2 ∈ Q × Q, q1 6∼q2 ⇒ q1 6'q2 . We can detect more unmergeable states by considering the counterpart of merging for determinization used in the deterministic framework [OG92], that is considering pairs of states that are reachable by a common word from the initial states. Definition 3. Two states q1 and q2 are said to be in relation k, denoted by q1 k q2 , if they are reachable by a common S word from initial states. More formally, we have q1 k q2 ⇔ ∃w ∈ Σ ∗ , q1 , q2 ∈ q0 ∈Q0 δ(q0 , w). The algorithm computing relation k is very similar to algorithm 2 for incompatible states. The loop in line 5 is replaced by a loop on pairs of initial states, and backward propagation in line 17 is replaced by forward propagation (using δ instead of δ −1 ). This algorithm can also detect ambiguity since it tries to put in relation k two states with incompatible output. Thanks to relation k we can detect new unmergeable states with the following equation which is illustrated in figure 2: q1 6∼q2 ∧ q2 k q3 ⇒ q1 6'q3 Relation k enables also earlier ambiguity detection: to detect ambiguity, we can check that no incompatible states have to be set in relation k (or that no states in relation k have to be set incompatible). This property comes from the fact that if two states are in relation k due to a word w1 and that they are incompatible due to a word w2 , then the word w1 w2 has an ambiguous classification. Notice also that this detection can replace the one given in section 1.3 (algorithm 2 line 11) since all initial states are in relation k, and every state is in relation k with itself.
Efficient Ambiguity Detection in C-NFA
31
Figure 2. Illustration of the equation q1 6∼q2 ∧ q2 k q3 ⇒ q1 6'q3 : given a relation q2 k q3 involved by a word w1 , and an incompatibility q1 6∼q2 involved by a word w2 , the merging of q1 and q3 is not possible since it entails the acceptation of the word w1 w2 in two different classes. We can also notice that the relation q1 6∼q2 ⇒ q1 6'q2 is due to a particular case of q1 6∼q2 ∧ q2 k q3 ⇒ q1 6'q3 with q2 = q3 thus thanks to the fact that every state is in relation k with itself.
To summarize, before computing a merge we can check in some cases if it will lead to ambiguity, but this checking is not always possible (we do not detect all mergings leading to ambiguity, see figure 3). In this case, ambiguity is detected during the merge thanks to the addition of new relation k and 6∼.
(a)
(b)
Figure 3. part a. States q0 and q1 are unmergeable but not detected with our equations (the automaton resulting from the merge, figure3. part b, is ambiguous; for example, in this automaton the word aa is both classified c1 and c2.
We dispose of various relations between states which are useful not only to detect ambiguity, but also to prevent merging of states that leads to ambiguity. We now propose to maintain these relations after each merge during an inference algorithm.
32
F. Coste and D. Fredouille
2.2
Incremental Maintenance of Relations
Let E6'(q) (resp. E6∼(q), Ek (q)) denote the set of states unmergeable (resp. incompatible, in relation k) with state q. At the beginning of an inference algorithm, E6'(q), E6∼(q) and Ek (q) have to be initialized. E6∼(q) and Ek (q) can be computed with algorithm 2 and its counterpart for states in relation k, but update of E6'(q) must also be done ; for that reason we use the function Initialize (algorithm 3). Function Merge’ Algorithm 3 Initialization of E6∼, Ek and E6' 1: 2: 3: 4: 5: 6:
Initialize(A=< Σ, Γ, Q, Q0 , δ, γ >) ∀q ∈ Q, E6∼(q) = ∅; Ek (q) = ∅; E6'(q) = ∅ for all {q1 , q2 } ∈ Q0 × Q0 do SetCP1 (q1 ,q2 ) /* maintain E6', add k relation and propagate */ for all {q1 , q2 } ∈ Q × Q, γ(q1 ) 6∼γ(q2 ) do SetIncompatible(q1 ,q2 ) /* maintain E6', add incompatibility and propagate */
(algorithm 4) realizes the merging of two states and update sets Ek , E6∼ and E6'. This update is realized by propagating existing relations incompatible and k on the state created by the merging (functions PropagateIncompatibility and PropagateCP1 , algorithm 5). For example, the ambiguity of the automaton figure 3 part b, may be detected during the merging thanks to addition of new relations: the incompatibility q0 6∼ q2 is transformed into q01 6∼q2 by the merging, then this relation is propagated to q01 6∼q01 by the function PropagateIncompatibility. At this step an exception is thrown since it would imply a k relation and an incompatibility between the same states. Algorithm 4 Merge two states and update E6∼, Ek and E6' 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 1
Merge’(A,q1,q2) /* detection of unmergeable states */ if q1 ∈ E6'(q2 ) then throw exception else A ← Merge(A,q1 ,q2 ) /* substitute q2 by q1 in A and E6', E6∼, Ek */ for all q 0 ∈ Ek (q1 ) do PropagateCP(q 0 ,q1 ) for all q 0 ∈ E6∼(q1 ) do PropagateIncompatibility(q 0 ,q1 ) return A
CP stands for Common Prefix, and correspond to the k relation. We do not detail the functions SetCP and PropagateCP which are the counterpart for relation k of functions SetIncompatible and PropagateIncompatibility shown in algorithm 5.
Efficient Ambiguity Detection in C-NFA
33
Algorithm 5 Add a new incompatibility in E6∼ and propagates its effects 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17:
SetIncompatible(q1 ,q2 ) if q1 6∈ E6∼(q2 ) then if q1 ∈ Ek (q2 ) then throw exception else /* add q1 to E6∼(q2 ) and q2 to E6∼(q1 ) */ E6∼(q1 ) ← E6∼(q1 ) ∪ {q2 } ; E6∼(q2 ) ← E6∼(q2 ) ∪ {q1 } /* Propagation */ PropagateIncompatibility(q1 ,q2 ) /* Update the blocs in relation 6'*/ for all q ∈ Ek (q1 ) do SetUnmergeable(q2 ,q) for all q ∈ Ek (q2 ) do SetUnmergeable(q1 ,q) PropagateIncompatibility(q1 ,q2 ) for all a ∈ Σ, {p1 , p2 } ∈ δ −1 (q1 , a) × δ −1 (q2 , a)} do SetIncompatible(p1 ,p2 )
Algorithm 6 Add unmergeable states in E6' 1: SetUnmergeable(q1 ,q2 ) 2: if q1 = q2 then 3: throw exception 4: else 5: if q1 6∈ E6'(q2 ) then 6: E6'(q1 ) ← E6'(q1 ) ∪ {q2 }; E6'(q2 ) ← E6'(q2 ) ∪ {q1 }
Every time an incompatibility or a relation k between two states has to be added (functions SetIncompatible and SetCP1 , algorithm 5), two actions are to be taken: (1) we check that the new relation does not mean ambiguity of the C-NFA (algorithm 5, line 3), (2) we compute new unmergeable states using the relation q1 k q2 ∧ q2 6∼q3 ⇒ q1 6'q2 (algorithm 5, line 11-14, and algorithm 6). Thanks to this new algorithm, we are able to infer efficiently non deterministic and non ambiguous C-NFAs. This algorithm can also be directly applied to the inference of classical NFA by inferring a 2-NFA and enabling only merges between states of M CA(hS+ , S− i) created by the positive sample S+ . In this framework, branches created by the negative sample are only used to check the ambiguity of the current C-NFA. To find the corresponding NFA, we suppress in the C-NFA the part of M CA(hS+ , S− i) corresponding to the negative sample. We present in the next section experiments applying this approach for the inference of classical NFAs.
34
3
F. Coste and D. Fredouille
Experiments
We have implemented our algorithm to carry out first experiments in order to test the validity of our approach. Our idea is to compare the information needed to correctly infer a NFA versus its determinized version. We first present the benchmark designed for this experiment and the state merging algorithm we have used before giving the experimental results. 3.1
Benchmark
We have chosen for this benchmark different non deterministic automata (figure 4) inspired by different papers [Dup96,SY97,DLT00] or specifically designed for the benchmark. We have tried to represent the various processes of determinization. The benchmark contains: a DFA such that no smaller NFA that recognizes the same language is expected to exist (L1) ; a NFA such that its determinization is polynomial (L2) ; NFAs with exponential determinization, representing a finite language (L3) or not (L4, L5) ; a simple NFA common in the DFA literature [Dup96], with a transition added in its transition function (L6). The various properties of these automata are summarized in table 1. A parameter n is set for some of the automata allowing to tune their size, the value chosen for n in the benchmark is indicated in the third column of table 1.
Figure 4. Automata of the benchmark
Efficient Ambiguity Detection in C-NFA
35
Table 1. Characteristics of the benchmark’s automata number Language with Σ = {a, b} L1
L2 L3 L4 L5 L6
all word such that the absolute value of its number of a minus its number of b modulo n is 0
n in the size size benchmark of NFA of DFA 8
n
n
4 5
n n+1
Σ ∗ aΣ n 2 {(b∗ a)(n−1).x+n.y |x ∈ IN, y ∈ IN+ } 3 see automaton -
n+2 n 3
(n − 1)2 + 2 2n/2+1 − 1 if n is even 3 ∗ 2bn/2c − 1 if n is odd 2n+1 2n − 1 7
{w ∈ Σ|w = uav, |w| < n ∧|v| = bn/2c − 1}
Samples of training and testing sets were generated following a normal distribution for length and a uniform distribution for words of a given length. Training and testing sets are disjoint. 3.2
Algorithm
In these experiments, we consider the inference of a minimum sized non deterministic automaton. We propose to use the “coloring” scheme which has been proven efficient for this search in the deterministic case [BF72,CN97,OS98]. We briefly describe the algorithm we have used. A set C of colored states (the states of the target automaton) is maintained. Search space exploration is performed by a function choosing at each step a state q of Q − C and calling itself recursively, first after each successful merging between q and a state of C, and second, after the promotion of q in C. Adopting a Branch & Bound strategy, the search is pruned if the number of states in C is greater than in the smallest solution found. The same heuristic than in [CN97] has been used both for the deterministic and the non deterministic automaton inference: at each step the state, having the maximum number of colored states unmergeable with it, is chosen to be colored. This algorithm has been used in the upper bound framework [BF72, OS98], which means that it tries to find a solution of size one and increments the size until a solution is found. Within this framework, we guarantee that the solution found is of minimum size and structurally complete with the samples. 3.3
Results
The result of algorithm’s runs are given in figure 5 and table 2. We can verify that for all the experiments except one, identification of the NFA requires
36
F. Coste and D. Fredouille
a smaller sample than identification of its deterministic version. The only exception is L1 which has been constructed so as to be hard to identify in the non deterministic approach. For all other languages non deterministic approach seems clearly better suited to this task as sparse training data are available. We may interpret this result by applying Occam’s razor principle: smaller automaton compatible with positive and negative samples are more likely to identify the target language. This results may also be explained by the amount of data needed to ensure structural completness with respect to the target automaton. Table 2. convergence observed language number of samples needed to reach “stable” 100% of recognition deterministic case non deterministic case L1 166 278 L2 372 23 L3 > 500 79 L4 65 22 L5 100 32 L6 190 27
Conclusion We have proposed an algorithm to detect whether a C-NFA is ambiguous. This algorithm may be used incrementally in a state merging inference process, taking into account not only the possible state merging but also the impossible ones. We have applied this approach for the exact search of minimal NFA with a saturation strategy. Experimental results are promising and tend to show that less data may be needed to identify the non deterministic automata representation of a language than its deterministic representation. However the main problem for the inference of non deterministic automata remains the lack of canonical form. Denis & al. [DLT00] have presented very recently a first response to this problem by constructing a subclass of NFA for which a canonical form can be defined. Their results could be integrated in the state merging framework in order to reduce the search space and to obtain identification results. Acknowledgments. The authors wish to thank Jacques Nicolas for helpful discussions about this work and Tallur Basavanneppa for valuable comments on the manuscript.
Efficient Ambiguity Detection in C-NFA
37
Figure 5. Graphs giving recognition level on testing set and size of automaton found (ordinate) compared to the number of samples in training set (abscissa). Inference of DFAs and NFAs is given on same graphs for each language.
38
F. Coste and D. Fredouille
References [Alq97] [AS95]
[BF72] [CN97] [Cos99] [dlH97] [DLT00] [Dup96] [Gol78] [HU80] [Lan92] [LPP98] [OG92] [OS98] [SY97] [Yok94]
Alqu´ezar (R.). – Symbolic and connectionist learning techniques for grammatical inference. – Th`ese de PhD, Universitat Politecnica de Catalunya, mars 1997. Alqu´ezar (R.) et Sanfeliu (A.). – Incremental grammatical inference from positive and negative data using unbiased finite state automata. In : Shape, Structure and Pattern Recognition, Proc. Int. Workshop on Structural and Syntactic Pattern Recognition, SSPR’94, Nahariya (Israel), pp. 291–300. – 1995. Biermann (A. W.) et Feldmann (J. A.). – On the synthesis of finite-state machines from samples of their behaviour. IEEE Transactions on Computeurs C 21, 1972, pp. 592 – 597. Coste (F.) et Nicolas (J.). – Regular inference as a graph coloring problem. In : Workshop on Grammar Inference, Automata Induction, and Language Acquisition (ICML’ 97). – Nashville, TN., USA, juillet 1997. Coste (F.). – State merging inference of finite state classifiers. – Rapport technique n˚ INRIA/RR-3695, IRISA, septembre 1999. de la Higuera (C.). – Characteristic sets for polynomial grammatical inference. Machine Learning, vol. 27, 1997, pp. 125–138. Denis (F.), Lemay (A.) et Terlutte (A.). – Apprentissage de langages r´eguliers a l’aide d’automates non d´et´erministes. In : Conf´ ` erence d’apprentissage CAp’00. – 2000. Dupont (P.). – Utilisation et apprentissage de mod` eles de langages pour la reconnaissance de la parole continue. – Th`ese de PhD, Ecole Nationale Sup´erieure des T´el´ecommunications, 1996. Gold (E. M.). – Complexity of automaton identification from given data. Information and Control, vol. 37, 1978, pp. 302 – 320. Hopcroft (J.) et Ullman (J.). – Introduction to Automata Theory, Languages, and Computation. – N. Reading, MA, Addison-Wesley, 1980. Lang (K. J.). – Random dfa’s can be approximately learned from sparse uniform examples. 5th ACM workshop on Computation Learning Theorie, 1992, pp. 45 – 52. Lang (K. J.), Pearlmutter (B. A.) et Price (R. A.). – Results of the abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. Lecture Notes in Computer Science, vol. 1433, 1998, pp. 1–12. Oncina (J.) et Garcia (P.). – Inferring regular languages in polynomial update time. Pattern Recognition and Image Analysis, 1992, pp. 49 – 61. Oliveira (A. L.) et Silva (J. P. M.). – Efficient search techniques for the inference of minimum size finite automata. In : South American Symposium on String Processing and Information Retrieval. – 1998. Salomaa (K.) et Yu (S.). – Nfa to dfa transformation for finite languages. In : First international workshop on implementing automata, WIA’96, p. 188. – 1997. Yokomori (T.). – Learning non-deterministic finite automata from queries and counterexamples. Machine Intelligence, vol. 13, 1994, pp. 169–189.
Learning Regular Languages Using Non Deterministic Finite Automata ? Fran¸cois Denis, Aur´elien Lemay, and Alain Terlutte Bˆ at. M3, GRAPPA-LIFL Universit´e de Lille I 59655 Villeneuve d’Ascq Cedex France {denis, lemay, terlutte}@lifl.fr
Abstract. We define here the Residual Finite State Automata class (RFSA). This class, included in the Non deterministic Finite Automata class, strictly contains the Deterministic Finite Automata class and shares with it a fundamental property : the existence of a canonical minimal form for any regular language. We also define a notion of characteristic sample SL for a given regular language L and a learning algorithm (DeLeTe). We show that DeLeTe can produce the canonical RFSA of a regular language L from any sample S which contains SL . We think that working on non deterministic automata will allow, in a great amount of cases, to reduce the size of the characteristic sample. This is already true for some languages for which the sample needed by DeLete is far smaller than the one needed by classical algorithms. Key words : regular inference, non deterministic automata.
1
Introduction
Regular language learning is still one of the major thema of the grammatical inference field. This class of language, the simplest in the Chomsky hierarchy, is neither efficiently learnable in the Gold Model [Pit89], nor in the Valiant one [KV94]. Nevertheless, this class is polynomially learnable from given data [Gol78] : one can build for each regular language L a sample SL in polynomial time relatively to the size of the smallest Deterministic Finite Automata (DFA) recognizing L and characterizing L in the way that L can be found back from any sample S containing SL . We could think that this theoritical result would have few consequences since nothing assures that a natural sample contains SL . But some learning algorithms in this last model, like RPNI made by Oncina and Garcia [OG92], are already interesting from an experimental point of view : correctly adapted and with good heuristics, they can turn into powerful algorithms [LPP98]. Can we go further ? The RPNI algorithm calculates a deterministic automaton compatible with the sample in polynomial time. But it is a well known fact that regular languages have representations much cheaper in term of ?
This work was partially supported by “Motricit´e et Cognition : Contrat par objectif r´egion Nord/Pas-de-Calais”
A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 39–50, 2000. c Springer-Verlag Berlin Heidelberg 2000
40
F. Denis, A. Lemay, and A. Terlutte
size : the minimal DFA recognizing the language Σ ∗ 0Σ n has O(2n ) states while this language is described here by a regular expression with O(log n) symbols. Looking after a non deterministic finite automaton (NFA) could seem to be a promising way but it has been proved that NFA are not polynomially identifiable from given data [Hig97]. That is nevertheless the way we will explore here. We define a sub-class of NFA, the Residual Finite States Automata (RFSA) that has some good properties of DFA (existence of a canonical representation) and NFA (often concise minimal representation), but also some of their drawbacks. We will show how to associate with a DFA A a characteristic sample SA of polynomial cardinal and we will define an algorithm (DeLeTe) that builds the canonical RFSA from any sample containing SA . The cardinal of SA is, in any cases, of the same order than the cardinal of the sample needed by RPNI, and it may be exponentially smaller in best cases. For example, RPNI needs O(2n ) examples to learn the language Σ ∗ 0Σ n although DeLeTe can learn it from sample of O(n2 ) words. However, this new algorithm is not a learning algorithm from given data, we will show that it is probably not a serious problem in a PAC context (Probably Approximatively Correct [Val84]) where samples are produced using a probability distribution, where the allowed running time depends on the length of the drawn examples and where the returned hypothesis can be an approximation of the target. After some preliminaries on languages, automata and the learning model from given data presented in Sect. 2, we introduce the Residual Finite States Automata in Sect. 3. Then we show how to associate a representative set with any regular language in Sect. 4.1, we present the DeLeTe algorithm in Sect. 4.2 and we demonstrate the main result of this article in Sect. 4.4. Then, we comment this result and argue that it seems promising.
2 2.1
Preliminary Languages and Automata
Let Σ be a finite alphabet and Σ ∗ be the set of words built on Σ. We note ε the empty word and |u| the length of a word u of Σ ∗ . We assume that words on Σ ∗ are ordered the following way : u < v iff [|u| < |v| or (|u| = |v| and u is before v in the lexicographical order)]. A language is a subset of Σ ∗ . If L is a language, we note pref (L) = {u ∈ Σ ∗ |∃v ∈ Σ ∗ such that uv ∈ L}. A Non deterministic Finite Automaton (NFA) is a quintuplet A = hΣ, Q, Q0 , F, δi where Q is a finite set of states, Q0 ⊆ Q is the set of initial states, F ⊆ Q is the set of terminal states and δ is a (partial) transition function defined from a subset of Q × Σ to 2Q . We also note δ the extended transition function defined on (a subset of) 2Q × Σ ∗ . A language L is regular if there exists a NFA A = hΣ, Q, Q0 , F, δi such that L = {u ∈ Σ ∗ |δ(Q0 , u) ∩ F 6= ∅}. We note REG the set of regular languages. Let A = hΣ, Q, Q0 , F, δi be a NFA and q a state of A, we note Lq the language Lq = {u ∈ Σ ∗ |δ({q}, u) ∩ F 6= ∅}. An automaton is said Deterministic (DFA) if Q0 contains only one element, and if, for each state q and each letter x, δ(q, x) contains at most one element. A finite automaton A is trimmed if every state is accessible and if, from each state, we can access to a terminal
Learning Regular Languages Using Non Deterministic Finite Automata
41
state. Any non empty regular language is accepted by a unique (with respect to an isomorphism) minimal deterministic trimmed automaton. If L is a regular language, and if u is a word of Σ ∗ , we note u−1 L the residual language of L by u defined by u−1 L = {v ∈ Σ ∗ | uv ∈ L}. According to the Myhill-Nerode theorem, the set of distincts residual languages of a regular language is finite. Furthermore, if A = hΣ, Q, {q0 }, F, δi is the minimal DFA recognizing L, u−1 L → δ(q0 , u) is a bijection from the set of residual states of L to Q. Let A = hΣ, Q, Q0 , F, δi be a NFA and L the language recognized by A. We define As = hΣ, Q, Qs0 , F, δ s i where Qs0 = {q ∈ Q|Lq ⊆ L} and δ s (q, x) = {q 0 ∈ Q|Lq0 ⊆ x−1 Lq } for any state q and any letter x. As is said to be the saturated of A and we say that an automaton A is saturated if it is isomorph with As . One can show that an automaton and its saturated recognize the same language [DLT00b]. 2.2
Learning Languages
Our framework is regular language learning from examples. If L is a language defined on the alphabet Σ, an example of L is a pair (u, e) where u ∈ Σ ∗ and e = 1 if u ∈ L (positive example), and e = 0 otherwise (negative example). A sample S of L is a finite set of examples of L. We note S + = {u|(u, 1) ∈ S} and S − = {u|(u, 0) ∈ S}. The size of a sample S (noted ||S||) is the sum of the length of all the words in it.Gold showed that the class of regular languages is polynomially learnable from given data [Gol78]. Goldman and Mathias introduced a learning model with teacher [GM96] that De la Higuera extended to languages and showed equivalent to the learning model from given data [Hig97]. To show that the REG class, represented by DFA, is polynomially learnable from given data is equivalent to show that there exists two algorithms T and L such that for any regular language L with a minimal DFA A : – T takes A as input and gives in polynomal time a sample SL of L (of polynomial size with respect to the size of A) – for each sample S of L, L takes S as input and gives in polynomial time a DFA compatible with S, equivalent with A if S contains SL . The RPNI algorithm ([OG92]) is a learning algorithm for the regular language class in this model. RPNI builds from a sample S the most specific DFA recognizing S + (the prefix tree), then examines if it is possible to merge two states, beginning with the root, while keeping the automaton deterministic and consistant with the sample. The purpose of the characteristic sample is here to avoid that two states are merged when they do not have to be. So, that is not a surprise that RPNI can identify an automaton from any sample containing its characteristic sample : but the main interesting point of this algorithm is that it keeps working in a proper way in “deteriorated” mode, that is when the sample does not contain the characteristic sample. It is here necessary to define heuristics precising whether two states can be merged or not, even when the conditions to do it are not fulfilled.
42
3
F. Denis, A. Lemay, and A. Terlutte
RFSA
Definition 1. A Residual Finite State Automaton (RFSA) is a finite automaton A = hΣ, Q, Q0 , F, δi such that, ∀q ∈ Q, ∃u ∈ Σ ∗ such that Lq = u−1 L. Deterministic automata are obviously RFSA, but some non deterministic finite automata are not (cf. figure 1). 0
- q0
/
0
q q 1
Fig. 1. This automaton is not a RFSA as Lq1 = {ε} is not a residual of 0+ .
One of the major interests of the RFSA class is that we can define a notion of canonical RFSA associated with a language. Definition 2. Let L be a regular language. A residual language u−1 L is said to be prime if it is not union of other residual languages of L, that is if [ {v −1 L | v −1 L ( u−1 L} ( u−1 L. We say that a residual language is composed if it is not prime. Example 1. Let L = a∗ + b∗ . This language possesses 3 non empty residuals : L, a∗ and b∗ . The first one is composed, and the other two are prime. Definition 3. Let L be a regular language on Σ and let A = hΣ, Q, Q0 , F, δi be the automaton defined by : – – – –
Q is the set of prime residual non-empty languages of L Q0 is the set of prime residual languages included in L. F is the set of prime residual languages of L containing ε• δ is defined by δ(u−1 L, x) = {v −1 L ∈ Q|v −1 L ⊆ ux−1 L} We say that A is the canonical RFSA of L.
One can show that the canonical RFSA of a regular language L is an RFSA, that it is saturated, that it recognizes L, that it is minimal in the number of states and that any other minimal saturated RFSA recognizing L is isomorph with it [DLT00b]. As a result, the RFSA class shares at least two important properties with DFA : states are defined by residual languages and there exists a canonical element. On the other hand, canonical RFSA of a regular language can be much smaller than the minimal DFA that recognizes this language : that is the case of the language Σ ∗ 0Σ n , often showed in the litterature to illustrate the fact that there can be an exponential gap between the size of the minimal DFA and the size of a minimal NFA recognizing a language.
Learning Regular Languages Using Non Deterministic Finite Automata a
- q1
43
b
/
- q2
/
Fig. 2. canonical RFSA accepting the language a∗ + b∗ . 0, 1
- ε
0
0
/ k
0
s
0,1
0 0,1
/ k
0,1 0
s
0,1 01
s
011
0,1 Fig. 3. The canonical RFSA of Σ ∗ 0Σ 2 : a minimal NFA has 4 states too, although its minimal DFA has 8 states. The 4 prime residual languages are L = Σ ∗ 0Σ 2 , 0−1 L = L ∪ Σ 2 , 01−1 L = L ∪ Σ, 011−1 L = L ∪ {ε}. The other 4 residuals are composed.
4 4.1
Learning Regular Languages Using RFSA A Characteristic Sample
Let A = (Σ, Q, q0 , F, δ) be a minimal trimmed DFA. For every state q of A, we define uq as being the smallest word of Σ ∗ such that δ(q0 , uq ) = q. As a consequence, we have uq0 = ε. We assume that Q = {q0 , . . . , qn } is ordered using uq . In other words, qi < qj iff uqi < uqj . We note P C(L) = {uq |q ∈ Q} and U (L) = P C(L)∪{ux|u ∈ P C(L), x ∈ Σ}. Definition 4. We say that a sample S is characteristic for the minimal DFA if – for every state – for every state – if Lq0 \ Lq00 – if Lq00 \ Lq0
q of Q, there exists a word of the form uq v in S + q and q 00 of Q and every letter x, if q 0 = δ(q, x) then 6= ∅ then there exists w such that uq xw ∈ S + and uq00 w ∈ S − 6= ∅ then there exists w such that uq xw ∈ S − and uq00 w ∈ S +
Let S be a sample, u and v be 2 words of pref (S + ). We note : – u ' v if no word w exists such that uw ∈ S + and vw ∈ S − or the opposite – u ≺ v if no word w exists such that uw ∈ S + and vw ∈ S − As, in a learning context, we do not know residual languages but only the learning sample, we will use these relations to estimate relations between residual languages. We show here that it is reasonable to do that if S is characteristic for the minimal DFA, and if u and v are in U (L).
44
F. Denis, A. Lemay, and A. Terlutte
Lemma 1. If u, v ∈ U (L) and if S is a sample of L then u ' v ⇐ u−1 L = v −1 L, u ≺ v ⇐ u−1 L ⊆ v −1 L If S is characteristic for the minimal DFA A, then the converses are true. Proof. The if-parts are straightforward. Let us assume that S is characteristic for the DFA A and that u−1 L \ v −1 L 6= ∅. Then u and v are distincts and one of those two words at least is different from ε. Let us assume that u = u1 x. We verify that a state q1 exists such that uq1 = u1 . There exists a word w such that uw = u1 xw ∈ S + and vw ∈ S − , thus u 6'v and u 6≺v. The other three cases (u−1 L \ v −1 L 6= ∅ and v = v1 x, v −1 L \ u−1 L 6= ∅ and v = v1 x, v −1 L \ u−1 L 6= ∅ and u = u1 x) are treated in a same way. Let us introduce a new notation : – if q is a prime state, let vq be the smallest word of Lq such that, for every state q 0 , Lq0 ( Lq ⇒ v 6∈Lq0 . – if q is a composed state, there exists a highest index k such that S = {L |L – L q q q i i ( Lq and i ≤ k} and S – {Lqi |Lqi ( Lq and i < k} ( Lq . We then define vq as being the smallest word of Lq \ {Lqi |Lqi ⊂ Lq , i < k}. Definition 5. We say that a sample S is characteristic for the canonical RFSA of a regular language L if it is characteristic for the minimal DFA of L and if – vq0 – for – –
∈ S + and for every state q such that Lq ⊆ L\{vq0 }, we have uq vq0 ∈ S − , every state q and for every letter x, if q 0 = δ(q, x) then uq xvq0 ∈ S + and for every state q 00 such that Lq00 ⊆ Lq0 \ {vq0 }, uq00 vq0 ∈ S −
Let S be a sample, u0 , . . . , un ∈ pref (S). We note u0 = ⊕{u1 , . . . , un } if ui ≺ u0 for every i = 1 . . . n and if for every word v, u0 v ∈ S + implies that there exists at least one index i > 0 such that ui v 6∈S − . Lemma 2. If u0 , . . . , un ∈ U (L) and if S is a sample of L then −1 u−1 0 L = ∪{ui L|i = 1 . . . n} ⇒ u0 = ⊕{u1 , . . . , un }
Furthermore, if we assume that S is characteristic for the canonical RFSA and that U = {u1 , . . . , un } verify the following property : ∀v ∈ P C(L), v ≺ u0 and ∃u ∈ U v ≤ u ⇒ v ∈ U Then the reciprocal is true. Proof. The necessary part of this statement is straightforward. Let us assume that u0 = ⊕{u1 , . . . , un }. As S is characteristic for the DFA, we have a u−1 i L⊆ −1 u−1 0 L for each i = 1 . . . n because of the previous lemma. If we had ∪{ui L|i = −1 −1 1 . . . n} ( u−1 0 L we would also have ui L ⊆ u0 L \ {vδ(q0 ,u0 ) } for every index i and so ui v0 ∈ S − , which is wrong because of our hypothesis
Learning Regular Languages Using Non Deterministic Finite Automata
45
Remarks – ≺, ' and ⊕ operators have the intended properties as long as we require the good conditions for the working sample and as long as we only use words of U (L). This means that, if we have a characteristic sample, we can use those operators to evaluate relations between residual languages ; that is actually what we will do in the learning algorithm presented here, – of course, there exists characteristic samples for the minimal DFA of polynomial size with respect to the size of this DFA ; we can also observe that the RPNI algorithm can find back this minimal DFA from such a sample, – there also exists characteristic samples for RFSA whose cardinal is polynomial with respect to the size of the minimal DFA recognizing the language L ; however, it happens that the smallest characteristic sample for the canonical RFSA recognizing a language L contains words of length exponential with respect to the size of the minimal DFA of this language. Let us note p1 , . . . , pn the first n prime numbers and let us define for every index i the language Li = {ε} ∪ {ak |pi does not divide k}. Let us introduce Sn + 1 extra letters {x0 , . . . , xn } and let us consider the language L = x0 a∗ ∪ {xi Li |i = 1 . . . n}. −1 The residual x−1 0 L is not the reunion of the residuals xi L but the first word p1 ...pn that can show that is a and its length is exponential with respect to the size of the minimal DFA recognizing L. 4.2
The DeLeTe Algorithm
We present here a grammatical inference algorithm that builds a NFA from a sample of a target language L. We show that if this sample is characteristic, the resulting automaton is the canonical RFSA of L. This algorithm is divided in 4 main phases. Suppose that the input sample is characteristic : – the marking phase marks some prefixes of the input sample : each marked word is in P C(L) (lemma 3) and each state of the canonical RFSA of L will correspond to one of those words. – in the saturation phase, we build the prefix tree automaton corresponding to the positive part of the input sample. We establish the ≺ relation between the marked prefixes and their successors. From our hypothesis and lemmas 1 and 2, these relations are correct and the resulting automaton contains the canonical RFSA of L, – in the cleaning phase, we delete un-marked states. Again, we will show that having a good sample allows us to do that without changing the language recognized by the automaton, – in the reduction phase, we delete marked states that are recognized as composed by the algorithm. In these conditions, the resulting automaton is the canonical RFSA of L. If the sample is not characteristic, we can at least show that the resulting automaton is consistant with the sample.
46
F. Denis, A. Lemay, and A. Terlutte
Input : a sample S of a language L (we suppose S + is not empty). We note QS = {u0 , . . . , un } prefixes of S + ordered using the usual order. Marking phase Initialisation : let Q˙ = {ε}(= {u0 }) For i from 1 to n Do If ∃j < i and x ∈ Σ such that ui = uj x and uj ∈ Q˙ Then ˙ < i, uk ≺ ui } Let Ei = {uk ∈ Q|k ˙ If ui 6= ⊕Ei Then Add qi to Q. End For We note QM the set Q˙ obtained at the end of this phase. Saturation phase ˙ by setting ˙ Q˙ 0 , F˙ , δi Initialisation : We create an automaton A˙ = hΣ, Q, + ˙ ˙ ˙ ˙ Q = QS , Q0 = {u0 }, F = S and δ(u, x) = {ux ∈ QS } for every word u ∈ QS ¨ = QM ∪ {ux ∈ pref (S + ) | x ∈ Σ, u ∈ QM } and every x ∈ Σ. We note Q For i from 0 to n Do For j from 0 to n Do ¨ ui ≺ uj Then If i 6= j, ui ∈ QM , uj ∈ Q, If uj = ux Then ˙ x) if it does not imply losing consistance with S − Add ui to δ(u, If uj = ε Then Add ui to Q˙ 0 if it does not imply losing consistance with S − End For End For We call As = hΣ, QS , Q0S , FS , δS i the automaton A˙ obtained after this phase. Cleaning phase ˙ = AS ˙ Q˙ 0 , F˙ , δi Initialisation : We set A˙ = hΣ, Q, ˙ For every states u ∈ Q \ QM , Do Suppress u of Q˙ if it does not imply losing the consistance with S + End For We note AC = hΣ, QN , Q0N , FN , δN i the automaton A˙ obtained after this phase. Reduction phase ˙ = AC ˙ Q˙ 0 , F˙ , δi Initialisation : We take A˙ = hΣ, Q, ˙ For every states ui ∈ Q Do If ∀v such that ui v ∈ S + , ∃uk ∈ Q˙ such that uk ≺ ui , uk 6= ui and uk v 6∈S − Then Suppress ui of Q˙ if it does not imply losing the consistance with S + End For We note AR = hΣ, QR , Q0R , FR , δR i the automaton A˙ obtained after this phase. Output : The automaton AR .
Learning Regular Languages Using Non Deterministic Finite Automata
4.3
47
Example
Let L = Σ ∗ 0Σ. The minimal DFA that recognizes L has 4 states corresponding to the residual languages L, 0−1 L = L∪Σ, 00−1 L = L∪Σ ∪ε and 01−1 L = L∪ε. All of them are prime except 00−1 L = L ∪ 0−1 L ∪ 01−1 L. 1
-
0
/
0
07
Y 1
0
j
1
/
/
1
Fig. 4. minimal DFA recognizing L
We have P C(L) = {ε, 0, 00, 01} and U (L) = {ε, 0, 1, 00, 01, 000, 001, 010, 011}. Let us study the behaviour of Delete on the following sample : S + = {00, 01, 000, 100, 0000, 0100, 01000, 01100} S − = {ε, 0, 1, 10, 010, 011, 0110} We first mark the words. We can observe for instance that 1 is not marked, as we have 1 ' ε. At the end of the marking phase, we have QM = {ε, 0, 00, 01} ¨ = {ε, 0, 1, 00, 01, 000, 010, 011}. and Q ¨ we’ve got the following non trivial relations : Between words of QM and Q, for every u, ε ≺ u, 000 ' 00, 010 ' 0, 011 ' ε, 1 ' ε, 0 ≺ 00, 01 ≺ 00, 0 ≺ 000, 01 ≺ 000. With those relations, we can build the automaton of figure 5 at the end of the saturation phase. At the cleaning phase, we suppress every un-marked state. As the state 00 is composed of all the others, we can suppress it during the reduction phase and we obtain the automaton of figure 6 at the end of the reduction phase. This automaton is the canonical RFSA of L. 4.4
Results
We suppose in this whole section that S is a characteristic sample of the RFSA. Lemma 3. For every word u ∈ pref (S + ), we have – if u ∈ QM then u ∈ P C(L) – if u ∈ P C(L), then there exists u1 , . . . , ul ∈ QM such that ui ≤ u for every index i and such that u = ⊕{u1 , . . . , ul }.
48
F. Denis, A. Lemay, and A. Terlutte
Fig. 5. Output Automaton at the end of the saturation phase 0, 1
-
0
/ i k
0 0, 1
/
q
i 0, 1
0, 1
q
0
Fig. 6. Output automaton of DeLeTe
One can prove both properties at the same time using a recurrence on the length of u (see full proof in [DLT00a]). This lemma does imply two main results : first, as every marked word u is in P C(L), the ≺ and ⊕ relations between those words do correspond to real inclusions and union relations between their representative languages, that is, if u and v are in QM , u ≺ v ⇒ u−1 L ⊆ v −1 L ; second, every prime word of P C(L) is in QM . We can then show that the automaton AS obtained after the saturation phase has an essential property : the language LS recognized by AS is exactly equal to the target language L. We can prove that using the following lemma. Lemma 4. ∀u ∈ QM , u−1 LS = u−1 L. Sketch of proof : (see full proof in [DLT00a]) The proof is mainly based on the fact that, due to the precedent lemma, if we have two words u and v in QM such that u ≺ v, then we have u−1 L ⊆ v −1 L. We first have to show that consistance tests in the algorithm are useless when the sample has the conditions required here, and that we can neglect them in those proofs.
Learning Regular Languages Using Non Deterministic Finite Automata
49
We also have to show that our transition function is correct, that is if u ∈ δS (v, x), then we have u−1 L ⊆ (vx)−1 L. That can be proved using the fact that it is true in the prefix tree, and all changes in the transition function keep this property. In particular, this implies that ∀u ∈ QM , u−1 LS ⊆ u−1 L. Furthermore, we can show that the automaton AS contains all the states and transitions of the canonical RFSA of L. So, we can conclude that, at least for words u ∈ QM such that u−1 L is a prime residual of L, we have u−1 L ⊆ u−1 LS . That is not too hard a step to also show it for every other states of QM . t u Theorem 1. If we give a characteristic sample for the canonical RFSA of a language L as input to the DeLeTe algorithm, the automaton given in output is the canonical RFSA of L. Sketch of proof : (see full proof in [DLT00a]) We already know that the automaton AS contains the states and the transitions of the canonical RFSA of L, and it is not hard to prove that the cleaning phase and the reduction phase suppress from this automaton all the states that are not prime. Furthermore, we can show that all the remaining transitions have the properties required in the canonical RFSA and that the remaining initial states are the initial states of the canonical RFSA. t u Despite of this theorem, DeLeTe is not a learning algorithm for regular language from given data as it can happen that the smallest characteristic sample for the canonical RFSA of the target language contains a word whose length is exponential with respect to the size of the minimal DFA recognizing it.
5
Remarks and Conclusion
Using DFA to represent regular languages is too strong a constraint since many regular languages possess shorter representations. That is the case of languages like Σ ∗ 0Σ n for which the minimal DFA has a number of states exponential in n. So, it is impossible for an algorithm like RPNI to have good results with those languages from samples of polynomial size in n. Nevertheless, it is quite easy to show that DeLeTe can find back those languages from a sample of cardinal in O(n2 ) and of size in O(n3 ). For instance, one can verify that the sample {01i 01j , 101i , 01k , 1i |0 ≤ i, j ≤ n + 1, 0 ≤ k ≤ 2n} correctly labeled, of size n2 +O(n) behaves like a strongly representative sample in the sense that DeLeTe calculates exactly the target language from all the examples it contains. The reason of this good behaviour is that those languages have few prime residuals and that they are defined by short words. But DeLeTe is not built to succeed on those “academic” languages. The ideas we propose here shares the same philosophy as the one that motivated the conception of RPNI or that explains its performances. In ideal conditions, it is an algorithm of exact learning ; in less ideal conditions (but not designed to make it fail), it is still a good algorithm for approximative learning. This
50
F. Denis, A. Lemay, and A. Terlutte
can surely be explained by the fact that, if an incorrect merge has been made, it is because there was no example to forbid it, and that it could probably have been made anyway without important consequences. This remark is also true when examples are distributed using a probability distribution and when asked performances are at the same time approximative and relative to the used distribution, that is in the PAC framework. This analysis is still basic and we think a deeper analysis is still to be done. The previous remark can be used in our case too. We said in the introduction that the representative sample of a DFA can contain words of exponential length, but in a PAC learning context, it is probably not a serious problem. Indeed, those long words are used, in an exact learning context, to be sure not to replace a residual by the reunion of the residuals it contains when we should not. But if no example forbid that reunion, it is probably not serious to suppose we could do it. All of this need to be formalized and precised : we’d like to find an algorithm that, when examples are distributed using a given probability, gives us a RFSA near to the target relatively to this distribution - that is still up to do. We think the ideas exposed here are original and promising. The present paper is a first step. It shows that it is relevant to study grammatical inference of regular languages using RFSA representation. This work will be carry on.
References [DLT00a] [DLT00b] [GM96] [Gol78] [Hig97] [KV94] [LPP98] [OG92] [Pit89]
[Val84]
F. Denis, A. Lemay, and A. Terlutte. Learning regular languages using non deterministic finite automata. Technical Report 7, 2000. F. Denis, A. Lemay, and A. Terlutt. Les automates finis ` a ´etats r´esiduels (afer). Technical report, ftp://ftp.grappa.univliller3. fr/pub/reports/after.ps.gz, 2000S.A. Goldman and H.D. Mathias. Teaching a smarter learner. Journal of Computer and System Sciences, 52(2): 255-267, 1996. E.M. Gold. Complexity of automaton identification from given data. Inform. Control, 37:302-320, 1978. Colin De La Higuera. Characteristic sets for polynomial grammatical inference. Machine Learning, 27: 125-137, 1997. M. Kearns and L. Valiant. Cryptographic limitations on learning boolean formulae and finite automata. Journal of the ACM, 41(1):67-95, 1994. K. J. Lang, B. A. Pearlmutter, and R. A. Price. Results of the abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. Lecture Notes in Science, 1433: 1-12, 1998. J. Oncina and P. Garcia. Inferring regular languages in polynomial update time. In Pattern Recognition and Image Analysis, pages 49-61, 1992. L. Pitt. Inductive Inference, DFAs, and Computational Complexity. In Proceedings of AII-89 Workshop on Analogical and Inductive Inference; Lecture Notes in Artificial Intelligence 397, pages 18-44, Heidelberg, October 1989. Springer-Verlag. L.G.Valiant. A theory of the learnable. Commun. ACM, 27(11):1134-1142, November 1984.
Smoothing Probabilistic Automata: An Error-Correcting Approach Pierre Dupont1 and Juan-Carlos Amengual2 1
EURISE, Universit´e Jean Monnet 23, rue P. Michelon 42023 Saint-Etienne Cedex – France
[email protected] 2 Universidad Jaume I de Castell´ on Campus de Riu Sec 12071 Castell´ on – Spain
[email protected]
Abstract. In this paper we address the issue of smoothing the probability distribution defined by a probabilistic automaton. As inferring a probabilistic automaton is a statistical estimation problem, the usual data sparseness problem arises. We propose here the use of an error correcting technique for smoothing automata. This technique is based on a symbol dependent error model which guarantees that any possible string can be predicted with a non-zero probability. We detail how to define a consistent distribution after extending the original probabilistic automaton with error transitions. We show how to estimate the error model’s free parameters from independent data. Experiments on the ATIS travel information task show a 48 % test set perplexity reduction on new data with respect to a simply smoothed version of the original automaton.
1
Introduction
The goal of learning a probabilistic deterministic finite automaton (PDFA) is to induce a DFA structure from data and estimate its constituent transition probabilities. As the structure itself constrains the probability distribution over the set of possible strings, the inference procedure can be considered to be a single problem of statistical estimation. Several learning algorithms for probabilistic automata have been proposed [16,3,15], but the smoothing issue has not been addressed. In particular, when probabilistic automata are used for modeling real data, as in the case of natural language interfaces, the usual problem of data sparseness arises. In other words only a few strings are actually observed in the training sample and many strings that could be observed receive a zero probability of being generated even after the generalization introduced by the inference algorithm. Smoothing the probability distribution fundamentally requires us to discount a certain probability mass from the seen events and to distribute it over unseen events which would otherwise have a zero probability. Considering that a string with zero probability is a string for which there is no path between the initial A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 51–64, 2000. c Springer-Verlag Berlin Heidelberg 2000
52
P. Dupont and J.-C. Amengual
state and an accepting state in a probabilistic automaton1 , error-correcting techniques [2] can be used towards this end. Error-correcting techniques extend automata to allow acceptance of, in principle, any string. Using error correction allows us to compute the probability of accepting the string with minimal error. Several criteria can be used to guide this process. For instance, we can look for the minimal number of editing operations necessary to accept a string. Alternatively, we can search for the accepting path of maximal probability in a probabilistic error-correcting parser. In the latter case, the error model parameters need to be estimated and possibly smoothed as well. Definitions and notations are given in section 2.1. The ALERGIA algorithm, which will be used for PDFA inference, is briefly presented in section 2.2. The criterion for evaluating the quality of a PDFA, that is the perplexity computed on an independent test sample, is detailed in section 2.3. We present in section 3 our baseline smoothing technique using linear interpolation with a unigram model. The formal definition of the proposed error-correcting model and the method for estimating its free parameters are fully described in section 4. Experiments on the ATIS task, a spoken language interface to a travel information database, were performed in order to assess the proposed smoothing techniques. The task is presented in section 5. Finally we show how error-correcting techniques improve the baseline perplexity. These experiments are detailed in section 6.
2
Preliminaries
In this section we detail the formal definition of a probabilistic DFA (PDFA). Next, we review briefly the ALERGIA algorithm which will be used in our experiments to infer PDFA. Finally we present the measure for estimating the quality of PDFA inference and smoothing. 2.1
Definitions
A PDFA is a 5-tuple (Q, Σ, δ, q0 , γ) in which Q is a finite set of states, Σ is a finite alphabet, δ is a transition function, i.e. a mapping from QxΣ to Q, q0 is the initial state, γ is the next symbol probability function, i.e. a mapping from QxΣ ∪{#} to [0, 1]. A special symbol #, not belonging to the alphabet Σ, denotes the end of string symbol. Hence γ(q, #) represents the probability of ending the generation process in state q and q is an accepting state if γ(q, #) > 0. The probability function must satisfy the following constraints: P
γ(q, a) = 0 , if δ(q, a) = ∅, ∀a ∈ Σ γ(q, a) = 1 , ∀q ∈ Q a∈Σ∪{#}
The probability PA (x) of generating a string x = x1 . . . xn from a PDFA A = (Q, Σ, δ, q0 , γ) is defined as 1
We assume here that no existing transition in an automaton has a zero probability.
Smoothing Probabilistic Automata
53
Qn i n i=1 γ(q , xi ) γ(q , #) if δ(q i , xi ) 6= ∅ with q i+1 = δ(q i , xi ), PA (x) = for 1 ≤ i < n and q 1 = q0 0, otherwise The language L(A) generated by a PDFA A is made of all strings with non-zero probability: L(A) = {x | PA (x) > 0} Our definition of probabilistic automaton is equivalent to P a stochastic deterministic regular grammar used as a string generator. Thus, x∈Σ ∗ PA (x) = 1. Note that some work onPthe learning of discrete distributions uses distributions defined on Σ n (that is x∈Σ n P (x) = 1, for any n ≥ 1), instead of Σ ∗ . Let I+ denote a positive sample, i.e. a set of strings belonging to a probabilistic language we are trying to model. Let P T A(I+ ) denote the prefix tree acceptor built from a positive sample I+ . The prefix tree acceptor is an automaton that only accepts the strings in the sample and in which common prefixes are merged together resulting in a tree shaped automaton. Let P P T A(I+ ) denote the probabilistic prefix tree acceptor. It is the probabilistic extension of the P T A(I+ ) in which each transition has a probability proportional to the number of times it is used while generating, or equivalently parsing, the positive sample. 2.2
PDFA Inference
Several inference algorithms for probabilistic automata have been proposed [16, 3,15] but only Carrasco and Oncina’s ALERGIA algorithm, a stochastic extension of the RPNI algorithm [14], is free from the restriction to the learning of acyclic automata. This algorithm has been applied to information extraction from text [7] or structured documents [17], speech language modeling [5] and probabilistic dialog modeling [10]. The ALERGIA algorithm performs an ordered search in a lattice of automata Lat(P P T A(I+ )). This lattice is the set of automata that can be derived from P P T A (I+ ) by merging some states. The specific merging order, that is the order in which pair of states are considered for merging, is explained in detail and fully motivated in [4]. At each step of this algorithm, two states are declared compatible for merging, if the probability of any of their suffixes are similar within a certain threshold α. This parameter α indirectly controls the level of generalization of the inferred PDFA. 2.3
Evaluation Criterion
Evaluation of non-probabilistic inference methods is usually based on correct classification rates of new positive and negative data [12]. In the case of PDFA inference, the model quality can no longer be measured by classification error rate, as the fundamental problem has become the estimation of a probability distribution over the set of possible strings.
54
P. Dupont and J.-C. Amengual
The quality of a PDFA A = (Q, Σ, δ, q0 , γ) can be measured by the per symbol log-likelihood of strings x belonging to a test sample according to the distribution defined by the solution PA (x) computed on a test sample S: |S| |x| X X 1 j LL = − log P (xi |q i ) kSk j=1 i=1 where P (xji |q i ) denotes the probability of generating xji , the i-th symbol of the jth string in S, given that the generation process was in state q i . This average loglikelihood is also related to the Kullback-Leibler divergence between an unknown target distribution and the proposed solution by considering the test sample as the empirical estimate of the unknown distribution (see e.g. [5]). The test sample perplexity P P is most commonly used for evaluating language models of speech applications. It is given by P P = 2LL . The minimal perplexity P P = 1 is reached2 when the next symbol xji is always predicted with probability 1 from the current state q i (i.e. P (xji |q i ) = 1) while P P = |Σ| corresponds to random guessing from an alphabet of size |Σ|.
3
Interpolation with a Unigram Model
In this section we present the basic smoothing technique which will serve as our reference model for smoothing probabilistic automata. A unigram model is a probabilistic model in which the probability of any symbol a from Σ is independent from its context. It can be simply estimated by computing the frequency C(a) of a in a training sample containing N tokens. The probability P (a) is given by P (a) =
C(a) N
and the probability P1 (x) of a string x = x1 . . . x|x| is given by P1 (x) =
|x| Y
P (xi )
i=1
In general not all symbols are observed in the training sample and the unigram distribution is smoothed according to a discounting parameter d [13]: ( C(a)−d , if C(a) > 0 N Pb(a) = (1) D otherwise N0 where D is the total discounted probability mass X d D= N {a | C(a)>0}
and N0 is the number of unseen symbols in the training sample 2
Such a perfectly informed model cannot be constructed in general.
Smoothing Probabilistic Automata
X
N0 =
55
1.
{a | C(a)=0}
A smoothed unigram model is guaranteed to assign a non-zero probability to any string which will be denoted Pb1 (x). It is equivalent to the universal automaton built from the alphabet Σ with transitions probabilities defined according to equation (1). If PA (x) denotes the (possibly null) probability assigned to a string x by a PDFA A, a smoothed distribution is obtained by linear interpolation with the smoothed unigram model: Pb(x) = β · PA (x) + (1 − β) · Pb1 (x) , with 0 ≤ β < 1. This smoothing technique is very rudimentary but, because it is so simple, it best reflects the quality of the PDFA itself. This smoothed probabilistic distribution serves as our reference model. In the sequel we study whether errorcorrecting techniques can improve over this reference model, that is whether a probabilistic model with smaller perplexity on independent data can be obtained.
4
Error-Correcting Model
Given A, a PDFA, and its language L(A), error transitions can be added in order to make it possible to accept any possible string from Σ ∗ with a non-zero probability. This error model is fully described in section 4.1. The problem of estimating the error-correcting model free parameters, which are the probabilities of error transitions, is detailed in section 4.2. Once the error model has been estimated from data, there may still be some string which cannot be generated with a non-zero probability. This is due to the fact that some error transitions may not have been seen during the estimation of the error model. Smoothing of the error model is then required as explained in section 4.3. The adaptation of the original PDFA distribution in order to include the error transition probabilities to build a consistent model is described in section 4.4. Once the error model has been defined and its parameters estimated, the probability Pe(x) of generating any string x from the original PDFA extended with error transitions can be computed. An efficient algorithm to compute a path of maximal probability in any probabilistic automaton A (i.e. an automaton possibly including cycles and possibly non-deterministic) was recently proposed [1]. This algorithm is briefly presented in section 4.5. We use this algorithm here to reestimate iteratively the error model as described in section 4.6. 4.1
Model Definition
Our error model is based on the addition of error transitions to an existing PDFA resulting in an extended automaton. These error transitions accounts for the possibility of inserting any symbol at any state, of substituting any existing transition labelled by a symbol a by any other symbol from the alphabet or of deleting the transition (or equivalently substituting a by the empty string λ). Figure 1 illustrates the addition of error transitions to a PDFA. Initially there are only two transitions from state q labeled by a and b, respectively. The original
56
P. Dupont and J.-C. Amengual
automaton is extended with insertion transitions, substitution transitions and deletion transitions. Note that for this example the alphabet is assumed to have b two symbols, Σ = {a, b}. a q
b
b a
a
q
b
λ
a
λ (a) Original automaton
(b) Extended automaton
Fig. 1. Addition of error transitions to a PDFA
The parameters of the general error model are the following: – P (λ → a | q) which denotes the probability of inserting symbol a while being in state q – P (a → b | q, q 0 ) which denotes the probability of substituting a by b while going from q to q 0 . In particular P (a → a | q, q 0 ) denotes the probability of substituting a by a, that is of taking the original transition labeled by a from state q. – P (a → λ | q, q 0 ) which denotes the probability of deleting a while going from q to q 0 . Estimating an error model consists of estimating the error transitions probabilities. In order to minimize the number of free parameters, these probabilities can be made dependent on the symbol but independent of the transitions (or the state) they apply to. The parameters of the symbol dependent error model now become: – P (λ → a) which denotes the probability of inserting symbol a in any state. – P (a → b) which denotes the probability of substituting a by b while taking any transition labeled by a. – P (a → λ) which denotes the probability of deleting a while taking any transition labeled by a. Alternatively, the error model can be made state dependent instead of symbol dependent. In our case, we adopt a symbol dependent error model as the alphabet is usually known before the automaton inference process. State independence also allows us to merge several error models as described in section 6.4. 4.2
Estimation of an Error Model
Once a PDFA is given or inferred from a training sample, the parameters of the error model can be estimated on an independent sample. For any string x from this independent sample, the probability of generating the string can be computed. This requires that a consistent probability distribution can be defined for
Smoothing Probabilistic Automata
57
the extended automaton as detailed in section 4.4. Note also that after the extension of the original automaton with error transitions, the new automaton is no longer deterministic. Following a Viterbi criterion, the probability of generating x can be approximated by the probability of the most likely path to generate x. An efficient algorithm to compute this path is described in section 4.5. The set of editing operations used while generating the independent sample from the extended automaton can be stored and the associated counts can be computed: – C(λ, a) denotes the number of insertions of the symbol a. – C(a, b) denotes the number of substitutions of the symbol a by b. In particular, C(a, a) denotes the number of times the symbol a was asserted, that is, not substituted while parsing the independent sample. – C(a, λ) denotes the number of deletions of the symbol a. – C(#) denotes the number of (end of) strings. As the proposed error model is state independent several estimates of its parameters for various underlying automata can be computed. Combining these estimates simply amounts to summing the respective error counts. This property will be used in our experiments as explained in section 6.4. 4.3
Smoothing of the Error Counts
Some counts associated with error transitions may be null after estimating the error model. This is the case when some error transitions are never used along any of the most likely paths computed while parsing the independent sample. This problem can be solved by adding real positive values to the error counts. We use four additional parameters εins , εsub ,εdel and εnoerr to smooth the error counts : – – – – 4.4
b a) = C(λ, a) + εins C(λ, b b) = C(a, b) + εsub if a 6= b C(a, b C(a, a) = C(a, a) + εnoerr b λ) = C(a, λ) + εdel C(a, Definition of the Extended PDFA Distribution
In the original PDFA P (a | q) = γ(q, a) denotes the probability of generating the symbol a from state q whenever such a transition exists. This transition probability can be estimated from the training sample from which the PDFA was built. The maximum likelihood estimate for γ(q, a) is given by γ(q, a) =
Cq (a) Cq
where Cq (a) denotes the number of times the symbol a was generated from state q and Cq denotes the number of times the state q was observed while parsing the training sample. We can assume that the counts Cq (a) and Cq are strictly
58
P. Dupont and J.-C. Amengual
positive, as any transition (or state) which would not satisfy this constraint would be initially removed from the original PDFA. The probability distribution of the extended automaton can be defined as follows. The total insertion count Cins is defined as X b a) C(λ, Cins = a∈Σ
and its complementary count Cins is defined as Cins =
X
X
b b) C(a,
+ C(#)
a∈Σ b∈Σ∪{λ}
ins denote the probability of inserting any symbol. The Let Pins = CinsC+C ins probability of the transitions from any state q in the extended automaton are computed as follows:
– the probability of inserting a while being in state q: b a) C(λ, . P (λ → a|q) = P (λ → a) = Pins · Cins
(2)
– the probability of substituting a by b, for any symbol b in the alphabet Σ (including the case where a = b), from state q : " # b b) C(a, . P (a → b|q) = P (a → b) · γ(q, a) = (1 − Pins ) · P · γ(q, a) b b∈Σ∪{λ} C(a, b) (3) – the probability of deleting a from state q : # " b λ) C(a, . · γ(q, a) P (a → λ|q) = P (a → λ) · γ(q, a) = (1 − Pins ) · P b b) C(a, b∈Σ∪{λ}
– the probability of generating the end of string symbol # from state q : . P (# | q) = (1 − Pins ) · γ(q, #) 4.5
(4) (5)
Computation of the Most Likely Path
The general problem of finite-state parsing with no error correction can be formulated as a search for the most likely path or equivalently the minimum cost3 path through a trellis diagram associated to the PDFA A and the string x to be parsed. This trellis is a directed acyclic multistage graph, where each node qkj corresponds to a state qj in a stage k. The stage k is associated with a symj bol xk of the string to be parsed and every edge of the trellis tk = (qki , qk+1 ) 3
Sums of negative log probabilities rather than products of probabilities are used.
Smoothing Probabilistic Automata
59
stands for a transition between the state qi in stage k and the state qj in stage k + 1 (Fig. 2 (a)). Thanks to the acyclic nature of this graph, dynamic programming can be used to solve the search problem, leading to the well-known Viterbi algorithm [6].
(a) K
(b) K+1
K
(c) K+1
K
(d) K+1
K
K+1
Fig. 2. Trellis with: a) Substitution and proper PDFA transitions b) Insertion transitions c) Deletion transitions in an acyclic PDFA d) Deletion transitions in a cyclic PDFA. Every edge is labeled with a symbol of Σ.
The trellis diagram can be extended in a straightforward fashion to parse errors produced by substitution and insertion actions. Efficient error correcting parsing can be implemented because such an extended trellis diagram still has the shape of a directed acyclic multistage graph (Fig. 2 (a),(b)). However, the extension of the trellis diagram to parse errors produced by deletion of one or more (consecutive) symbol(s) in the original string results in a graph form that includes edges between nodes belonging to the same stage k (Fig. 2 (c)). In particular, when the automaton A has cycles dynamic programming can no longer be used as the problem becomes one of finding a minimum cost path through a general directed cyclic graph (Fig. 2 (d)). As noted in [8], we can still take advantage of the fact that most edges, for this kind of graph, still have a left-to-right structure and consider each column as a separate stage like in the Viterbi algorithm. An efficient algorithm for computing the most likely acceptance path that includes error operations in general automata was proposed in [1]. This algorithm can be considered as an extension of the Viterbi algorithm. The main difference lies in the fact that an order has to be defined when parsing deletion transitions (see Fig. 2 (c), (d)) for adequately performing the required computations during local (state) minimizations. In particular, it is based on the definition of a pseudo-topological state ordering, that is an extension to cyclic graphs of the usual topological ordering. This pseudo-topological ordering is computed and efficiently stored in a hash table during a preprocessing stage which detects the backward edges, i.e. those edges which produce cycles in A. This leads to a fixed order for the traversal of the list of nodes (states of the PDFA) at any stage of the parsing process in order to update the cumulated costs whenever required.
60
P. Dupont and J.-C. Amengual
Full details of the computation of this state ordering, the resulting parsing algorithm and practical evaluations are presented in [1]. We use here this algorithm to compute the most likely path of generating a string from the extended PDFA. 4.6
Greedy Reestimation of the Smoothed Distribution
Computing the most likely path using the technique described in section 4.5 is equivalent to computing the path of minimum cumulated cost. For example the cost Dq (a → b) of substituting a by b from state q is given by Dq (a → b) = − log P (a → b|q). Thus the maximization of a product of probabilities becomes a minimization of additive costs. The initial error model can not be derived from the probabilistic error model described in section 4.4, as the error counts are initially unknown and the extended (smoothed) PDFA distribution can not be computed. However a set of editing costs can be defined a priori, for instance according to the Levenshtein distance [11]: Dq (λ → a) = 1, Dq (a → b) = 1 if a 6= b, Dq (a → a) = 0 and Dq (a → λ) = 1. Once the initial editing costs are defined, the counts of insertions, substitutions and deletions that minimize the Levenshtein distance criterion on an independent sample can be computed as described in section 4.1. Note that, in this particular case, only the structure of the PDFA is required. A new error model can then be derived from these error counts, and this estimation can be iterated with a true probabilistic error model. This reestimation process is performed until a maximum number of iterations is reached (typically 10) or until the relative change of perplexities computed on two consecutive iterations falls below a certain threshold (typically 1%). During this iterative procedure, the original PDFA distribution can also be reestimated by adding to the original counts, Cq and Cq (a), their values computed on the independent sample and by modifying accordingly the estimate of γ(q, a). This will be referred to as reestimation of non-error transitions.
5
The ATIS Task
The Air Travel Information System (ATIS) corpus [9] was developed under a DARPA speech and natural language program that focussed on developing language interfaces to information retrieval systems. The corpus consists of speakers of American English making information requests such as, “Uh, I’d like to go from, uh, Pittsburgh to Boston next Tuesday, no wait, Wednesday”. Each user was given several goal scenarios to work with, in which he or she had to try to make travel arrangements between multiple cities in North America. A database containing information from the Official Airline Guide was at the heart of the system. Users could ask questions about a wide variety of items in the database, ranging from flight information to aircraft equipment descriptions and even meals served on particular flights. They could speak naturally to the machine, as there was no fixed interaction language or required sequence of events. Spoken language phenomena such as truncated words, hesitations, false starts, and verbal error recovery are common in the corpus. It is commonplace
Smoothing Probabilistic Automata
61
to find multiple turn interactions (and thus multiple utterances from a user) between the user and machine for solving each scenario.
6 6.1
Experiments Data Sets
We use the ATIS-2 sub-corpus in the experiments reported here. This portion of the corpus was developed under Wizard-of-Oz conditions in which a human being secretly replaced the speech recognition component of an otherwise fully automated dialogue system. The ATIS-2 collection is officially defined as containing a training set and two evaluation sets. The training set, which we used for inferring PDFAs, contains 13,044 utterances (130,773 tokens). The vocabulary contains 1,294 words. We used the first evaluation set (Feb92, 974 utterances, 10636 tokens) as a validation set to estimate the baseline perplexity and an error model. The second evaluation set (Nov92, 1001 utterances, 11703 tokens) was used as our independent test set. In the context of these experiments, alphabet symbols represent words from the ATIS vocabulary and strings represent utterances. 6.2
Baseline Perplexity
A PDFA is inferred from the training set using the ALERGIA algorithm. The resulting PDFA consists of 414 states and 12,303 transitions. It accepts 55 % (532 strings) of the validation set, illustrating the need for smoothing the PDFA distribution. In particular the validation set perplexity is infinite without smoothing. Figure 3(a) shows the perplexity obtained after smoothing by interpolating with a unigram model as explained in section 3. The optimal perplexity (70) is obtained for β equal to 0.5. 6.3
Validation Set Perplexity with Error Model
The initial error model parameters are estimated from training and validation sets by counting the observed editing operation frequencies so as to minimize the Levenshtein distance (see section 4.6). As some error transitions are not observed during this process, the initial error table is then smoothed (see section 4.3). The additional smoothing parameters (εins , εsub , εdel and εnoerr ) are adjusted in order to minimize the perplexity on the last 10 % of the validation set while estimating the error model only on the first 90 % of the validation set. Their optimal values are εins = 0.1, εsub = 0.1, εdel = 0.1 and εnoerr = 0.0. Figure 3(b) shows the perplexity obtained on the validation set during reestimation of the error model. The initial perplexity (41) is achieved after the initial estimation of error parameters, based on the counts of the editing operations which minimize Levenshtein distance. In the first case (type I model), only error transitions are reestimated resulting in a 10% relative perplexity improvement (from 41 to 37). In the second case (type II model), error and non-error transitions probabilities are reestimated. The perplexity obtained after 10 iterations is 28.
62
P. Dupont and J.-C. Amengual
(a)
6.4
Fig. 3. Perplexity results
(b)
Estimating the Error Model by Cross-Validation
In the experiments described in section 6.3, the error model was constructed and reestimated on the validation set. The training set, which represents about 13 times more data, was not used for estimating the error model as the original automaton is guaranteed to accept all training strings without errors. However a better estimate of the error model can be obtained using cross-validation. This procedure can be summarized as follows: – Concatenate training and validation set in a single data set. – Construct N (typically 10) different partitions of this data set. – For each partition, infer a PDFA on the first part (typically 90 % of the data set) and estimate an error model on the second part (typically the remaining 10 %) following the greedy procedure described in section 4.6. – Merge all error models by summing up the error counts obtained on each partition. Merging of several error models is simple in our case, as these models are symbol dependent but do not depend on the structure of the underlying automaton. Once the error model is estimated by cross-validation, a final reestimation on the validation set can be performed using the original automaton constructed on the training set only. 6.5
Independent Test Set Perplexity
Table 1 summarizes the final results computed on an independent test set. The reference model is a PDFA interpolated with a unigram using the optimal interpolation parameter estimated on the validation set (β = 0.5). Type I error model refers to the model obtained after reestimating only the error transition probabilities on the validation set. Type II error model refers to the model obtained after reestimating both error and non error transitions probabilities. In both cases the error model probabilities may be simply computed on the validation set or can be estimated by cross-validation (CV) following the procedure described in section 6.4.
Smoothing Probabilistic Automata Table 1. Test set perplexity Model Unigram smoothing (β = 0.5) Type I error model Type II error model CV + Type I error model CV + Type II error model
63
Perplexity 71 40 41 37 37
The reestimation of non-error transitions does not improve the perplexity of the extended PDFA on an independent test set. The significant perplexity decrease on the validation set, as seen in figure 3(b), is thus a result of overfitting to the validation data. On the other hand, cross-validation allows for up to 10 % relative perplexity reduction. Finally these results show a 48 % relative perplexity reduction as compared to the perplexity obtained by interpolating with a unigram model.
7
Conclusions and Future Work
We have examined the issues of smoothing probabilistic automata by adding error transitions to an original probabilistic automaton structure. The probability distribution of the extended automaton is such that any possible string can be predicted with non-zero probability. We explained how to define a consistent error model and how to estimate its free parameters from independent data. Practical experiments on the ATIS travel information task show a 48 % test set perplexity reduction on new data with respect to a simply smoothed version of the original automaton.These experiments illustrate the risk of overfitting when both the error model and the initial non error transitions are reestimated. On the other hand, cross-validation allows us to estimate a more reliable error model which results in significant perplexity reduction on new data. The error model proposed here is symbol dependent but state independent. In particular, the probability of inserting a given symbol a does not depend on where this symbol is inserted. In order to refine the error model without significantly increasing the number of free parameters, the relative weight of error versus non-error transitions could also be estimated for each state. We presented here the error correcting approach as a method for extending a probabilistic deterministic automaton. Most existing inference algorithms produce deterministic machines which after extension with error transitions become non-deterministic. The techniques presented here handle this non-determinism. Thus, smoothing of automata which are non-deterministic from the start is also something we can pursue. Clustering alphabet symbols before PDFA inference was shown to reduce perplexity on new data [5]. Combination of this technique with error correcting will also be investigated in the future.
References 1. J.-C. Amengual and E. Vidal. Efficient error-correcting viterbi parsing. IEEE Transactions on Pattern Analysis and Machine Intelligence, PAMI-20(10), October 1998.
64
P. Dupont and J.-C. Amengual
2. J.-C. Amengual, E. Vidal, and J.-M. Bened´i. Simplifying language through errorcorrecting techniques. In International Conference on Spoken Language Processing, pages 841–844, 1996. 3. R. Carrasco and J. Oncina. Learning stochastic regular grammars by means of a state merging method. In Grammatical Inference and Applications, ICGI’94, number 862 in Lecture Notes in Artificial Intelligence, pages 139–150, Alicante, Spain, 1994. Springer Verlag. 4. R. Carrasco and J. Oncina. Learning deterministic regular gramars from stochastic samples in polynomial time. Theoretical Informatics and Applications, 33(1):1–19, 1999. 5. P. Dupont and L. Chase. Using symbol clustering to improve probabilistic automaton inference. In Grammatical Inference, ICGI’98, number 1433 in Lecture Notes in Artificial Intelligence, pages 232–243, Ames, Iowa, 1998. Springer Verlag. 6. G.D. Forney. The Viterbi algorithm. IEEE Proceedings, 3:268–278, 1973. 7. D. Freitag. Using grammatical inference to improve precision in information extraction. In Workshop on Automata Induction, Grammatical Inference, and Language Acquisition, Fourteenth International Conference on Machine Learning, Nashville, Tennessee, 1997. 8. G.W. Hart and A. Bouloutas. Correcting dependent errors in sequences generated by finite-state processes. IEEE Trans. on Information Theory, 39(4):1249–1260, July 1993. 9. L. Hirschman. Multi-site data collection for a spoken language corpus. In Proceedings of DARPA Speech and Natural Language Workshop, pages 7–14, Arden House, NY, 1992. 10. K. Kita, Y. Fukui, M. Nagata, and T. Morimoto. Automatic acquisition of probabilistic dialogue models. In Proceedings of ISSD96, workshop of the International Conference on Spoken Language Processing, pages 196–199, Philadelphia, October 1996. 11. J.B. Kruskal. An overview of sequence comparison. In D. Sankoff and J.B. Kruskal, editors, Time Warps, String Edits, and Macromolecules: the Theory and Practice of Sequence Comparison, pages 1–44. Addison-Wesley, Reading, Massachusetts, 1983. 12. K.J. Lang, B.A. Pearlmutter, and R.A. Price. Results of the abbadingo one DFA learning competition and a new evidence-driven state merging algorithm. In Grammatical Inference, number 1433 in Lecture Notes in Artificial Intelligence, pages 1–12, Ames, Iowa, 1998. Springer-Verlag. 13. H. Ney, U. Essen, and R. Kneser. On structuring probabilistic dependences in stochastic language modelling. Computer Speech and Language, 8:1–38, 1994. 14. J. Oncina and P. Garc´ıa. Inferring regular languages in polynomial update time. In N. P´erez de la Blanca, A. Sanfeliu, and E.Vidal, editors, Pattern Recognition and Image Analysis, volume 1 of Series in Machine Perception and Artificial Intelligence, pages 49–61. World Scientific, Singapore, 1992. 15. D. Ron, Y. Singer, and N. Tishby. On the learnability and usage of acyclic probabilistic automata. In Proceedings of the Eighth Annual Conference on Computational Learning Theory, pages 31–40, Santa Cruz, CA, 1995. ACM Press. 16. H. Rulot and E. Vidal. An efficient algorithm for the inference of circuit-free automata. In G. Ferrat`e, T. Pavlidis, A. Sanfeliu, and H. Bunke, editors, Advances in Structural and Syntactic Pattern Recognition, pages 173–184. NATO ASI, Springer-Verlag, 1988. 17. M. Young-Lai and F. Tompa. Stochastic grammatical inference of text database structure. To appear in Machine Learning, 2000.
Inferring Subclasses of Contextual Languages J.D. Emerald, K.G. Subramanian, and D.G. Thomas Department of Mathematics Madras Christian College Madras - 600 059, INDIA
Abstract. In this paper, we show that the languages generated by internal contextual grammars are not inferable from positive data only. We define two subclasses of internal contextual languages, namely, k-uniform and strictly internal contextual languages which are incomparable classes and provide an algorithm to learn these classes. The algorithm can be used when the rules are applied in a parallel mode. Keywords : internal contextual grammars, strict, k-uniform, identification in the limit from positive data.
1
Introduction
Contextual grammars of Marcus [5] originated in an attempt to translate the central notion of context from the analytical models into the framework of generative grammars. Basically, a contextual grammar gives rise to a language as follows: starting with a given finite set of strings, called axioms, pairs of strings, called contexts, associated to sets of words, called selectors, are added iteratively to the strings already obtained. Among different variations of contextual grammars, we are concerned here with internal contextual grammars [2,3]. It is known that internal contextual grammars with maximal use of selectors are the most appropriate to model the generative capacity of natural languages because they are able to describe all the usual restrictions appearing in such languages. In this paper, we show that the class of internal contextual languages is not inferable from positive data only. Thus, it is natural to look for subclasses of these languages which can be identified in the limit from positive examples only. Motivated by this, two such subclasses are introduced here and inference procedures are provided for learning these classes in this framework. In addition, another variation of internal contextual grammar, which requires a parallel mode of derivation, is also introduced and inference procedure for this class is indicated.
A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 65–74, 2000. c Springer-Verlag Berlin Heidelberg 2000
66
2
J.D. Emerald, K.G. Subramanian, and D.G. Thomas
Perliminaries and Definitions
For a language L ⊆ Σ ∗ , the length set of L, is length (L) = {|x|/x ∈ L}. We recall that for x, y ∈ Σ ∗ , y is said to be a subword of x, if x = x1 yx2 , for x1 , x2 ∈ Σ ∗ . The set of all subwords of a string x is denoted by sub(x). If w = xy then x−1 w = y and wy −1 = x. Definition 2.1 [3] An internal contextual grammar (ICG) is a triple G = (Σ, A, P ), where Σ is an alphabet, A is a finite set of strings over Σ, and P is a finite set of pairs (z, u$v) where z, u, v are strings over Σ and $ is a reserved symbol not in Σ. The elements of A are called axioms, those in P are called productions. For a production π = (z, u$v), z is called the selector of π and u$v (or the pair (u,v)) is the context of π. For x, y ∈ Σ ∗ , we define the derivation relation x ⇒ y (with respect to G) if and only if there is a production π ∈ P, π = (z, u$v), such that x = x1 zx2 and y = x1 uzvx2 , for some x1 , x2 ∈ Σ ∗ . (The context (u,v) is adjoined to a substring z of x provided this substring is the selector of a production whose context is u$v). ⇒∗ denotes the reflexive and transitive closure of the relation ⇒. The language generated by G is L(G) = {x ∈ Σ ∗ /w ⇒∗ x for some w ∈ A}. We use ICL to denote the family of all languages generated by internal contextual grammars. Clearly, the productions (z, u$v) with u = v = λ have no effect. Example 2.1 L1 = {an bn am bm /n, m ≥ 1} is an ICL generated by an ICG G1 = ({a, b}, {abab}, {(ab, a$b)}).
3
A Negative Result for Identifiability of ICL
We first consider the learnability of the class of internal contextual languages in the limit from positive data and show using the result of Angluin [1] that C, the class of ICL is not learnable from positive data only. Theorem 3.1 [1] If an indexed family of nonempty recursive languages is inferable from positive data, then there exists, on any input i, i ≥ 1, a finite set of strings Ti such that Ti ⊆ Li and for all j ≥ 1, if Ti ⊆ Lj , then Lj is not a proper subset of Li . Theorem 3.2 The class of internal contextual languages C, is not inferable from positive data only.
Inferring Subclasses of Contextual Languages
67
Proof : We derive a contradiction with Theorem 3.1. Consider the language L = {cbc} ∪ {can ban c/n ≥ 1}. L is in C, since it can be generated from the axiom cbc with production (b, a$a). Let T be any nonempty finite subset of L, and let T 0 = T − {cbc}. In fact, let T 0 = {can1 ban1 c, ..., canp banp c} where ni ≥ 1. Consider an internal contextual grammar H with axiom cbc and with production {(b, an1 $an1 ), ..., (b, anp $anp )}. We have T 0 ⊆ L(H) ⊆ L contradicting Theorem 3.1. Remark : It is observed in [3] that internal contextual grammars are equivalent to pure grammars [6] which generate a class of languages included in the context-sensitive family and hence ICL ⊂ CS. Koshiba et al.[4] have shown that pure contextfree languages are not inferable from positive data only. As a consequence, pure languages are also not inferable from positive data only. In view of this, Theorem 3.2 is of significance, as ICL is a subclass of the family of pure languages.
4
Subclasses of Internal Contextual Grammars
We now define a strictly internal contextual grammar and a k-uniform internal contextual grammar. Definition 4.1 A strictly internal contextual grammar (SICG) is a 6-tuple G = (Σ, S, C, F, w, P ) where i) Σ is the alphabet. ii) S, C and F are sets of strings over Σ (i.e., S, C, F ⊆ Σ ∗ ) called Selector, Context and Factor sets respectively, such that for any u, v ∈ S ∪ C ∪ F , f irst(u) 6= f irst(v) and u is not a subword of v and vice versa, where f irst(w) denotes the first letter of the string w. iii) w ∈ (S ∪ F )∗ is the axiom such that w = w1 w2 ...wn where wi ∈ S or wi ∈ F , 1 ≤ i ≤ n and wi 6= wj , 1 ≤ i, j ≤ n. iv) P is a finite set of pairs (z, u$v) called productions where z ∈ S and u, v ∈ C ∪ {λ}, u 6= v such that there is at most one production for each z ∈ S. The language generated by a strictly internal contextual grammar (SICG) is called a strictly internal contextual language (SICL). Example 4.1 The grammar G2 = ({a, b, c}, {abca}, {(abca, bb$cc)} is a SICG generating the language L2 = {(bb)m abca(cc)m /m ≥ 0}.
68
J.D. Emerald, K.G. Subramanian, and D.G. Thomas
Definition 4.2 An internal contextual grammar is called a k-uniform internal contextual grammar, (k − U ICG), k ≥ 1, if the elements in axiom A are of length mk, for some m ≥ 1, and if in each production (z, u$v), i) |z| = |u| = |v| = k ii) for any w0 ∈ A, |x| = mk for some m ≥ 1, where x = z −1 w0 or x = w0 z −1 for some z with (z, u$v) in P . iii) for each selector, there is at most one production iv) given any axiom w = w1 w2 ...wm , |wi | = k, 1 ≤ i ≤ m, in A, there is no rule of the form (x, u$v) with either u = wi or v = wi for any i(1 ≤ i ≤ m). The language generated by a k-UICG is called a k-uniform internal contextual language (k-UICL). Example 4.2 The grammar G3 = ({a, b, c}, {ab}, {(ab, ba$bc)}) is a 2-UICG generating the language L3 = {(ba)m ab(bc)m /m ≥ 0}. Proposition 4.1 The class of mono-PCF languages is a subclass of the class of internal contextual languages. Proof : Inclusion can be seen as follows: Every mono-PCF language L is a finite union of languages of the form {xun bv n y/n ≥ 1}, where x, u, v, y are strings and b is a symbol [8]. We can construct an internal contextual grammar to generate the language {xun bv n y / n ≥ 1} as follows: G = (Σ, {xubvy}, {(b, u$v)}). Thus L is an internal contextual language, being a finite union of such languages. Proper inclusion follows from the fact that the internal contextual language L = {an bn am bm /n, m ≥ 1} generated by an internal contextual grammar G = ({a, b}, {abab}, {(ab, a$b)} is not a mono-PCF language. Proposition 4.2 The class of k-uniform internal contextual languages is incomparable with the class of strictly internal contextual languages.
Inferring Subclasses of Contextual Languages
69
Proof : i) The language L3 given in example 4.2 is a 2-uniform internal contextual language. But it can be seen that it cannot be generated by any strictly internal contextual grammar. ii) The strictly internal contextual language L2 given in example 4.1 cannot be generated by any k-uniform internal contextual grammar, for the requirement of |z| = |u| = |v| = k cannot be met with for any rule (z, u$v). iii) The language L4 = {(ba)m ab(cb)m /m ≥ 0} generated by the grammar G4 = ({a, b, c}, {ab}, {(ab, ba$cb)} is both a strictly internal contextual language and a 2-uniform internal contextual language. It is commonly asserted that natural languages are not context-free. This assertion is based on the existence of some restrictions in natural as well as artificial languages outrunning the context-free barrier. Therefore, different ways of adjoining contexts in order to capture these features have been defined. One such approach is considering an internal contextual grammar working in a parallel derivation mode i.e., contexts are adjoined in parallel [7]. We now define a variation in the two classes considered, by requiring a parallel mode of derivation i.e., contexts are adjoined simultaneously to all the selectors in a word. Definition 4.3 Parallel derivation for a k-uniform internal contextual grammar or a strictly internal contextual grammar, can be defined in the following manner: x ⇒p y if and only if x = x1 z1 x2 z2 ...xm zm xm+1 and y = x1 u1 z1 v1 x2 ...xm um zm vm xm+1 where (zi , ui $vi ) ∈ P . Example 4.3 The language L5 = {ba(eb)m ab(f a)m db(gb)m cb(hc)m ra/m ≥ 0} is generated by a strictly internal contextual grammar G5 = ({a, b, c, d, e, f, g, h, r}, {baabdbc bra}, (ab, eb$f a), (cb, gb$hc)), with parallel mode of derivation. In fact the language generated is context-sensitive.
5
Identification of Subclasses of Internal Contextual Languages
Since the class of internal contextual languages is not inferable from positive data only, we show that certain subclasses of internal contextual languages namely the class of strictly internal contextual languages and the class of k-uniform internal contextual languages are inferable from positive data using an identification algorithm to learn these languages. For simplicity, we consider the algorithm which learns these grammars with a single axiom and non-empty contexts. The
70
J.D. Emerald, K.G. Subramanian, and D.G. Thomas
idea behind the algorithm can be informally described as follows: The sample words are first factored using their factorization and then compared to obtain the context rules using the procedure CONTEXT i.e., in the case of k-uniform ICL, the sample words are factored into factors of length k while in the case of strictly ICL, the words are factored according to the subprocedure UPDATE [9], where the first letters of the factors are all different. While comparing the factorized words x and y, the maximum common prefix and suffix factors of the words are removed and new words x1 and y1 are obtained from x and y respectively. x1 and y1 are then compared to find the first occurrence of the maximum prefix factor of word x1 in y1 (assuming |x1 | < |y1 |). This maximum prefix is conjectured to be the selector of the rule and the factors to the left and the same number of factors to the right of this maximum prefix in y are conjectured as the contexts of the rules. The same procedure is repeated with the remaining factors of the words, and the algorithm terminated when the correct conjecture is given as output. Notations |w| denotes the number of letters in the string w; #(w) denotes the number of factors in w, in a given factorization; Assuming #(x) < #(y) we use the following notations: common-pref (x,y) denotes the maximum common prefix factors of x and y. common-suf (x,y) denotes the maximum common suffix factors of x and y. max-pref (x,y) denotes the first occurrence in y of the maximum prefix factors of x, in considering factor subwords of y from left to right. remove-pref (u, δ) = u0 if u = δu0 . remove-suf (v, δ) = v 0 if v = v 0 δ. When max-pref(x, y) is well defined, we can describe x = max-pref(x, y)u where u ∈ Σ ∗ and u = (u1 )(u2 )...(uk ), k ≥ 1; and y = v max-pref(x, y)v 0 where v, v 0 ∈ Σ ∗ and 0 )n, m ≥ 1 v = (v1 )(v2 )...(vn ); v 0 = (v10 )(v20 )...(vm rem(x) = (u1 )(u2 )...(uk ) rem-left(y, max-pref(x, y)) = (v1 )(v2 )...(vn ) 0 ) rem-right(y, max-pref(x, y)) = (v10 )(v20 )...(vm first(w) is the first letter of w. factorize(w,T) is a function which factorizes the word w over the set T of factors obtained from calling the subprocedure UPDATE given in [9]. Algorithm A Input: A positive presentation of a strictly internal contextual language and its alphabet S. Output: a sequence of SICG for SICL. Procedure:
Inferring Subclasses of Contextual Languages
let C = φ {/* C is the set of contexts */} let P = {(a1 , λ$λ), (a2 , λ$λ), ..., (am , λ$λ)} where Σ = {a1 , a2 , ..., am } read the first positive example w1 ; let T1 = {w1 }; let w10 = factorize (w1 , T1 ); let A = {w1 }; let Γ = {w1 }; output G1 = (Σ, P, A) let i = 2 repeat (forever) begin [i-th stage] let Gi−1 = (Σ, P, A) be the (i − 1)th conjectured SICG let a be the unique element in A read the next positive example wi let Ti = UPDATE (Ti−1 , wi ); for n = 1 to i do let wn0 = factorize(wn , Ti ); 0 let a = factorize(a, Ti ); if wi0 ∈ L(Gi−1 ) then output Gi (= Gi−1 ) as the i-th conjecture else begin if |wi | ≤ |a| then begin call CONTEXT (P, wi0 , a0 ); A = {wi } end else for all w ∈ Γi−1 do the following begin if |wi | < |w| then call CONTEXT (P, wi0 , w0 ) else call CONTEXT (P, w0 , wi0 ) end end end let Γi = Γi−1 ∪ {wi } output Gi = (Σ, P, A) as the i-th conjecture; i=i+1 CONTEXT (P, u, v) begin let P = {(a1 , ta1 $t0a1 ), (a2 , ta2 $t0a2 ), ...(am , tam $t0am )} let f = #(u); let g = #(v); if common-pref (u,v) = φ and common-suf (u,v) = φ then call PRE-RULE (u,v) else begin if ω = common-pref (u,v)
71
72
J.D. Emerald, K.G. Subramanian, and D.G. Thomas
then
begin let u = remove-pref (u,ω); v = remove-pref (v,ω); end if ω 0 = common-suf (u,v) then begin let u = remove-suf(u,ω 0 ); let v = remove-suf (v,ω 0 ); end call PRE-RULE (u,v) end PRE-RULE (u,v) let max-pref (u,v) = α; if α 6= φ then begin let rem(u) = α0 ; let h = #(α); let rem-left (v,a) = x; let j = #(x); let β = (v1 )(v2 )...(v2j+h ) where β = xαx0 ; #(x) = #(x0 ); let β 0 = v2j+h+1 ...vg let rem-left (β, α) = δ; let rem-right (β, α) = δ 0 if α 6∈C then begin case (γ = γ 0 αγ 00 ) for some γ in C replace (γ, tγ $t0γ ) by (α, γ 0 $γ 00 ) let C = C − {γ} ∪ {α} case (α = α1 γα2 ) for some γ in C call Pre-rule (P, γ, α) case (γ 6= γ 0 αγ 00 ) ∧ (α 6= α1 γα2 ) for all γ in C let P = P ∪ {(α, δ$δ 0 )} let C = C ∪ {α} end else begin if (#(δ) < #(tα )) ∧ (#(δ 0 ) < #(t0α )) (except when δ = δ 0 = λ) then replace (α, tα $t0α ) by (α, δ$δ 0 ) in P end if {((α0 6= λ) ∧ (β 0 6= λ)) then if common-pref (α0 , β 0 ) = φ then call Pre-rule (P, α0 , β 0 ) (assuming |α0 | < |β 0 |) else if ω = common-pref (α0 , β 0 ) then begin let α0 = remove-pref(α0 , ω);
Inferring Subclasses of Contextual Languages
end
end
73
let β 0 = remove-pref(β 0 , ω); call Pre-rule (α0 , β 0 ) (assuming |α0 | < |β 0 |) end
Remark : The above algorithm can also be used to identify a k-uniform internal contextual grammar. A modification required in the algorithm is that k is also given along with the positive presentation as input to the algorithm and instead of the function factorize(w), we use the function split (w,k) where split(w,k) = (w1 )(w2 )...(wn ), |wi | = k, 1 ≤ i ≤ n. A strictly and k-uniform contextual grammars with derivations in parallel mode can also be inferred using the above algorithm. Correctness of the Algorithm and Characteristic Sample The correctness of the algorithm can be noticed in view of the fact that the specific features of the subclasses considered allow the positive examples to have a unique factorization. Indeed, the factors allow us to infer the rules correctly. Also, it can be seen that the algorithm runs in time polynomial in the sum of the lengths of the examples provided. The correctness of the algorithm A, can be seen by considering a characteristic sample for a target strictly internal contextual language. Let L be a strictly internal contextual language. A finite set S is called a characteristic sample of L if and only if L is the smallest SICL containing S. We illustrate the method of forming a characteristic sample, with an example. Consider a SICG, G = {Σ, S, C, F, w, P } where Σ = {a, b, c, d, e, f, g, h}, S = {ab, dc}, C = {ebd, f c, gb, hcd}, F = {ba, cb}, w = baabcbdc and P = {(ab, ebd$ f c), (dc, gb$hcd)} generating a language L = {ba(ebd)m ab(f c)m cb(gb)n dc(hcd)n /m, n ≥ 0}. We construct the characteristic sample S by taking a finite number of strings derived from the axiom till each of the rules of the grammar finds its application at most twice in the derivation of these strings. In the grammar considered above, S = {baabcbdc, baebdabf ccbdc, baabcbgbdchcd, baebdabf ccbgbdchcd, baebdebdabf cf ccbdc, baabcbgbgbdchcdhcd, baebdebdabf cf ccbgbgbdchcdhcd, baebdabf ccbgbgbdchcdhcd, baebdebdabf cf ccbgbdchcd}. When the input set of the algorithm contains all the elements of S, the algorithm A converges to a correct SICG for the target language L. Hence, it is clear from
74
J.D. Emerald, K.G. Subramanian, and D.G. Thomas
the manner in which the characteristic sample S is formed that, the class of SICL is identifiable in the limit from positive data. Similarly, the UICL can also be identified in the limit from positive data by constructing a characteristic sample in a similar manner. References [1] D. Angluin, Inductive inference of formal languages from positive data, Information and Control 45 (1980), 117-135. [2] A. Ehrenfeucht, Gh. Paun and G. Rozenberg, Contextual grammars and natural languages, In Handbook of Formal Languages, Springer-Verlag, Vol. 2 (1997), 237-293. [3] A. Ehrenfeucht, Gh. Paun and G. Rozenberg, On representing recursively enumerable languages by internal contextual languages, Theoretical Computer Science 205 (1998), 61-83. [4] T. Koshiba, E. M¨ akinen and Y.Takada, Inferring pure context-free languages from positive data, Technical Report A-1997-14, University of Tampere, Finland (To appear in Acta Cybernetica). [5] S. Marcus, Contextual grammars, Rev. Roum. Math. Pures Appl., 14 (10) (1969), 1525-1534. [6] H.A. Maurer, A. Salomaa and D. Wood, Pure Grammars, Information and Control 44 (1980), 47-72. [7] V. Mitrana, Parallelism in contextual grammars, Fundamenta Informaticae 33 (1998), 281-294. [8] N. Tanida and T. Yokomori, Inductive inference of monogenic pure contextfree languages, Lecture Notes in Artificial Intelligence 872, Springer-Verlag (1994), 560-573. [9] T. Yokomori, On polynomial-time learnability in the limit of strictly deterministic automata, Machine Learning 19 (1995), 153-179.
Permutations and Control Sets for Learning Non-regular Language Families Henning Fernau1 and Jos´e M. Sempere2 1
2
Wihelm-Schickard-Institut f¨ ur Informatik, Universit¨ at T¨ ubingen, D-72076 T¨ ubingen, Germany, Email:
[email protected] Departamento de Sistemas Inform´ aticos y Computaci´ on, Universidad Polit´ecnica de Valencia, Valencia 46071, Spain, Email:
[email protected]
Abstract. We discuss two versatile methods which can be used to transfer learnability results from one language class to another. We apply these methodologies to three learning paradigms: (1) Learning in the limit, (2) Morphic generator grammar inference, and (3) Query learning. Keywords:Formal languages, universal grammars, control sets, learning from positive data.
1
Introduction
In this paper, we will present two methods for transferring learnability results from one language class to another by simple preprocessing. We mainly concentrate on the paradigm “learning in the limit from positive data”. It is not hard to see that similar techniques can be used to develop efficient learning algorithms in other paradigms as well. In the case of query learning, this has been done (within the framework of matrix grammars, see below) in [9]. We will detail such extensions at the end. Here, we will focus on the following version of the learning model “identification in the limit” proposed by Gold [18]: – An inference machine (a “learner”) IM is given the task to infer a language from a certain fixed language class F for which a description formalism (in our case, a grammar formalism) is also fixed. – To the inference machine IM, a language L ∈ F is presented by giving all the elements of L to IM one by one (maybe, with repetitions), i.e., L = { wi | i ≥ 0 }, and wi is given to IM at time step i. – After having received wi , IM responds with a hypothesis grammar Gi . Of course, we can see Gi as the result of computing a recursive (i + 1)-ary function fi : Gi = fi (w0 , . . . , wi ).
A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 75–88, 2000. c Springer-Verlag Berlin Heidelberg 2000
(1)
76
H. Fernau and J.M. Sempere
The inference machine IM is called a learner for F if 1. the process described above always converges in the discrete space of Fgrammars, i.e., for all presentations {wi | i ≥ 0} ∈ F, the corresponding grammar sequence G0 , G1 , G2 , . . . converges to a limit grammar G, which means that there is an i0 such that for all i ≥ i0 we find G = Gi0 = Gi ; 2. the limit grammar G is independent of the presentation of the language L and, moreover, L ⊆ L(G). Note that there are a lot of language families known to be identifiable in the limit from positive data. According to Gregor [19], the most prominent examples of identifiable regular language families are: – k-testable languages [15,17] (see below), – k-reversible languages [4] and – terminal distinguishable regular languages [30,31]. Generalizations of these language classes are discussed in [1,10,11,21,32]. Further identifiable language families, especially also non-regular ones, can be found as references in the quoted papers. All these language classes can be learned efficiently, i.e., the time complexity for the computation of the hypothesis function(s) fi in Eq. (1) is only polynomial in the size of its input, which is the total length of the input sample words up to step i.
2
Formal Language Definitions
Notations: Σ k is theSset of all words of length k over the alphabet Σ, Σ
Permutations and Control Sets for Learning Non-regular Language Families
77
In order to provide a simple running example, we define: Definition 2. A language L is k-testable (in the strict sense), or k-TLSS for short, iff L = Ik Σ ∗ ∩ Σ ∗ Fk \ Σ ∗ Tk Σ ∗ , where Ik , Fk ⊆ Σ
3
Permutation Families for Learning from Positive Data
In order to present our results, we need some further notions. Let ψn : {1, . . . , n} → {1, . . . , n} be a permutation, i.e., a bijective mapping. A collection Ψ = (ψn | n ≥ 1) of permutations is called a family of permutations. Ψ is called uniformly polynomial-time computable if there is an algorithm Af realizing the partial function f : N×N → N with f (n, m) = ψn (m) in polynomial time and another algorithm Ag realizing the partial function g : N × N → N with g(n, m) = ψn−1 (m) in polynomial time. To every family of permutations Ψ = (ψn | n ≥ 1), there corresponds a family of inverse permutations Ψ −1 = (ψn−1 | n ≥ 1). Example 3. Consider ψn (m) =
2m − 1 , if m ≤ (n + 1)/2 2(n − m + 1), if m > (n + 1)/2
Both ψn (m) and its inverse are easily seen to be polynomial-time computable. Hence Ψ = (ψn | n ≥ 1) is a polynomial-time computable family of permutations. It is easy to construct further uniformly polynomial-time computable families of permutations from known ones. This can be done, e.g., by the following two operations (here, let Ψ = (ψn ) and Φ = (φn ) be uniformly polynomial-time computable families of permutations): piecewise mixture Let n0 be fixed. Define Ξ = (ξn ) by ξn = ψn , if n < n0 , and ξn = φn otherwise. permutationwise composition Define Ξ = (ξn ) by ξn = ψn ◦φn with ξn (x) = ψn (φn (x)). For example, the permutationwise composition of Ψ as defined in Example 3 with itself yields the family Ξ = (ξn ), where, e.g., ξ11 can be described by the following table: x = 1 2 3 4 5 6 7 8 9 10 11 ψ11 (x) = 1 3 5 7 9 11 10 8 6 4 2 ξ11 (x) = 1 5 9 10 6 2 4 8 11 7 3 Now, let us fix an alphabet Σ and a family of permutations Ψ = (ψn | n ≥ 1). In order to avoid further awkward notations, let us denote by Ψ (w), where w = a1 . . . an ∈ Σ n , the word Ψ (w) = aψn (1) . . . aψn (n) . We extend this notation further to language by setting, for L ⊆ Σ ∗ , Ψ (L) = {Ψ (w) | w ∈ L} and to language families (obviously not necessarily restricted to a specific alphabet Σ anymore) by defining Ψ (L) = {Ψ (L) | L ∈ L}.
78
H. Fernau and J.M. Sempere
The following theorem is easily shown: Theorem 4. If L is a language family for which a polynomial-time learning algorithm in the limit from positive data is known, then Ψ (L) can be learned in polynomial-time in the limit from positive data as well, if Ψ is a uniformly polynomial-time computable family of permutations. Proof. Let AL be the learning algorithm for language family L. The learning algorithm for Ψ (L) uses AL as a subroutine in the sense that it translates its input sequence w1 , w2 , w3 , . . . into an input sequence Ψ −1 (w1 ), Ψ −1 (w2 ), Ψ −1 (w3 ), . . . of algorithm AL . To this input sequence, AL responds with a sequence of grammars G1 , G2 , G3 , . . . generating the languages L1 , L2 , L3 , . . . . Now, the intended learning algorithm just interprets G1 , G2 , G3 , . . . as representing languages Ψ (L1 ), Ψ (L2 ), Ψ (L3 ), . . . . The correctness and efficiency of the described algorithm trivially carries over from AL , since Ψ was assumed to be a uniformly polynomial-time computable family of permutations. t u Of course, it is a bit abstract and unnatural to consider a grammar G of L to represent “suddenly” a language Ψ −1 (L) as done in the preceding proof. We will show natural examples of such an interpretation in the following by making use of the concept of control sets.
4
Control Sets for Learning from Positive Data
Let G denote some grammar.1 A word w (over G’s terminal alphabet) belongs to the language L(G) generated by G iff there is a sequence of rules r1 . . . rm whose sequential application, starting from an axiom of G, yields w. In other words, w is somehow described by a word (called control word or associate word) over the rule set P of G. If G is ambiguous, this description is not unique. On the other hand, one could consider the sublanguage L(G, R) of L(G) consisting of those (terminal) words derivable through G which have control words in R ⊆ P ∗ . Here, R is called control set.2 If G is some grammar family and L is some language family, then let CONTROL(G, L) = {L(G, R) | G ∈ G, L ∈ L} . Let us fix, for the moment, some grammar family G. A grammar GΣ 0 ∈ G ∗ is called base grammar for the alphabet Σ if L(GΣ 0 ) = Σ and, furthermore, every word in Σ ∗ can be derived unambiguously via GΣ 0 . Let G0 ⊂ G be a collection of base grammars such that for every (with respect to G) possible terminal alphabet Σ, there is exactly one base grammar in G0 . Finally, grammar 1 2
Similar notions can be developed for automata. For many properties of controlled language families, we refer to [20].
Permutations and Control Sets for Learning Non-regular Language Families
79
subfamily G0 is called universal if CONTROL(G0 , REG) = L(G), where L(G) denotes the language family generated by the grammar family G. In general, there are various universal grammar subfamilies. Consider, for example, the following: Example 5. Both {GΣ 0 | Σ is an alphabet}, where GΣ 0 = ({S}, Σ, {S → λ} ∪ {S → aS | a ∈ Σ}, S) and, more generally, all families G0,k = {GΣ 0,k | Σ is an alphabet}, where
are universal for REG, i.e., CONTROL(G0 , REG) = REG. As within the permutation approach, we can use use universal grammar families in order to obtain (new) learnable language families through learnable control set classes. More specifically, we can present the following learning algorithm for CONTROL(G0 , L), given any learnable language class L: 1. 2. 3. 4. 5.
Consider a new input word wj . wj is transformed to the unique control word πj . πj is given to the identification algorithm AL of L. AL outputs hypothesis grammar Gj . The whole algorithm outputs hypothesis language L(GΣ 0 , L(Gj )), where Σ is the alphabet of symbols contained in w1 . . . wj .
The algorithm is efficient if its second step can be performed in polynomial time. We will assume this to be the case in the following. In a certain sense, grammar Gj can be viewed as “representing” L(GΣ 0 , L(Gj )). More precisely, since G0 is universal, for every Gj there exists a Hj ∈ G such that L(Hj ) = L(GΣ 0 , L(Gj )). If the transformation Gj 7→Hj can be done efficiently3 , the above algorithm could indeed give a hypothesis grammar, namely Hj , in its last step. Remark 6. As described above, each grammar G ∈ G0 translates a string over, say, Σ into another string over another alphabet. For example, each GΣ 0,k (as defined in Example 5) can be viewed as a deterministic generalized sequential machine.4 This view immediately explains two things: – The transformation of a word w into its control word (with respect to GΣ 0,k ) can be done in linear time. – If L is a trio, then CONTROL(G0,k , L) = L for each k ≥ 1 due to the Theorem of Nivat [6,27]. Actually, GΣ 0,k realizes some sort of tape compression. We will consider words over ∆ = Σ ≤k as control words for GΣ 0,k . 3 4
This is indeed the case for all published learning applications of control languages we know of. For notions like generalized sequential machines, trios, etc. we refer to [6,22,27].
80
H. Fernau and J.M. Sempere
Note that even considering identifiable subclasses of the regular languages may yield new interesting identifiable subclasses in this way. Theorem 7. We continue Example 5. For all k, l ∈ N, CONTROL(G0,k , `-TLSS) is efficiently identifiable from positive samples.5 Moreover, 1. 2. 3. 4.
CONTROL(G0,1 , `-TLSS) = `-TLSS ( CONTROL(G0,` , 2-TLSS), CONTROL(G0,k , `-TLSS) ( CONTROL(G0,k , ` + 1-TLSS), CONTROL(G0,k , `-TLSS) ( CONTROL(G0,k+1 , `-TLSS), CONTROL(G0,k , `-TLSS) ( CONTROL(G0,k` , 2-TLSS).
Proof. We have only to show the “moreover-part”: 1.
CONTROL(G0,1 , `-TLSS) = `-TLSS is clear by definition. In order to show `-TLSS ⊆ CONTROL(G0,` , 2-TLSS), recall that languages from `-TLSS can “test” prefixes and suffixes of length ` − 1 and forbidden subwords of length `. On the other hand, the input word w ∈ Σ ∗ of the combined learning algorithm will be essentially sliced into parts w = u1 . . . un , where ui ∈ Σ ` and un ∈ Σ <` , if we let words v from Σ ` denote <` the rule S → vS in GΣ denote the rule S → v in 0,` and words v ∈ Σ Σ G0,` . Note that u1 , . . . , un are the input symbols of the word given to the 2-TLSS algorithm. Consider now L = I` Σ ∗ ∩ Σ ∗ F` \ Σ ∗ T` Σ ∗ , where I` , F` ⊆ Σ <` and T` ⊆ Σ ` . Let I`0 = (I` Σ ∗ ∩ Σ `−1 ) ∪ (I` Σ ∗ ∩ Σ ∗ F` ∩ Σ <`−1 ) and F`0 = (Σ ∗ F` ∩ Σ `−1 )∪(I` Σ ∗ ∩Σ ∗ F` ∩Σ <`−1 ). Observe that L = I`0 Σ ∗ ∩Σ ∗ F`0 \Σ ∗ T` Σ ∗ and I`0 ∩ F`0 ∩ Σ <`−1 = L ∩ Σ <`−1 . ˆ = Iˆ2 ∆∗ ∩ ∆∗ Fˆ2 \ ∆∗ Tˆ2 ∆∗ ⊂ ∆∗ , where We have to design a language L ≤` ∆=Σ . Let us first consider “short words” u of length < ` in L. Hence, u has to be in the control set, which is guaranteed when u is both in Iˆ2 and in Fˆ2 . In particular, Iˆ2 ∩ Σ <` = L ∩ Σ <` . In the following, we restrict our discussions to words of length at least `. Now, p is a prefix of length ` − 1 of w iff p is prefix of u1 , a property which can be tested easily by the control language which is from 2-TLSS. More specifically, Iˆ2 ∩ Σ ` = {pa | p ∈ I`0 , a ∈ Σ}. The set of subwords of length ` of w equals the set of subwords of length ` of the language {ui ui+1 | 1 ≤ i < n}, so that forbidden subwords of length ` of w can be tested through forbidden subwords of length 2 of the control language. Finally, let s be a suffix of length ` − 1 of w. This basically means that we have to “allow” all suffixes uv of control words, where u ∈ Σ ` , v ∈ Σ <` and s is suffix of uv. In particular, this means that s is in Fˆ2 . But this is not enough. We have to put all suffixes of s in Fˆ2 and forbid T = {uv | u ∈ Σ ` , v ∈ Σ <` , ∀s ∈ F`0 ∩ Σ `−1 : s 6= v and s is not suffix of uv}. The inclusion is strict since Σ 2`−1 ∈ CONTROL(G0,` , 2-TLSS)\`-TLSS.
5
[17] contains an algorithmic definition of an automata family characterizing k-TLSS.
Permutations and Control Sets for Learning Non-regular Language Families
81
2. & 3. The inclusions themselves are trivial. L = {w ∈ Σ ∗ | |w| ≤ k` + 1} ∈ / CONTROL(G0,k , `-TLSS), but L lies in both CONTROL(G0,k , ` + 1-TLSS) and CONTROL(G0,k+1 , `-TLSS). 4. This is a straightforward generalization of the first item. t u Example 8. {aa} ∈ 3-TLSS. This yields the control “word” [aa] of length 1 {a} (codifying the rule S → aa) via G0,3 . Obviously, [aa] ∈ 2-TLSS.
5
Putting Things Together
Let Ψ = (ψn ) and Φ = (φn ) be uniformly polynomial-time computable families of permutations. Let G0 be a universal subfamily of the grammar family G. If language family L(G) is efficiently identifiable from positive samples only, then the following algorithm is also efficient: 1. 2. 3. 4. 5.
Permute an input word wj according to Ψ −1 , yielding wj0 = Ψ −1 (wj ). Compute the control word πj of wj0 according to a suitable G0 ∈ G0 . Permute πj according to Φ−1 , yielding πj0 = Φ−1 (πj ). The identification algorithm, given πj0 , yields a grammar Gj . The new guess of the whole algorithm is Ψ (L(G0 , Φ(L(Gj )))).
Here, we get the problem of which language family L(Ψ, G0 , Φ, L(G)) will be identified using such a mixed strategy. In some special cases, we could give characterizations of those language families, and we will focus on those families in the following. To this end, let Φ = ID be the family of identities. Example 9. Let Ψ be defined as in Example 3 and G0,2 as in Example 5. Then, L(Ψ, G0,2 , ID, REG) = ELL. This can be easily seen by observing the following facts: 1. The grammars H0Σ = ({S}, Σ, {S → aSb | a, b ∈ Σ}∪{S → x | x ∈ Σ <2 }, S) are universal for the even linear languages, see [33,23,24]. 2. Control words for H0Σ can be viewed as words over ∆, where ∆ = Σ ≤2 . Observe that words over ∆ can be viewed as control words for GΣ 0,2 as well. 3. If w ∈ Σ ∗ has the control word π according to H0Σ , then Ψ (w) has the control word π according to GΣ 0,2 and vice versa. Hence, in particular, L(Ψ, G0,2 , ID, 2-TLSS) = CONTROL({H0Σ | Σ is an alphabet}, 2-TLSS) is identifiable from positive samples only. The following observation is an immediate corrollary of the definitions. It is interesting, since semilinear properties generally need non-trivial proofs.6 6
The notion of semilinearity is explained, e.g., in [22]. It is important both from a linguistic point of view and from the standpoint of learning algorithms, cf. [8,35].
82
H. Fernau and J.M. Sempere
Corollary 10. Fix k ≥ 1. Let Ψ and Φ be arbitrary families of permutations. Then, L(Ψ, G0,k , Φ, REG) contains only semilinear languages, where G0,k is defined as in Example 5. In the following, we restrict the notion of universal grammar family further: Definition 11. Let us call a subfamily G0 = {GΣ 0 | Σ is an alphabet} of linear grammars uniformly described if every universal grammar GΣ 0 contains exactly can be characterized by a pair of natural one nonterminal S and the rules of GΣ 0 numbers (n, m) such that S → α with α ∈ Σ ∗ is a rule iff |α| < n + m and S → αSβ is a rule iff α ∈ Σ n and β ∈ Σ m . One can easily prove: Lemma 12. If G0 is a uniformly described grammar family, then CONTROL(G0 , REG) ⊆ LIN. In the proof of the preceding lemma, the regular control language given by, e.g., a deterministic finite automaton, is simulated in the nonterminals which basically store the state of the automaton. For example, the universal grammar subfamilies presented for REG in Example 5 are uniformly described, as well as the universal grammar subfamily for ELL in Example 9. Theorem 13. If G0 is a uniformly described universal family of linear grammars, then there exists a uniformly polynomial-time computable family of permutations Ψ such that Ψ (L(G0 )) = REG. Proof. (Sketch) Let GΣ 0 ∈ G be described by the number pair (n, m). Let Ψ = (ψ` | ` ∈ N) be recursively defined as: = ν, if ` < n + m ∨ ν ≤ n ψ` (ν) ψ` (` − ν + 1) = n + ν, if ` ≥ n + m ∧ ν ≤ m = n + m + ψ`−n−m (ν − n), if ν ∈ (n, ` − m) ψ` (ν) t u Example 14. Considering the rule pattern S → uSv, S → u and S → λ where u, v are terminal placeholders (this corresponds to Example 9), we obtain a grammar family which characterizes ELL. The corresponding family of permutations can be defined recursively according to the preceding proof as: , if m = 1 1 , if m = n ψn (m) = 2 ψn−2 (m − 1) + 2, if m ∈ (1, n) By an obvious induction argument, one sees that this family of permutations is the same as that in Example 3.
Permutations and Control Sets for Learning Non-regular Language Families
83
As a simple corollary of our previous discussions, we can state: Corollary 15. Consider a language family L ⊂ REG for which a polynomialtime learning algorithm in the limit from positive data is known and a uniformly described universal grammar family G0 . Let Ψ be the family of permutations defined in the previous theorem. Then Ψ −1 (L) ⊂ LIN can be learned in polynomialtime in the limit from positive data as well. Moreover, there is a family of linear grammars characterizing Ψ −1 (L). Our observations are easily generalizable towards regular or linear tuple languages, cf. [25] (which are indeed equivalent to regular or linear simple matrix languages, cf. [9,29,34]) or towards regular-like expressions and their automata, see [7]. We give a combination of Definitions 25 and 26 of Brzozowski as well as of those of P˘aun, incorporating a suitable “even”-condition, in the following: Definition 16. An even one-sided linear parallel grammar of order n with direc. . . , Vn , Σ, M, S) , tion vector δ = δ1 . . . δn ∈ {L, R}n is an (n+ 3)-tuple G = (V1 ,S n where {S}, V1 , . . . , Vn , Σ are pairwise disjoint alphabets (VN = i=1 Vi ∪{S} contains the nonterminals and Σ the terminals), and M is a finite set of matrices of the form 1. (S → A1 . . . An ), for Ai ∈ Vi , 1 ≤ i ≤ n, or 2. (A1 → λ, . . . , An−1 → λ, An → xn ), for Ai ∈ Vi , xn ∈ Σ
The last equality can be seen by the fact that RL-ELPL grammars can be viewed as an alternative formalization of regularly controlled external contextual grammars, whose relation to linear grammars is exhibited in [28, Section 12.3].
84
H. Fernau and J.M. Sempere
6
Further Inference Methods
6.1
Morphic Generator Grammatical Inference
This methodology has been proposed in [14,16]. Here, the starting point is the well-known fact that every regular language is the image of a 2-testable language under a letter-to-letter morphism [26]. So, the general grammatical inference method as proposed by Garc´ıa et al. [16] is the following one: Given an input sample from an unknown (regular) language, choose a mapping g to transform the sample. Then, an inference method for 2-TLSS is applied to the transformed sample in order to obtain a 2-TLSS language. Finally, a morphism h is selected in order to transform the conjectured language to the original regular one. If g and h are well selected, then the target regular language can be learned from the original positive input sample. Unfortunately, there is no way to characterize the mappings g and h in order to perform the learning task. Furthermore, as a consequence of a previous work by Angluin [3], the general strategy to learn regular languages from only positive sample is not possible. Here, we can prove the following generalization to the method by Garc´ıa et al.: Theorem 18. Let Ψ, Φ be families of permutations. For every k ≥ 1, every language from L(Ψ, G0,k , Φ, REG) is the image of a language from L(Ψ, G0,k , Φ, 2-TLSS) under a letter-to-letter morphism. Proof. (Sketch) Consider a language L ∈ L(Ψ, G0,k , Φ, REG), L ⊆ Σ ∗ . This means that Ψ −1 (L) = L(GΣ 0,k , Φ(R)) for a suitable regular language R. By [26], there exists a letter-to-letter morphism h : X ×Σ ≤k → Σ ≤k such that R = h(R0 ) for some R0 ∈ 2-TLSS. Since Φ is a permutation, we have Φ(R) = Φ(h(R0 )) = h(Φ(R0 )). Every word (x1 , α1 ) . . . (xn , αn ) ∈ X × Σ ≤k , can be viewed as control ≤k as a subset of X × Σ ≤k . Taking now word of GX×Σ 0,k , considering (X × Σ) the natural projection letter-to-letter morphism h0 : X × Σ → Σ, we conclude 0 t u L = h0 (L0 ), where Ψ −1 (L0 ) = L(GX×Σ 0,k , Φ(R )). This suggests the following methodology for identifying languages L ⊆ Σ ∗ from L(Ψ, G0,k , Φ, REG): 8 1. Choose a (larger) alphabet Σ 0 and a letter-to-letter morphism h : Σ 0 → Σ. 2. Choose an easily computable function g : Σ ∗ → Σ 0∗ such that h(g(w)) = w for all w ∈ Σ ∗ . 3. The input sequence w1 , w2 , . . . is transformed into the sequence g(w1 ), g(w2 ), . . . which is given to the identification algorithm for L(Ψ, G0,k , Φ, 2-TLSS). 4. The output language sequence L1 , L2 , . . . hence obtained is interpreted as the language sequence h(L1 ), h(L2 ), . . . of languages over Σ. 8
Again, we assume the transformations induced by Ψ , G0,k and Φ be be computable in polynomial time.
Permutations and Control Sets for Learning Non-regular Language Families
6.2
85
Query Learning
In the query learning model introduced by Angluin [5], the inference machine IM plays an active role (in contrast with Gold’s model) in the sense that IM interacts with a teacher T. More precisely, at the beginning of this dialogue, IM is just informed about the terminal alphabet Σ of the language L that IM should learn. IM may ask T the following questions: Membership query Is w ∈ L ? Equivalence query Does the hypothesis grammar G generate L? Teacher T reacts as follows to the questions: 1. To a membership-query, T answers either “yes” or “no”. 2. To an equivalence-query, T answers either “yes” (here, the learning process may stop, since L has performed its task successfully) or “no, I will show you a counterexample w.” Since Angluin showed that all regular languages can be learned in polynomial time by using the learner-teacher dialogue just explained, we can immediately infer: Theorem 19. Let k ≥ 1. If Ψ and Φ are computable in polynomial time, then L(Ψ, G0,k , Φ, REG) can be learned in the query learning model in polynomial time. t u Remark 20. When equivalence queries are not reckoned as oracle calls, they should be computable. Since Ψ (L) = Ψ (L0 ) iff L = L0 , it is not hard to see that equivalence is indeed decidable within the (Ψ, G0,k , Φ)-setting due to the decidability of the equivalence problem for regular languages. In the query learning model, the complete power of the formalism is not needed: Theorem 21. Let k ≥ 1. If Ψ and Φ are computable in polynomial time, then L(Ψ, G0,k , Φ, REG) = Ξ(REG) for some polynomial-time computable permutation Ξ. Proof. By definition, L ∈ L(Ψ, G0,k , Φ, REG) if Ψ −1 (L) = L(GΣ 0,k , Φ(R)) for Σ some regular language R. Due to Remark 6, L(G0,k , M ) = τ (M ) for some rational transduction τ . Moreover, for another permutation Φ0 , L(GΣ 0,k , Φ(R)) = Φ0 (τ (R)) = Φ0 (R0 ) for some regular language R0 . Setting Ξ = Ψ Φ0 yields the theorem. t u This implies that when we are interested in language classes induced by the whole class of regular languages, we can confine ourselves to permutations, which conceptually simplifies all considerations in this case.
86
7
H. Fernau and J.M. Sempere
Conclusions
We presented two quite powerful general and mutually related mechanisms to define further efficiently learnable language classes from already known ones, namely 1. by using families of permutations and 2. by exploiting control language features. Such “language class generators” can be quite useful since it has turned out that, in various applications, it is necessary to make a choice of the language class to be learned based on experience or additional knowledge inspired by the application, see, e.g., the discussion in [1]. Therefore, it seems to be good if one could enhance the set of possible choices and use, e.g., known structural information like bracket structures. Furthermore, since polynomial-time computable permutation families are closed under permutationwise composition, it makes sense to consider language families like the class Ψ −1 (Ψ −1 (Ψ −1 (L))) which is also efficiently inferrable from positive data if L is. Similarly, any class like CONTROL(G0 , CONTROL(G0 , CONTROL(G0 , L))) is efficiently inferrable from positive data if L is. Such hierarchies deserve further studies, as begun in [37,36]. Finally, it would be interesting to develop further general techniques in order to apply known learning (especially, identification) algorithms to other language families. One such technique could consists in a suitable splitting of input words, as exhibited in the case of externally contextual languages in [13] and in the case of parallel communicating grammars systems in [12].
References 1. H. Ahonen. Generating grammars for structured documents using grammatical inference methods. Phd thesis. Also: Report A-1996-4, Department of Computer Science, University of Helsinki, Finland, 1996. 2. V. Amar and G. Putzolu. On a family of linear grammars. Information and Control, 7:283–291, 1964. 3. D. Angluin. Inductive inference of formal languages from positive data. Information and Control, 45:117–135, 1980. 4. D. Angluin. Inference of reversible languages. Journal of the Association for Computing Machinery, 29(3):741–765, 1982. 5. D. Angluin. Learning regular sets from queries and counterexamples. Information and Computation, 75:87–106, 1987. 6. J. Berstel. Transductions and Context-Free Languages, volume 38 of LAMM. Stuttgart: Teubner, 1979. 7. J. A. Brzozowski. Regular-like expressions for some irregular languages. In IEEE Conf. Record of 9th Ann. Symp. on Switching and Automata Theory SWAT, pages 278–280, 1968.
Permutations and Control Sets for Learning Non-regular Language Families
87
8. J. Dassow and Gh. P˘ aun. Regulated Rewriting in Formal Language Theory, volume 18 of EATCS Monographs in Theoretical Computer Science. Berlin: Springer, 1989. 9. H. Fernau. Efficient learning of some linear matrix languages. In T. Asano et al., editors, COCOON’99, volume 1627 of LNCS, pages 221–230, 1999. 10. H. Fernau. Learning of terminal distinguishable languages. Technical Report WSI–99–23, Universit¨ at T¨ ubingen (Germany), Wilhelm-Schickard-Institut f¨ ur Informatik, 1999. Short version published in the proceedings of AMAI 2000, see http://rutcor.rutgers.edu/˜amai/AcceptedCont.htm. 11. H. Fernau. k-gram extensions of terminal distinguishable languages. In Proc. International Conference on Pattern Recognition. IEEE/IAPR, 2000. To appear. 12. H. Fernau. PC grammar systems with terminal transmission. In Proc. International Workshop on Grammar Systems, 2000. To appear. 13. H. Fernau and M. Holzer. External contextual and conditional languages. To appear in a book edited by Gh. P˘ aun, 1999. 14. P. Garc´ıa et al. On the use of the morphic generator grammatical inference (MGGI) methodology in automatic speech recognition. International Journal of Pattern Recognition and Artificial Intelligence, 4:667–685, 1990. 15. P. Garc´ıa and E. Vidal. Inference of k-testable languages in the strict sense and applications to syntactic pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12:920–925, 1990. 16. P. Garc´ıa, E. Vidal, and F. Casacuberta. Local languages, the successor method, and a step towards a general methodology for the inference of regular grammars. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9:841–845, 1987. 17. P. Garc´ıa, E. Vidal, and J. Oncina. Learning locally testable languages in the strict sense. In First International Workshop on Algorithmic Learning Theory ALT’90, pages 325–328, 1990. 18. E. M. Gold. Language identification in the limit. Information and Control, 10:447– 474, 1967. 19. J. Gregor. Data-driven inductive inference of finite-state automata. International Journal of Pattern Recognition and Artificial Intelligence, 8(1):305–322, 1994. 20. S. A. Greibach. Control sets on context-free grammar forms. Journal of Computer and System Sciences, 15:35–98, 1977. 21. T. Head, S. Kobayashi, and T. Yokomori. Locality, reversibility, and beyond: Learning languages from positive data. In ALT’98, volume 1537 of LNCS, pages 191–204, 1998. 22. J. E. Hopcroft and J. D. Ullman. Introduction to Automata Theory, Languages, and Computation. Reading (MA): Addison-Wesley, 1979. 23. E. M¨ akinen. The grammatical inference problem for the Szilard languages of linear grammars. Information Processing Letters, 36:203–206, 1990. 24. E. M¨ akinen. A note on the grammatical inference problem for even linear languages. Fundamenta Informaticae, 25:175–181, 1996. 25. H. A. Maurer and W. Kuich. Tuple languages. In W. D. Itzfeld, editor, Proc. of the ACM International Computing Symposium, pages 882–891. German Chapter of the ACM, 1970. 26. T. Y. Medvedev. On the class of events representable in a finite automaton (translated from russian). In E. F. Moore, editor, Sequential Machines – Selected papers, pages 227–315. Addison–Wesley, 1964. 27. M. Nivat. Transductions des langages de Chomsky. Ann. Inst. Fourier, Grenoble, 18:339–456, 1968.
88
H. Fernau and J.M. Sempere
28. Gh. P˘ aun. Marcus contextual grammar. Studies in Linguistics and Philosophy. Dordrecht: Kluwer Academic Publishers, 1997. 29. Gh. P˘ aun. Linear simple matrix languages. Elektronische Informationsverarbeitung und Kybernetik (jetzt J. Inf. Process. Cybern. EIK), 14:377–384, 1978. 30. V. Radhakrishnan. Grammatical Inference from Positive Data: An Effective Integrated Approach. PhD thesis, Department of Computer Science and Engineering, Indian Institute of Technology, Bombay (India), 1987. 31. V. Radhakrishnan and G. Nagaraja. Inference of regular grammars via skeletons. IEEE Transactions on Systems, Man and Cybernetics, 17(6):982–992, 1987. 32. J. Ruiz, S. Espa˜ na, and P. Garc´ıa. Locally threshold testable languages in strict sense: application to the inference problem. In V. Honvar and G. Slutzki, editors, Grammatical Inference, 4th Intern. Coll. ICGI-98, volume 1433 of LNCS, pages 150–161, 1998. 33. Y. Takada. Grammatical inference of even linear languages based on control sets. Information Processing Letters, 28:193–199, 1988. 34. Y. Takada. Learning even equal matrix languages based on control sets. In A. Nakamura et al., editors, Parallel Image Analysis, ICPIA’92, volume 652 of LNCS, pages 274–289, 1992. 35. Y. Takada. Learning semilinear sets from examples and via queries. Theoretical Computer Science, 104:207–233, 1992. 36. Y. Takada. A hierarchy of language families learnable by regular language learning. Information and Computation, 123:138–145, 1995. 37. Y. Takada. Learning formal languages based on control sets. In K. P. Jantke and S. Lange, editors, Algorithmic Learning for Knowledge-Based Systems, volume 961 of LNCS/LNAI, pages 317–339, 1995.
On the Complexity of Consistent Identification of Some Classes of Structure Languages Christophe Costa Florˆencio1? UiL OTS (Utrecht University) Trans 10, 3512 JK Utrecht, Netherlands
[email protected]
Abstract. In [5,7] ‘discovery procedures’ for CCGs were defined that accept a sequence of structures as input and yield a set of grammars. In [11] it was shown that some of the classes based on these procedures are learnable. The complexity of learning them was still left open. In this paper it is shown that learning some of these classes is NP-hard under certain restrictions. Keywords: identification in the limit, inductive inference, consistent learning, complexity of learning, classical categorial grammar.
1
Identification in the Limit
In the seminal paper [9] the concept of identification in the limit was introduced. In this model of learning a learning function receives an endless stream of sentences from the target language, called a text, and hypothesizes a grammar for the target language at each time-step. A class of languages is called learnable if and only if there exists a learning function such that after a finite number of presented sentences it guesses the right language on every text for every language from that class and does not deviate from this hypothesis. Research within this framework is known as formal learnability theory. In this paper only those aspects of formal learnability theory that are relevant to the proof of NP-hardness will be discussed. See [15] and [10] for a comprehensive overview of the field. In formal learnability theory the set Ω denotes the hypothesis space, which can be any class of finitary objects. Members of Ω are called grammars. The set S denotes the sample space, a recursive subset of Σ ∗ for some fixed finite alphabet Σ. Elements of S are called sentences, subsets of S (which obviously are sets of sentences) are called languages. The function L maps elements of Ω to subsets of S. If G is a grammar in Ω, then L(G) is called the language generated by (associated with) G. L is also called the naming function. The question whether a sentence belongs to a language generated by a grammar is called the universal membership problem. ?
I would like to thank Dick de Jongh and Peter van Emde-Boas for their valuable comments.
A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 89–102, 2000. c Springer-Verlag Berlin Heidelberg 2000
90
C. Costa Florˆencio
A triple hΩ, S, Li satisfying the above conditions is called a grammar system. A class of grammars is denoted G, a class of languages is denoted L. I will adopt notation from [11] and let FL denote a class of structure languages, to be defined in Section 3. The corresponding naming function is FL(G). Learning functions are written as ϕ, their input sequences as σ or τ . The classes discussed in this paper are all indexed families of recursive languages. These are quite natural properties for linguistically plausible grammar formalisms, and very convenient when dealing with learnability issues. In [3] and [21] characterizations of learnable classes under these restrictions were given, where it was shown that the property known as finite elasticity 1 is a sufficient condition for learnability when dealing with such classes. Definition 1. A class L of languages is said to have infinite elasticity if there exists an infinite sequence hsn in∈N of sentences and an infinite sequence hLn in∈N of languages in L such that for all n ∈ N, sn 6∈Ln , and {s0 , . . . , sn } ⊆ Ln+1 . A class L of languages is said to have finite elasticity if it does not have infinite elasticity. A closely related concept is finite thickness, which implies finite elasticity. Definition 2. A class L of languages is said to have finite thickness if for each n ∈ N, card{L ∈ L | n ∈ L} is finite. 2 1.1
Constraints on Learning Functions
The behaviour of learning functions can be constrained in a number of ways, such a constraint is restrictive if it restricts the space of learnable classes. Only some important constraints relevant to this discussion will be defined here: Definition 3. Consistent Learning A learning function ϕ is consistent on G if for any L ∈ L(G) and for any finite sequence hs0 , . . . , si i of elements of L, either ϕ(hs0 , . . . , si i) is undefined or {s0 , . . . , si } ⊆ L(ϕ(hs0 , . . . , si i)). Informally, consistency requires that the learning function explains all the data it sees with its conjecture. Definition 4. Prudent Learning A learning function ϕ learns G prudently if ϕ learns G and content(ϕ) ⊆ G. Prudent learners only hypothesize grammars that are in their class. Definition 5. Responsive Learning A learning function ϕ is responsive on G if for any L ∈ L(G) and for any finite sequence hs0 , . . . , si i of elements of L ({s0 , . . . , si } ⊆ L), ϕ(hs0 , . . . , si i) is defined. 1 2
The notions of both finite and infinite elasticity are due to Wright [21]. The original definitions were incorrect, and were later corrected in [14]. Note that in recursion theory languages are regarded as sets of natural numbers.
On the Complexity of Consistent Identification
91
A responsive learning function is always defined, as long as the text is consistent with a language from its class. Given the assumptions mentioned earlier, none of these constraints are restrictive. 1.2
Time Complexity of Learning Functions
In formal learnability theory there are no a priori constraints on the computational resources required by the learning function. In [10] a whole chapter has been devoted to complexity issues in identification, where it is noted that there is a close relationship between the complexity of learning and computational complexity of functionals and operators. Defining the latter is a complex problem and still an active area of research. It is therefore no surprise that only partial attempts have been made at modeling the complexity of the identification process. Some examples are given that are based on bounding the number of mind changes of a learner, or bounding the number of examples required before the onset of convergence. These definitions do not seem to be directly related to any ‘computational’ notion of complexity. Ideally, such a constraint would satisfy some obvious intuitions about what constitutes tractability: for example, in the worst case a learning function should converge to a correct solution in polynomial time with respect to the size of the input. Such definitions are not directly applicable, since the input is not guaranteed to be helpful, for example it can start with an unbounded number of presentations of the same sentence. In full generality there can never be a bound on the number of time-steps before convergence, so such a constraint poses no bounds on computation time whatsoever. It turns out that giving a usable definition of the complexity of learning functions is not exactly easy. In this subsection some proposals and their problems will be discussed, and the choice for one particular definition will be motivated. In [9] a definition of efficiency for learning functions known as text-efficiency is given: a function ϕ identifies L (text-)efficiently just if there exists no other function that, for every language in L, given the same text, converges at the same point as ϕ or at an earlier point, formally this can be simply regarded as a constraint. Although the text-efficiency constraint seems to correspond to a rational learning strategy, by itself it is hardly restrictive. Every learnable class is textefficiently learnable. Also, there is no direct connection between text-efficiency and time complexity. Text-efficiency seems to be of limited interest to the present discussion. Let the complexity of the update-time of some (computable) learning function ϕ be defined as the number of computing steps it takes to learn a language, with respect to |σ|, the size of the input sequence. In [17] it was first noted that requiring the function to run in a time polynomial with respect to |σ| does not constitute a significant constraint, since one can always define a learning function ϕ0 that combines ϕ with a clock so that its amount of computing time is
92
C. Costa Florˆencio
bounded by a polynomial over |σ|. Obviously, ϕ0 learns the same class as ϕ, and it does so in polynomial update-time3 . The problem here is that without additional constraints on ϕ the ‘burden of computation’ can be shifted from the number of computations the function needs to perform to the amount of input data considered by the function4 . Requiring the function to be consistent already constitutes a significant constraint when used in combination with a complexity restriction (see [4]). See [19] for a discussion of consistent polynomial-time identification. In [1], consistent and conservative learning with polynomial time of updating conjectures was proposed as a reasonable criterion for efficient learning. The consistency and conservatism requirements ensure that the update procedure really takes all input into account. 5 This definition was applied in [13] to analyze the complexity of learning regular term tree languages. There does not seem to be any generally accepted definition of what constitutes a tractable learning function. A serious problem with Angluin’s approach is that it is not generally applicable to learning functions for any given class, since both consistency and (especially) conservatism are restrictive. I will therefore apply only the restrictions of consistency and polynomial update-time, since this seems to be the weakest combination of constraints that is not trivial and has an intuitive relation with standard notions of computational complexity. This definition has drawbacks too: not all learnable classes can be learned by a learning function that is consistent on its class, so even this complexity measure cannot be generally applied. There is also no guarantee that for a class that is learnable by a function consistent on that class characteristic samples (i.e. samples that justify convergence to the right grammar) can be given that are uniformly of a size polynomial in the size of their associated grammar. See [19,20] for discussions of the relation between the consistency constraint and complexity of learning functions.
2
Classical Categorial Grammar and Structure Languages
The classes defined in [5,7] are based on a formalism for (-free) context-free languages called classical categorial grammar (CCG)6 . In this section the relevant concepts of CCG will be defined. I will adopt notation from [11]. In CCG each symbol in the alphabet Σ gets assigned a finite number of types. Types are constructed from primitive types by the operators \ and /. We let Pr 3 4
5 6
To be more precise: in [8] it was shown that any unbounded monotone increasing update boundary is not by itself restrictive. Similar issues seem to be important in the field of computational learning theory (see [12] for an introduction). The notion sample complexity from this field seems closely related to the notions of text- and data-efficiency. There also exists a parallel with our notion of (polynomial) update-time. It is interesting to note that a conservative (and prudent) learner that is consistent on its class is text-efficient (Proposition 8.2.2 A in [16]). Also known as AB languages.
On the Complexity of Consistent Identification
93
denote the (countably infinite) set of primitive types. The set of types Tp is defined as follows: Definition 6. The set of types Tp is the smallest set satisfying the following conditions: 1. Pr ⊆ Tp, 2. if A ∈ Tp and B ∈ Tp, then A\B ∈ Tp. 3. if A ∈ Tp and B ∈ Tp, then B/A ∈ Tp. One member t of Pr is called the distinguished type. In CCG there are only two modes of type combination, backward application, A, A\B ⇒ B, and forward application, B/A, A ⇒ B. In both cases, type A is an argument, the complex type is a functor. Grammars consist of type assignments to symbols, i.e. symbol7→T , where symbol ∈ Σ, and T ∈ Tp. Definition 7. A derivation of B from A1 , . . . , An is a binary branching tree that encodes a proof of A1 , . . . , An ⇒ B. Through the notion of derivation the association between grammar and language is defined. All structures contained in some given structure language correspond to a derivation of type t based solely on the type assignments contained in a given grammar. The string language associated with G consists of the strings corresponding to all the structures in its structure language, where the string corresponding to some derivation consists just of the leaves of that derivation. The class of all categorial grammars is denoted CatG, the grammar system under discussion is hCatG, Σ F , FLi. The symbol FL is an abbreviation of functorargument language, which is a structure language for CCG. Structures are of the form symbol, fa(s1,s2) or ba(s1,s2), where symbol ∈ Pr, fa stands for forward application, ba for backward application and s1 and s2 are also structures. We will only be concerned with structure languages in the remainder of this article. The definition of identification in the limit (Section 1) can be applied in a straightforward way by replacing ‘language’ with ‘structure language’, from a formal point of view this makes no difference. Note that, even though structure languages contain more information than string languages, learning a class of structure languages is not necessarily easier than learning the corresponding class of string languages. This is because the identification criterion for structure languages is stronger than that for string languages: when learning structure languages, a learner must identify grammars that produce the same derivations, not just the same strings. This makes learning such classes hard, from the perspective of both learnability and complexity. All learning functions in [11] are based on the function GF. This function receives a sample of structures D as input and yields a set of assignments (i.e. a grammar) called the general form as output. It is a homomorphism and runs in linear time. It assigns t to each root node, assigns distinct variables to the argument nodes, and computes types for the functor nodes: if s17→ A, given ba(s1,s2) ⇒ B, s27→A\B. If s17→A, given fa(s2,s1) ⇒ B, s27→B/A.
94
C. Costa Florˆencio
Categorial types can be treated as terms, so natural definitions of substitution and unification apply. A substitution over a grammar is just a substitution over all of the types contained in its assignments. We state without proof that FL(G) ⊆ FL(Θ[G]), see [11] for details. The following proposition and corollary will be convenient for the proof of NP-hardness: Proposition 8. For every structure s, if s ∈ FL(G), then there exists a substitution Θ such that Θ[GF({s})] ⊆ G. Proof. (Sketch): Since s ∈ FL(G), G contains a minimal set of type assignments G0 ⊆ G such that G0 admits the derivation of type t corresponding with structure s. Each step in a derivation can take the form of just the following three cases: – The structure deriving type T is symbol S ∈ Σ. – The structure deriving type T is fa(s1, s2). Structure s2 derives some type T new, structure s1 derives T /T new. The type T new may be complex. – The structure deriving type T is ba(s1, s2). Structure s1 derives some type T new, structure s2 derives T new\T . The type T new may be complex. This inductive definition shows G0 to be equivalent to GF({s}), except that primitive types in GF({s}) may correspond to complex ones in G0 . Let Θ be the substitution that maps the complex types in G0 to the corresponding primitive types in GF({s}). Then Θ[GF({s})] = G0 . Since G0 ⊆ G, Θ[GF({s})] ⊆ G follows. t u Corollary 9. For every consistent learning function ϕ learning a subclass of CatG and every sequence σ for a language from that subclass there exists a substitution Θ such that Θ[GF(σ)] ⊆ ϕ(σ), if ϕ(σ) is defined. Thus, if GF(σ) assigns x different types to the same symbol that are pairwise not unifiable, the consistent learning function ϕ(σ) assigns at least x different types to that same symbol.
3
The Classes of Grammars
In the following subsections definitions for the relevant classes will be given. The first two classes are especially important for understanding the proof of NP-hardness. Rigid Grammars: A rigid grammar is a partial function from Σ to Tp. It assigns either zero or one type to each symbol in the alphabet. We write Grigid to denote the class of rigid grammars over Σ. The class {FL(G) | G ∈ Grigid } is denoted FLrigid . This class is learnable with polynomial update-time, by simply unifying all types assigned to the same symbol in the general form. The other classes defined in [5,7] are generalizations of this class.
On the Complexity of Consistent Identification
95
k-Valued Grammars: A k-valued grammar is a partial function from Σ to Tp. It assigns at most k types to each symbol in the alphabet. We write Gk-valued to denote the class of k-valued grammars over Σ. The class {FL(G) | G ∈ Gk-valued } is denoted FLk-valued . Note that in the special case k = 1, Gk-valued is equivalent to Grigid . The learning function ϕVGk 7 learns Gk-valued from structures. The proof of NP-hardness that we will give applies directly to the classes of k-valued grammars. The proof of this result then applies to some of the following related classes. Note that this class (and the rigid grammars) are the only classes in this paper that have finite elasticity, see [11]. Least-Valued Grammars: A grammar G is called a least-valued grammar if it is least-valued with respect to FL(G). Let L ⊆ Σ F . A grammar G ∈ Gk+1 -valued −Gk-valued is called least-valued with respect to L if L ⊆ FL(G) and there is no G0 ∈ Gk-valued such that L ⊆ FL(G0 ). We write Gleast-valued to denote the class of least-valued grammars over Σ. The class {FL(G) | G ∈ Gleast-valued } is denoted FLleast-valued . The learning function ϕLVG learns Gleast-valued from structures. Optimal Grammars: Another extension of rigid grammars proposed by Buszkowski and Penn is the class of optimal grammars. The algorithm associated with this class, OG, is based on a generalization of unification called optimal unification. We write Goptimal to denote the class of optimal grammars over Σ. The class {FL(G) | G ∈ Goptimal } is denoted FLoptimal . These grammars can be obtained by unifying GF(D) ‘as much as possible’. Thus, from no G ∈ OG a G0 6= G can be obtained by unifying types assigned to the same symbol in G. The class of optimal grammars is not learnable (see [11], Corollary 7.22). It is only mentioned here since it is a superclass of the least cardinality grammars and the minimal grammars. Least Cardinality Grammars: We write Gleast-card to denote the class of least cardinality grammars over Σ. The class {FL(G) | G ∈ Gleast-card } is denoted FLleast-card . If D is a finite set of functor-argument structures, let LCG(D) = {G ∈ OG(D) | ∀G0 ∈ OG(D)(|G| ≤ |G0 |)}. Let L ⊆ Σ F . A grammar G is said to be of least cardinality with respect to L if L ⊆ FL(G) and there is no grammar G0 such that |G0 | < |G| and L ⊆ FL(G0 ). if G ∈ LCG(D), then G is of least cardinality with respect to D. 7
With this function, and the functions defined for the other classes, we will denote arbitrary learning functions that learn these classes, not necessarily the particular functions defined in [11].
96
C. Costa Florˆencio
A grammar G is called a least cardinality grammar if G is of least cardinality with respect to FL(G). The learning function ϕLCG learns Gleast-card from structures. Minimal Grammars: Like least cardinality grammars, the class of minimal grammars is a subclass of optimal grammars. Hypothesized grammars are required to be minimal according to a certain partial ordering, in addition to being optimal. We write Gminimal to denote the class of minimal grammars over Σ. The class {FL(G) | G ∈ Gminimal } is denoted FLminimal . The following proposition will be useful later on: Proposition 10. ([11], Proposition 7.52) If a grammar G is of least cardinality with respect to L, then G is minimal with respect to L. Whether or not Gminimal is learnable from structures is an open question. Kanazawa conjectures it is learnable (see [11], Section 7.3).
4
The Proof
In order to prove NP-hardness of an algorithmic problem L, it suffices to show that there exists a polynomial-time reduction from an NP-complete problem L0 to L. We will present such a reduction using the vertex-cover problem, a well-known NP-hard problem from the field of operations research. Definition 11. Let G = (V, E) be an undirected graph, where V is a set of vertices and E is a set of edges, represented as tuples of vertices. A vertex cover of G is a subset V 0 ⊆ V such that if (u, v) ∈ E, then u ∈ V 0 or v ∈ V 0 (or both). That is, each vertex ‘covers’ its incident edges, and a vertex cover for G is a set of vertices that covers all the edges in E. The size of a vertex cover is the number of vertices in it. The vertex-cover problem is the problem of finding a vertex cover of minimum size (called an optimal vertex cover) in a given graph. The vertex cover problem can be restated as a decision problem: does a vertex cover of given size k exist for some given graph? Proposition 12. The decision problem related to the vertex-cover problem is NP-complete. The vertex-cover problem is NP-hard. Since the formal proof of Theorem 13 below will be somewhat complex I will first give an informal sketch of its structure. Let graph Graph be given. Construct an alphabet A and a sample D, that is, a set of structures D = {S0 , . . . , Sn }, using A, following some recipe so that this sample represents Graph. A consistent learning function ϕ presented with D can only conjecture grammars whose associated languages contain D. Using Corollary 9 it will be shown that, in order for these grammars to be in ϕ’s class, they have to correspond to vertex covers for Graph of at most some given size. Therefore, computing the conjecture after the last element of D is input solves the decision problem related to the vertex-cover problem, which is NP-complete.
On the Complexity of Consistent Identification
97
Theorem 13. Learning the classes Gk-valued from structures by means of one function that, for each k, is responsive and consistent on its class and learns its class prudently is NP-hard. Proof. The decision version of the vertex-cover problem can be transformed in polynomial time to the problem of learning a k-valued grammar from structures by means of a learning function consistent on that class. That is, given a bound on the size of the vertex cover, the function will yield a solution, or will be undefined8 if no vertex cover of that size exists. The transformation of the initial graph to an input sample will now be detailed. The initial graph consists of edges, which are numbered 1, . . . , e, and vertices, which are numbered 1, . . . , v. First, for every edge i in E, we produce a structure fa(a, ba(w, . . . , ba(w, fa(g,w)) . . . )). Construct a sample D from {z } | i times
all these structures for 1 ≤ i ≤ e. Inclusion of these structures results in the assignment of types Πi = (Wx \ . . . (Wx+i−1 \(Ai /Wx+i )) . . . ), for all 1 ≤ i ≤ e, to symbol g in GF(D) (obviously no pair Πi , Πj can be unified unless i = j). Symbol a gets assigned just the types t/Ai for all 1 ≤ i ≤ e. Symbol w gets assigned just some number w of primitive types W1...w in GF(D). For each vertex j, 1 ≤ j ≤ v, create the structure Ωj = ba(w, . . . ba(w, g) . . . ). {z } | j times
For each edge i, add the structures ba(w, ba(w, . . . , fa(Ωf e(i,a) , w) . . . )) and {z } | i times
ba(w, ba(w, . . . , fa(Ωf e(i,b) , w) . . . )) to D. Here f e(i, a) and f e(i, b) give (indi| {z } i times
ces for) the two vertices incident on edge i9 . This results in the assignment of types Γ(i,a) = Wy \(Wy+1 \ . . . (Λf e(i,a) /Wy+i ) . . . ) and Γ(i,b) = Wq \(Wq+1 \ . . . (Λf e(i,b) /Wq+i ) . . . ) to symbol g in GF(D), where Λf e(i,a) = (Wz \ . . . (Wz+f e(i,a)−1 \t) . . . ) and Λf e(i,b) = (Wr \ . . . (Wr+f e(i,b)−1 \t) . . . ). Let max be the size of the desired vertex cover, and let d = 2e − max. Obviously d ≥ 0. The last step in constructing D consists of adding ‘filler’ types, to pad the number of types assigned to a. These will look like (. . . (Wy \t)/ . . . / Wy+l−1 ) for a given l. Let F iller denote a (possibly empty) set containing d types.
8 9
Note that this does not mean that the function is not responsive, since it will only be undefined if the input is not from a language from its class. The fa-label is used here as a separator between the ‘outer’ and ‘inner’ ba-labels in the structure. Obviously the corresponding Γ types can only be unified if they have the same number of ‘outer’ and ‘inner’ \ operators. This idea can be generalized so that it can be used to encode sequences of arbitrary length of natural numbers in categorial types.
98
C. Costa Florˆencio
Let G = GF(D): g 7→Γ(1,a) , Γ(1,b) , Π1 , ... Γ(e,a) , Γ(e,b) , Πe , G: a 7→t/A1 , . . . , t/Ae , F iller w 7→W1 , W2 , . . . Suppose this sample D10 is input for ϕVGk , k = 2e. Then, by Corollary 9, for each i, 1 ≤ i ≤ e, the type Πi assigned to g has to unify with one of the two types it can unify with, i.e. either Γ(i,a) or Γ(i,b) . For every such unification step a substitution, either {Λf e(i,a) ← Ai } or {Λf e(i,b) ← Ai }, is obtained. This unification step is intended to correspond to including vertex f e(i, a) or f e(i, b), 1 ≤ i ≤ e, in the vertex-cover. Applying these substitutions to G produces grammar G0 : g 7→Γ(1,a) , Γ(1,b) , ... Γ(e,a) , Γ(e,b) , G0 : a 7→t/Λf e(1,f c(1)) . . . , t/Λf e(e,f c(e)) , F iller w 7→W1 , W2 , . . . Here f c(i) is the choice function from i, 1 ≤ i ≤ e to the vertex (either a or b) chosen to cover edge i in the final vertex cover. In order to obtain a grammar that is k-valued (k = 2e), there cannot be more than 2e types assigned to a. Since there are d = 2e − max ‘filler’ types assigned to a, the rest of the types can only be at most max types of the form t/Λx (these obviously cannot unify with any type in F iller). Since the Λ-types are pairwise not unifiable, this means that only max different types of the form t/Λx are assigned to a. Note that whether or not all types assigned to w are unified has no consequence for the structure language associated with G0 , and that there remain exactly 2e Γ types assigned to g. The grammar G0 is the final output, if ϕVGk (D), k = 2e is defined. This output can be read as a solution by adding vertex j to the vertex cover for each of the types of the form t/Λj assigned to a. Since the function is responsive and prudent, its being undefined for D implies there is no G, D ∈ FL(G), G ∈ Gk-valued , which means that a vertex cover of size max does not exist for Graph. Since both the conversion from graph to input sample and the conversion from resulting grammar to set of vertices can be done in polynomial time, the learning function has to be NP-hard. This implies that its update-time is NPPsize(D) hard, since its total computation time is n=1 update-time for nth element in 10
We show only GF(D) instead of D itself since D’s properties that are relevant to this discussion are much more accessible in this form.
On the Complexity of Consistent Identification
99
D, size(D) is polynomial in the size of the graph, as is the size of each element in D. Any grammar output by such a function that is k-valued, k = 2e, will look like G0 . Since such a grammar will correspond to a vertex cover any function that can learn any of these classes prudently and is responsive and consistent on that class will be able to solve the decision problem related to the vertex-cover problem after a polynomial-time reduction. t u Corollary 14. (Of the proof ) Learning Gleast-valued from structures by means of a function that is responsive and consistent on its class and learns its class prudently is NP-hard. Obviously, exactly the same proof works for learning Gleast-valued , since, because of the introduction of the ‘filler’-types, there cannot be any grammars obtained from D with k < v, so the least value for k is v. Corollary 15. (Of the proof ) Learning Gleast-card from structures by means of a function that is responsive and consistent on its class and learns its class prudently is NP-hard. The proof works for learning Gleast-card , since the k-valued grammar obtained by learning Gk-valued is optimal (all symbols have k non-unifiable types assigned, recall the remark concerning symbol x), and all optimal grammars obtainable from D have the same cardinality. The proof of Theorem 13 cannot be used for Gminimal . However, the relation between Gminimal and Gleast-card provides a different route for proving NPhardness. Let ϕ be a computable function for a class L that learns L consistently. Then the learning function ϕ0 for a class L0 , L ⊆ L0 that learns L0 consistently has a time complexity that is the same as, or worse than, the time complexity of ϕ. From this and Proposition 10 the following proposition is straightforward: Proposition 16. Learning Gminimal by means of a function that is responsive and consistent on Gminimal and learns Gminimal prudently is NP-hard. A proof of NP-hardness gives evidence for the intractability of a problem. After such a proof has been given it is natural to ask whether such a problem is NP-complete. In order to prove NP-completeness of a problem L that has been shown to be NP-hard, one needs to show that L ∈ NP. This would imply that there exists an algorithm that verifies solutions for L in polynomial time. Normally this is the ‘easy’ part of an NP-completeness proof. In this case, however, it is not at all clear what such algorithms are supposed to do, let alone whether they exist. Their task, among other things, is checking whether the grammar is consistent with the input sequence, whether it is in the right class, and whether the grammar is justified in giving its conjecture. Obviously the last task is the most problematic.
100
C. Costa Florˆencio
Checking consistency is polynomial in |D| (since membership is decidable in polynomial time for context-free structure languages), but it is not even clear whether for all ϕ learning any of the classes under discussion, |D| may be exponential in |G| for some G in ϕ’s class. Checking whether a grammar is k-valued, or optimal, can obviously be done in polynomial time, but even checking whether grammar G can be derived from grammar G0 by unification may not be so simple. Defining this criterion and proving existence of a polynomial time verification algorithm is expected to be much harder than the proof of Theorem 13. An interesting question is whether there exist (non-trivial) learnable subclasses11 of the classes under discussion for which polynomial-time consistent learning algorithms do exist. A necessary (but not sufficient) condition for such a class would be that vertex-cover problems cannot be recast as learning problems in polynomial time. It is easy to see that this requires a class definition that is not (crucially) based on the number of type assignments in the grammar. It turns out to be possible to give sufficient conditions for such a class, using the following theorem ([13], Theorem 9, where it is attributed to [2] and [18]): Theorem 17. Let L be a class. If L has finite thickness, and the membership problem and the minimal language problem for L are computable in polynomial time, then L is polynomial time inferable from positive data12 . The minimal language problem is defined as ‘Find a grammar from a set that generates a language that is minimal relative to the languages generated by the other grammars in the set’. The complexity of this problem is of the same order as the complexity of the inclusion problem, i.e. ‘given grammars G1 and G2 , is the language associated with G1 a subset of the language associated with G2 ?’. The inclusion problem is known to be undecidable for context-free string languages. However, it is decidable for the structure languages associated with CCGs. An algorithm is given in [6] based on algebras that runs in exponential time. In [11] it is shown that the inclusion problem for rigid structure languages is simply unification, so it can run in linear time. It may be the case that such simple functions exist for other subclasses of the context-free structure languages. We are not aware of any complexity results for the inclusion problem for context-free structure languages. Let order be a function over categorial types that computes their size by b be subclasses counting the number of operators contained in them. Let Gk-valued b of the classes Gk-valued where the order of all types in any grammar G ∈ Gk-valued b is bounded by some b. It is easy to see that any class in Gk-valued has finite b thickness. The membership problem is decidable in polynomial time for Gk-valued 11 12
Obviously, consistently learning any superclass of the classes under discussion is an NP-hard problem. Polynomial time inference is defined in [13] as ‘there exists a consistently, responsively and conservatively working learning function that has polynomial update time’, this corresponds to the Angluin-style definition of efficient learning.
On the Complexity of Consistent Identification
101
(since all the classes under discussion are subclasses of context-free grammars). Thus it suffices to show that the inclusion problem is decidable in polynomial b b to prove that Gk-valued is polynomial time inferable from positive time for Gk-valued data. In [20] a proof of a NP-hardness result for learning the class of pattern languages was given. This proof has a structure that is totally different from ours; it crucially depends on the complexity of membership testing, which is NPcomplete for that class. Note that in [20] it is claimed that any polynomial time learner can be converted to a polynomial time consistent learner. This may seem to be in direct contradiction with our results, but in [20] a number of definitions of consistency are used that differ from ours, and are actually combinations of other constraints. None of these constraints require the learning function to be prudent. Therefore there is no contradiction with our results, since these crucially rely on the prudence constraint.
5
Conclusion and Further Research
In this paper it has been shown that learning any of the classes Gleast-valued , Gleast-card , and Gminimal from structures by means of a learning function that is consistent on its class is NP-hard. The result for the classes Gk-valued is weaker: one function that can learn these classes for each k and is consistent on its class is NP-hard. It is an open question whether there exist polynomial-time learning functions for Gk-valued for each k separately, although we feel it is unlikely. Showing intractability for k = 2 would imply intractability for all k > 1, since Gk-valued ⊆ Gk+1-valued . It is a well-known fact that learning functions for any learnable class without consistency- and monotonicity constraints can be transformed to trivial learning functions that have polynomial update-time (see Subsection 1.2). It is an open question whether there exist ‘intelligent’ inconsistent learning functions that have polynomial update-time for the classes under discussion. Since the relation between structure language and string language is so clearcut, it is in general easy to transfer results from one to the other. In [11] some results concerning learnability of classes of structure languages were used to obtain learnability results for the corresponding classes of string languages. It might be possible to do the same with complexity results, i.e. obtain an NPhardness result for learning Gleast-valued from strings, for example. The proof of Theorem 13 relies on a subclass of languages that can all be identified with sequences that have a length polynomial in the size of their associated grammars. This is not necessarily true for all languages in the class, so data-complexity issues may make the complexity of learning these classes in practice even worse than Theorem 13 suggests.
References [1] D. Angluin. Finding common patterns to a set of strings. In Proceedings of the 11th Annual Symposium on Theory of Computing, pages 130–141, 1979.
102
C. Costa Florˆencio
[2] D. Angluin. Finding patterns common to a set of strings. Journal of Computer System Sciences, 21:46–62, 1980. [3] D. Angluin. Inductive inference of formal languages from positive data. Information and Control, 45:117–135, 1980. [4] J. Barzdin. Inductive inference of automata, functions and programs. In Proceedings International Congress of Math., pages 455–460, Vancouver, 1974. [5] W. Buszkowski. Discovery procedures for categorial grammars. In E. Klein and J. van Benthem, editors, Categories, Polymorphism and Unification. University of Amsterdam, 1987. [6] W. Buszkowski. Solvable problems for classical categorial grammars. Bulletin of the Polish Academy of Sciences: Mathematics, 34:373–382, 1987. [7] W. Buszkowski and G. Penn. Categorial grammars determined from linguistic data by unification. Studia Logica, 49:431–454, 1990. [8] R. Daley and C. Smith. On the complexity of inductive inference. Information and Control, 69:12–40, 1986. [9] E. M. Gold. Language identification in the limit. Information and Control, 10:447– 474, 1967. [10] Sanjay Jain, Daniel Osherson, James Royer, and Arun Sharma. Systems that Learn: An Introduction to Learning Theory. The MIT Press, Cambridge, MA., second edition, 1999. [11] M. Kanazawa. Learnable Classes of Categorial Grammars. CSLI Publications, Stanford University, 1998. [12] Michael J. Kearns and Umesh V. Vazirani. An Introduction to Computational Learning Theory. Cambridge, Mass.: MIT Press, 1994. [13] Satoshi Matsumoto, Yukiko Hayashi, and Takayoshi Shoudai. Polynomial time inductive inference of regular term tree languages from positive data. ALT 97, pages 212–227, 1997. [14] T. Motoki, T. Shinohara, and K. Wright. The correct definition of finite elasticity: Corrigendum to identification of unions. In The Fourth Workshop on Computational Learning Theory. San Mateo, Calif.: Morgan Kaufmann, 1991. [15] D. N. Osherson, D. de Jongh, E. Martin, and S. Weinstein. Formal learning theory. In Handbook of Logic and Language. (J. van Benthem and A. ter Meulen, editors.), Elsevier Science Publishers, 1996. [16] D. N. Osherson, M. Stob, and S. Weinstein. Systems that Learn: An Introduction to Learning Theory for Cognitive and Computer Scientists. MIT Press, Cambridge, MA., 1986. [17] L. Pitt. Inductive inference, dfas, and computational complexity. In K. P. Jantke, editor, Proceedings of International Workshop on Analogical and Inductive Inference, number 397 in Lecture Notes in Computer Science, pages 18–44, 1989. [18] T. Shinohara. Studies on Inductive Inference from Positive Data. PhD thesis, Kyushu University, 1986. [19] Werner Stein. Consistent polynominal identification in the limit. In Algorithmic Learning Theory (ALT), volume 1501 of Lecture Notes in Computer Science, pages 424–438, Berlin, 1998. Springer-Verlag. [20] R. Wiehagen and T. Zeugmann. Learning and consistency. In K. P. Jantke and S. Lange, editors, Algorithmic Learning for Knowledge-Based Systems, Lecture Notes in Artificial Intelligence 961, pages 1–24. Springer-Verlag, 1995. [21] Keith Wright. Identification of unions of languages drawn from an identifiable class. In The 1989 Workshop on Computational Learning Theory, pages 328–333. San Mateo, Calif.: Morgan Kaufmann, 1989.
Computation of Substring Probabilities in Stochastic Grammars Ana L. N. Fred Instituto de Telecomunica¸co ˜es Instituto Superior T´ecnico IST-Torre Norte, Av. Rovisco Pais, 1049-001 Lisboa, Portugal
[email protected] Abstract. The computation of the probability of string generation according to stochastic grammars given only some of the symbols that compose it underlies pattern recognition problems concerning the prediction and/or recognition based on partial observations. This paper presents algorithms for the computation of substring probabilities in stochastic regular languages. Situations covered include prefix, suffix and island probabilities. The computational time complexity of the algorithms is analyzed.
1
Introduction
The computation of the probability of string generation according to stochastic grammars given only some of the symbols that compose it, underlies some pattern recognition problems such as prediction and recognition of patterns based on partial observations. Examples of this type of problems are illustrated in [2, 3,4], in the context of automatic speech understanding, and in [5,6], concerning the prediction of a particular physiological state based on the syntactic analysis of electro-encephalographic signals. Another potential application, in the area of image understanding, is the recognition of partially occluded objects based on their string contour descriptions. Expressions for the computation of substring probabilities according to stochastic context-free grammars written in Chomsky Normal Form have been proposed [1,2,3]. In this paper algorithms for the computation of substring probabilities for regular-type languages expressed by stochastic grammars of the form σ → Fi
,
Fi → α
,
Fi → αFj
,
α ∈ Σ∗
,
σ, Fi , Fj ∈ VN
(σ representing the grammar start symbol and VN , Σ corresponding to the nonterminal and terminal symbols sets, respectively) are described. This type of grammars arises, for instance, in the process of grammatical inference based on Crespi-Reghizzi’s method [7] when structural samples assume the form [. . . [dc[f gb[e[cd[ab]]]] (meaning some sort of temporal alignment of sub-patterns). A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 103–114, 2000. c Springer-Verlag Berlin Heidelberg 2000
104
A.L.N. Fred
The algorithms presented next explore the particular structure of these grammars, being essentially dynamic programming methods. Situations described include the recognition of fixed length strings (probability of exact recognition – section 3.1, highest probability interpretation – section 3.2) and arbitrary length strings (prefix – section 3.3, suffix – section 3.5 and island probabilities – section 3.4). The computational complexity of the methods is analyzed in terms of worst case time complexity. Minor and obvious modifications of the above algorithms enable the computation of probabilities according to the standard form of regular grammars.
2
Notation and Definitions
Let G = (VN , Σ, Rs , σ) be a stochastic context-free grammar, where VN is the finite set of non-terminal symbols or syntactical categories; Σ is the set of terminal symbols (vocabulary); Rs is a finite set of productions of the form pi : A → α, ∗ A ∈ VN , α ∈ (Σ ∪ VN ) , the star representing any combination of symbols in the set and pi is the rule probability; and σ ∈ VN is the start symbol. When rules take the particular form A → aB or A → a, with A, B ∈ VN and a ∈ Σ, then the grammar is designated as finite-state or regular. Along the text the symbols A, G and H will be used to represent non-terminal symbols. The following definitions are also useful: def w1 . . . wn = finite sequence of terminal symbols; ∗ def H ⇒ α = derivations of α from H through the application of an arbitrary number of rules; def CT (γ) = number of terminal symbols in γ (with repetitions); def
CN (γ) = number of non-terminal symbols in γ (with repetitions); def max nmin (H,G) = min n , γ {CT (γ) : H → γG} The following graphical notation is used to represent derivation trees: – Arcs are associated with the direct application of rules. For instance, the rule H → w1 w2 G is represented by
H
HH
w1 w2
or
G
H
HH w1 w2
G
– The triangle represents derivation trees having the top non-terminal symbol as root and leading to the string on the base of the triangle: H
∗
∗
H ⇒ w1 . . . wn Σ ≡
HH
w1 . . . wn Σ ∗
Computation of Substring Probabilities in Stochastic Grammars
3 3.1
105
Algorithms Description ∗
Probability of Derivation of Exact String – Pr(H ⇒ wi . . . wi+n ) ∗
Let Pr (H ⇒ wi . . . wi+n ) be the probability of all derivation trees having H as root and generating exactly wi . . . wi+n . According to the type of grammars considered the computation of this probability can be obtained as follows (see figure 1): ∗
Pr(σ ⇒ wi . . . wi+n ) =
X
∗
Pr(σ → G)Pr(G ⇒ wi . . . wi+n )
(1)
G
∗
Pr(H ⇒ wi . . . wi+n ) , H 6= σ = Pr(H → wi . . . wi+n ) + (H,G) X nmin X ∗ + Pr(H → wi . . . wi+k−1 G)Pr(G ⇒ wi+k . . . wi+n ) G
H
H
HH
wi
(2)
k=1
HH
wi wi+1
A1
HH
wi+1 . . . wi+n
...
A1
HH
H
HH
wi . . . wi+n−1 A1
wi+2 . . . wi+n
wi+n
Fig. 1. Terms of the expression 2.
This can be summarized by the iterative algorithm: 1. For all non-terminal symbols H ∈ VN − {σ}, determine Pr(H → wi+n ) 2. For k = 1, . . . n and for all non-terminal symbols H ∈ VN − {σ}, compute: ∗
Pr(H ⇒ wi+n−k . . . wi+n ) = Pr(H → wi+n−k . . . wi+n ) + (H,G) X nmin X Pr(H → wi+n−k . . . wi+n−k+j−1 G) × + j=1
G ∗
Pr(G ⇒ wi+n−k+j . . . wi+n )
(3)
106
A.L.N. Fred
3. ∗
Pr(σ ⇒ wi . . . wi+n ) =
X
∗
Pr(σ → G)Pr(G ⇒ wi . . . wi+n )
(4)
G
This corresponds to filling in the matrix of figure 2, column by column, from the right to the left. In this figure, each element corresponds to the probability of all derivation trees for the substring on the top of the column, with root on the ∗ non-terminal symbols indicated on the left of the row, i.e. Pr (Hi ⇒ wj . . . wi+n ).
H1 H2 .. . H|VN |
wi . . . wi+n . . .
wi+n−1 . . . wi+n ∗
wi+n Pr (H1 → wi+n )
Pr (H2 ⇒ wi+n−1 wi+n )
Fig. 2. Array representing the syntactic recognition of strings according to the algorithm.
Notice that the calculations on step 2 are based only on direct rules and access to previously computed probabilities on columns to the right of the position under calculation. Computational complexity analysis. For strings of length n, the computational cycle is repeated n times, all non-terminal symbols being considered. Associating to the terms in equation 3 A: Pr(H → wi+n−k . . . wi+n ) (H,G) X nmin X ∗ Pr(H → wi+n−k . . . wi+n−k+j−1 G)Pr(G ⇒ wi+n−k+j . . . wi+n ) B: G
j=1
the worst case time complexity is of the order
of B computation of A computation n−1 {z } | X | {z } α β O (|VN | − 1)δ + (|VN | − 1) θ + + |{z} | {z } k=1 step 3 step 1 | {z } step 2 = O (|VN |n)
Computation of Substring Probabilities in Stochastic Grammars
107
∗
3.2
Maximum Probability of Derivation – Pm (H ⇒ wi . . . wi+n ) ∗
Let Pm (H ⇒ w1 . . . wn ) denote the highest probability over all derivation trees having H as root and producing exactly w1 . . . wn and define the matrix Mu as follows: ∗
Mu [i, j] = Pm (Hi ⇒ wj . . . wn )
i = 1, . . . , |VN |
j = 1, . . . , n
(5)
Observing that n o ∗ ∗ Pm (σ ⇒ w1 . . . wn ) = max Pm (G ⇒ w1 . . . wn ) G
(6)
and ∗
Pm (H ⇒ w1 . . . wn ) , H 6= σ = max {Pr(H → w1 . . . wn ) , n o ∗ max Pr(H → w1 . . . wi G)Pm (G ⇒ wi+1 . . . wn ) i,G
(7)
the following algorithm computes the desired probability: 1. For i = 1, . . . , |VN |
, Hi 6= σ
Mu [i, n] =
Pr(Hi → wn ) if (Hi → wn ) ∈ Rs 0 otherwise
2. For j = n − 1, . . . , 1 and for i = 1, . . . , |VN |
, Hi 6= σ
Mu [i, j] = max { Pr(Hi → wj . . . wn ) , max {Pr(Hi → wj . . . wk Hl )Mu [l, k + 1]} } k>j ,l
3. For i : Hi = σ Mu [i, 1] =max {Pr(σ → Hj )Mu [j, 1]} j
This algorithm corresponds to the computation of an array similar to the one in figure 2, but where probabilities refer to maximum values of single derivations instead of total probability of derivation when ambiguous grammars are considered. Based on the similarity between this algorithm and the one developed in section 3.1 it is straightforward to conclude that it runs in O (|VN |n) time.
108
3.3
A.L.N. Fred ∗
Prefix Probability – Pr (H ⇒ wi . . . wi+n Σ ∗ )
The probability of all derivation trees having H as root and generating arbitrary ∗ length strings with the prefix wi . . . wi+n – Pr (H ⇒ wi . . . wi+n Σ ∗ )) – can be expressed as follows: ∗
Pr(σ ⇒ wi . . . wi+n Σ ∗ ) =
X
∗
Pr(σ → G)Pr(G ⇒ wi . . . wi+n Σ ∗ )
(8)
G
∗
Pr(H ⇒ wi . . . wi+n Σ ∗ ) , H 6= σ = Pr(H → wi . . . wi+n ) + (H,G) X nmin X ∗ Pr(H → wi . . . wi+k−1 G)Pr(G ⇒ wi+k . . . wi+n Σ ∗ ) + + G
+
X
k=1
Pr(H → wi . . . wi+n G) +
G
+
X X
Pr(H → wi . . . wi+n v1 . . . vk G)
G k∈N +
= Pr(H → wi . . . wi+n ) + (H,G) X X nmin ∗ Pr(H → wi . . . wi+k−1 G)Pr(G ⇒ wi+k . . . wi+n Σ ∗ ) + + G
X
+
k=1
Pr(H → wi . . . wi+n v1 . . . vk G)
(9)
G,k∈N0+
H wi ...wi+n Σ
H
H
H
HH . . . HH
∗ w i
A1
H
wi ...wi+n−1
H
wi+1 ...wi+n Σ ∗
A1
wi ...wi+n
H
wi+n
H
HH
H∗ Σ
A1
H∗H
wi ...wi+n Σ
H
Σ∗
H
A1
H
Σ∗
H
Fig. 3. Terms of the expression 9.
The previous expression suggest the following iterative algorithm: 1. For all nonterminal symbols H ∈ VN − {σ}, determine X Pr(H → wi+n v1 . . . vk ) Pr(H → wi+n ) + k∈N +
2. For k = 1, . . . , n and for all nonterminal symbols H ∈ VN − {σ}, compute ∗
Pr(H ⇒ wi+n−k . . . wi+n Σ ∗ )
Computation of Substring Probabilities in Stochastic Grammars
109
= Pr(H → wi+n−k . . . wi+n ) + +
(H,G) X X kmin
G
Pr(H → wi+n−k . . . wi+n−k+j−1 G) ×
j=1 ∗
+
X X
×Pr(G ⇒ wi+n−k+j . . . wi+n Σ ∗ ) + Pr(H → wi+n−k . . . wi+n v1 . . . vj G)
(10)
G j∈N + 0
3. ∗
Pr(σ ⇒ wi . . . wi+n Σ ∗ ) =
X
∗
Pr(σ → G)Pr(G ⇒ wi . . . wi+n Σ ∗ )
(11)
G
This algorithm, like the one of section 3.1, corresponds to filling in a parsing matrix similar to the one in figure 2. The similarity with the algorithm in section 3.1 leads to the conclusion that this algorithm has O (|VN |n) time complexity. 3.4
∗
Island Probability – Pr (H ⇒ Σ ∗ wi . . . wi+n Σ ∗ ) ∗
The island probability consists of Pr (H ⇒ Σ ∗ wi . . . wi+n Σ ∗ ), the probability of all derivation trees with root in H that generate arbitrary length sequences containing the subsequence wi . . . wi+n . Let X def Pr(H → γG) , γ ∈ Σ ∗ (12) PR (H → G) = γ
= probability of rewriting H by sequences with the nonterminal symbol G as suffix One can write ∗
Pr(σ ⇒ Σ ∗ wi . . . wi+n Σ ∗ ) =
X
∗
Pr(σ → G)Pr(G ⇒ Σ ∗ wi . . . wi+n Σ ∗ ) (13)
G
∗
Pr(H ⇒ Σ ∗ wi . . . wi+n Σ ∗ ) , H 6= σ X ∗ PR (H → G)Pr(G ⇒ Σ ∗ wi . . . wi+n Σ ∗ ) + = G
+
(H,G) X nmin X
G
+
X G
∗
PR (H → wi . . . wi+j−1 G)Pr(G ⇒ wi+j . . . wi+n Σ ∗ ) +
j=1 ∗
PR (H → wi . . . wi+n G) Pr(G ⇒ Σ ∗ ) + {z } | =1
110
A.L.N. Fred
+ +
XX G
j,k
G
j,k
XX
∗
Pr(H → v1 . . . vk wi . . . wi+n z1 . . . zj G) Pr(G ⇒ Σ ∗ ) + {z } | =1
Pr(H → v1 . . . vk wi . . . wi+n z1 . . . zj )
(14)
max
For strings sufficiently long ( n > G {CT (γ) : G → γ}) the last three terms do not exist, therefore we will ignore them from now on. H
H
H
H
H
HH HH HH HH ∗ ∗ ∗ ∗ ∗ Σ w ...w Σ w ...w Σ w ...w Σ Σ wi ...wi+n Σ ∗ Σ∗ i i i+n G i i+n i+k G G G H H H H H H H H Σ ∗ wi ...wi+n
Σ∗
wi+k+1 ...wi+n Σ ∗
Σ∗
Σ∗
Fig. 4. Terms of the expression 14.
∗
Pr(H ⇒ Σ ∗ wi . . . wi+n Σ ∗ ) , H 6= σ X ∗ = PR (H → G)Pr(G ⇒ Σ ∗ wi . . . wi+n Σ ∗ ) + G
+
(H,G) X X nmin
G
∗
PR (H → wi . . . wi+j−1 G)Pr(G ⇒ wi+j . . . wi+n Σ ∗ ) (15)
j=1
Recursively applying the above expression and after some manipulation (details can be found in [6]) one obtains: ∗
Pr(H ⇒ Σ ∗ wi . . . wi+n Σ ∗ ) , H 6= σ X X X ∗ QR (H ⇒ A) PR (A → wi . . . wi+j−1 G)Pr(G ⇒ wi+j . . . wi+n Σ ∗ ) = A
+
X X
G j∈N + ∗
PR (H → wi . . . wi+j−1 G)Pr(G ⇒ wi+j . . . wi+n Σ ∗ )
(16)
G j∈N +
with ∗
QR (H ⇒ G) = Pr(H ⇒ Σ ∗ G) = PR (H → G) + X X + PR (H → A)PR (A → G) + PR (H → A1 )PR (A1 → A2 ) × A
A1 A2
×PR (A2 → G) + . . . QR (H ⇒ G) obeys the equation X PR (H → A)QR (A ⇒ G) + PR (H → G) QR (H ⇒ G) = A
(17) (18)
Computation of Substring Probabilities in Stochastic Grammars
111
Defining the matrices PR [H, G] = PR (H → G) QR [H, G] = QR (H ⇒ G)
(19) (20)
QR = PR [I − PR ]−1
(21)
Q is given by [6]:
The algorithm can thus be described as: 1. Off-line computation of QR = PR [I − PR ]−1 . ∗ ∗ 2. On-line computation of Pr(G ⇒ wn Σ ∗ ) , Pr(G ⇒ wn−1 wn Σ ∗ ) , . . . , Pr ∗ (G ⇒ w2 . . . wn Σ ∗ ) for all nonterminal symbols G ∈ VN − {σ} using the algorithm in section 3.3. 3. For all nonterminal symbols H ∈ VN − {σ} compute ∗
Pr(H ⇒ Σ ∗ wi . . . wi+n Σ ∗ ) X X X = QR (H ⇒ A) PR (A → wi . . . wi+j−1 G) × A
G j∈N + ∗
+
X X
×Pr(G ⇒ wi+j . . . wi+n Σ ∗ ) + ∗
PR (H → wi . . . wi+j−1 G)Pr(G ⇒ wi+j . . . wi+n Σ ∗ )
G j∈N +
4. ∗
Pr(σ ⇒ Σ ∗ wi . . . wi+n Σ ∗ ) =
X
∗
Pr(σ → G)Pr(G ⇒ Σ ∗ wi . . . wi+n Σ ∗ )
G
The required on-line computations have the following time complexity:
δ = O (|VN | − 1)(n − 1)α + (|VN | − 1)|VN |β + |{z} | {z } | {z } step 4 step 2 step 3 = max(O (|VN |n) , O |VN |2 3.5
∗
Suffix Probability – Pr(H ⇒ Σ ∗ wi . . . wi+n ) ∗
Let Pr (H ⇒ Σ ∗ wi . . . wi+n ) be the probability of all derivation trees with root in H generating arbitrary length strings having wi . . . wi+n as a suffix. This probability can be derived as follows: ∗
Pr(σ ⇒ Σ ∗ wi . . . wi+n ) =
X G
∗
Pr(σ → G)Pr(G ⇒ Σ ∗ wi . . . wi+n )
(22)
112
A.L.N. Fred ∗
Pr(H ⇒ Σ ∗ wi . . . wi+n ) , H 6= σ X ∗ = PR (H → G)Pr(G ⇒ Σ ∗ wi . . . wi+n ) + G
+
(H,G) X X nmin
j=1
G
+
X
∗
PR (H → wi . . . wi+j−1 G)Pr(G ⇒ wi+j . . . wi+n ) +
PR (H → wi . . . wi+n )
(23)
G
H
H
H
Σ ∗ wi ...wi+n
wi+k+1 ...wi+n
HH HH Σ ∗ wi ...wi+k G Σ ∗ wi ...wi+n Σ∗ G HH HH Fig. 5. Terms of the expression 23. max
For strings sufficiently long ( n > G {CT (γ) : G → γ}) the third term does not exist, so it will be ignored henceforth. Recursively applying the resulting expression to the second part of the first term one obtains: ∗
Pr(H ⇒ Σ ∗ wi . . . wi+n ) , H 6= σ X X ∗ = PR (H → wi . . . wi+j−1 A1 )Pr(A1 ⇒ wi+j . . . wi+n ) + A1 j∈N +
X X
+
PR (H → A1 )PR (A1 → wi . . . wi+j−1 A2 ).
A1 ,A2 j∈N +
.Pr(A2 ⇒ wi+j . . . wi+n ) + +... + X + PR (H → A1 )PR (A1 → A2 ) . . . A1 ,...,Ak ∗
(24) . . . PR (Ak−1 → wi . . . wi+j−1 Ak )Pr(Ak ⇒ wi+j . . . wi+n ) + . . . X X X ∗ = QR (H ⇒ A) PR (A → wi . . . wi+j−1 G)Pr(G ⇒ wi+j . . . wi+n ) + A
+
X X
G j∈N + ∗
PR (H → wi . . . wi+j−1 G)Pr(G ⇒ wi+j . . . wi+n )
G j∈N +
The algorithm is then: 1. Off-line computation of QR = PR [I − PR ]−1 .
(25)
Computation of Substring Probabilities in Stochastic Grammars ∗
∗
113 ∗
2. On-line computation of Pr(G ⇒ wn ) , Pr(G ⇒ wn−1 wn ) , . . . , Pr(G ⇒ w2 . . . wn ) for all non-terminal symbols G ∈ VN − {σ} using the algorithm in section 3.1. 3. For all non-terminal symbols H ∈ VN − {σ} compute ∗
Pr(H ⇒ Σ ∗ wi . . . wi+n ) X X X ∗ = QR (H ⇒ A) PR (A → wi . . . wi+j−1 G)Pr(G ⇒ wi+j . . . wi+n ) A
+
X X
G j∈N + ∗
PR (H → wi . . . wi+j−1 G)Pr(G ⇒ wi+j . . . wi+n )
G j∈N +
4. ∗
Pr(σ ⇒ Σ ∗ wi . . . wi+n ) =
X
∗
Pr(σ → G)Pr(G ⇒ Σ ∗ wi . . . wi+n )
G
On-line computations have the time complexity:
δ = O (|VN |n) O (|VN | − 1)(n − 1)α + (|VN | − 1)β + |{z} | {z } | {z } step 4 step 2 step 3
4
Conclusions
This paper described several algorithms for the computation of substring probabilities according to grammars written in the form σ → Fi
,
Fi → α
,
Fi → αFj
,
α ∈ Σ∗
,
σ, Fi , Fj ∈ VN
Table 1 summarizes the probabilities considered here, and the order of complexity of the associated algorithms. Table 1. Summary of proposed algorithms for the computation of sub-string probabilities. Algorithm Expression time Complexity ∗ Fixed length Pr(H ⇒ w1 . . . wn ) O(|VN |n) ∗ strings Pm (H ⇒ w1 . . . wn ) O(|VN |n) ∗ Arbitrary Pr(H ⇒ w1 . . . wn Σ ∗ ) O(|VN |n) ∗ length Pr(H ⇒ Σ ∗ w1 . . . wn Σ ∗ ) max(O (|VN |n) , O |VN |2 ∗ strings Pr(H ⇒ Σ ∗ w1 . . . wn ) O(|VN |n)
114
A.L.N. Fred
More general algorithms for the computation of sub-string probabilities according to stochastic context-free grammars, written in Chomsky Normal Form, can be found in [1,2,3]. However, the later have O(n3 ) time complexity [2]. The herein proposed algorithms, exhibiting linear time complexity in string’s length, represent a computationally appealing alternative to be used whenever the application at hand can adequately be modeled by the types of grammars described above. Examples of application of the algorithms described in this paper can be found in [5,6,8,9].
References 1. F. Jelinek, J. D. Lafferty and R. L. Mercer. Basic Methods of Probabilistic Context Free Grammars. In Speech Recognition and Understanding. Recent Advances, pages 345–360. Springer-Verlag, 1992. 2. A. Corazza, R. De Mori, R. Gretter, and G. Satta. Computation of Probabilities for an Island-Driven Parser. IEEE Trans. Pattern Anal. Machine Intell., vol. 13, No. 9, pages 936–949, 1991. 3. A. Corazza, R. De Mori, and G. Satta. Computation of Upper-Bounds for Stochastic Context-Free Languages. In proceedings AAAI-92, pages 344–349, 1992. 4. A. Corazza, R. De Mori, R. Gretter, and G. Satta. Some Recent results on Stochastic Language Modelling. In Advances in Structural and Syntactic Pattern Recognition, World-Scientific, pages 163–183, 1992. 5. A. L. N. Fred, A. C. Rosa, and J. M. N. Leit˜ ao. Predicting REM in sleep EEG using a structural approach. In E. S. Gelsema and L. N. Kanal, editors, Pattern Recognition in Practice IV, pages 107 – 117. Elsevier Science Publishers, 1994. 6. A. L. N. Fred, Structural Pattern Recognition: Applications in Automatic Sleep Analysis, PhD Thesis, Technical University of Lisbon, 1994. 7. K. S. Fu and T. L. Booth. Grammatical inference: Introduction and survey – part I and II. IEEE Trans. Pattern Anal. Machine Intell., PAMI-8:343–359, May 1986. 8. Ana L. N. Fred and T. Paiva. Sleep Dynamics and Sleep Disorders: a Syntactic Approach to Hypnogram Classification. Em 10th Nordic-Baltic Conference on Biomedical Engineering and 1st International Conference on Bioelectromagnetism, pp 395–396, Tampere, Finlˆ andia Junho, 1996. 9. A. L. N. Fred, J. S. Marques, P. M. Jorge, Hidden Markov Models vs Syntactic Modeling in Object Recognition, Proc. Intl. Conference on Image Processing,ICIP’97, 1997.
A Comparative Study of Two Algorithms for Automata Identification P. Garc´ıa, A. Cano, and J. Ruiz Depto. de Sistemas Inform´ aticos y Computaci´ on. Universidad Polit´ecnica de Valencia. Valencia (Spain).
[email protected] [email protected] [email protected]
Abstract. We describe in this paper the experiments done to compare two algorithms that identify the family of regular languages in the limit, the algorithm of Trakhenbrot and Barzdin/Gold by one hand and the RP N I/Lang algorithm by the other. As a previous step, for a better comparison, we formulate the algorithm of Gold as a merging states in the prefix tree acceptor scheme.
1
Introduction
Finite automata identification from samples of their behavior is an old problem in automata theory and is a central topic in the discipline known as Grammatical Inference, connected to the field of Pattern Recognition. This problem can be formally established as a decision problem as follows: Given an integer n and two disjoint sets of words D+ and D− over a finite alphabet, do there exist a deterministic finite state automaton (DF A) consistent with D+ and D− and having a number of states less than or equal to n? Gold [2] has proved that this problem is N P -complete. Nevertheless, besides the (general) enumeration algorithm, if the sets D+ and D− are somehow representative, there are two algorithms that solve the considered problem in deterministic polynomial time. The first one (1973) is from Trakhtenbrot and Barzdin [6], and will be denoted as T B. The authors describe it as a contraction procedure in a finite tree that represents a uniformly complete data set (a set that contains all the words up to a certain length). If the given data set comprises a certain characteristic set of an unknown regular language, the algorithm finds the smallest DF A that recognizes the language. The characteristic input set required for T B algorithm to converge is of polynomial size in relation with the number of states of the canonical acceptor of the target language. In 1978, Gold [2] rediscovers T B algorithm and applies it to the discipline of grammatical inference (uniformly complete samples are not required). He also specifies the way to obtain indistinguishable states using the so called state characterization matrices. If the input data set does not contain the characteristic A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 115–126, 2000. c Springer-Verlag Berlin Heidelberg 2000
116
P. Garc´ıa, A. Cano, and J. Ruiz
set mentioned above, the algorithm guarantees the consistency at the cost of outputting the prefix tree acceptor (P T A) of the positive sample. A complete proof of the equivalence of T B and Gold algorithms would enlarge this paper too much. By other hand, in 1992, Oncina and Garc´ıa [5] propose the RP N I (Regular Positive and Negative Inference) algorithm for inference of regular languages from complete presentation. In the same year, Lang [4] describes T B algorithm (in a way that, if one doesn’t use a complete uniform sample as input it doesn’t necessarily coincides with T B) and modifies it to guarantee the consistency in the case of a non complete input sample, obtaining this way an algorithm that behaves exactly in the same way as RP N I does. The aim of the present paper is to show the experimental results done in order to compare the behavior of both algorithms (T B/Gold by one hand and RP N I/Lang by the other). As a previous step, we formulate both algorithms in a homogeneous way. RP N I algorithm is based on merging states in the prefix tree acceptor of the sample, while Gold used the -so called- state characterization matrices, but as both algorithms are somehow based on Nerode theorem one might expect a similar behavior. Experiments done show that RP N I/Lang algorithm raises convergence more rapidly than T B/Gold algorithm does.
2
Definitions and Notation
We suppose that the reader is familiar with the rudiments of formal languages. For further details the reader is referred to [3]. Let Σ be a finite alphabet and let Σ ∗ be the free monoid generated by Σ with concatenation as the internal law and λ as neutral element. A language L over Σ is a subset of Σ ∗ . The elements of L are called words. Given x ∈ Σ ∗ , if x = uv with u, v ∈ Σ ∗ , then u (resp. v) is called prefix (resp. suffix ) of x. Pr(L) (resp. Suf(L)) denotes the set of prefixes (suffixes) of L. The right quotient of a language L by a word u is denoted by u−1 L, that is, u−1 L = {v ∈ Σ ∗ : uv ∈ L}. The right quotient of two languages ∗ is L−1 1 L2 = {v ∈ Σ : ∃u ∈ L1 , uv ∈ L2 }. A deterministic finite automaton (DF A) is a 5-tuple A = (Q, Σ, δ, q0 , F ) where Q is the (finite) set of states, Σ is a finite alphabet, q0 ∈ Q is the initial state, F ⊆ Q is the set of final states and δ is a partial function that maps Q × Σ to Q which can be extended to words by doing δ(q, λ) = q and δ(q, xa) = δ(δ(q, x), a), ∀q ∈ Q, ∀x ∈ Σ ∗ , ∀a ∈ Σ. A word x is accepted by A if δ(q0 , x) ∈ F. The set of words accepted by A is denoted by L(A). A Moore machine is a 6-tuple M = (Q, Σ, Γ, δ, q0 , Φ), where Σ (resp. Γ ) is the input (resp. output) alphabet, δ is a partial function that maps Q × Σ in Q and Φ is a function that maps Q in Γ called output function. The behavior of M is given by the partial function tM : Σ ∗ → Γ defined as tM (x) = Φ(δ(q0 , x)), for every x ∈ Σ ∗ such that δ(q0 , x) is defined. Given two finite sets of words D+ and D− , we define the (D+ , D− )-prefix Moore machine (P T M (D+ , D− )) as the Moore machine having Γ = {0, 1, ↑},
A Comparative Study of Two Algorithms for Automata Identification
117
Q = Pr(D+ ∪ D− ), q0 = λ and δ(u, a) = ua if u, ua ∈ Q and a ∈ Σ. For every state u, the value of the output function associated to u is 1, 0 or ↑ (undefined) depending whether u belongs to D+ , to D− or to the complementary set of D+ ∪ D− . A Moore machine M = (Q, Σ, {0, 1, ↑}, δ, q0 , Φ) is consistent with (D+ , D− ) if ∀x ∈ D+ we have Φ(x) = 1 and ∀x ∈ D− we have Φ(x) = 0.
3 3.1
A Description of the Algorithms to Be Compared TB/Gold Algorithm
We are going to see the algorithm proposed by Gold [2], based on the so called state characterization matrix. A state characterization matrix over an alphabet Σ is a triple (S, E, T ) where S, E are finite subsets of Σ ∗ and T : (S ∪ SΣ)E → {0, 1, ↑}. The elements of S are called states, and those of E are called experiments. The data (D+ and D− ) contained in a state characterization matrix is:
x ∈ D+ if ∃u ∈ (S ∪ SΣ) ∃v ∈ E, x = uv ∧ T (uv) = 1 x ∈ D− if ∃u ∈ (S ∪ SΣ) ∃v ∈ E, x = uv ∧ T (uv) = 0
For the rest of the words we set T (uv) =↑. Every element u of (S∪SΣ) defines a row which will be called row(u). Given u, v ∈ (S ∪ SΣ), we say that row(u) is obviously different from row(v), and we write row(u) row(v), if there exists an experiment e ∈ E such that T (ue), T (ve) ∈ {0, 1} and T (ue) 6= T (ve). A state characterization matrix is called closed if neither row belonging to SΣ − S is obviously different from all the rows of S. Gold algorithm [2] was initially established using Mealey machines. Here we use Moore machines to present it, as they are, to our opinion, the best way to present RP N I algorithm also (Moore machines associate the outputs to the states). Doing it this way the comparisons between both algorithms can be seen in a more clearly way. Remark. The choice that we make here of the set E does not appear as necessary in [2], where it is only established that E must be suffix-complete (∀x ∈ E, Suf(x) ⊆ E) and that it must be large enough to contain every word belonging to D+ ∪ D− . The algorithm begins by setting S = {λ} and E as the right quotient of the set of suffixes of the sample by the alphabet. We have to maintain two sets of words (each word defines a row), the set S and the set SΣ − S . At every step we have to move from the set SΣ − S to the set S one of the rows which is obviously different from the rows of every word of S. We have then to update the set SΣ − S and this operation has to be done until neither word can be moved from SΣ − S (we say then that the matrix is closed ). The output of the algorithm is an automaton obtained from the state characterization matrix. If this automaton is not consistent with the data, the algorithm outputs the Moore machine P T M (D+ , D− ).
118
P. Garc´ıa, A. Cano, and J. Ruiz
Algorithm 1 Gold’s Algorithm Gold(D+ , D− ) S = {λ}; E = Suf (Σ −1 (D+ ∪ D− )); Build the table (S, E, T ) ; While there exists s0 ∈ (SΣ − S) : row(s0 ) row(s), ∀s ∈ S Choose any s0 ; (*) S = S ∪ {s0 }; Update (S, E, T ) End While; Q = S; q0 = λ; For all s ∈ S Φ(s) = T (s); For all a ∈ Σ If sa ∈ S Then δ(s, a) = sa Else δ(s, a) = any s0 ∈ S : Not(row(sa) row(s0 )) (**) End For all End For all; M = (Q, Σ, {0, 1, ↑}, δ, q0 , Φ); If M is consistent with (D+ , D− ) Then Return(M ) Else Return(P T M (D+ , D− )) End
There are exactly two places where the algorithm may be non deterministic. The first one1 is when there are several rows from SΣ − S that can be moved to S. The second is when we are building the output automaton and there are several obviously different rows (states) where the transition can be assigned2 . The solution we adopted for both situations is to choose the smallest row in lexicographic order. Example 1. Let D+ = {abb, bb, bba, bbb, babb} and D− = {λ, a, ba, aba, bab}. The initial state characterization matrix is: abb bb b λ ba a ab E S λ 1 1 ↑0 0 0 ↑ SΣ − S a ↑ 1 ↑ 0 0 ↑ ↑ b 1 1 1↑ 1 0 0
1 2
See the part marked with (*) in the algorithm Gold(D+ , D− ). See the part marked with (**) in the algorithm Gold(D+ , D− ).
A Comparative Study of Two Algorithms for Automata Identification
119
Applying the algorithm until the matrix is closed we obtain: E
λ S b bb a SΣ − S ba bba bbb
abb bb b λ ba a ab 1 1 ↑0 0 0 ↑ 1 1 1↑ 1 0 0 ↑ ↑ 11 ↑ 1 ↑ ↑ 1 ↑0 0 ↑ ↑ ↑ 1 00 ↑ ↑ ↑ ↑ ↑ ↑1 ↑ ↑ ↑ ↑ ↑ ↑1 ↑ ↑ ↑
The Moore machine resulting from running the algorithm with the above example is depicted in figure 1 (a). The corresponding automaton used for pattern recognition is shown in figure 1(b). In this automaton the states in which the output is undefined are considered as non final. a
a b
b
0 a
1
a,b (a)
b
b
a
a,b (b)
Fig. 1. (a) Moore machine output by Gold algorithm on input positive and negative samples of example 1. (b) Automaton inferred under the same conditions for clasification tasks.
3.2
The Algorithm RP N I/Lang
The algorithm RP N I/Lang [6] has been described and widely used in pattern recognition tasks, see for example [4] and [5]. It is also somehow based on Nerode theorem. The algorithm merges states on the prefix Moore machine of the positive and negative samples and outputs an automaton which is consistent with those samples. After trying to merge two states it has to group together the necessary ones to make a deterministic automaton. If any of the groupings gives an automaton which is not consistent with the data, it has to forget the current merging and try another one. An upper bound of the running time of this algorithm is mn2 , where m is the size of the initial prefix tree and n is the size of the final automaton ([4]). A detailed and revised description of the algorithm can be found in [1].
4
A Merging States Version of T B/Gold Algorithm
Differences between Gold and RP N I algorithms start at the very beginning, so, in order to compare them it can be useful to re-write one of the algorithms in
120
P. Garc´ıa, A. Cano, and J. Ruiz
a similar way as the other is written. We will present T B/Gold algorithm as a merging states in the P T M (D+ , D− ) procedure. Given a Moore machine M = (Q, Σ, {0, 1, ↑}, δ, q0 , Φ), we say that p, q ∈ Q are distinguishable in M if there exists a word x ∈ Σ ∗ with Φ(δ(p, x)), Φ(δ(q, x)) ∈ {0, 1} and Φ(δ(p, x)) 6= Φ(δ(q, x)). On the opposite way we say that p and q are non distinguishable in M . We next describe the procedure distinguishable and the T B/Gold algorithm. In the procedure distinguishable we make use of the function compatible which assigns to a couple of states the value False if one represents a positive sample and the other a negative one, and the value True in the rest of the cases. We also make use of the functions defined for lists, First which returns the first element of a list and Rest which returns the list without the first element. Algorithm 2 function distinguishable: given p, q ∈ Q, returns True if ∃ x ∈ Σ ∗ : Φ(δ(p, x)), Φ(δ(q, x)) ∈ {0, 1} and Φ(δ(p, x)) 6= Φ(δ(q, x)) distinguishable(u, v, P T M (D+ , D− )) list ={(u, v)}; While list 6= ∅ do (p, q) = First(list); list = Rest(list); If (Not(compatible(p, q))) Then Return(True) //p, q ∈ Q are not compatible if Φ(p), Φ(q) ∈ {0, 1} ∧ Φ(p) 6= Φ(q) .// Else For every a ∈ Σ If ∃δ(p, a) ∧ ∃δ(q, a) Then Append (δ(p, a), δ(q, a)) to list EndIf EndIf End While Return(False) End.
Example 2. Let D+ = {bbbb} and D− = {b} as in the above example. The function distinguishable({λ}, {b}, P T M (D+ ,D− )) returns the result F alse. If one wishes that this algorithm returns complete automata as the original Gold algorithm does, we only have to send the lacking transitions to the initial state. With this new approach to Gold algorithm, some differences related to the behavior of both -RP N I and Gold- algorithms can be seen. The main one is that at every step Gold only uses the prefix tree acceptor and the states to be merged, so he does not make use of some information that RP N I uses, as this last algorithm considers all the merging done at every step as correct. By other hand, Gold algorithm returns a hypothesis of size less than or equal to the size of the target automaton, as the number of rows of S is bounded by the number of states of the target automaton, which determines the number of equivalence classes in the Nerode equivalence.
A Comparative Study of Two Algorithms for Automata Identification
121
Algorithm 3 TB/Gold algorithm as a merging states in the P T M (D+ , D− ) procedure TB/Gold (D+ , D− ) M0 = P T M (D+ , D− ); //M0 = (Q0 , Σ, {0, 1, ↑}, δ, q0 , Φ0 )// S = {λ}; While there exists s0 ∈ SΣ − S such that ∀s ∈ S, distinguishable(s, s0 , M0 ) choose s0 ; S = S ∪ {s0 }; End While; Q = S; q0 = λ; For s ∈ S Φ(s) = Φ0 (s); For a ∈ Σ If sa ∈ S Then δ(s, a) = sa Else δ(s, a) = any s0 such that Not(distinguishable(sa, s0 , M0 )); End For End For; M = (Q, Σ, {0, 1, ↑}, δ, q0 , Φ); If M is consistent with (D+ , D− ) Return(M ) Else Return(M0 ); End.
Less important is the fact that Gold algorithm always returns complete automata, while this is not necessarily true in case of RP N I algorithm. 4.1
An Example of the Merging States Scheme of Gold Algorithm
Figure 2 represents the (D+ , D− )-Prefix Moore machine with D+ = {aa, aaa, aaab, aaba, aabba, ab, abab, abb, abba, abbb} and D− = {λ, a, aab, b, bb}. The states to be merged at every step, and the evolution of S and SΣ can be seen in the following table. S SΣ list {λ} {a, b} {λ, a} {b, aa, ab} {(b, λ)} {λ, a, aa} {b, ab, aaa, aab} {λ, a, aa, ab} {b, aaa, aab, aba, abb} {(b, λ), (aaa, ab), (aab, a), (aba, a), (abb, ab)} According to the algorithm, the states of the resulting automaton are {λ, a, aa, ab}. The transition function of the automaton is δ(λ, a) = a, δ(a, b) = λ, δ(a, a) = aa, δ(a, b) = ab, δ(aa, b) = a, δ(ab, b) = ab and δ(ab, a) = a. The set of final states is F = {aa, ab}. The automaton is depicted in figure 3.
122
P. Garc´ıa, A. Cano, and J. Ruiz b
1
1
a 1
1
a
b 0
a
a
b
1
0
b
a
b
1
a
1
a
0
1
b
b
1
b
0
1
b 0
Fig. 2. (D+ , D− )-Prefix Moore machine with D+ = {aa, aaa, aaab, aaba, aabba, ab, abab, abb, abba, abbb} and D− = {λ, a, aab, b, bb}.
b
aa
b λ
a
a b
a
a
a ab
b Fig. 3. Output automaton of the execution of the merging states scheme of Gold algorithm for the given sample.
5
Experimental Results
We have compared the new version of T B/Gold algorithm (merging in the prefix tree acceptor) and RP N I/Lang algorithm. Description of the experiments: – We work with minimal automata having 8, 16, 32, 64 and 128 states, the alphabet is Σ = {a, b}. We obtain them beginning with larger automata, we then minimize them and discard the automata which do not have the required size. This method is inspired in [4], Although Lang permits some flexibility in the number of states. – For the learning process we use randomly generated strings of length less than or equal to 21 over Σ ∗ . The number of them is shown in the figures that describe the results of the experiments.
A Comparative Study of Two Algorithms for Automata Identification
123
– The comparison of the automata is done using all the words of length less than or equal to 15 not used in the learning process. When Gold algorithm obtains an automaton which is not consistent with the sample we use it instead of the prefix tree acceptor of the sample. Other way, as we use for testing different words as for learning, the error rate for that automaton would be 100%. We recall that the objective of the experiment is to measure how close the hypothesis and the target automata are. – We have done 1000 experiments for each different size of the automata. The following figures show the mean (and one of them the typical deviation) of the error rate (percentage of words not correctly classified) and of the representation coefficient (size of the obtained hypothesis divided by the size of the target automaton). 5.1
Error Rate
80 64E 32E 16E 8
70 60 50 40 30 20 10 0 0
500
1000
1500
2000
2500
3000
3500
4000
4500
Fig. 4. Error rate obtained running Gold algorithm with target automata of size from 8 to 64 states as the number of sample used for inference varies.
Figures 4 and 5 show respectively the variation of the error rate (percentage of words not correctly classified) obtained by Gold and RP N I algorithms for different sizes of target automata when the number of samples used for inference also varies. As one may hope, for a given inference sample size, if the size of the target automaton increases so does the error rate. One may observe that the error rates of RP N I algorithm are better than those of Gold algorithm. In figure 6 we compare the behavior of both algorithms for the case of automata of size 16 (for the rest of the cases, the situation remains equivalent so we do not depict them). Figures 7 and 8 show the variation of the mean and of the typical deviation of the error rates of the algorithms subject of study for automata of size 16 as the number of samples used for inference vary. We can observe that the typical deviation obtained with RP N I algorithm decreases to zero more rapidly than that obtained with Gold algorithm.
124
P. Garc´ıa, A. Cano, and J. Ruiz 80 64E 32E 16E 8E
70 60 50 40 30 20 10 0 0
500
1000
1500
2000
2500
3000
3500
4000
4500
Fig. 5. Error rate obtained running RP N I algorithm with target automata of size from 8 to 64 states as the number of sample used for inference varies. 60 tbrpni gold
50 40 30 20 10 0 0
500
1000
1500
2000
2500
Fig. 6. A comparison of the error rate average obtained by Gold and RP N I algorithms when the number of samples used automata of size 16 varies from 100 to 2000. 60 Gold 16 E
50 40 30 20 10 0 0
500
1000
1500
2000
2500
Fig. 7. Error Diagram (mean and typical deviation) for Gold algorithm for automata of 16 states as the number of samples used for inference vary from 100 to 2000.
5.2
Representation Coefficient
Figures 9 and 10 show the representation coefficient (size of the hypothesis divided by the size of the target automaton) of the automata obtained when running Gold and RP N I algorithms respectively. We can see that both algorithms converge to identify the number of states of the automata correctly, although Gold algorithm obtains increasing values as the number of words used in the inference becomes larger (we recall that the size of the hypothesis obtained using Gold algorithm is less than or equal to that of the target automaton) while RP N I obtains decreasing values.
A Comparative Study of Two Algorithms for Automata Identification
125
50 TBRPNI 16E
40 30 20 10 0 0
500
1000
1500
2000
2500
Fig. 8. Error Diagram (mean and typical deviation) for RP N I algorithm for automata of 16 states as the number of samples used for inference vary from 100 to 2000.
3 64 st 32 st 16 st 8 st
2.5 2 1.5 1 0.5 0 0
500
1000
1500
2000
2500
3000
3500
4000
4500
Fig. 9. Average of the representation coefficient of the automata inferred using Gold algorithm.
3 64 st 32 st 16 st 8 st
2.5 2 1.5 1 0.5 0 0
500
1000
1500
2000
2500
3000
3500
4000
Fig. 10. Average of the representation coefficient of the automata inferred using RP N I algorithm.
6
Conclusions
We have revised Trakhtenbrot and Barzdin (T B), Gold, RP N I and Lang algorithms. As it can be seen in [1], the first two ones are in fact the same, while the first description that Lang does of T B algorithm agrees with it only in case of a uniformly complete sample. The extension he gives to obtain consistent hypothesis is in fact RP N I algorithm.
126
P. Garc´ıa, A. Cano, and J. Ruiz
We have also formulated T B/Gold algorithm as a procedure of merging states in the prefix Moore machine of the sample for a better understanding of the comparison with RP N I. By other hand we have observed that the implementation of this new version works much faster than the original one. This fact might be deduced from theoretical estimations, but the use of upper bounds to estimate temporal complexities in the worst case do not allow us to be more precise about this fact. We have observed that RP N I algorithm generally obtains better results than Gold algorithm.
References 1. Garc´ıa, P. Cano, A. and Ruiz, J. Estudio comparativo de dos algoritmos de identificaci´ on de aut´ omatas. Internal Report DSIC-II/1/00 Univ. Polit´ecnica Valencia. (2000) (In Spanish). 2. Gold , M. Complexity of Automaton Identification from Given Data. Information and Control 37, pp 302-320 (1978). 3. Hopcroft, J. and Ullman, J. Introduction to Automata Theory, Languages and Computation. Addison-Wesley (1979). 4. Lang , K.J. Random DFA’s can be Approximately Learned from Sparse Uniform Examples. In Proceedings of the Fifth Annual ACM Workshop on Computational Learning Theory, pp 45-52. (1992). 5. Oncina, J. and Garc´ıa, P. Inferring Regular Languages in Polynomial Updated Time. In Pattern Recognition and Image Analysys. P´erez de la Blanca, Sanfeli´ u and Vidal (Eds.) World Scientific. (1992). 6. Trakhtenbrot B. and Barzdin Ya. Finite Automata: Behavior and Synthesis. North Holland Publishing Company. (1973).
The Induction of Temporal Grammatical Rules from Multivariate Time Series G. Guimarães Center of Artificial Intelligence (CENTRIA), Universidade Nova de Lisboa, Portugal guimas@uevora .pt
Abstract. In this paper the induction of temporal grammatical rules from multivariate time series is presented in the context of temporal data mining. This includes the use of unsupervised neural networks for the detection of the most significant temporal patterns in multivariate time series, as well as the use of Machine Learning-algorithms for the generation of a rule-based description of primitive patterns. The main idea lies in introducing several abstraction levels for the pattern discovery process. The results of the previous step then are used to induce temporal grammatical rules at different abstraction levels. This approach was successfully applied to a problem in medicine, called sleep apnea. Keywords: Temporal Grammar Induction, Unsupervised Neural Networks, Machine Learning, Sleep Apnea
1. Introduction The induction of grammatical rules includes two steps, the induction of structural patterns from data, and the representation of those patterns by a formal grammar. Usually, a set of strings defined on a specific alphabet is used as the set of examples for the induction process. However, for multivariate time series no sequence of strings exists. This means that the time series have to be transformed into a string-based representation. Moreover, this transformation may include the discovery of inherent patterns in time series using unsupervised methods. Self-Organizing Maps (SOMs), as proposed by Kohonen [15], are well suited for this task. They are appropriated to process temporal data, for instance, in speech recognition [3], and EEG monitoring [13] as well as for clustering high dimensional data [14, 23]. The latter assumes an adequate visualization of the network structures. SOMs have already been used for rule inference in the context of recurrent Neural Networks applied to univariate time series [9]. A symbolic encoding is obtained from one-dimensional SOMs, making the training of the recurrent Neural Network and the extraction of the symbolic knowledge easier. The discovery of yet unknown temporal patterns in the time series can be regarded as a knowledge discovery process in multivariate time series [4]. Knowledge discovery in databases (KDD) is an interactive multi-disciplinary approach and A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 127-140, 2000. © Springer-Verlag Berlin Heidelberg 2000
128
G. Guimarães
comprises several steps as data selection, preprocessing and transformation, data mining, and knowledge interpretation [8]. The main step of the whole process is the data mining process, as a non-trivial process for the extraction of implicit, previous unknown and potentially useful knowledge from large data sets [1]. However, data mining not only focus on the discovery of inherent structures in the data, but also assumes a transformation of the discovered patterns into a knowledge representation form, intelligible for human beings. Machine Learning (ML) algorithms as well as rule induction algorithms are appropriated for this task. An advantage of the induction of temporal grammatical rules for multivariate time series lies in the interpretation of the results of the neural networks, usually regarded as some kind of black-boxes. In addition, parsing of the discovered patterns now becomes possible. Other approaches propose incremental learning for the extraction of concepts in the context of data mining [7]. The main idea of the approach proposed here lies in structuring this highly complex problem and introduce several abstraction levels. At each level a higher abstraction of patterns in multivariate time series is obtained. Temporal grammatical rules are induced at all abstraction levels. Section 2 gives a brief description of the method that introduces several abstraction levels. In section 3 the structure of the temporal grammatical rules as well as algorithms for the generation, and induction process are described. Section 4 presents an application of the method in medicine. Conclusions are presented in Section 5.
2. The Discovery of Temporal Patterns in Multivariate Time Series with Unsupervised Neural Networks The recently developed method for Temporal Knowledge Conversion (TCon) [11] introduces several abstraction levels to perform a successive and step-wise conversion of temporal patterns in multivariate time series into a temporal knowledge representation. Fig. 1 shows the main steps of the method. Multivariate time series v v v Z = x (t1),..., x (t n ) with x (t i ) ∈ R m , m > 1 sampled at equal time steps t1 ,..., t n gathered from signals of complex processes are the input of the system. Results of the method are the discovered temporal patterns as well as a linguistic description of the patterns, in form of temporal grammatical rules (TG-rules). Features: First, a pre-processing and feature extraction for all time series is a prerequisite for further processing [6]. For the feature extraction one or even more than v one time series xS (ti ) = x j1 (ti ),..., x jS (ti )T ∈ R S may be selected from the multivariate time
series v
Z
with
jk ∈ S , k = 1,..., s ,
v
S ⊂ {1,..., m}, s =| S | .
mS (ti , l ) = f ( xS (ti ),..., xS (ti +l )) is then the value of a function t i with l ∈ {1,..., n − l} from selection S.
A
feature
f : R s×l → R at time
The Induction of Temporal Grammatical Rules from Multivariate Time Series
129
Temporal Data Mining with Temporal Knowledge Conversion Temporal Patterns
A temporal pattern is a ‘tp_name’ [min, max] if
Rule Generation with Machine Learning and Grammatical Inference
E
Sequences
E E ... E ... E E E E
Events Successions
E
…
SOM
… or ‘Sequencek’ [min, max]’
A sequence is a ‘s_name’ [min, max] if ‘Eventi’: ‘name of event i [min, max]’ followed by or followed after [min, max ] by … followed by or followed after [min, max ] by ‘Eventk’: ‘name of event k [min, max]’ An event is a ‘e-name’ if ‘succession i1’ and … and ‘succession 1n’ is more or less simultaneous … is more or less simultaneous succession k1’ and … and ‘succession kn’ A Primitive pattern is a ‘pp-name´ if ‘feature i’ ∈ [min, max] and ‘feature j’ ∈ [min, max] and … and ‘feature k’ ∈ [min, max]
Selection
Selection
SOM SOM
Selection
Feature Extraction
Primitive Patterns
...
‘sequencei’ [min, max]’ or
Fig. 1. Abstraction levels and steps of the method for Temporal Knowledge Conversion (TCon)
Primitive Patterns: Elementary patterns p j , j = 1,..., K , called primitive pattern classes, are discovered in the time series using selections of features as attributes. An element of a primitive pattern class is a primitive pattern p j (ti ), j = 1,..., K that belongs to a given primitive patterns class p j , j = 1,..., K and is associated to a given time point ti . For this task, the use of SOMs is particularly adequate. Together with a special visualization technique, called U-Matrix-method [23], the main patterns in different feature selections may be detected. Regions on a U-Matrix that do not correspond to a specific primitive pattern class are associated to a special group, named tacet. We now are able to classify the whole time series with primitive patterns and tacets. This will be called a primitive pattern (PP-)channel. Instead of analyzing all time series simultaneously, several selections of features are made. Consequently, several SOMs are learned, and several PP-channels are generated. Each PP-channel is represented by a sequence of strings. They are used to induce the temporal grammatical rules at the next level. In order to obtain an intelligible description of the
130
G. Guimarães
discovered primitive pattern classes, ML-algorithms are used to generate a rule-based description of the classes based on the features. Successions: In order to consider temporal relations among primitive patterns, p j (t i ),..., p j (t i + k ) , i = 1,..., n − k + 1 , succeeding identical primitive patterns
j ∈ {1,..., K } obtained from each SOM are identified as successions. Each succession s j (a, e) is associated to a given primitive pattern class. It has a starting point a := ti , an
end point e := ti +l , and a duration l = e − a . Since each primitive pattern is represented by its bestmatch on a U-Matrix, trajectories of succeeding primitive patterns (bestmatches) on a U-Matrix are used for the identification of successions [10]. As several feature selections are made, successions from different PP-channels may occur more or less simultaneously. Events: Instead of using detailed interval overlapping relationships [2] among the successions at different PP-channels, a vague simultaneity for pattern discovery problems is preferred [17]. Events e j (l ji ) are identified with the repeated occurrence of more or less simultaneous successions of the same type sv (av j , ev j ),..., s r (a r j , er j ) , i
i
i
i
v,..., q ∈ {1,..., K } at different pp-channels. Each event e j (l ji ) belongs to event class e j , j = 1,..., Q . A ji = max(a v j ,..., a r j ) is the starting point, E ji = max(ev j ,..., er j ) the end i
i
i
i
point, and l ji = E ji − A ji the duration of event e j (l ji ) . In order to reduce the number of event classes, similar event classes are summarized to one event class. Therefore, the significance of each event class (frequency of the occurrence of the event class) is calculated. Conditional probabilities are calculated among primitive patterns at different PP-channels. Histograms over the probabilities enable a distinction between significant event classes (very frequent event classes) and less significant event classes (less frequent event classes). Rare events are omitted in the sense as they are regarded as delays between events, named as event tacets. In order to join event classes with different significance levels, less significant event classes are associated to significant event classes. Therefore, similarities are calculated among them counting the number of equal types of successions in the event classes. Each event class is described by one significant event class and, possibly, one or more than one less significant event classes. At this level, the whole multivariate time series is described by a sequence of events, i.e. strings, F = e1i ,..., eni , and i ∈ {1,..., M } the number of event classes. The identification of events with SOMs presumes the use of extended hierarchical SOMs [10]. For each event a temporal grammatical rule is generated. Sequences: Subsequences of event types eh ,..., ek that occur more than once in F are identified as a sequence sq j (l min j , l max j ) = eh j (lmin h ,lmax h ) ,...,ek j (l mink ,lmax k ) , j
j
j
j
j = 1,..., P the number of different sequences types. A sequence has a minimal duration l min j and maximal duration l max j . The event types of sequence sq j also have
a minimal duration l mini and maximal duration l maxi . This means that sequences are j
j
repeated subsequences of the same type of events at different time points t i . For the identification of sequences probabilistic automata are used. They describe transition probabilities between events. Paths through such a probabilistic automata permit the
The Induction of Temporal Grammatical Rules from Multivariate Time Series
131
identification of subsequences of events. Furthermore, longer transitions between succeeding events may also be used for the identification of sequences. Event tacets with a longer duration are used to determine starting or/and end events of sequences, if they are interpreted as some kind of delays. For each sequence a temporal grammatical rule is generated. Temporal Patterns: Finally, small variations of events in different sequence classes lead to the identification of similar sequence classes. Similar sequence classes sqi (l mini ,l maxi ) ∨ ... ∨ sq v(l min v ,l maxv ) are subsumed a temporal pattern tp j (l min j , l max j ) , where l min j = min (l mini ,..., l minv ) and l max j = max (l maxi ,..., lmaxv ) . Temporal patterns are abstract descriptions of the main temporal structures in multivariate time series. String exchange algorithms are suitable for the identification of temporal patterns. For each temporal pattern a temporal grammatical rule is generated.
3. The Induction of Temporal Grammatical Rules In the context of Temporal Data Mining the induction of rules from the discovered temporal patterns is of extreme relevance for the interpretation of the results obtained from the neural networks. There exist different knowledge representation formalisms in AI, such as language, frames, semantic nets, and predicate logic. In particular, language can be considered as one of the most diffused and natural forms of knowledge representation. Thus, language in form of temporal grammatical rules (TG-rules) will be considered here as the most appropriated knowledge representation form for multivariate time series. Formal language models are widely used in syntactic or linguistic pattern classification systems, whereat patterns are represented as strings and a set of rules is learned for generating strings belonging to the same class. Such classification systems, for instance, may have applications in speech recognition, discovery of patterns in biosequences, interpretation of ECG, and recognition of seismic signals. Language learning or acquisition, usually referred to as grammatical inference, concerns the acquisition of the syntax and the semantics of a target language. Until now, most emphasis has been made in learning the syntax of a language. Therefore, an appropriate class of grammars has to be chosen. The Chomsky hierarchy of formal grammars (regular, context-free and context-sensitive grammars) is often used to model the target grammar. Since every finite language is regular, and a context-free language can often be approximated by a regular grammar, the inference of regular grammars is of significant practical relevance [18]. If we are dealing with multivariate time series, however, the use of extended context-free grammars, called definitive clause grammars (DCGs) seems more advantageous [12]. Conditions and tests are introduced in the rules of a context-free grammar [20], such that temporal conditions as, for example, the duration may be specified. Furthermore, a PROLOG-Interpreter then can be used as a parser. st Next, the use of machine learning (ML) algorithms at the 1 abstraction level will be presented for the generation of intelligible names of the primitive patterns as well as the induction of TG-rules at the next higher abstraction levels. Rules for primitive pattern classes: The main idea of this step lies in using ML algorithms for generating an intelligible description of the discovered primitive
132
G. Guimarães
patterns. This description (intelligible names of the primitive patterns) will be used at the next level for the generation of the TG rules. ML algorithms are suitable for this task, since a rule-based description or a description in form of decision trees is generated. At this level, no adaptation of the ML-algorithms is needed, since no temporal relations among the primitive patterns are considered. For example, consider the rule generation algorithm sig* [22] that generates rules based on the most significant features for each class. For each primitive pattern class a rule is generated with its most significant features. A Primitive pattern if ‘feature i’ ∈ and ‘feature j’ ∈ and … and ‘feature k’ ∈
is a ‘primitive_pattern_name’ [mini, maxi] [minj, maxj]
[mink, maxk]
TG-rules for event classes: Each event is described by several similar event classes (one significant and one or more than one less significant). Therefore, the tokens “more or less simultaneous” and “or” are introduced at this level. An event is a ‘event_name’ if ‘succession i1’ and … and ‘succession 1n’ is more or less simultaneous ‘succession j1’ and … and ‘succession jm’ is more or less simultaneous … is more or less simultaneous ‘succession k1’ and … and ‘succession kn’
At the right side of the rules intelligible names for the successions obtained from the sig*-rules are used instead of simple characters. As event classes may appear in different sequences at the next level, no duration is specified. Algorithm for the generation of TG-rules for events: For each event class j • Identify significant, less significant event classes, and event tacets • Calculate the similarity between significant, and less significant event classes • Associate less significant event classes to a significant event class • Generate a rule with (a) Left side of the rule: name of the event (b) Right side of the rule: • For each pp-channel generate alternatives between successions belonging to similar event classes with the token ´or´ • Concatenate the generated alternatives using the token ´is more or less simultaneous´
TG-rules for sequence classes: For each sequence class a grammatical rule is inferred. Therefore, the tokens ´followed by´ and ´followed after … by´ are used. They enable a distinction between immediately following events and events following after
The Induction of Temporal Grammatical Rules from Multivariate Time Series
133
an event tacet. It is emphasized that the same event class may appear in different sequence classes. At the right side of the rules, intelligible names for the event classes obtained from last step are used instead of simple characters. At this level no name is generated for the sequence classes. A sequence is a ‘sequence_name’ [min, max] if ‘Eventi’: ‘name of event i [mini, maxi]’ followed by [followed after [mintj, maxtj] by] ‘Eventj’: ‘name of event j [minj, maxj]’ followed by [followed after [mintk, maxtk] by] ‘Eventk’: ‘name of event k [mink, maxk]’ … followed by [followed after [mintl, maxtl] by] ‘Eventl’: ‘name of event l [minl, maxl]’
Algorithm for the generation of TG-rules for sequences: For each sequence class j • Identify the event classes that belong to the sequence class • For each event class belonging to the sequence class calculate the minimal and maximal duration • For each event tacet between succeeding event classes belonging to the sequence class calculate the minimal and maximal duration • Generate a rule with (a) Left side of the rule: number of the sequence, and also specify the minimal and maximal duration (b) Right side of the rule: • If event classes follow immediately use the token ´followed by´ • If event classes follow after a break use the token ´followed after [min, max] by´ • For each event class and event tacet specify the minimal and maximal duration
TG-rules for temporal patterns: Similar sequences are subsumed to a temporal pattern. Therefore, the token “or” is used. If a temporal pattern entails just one sequence, this step may be dismissed. A temporal pattern is a if ‘sequencei’ [mini, or ‘Sequencej’ [minj, or … or ‘Sequencek’ [minv,
‘temporal_pattern_name’ [min, max] maxi]’ maxj]’
maxv]’
Algorithm for the generation of TG-rules for sequences: For each temporal pattern j • Identify the sequence classes that belong to the temporal pattern • Calculate the minimal and maximal duration of the temporal pattern • For each sequence class belonging to the temporal pattern calculate the minimal and maximal duration
134
G. Guimarães • Generate a rule with (a) Left side of the rule: number of the temporal pattern, and also specify the minimal and maximal duration (b) Right side of the rule: • Generate alternatives between sequences belonging to temporal pattern using the token ´or´ • For each sequence class specify the minimal and maximal duration
The complexity of each TG-rule at all abstraction levels may differ strongly, as will be shown in the next section with the application.
4. An Application in Medicine The method was applied to a sleep disorder with a high prevalence, called sleeprelated breathing disorders (SRBDs). For the diagnosis of SRBDs the temporal dynamics of physiological parameters such as sleep-related signals (EEG, EOG, EMG), signals concerning the respiration (airflow, ribcage and abdominal movements, oxygen saturation, snoring) and circulation related signals (ECG, blood pressure), have to be recorded and evaluated. Since the main aim was the identification of different types of sleep related breathing disorders, mainly apnea and hypopnea, only the signals concerning the respiration had to be considered [21]. Severity of the disorder is calculated by counting the number of apnea and hypopneas per hour of sleep, named respiratory disturbance index (RDI). If the RDI exceeds 40 events per hour of sleep, the patient has to be referred to therapy. A visual classification of the different types of SRBDs based on a such a recording is usually made by technical assistants. An automatic identification of SRBDs is a very hard task, since all signals have to be analyzed simultaneously. In addition, quite different patterns for the same SRBD may occur, even for the same patient during the same night, and a strong variation of the duration of each event may occur, as well [19]. The different kinds of SRBDs are identified through the signals ´airflow´, ´ribcage movements´ and ´abdominal movements´, ´snoring´ and ´oxygen saturation´, where a distinction between amplitude-related and phase-related disturbances is made. Concerning the amplitude-related disturbances, disturbances with 50% as well as disturbances with 10-20% of the baseline signal amplitude may occur. Phase-related disturbances are characterized by a lag between ´ribcage movements´ and ´abdominal movements´. An interruption of ´snoring´ is present at most SRBDs as well as a drop in ´oxygen saturation´. For this experiment, 25 Hz sampled data have been used from three patients having the most frequent SRBDs. One patient exhibited multiple sleep disorders. Features: The feature extraction made for SRBDs was based on a characterization of SRBDs typical in medicine that distinguishes between phase- and amplituderelated disturbances [21]. Two feature selections have been used as input for each SOM. Six primitive pattern classes have been detected from the U-Matrix with features concerning mainly the respiratory flow. Nine primitive pattern classes have been detected from the U-Matrix with features mainly related to respiratory
The Induction of Temporal Grammatical Rules from Multivariate Time Series
135
movements. Altogether, six sequence classes, four temporal patterns and six events classes have been discovered (see Fig. 2). Event 2 13 to 17 sec
Event 3 20 to 38 sec
Event 6 20 to 24 sec
Event 5 6 to 11 sec Event 5 12 to 14 sec
Event 4
Event 6
Event 5
3 to 8 sec
6 to 20 sec
11 to 15 sec
Event 4
Event 5
10 to 20 sec
= Sequence1 - TemporalPattern1
6 to 9 sec
Event 4
Event 4
Event 5
3 to 6 sec
11 to 15 sec
7 to 8 sec
Event 1
= Sequence2 or
TemporalPattern2
= Sequence3
= Sequence4 or
TemporalPattern3
= Sequence5
= Sequence6 - TemporalPattern4
> 6 min
Fig. 2. Temporal Patterns with corresponding sequences and events for all SRBDs
In order to evaluate the plausibility of all temporal patterns, TG-rules at all levels have been generated from the results of the SOMs and were presented to a medical expert. See Fig. 3 for successions, events and sequences from one sleep disorder. Primitive patterns: For all 15 primitive patterns a rule based description was generated using the machine learning algorithm sig* [22]. For each primitive pattern class a rule is generated with its most significant features (see Example 1). The values for the features have to be interpreted as follows. Values nearby one mean that this feature occurs with a high probability, while values nearby zero mean that this feature probably will not occur. This is due to the fact that a normalization of all features was performed. As the sig* algorithm generates rules with the most significant features for each primitive pattern, the naming of the primitive patterns is straightforward. The primitive pattern ‘A2’ was named as ‘no airflow without snoring’, since the feature ‘no airflow’ has values nearby one and ‘snoring intensity’ has low values. The primitive pattern ‘B3’ was named as ‘no ribcage and abdominal movements without snoring’, since the features ‘no ribcage movements’ and ‘no abdominal movements’ have high values and ‘snoring intensity’ has zero value. These semi-automatically generated names of the primitive patterns will be used further on for the description of the TG-rules at the next higher level. Example 1 Consider primitive patterns ‘A2’ and ‘B3’ that have been detected from different U-Matrices. The following sig* rules have been generated:
136
G. Guimarães
A primitive pattern is a ‘A2’ if ‘no airflow’ ∈ [0.951, 1] and ‘reduced airflow’ = 0 and ‘snoring intensity’ ∈ [0, 0.241]
A primitive pattern is a ‘B3’ if ‘no ribcage movements’ ∈ [0.772, 1] and ‘no abdominal movements’ ∈ [0.641, 1] and ‘reduced ribcage movements’ = 0 and ‘snoring intensity’ = 0
Events: For each of the six events one TG-rule was generated. These rules are understandable and interpretable, since the names of the primitive patterns, i.e. successions, have been used in the rules. The generation of names of the events is then straightforward (see Example 2). The occurrence of tacets in the rules means that small interruptions may occur in successions or that a succession, for example, from one PP- channel occurs simultaneously with irrelevant information at the other channel. The name of an event entails just names of the most frequently occurring successions. Names of rare successions may be dismissed. This follows the idea of information reduction for the generation of well-interpretable rules. If needed, details may then be consulted at lower abstraction levels. Example 2: Grammatical rules for ‘Event1’ and ‘Event3’: An event is a ´Event1’: ‘no airflow and no chest and abdomen wall movements without snoring’
An event is a ‘Event3’: ‘strong with snoring’ if
if
(‘strong snoring’
‘no airflow without snoring’ is more or less simultaneous
or
(‘no ribcage and abdominal movements without snoring’
or
or ‘tacets’)
breathing
‘reduced snoring’
airflow
airflow
with
with
‘tacets’) is more or less simultaneous ‘strong ribcage abdominal movements’
and
The following names for all events have been derived from the rules: • Event1: ‘reduced airflow with snoring’ • Event2: ‘no airflow and no respiratory movements without snoring’ • Event3: ‘no airflow and reduced ribcage movements and no abdominal movements without snoring’ • Event4: ‘no until reduced airflow and reduced parallel and lagged respiratory movements without snoring’ • Event5: ‘strong breathing with snoring’ • Event6: ‘reduced airflow and lagged respiratory movements without snoring’
The Induction of Temporal Grammatical Rules from Multivariate Time Series
137
Sequences and temporal patterns: Altogether, six sequences from four different temporal patterns have been identified. (see Fig. 2). For each sequence a rule was generated, as shown in example 3. The names obtained at the lower level for the events have been used and describe the main statistical properties of the discovered temporal patterns. The rules contain information about the range of the duration of the corresponding sequence. For each event in the rule, a range of the duration of the event class is also given, such that its plausibility could be checked by a domain expert. At this level, no names were given to the sequences and temporal patterns. Example 3: Grammatical rule for ‘Sequence1’ A sequence is a ‘Sequence1’ [40 sec, 64 sec] if ‘Event1’: ‘no airflow and no chest and abdomen wall movements without snoring’ [13 sec, 18 sec] followed by ‘Event2’: ‘no airflow and reduced chest and no abdomen wall movements without snoring’ [20 sec, 39 sec] followed after [0,5 sec, 5 sec] by ‘Event3’: ‘strong breathing with snoring’ [6 sec, 12 sec]
A structured and complete evaluation of the discovered temporal knowledge at the different abstraction levels was made using a questionnaire. All events and temporal patterns presented to the medical expert described the main properties of SRBDs as, for instance, hyperpnoe, obstructive snoring, obstructive apnoe or hypopnoe. All of the four discovered temporal patterns described very well the domain knowledge. In order to evaluate the TG-rules of the events, successions in the events have been examined within the questionnaire. The expert could give the following answers: yes (successions must occur in the event), maybe (successions may occur in the event), don´t know (no statement can be made) and no (successions must not occur in the event). Altogether, 56,25% of more or less occurring successions were correctly associated to the events (yes). This corresponds to events that allocate 80,16% of the whole time period. For 37,5% of more or less occurring successions the expert couldn´t give an exact answer (don’t know). This merely corresponds to events that allocate 18,21% of the whole time period. The latter is mainly due to the occurrence of tacets in the rules, that don´t have a meaning in medicine. For 6,25% the expert couldn´t give an exact answer (maybe), corresponding to 1,63% of the whole time period. It is remarkable, that none of the events was incorrectly described (answer: no). An evaluation of the rules at this level lead to an overall sensitivity of 0,762 and a specificity of 0,758. ´Event5´ was correctly identified as a special event, called ´hyperpnea´. SRBDs always end up with a ´hyperpnea’. In some cases the duration of ´Event5´ was too short. The duration of all other events were in a valid range. The interpretation of the rules was straightforward, since all rules were regarded as well understandable descriptions due to the tokens used in the grammatical rules, such as ´more or less simultaneous´, ´or´, ´followed by´ and ´followed after … by´, and due to the intelligible names generated for the primitive patterns and events. A restriction was made concerning the term ´more or less simultaneous´, that in the opinion of the
138
G. Guimarães
expert should be renamed to ´simultaneous´, since a simultaneity is assumed in medicine. For one temporal pattern even previously unknown knowledge was discovered. This temporal pattern was named by the expert as mixed obstructive apnoe, distinguished into a mixed obstructive apnoe with an interruption and snoring having a central and an obstructive part and a mixed obstructive apnoe without an interruption and without snoring ending in an hypoventilation . Eve nt2 9000
Eve nt3 Eve nt5
8000
Eve nt Ta ce t 7000
No ribca ge a nd a bdomina l move me nts without s noring S trong ribca ge a nd a bdomina l move me nts Re duce d ribca ge a nd a bdomina l move me nts without s noring Ta ce t
6000
5000
4000
No a irflow without s noring 3000
S trong a irflow with s noring 2000
Ta ce t
1000
Airflow Ribca ge move me nts 04:00:00 04:00:05 04:00:10 04:00:14 04:00:19 04:00:24 04:00:28 04:00:33 04:00:38 04:00:43 04:00:48 04:00:52 04:00:58 04:01:02 04:01:07 04:01:12 04:01:16 04:01:21 04:01:26 04:01:31 04:01:36 04:01:40 04:01:45 04:01:50 04:01:55 04:02:00 04:02:04 04:02:09 04:02:14 04:02:19 04:02:24 04:02:28 04:02:33 04:02:38 04:02:43 04:02:48 04:02:53 04:02:58 04:03:02 04:03:07 04:03:12 04:03:16 04:03:21 04:03:26 04:03:31 04:03:36 04:03:40 04:03:46 04:03:50 04:03:55 04:04:00
0
Abdomina l move me nts S noring
Fig. 3. Multivariate time series and successions, events, from a patient with SRBDs
5. Conclusion This paper presents an approach for the induction of temporal grammatical rules from multivariate time series with unsupervised neural networks. It was demonstrated that Self-Organizing Maps (SOMs) [15] are appropriate for this task. The main idea lies in the introduction of several abstraction levels, such that a step-wise and successive discovery of temporal patterns becomes feasible. The results of the neural networks at different abstraction levels are used to induce temporal grammatical rules. If no temporal relations have to be considered, for instance, for the generation of a rulebased description of elementary patterns in time series, Machine Learning (ML) algorithms can be used straightforwardly. The main advantage of this approach lies in the generation of a description for multivariate time series at different accuracy levels. This permits a structured interpretation of the final results. Previous approaches for the generation of a syntactical description of signals in form of grammars or automata used a pre-classification of the signals and are limited to univariate time series, such as ECGs [16] or carotid pulse waves [5]. The main
The Induction of Temporal Grammatical Rules from Multivariate Time Series
139
patterns in the time series are pre-defined, for instance, having classifications of Pwaves or QRS-complex of an ECG signal or using simple waveform operations as local minimum or negative slope. A strongly related approach that also uses SOMs in combination with recurrent neural networks for the generation of automata is presented in [9]. This approach was used to predict the daily foreign exchange rates. One-dimensional SOMs are used to extract elementary patterns form the time series. However, this approach is limited to univariate time series. The method presented here was applied successfully to sleep-related breathing disorders (SRBDs). All events and temporal patterns presented to the medical expert described the main properties of SRBDs as, for instance, hyperpnoe, obstructive snoring, obstructive apnoe or hypopnoe. An evaluation of the rules at a lower abstraction level lead to an overall sensitivity of 0,762 and a specificity of 0,758.
Acknowledgments I would like to thank Prof. Dr. J. H. Peter and Dr. T. Penzel, Medizinische Poliklinik, Philipps University of Marburg for providing the data.
References 1. 2.
Adrians, P., Zantinge, D.: Data Mining. Addison-Wesley (1996) Allen, J.: Towards a General Theory of Action and Time. Artificial Intelligence 23 (1984) 123-154 3. Behme, H., Brandt, W.D., Strube, H.W.: Speech Recognition by Hierarchical Segment Classification. In: S. Gielen, B. Kappen (Eds.): Proc. Intl. Conf. on Aritificial Neural Networks (ICANN 93), Amsterdam, Springer Verlag, London (1993) 416-419 4. Berndt, D.J., Clifford, J.: Finding Patterns in Time Series: A Dynamic Programming Approach. In: Fayyad, U.M et al. (Eds): Advances in Knowledge Discovery and Data Mining. AAAI Press, The MIT Press, London (1996) 229-248 5. Bezdek, J.C.: Hybrid modeling in pattern recognition and control, Knowledge-Based Systems, Vol. 8, Number 6 (1995) 359-371 6. Bishop, C.M.: Neural Networks for Pattern Recognition. Oxford, Clarendon Press (1995) 7. Case, J., Jain, S., Lange, S., Zeugmann, T.: Learning Concepts Incrementally with Bounded Data Mining. In: Workshop on Automated Induction, Grammatical Inference, th and Language Induction, The 14 Intl. Conf. on Machine Learning (ICML-97), July 12, Nashville, Tenesse (1997) 8. Fayyad, U.M., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R (Eds.): Advances in Knowledge Discovery and Data Mining. AAAI Press, The MIT Press, London (1996) 9. Giles, C.L., Lawrence, S., Tsoi, A.C.: Rule Inference for Financial Prediction using Recurrent Neural Networks. In: Proceedings of IEE/IAFE Conf. on Computational Intelligence for Financial Engineering (CIFEr), IEEE, Piscataway, NJ (1997) 253-259 10. Guimarães, G.: Temporal Knowledge Discovery for Multivariate Time Series with Enhanced Self-Organizing Maps. To be publ. in: IEEE-INNS-ENNS Intl. Joint Conf. on Neural Networks (IJCNN’2000), Como, 24-27 July, Italy (2000) 11. Guimarães, G., Ultsch, A.: A Method for Temporal Knowledge Conversion. in Procs. of IDA99, The Third Symposium on Intelligent Data Analysis, August 9-11, Amsterdam, Netherlands, Lecture Notes in Computer Science, Springer Verlag (1999) 369-380
140
G. Guimarães
12. Guimarães, G., Ultsch, A.: A Symbolic Description for Patterns using Definitive Clause Grammars. In: R. Klar, O. Opitz (Eds.): Classification and Knowledge Organization, Proc. of the 20th Annual Conference of the Gesellschaft für Klassifikation, March 6-8 1996, Univ. of Freiburg, (1997) 105-111 13. Kaski, S., Joutsiniemi, S.L.: Monitoring EEG Signal with the Self-Organizing Map. In: S. Gielen, B. Kappen (Eds.): Intl. Conf. on Artificial Neural Networks (ICANN 93), Amsterdam, Springer Verlag, London (1993) 974-977 14. Kaski, S., Kohonen, T.: Exploratory Data Analysis by Self-Organizing Map: Structures of Welfare and Poverty in the World. In: A.P.N Refenes et al. (Eds.): Neural Networks in Financial Engineering. Proc. of the Intl. Conf. on Neural Networks in the Capital Markets, London, Singapore (1996) 498–507 15. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biological Cybernetics 43 (1982) 59-69 16. Koski, A., Juhola, M. and Meriste, M.: Syntactic recognition of ECG signals by attributed finite automata. In: Pattern Recognition, The Journal of the Pattern Recognition Society, Volume 28, Issue 12, December (1995) 1927-1940 17. Kowalsky, R., Sergot, M.: A Logik-based Calculus of Events. In: New Generation Computing , 4, (1986) 67-95 18. Parekh, R., Honavar, V.: Grammar Inference, Automata Induction, and Language Acquisition. In: Dale, Moisl, Somers (Eds.): Handbook of Natural Language Processing, New York, Marcel Dekker (1998) 19. Penzel, T., Peter, J.H.: Design of an Ambulatory Sleep Apnea Recorder, in: H.T. Nagle, W.J. Tompkins, eds., Case Studies in Medical Instrument Design, IEEE, New York (1992) 171-179 20. Pereira, F.C.N., Warren, D.: Definitive Clause Grammars for Language Analysis - A Survey of the Formalism and a Comparison with Augmented Transition Networks. In: Artificial Intelligence 13 (1980) 231-278 21. Peter, J.H., Becker, H., Brandenburg, U., Cassel, W., Conradt, R., Hochban, W., Knaack, L., Mayer, G., Penzel, T.: Investigation and diagnosis of sleep apnoea syndrome. In: McNicholas, W.T. (ed.): Respiratory Disorders during Sleep. European Respiratory Society Journals, Sheffield (1998) 106-143 22. Ultsch, A.: Knowledge Extraction from Self-organizing Neural Networks. In: O. Opitz, B. Lausen and R. Klar, (Eds.) Information and Classification, Berlin, Springer (1987) 301306 23. Ultsch, A., Siemon, H.P.: Kohonen´s Self-Organizing Neural Networks for Exploratory Data Analysis. In: Proc. Intl. Neural Network Conf. INNC90, Paris, Kluwer Academic (1990) 305-308
Identification in the Limit with Probability One of Stochastic Deterministic Finite Automata Colin de la Higuera and Franck Thollard EURISE, Université de Saint-Etienne, France www.univ-st-etienne.fr/eurise/cdlh.html www.univ-st-etienne.fr/eurise/thollard.html
Abstract. The current formal proof that stochastic deterministic finite automata can be identified in the limit with probability one makes use of a simplified state-merging algorithm. We prove in this paper that the Alergia algorithm, and its extensions, which may use some blue fringe type of ordering, can also identify distributions generated by stochastic deterministic finite automata. We also give a new algorithm enabling us to identify the actual probabilities, even though in practice, the number of examples needed can still be overwhelming.
Keywords: identification with probability one, grammatical inference, polynomial learning, stochastic deterministic finite automaton.
1 Introduction Inference of deterministic finite automata (dfa) or of regular grammars is a favourite subject of grammatical inference. It is well known [Gold 67& 78] that the class cannot be identified in the limit from text, i.e. from positive examples only. As the problem of not having negative evidence arises in practice, different options as how to deal with the issue have been proposed. Restricted classes of dfa can be identified [Angluin 82, Garcia and Vidal 90], heuristics have been proposed and used for practical problems in speech recognition or pattern recognition [Lucas et al. 94], and stochastic inference has been proposed as a mean to deal with the problem [Carrasco and Oncina 94, Stolcke and Ohomundro 94, Ron et al. 95]. Stochastic grammars and automata have been used for some time in the context of speech recognition [Rabiner and Juang 93, Ney 95]. Algorithms that (heuristically) learn a context-free grammar have been proposed (for a recent survey see [Sakakibara 97]), and other algorithms (namely the forward-backward algorithm for Hidden Markov Models, close to stochastic finite automata or the inside-outside algorithm for stochastic context-free grammars) that compute probabilities for the rules have been realised [Rabiner and Juang 93]. But in the general framework of grammatical inference it is important to search for algorithms that not only perform well in practice, but that provably converge to the optimal solution, using only a polynomial amount of time. For the case of stochastic finite automata the problem has been dealt with by different authors: Stolcke and Ohomundro [94] learn stochastic deterministic finite automata through Bayes minimisation, Carrasco and Oncina [94] through state merging techniques common to classical algorithms for the dfa inference problem. A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 141-156, 2000. © Springer-Verlag Berlin Heidelberg 2000
142
C. de la Higuera and F. Thollard
Along the same line Ron et al. [95] learn acyclic stochastic automata, proving furthermore that under certain restrictions the inferred automaton is Probably Approximately Correct. The popular Alergia algorithm was proposed in 1994, but the formal proof of its convergence was not established. Heuristic improvements to Alergia have been added in [Young-Lai and Tompa 99], and more convincing arguments towards the identification in the limit with probability one have been given in [Carrasco and Oncina 99] but using a less performing version of Alergia : rlips (that does not use the accumulation of data to test compatibility between states). We prove in this paper that a wide family of algorithms (including Alergia) identify stochastic automata in the limit with probability one. Any prefix preserving ordering of the exploration of the states is valid. We also provide a new algorithm allowing the identification of the probabilities, which empirically proves to require less data than previous ones, and who, even when it does not identify, does return a satisfying estimation.
2 Preliminaries The definitions and notations from this section are those of (Carrasco and Oncina 99). Strings will be denoted by letters u, v,.., z and symbols by letters a, b, c. An alphabet is a finite non empty set of distinct symbols. For a given alphabet Σ, the set of all finite strings of symbols from Σ is denoted by Σ*. The empty string is denoted by λ. A language L over Σ is a subset of Σ*. Given L, L is the complementary of L in Σ*. Q is the set of rational numbers. A stochastic deterministic finite automaton (sdfa) A=< Σ, Q, q0, γA, δ> consists of a finite alphabet Σ, set of terminal symbols, a finite set Q of states, with q0 the initial state, a total1 transition function δ: Q×Σ→Q and γA a probability function Q×Σ∪{λ}→Q∩[0,1]2, such that:
∀q ∈ Q,
∑
γ A ( q, a ) a∈Σ ∪{λ }
=1
We define recursively: δ (qi , λ ) = qi
δ (qi , a.w) = δ (δ (qi , a), w) And the probability for a string to be generated by A starting at qi is defined recursively by: PA (qi , λ ) = γ A (qi , λ ) PA (qi , aw) = γ A (qi , a).PA (δ (qi , a), w) In case the sdfa contains no useless nodes3 it generates a distribution over Σ*. We can then also define recursively: PA (qi , awΣ*) = γ A (qi , a).PA (δ (qi , a ), wΣ*) 1
This implies that the automaton will be complete, some transitions having probability 0, all such null probability transitions leading to some non accepting state. 2 For practical purposes, only rational probabilities will be used. 3 A state q is useless if it accepts no string, i.e. ∀w ∈Σ * , P (q, w) = 0 . A
Identification in the Limit with Probability
143
So PA (qi, Σ*) = 1 and a distribution is defined over Σ* from each state. The class of stochastic deterministic regular languages (sdrl) consists of all languages generated by stochastic deterministic finite automata. If L is generated by P (q , uw) , thus PL (w) = PL ( w λ ) = some sdfa A we will denote PL ( w u ) = A 0 PA (q0 , uΣ*) PA (q0 , w) . Let x∈Σ*, we denote x the class of x, for an equivalence relation ≡R over Σ*. We write x = y iff x≡Ry. Relation ≡R is a left congruency if ∀a∈Σ, x = y ⇒ xa = ya . We denote ≡R the number of classes (or index) of the relation. To simplify, class x shall alternatively be *
considered as the set of all strings equivalent to x : x ={ y∈Σ /y≡R x}. Given an sdrl L, the equivalence relation ≡L is defined as follows: x≡L y⇔∀w∈Σ* PL ( w x) = PL ( w y ) . Relation ≡R is compatible with L iff it is finer than relation ≡L i.e. * ∀ x,y∈Σ , x ≡R y ⇒ x ≡L y. A set of strings X is compatible with L if ∀ x,y∈X, x ≡L y. If X is compatible with L, PL ( w X ) = PL ( w x) ∀x∈X From [Carrasco and Oncina 99] it follows that any sdrl L admits a minimal canonical automaton whose states are the equivalence classes of ≡L, and the transition function is δ( x , a)= xa . The initial state is λ . And PA ( x, a) = PL ( xaΣ * xΣ *) . Two sdfa A and A’ are equivalent if they provide identical probability distribution over Σ*, i.e. ∀w ∈Σ * PA (q0 , w) = PA ′ (q 0′ , w) . A stochastic sample of a language L is an infinite sequence of strings generated according to the probability distribution (PL). We denote Sn the sequence of the n first strings (not necessarily different). The number of occurrences in Sn of some x is cn (x) . denoted cn (x), and ∀X⊂Σ*, cn(X)=
∑
x ∈Σ ∗
We denote by tn the set of all prefixes of strings in Sn. One can notice [Carrasco and Oncina 99] that tn grows at most linearly with n, with probability one. Let f be a function N→X. If f(n) ≠ x∈X for only a finite number of values n we shall write "f(n)=x for a co-finite number of values of n". Property: f(n)=x for a co-finite number of values of n iff ∃k∈N : ∀n≠k f(n)=x The proof is straightforward
We aim to prove in this paper that sdfa can be identified in the limit with probability one and using polynomial time by some algorithm A, namely:
144
C. de la Higuera and F. Thollard
Definition : polynomial identification in the limit with probability one. Given any sdrl L, and some infinite sequence drawn according to L, Sn denoting the subsequence of the n first elements, and A(n) the distribution resulting of running algorithm A over Sn, Prob[ ∃n∈ N:∀ k≠n A(k)=L) ] = 1 Moreover A runs in time polynomial in ≡L, Σ , log n and tn .
3 Identifying the Structure Using an oracle (equiv.) allowing to test equivalence between two classes, a generic algorithm giving the construction of the structure of the automaton is the following: Algorithm 1 asi (automaton structure identification) Input : ≡R = I (the identity relation) Output : the automaton structure ( ≡R )
Red←{ λ }; Blue←{ a , ∀a∈Σ}; While Blue≠∅ do chose x in Blue; if ∃ y in Red such that equiv( x , y ) then merge( x , y ) ; Blue←Blue∪{ xa , ∀a∈Σ}\Red else Red←Red∪{ x }; Blue←Blue∪{ xa , ∀a∈Σ}\{ x } end-while Algorithm 2 merge Input : x , y, ≡R Output : ≡R, the smallest left congruency containing ≡R ∪{(x, y)}. Algorithm 3 equiv Input : x , y Output : x ≡L y , boolean Equiv is obviously at this point a call to an oracle, function merge may seem difficult to implement, yet with the asi ordering, it corresponds to the usual cascade merging [Oncina and Garcia 94, de la Higuera et al. 96]. Algorithm asi is written following the blue fringe algorithm rules [Lang et al. 98]. The Red set contains established classes/states, the Blue set the border classes/states, i.e. those that can be accessed by reading one symbol from a Red state, but are not in Red. Such an algorithm is independent of the order in which prefixes are explored: function "chose" can be based on any a priori ordering (breadth first will lead to Alergia [Carrasco and Oncina 94]), or on evidence: x can be the class for which most information is available [de la Higuera et al 96, Lang et al. 98].
Identification in the Limit with Probability
145
Proof of the convergence of asi At each iteration a new edge or a new state is added to Red or (exclusive) a state is suppressed from Blue. A new state is created only if no equivalent state exists. The invariant is that ≡R is compatible with ≡L at all time. The number of steps is thus at most (Σ+1)∗ ≡L. In practice oracle equiv has to be simulated. Estimation will be based on the learning sample. Algorithm 4 asifs: automaton structure identification from a sample is algorithm 1 (asi) where equiv( x , y ) is replaced by compatiblen. Algorithm 5 compatiblen Input : x , y , ≡R, Sn, α a function N→]0,0.5] (whose values shall be denoted αn) Output : Boolean, an estimation of x ≡L y for each z∈Σ* / cn( x zΣ*)≠0 or cn( y zΣ*)≠0 if different(cn( x z), cn( x zΣ*), cn( y z), cn( y zΣ*), αn) then return false; for each a∈Σ if different(cn( x zaΣ*), cn( x zΣ*), cn( y zaΣ*),
cn( y zΣ*), αn) then return false end_for end_for return true Algorithm 6 different Input : f1, n1, f2, n2 ,α Output : Boolean return
1 f1 f 2 1 − < 2. + n1 n 2 n1 n 2
2 log α
Remark The test used in algorithm different is based on Hoeffding’ test (the additive form of Chernov bounds). Let
f be the observed frequency of a Bernouilli variable with probability p, then n
with probability at least
1-α we have p −
f < n
2 1 log . α 2n
146
C. de la Higuera and F. Thollard
f1 f and 2 n1 n2 are observed frequencies of a same Bernouilli variable (but on independent draws!) From the previous result it follows that with probability at least (1-α) , if 2
then
1 f1 f 2 1 2 log − < 2. + n1 n 2 α n1 n 2
Proof of the identification in the limit with probability 1 of algorithm 4 asifs. There are 2 possible errors : • the optimistic error: 2 states are wrongly merged; • the pessimistic error: some necessary merge is not made. We want to measure the event "there are no errors": although a pessimistic error can still lead to a correct automaton (equivalent to the target, but with more states), the event will not be taken into account: the results will thus be safe. Take some different test. It is proven in [Carrasco and Oncina 99] that the test is 2 correct with probability at least (1-αn) . This is only true if the drawing protocol is rigorously followed: a Bernouilli variable, and independent draws. Clearly this is not the case here: the strings we are using could have been used before in some other test, whose result can influence the new test. Let us take a simple example to make this point clear: Let the goal be the stochastic language {aa, ab, ba, bba} with PL(aa)= PL (ab)= PL (ba)= PL (bb)=0,25 Take a large sample and apply algorithm asifs. After 2 steps the classes in Red are λ and a , so we are interested in class b (from Blue). After compatibility between λ and b fails, compatibility between a and b is tested. * * * * different(cn( a aΣ ), cn( a Σ ), cn( b aΣ ), cn( b Σ ), αn) returns true. What about the result of different(cn( a bΣ ), cn( a Σ ), cn( b bΣ ), cn( b Σ ), αn)? Clearly this test is not independent from the previous one: the same data is used * * * and with probability 1 the result will be true (notice: cn( a aΣ )+ cn( a bΣ )= cn( a Σ )). We prove in the following lemma that the fact that a different test succeeds affects only positively the probability that later tests will succeed: *
*
*
*
Lemma 1. ∀z∈Σ*, ∀a∈Σ, X and Y⊂Σ* compatible with L, and Sn a finite stochastic sample of L. * * * * If different(cn(XzaΣ ), cn(XzΣ ), cn(YzaΣ ), cn(YzΣ ), αn ) = [PL(aXz)=PL (a/Yz)] or * * different(cn(Xz), cn(XzΣ ), cn(Yz), cn(YzΣ ), αn) = [PL(λ/Xz)=PL (λ/Yz)] then with probability at least 1-αn ∀ β>0, ∀ x∈Σ*, ∀a∈Σ∪{λ},
PL (a x) −
cn ( xa) < cn ( xΣ* )
2 1 log β 2cn ( xΣ* )
If we admit the lemma (which merely tells us that the Hoeffding test is likely to be valid also with correct prior knowledge) all the tests can be considered (as a worse case) as independent and the probability that some test different fails is less than 2αn. Also by the above lemma the probability that one of them succeeds for the wrong
Identification in the Limit with Probability
147
reasons (and thus may affect the rest of the computation) is also less than 2αn. For a test compatiblen at most 2tn tests different are performed. The probability that one of them fails or is wrongly successful is thus less than 8αntn. We remember that the size tn cannot (with probability one) rise faster than n. 1 By choosing αn < 2 +ε , ε>0, the number of values of n for which the test n compatiblen fails is finite (with probability one): using the Borel-Cantelli lemma 8nα n <∞, giving the result. [Feller 50],
∑ n
The number of tests compatiblen is finite (at most Q.(Q+1)), where Q is the set of states of the target automaton), and independent from n. The previous result is valid for a finite number of tests; with probability 1 all tests will give a correct result, but for a finite number of values of n. Proof of lemma 1. If different(cn(XzaΣ ), cn(XzΣ ), cn(YzaΣ ), cn(YzΣ ), αn)) = PL(aXz)=PL(a/Yz)4 then one of the following cases holds: • Either the observed frequencies are close and correspond to a same probability • Or they are different and correspond to different probabilities. Take the first case, we have two sub-cases: both frequencies are close to the probability, or both are far away (out of the confidence interval for the Hoeffding test) but close one to the other. (1 − α n ) 2 The first sub-case is clearly more probable than the second one: (1 − α n ) 2 + Kα n 2 *
against
*
*
*
Kα 2
where K is smaller than 1 (K is the probability that the (1 − α n ) 2 + Kα n 2 frequencies are close to each other whilst being far from the probability). A (very loose) bound is
Kα 2 (1 − α n ) 2 + Kα n 2
<αn, so the probability of being in the wrong case
(correct different test for the wrong reasons) is at least 1-αn. In this case the probability that some other frequency related to this one is close to the probability is necessarily higher than the probability with no knowledge, which is bounded by the Hoeffding test. The second case is solved in the same way.
4 Identifying a Probability The question of identifying a set of rational probabilities has not been that much addressed in the literature. Angluin gives some arguments in an unpublished technical report [Angluin 88], without entering the algorithmic implications of distribution 4
The proof for the case different(cn(Xz), cn(XzΣ ), cn(Yz), cn(YzΣ ), αn ) = PL(λ/Xz) = PL (λ/Yz) is equivalent. *
*
148
C. de la Higuera and F. Thollard
identifications. Carrasco also addresses the problem in his phD dissertation [Carrasco 97] but uses an enumeration technique to identify each probability. It can be claimed that under Gold’s result that you cannot do better than enumeration, the only hope is to obtain a faster algorithm, but it is unclear that Gold’s result holds for identifications of distributions. It would nevertheless be of interest to have good algorithms to identify the probabilities of a sdfa: • identified probabilities are simpler than the estimations. Automata have the advantage of being readable. This advantage can be lost with edges labelled 14273/28545 (instead of a more logical 1/2). • identifying the probabilities will allow the automaton to be robust to incrementation. Adding a new string to the training set would not necessarily lead to changing all the probabilities. We give in this section an algorithm that identifies any individual probability (with probability one) and which is faster than enumeration. Definitions : Stern-Brocot tree [Stern 1858, Brocot 1860, Graham et al. 89] Two integers n and m are prime (denoted n⊥m), if 1 is their smallest common divider. m is in normal form (simple fraction) if n⊥m. A fraction n m m′ m m′ and be 2 simple fractions, the median of and is the fraction Let ′ n n n n′ m + m′ 5 . n + n′ m m′ m m + m ′ m′ < , then < < Remark : if ′ n n n n + n ′ n′ The Stern-Brocot tree allows us to construct and represent all simple fractions in the following way: 0 1 and 6, and order them in set PR. a) Consider the two simple fractions 1 0 b) Iterate : construct all the medians of any two consecutive fractions from PR and add them to PR 0 +1 1 = At step two we construct 1+ 0 1 3 1 2 3 1 2 Then and , followed by , , and etc... 2 1 1 3 3 2
5
Notice that the median does correspond to the fraction obtained by multiplying each fraction by its weight, and dividing by the sum of these weights. 6 This fraction is solely used for construction.
Identification in the Limit with Probability
149
The construction can be seen as the building of a binary tree
0 1
1 0
1 1
1 2
2 1
1 3
1 4
2 3
2 5
3 5
3 2
3 4
4 3
3 1
5 3
5 2
4 1
m m′ m′ and are on the same branch of the Stern-Brocot tree, if can be obtained n n′ n′ m by applying the median computation in one or more steps; in this case we from n m m′ m m′ is simpler than , and write ∠ . will say that ′ n n n n′ x m m′ is the closest common ancestor of and if : y n n′ x m (i) ∠ y n
x m′ ∠ y n′ x′ x′ x′ x m m′ (iii) ∠ and ∠ ⇒ ∠ y′ n y′ n′ y′ y (ii)
150
C. de la Higuera and F. Thollard
We denote by SB(a,b) the set of all values on the branch leading to
a in the Sternb
Brocot tree: x ∈SB(a,b) ⇔ y Algorithm Input : a
x a ∠ . SB(a,b) can be computed in polynomial time (in O(b)): y b 7 Compute_SB and b two integers {ab, b≠0} x Output : SB a set of fractions belonging to the y
branch leading to
a b
x x1 0 0 x +x x 1 ← ; 2 ← ; SB← { }; ← 1 2 ; y1 1 y2 1 1 y y1 + y2 If b≠1 and a≠0 then x a While ≠ do y b x SB ← SB ∪ { }; y if a*y
x x2 ← y y2
else
x1 x ← ; y1 y
x x +x ← 1 2 y y1 + y2 end_while Given some Bernouilli variable x with (simple) probability
a the result after n b
draws is denoted cn(x). Lemma 2. With probability 1, for a co-finite number of values of n,
c ( x) a ∠ n . b n
Identification in the Limit with Probability
151
Proof
c n ( x) a c ( x) a c ( x) a 1 ∠ and n ≠ : n − ≤ 2. n b n b n b b x Suppose now is their common ancestor. y c ( x) x a < < : then bx
1 by
⇒ (b ≥ y, since
x a ∠ ) y b
ay − bx 1 ≥ 2 by b
and
a b
ay − bx ≥ by
c n ( x) a x > - => n b y
a b
c n ( x) 1 ≥ 2 n b
1 c ( x) bx − ay a x < < n : then ay
Case 3 (b ≥ y, since
c ( x) a 1 1 x a bx − ay ∠ ) - ≥ 2 ≥ 2 and in the same way n y b by n b b b
The iterated logarithm tells us that with probability one, for a co-finite number of c ( x) a log log n − < values of n n n b n Take n such that (Remark : n→
log log n 1 < 2 to get the result for all 3 cases. n b log log n is a decreasing function) n
Corollary 1. With probability 1, for a co-finite number of values of n, SB(cn(x), n)
a belongs to b
152
C. de la Higuera and F. Thollard
This allows us to consider the following algorithm: Algorithm 8 Compute_Probability Input : m and n two integers {mn, n≠0} m Output : a simple probability for n Compute SB(m, n); a return , the simplest element of SB(m, n) such that b m a log log n − < . n b n
a′ a a′ a simpler fraction that . With probability one, can only b′ b b′ be candidate for a finite number of values of n when calling Compute_Probability(cn(x), n)). Proposition 1. Let
Proof. Suppose this is false, then for an infinite number of values of n c n ( x) a ′ log log n − < . n b′ n But for an co-finite number of values of n
cn ( x ) a − < n b
log log n . n
a a′ log log n − <2. , for an infinite number of values of n, which is b b′ n impossible.
Then
Theorem 1. With probability 1, for a co-finite number of values of n, a Compute_Probability(cn(x), n))= b a Proof: the number of fractions simpler than is finite. Each one can be proposed b a belongs to only a finite number of times. The result then follows from the fact that b SB(cn(x), n) a co-finite number of times.
5 Identifying an Sdfa From section 3 it follows that the relation ≡L can be identified in the limit with probability one. The relation gives us the structure of the target sdfa. It remains to be seen that all the probabilities can be simultaneously identified.
Identification in the Limit with Probability
153
Algorithm 9 Identify_Probabilities Input : ≡L, Sn Output : an sdfa (≡L, PL) For each transition ( x , a, y ) do
Ploc( x , a) ←Compute_Probability( c n ( xaΣ*) , c n ( xΣ*) ); End_for For each state x do Ploc( x , λ) ←1-
∑ Ploc( x, a) ;
a∈∑
If
∑ Ploc( x, a)
a∈∑
then for each transition ( x , a, y ) do
PL ( x , a) ← Ploc( x , a); PL ( x , λ) ← Ploc( x , λ) else for each transition ( x , a, y ) do PL ( x , a) ← PL ( x , λ) ←
c n ( xaΣ*) c n ( xΣ*)
;
c n ( x)
c n ( xΣ*) End_for Algorithm Identify_Probabilities checks if the probabilities are consistent (if they can sum up to 1). If that is the case the Stern-Brocot fractions are used, if not the empirical distributions are used. Corollary 2. Let L be a sdrl, given ≡L , L can be polynomially identified in the limit with probability 1. Proof. Each individual probability can be identified with probability 1. The number of these probabilities is finite (and polynomial in the size of the target). The error is possible only for a finite number of draws. All algorithms are polynomial.
6 Experimental Work and Discussion The purpose of the article is to prove that Alergia, and a blue fringe extension of Alergia are well founded and converge (under the identification in the limit with probability one paradigm), so only simple experiments have taken place. The best blue fringe strategy, and the choice of some better different test are open problems. We nevertheless wished to find out if the Stern-Brocot technique could be successfully employed. Whilst the idea that simpler probabilities, and probabilities that will resist incrementing the learning sample is attractive, does it uphold testing?
154
C. de la Higuera and F. Thollard
We have used here two fractions proposed by [Carrasco and Oncina 99]: 5/8 and 31/50. Results are presented in figures 1 and 2. To measure the distance between the returned result and the correct one we use the Kullback-Leibler divergence. Let p1 and p2 two probabilities. The Kullback-Leibler divergence of p2 with respect to p1, dKL(p1,
p1 p2 Algorithm 8 (Compute_Probability) identifies the correct probability quickly, for a good value of lambda. When identification is not obtained, the approximation is in the range of the empirical estimation. Clearly the choice for λ is important: a value of λ close to 1 will allow to find fast a simple probability, but the error can be important. Choosing λ small (0.01) identification takes place later, but the approximation is better. Identification of 31/50 is not achieved with 100 000 throws. A third technique was also tested (ML on figures 1 and 2) intended to use the Stern-Brocot trees without using the iterated logarithm, based on a maximal likelihood technique. The algorithm finds on the Stern-Brocot branch leading to the a observed frequency, the fraction f maximizing the product P( f ) .Pr(f) with b b a P( f ) = .( f a .(1 − f )b − a ) and Pr(f) depending on the length of the encoding of f b a
p2) is: d KL ( p1, p2 ) = p1 log
− prof ( f )
where prof(f) is the length of this path. By (by its Stern-Brocot path: i.e. 2 a testing the logarithm of P( f ) .Pr(f). Algorithmically the logarithm is tested, so the b b factor becomes a constant and does not need to be computed. a fig u r e 2 : id e n t ify in g 5 / 8 1E+0 1 E -1 1 E -2 e s tim a tio n ite ra te d lo g a rith m 1 E -4
ML
1 E -5 1 E -60
3800
3200
2600
2000
1400
0
800
1 E -7 200
KL
1 E -3
n
Fig. 1. Identifying 5/8.
Identification in the Limit with Probability
155
fig ur e 3 : ide ntify in g 31/50 1E +0 1E -2 est imation
KL
1E -4
it erat ed logarithm
1E -6
ML
1E -8 1 E -1 0
0
0 96
00
00 82
68
00
0
0 00
54
40
00
0
0
0 00
00 26
00
12
16
20
0
0
n
Fig. 2. Identifying 31/50.
References D. Angluin, "Inference of reversible languages", Journal of the ACM 29 (3) 741-765, 1982. D. Angluin, "Identifying Languages from Stochastic Examples", Yale University technical report, YALEU/DCS/ RR-614, March 1988. A. Brocot, "Calcul des rouages par approximation, nouvelle méthode," Revue Chronométrique 6 (1860) 186-194. R. Carrasco and J. Oncina, "Learning Stochastic Regular Grammars by Means of a State Merging Method", in Grammatical Inference and Applications, Proceedings of ICGI ’94, Lecture Notes in Artificial Intelligence 862, Springer Verlag ed., 139-150, 1994. R. Carrasco, "Inferencia de lenguajes racionales estocasticos", PhD, Universidad de Alicante, 1997. R. Carrasco, J. Oncina, "Learning deterministic regular grammars from stochastic samples in polynomial time", Informatique Théorique et Applications, 33/1, 1-19, 1999. W. Feller, "An introduction to probability theory and its applications", John Wiley and Sons, New York, 1950. P. García and E. Vidal, "Inference of K-testable Languages in the Strict Sense and Applications to Syntactic Pattern Recognition", IEEE Transactions on Pattern Analysis and Machine Intelligence 12/9, 920-925, 1990. M. E. Gold, "Language Identification in the Limit", Information and Control 10, 5, 447-474, 1967. M. E. Gold, "Complexity of automaton identification from given data", Information and Control 37, 302-320, 1978. R. Graham, I. Knuth and O. Patashnik, "Concrete Mathematics ", Addison Wesley, 1989. C. de la Higuera, J. Oncina and E. Vidal, "Identification of DFA: data-dependent versus dataindependent algorithms", in Grammatical Inference: Learning Syntax from Sentences, Proceedings of ICGI '96, Lecture Notes in Artificial Intelligence 1147, Springer Verlag ed., 313-325, 1996.
156
C. de la Higuera and F. Thollard
K. Lang, B.A. Pearlmutter and R.A. Price, "Results of the Abbadingo One DFA Learning Competition and a New Evidence-Driven State Merging Algorithm", in Grammatical Inference, Proceedings of ICGI ’98, Lecture Notes in Artificial Intelligence 1433, Springer Verlag ed., 1-12, 1998. S. Lucas, E. Vidal, A. Amiri, S. Hanlon and J-C.Amengual. "A comparison of syntactic and statistical techniques for off-line OCR". Proceedings of the International Colloquium on Grammatical Inference ICGI-94 (pp. 168-179). Lecture Notes in Artificial Intelligence 862, Springer-Verlag, 1994 H. Ney, "Stochastic grammars and Pattern Recognition", In Speech Recognition and Understanding, Edited by P. Laface and R. de Mori, Springer-Verlag, 45-360, 1995. Y. Sakakibara, "Recent Advances of Grammatical Inference", Theoretical Computer Science 185, 15-45, 1997. M.A. Stern, "Über eine zahlentheoretische Funktion", Journal für die reine und angewandte Mathematik 55, 193-220, 1858. A. Stolcke and S. Omohundro, "Inducing Probabilistic Grammars by Bayesian Model Merging", in Grammatical Inference and Applications, Proceedings of ICGI ’94, Lecture Notes in Artificial Intelligence 862, Springer Verlag ed., 106-118, 1994. L. Rabiner and B.H. Juang, "Fundamentals of Speech Recognition", Prentice-Hall, 1993. D. Ron, Y. Singer and N. Tishby, "On the Learnability and Usage of Acyclic Probabilistic Finite Automata", Proceedings of COLT 1995 , 31-40, 1995. M. Young-Lai and F.W. Tompa, "Stochastic Grammatical Inference of Text Database Structure'', to appear in Machine Learning, 2000.
Iterated Transductions and Efficient Learning from Positive Data: A Unifying View Satoshi Kobayashi Dept. of Information Sciences, Tokyo Denki Univ. Ishizaka, Hatoyama-machi, Hiki-gun, Saitama 350-0394, JAPAN e-mail:[email protected]
Abstract. In this paper, we will not focus on a learnability result of some specific language class, but on giving a set of language classes each of which is efficiently learnable in the limit from positive data. Furthermore, the set contains the class of k-reversible languages and the class of k-locally testable languages in the strict sense just as example language classes. This paper also proposes a framework for defining language classes based on iterated transductions. We believe that the framework is quite adequate for theoretically investigate the classes of languages which are efficiently learnable from positive data.
1
Introduction
Learning in the limit is one of the mathematical framework proposed by Gold in order to computationally analyze the limiting behavior of various learning tasks([Gol67]). In particular, this paper concerns the learning from positive data, where the learning device only receives the positive information of the target languages ([Ang80]). From the practical point of views, it is important to study for a given target language class C, whether there exists an efficient learning device which learns C. The class k-REV of k-reversible languages ([Ang82]) and the class k-LT S of k-locally testable languages in the strict sense ([MP71][LR80]) are important and famous examples of such efficiently learnable language classes. These language classes gave much impact on the application of learning theory ([Gar90][YIK94][YK98]) and theoretical study of other efficiently learnable language classes ([Yok95][KMT95]). Some of these works are related to some subclasses of regular languages. However, further research on the efficient learnability from positive data for subclasses of regular languages remains open to be studied. In particular, it is a challenging research topic to give a unifying view of such efficiently learnable subfamilies of regular languages. In this paper, we will not focus on a learnability result of some specific language class, but on giving a set of language classes each of which is efficiently learnable in the limit from positive data. Furthermore, the set contains k-REV and k-SLT just as example language classes. A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 157–170, 2000. c Springer-Verlag Berlin Heidelberg 2000
158
S. Kobayashi
This paper also proposes a framework for defining language classes based on iterated transductions. It is not a new idea to use iterated transductions in order to define language classes. For example, the phrase structure grammar is also regarded as one of such examples, because we could view each application of a production rule r : X → Y as an application of a transduction f such that f replaces an occurrence of X by Y , and the language could be defined as an intersection of the set of terminal strings and a fix-point of the transduction starting from a start symbol S. Thus, a transduction generates a language and a set of transductions generates a language class. There are also some other works which propose a framework for defining a language class based on the iterated transductions([MMP98][Woo76]), which focus on the generative capability of the systems. On the other hand, the framework in this paper is used for the analysis of the efficient learnability from positive data. Furthermore, it uses only a fixed transduction for defining a language class, and its transduction does not start from a start symbol, but can start from any finite set of terminal symbols. We believe that the proposed framework is quite adequate for theoretically investigate the classes of languages which are efficiently learnable from positive data. This sort of definition could also be found in the theory of splicing systems ([Hea87] [Pix96]) and other studies which focus on the iterated application of transductions inspired from molecular reactions([KMPR]). In fact, this paper is partially motivated by the closure results of regular languages under some molecular reactions discussed in those works. As a mechanism for defining a transduction over strings, we will introduce a system, called f-pattern system, which has close relation to elementary formal system (EFS) proposed by [Smu61]. Although elementary formal system is well studied in the context of learnability from positive data ([ASY92] [Shi94]), those results are dependent on the restriction that the substitution by empty string should not be used. In this paper, we follow Smullyan’s original definition of EFS, where the substitution by empty string is allowed. In particular, we will introduce a class of f-pattern systems of type I, and investigate the closure property of regular languages under the transduction defined by them. Since the closure result is based on the constructive proof, it can be applied to the learning problem of the proposed class of languages. After giving some fundamental definitions and notations, in section 3, we propose a framework for defining language classes using the concept of transductions, and show its relation to learning from positive data. Furthermore, we will introduce f-pattern system of type I and consider the class of transductions defined by it. Section 4 gives a closure result of regular languages under the proposed transduction, the proof of which is constructive, and thus, can be used to show the efficient learnability results of the proposed language classes in section 5. Finally, we will conclude in section 6.
Iterated Transductions and Efficient Learning from Positive Data
2
159
Preliminaries
Let Σ be a finite alphabet, and Σ ∗ be the set of all strings over Σ. By we denote the empty string. For a string x, by | x | we denote the length of x. A string x is a prefix of y if xz = y for some z ∈ Σ ∗ . A string x is a suffix of y if zx = y for some z ∈ Σ ∗ . By prfk (x) we denote the prefix of x of length k if | x |≥ k (In case that | x |< k, it denotes x itself). By sufk (x) we denote the suffix of x of length k if | x |≥ k (In case that | x |< k, it denotes x itself). A subset of Σ ∗ is called a language over Σ and a set of languages over Σ is called a language class over Σ. By REG, we denote the class of regular languages. An infinite sequence w1 , w2 , ... of strings in a language L is called a positive presentation of L if {wi | i ≥ 1} = L. Let Γ be a finite alphabet disjoint from Σ. We assume that there exists some (possibly partial) function which maps a string over Γ to a language over Σ. The string r which is mapped to a language L is called a representation of L, or we say that r represents L. The language represented by r is denoted by Lr . A set R of strings over Γ is called a representation class for a language class C if (1) C = {Lr | r ∈ R} and (2) there exists a recursive function f : R × Σ ∗ → {0, 1} such that f (r, w) = 1 if and only if w ∈ Lr . A learning machine is a computational device which accepts a string over Σ from time to time and outputs a representation of a language from time to time. We say that a learning machine M learns L ∈ C in the limit from positive data using R if for any input of positive presentations of L, M outputs an infinite sequence g1 , g2 , ... of representations in R which converges to some g such that Lg = L. The language class C is said to be learnable in the limit from positive data using R if there exists M such that M learns every language in C in the limit from positive data using R. A learning machine M learns L ∈ C in the limit from positive data using R with the conjectures updated in polynomial time if for any positive presentation w1 , w2 , ... of L, 1. 2. 3. 4.
M outputs a conjecture gi in R whenever it receives the ith string wi , the sequence g1 , g2 , ... converges to some g with Lg = L, if gi 6= gi+1 then wi+1 6∈Lgi , the time T (i) for updating conjectures from gi−1 to gi (possibly gi−1 = gi ) at each stage i is bounded by some polynomial function with respect to i | wj |. Σj=1
Pitt discusses various problems which arise when defining efficient learnability in the limit ([Pit89]). In this paper, we followed the above definition, which has been implicitly used for obtaining various results on the learnability from positive data ([Ang82] [Gar90][HKY98][KMT95][Shi83][YK98]). Let C be a language class which has a representation class R. A finite subset F of a language L ∈ C is called a characteristic sample of L with respect to C
160
S. Kobayashi
if for any L0 ∈ C, F ⊆ L0 implies L ⊆ L0 . A language class C is said to have characteristic samples if for any language L in C, there exists a characteristic sample of L with respect to C. Theorem 1. [Kob96] A language class C with a representation class R is learnable in the limit from positive data using R if C has characteristic samples. u t A labeled directed graph (LDG) G is a triple G = (S, Σ, E), where S is a finite set of vertices, Σ is a finite alphabet, and E is a subset of S ×(Σ ∪{})×S whose elements are called edges. For two LDGs G1 = (S1 , Σ, E1 ) and G2 = (S2 , Σ, E2 ), we define: G1 ∪ G2 = (S1 ∪ S2 , Σ, E1 ∪ E2 ). A sequence p1 e1 p2 e2 · · · en−1 pn with pi ∈ S (1 ≤ i ≤ n) and ei ∈ E (1 ≤ i ≤ n − 1) is said to be a path from p1 to pn of G with a string w = a1 · · · an−1 if for any i with 1 ≤ i ≤ n − 1, ei = (pi , ai , pi+1 ) ∈ E holds. For a path p, we write l(p) = w, if p is a path with a string w. A finite automaton is a triple A = (G, I, F ), where G is an LDG, and I and F are subsets of the vertex set of G. A path from an element of I to an element of F is called an accepting path of A. A language accepted by A is the set of strings defined as follows: L(A) = {w ∈ Σ ∗ | there is an accepting path of A with w} Note that in our definition, a finite automaton is allowed to have nondeterministic transitions and -moves. For two automaton A1 = (G1 , I1 , F1 ) and A2 = (G2 , I2 , F2 ), we define: A1 ∪ A2 = (G1 ∪ G2 , I1 ∪ I2 , F1 ∪ F2 ). An arrow with label w = a1 · · · an (ai ∈ Σ, i = 1, ..., n ), written arw(w), is an isolated component of an automaton defined by arw(w) = (G, ∅, {vn }), where G = (S, Σ, E), S = {v0 , ..., vn }, E = {(vi−1 , ai , vi ) | 1 ≤ i ≤ n}. Since the set of initial vertices are empty, there is no accepting path in arw(w). This component will be used to obtain closure results in section 4. Let L be a regular language. A vertex v of a finite automaton is said to have k-suffix property if for any paths p1 , p2 from an initial vertex to v, sufk (l(p1 )) = sufk (l(p2 )) holds. A finite automaton has k-suffix property if each of its vertices has k-suffix property. Then, it is not difficult to see that for any nonnegative integer k, there exists a finite automaton A such that L = L(A) and A has k-suffix property. For an automaton A = ((S, Σ, E), I, F ) containing -edges and a vertex q in A, by (q), we denote the set of vertices q 0 such that (q, , q 0 ) ∈ E, i.e. (q) is the set of vertices adjacent from q by an -edge.
Iterated Transductions and Efficient Learning from Positive Data
3
161
Language Classes Defined by Iterated Transductions
A transduction γ of n-dimension over Σ ∗ is a subset of (Σ ∗ )n+1 . Let γ be a transduction of n-dimension over Σ ∗ and L be a language over Σ. Then, γ(L) is defined to be the set: γ(L) = {w | ∃ w1 , ..., wn ∈ L such that (w, w1 , w2 , ..., wn ) ∈ γ} Furthermore, we define inductively: γ 0 (L) = L, γ i (L) = γ(γ i−1 (L)) ∪ γ i−1 (L) [ γ i (L). γ ∗ (L) =
for i ≥ 1,
i≥0
For each transduction γ over Σ ∗ , we associate the language class : D(γ) = {γ ∗ (F ) | F is a finite subset of Σ ∗ } The following proposition is straightforward. Proposition 2. For any transduction γ, D(γ) has characteristic samples, thus learnable in the limit from positive data. t u Let Σ be a finite alphabet, V be a countable set of variables, and F be a countable set of function symbols such that each element f of F is associated with a function fˆ : Σ ∗ → Σ ∗ . By F(V ), we denote the set {f (X) | f ∈ F, X ∈ V }. We can regard V and F(V ) as countable alphabets. Thus, in the sequel, we often regard elements X ∈ V and f (X) ∈ F(V ) as single letters. An f-pattern is a non-empty string over Σ ∪ V ∪ F(V ). A substitution θ is a mapping from V to (V ∪ Σ)∗ . In case that θ maps every element of V to an element of Σ ∗ , we say that θ is a ground substitution. By [p1 : X1 , ..., pk : Xk ], where pi ∈ (V ∪ Σ)∗ and Xi ∈ V (1 ≤ i ≤ k), we denote a substitution that maps Xi to pi for i = 1, ..., k and maps every other element of X ∈ V to itself. For an f-pattern p and a substitution θ, we define: θ(X) if p is a variable X c if p is a symbol c ∈ Σ pθ ≡def ˆ f (θ(X)) if p is of the form f (X) for some X ∈ V p1 θ · p2 θ if p is of the form p1 p2 for some f-patterns p1 , p2 A rule of the form r : p0 ← p1 , ..., pn , where pi (i = 0, ..., n) are f-patterns, is called an f-pattern rule. The f-pattern p0 is called the head of r, and the segment p1 , ..., pn is called the body of r. A finite set P of f-pattern rules is called an fpattern system. For an f-pattern rule r : p0 ← p1 , ..., pn , we define a transduction: γr = {(p0 θ, ..., pn θ) ∈ (Σ ∗ )n+1 | θ is a ground substitution.}
162
S. Kobayashi
For an f-pattern system P , we define γP = ∪r∈P γr . The f-pattern system has a close relation to the elementary formal system (EFS) introduced by Smullyan ([Smu61]). Actually, the f-pattern system could be regarded as a subclass of EFS. EFS is well studied in the context of learnability from positive data ([ASY92] [Shi94]), and some of those results are dependent on the restriction that the substitution by empty string should not be used 1 . In this paper, we follow Smullyan’s original definition of EFS, where the substitution by empty string is allowed. Furthermore, the proposed system is used to give a unifying view for important and efficiently learnable language classes proposed so far. Ex. 1 A string w over Σ is a constant relative to a language L if, whenever uwv and swt are in L, both uwt and swv are also in L. A language L is k-locally testable in the strict sense (k-LT S), for a non-negative integer k, if every string of length k in Σ ∗ is a constant relative to L. We say that L is locally testable in the strict sense (LT S) if it is k-LT S for some k. We can define the language class k-LT S using the following f-pattern system Pk : Pk = {XwW ← XwY, ZwW | w ∈ Σ ∗ , | w |= k}, (X, Y, Z, W ∈ V ). It is clear that D(γPk ) ⊆ k-LT S holds. On the other hand, it is known that kLT S has characteristic samples([Gar90]). Therefore, it is also straightforward to see that for every L ∈ k-LT S, there exists a finite set F such that L = γP∗ k (F ). Thus, we have D(γPk ) = k-LT S. t u Ex. 2 We take the characterization of the concept of a k-reversible language given by Angluin (Theorem 14 in [Ang82]) as the basis of our definition, but we prefer to give her definition in a slightly different form, which is given in [HKY98]. A string w over Σ is a semiconstant relative to a language L if, whenever uwv, swt, and uwt are in L, swv is also in L. This allows the following equivalent of Angluin’s definition. A regular language L is k-reversible (k-REV ), for a nonnegative integer k, if every string of length k in Σ ∗ is a semiconstant relative to L. We say that L is reversible if it is k-reversible for some k. Let us consider the following f-pattern system Qk : Qk = {ZwY ← XwY, ZwW, XwW | w ∈ Σ ∗ , | w |= k}, (X, Y, Z, W ∈ V ). It will be shown in the next section that D(γQk ) ⊆ REG holds in a general setting2 . Therefore, D(γQk ) ⊆ k-REV holds. Furthermore, it is known that kREV has characteristic samples([Ang82]). Therefore, it is also straightforward to ∗ (F ). see that for every L ∈ k-REV , there exists a finite set F such that L = γQ k t u Thus, we have D(γQk ) = k-REV . 1 2
In [Shi83], the learnability of the extended version of pattern languages is studied. This fact D(γQk ) ⊆ REG was also proved in [KY97].
Iterated Transductions and Efficient Learning from Positive Data
163
Ex. 3 Let us consider the following f-pattern system P over Σ = {a, b}: P = {XW ← XabY, ZbaW } ∪ {XW ← XacY, ZcaW }, (X, Y, Z, W ∈ V ). It is straightforward to see that γP∗ ({baaab, caaab, caaac}) = baaaa∗ b+caaaa∗ b+ caaaa∗ c, which is not reversible. Therefore, the class D(γP ) contains a nonreversible language. t u Ex. 4 Let us consider the following f-pattern system Q over Σ = {a, b}: Q = {XZ ← XY R , Y Z} where X R represents the reversal of X. (More precisely, we should write X R as f (X) using a function symbol f which is associated with the function reversing ∗ ({aba}) = (ab∗ a)∗ . From the the string.) Then, it is straightforward to see γQ result in the section 4, we will see that D(γQ ) is a subclass of REG. t u Ex. 5 The use of function symbols in an f-pattern often increases the generative capability of the system. Let us consider the following f-pattern system R over Σ = {a, b}: R = {XaaX R ← XX R } ∪ {XbbX R ← XX R }. ∗ ({}) = {wwR | w ∈ Σ ∗ }, which is not It is straightforward to see that γR t u regular. Therefore, the class D(γR ) contains a non-regular language.
An occurrence of a variable X in an f-pattern rule is said to be pure if it is not contained in any function symbols. The variable X in an f-pattern rule r is said to be completely pure in r if every occurrence of X in r is pure and there exists exactly one pure occurrence of X in the head of r, and exactly one pure occurrence of X in the body of r, respectively. A variable X of a rule r is said to be generative if X appears in the head of r. We say that X is nongenerative if it is not generative. For a function fˆ with f ∈ F and a language L, we define: fˆ−1 (L) = {w ∈ Σ ∗ | fˆ(w) ∈ L}. A function symbol f ∈ F with its associated function fˆ is said to be efficient if there exists an algorithm which for every input of a finite automaton A, computes a finite automaton A0 with L(A0 ) = fˆ−1 (L(A)) in polynomial time with respect to the number of vertices of A. For instance, a function taking the reverse of a given string, and a function taking a homomorphic image of a given string are efficient. An f-pattern rule r : p0 ← p1 , ..., pn is said to be of type I, if every function symbol f in r is efficient and p0 is either:
164
S. Kobayashi
(a) of the form w for some w ∈ Σ ∗ , or (b) of the form XwY with w ∈ Σ ∗ X 6= Y ∈ V , where X and Y are completely pure in r, Xw is a prefix of some f-pattern in the body of r, and wY is a suffix of some f-pattern in the body of r. (In this case, the string w is called the center of r.) An f-pattern system P is said to be of type I if every f-pattern rule of P is of type I and every center of the rules of the form (b) has the same length. Note that all the examples in Ex.1, Ex.2, Ex.3, and Ex.4 are of type I.
4
Closure Results
In this section, we will prove the following theorem. Theorem 3. Let P be an f-pattern system of type I. Then, for any regular t u language L, γP∗ (L) is regular. The proof of this theorem is based on an effective and efficient construction of a finite automaton accepting γP∗ (L) for a given automaton which accepts L. As will be seen later in this section, the condition of efficiency of function symbols is required only for the efficient construction of the automaton. Let k be the length of centers of f-pattern rules in P . Let Pa and Pb be the set of all f-pattern rules in P of the form (a) and (b), respectively. We will describe bellow an algorithm CLS which for a given automaton A = (G, {q0 }, F ) with k-suffix property, constructs an automaton accepting γP∗ (L(A)). The algorithm CLS constructs a sequence Ai = (Gi , {q0 , q00 }, F0 ) of finite automata, where A0 is defined as follows: A0 = (G0 , {q0 }, F0 ) = A
∪
(∪w←p1 ,...,pn ∈Pa arw(w)).
Note that every arrow component has disjoint set of vertices from each other and from A, thus isolated in the automaton A0 . The initial and final set of vertices do not depend on i, thus we only have to show how to construct Gi+1 from Gi . Furthermore, as will be shown in the sequel, the vertex set of Gi does not depend on i. For the construction of Gi+1 from Gi , CLS chooses any f-pattern rule r in P and checks whether or not the rule r is applicable. We will define bellow the notion that r is applicable. For every f-pattern pl = x1 · · · xm (xj ∈ Σ ∪ V ∪ F(V )) in the body of r, we will assign a sequence v0 , v1 , ..., vm of vertices of Ai such that v0 ∈ I and vm ∈ F . Here we say that the occurrence xj is associated with a vertex pair (vj−1 , vj ). We also sometimes say that the occurrence xj is assigned to the vertex vj . For each occurrence Y (Y is either of the form c ∈ Σ, X ∈ V or f (X) ∈ F(V )), we will associate a finite automaton (Gi , p, q), where (p, q) is a vertex pair associated with the occurrence Y . When we determine the choice of the vertices for all f-patterns of the body of r, we perform the following two procedures:
Iterated Transductions and Efficient Learning from Positive Data
165
1. For every occurrence c ∈ Σ in the body, we check whether the associated automaton of the occurrence c accepts c or not. If all of those automata accept their corresponding symbols c, we answer yes, otherwise, no. 2. For each variable X in r, let OX be the set of all occurrences of the form, X, f (X), in the body of r. For the convenience, we will regard a pure occurrence X as an occurrence id(X) with the identity function id. Note that id is efficient. For each occurrence o ∈ OX of the form f (X), let Ao be the automaton associated with the occurrence o, and construct a new automaton A0o accepting a language fˆ−1 (L(Ao )). These automata A0o can be constructed in polynomial time with respect to the number of vertices of automaton Ai (which is equivalent to the number of vertices of A), since f is efficient. We take the intersection of all these automata and denote it by AX . If this intersection L(AX ) is not empty for every variable X in r, we answer yes, otherwise, no. If both of the above two procedures answer yes, the choice of the vertices for r is said to be successful. If there exists some successful choice of the vertices for r, then r is said to be applicable. The following Lemma holds. Lemma 4. An f-pattern rule p0 ← p1 , ..., pn is applicable to an automaton A if and only if there exists a substitution θ such that pi θ ∈ L(A) for every i = 1, ..., n. Proof. Assume that there exists a substitution θ such that pi θ ∈ L(A) for every i = 1, ..., n. Then, for every i = 1, ..., n, there exists an accepting path ti of A such that l(ti ) = pi θ. Let us write pi = Z1 Z2 · · · Zm , where Zj ∈ Σ ∪ V ∪ F(V ). For every j = 1, ..., m, let vj be the last vertex of the subpath of ti , which can be reached by the transition of the label (Z1 Z2 · · · Zj )θ. Furthermore, we define v0 as the first vertex of the path ti . We associate each occurrence Zj with a vertex pair (vj−1 , vj ). Then, this choice of sequence of vertices ensures that r is applicable. Conversely, assume that r : p0 ← p1 , ..., pn is applicable to A. Then, we can choose a sequence of vertices for each f-pattern pi (i = 1, ..., n), which ensures that r is applicable. Then, consider a substitution θ that maps a variable X in the body of r to an element in L(AX ), where AX is the automaton constructed during the second procedure above. It is straightforward to verify that pi θ ∈ L(A) holds for every i = 1, ..., n. t u It can be checked in polynomial time whether a rule r is applicable or not, since the number of choices of the vertices is bounded by O(ns ), where n is the number of vertices of A and s is the sum of the lengths of all f-patterns in the body of r. Note that s is a constant. The algorithm CLS checks whether or not there exists an applicable f-pattern rule. Whenever it finds an applicable rule p0 ← p1 , ..., pn , CLS adds an -edge to Ai according to the choice of vertices which ensures the applicability of r. There are two cases: (1) In case that p0 = w for some w ∈ Σ ∗ , CLS adds a new -edge (q00 , , q), where q is the first vertex of the arrow component arw(w),
166
S. Kobayashi
(2) In case that p0 = XwY for some w ∈ Σ ∗ and X, Y ∈ V , CLS adds a set E 0 of new -edges defined as follows: Let us consider the unique occurrence of Xw in the body of r. Let c be the occurrence of the last letter of Xw. Let v be the second component of the vertex pair associated with c. Furthermore, let v 0 be the first component of the vertex pair which is associated with the unique occurrence of Y in the body of r. In case of v 0 6= q00 , we define E 0 = {(v, , v 0 )}. In case of v 0 = q00 , we define E 0 = {(v, , v 00 ) | v 00 ∈ (v 0 )}. In both cases, all the new edges in E 0 will be added to Ai . The addition of these -edges are continued repeatedly until there is no more new -edges that can be added to Ai . Then, the final product will be the automaton Ai+1 . Note that no edges of the form (q, a, q00 ) (a ∈ {} ∪ Σ) is added during the construction of Ai+1 . Therefore, the automaton Ai+1 does not have any edges going to q00 . The sequence A0 , A1 , ... is finite since there is no increase of vertices, thus the number of -edges can be added only finitely many times. Assume that the sequence stops at AN . Then, CLS outputs AN . The algorithm CLS runs in polynomial time with respect to n, since the number of -edges should be bounded by n2 . Lemma 5. Every automaton Ai (i = 1, ..., N ) satisfies the k-suffix property. Proof. Since A0 = A has k-suffix property and the addition of -edges at each stage does not destroy the k-suffix property, the claim holds. t u Theorem 3 is obtained directly by the following two lemmas. Lemma 6. γP∗ (L(A)) ⊆ L(AN ) holds. Proof. We will prove γPi (L(A)) ⊆ L(AN ) by induction on i. In case of i = 0, the claim holds clearly. Suppose that the claim holds in case of i ≤ k − 1, and consider the case of i = k. Then, we have: γPk (L(A)) = γP (γPk−1 (L(A))) ∪ γPk−1 (L(A)). By the induction hypothesis, γPk−1 (L(A)) ⊆ L(AN ) holds. Therefore, it suffices to show γP (L(AN )) ⊆ L(AN ). Let z be any element in γP (L(AN )). Then, there exist an f-pattern rule r : p0 ← p1 , ..., pn and a substitution θ such that p0 θ = z and pj θ ∈ L(AN ) for every j = 1, ..., n. By Lemma 4, r is applicable to AN . Let us consider the choice of vertices ensuring the applicability of r, which is described in the proof of if-part of Lemma 4. In case that p0 = w for some w ∈ Σ ∗ , z = w holds. Since r is applicable, there is an -edge between q00 and the first vertex of the arrow component arw(w). Therefore, AN should accept w = z. Let us consider the case that p0 = XwY for some w ∈ Σ ∗ and X, Y ∈ V . Since r is of type I, there exist the unique occurrences of Xw and Y in the body of r. Let (v1 , v2 ) be the vertex pair associated with the last letter of Xw,
Iterated Transductions and Efficient Learning from Positive Data
167
and (v10 , v20 ) be the vertex pair associated with the occurrence Y . Then, by the definition of the algorithm CLS, AN has some -edges from v2 to either v10 or (v10 ). Therefore, AN should accept the string (Xw)θ · · Y θ = p0 θ = z. This completes the proof. t u Lemma 7. L(AN ) ⊆ γP∗ (L(A)) holds. Proof. We will introduce a complexity measure for a path of AN . We say that an -edge is at level i if it is added during the construction of Ai . A path p in AN is of complexity (n1 , ..., nN ) if and only if the number of -edges at level i appears ni times in the path p. For any complexities d1 and d2 , we write d1 < d2 if d1 lexicographically precedes d2 . We can prove the following claim based on the induction on the complexity of the path p. [Claim:] For any accepting path p of AN , l(p) is in γP∗ (L(A)). Proof of the claim : In case that the complexity of p is (0, ..., 0), the path p uses only edges of A0 = A, thus, we have the claim. Suppose that the claim holds in case that the complexity of the path is less than d, and consider the path p of complexity d. Let us choose the last occurrence of -edges in the path p, and denote it by e = (v, , v 0 ). Let i be the level of e. Then, there exists an applicable rule r : p0 ← p1 , ..., pn which introduced the edge e during the construction of Ai . By Lemma 4, there is a substitution θ such that pj θ ∈ L(Ai−1 ) for every j = 1, ..., n. Therefore, for every j = 1, ..., n, there exists an accepting path tj of Ai−1 such that l(tj ) = pj θ. Note that the complexity of tj is less than d for every j = 1, ..., n. Thus, by induction hypothesis, pj θ ∈ γP∗ (L(A)) holds for every j = 1, ..., n. In case that p0 = w for some w ∈ Σ ∗ , the -edge e should start from q00 and go to the first vertex of the arrow component arw(w). Since e is the last occurrence, the path p can not go out from arw(w). Since AN has no edges going to the vertex q00 , the path p should start from q00 . Therefore, l(p) should be w. Since pj θ ∈ γP∗ (L(A)) holds for every j = 1, ..., n, we have w = p0 θ ∈ γP∗ (L(A)). Let us consider the case that p0 = XwY for some w ∈ Σ ∗ and X, Y ∈ V . For the convenience, we write the path p as follows:
s
y
I 3 v0 −→ v → v 0 −→ v 00 ∈ F,
where s, y ∈ Σ ∗ and the -edge v → v 0 is corresponding to the last occurrence e. Without loss of generality, we can assume that p1 = Xwp01 and p2 = p02 wY for some f-patterns p01 and p02 . Thus, we can write the paths t1 and t2 as (Xθ)·w
p0 θ
1 t1 : I 3 u0 −→ v −→ u∈F
(p0 θ)·w
Yθ
2 v 0 −→ v 00 ∈ F t2 : I 3 u00 −→
Note that by Lemma 5, all the paths from a vertex of I to v should have a label with suffix w. Thus, we can write s = s0 w for some s0 ∈ Σ ∗ . Let us consider the
168
S. Kobayashi
new paths t01 , t02 by splicing the path p and the paths t1 , t2 : s0 w
p0 θ
1 u∈F t01 : I 3 v0 −→ v −→
(p0 θ)·w
y
2 v 0 −→ v 00 ∈ F t02 : I 3 u00 −→
Then, it is easy to verify that the complexities of t01 and t02 are less than d. Therefore, by the induction hypothesis, l(t01 ) = s0 w(p01 θ) ∈ γP∗ (L(A)) and l(t02 ) = (p02 θ)wy ∈ γP∗ (L(A)) hold. Let θ0 be a substitution obtained by modifying the map θ at the variables X and Y so that θ0 (X) = s0 and θ0 (Y ) = y. Then, we have p1 θ0 = s0 w(p01 θ0 ) = s0 w(p01 θ), p2 θ0 = (p02 θ0 )wy = (p02 θ)wy, p3 θ0 = p3 θ, ..., pn θ0 = pn θ. Furthermore, we have (XwY )θ0 = s0 wy. By applying the rule r to the strings p1 θ0 , p2 θ0 , ..., pn θ0 in γP∗ (L(A)) with the substitution θ0 , we obtain that (XwY )θ0 = s0 wy = sy = l(p) is in γP∗ (L(A)). This completes the proof of the claim. t u
5
Learnability Results
Finally, we will give the main theorem in this paper. Theorem 8. Let P be an f-pattern system of type I. Then, D(γP ) is learnable in the limit from positive data using the automata representation class with the conjectures updated in polynomial time. Proof. For a given set P OS of strings given so far, the learning machine M checks whether the previous conjecture of M accepts every element in P OS or not. In case that it does not accept some element, M constructs a prefixtree automaton A accepting P OS. Then, M runs CLS for A and outputs the automaton accepting γP∗ (P OS). It is straightforward to see that at the stage that P OS contains a characteristic sample of the target language, the conjectures of M will converge to the correct representation. From the construction of M , it is clear that M runs with the conjectures updated in polynomial time. t u
6
Conclusions
This paper presented an approach to giving a unifying view of the language classes which are efficiently learnable in the limit from positive data. We believe that the proposed method for defining language classes using iterated transductions would give a new light on the efficient learnability from positive data. However, there are some problems which should be discussed in the future works. One of the important problems is that the framework proposed in this paper does not focus on the notion of characteristic samples, which is caused from the fact that the language class is defined by a fixpoint starting from a finite set.
Iterated Transductions and Efficient Learning from Positive Data
169
Thus, the learnability from positive data is immediate from the definition. It is sometimes more natural to define a language class as D0 (γ, C) = {L ∈ C | L = γ ∗ (L)}, where r is a transduction and C is a language class. For example, based on the original characterizations, k-SLT could be naturally defined by ∗ D0 (γPk , 2Σ ), and k-REV could also be defined by D0 (γQk , REG). In this new framework, the question arises what type of transduction γ and a language class C make D0 (γ, C) have characteristic sample property. Another open problem is the investigation of the efficient learnability beyond the class of regular languages, which might be a more challenging topic in the future works.
References [Ang80] D. Angluin, Inductive inference of formal languages from positive data, Information and Control 45 (1980) 117-135. [Ang82] D. Angluin, Inference of reversible languages, Journal of the ACM 29 (1982) 741-765. [ASY92] S. Arikawa, T. Shinohara and A. Yamamoto. Learning Elementary Formal Systems. Theoretical Computer Science, 95 (1992) 97-113. [Gar90] Pedro Garc´ıa and Enrique Vidal, Inference of k-Testable Languages in the Strict Sense and Application to Syntactic Pattern Recognition. IEEE Trans. on Pattern Analysis and Machine Intelligence, 12 (1990) 920-925. [Gol67] E. Mark Gold, Language identification in the limit, Information and Control, 10 (1967) 447-474. [Hea87] Head, T, Formal language theory and DNA: an analysis of the generative capacity of specific recombinant behaviors, Bulletin of Mathematical Biology, 49 (1987) 737-759. [HKY98] Tom Head, Satoshi Kobayashi and Takashi Yokomori, Locality, Reversibility, and Beyond: Learning Languages from Positive Data, in Proc. of 9th International Workshop on Algorithmic Learning Theory, (Lecture Notes in Artificial Intelligence 1501), Springer-Verlag, pp.191-204, 1998. [Kob96] S. Kobayashi, Approximate Identification, Finite Elasticity and Lattice Structure of Hypothesis Space, Technical Report, CSIM 96-04, Dept. of Compt. Sci. and Inform. Math., Univ. of Electro-Communications (1996) [KMPR] Satoshi Kobayashi, Victor Mitrana, Gheorghe P˘ aun and Grzegorz Rozenberg, Formal Properties of PA-Matching, Theoretical Computer Science, accepted for publication. [KY97] S. Kobayashi and T. Yokomori, Learning Approximately Regular Languages with Reversible Languages, Theoretical Computer Science, 174 (1997) 251-257. [KMT95] T. Koshiba, E. M¨ akinen and Y. Takada, Learning Strongly Deterministic Even Linear Languages from Positive Examples, in Proc. of Workshop on Algorithmic Learning Theory’95, (Lecture Notes in Artificial Intelligence 997), SpringerVerlag, pp.41-54, 1995. [LR80] De Luca, A. and A. Restivo, A characterization of strictly locally testable languages and its application to subsemigroups of a free semigroup, Information and Control, 44 (1980) 300-319. [MMP98] Vincenzo Manca, Carlos Mart´in-Vide, Gheorghe P˘ an, Iterated GSM Mappings: A Collapsing Hierarchy, Turku Centre for Computer Science, Technical Report No. 206, October 1998.
170
S. Kobayashi
[MP71] R. McNaughton and S. Papert, Counter-Free Automata, MIT Press, Cambridge, Massachusetts (1971). [Pit89] L. Pitt, Inductive Inference, DFAs, and Computational Complexity. Lecture Notes in Artificial Intelligence, 397, pp.18-44, 1989. [Pix96] D. Pixton, Regularity of splicing languages, Discrete Applied Mathematics 69, pp.101-124, 1996. [Sch75] M. P. Schutzenberger, Sur certaines operations de fermeture dans les languages rationnels, Symposium Mathematicum, 15 (1975) 245-253. [Shi83] T. Shinohara. Polynomial-time Inference of Extended Regular Pattern Languages. In Proc. of RIMS Symp. on Software Sci. and Engin., LNCS 147, pp.115-127, 1983. [Shi94] T. Shinohara. Rich Classes Inferable from Positive Data : Length Bounded Elementary Formal Systems. Information and Computation, 108, pp.175-186, 1994 [Smu61] Raymond M. Smullyan, Theory of Formal Systems, Annals of Mathematics Studies, 47, revised edition, Princeton University Press, 1961. [Woo76] Derick Wood, Iterated a-NGSM Maps and Γ Systems, Information and Control, 32, pp.1-26, 1976. [YIK94] T. Yokomori, N. Ishida, and S. Kobayashi, Learning local languages and its application to protein alpha-chain identification, In Proc. of 27th Hawaii Intern. Conf. on System Sciences, IEEE Press, 113-122 (1994). [Yok95] T. Yokomori, On polynomial-time learnability in the limit of strictly deterministic automata, Machine Learning, 19 (1995) 153-179. [YK98] T. Yokomori and S. Kobayashi, Learning local languages and its application to DNA sequence analysis, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.20, No.10, pp.1067-1079, 1998.
An Inverse Limit of Context-Free Grammars — A New Approach to Identifiability in the Limit Pavel Martinek Department of Mathematics Mendel University of Agriculture and Forestry Zemˇedˇelsk´ a3 613 00 Brno, Czech Republic E-mail: [email protected]
Abstract. The class of context-free grammars is studied. We transform results reached in [10] into terminology which meets Category theory. If a context-free language L is given, our approach enables to describe lucidly the whole structure of all context-free grammars (with the same set of terminals) in Chomsky normal form generating L.
1
Introduction
The grammatical inference problem is the problem of constructing a grammar to a given formal language which is known only incompletely by means of a finite set of its words. Many inference algorithms have been introduced for various classes of formal languages so far. Most of them have dealt with languages based on Chomsky hierarchy or derived from it (cf., e.g., [4], [13], and [15]). In last years, many researchers focus on learning grammars (or languages) from positive examples (e.g., [9], [12] or [14]), i.e., the learner receives only strings of the learned language and no others. The other possibility demands knowledge of both positive and negative strings of the learned language (e.g., [3], [5], and [10]). This paper follows [10], where the context-free grammars were inferred in the limit in the sense of [7], i.e., a sequence of successively inferred grammars (identified in a certain way) was proved to be constant except for finitely many starting members. We present a way of how to describe the inference by means of some basic notions from Category theory. This seems to describe the process more lucidly. Here, all context-free grammars (with the same set of terminals) in Chomsky normal form generating a given language constitute a compact formation of certain algebraic structures.
2
Preliminaries
We denote by N the set of all positive integers. For any set X, we denote by X ∗ the set of all finite strings over X (including the empty string Λ) provided with the binary operation of concatenation. We use X + to denote the set X ∗ − A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 171–185, 2000. c Springer-Verlag Berlin Heidelberg 2000
172
P. Martinek
{Λ}. If x ∈ X ∗ , then there exist a non-negative integer m and some elements x1 , x2 , . . . , xm ∈ X such that x = x1 x2 · · · xm . We put |x| = m. If xi = x ∈ X for i ∈ {1, 2, . . . , m}, we write xm instead of x1 x2 · · · xm . Let V, N be non-empty disjoint sets and V be finite. Any L ⊆ V ∗ is called a language. We put U = V ∪ N . If R ⊆ N × U ∗ and S ∈ N , then the ordered quadruple G = (N, V, R, S) is called a generalized context-free grammar . (The symbols N, V, R, and S denote, respectively, the set of nonterminals, the set of terminals, the set of productions and the start symbol.) If, moreover, the sets N and R are finite, G is called a context-free grammar . A generalized context-free grammar G = (N, V, R, S) is said to be in Chomsky normal form if all productions in R are of the form (A, BC) or (A, a). Here A, B and C are nonterminals and a is a terminal. As usually, we write s ⇒ t (R) for s, t ∈ U ∗ , if there exist (y, x) ∈ R and u, v ∈ U ∗ such that s = uyv, t = uxv. If s, t ∈ U ∗ , m is a non-negative integer and s0 , s1 , . . . , sm are strings in U ∗ such that s0 = s, sm = t and si−1 ⇒ si (R) for each i ∈ {1, . . . , m}, then the finite sequence (si )m i=0 is called an s-derivation ∗ of t in R of length m. We set s ⇒ t (R) for s, t ∈ U ∗ , if there exists an s∗ derivation of t in R. We put L(G) = {w ∈ V ∗ ; S ⇒ w (R)} and L(G) is said to be the language generated by G. A language L ⊆ V ∗ which is generated by a context-free grammar is called a context-free language. A language L ⊆ V + is said to be Λ-free. It is well known (cf. Theorem 4.5. in [6]) that any Λ-free context-free language can be generated by a grammar in Chomsky normal form. + + We denote by 2V the system of all subsets of V + , i.e., 2V = {X; X ⊆ V + }. + For all N ⊆ 2V , we put NN = {Q1 Q2 ; Q1 ∈ N and Q2 ∈ N }. Furthermore, we suppose the set of terminals V to be the same in the whole + paper (with the exception of examples) and satisfying the condition V ∩2V = ∅. + Then the set 2V can play a role of a set of nonterminals. (Clearly, the demand + V ∩ 2V = ∅ is not satisfied automatically because generally, a non-empty finite set V can have elements of various origin, e.g., a, b, {ab} etc.)
3
Generalized Context-Free Grammars with Nonterminals Formed by Sets of Terminal Strings
The fact, that a nonterminal in a context-free grammar can be considered to be the set of all terminal strings generated from it as from a start symbol, is often used in various papers (e.g. [2] or [11]). This model was applied in [10] to solve the grammatical inference problem for context-free languages. We will repeat used notions and main results in this section. The transformation of nonterminals to sets of terminal strings can be described in the following way.
An Inverse Limit of Context-Free Grammars
173
Let G = (N, V, R, S) be a context-free grammar in Chomsky normal form. We put ∗
τ (Q) = {w ∈ V ∗ ; Q ⇒ w (R)} for all Q ∈ N , τ (N ) = {τ (Q); Q ∈ N }, τ (R) = {(τ (Q), τ (Q0 )τ (Q00 )); Q, Q0 , Q00 ∈ N and (Q, Q0 Q00 ) ∈ R} ∪ {(τ (Q), a); Q ∈ N, a ∈ V and (Q, a) ∈ R}, and finally, τ (G) = (τ (N ), V, τ (R), τ (S)). (Notice that the grammar τ (G) is also context-free and in Chomsky normal form.) Example 1. Let N = {A, B, S}, V = {a, b} and R = {(S, BA), (S, a), (A, SB), (B, b)}. Apparently, G = (N, V, R, S) is a context-free grammar in Chomsky normal form and L(G) = {bn abn ; n ≥ 0}. Furthermore, τ (A) = {bn abn+1 ; n ≥ 0}, τ (B) = {b}, τ (S) = {bn abn ; n ≥ 0}, τ (N ) = {{bn abn+1 ; n ≥ 0}, {b}, {bn abn ; n ≥ 0}}, τ (R) = {({bn abn ; n ≥ 0}, {b}{bn abn+1 ; n ≥ 0}), ({bn abn ; n ≥ 0}, a), ({bn abn+1 ; n ≥ 0}, {bn abn ; n ≥ 0}{b}), ({b}, b)}. Now, the only possible τ (S)-derivations of terminal strings in τ (R) are the following ones: τ (S) = {bn abn ; n ≥ 0} ⇒ a, ∗ τ (S) = {bn abn ; n ≥ 0} ⇒ {b}{bn abn+1 ; n ≥ 0} ⇒ {b}{bn abn ; n ≥ 0}{b} ⇒ ∗ {b}k {bn abn ; n ≥ 0}{b}k ⇒ bk abk for any k ∈ N. Thus, L(τ (G)) = {bn abn ; n ≥ 0} = L(G).
2
The result of Example 1 can be also obtained from the following lemma. Lemma 1. If G is a context-free grammar in Chomsky normal form then L(G) = L(τ (G)). Proof. See Lemma 3.1 in [10].
2
Focussing on productions in a context-free grammar τ (G) (which is in Chomsky normal form), we can see some perspicuous properties, namely (i) if the second component of a production is formed by a terminal symbol then it is an element of the first component, (ii) if the second component of a production is formed by a pair of nonterminals then their concatenation forms a subset of the first component.
174
P. Martinek
After all, these properties follow directly from the definition of τ . Now, we will change our point of view and we will examine sets of pro+ ductions which involve nonterminals from the set 2V and satisfy the above properties (i),(ii) generally. +
First, for all Q1 , Q2 ∈ 2V , we put C(Q1 Q2 ) = {x ∈ V ∗ ; there exist x1 ∈ Q1 and x2 ∈ Q2 such that x = x1 x2 }. (This notation is introduced to distinguish strictly between a nonterminal string Q1 Q2 and a set obtained by the concatenation of the sets Q1 and Q2 .) Moreover, we put: +
R(1) = {(Q, Q0 Q00 ); Q, Q0 , Q00 ∈ 2V , C(Q0 Q00 ) ⊆ Q}, + R(2) = {(Q, w); Q ∈ 2V , w ∈ Q ∩ V }, R = R(1) ∪ R(2). +
Clearly, each ordered quadruple (N, V, R, X) with N ⊆ 2V , R ⊆ R ∩ (N × (NN ∪V )) and X ∈ N represents a generalized context-free grammar in Chomsky normal form. Example 2. Let G = (N, V, R, S) be a context-free grammar such that N = {A, B, S} with A = {bn abn−1 ; n ≥ 1}, B = {b}, S = {bn abn ; n ≥ 0}, V = {a, b}, R = {(S, AB), (S, a), (A, BS), (B, b)}. + Obviously, A, B, S ∈ 2V . Moreover, it is easy to verify that R ⊆ R: Since a ∈ S ∩ V and b ∈ B ∩ V , we obtain (S, a) ∈ R and (B, b) ∈ R respectively. As to the production (S, AB), we have C(AB) = C({bn abn−1 ; n ≥ 1}{b}) = {bn abn ; n ≥ 1} ⊆ {bn abn ; n ≥ 0} = S which implies (S, AB) ∈ R. Similarly, we get (A, BS) ∈ R. Now, the only possible S-derivations of terminal strings in R are the following ones: S ⇒ a, ∗ ∗ S ⇒ AB ⇒ BSB ⇒ B k SB k ⇒ bk abk for any k ∈ N. n n Thus, L(G) = {b ab ; n ≥ 0} = S. 2 The fact, that the whole set of productions R ensures strong generative power, is confirmed in the following assertion. +
Lemma 2. Any L ⊆ V ∗ is a Λ-free language iff L = L(2V , V, R, L). Proof. See Theorem 5.1 in [10].
2
An Inverse Limit of Context-Free Grammars
175
Since generative abilities of the generalized context-free grammars + (N, V, R, X) with N ⊆ 2V and R ⊆ R are too large, an appropriate narrowing (which leads to context-free grammars) can be ensured by means of the following notions. If n ∈ N then we put +
= {x ∈ X; |x| ≤ n} for all X ∈ 2V , n x = x for all x ∈ V , + V ∪V, n (x1 · · · xp ) = n (x1 ) · · · n (xp ) for all p ∈ N, x1 , . . . , xp ∈ 2 + V , n N = {n Q; Q ∈ N } for all N ⊆ 2 + + + R = {( y, x); (y, x) ∈ R} for all R ⊆ 2V × ((2V )(2V ) ∪ V ). n n n nX
+
+
+
Lemma 3. Let R ⊆ 2V × ((2V )(2V ) ∪ V ) and let m, n, k ∈ N, m ≤ n ≤ k. + + If Q ∈ 2V , N ⊆ 2V , R1 ⊆ m R and R2 ⊆ k R are arbitrary then: (i) (ii) (iii)
n (m Q)
= m Q ⊆ Q and n (k Q) = n Q ⊆ Q. + + ( N ) = m N ⊆ 2V and n (k N ) = n N ⊆ 2V . n m n (R1 ) = R1 ⊆ m R and n (R2 ) ⊆ n (k R) = n R.
Proof. This lemma follows directly from the above definitions.
2
Now, we exactly define systems of generalized context-free grammars and mappings among them, we intend to deal with. +
If X ∈ 2V , n, m ∈ N, n ≥ m. We put +
G(X) = {(N, V, R, X); N ⊆ 2V , X ∈ N, R ⊆ R}, +
Gn (X) = {(N, V, R, n X); N ⊆ n (2V ), n X ∈ N, R ⊆ n R}, and we define the mappings: Φn : G(X) → Gn (X) such that Φn ((N, V, R, X)) = (n N, V, n R, n X) for all (N, V, R, X) ∈ G(X), Φn,m : Gn (X) → Gm (X) such that Φn,m ((N, V, R, n X)) = (m N, V, m R, for all (N, V, R, n X) ∈ Gn (X).
m X)
Example 3. Let G = (N, V, R, S) be the same context-free grammar as in Example 2, i.e., N = {A, B, S} with A = {bn abn−1 ; n ≥ 1}, B = {b}, S = {bn abn ; n ≥ 0}, V = {a, b}, R = {(S, AB), (S, a), (A, BS), (B, b)}. According to the previous definitions, we obtain 3 A = {ba}, 3 B = {b},
176
P. Martinek 3S
and
= {a, bab},
3 R = {({a, bab}, {ba}{b}), ({a, bab}, a), ({ba}, {b}{a, bab}), ({b}, b)}. The only 3 S-derivations of terminal strings in 3 R are the following ones: 3 S = {a, bab} ⇒ a, ∗ k k ∗ k k 3 S = {a, bab} ⇒ {ba}{b} ⇒ {b}{a, bab}{b} ⇒ {b} {a, bab}{b} ⇒ b ab for any k ∈ N. Hence, L(Φ3 (G)) = L(3 N , V, 3 R, 3 S) = {bn abn ; n ≥ 0} = L(G).
2
In Example 3, the generative ability of the considered context-free grammar was not influenced by the mapping Φ3 . Generally, as a reformulation of Theorem 4.1 from [10], we have: Lemma 4. Let L ∈ 2V
+
and n ∈ N. If G = (N, V, R, L) ∈ G(L) then
(i) L(G) = L(Φn (G)) if card N = card (n N ), (ii) L(G) ⊆ L(Φn (G)) otherwise. Assertion (i) of the previous lemma indicates that the equations L(G) = L(Φn (G)) can be satisfied for more than one number n. An example is given in the following lemma. Lemma 5. Let G = (N, V, R, S) be a context-free grammar in Chomsky normal form. Then: (i) τ (G) ∈ G(L(G)), (ii) there exists n0 ∈ N such that L(G) = L(Φn (τ (G))) for all integers n ≥ n0 . Proof. This represents a connection of these assertions of [10]: Theorem 4.2 and Lemma 5.4. 2 Finally, we present the assertion which characterizes the class of all Λ-free context-free languages. Theorem 6. L ⊆ V + is a context-free language if and only if there exist n0 ∈ N and a sequence of grammars (Gn )n≥n0 such that Gn ∈ Gn (L), L = L(Gn ) and Φn,n0(Gn ) = Gn0 for all integers n ≥ n0 . Proof. See Theorem 5.3 in [10].
2
Although Theorem 6 is of existential nature, it enables to construct the “correct” context-free grammar G to any Λ-free context-free language L in the limit. The way of how to construct G is based on the fact that the system Gn (L) is finite for each n ∈ N. So, we can successively investigate context-free grammars
An Inverse Limit of Context-Free Grammars
177
from Gn (L). Each of the grammars, we are interested in, is required to generate just strings of L and no others. Since we do not know all strings of L actually, we have to confine our requirement to generating strings of m L, where m is a positive integer. We transform these considerations to some definitions concerning sets of productions. Let L ⊆ V + , m, n ∈ N, P ⊆ n R. +
(i) If m L(n (2V ), V, P, n L) = m L then P is called an (n, m)-satisfying set of productions. + (ii) If L(n (2V ), V, P, n L) = L then P is called an n-satisfying set of productions. Example 4. Let L = {bn abn ; n ≥ 0}. We will examine the following sets of productions: P1 = {({a}, a)}. P2 = {({a}, {a, b}{b}), ({a, b}, {b}{a}), ({a}, a), ({b}, b)}. It is easy to verify that both P1 and P2 are subsets of 1 R: + Since {b}, {a, b}, {a, ab, b2 } ∈ 2V and C({a, b}{b}) = {ab, b2 } ⊆ {a, ab, b2 }, we have ({a, ab, b2 }, {a, b}{b}) ∈ R. By the definition of 1 R, we obtain (1 {a, ab, b2 }, 1 {a, b}1 {b}) = ({a}, {a, b}{b}) ∈ 1 R. Analogously, we can prove that all remaining productions of P1 and P2 are also from the set 1 R. Since in the set P1 the only derivation 1 L = {a} ⇒ a can be formed and + V ), V, P1 , 1 L) = 2 ({a}) = {a} = 2 L. So, P1 is an (1, 2)2 L = {a}, we get 2 L(1 (2 + 6 satisfying set of productions. However, 3 L(1 (2V ), V, P1 , 1 L) = 3 ({a}) = {a} = {a, bab} = 3 L. Therefore, P1 is no (1, 3)-satisfying set of productions. Consequently, P1 is no 1-satisfying set of productions. In the set P2 either the derivation 1 L = {a} ⇒ a or the derivation 1 L = ∗ ∗ {a} ⇒ {a, b}{b} ⇒ {b}{a}{b} ⇒ {b}k {a}{b}k ⇒ bk abk for each k ∈ N can + be formed. Thus, L(1 (2V ), V, P2 , 1 L) = {bn abn ; n ≥ 0} = L. Hence, P2 is an 1-satisfying set of productions. An example of 3-satisfying set of productions is given by the set 3 R from Example 3. 2 The definition of n-satisfying set of productions and Theorem 6 imply the following assertion. Corollary 7. Let L ⊆ V + . Then, L is context-free if and only if there exist n ∈ N and P ⊆ n R such that P is an n-satisfying set of productions. A connection among various n-satisfying and (n, m)-satisfying sets of productions is given in the following assertion.
178
P. Martinek
Theorem 8. Let n ∈ N and L be a context-free language. Then there exists m1 ∈ N such that for all integers m ≥ m1 , each (n, m)-satisfying set of productions is n-satisfying. Proof. See Theorem 6.1 in [10].
2
Summarizing, if L is a Λ-free context-free language then Theorem 6 ensures that, starting with an index n great enough, there exists a context-free grammar + Gn = (n (2V ), V, Pn , n L) ∈ Gn (L) generating the language L. By the definition of n-satisfying set of productions, the set Pn is n-satisfying. Finally, by Theorem 8, Pn can be found as an (n, m)-satisfying set of productions with an m exceeding some m0 ∈ N — this requires knowledge of n L and m L which contain finitely many strings of L. This process can be repeated for Gn+1 , Gn+2 , . . . If we find a sequence of the grammars (Gn+i )i∈N such that Φn+i,n (Gn+i ) = Gn then regarding Theorem 6 and the “right” index n, the context-free grammar Gn generates L. So, since each grammar Gn+i can be “identified” with Gn by means of the mapping Φn+i,n , we have a solution of the grammatical inference for Λ-free context-free languages in the limit. The following sections are devoted to a more detailed description of the structure of (generalized) context-free grammars from G(L) and Gn (L). This will also lead to some generalizations of Theorem 6
4
Inverse Systems of Sets
In this section we will introduce some basic notions from Category Theory based on approach used in [1] and [8]. We put P = {(n, m) ∈ N × N; n ≥ m}. Let (An )n∈N be a sequence of sets and (ϕn,m )(n,m)∈P be a sequence of mappings satisfying the following conditions: (i) ϕn,m maps the set An into Am for all (n, m) ∈ P. (ii) ϕn,l = ϕm,l b ϕn,m for all n, m, l ∈ N, n ≥ m ≥ l. (iii) ϕn,n = idAn for all n ∈ N. Then, the ordered pair ((An )n∈N ,(ϕn,m )(n,m)∈P ) is called an inverse system of sets. Let A = ((An )n∈N ,(ϕn,m )(n,m)∈P ) be an inverse system of sets and (A, (ϕn )n∈N ) be an ordered pair, where A is a set and (ϕn )n∈N is a sequence of mappings satisfying the following conditions: (i) ϕn maps the set A into An for all n ∈ N. (ii) ϕm = ϕn,m b ϕn for all (n, m) ∈ P. Then, the ordered pair (A, (ϕn )n∈N ) is called a source of the inverse system A .
An Inverse Limit of Context-Free Grammars
179
Remark. Each inverse system of sets has at least one source, namely (∅, (∅n )n∈N ), where ∅n denotes the empty mapping of the empty set into An . A source (A, (ϕn )n∈N ) of an inverse system of sets A = ((An )n∈N ,(ϕn,m )(n,m)∈P ) is called an inverse limit of A if for any source (A0 , (ϕ0n )n∈N ) of A, there exists exactly one mapping ϕ of the set A0 into A such that ϕ0n = ϕn b ϕ for all n ∈ N. Theorem 9. Each inverse system of sets has an inverse limit. Proof. It is well known that the theorem holds (cf. the construction described in [8] on p. 131). 2 An inverse system of sets ((Bn )n∈N ,(ϕn,m|Bn )(n,m)∈P ) is called a subsystem of the inverse system ((An )n∈N ,(ϕn,m )(n,m)∈P ) if (i) Bn ⊆ An for all n ∈ N, (ii) ϕn,m|Bn is (as usually) a restriction of the mapping ϕn,m to the set Bn for all (n, m) ∈ P. Theorem 10. Let A be an inverse system of sets having an inverse limit (A, (ϕn )n∈N ). Then each subsystem A0 of the inverse system A has an inverse e n∈N ) where A e ⊆ A. e (ϕn|A) limit (A, Proof. By Theorem 9, the inverse system A0 has an inverse limit, we denote it by (A0 , (ϕ0n )n∈N ). This inverse limit is a source of A because it satisfies its definition. By the definition of inverse limit of A, namely (A, (ϕn )n∈N ), there exists exactly one mapping ϕ of the set A0 into A such that ϕ0n = ϕn b ϕ for all e then A e ⊆ A and (A, e (ϕn |A) e n∈N ) is a source of A0 n ∈ N. If we put ϕ(A0 ) = A 0 0 0 e because ϕn (A) = ϕn (ϕ(A )) = ϕn (A ) for all n ∈ N. Now, by the definition of inverse limit of A0 , namely (A0 , (ϕ0n )n∈N ), there exists exactly one mapping ψ of e = ϕ0n b ψ for all n ∈ N. Since ϕ0n = ϕn b ϕ, we e into A0 such that ϕn|A the set A e = ϕn b ϕ b ψ for all n ∈ N. Hence, ψ = ϕ−1 which implies that the have ϕn |A e n∈N ) is an inverse limit of A0 . e (ϕn|A) 2 mapping ψ is a bijection and (A,
5
Inverse Systems of Sets Containing Context-Free Grammars
An application of the notions from the previous section to the generalized context-free grammars will be introduced. In an inverse system ((An )n∈N ,(ϕn,m )(n,m)∈P ) with an inverse limit (A, (ϕn )n∈N ), the role of An , A, ϕn,m , and ϕn should be played, respectively, by Gn (L), G(L), Φn,m , and Φn for some L ⊆ V + . To reach this gradual aim, we need the following lemmas.
180
P. Martinek
Lemma 11. Let (Qn )n∈N be a sequence of sets satisfying the following conditions: +
(i) Qn ∈ n (2V ) for all n ∈ N, (ii) m (Qn ) = Qm for all (n, m) ∈ P. Then there exists exactly one set Q ∈ 2V
+
such that
nQ
= Qn for all n ∈ N.
Proof. If Qn = ∅ for all n ∈ N then clearly, Q =S∅ satisfies the lemma. So, assume that Qn 6= ∅ for some n ∈ N. We put Q = (Qn ; n ∈ N). Consider an arbitrary n ∈ N. + + a) Since Qn ⊆ Q ∈ 2V and Qn ∈ n (2V ) by assumption (i), we get Qn = n (Qn ) ⊆ n Q. S b) Let w ∈ n Q = n ( (Qk ; k ∈ N)) be arbitrary. Then |w| ≤ n and there exists m ∈ N such that w ∈ Qm . Consequently, w ∈ n (Qm ). If n ≤ m then n (Qm ) = Qn by assumption (ii) and w ∈ Qn . If n > m then Qm = m (Qn ) ⊆ Qn by assumption (ii) and Lemma 3. Hence, w ∈ Qn . Both cases lead to the inclusion n Q ⊆ Qn . Parts a) and b) imply n Q = Qn . + c) Let Q0 ∈ 2V be such that k Q0 = Qk for all k ∈ N. Furthermore, let w ∈ (Q0 − Q) ∪ (Q − Q0 ) be arbitrary. If we put j = |w| then w ∈ j (Q0 − Q) ∪ 0 0 0 j (Q − Q ) = (j Q − j Q) ∪ (j Q − j Q ) = (Qj − Qj ) ∪ (Qj − Qj ) = ∅. This means 0 that Q = Q and the lemma holds. 2 Lemma 12. Let (Nn )n∈N be a sequence of sets satisfying the following conditions: +
(i) Nn ⊆ n (2V ) for all n ∈ N, (ii) m (Nn ) = Nm for all (n, m) ∈ P. Then there exists exactly one set N ⊆ 2V
+
such that n N = Nn for all n ∈ N.
Proof. Consider the set A of all sequences a = (Qn (a))n∈N such that Qn (a) ∈ Nn for all n ∈ N, and m (Qn (a)) = Qm (a) for all (n, m) ∈ P. By Lemma 11, for each + a ∈ A, there exists exactly one set Q(a) ∈ 2V such that n Q(a) = Qn (a) for all + n ∈ N. We put N = {Q(a); a ∈ A}. Obviously, N ⊆ 2V . Consider an arbitrary n ∈ N. Then, n N = {n Q(a); a ∈ A} = {Qn (a); a ∈ A} ⊆ Nn . Now, let Q[n] ∈ Nn be arbitrary. For all m ∈ N, m ≤ n, if we denote m Q[n] = Q[m] then Q[m] ∈ m (Nn ) = Nm by assumption (ii). Since for all integers k ≥ n, we have Nn = n (Nk ), there exists a sequence (Q[k])k≥n such that Q[k] ∈ Nk and j Q[k] = Q[j] for all j, k ∈ N, n ≤ j ≤ k. Hence, we get + a sequence a0 = (Q[k])k∈N ∈ A. By Lemma 11, there exists a set Q(a0 ) ∈ 2V such that k Q(a0 ) = Q[k] for all k ∈ N. Therefore, Q[n] = n Q(a0 ) ∈ n N by the definition of N . Hence, Nn ⊆ n N . Altogether, Nn = n N . + Now, let N 0 ⊆ 2V be such that k N 0 = Nk for all k ∈ N. If N 0 6= N then there + exist j ∈ N and Q ∈ j (2V ) such that Q ∈ (N 0 −N )∪(N −N 0 ). Hence, Q = j Q ∈ 0 0 0 0 j (N − N ) ∪ j (N − N ) = (j N − j N ) ∪ (j N − j N ) = (Nj − Nj ) ∪ (Nj − Nj ) = ∅. This is a contradiction to the existence of Q. Therefore, N = N 0 and the lemma holds. 2
An Inverse Limit of Context-Free Grammars
181
+
Lemma 13. Let X ∈ 2V and (Gn )n∈N be a sequence of context-free grammars satisfying the following conditions: (i) Gn ∈ Gn (X) for all n ∈ N, (ii) Φn,m (Gn ) = Gm for all (n, m) ∈ P. Then there exists exactly one generalized context-free grammar G ∈ G(X) such that Φn (G) = Gn for all n ∈ N. Proof. Let Gn = (Nn , V, Rn , n X) for all n ∈ N. Assumption (i) implies that + Nn ⊆ n (2V ) and Rn ⊆ n R for all n ∈ N. Assumption (ii) implies that for all (n, m) ∈ P, Nm = m (Nn ) and Rm = m (Rn ) because (Nm , V, Rm , m X) = Φn,m ((Nn , V, Rn , n X)) = (m (Nn ), V, m (Rn ), m (n X)). By Lemma 12, there ex+ ists exactly one set N ⊆ 2V such that n N = Nn for all n ∈ N. We put R = {(y, x) ∈ (N × (N N ∪ V )) ∩ R; (n y, n x) ∈ Rn for all n ∈ N}. By the definition of n R, we obtain n R ⊆ Rn for all n ∈ N. Now, consider an arbitrary n ∈ N and (yn , xn ) ∈ Rn . Since by already proved equations, Rm = m (Rn ) and Rn = n (Rl ) for all m, l ∈ N, m ≤ n ≤ l, there exists a sequence ((yk , xk )k∈N ) such that (yk , xk ) ∈ Rk for all k ∈ N and (j (yk ), j (xk )) = (yj , xj ) for all k, j ∈ N, k ≥ j. Since Rk ⊆ k R, we have for + + all k ∈ N, either xk = x for some x ∈ V or xk ∈ (k (2V ))(k (2V )). Lemma 11 implies that there exists exactly one production (y, x) ∈ R ∩ (N × (N N ∪ V )) such that (k y, k x) = (yk , xk ) for all k ∈ N. Hence, (y, x) ∈ R and (yn , xn ) = (n y, n x) ∈ n R. Therefore, Rn ⊆ n R. Altogether, n R = Rn for all n ∈ N. The definition of R and the fact, that N is the only set satisfying the condition n N = Nn for all n ∈ N, imply that R is the only set of productions from R such that n R = Rn for all n ∈ N. Summarizing, if we put G = (N, V, R, X) then G is the only generalized context-free grammar such that G ∈ G(X) and Φn (G) = Gn for all n ∈ N. 2 Theorem 14. ((Gn (X))n∈N ,(Φn,m )(n,m)∈P ) is an inverse system of sets with +
an inverse limit (G(X), (Φn )n∈N ) for all X ∈ 2V . +
Proof. Let X ∈ 2V be arbitrary. a) By the definition, Φn,n = idGn (X) for all n ∈ N and Φn,m maps the set Gn (X) into Gm (X) for all (n, m) ∈ P. If n, m, l ∈ N, n ≥ m ≥ l, and G = (N, V, R, n X) ∈ Gn (X) are arbitrary then by the definition and Lemma 3, Φm,l (Φn,m (G)) = Φm,l ((m N , V, m R, m X)) = (l N , V, l R, l X) = Φn,l (G). Hence, Φn,l = Φm,l b Φn,m and ((Gn (X))n∈N ,(Φn,m )(n,m)∈P ) is an inverse system of sets. b) By the definition, Φn is a mapping of the set G(X) into Gn (X) for all n ∈ N. Furthermore, if (n, m) ∈ P and G = (N, V, R, X) ∈ G(X) are arbitrary then by the definition and Lemma 3, Φn,m (Φn (G)) = Φn,m ((n N , V, n R, n X)) = (m N , V, m R, m X) = Φm (G). Hence, Φm = Φn,m b Φn and (G(X), (Φn )n∈N ) is a source of the inverse system ((Gn (X))n∈N ,(Φn,m )(n,m)∈P ).
182
P. Martinek
c) Let (H, (Ψn )n∈N ) be an arbitrary source of the inverse system. Consider an arbitrary H ∈ H. By the definition of source, Ψn (H) ∈ Gn (X) for all n ∈ N. Moreover, for all (n, m) ∈ P, we have Ψm (H) = Φn,m (Ψn (H)). By Lemma 13 applied to the sequence (Ψn (H))n∈N , there exists exactly one generalized contextfree grammar G(H) ∈ G(X) such that Φn (G(H)) = Ψn (H) for all n ∈ N. We put Φ(H) = G(H). Then for all n ∈ N, Φn (Φ(H)) = Φn (G(H)) = Ψn (H). Thus, Φ is a mapping of the set H into G(X) such that Φn b Φ = Ψn for all n ∈ N. Now, let Ψ be an arbitrary mapping of the set H into G(X) satisfying the condition Φn b Ψ = Ψn for all n ∈ N. Consider an arbitrary H ∈ H. Then, we have Φn (Ψ (H)) = Ψn (H) = Φn (Φ(H)) for all n ∈ N. Consequently, Lemma 13 applied to the sequence (Ψn (H))n∈N implies that Ψ (H) = Φ(H). This means that Ψ = Φ. Hence, (G(X), (Φn )n∈N ) is an inverse limit of the inverse system ((Gn (X))n∈N ,(Φn,m )(n,m)∈P ) and the theorem is proved. 2 As a consequence of Theorem 6, we get the following assertion. +
Theorem 15. A language L ∈ 2V is context-free if and only if there exist n0 ∈ N and a subsystem (({Gn })n∈N ,(Φn,m|{Gn })(n,m)∈P ) of the inverse system ((Gn (L))n∈N ,(Φn,m )(n,m)∈P ) such that L = L(Gn ) for all integers n ≥ n0 . Proof. This follows from Theorems 6 and 14.
2
Theorem 15 deals with an inverse system (({Gn })n∈N ,(Φn,m|{Gn })(n,m)∈P ) where the sets of context-free grammars are one-element sets. The theorem can be generalized if we consider richer than one-element subsets of Gn (L). The generalization requires the following lemma. +
Lemma 16. Let L ∈ 2V . If A = (({Gn })n∈N ,(Φn,m |{Gn })(n,m)∈P ) is a subsystem of the inverse system ((Gn (L))n∈N ,(Φn,m )(n,m)∈P ) such that L ⊆ L(Gn ) for all n ∈ N, then A has an inverse limit ({G}, (Φn |{G})n∈N ) such that G ∈ G(L) and L = L(G). Proof. By Theorems 10 and 14, the inverse system A has an inverse limit e (Φn| G) e n∈N ) such that Ge ⊆ G(L). Lemma 13 implies that there exists exactly (G, one generalized context-free grammar G ∈ G(L) such that Φn (G) = Gn ∈ {Gn } for all n ∈ N. Hence, Ge = {G}. + We denote G = (N, V, R, L). Since G ∈ G(L), we have N ⊆ 2V and R ⊆ R. + By the definition of L(G) and Lemma 2, we obtain L(G) ⊆ L(2V , V, R, L) = L. Now, consider an arbitrary string w ∈ L. The equations Φn (G) = Gn imply Gn = (n N , V, n R, n L) for all n ∈ N. Since w ∈ L ⊆ L(G1 ), there exists an 1 L-derivation of w in 1 R. Obviously, each derivation in 1 R ⊆ 1 R can be considered consisting of two successive derivations, where the first one uses productions from 1 R ∩ 1 R(1) solely and derives a nonterminal string, the second one uses only productions from 1 R ∩ 1 R(2) and changes the nonterminals to terminals. Thus, if we denote |w| = l then there
An Inverse Limit of Context-Free Grammars
183
exist nonterminals Q1 , . . . , Ql ∈ 1 N and terminals w1 , . . . , wl ∈ V such that w = w1 · · · wl and the 1 L-derivation of w can be divided into two successive derivations: ∗
a) 1 L ⇒ Q1 · · · Ql (1 R ∩ 1 R(1)), ∗ b) Q1 · · · Ql ⇒ w1 · · · wl (1 R ∩ 1 R(2)). Each production used in the 1 L-derivation of the nonterminal string Q1 · · · Ql is of the form (Q, Q0 Q00 ) with Q, Q0 , Q00 ∈ 1 N , i.e., during the deriving, the length of the nonterminal string (starting with 1 L and finishing with Q1 · · · Ql ) is continually increasing. In the set 1 N , the number of all nonterminals is finite likewise the number of all nonterminal permutations up to the length l. Though the 1 L-derivation of Q1 · · · Ql in 1 R need not be unique, the finite number of all productions in 1 R limits these derivations to a finite number. The previous considerations imply that there exist only finitely many 1 Lderivations of w in 1 R, where each of them is distinct from the others as to the used nonterminals, productions, and the order of their using. Now, let n ∈ N be arbitrary. Since L ⊆ L(Gn ), there exists an n L-derivation of w in n R, e.g., (si (n))2l−1 i=0 , where s0 (n) = n L, sl−1 (n) = Q1 · · · Ql , s2l−1 (n) = w, Q1 , . . . , Ql ∈ n N , si (n) ∈ (n N ∪ V )+ for all i ∈ {0, . . . , 2l − 1} and si−1 (n) ⇒ si (n) ({(yi (n), xi (n))}) with (yi (n), xi (n)) ∈ n R for all i ∈ {1, . . . , 2l − 1}. Considering an arbitrary m ∈ N, m ≤ n, by definitions and Lemma 3, we obtain: + + m (si (n)) ∈ m (n N ∪ V ) = (m N ∪ V ) for all i ∈ {0, . . . , 2l − 1}, m (s0 (n)) = m (n L) = m L, m (s2l−1 (n)) = m (w) = w, and (m (yi (n)), m (xi (n))) ∈ m (n R) = m R for all i ∈ {1, . . . , 2l − 1}. Hence, (m (si (n)))2l−1 i=0 is an m L-derivation of w in m R. The considerations also imply that for each n ∈ N and for each n L-derivation of w in n R, there exists the corresponding 1 L-derivation of w in 1 R. Since the number of all 1 L-derivations of w in 1 R is finite, obviously, there exist sequences 2l−1 ((si (n))2l−1 i=0 )n∈N , (({(yi (n), xi (n))})i=1 )n∈N such that for each n ∈ N, (i) (si (n))2l−1 i=0 is an n L-derivation of w in n R which uses successively productions from the sequence ({(yi (n), xi (n))})2l−1 i=1 , i.e., si−1 (n) ⇒ si (n) ({(yi (n), xi (n))}) for all i ∈ {1, . . . , 2l − 1}. (ii) m (si (n)) = si (m) for all m ∈ N, m ≤ n, and i ∈ {0, . . . , 2l − 1}. (iii) (m (yi (n)), m (xi (n))) = (yi (m), xi (m)) for all m ∈ N, m ≤ n, and i ∈ {1, . . . , 2l − 1}. Since for each i ∈ {0, . . . , 2l − 1}, the string si (n) has a fixed length independent of the index n, condition (ii) and repeatedly used Lemma 11 imply the existence of a string si ∈ (N ∪ V )+ such that n (s(i)) = si (n) for all n ∈ N. Namely, for i = 0, we get n (s0 ) = n L. Therefore, s0 = L by Lemma 11. Obviously, s2l−1 = w. Similarly, for each i ∈ {1, . . . , 2l − 1}, by condition (iii), Lemma 11, and Lemma 13 applied to the sequence ((n N, V, {(yi (n), xi (n))}, n L))n∈N , we obtain
184
P. Martinek
exactly one generalized context-free grammar (N, V, {(y, x)}, L) ∈ G(L) such that {(y, x)} ⊆ R and (n y, n x) = (yi (n), xi (n)) for all n ∈ N. Hence, (si )2l−1 i=0 represents an L-derivation of w in R. Thus, w ∈ L(G) and L ⊆ L(G). Altogether, L = L(G) and the lemma is proved. 2 +
Theorem 17. Let L ∈ 2V . Then, L is a context-free language if and only if the inverse system ((Gn (L))n∈N ,(Φn,m )(n,m)∈P ) has an inverse subsystem ((Hn )n∈N ,(Φn,m | Hn )(n,m)∈P ) with an inverse limit (H, (Φn | H)n∈N ) satisfying the following conditions: (i) (ii) (iii) (iv)
H ⊆ G(L), there exists a context-free grammar H ∈ H, L(H) = L for all H ∈ H, if G is a context-free grammar in Chomsky normal form such that V is its set of terminals and L = L(G) then τ (G) ∈ H.
Proof. Conditions (ii) and (iii) imply that L is a context-free language. In the other direction, let L be a context-free language. We denote by Y the system of all context-free grammars in Chomsky normal form which generate L e = {τ (G); G ∈ Y }. and whose sets of terminals are equal to the set V . We put H Since L is a Λ-free context-free language, the system Y is non-empty. Therefore, e Then, there exists G ∈ Y such that τ (G) = e= H 6 ∅. Consider an arbitrary H ∈ H. H. Since G ∈ Y , we have L(G) = L. The definition of τ and Lemma 5 imply that H = τ (G) is a context-free grammar from G(L). Moreover, L = L(G) = e satisfies conditions (i)–(iv) L(τ (G)) = L(H) by Lemma 1. Thus, the system H of the theorem required of H. e the definition of Φn imFor each n ∈ N, we put Hn = {Φn (H); H ∈ H}; plies Hn ⊆ Gn (L). Considering an arbitrary (n, m) ∈ P and Hn ∈ Hn , by the definitions of Φn , Φn,m and Lemma 3, we obtain Φn,m (Hn ) = Φn,m (Φn (H)) = Φm (Φn (H)) = Φm (H) = Hm ∈ Hm , where H denotes the context-free grammar e for which Φn (H) = Hn . Analogously, for all n, m, l ∈ N, n ≥ m ≥ l, and from H, Hn ∈ Hn , we get Φm,l (Φn,m (Hn )) = Φm,l (Φm (Hn )) = Φl (Φm (Hn )) = Φl (Hn ) = Φn,l (Hn ). Hence, regarding Theorem 14, ((Hn )n∈N ,(Φn,m | Hn )(n,m)∈P ) forms an inverse subsystem of the inverse system ((Gn (L))n∈N ,(Φn,m )(n,m)∈P ). e (Φn | H) e n∈N ) is a source of the The definition of (Hn )n∈N implies that (H, inverse system ((Hn )n∈N ,(Φn,m| Hn )(n,m)∈P ). By Theorems 10 and 14, ((Hn )n∈N ,(Φn,m | Hn )(n,m)∈P ) has an inverse limit (H, (Φn | H)n∈N ) where H ⊆ G(L). So, the set H satisfies condition (i) of the theorem. Consider an arbitrary e ∈ H. e The definition of source and Lemma 13 imply H e ∈ H. Hence, H e ⊆ H. H Now, consider an arbitrary H ∈ H. By the definition of inverse limit (H, (Φn| H)n∈N ), we have Φn (H) ∈ Hn ⊆ Gn (L) for all n ∈ N. By the definition e such that Φn (H) = Φn (H(n)). of Hn , for each n ∈ N, there exists H(n) ∈ H e Since the set H satisfies condition (iii) of the theorem, L(H(n)) = L for all n ∈ N. By Lemma 4, we get L = L(H(n)) ⊆ L(Φn (H(n))) = L(Φn (H)) for all n ∈ N. If we regard the definition of inverse limit (H, (Φn | H)n∈N ) then
An Inverse Limit of Context-Free Grammars
185
we have proved that (({Φn (H)})n∈N ,(Φn,m| {Φn (H)})(n,m)∈P ) is a subsystem of the inverse system ((Gn (L))n∈N ,(Φn,m )(n,m)∈P ) such that L ⊆ L(Φn (H)) for all n ∈ N. By Lemma 16, we obtain L = L(H). So, the set H satisfies condition (iii). e ⊆H The remaining conditions (i.e., (ii) and (iv)) are also satisfied because H e and H satisfies them. 2
References 1. Cohn, P. M.: Universal Algebra, Harper & Row, New York 1965. 2. Dr´ aˇsil, M.: A grammatical inference for C-finite languages, Arch. Math. Brno, vol. 25, no. 3, 1989, 163–174. 3. Dupont, P.: Regular grammatical inference from positive and negative samples by genetic search: the GIG method, Proc. of the Internat. Coll. on Grammatical Inference (ICGI-94), Lecture Notes in Artificial Intelligence, Vol. 862, Springer-Verlag, 1994, 236–245. 4. Fu, K. S., Taylor, L. B.: Grammatical inference: Introduction and survey, IEEE Transactions on Systems, Man and Cybernetics, vol. SMC-5, no. 1, 1975, 95–111. 5. de la Higuera, C., Oncina, J., Vidal, E.: Identification of DFA: Data-dependent versus data-independent algorithms, Proc. 3rd Internat. Coll. on Grammatical Inference (ICGI-96), Lecture Notes in Artificial Intelligence, Vol. 1147, Springer, Berlin, 1996, 313–325. 6. Hopcroft, J. E., Ullmann, J. D.: Formal Languages and their Relation to Automata, Addison-Wesley, Reading 1969. 7. Gold, E. M.: Language identification in the limit, Inform. and Control, Vol. 10 (1967), 447–474. 8. Gr¨ atzer, G.: Universal Algebra, Van Nostrand, Princeton, New Jersey 1968. 9. Koshiba, T., M¨ akinen, E., Takada, Y.: Learning deterministic even linear languages from positive examples, Theoret. Comput. Sci. 185 (1997), 63–79. 10. Martinek, P.: On a Construction of Context-free Grammars, Fundamenta Informaticae, to appear. 11. Novotn´ y, M.: On some constructions of grammars for linear languages, Intern. J. Computer Math., vol. 17 (1985), 65–77. 12. Sakakibara, Y.: Efficient learning of context-free grammars from positive structural examples, Inform. Comput. 97 (1992), 23–60. 13. Sakakibara, Y.: Recent advances of grammatical inference, Theoret. Comput. Sci. 185 (1997), 15–45. 14. Sempere, J. M., Fos, A.: Learning linear grammars from structural information, Proc. 3rd Internat. Coll. on Grammatical Inference (ICGI-96), Lecture Notes in Artificial Intelligence, Vol. 1147, Springer, Berlin 1996, 126–133. 15. Tanatsugu, K.: A grammatical inference for harmonic linear languages, Intern. J. of Comput. and Inform. Sci., vol. 13, no. 5, 1984, 413–423.
Synthesizing Context Free Grammars from Sample Strings Based on Inductive CYK Algorithm Katsuhiko Nakamura1 and Takashi Ishiwata2 1
Department of Computers and Systems Engineering, Tokyo Denki University, Hatoyama-machi, Saitama-ken, 350-0394 Japan. [email protected] 2 COM Software Co., Ltd. 1-12-6 Kudankita, Thiyoda-ku, Tokyo, 102-0073 Japan. [email protected]
Abstract. This paper describes a method of synthesizing context free grammars from positive and negative sample strings, which is implemented in a grammatical inference system called Synapse. The method is based on incremental learning for positive samples and a rule generation method by “inductive CYK algorithm,” which generates minimal production rules required for parsing positive samples. Synapse can generate unambiguous grammars as well as ambiguous grammars. Some experiments showed that Synapse can synthesize several simple context free grammars in considerably short time.
1
Introduction
Among themes on machine learning studies, inductive inference of context free grammars is one of the most fundamental and important subjects. A reason of this is that since the properties of the context free languages have been studied for many years, we can make use of the results for investigating and evaluating the inductive inference. Another reason is that the grammatical inference is closely related to, and can be applied to, other machine learning approaches such as regular grammatical inference and inductive logic programming. In this paper, we present a method of synthesizing both ambiguous and unambiguous context free grammars (CFG’s) from positive and negative sample strings, which is implemented in an inductive grammar inference system called Synapse (Synthesis by Analyzing Positive String Examples). Synapse can generate both ambiguous grammars and unambiguous grammars. The inductive inference methods of this system are outlined as follows. 1. At first the system has no production rules. For a given positive sample string, it generates minimum production rules which derive this string. Then, it checks that the rules do not derive any given negative samples. This process continues until the system finds a rule set which derives all the positive samples and none of the negative samples. A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 186–195, 2000. c Springer-Verlag Berlin Heidelberg 2000
Synthesizing Context Free Grammars from Sample Strings
187
2. For generating production rules, the system uses “inductive CYK algorithm,” which generates sets of production rules required for parsing positive samples. 3. The inductive inference is based on incremental search, or iterative deepening, in the sense that the rule sets are searched within the limits of the minimum numbers of nonterminal symbols and rules. When the search fails, the system iterates the search with larger limits. The CYK (Cook, Yunger and Kasami) algorithm is well known for solving the membership problem of CFL [4]. This algorithm uses a 2-dimensional table of sets of nonterminal symbols for efficient bottom-up analyses. The time required for determining the membership of a string with the length n is O(n3 ). The inductive CYK algorithm is similar to the usual CYK algorithm, except that when the rule set does not derive a string, it adds production rules so that the parsing always succeeds. There have been several researches on inductive inference of formal grammars. The works on context free grammars include approaches based on inductive logic programming and genetic algorithm [6]. Sakakibara and Kondo [7] shows a method based on both genetic algorithm and the CYK algorithm. Their use of CYK algorithm is different from our approach in that possible tables of symbol sets are generated and tested in every generation of the GA process. Some heuristics in Synapse system are suggested by those employed in the program for solving firing squad synchronization problem by Balzer [1]. The heuristics include the essential ones as follows. 1. The system generates a production rule of the form A ← BC only when the pair BC of symbols appears in testing the positive samples by the inductive CYK algorithm. 2. The system first produces a rule set for short positive sample strings, and then produces rule sets for longer sample strings by adding production rules. To describe the search process simply, we represent the procedure of Synapse as a nondeterministic program [3] in a Pascal-like language, which is extended to include additional data structures and control statements. Synapse system, written in C language, is an implementation of the nondeterministic program.
2 2.1
Context Free Languages and CYK Algorithm Context Free Grammar
A context free grammar (CFG) is a system G = (N, T, P, S), where N and T are finite sets of nonterminal symbols and terminal symbols, respectively; P is a finite set of production rules of the form A → β with A ∈ N, β ∈ (N ∪ T )∗ ; and S ∈ N is a starting symbol. A syntax tree for a CFG G = (N, T, P, S) is the labeled ordered tree satisfying the following conditions.
188
K. Nakamura and T. Ishiwata
1. All the nonterminal nodes have labels of nonterminal symbols, and all the terminal nodes of terminal symbols. 2. For each node nonterminal node with a label A, the child nodes have labels α1 , · · · , αn in the left to right order, if and only if A → α1 , · · · , αn ∈ P D The string of terminal labels from left to right is called the result of this syntax ∗ tree. We write B ⇒G w, if there is a syntax tree such that the root has the label B and the result is w ∈ T ∗ . The language L(G) of G is defined by ∗
L(G) = {w ∈ T ∗ | S ⇒G w}. A CFG is ambiguous, if there is a string w such that there are two or more different syntax trees with the result w and the root labeled by S. A CFG is weakly ambiguous, if there is a string w and a nonterminal symbol B such that there are two or more different syntax trees with the result w and the root labeled by B. Every ambiguous CFG is weakly ambiguous, but the converse is not necessary true. 2.2
Revised Chomsky Normal Form
Any CFG G can be transformed to Chomsky normal form such that all the productions have the forms A → BC and A → a, where A, B and C are nonterminal symbols, and a is a terminal symbol. To represent a grammars with less production rules, we use the revised Chomsky normal form A → α1 α2 , α1 , α2 ∈ N ∪ T. No one-letter string can be derived from the revised Chomsky normal form grammar. We can, however, restrict the CFG to have this revised normal form without the loss of generality, since the class of languages defined by this grammar includes all the CFL’s of strings with two or more symbols. 2.3
CYK Algorithm
Figure 1 shows CYK algorithm in a Pascal-like language. This algorithm differs from the original one such as in [4] in that this is for the revised Chomsky normal form. We use the operation apply(P, S, T ), defined by apply(P, S, T ) = {A| (A ← BC) ∈ P, B ∈ S, C ∈ T }, to find sets of nonterminal symbols derived from the set of terminal and nonterminal symbols in the bottom-up derivation. We can check the weak ambiguity of the grammar by testing whether two sets T [i, j] and apply(P, T [i, k], T [i + k, j − k]) contain a common nonterminal symbol before the operation T [i, j] ← T [i, j] ∪ apply(P, T [i, k], T [i + k, j − k]).
Synthesizing Context Free Grammars from Sample Strings
189
function cyk(w : string,P : set of rule): boolean; % This is a comment. % cyk(w,P) returns true, if S derives w. begin var T : array [1..n, 1..n] of set of symbol; var i, j, k: integer; % Initialization of the array. It is assumed that w = a1 · · · an . for i ← 1 until n do T [i, 1] ← ai ; % Bottom-up parsing for j ← 2 until n do for i ← 1 until n − j + 1 do begin T [i, j] ← ∅; for k ← 1 until j − 1 do T [i, j] ← T [i, j] ∪ apply(P, T [i, k], T [i + k, j − k]) end; return(S ∈ T [1, n]) end Fig. 1. CYK Algoithm for revised Chomsky normal form
3
Inductive CYK Algorithm and Synapse
In this section, we represent procedures employed in Synapse system as nondeterministic programs. We assume that all the production rules are of the revised Chomsky normal form unless otherwise specified. 3.1
Nondeterministic Programs
The nondeterministic programs include the following two additional operations. 1. Nondeterministic branch: This operation is specified by the statement of the form nondeterministic goto label ; This statement represents the choice point of two possible processes, to execute “goto label ” or to skip this statement. 2. Failure terminal by “failure” statement: When the control reaches to this terminal, no solution is obtained on the selected process. We consider the usual end terminal of a procedure as a success terminal. A nondeterministic function, or procedure, has a set of possible values, or possible results, respectively. We can transform the nondeterministic programs into usual deterministic programs to find either one or all the results by depthfirst search such that the control backtracks to the last choice point whenever it reaches to the failure terminal. The backtracking also occurs at any choice point when all the processes are found to be failed. In the case of search for all
190
K. Nakamura and T. Ishiwata
solutions, backtracking also occurs at the success terminal. The nondeterministic branches are ordered so that a preceded process is tried first in the deterministic execution of the transformed program. For the nondeterministic goto statement, the goto operation is executed first, and this statement has no effect at the backtracking. The Pascal-like language is further extended to include the control statement of the form “for each X ∈ S do (statement)” for any ordered set S. This statement specifies |S| iteration of the inner statement with each value X ∈ S. 3.2
Inductive CYK Algorithm
Inductive CYK algorithm is an extension of CYK algorithm to have a function to add production rules when the production rules do not derive a given string. Figure 2 shows a nondeterministic program for inductive CYK algorithm. The function of the form inductive_cyk(w, P, K, R, Rmax ) returns a set of rules in the revised Chomsky normal form. We can obtain a program for producing unambiguous grammars by adding a statement that causes failure in the case that weak ambiguity is detected. The statement for this check if T [i, j] ∩ apply(P, {B}, {C}) 6= ∅ then failure; should be placed just before the statement T [i, j] ← T [i, j] ∪ apply(P, {B}, {C}). The following two propositions show the correctness and completeness of the inductive CYK algorithm for finding rule sets. Proposition 1. For any string w ∈ T T + , any set P of rules, and any positive integers K, R and Rmax with R ≤ Rmax , if the nondeterministic function ∗ inductive_cyk(w, P, K, R, Rmax ) has a possible value Q, then S ⇒G w for the CFG G = (N, T, Q, S) with |N | ≤ K and |Q| = R ≤ Rmax , where N is the set nonterminal symbols in P. For any sets P and Q of rules, P is a variant of Q, if all the rules in P can be converted to the rules in Q, and all the rules in P to those in Q by replacing the nonterminal symbols other than the starting symbol S. Proposition 2. For any string w ∈ T + and any revised Chomsky normal form ∗ CFG G = (N, T, P, S), if S ⇒G w and all the rules in P are used in the execution of cyk(w, P ), then there are integers K ≥ |N | and Rmax ≥ |P | such that a variant of P is a posssible value of inductive_cyk(w, ∅, K, 0, Rmax ). Intuitively, Proposition 2 says that execution of inductive_cyk(w, ∅, K, 0, Rmax ) nondeterministically returns all the revised Chomsky normal form rule sets for deriving w as possible values. We can prove these two propositions by mathematical induction on the number of rules.
Synthesizing Context Free Grammars from Sample Strings
191
function inductive_cyk(w : string, P : set of rule, K, R, Rmax : integer): set of rule; % K is the number of nonterminal symbols. % R is the number of rules. Rmax is the maximum number of rules. begin var T : array [1..n, 1..n] of set of symbol; var i, j, k: integer, NK : set of symbol; NK ← {K possible nonterminal symbols}; % Initialization of the array. It is assumed that w = a1 · · · an . for i ← 1 until n do T [i, 1] ← ai ; for j ← 2 until n do for i ← 1 until n − j + 1 do begin T [i, j] ← ∅; for k ← 1 until j − 1 do begin for each B ∈ T [i, k] do for each C ∈ T [i + k, j − k] do begin for each A ∈ NK if (A → BC) 6∈P then begin nondeterministic goto brake; if R < Rmax then R ← R + 1 else failure; P ← P ∪ {(A → BC)}; brake: end T [i, j] ← T [i, j] ∪ apply(P, {B}, {C}) end end end; if S ∈ T (1, n) then return(P ) else failure end
Fig. 2. Inductive CYK Algorithm
3.3
The Procedure of Synapse
Figure 3 shows a nondeterministic program for the top-level procedure of Synapse. The procedure call synapse(SP , SN , Kmax , Rlimit ) starts the process of Synapse system for sequences SP and SN of positive and negative sample strings, respectively, and the maximum numbers Kmax and Rlimit of nonterminal symbols and rules, respectively. The following two proposition show the correctness and completeness of the procedure of Synapse for finding CFG. Proposition 3. For any two sets SP , SN ⊂ T T + and any integers K and R, if the nondeterministic procedure synapse(SP , SN , K, R) has a possible result P , then SP ⊆ L(G) and SN ∩ L(G) = ∅ for a CFG G = (N, T, P, S) with K = |N |. Proposition 4. For any CFL L ⊆ T T + , there are two sets SP , SN ⊆ T T + with SP ⊆ L and SN ∩ L 6= ∅ and positive integers K and R such that
192
K. Nakamura and T. Ishiwata procedure synapse(SP , SN :list of strings, Kmax , Rlimit :integer); % SP is a sequence of positive sample strings w1 , w2 , · · · , wn D % SN is a sequence of negative sample strings v1 , v2 , · · · , vm D % Kmax is the maximum number of nonterminal symbols. % Rlimit is the limit of the number of rules in the search. begin var P : set of rule; var K, R, Rmax i, j: integer; for K ← 1 until Kmax do for Rmax ← 1 until Rlimit do begin P ← ∅; R ← 0; for i ← 1 until n do begin P ← inductive_cyk(wi , P, K, R, Rmax ); for j ← 1 until m do if cyk(vj , P ) then failure; end; Output the rules P ; return % For finding all solutions, replace “return” by “failure.” end Print “No grammar is found.”; end Fig. 3. Nondeterministic Program for Synapse
1. There is a possible result P of synapse(SP , SN , K, R); 2. L = L(G) for the CFG G = (N, T, P, S) with |N | ≤ K and |P | ≤ R, where N is the set of nonterminal symbols in P ; and 3. P is minimum, i.e. there is no other revised Chomsky normal form CFG G0 that satisfies L = L(G0 ) and has less rules than |P |. We can prove these two propositions by mathematical induction on the number of positive samples from Propositions 1 and 2.
4
Synapse System and Performance Results
Synapse system is an implementation of the procedures in the previous section. The program written in C language is considerably short, approximately 400 lines. Table 1 shows ambiguous grammars synthesized by Synapse together with the numbers of nonterminal symbols and rules and computation time in seconds. Table 2 shows the data for synthesizing unambiguous grammars. In these tables, the symbol # N and # Rules denote the number of nonterminal symbols, and production rules, respectively, and #a (w) denotes the number of a’s in the string w. In all the experiments, both positive and negative samples are strings in {a, b}∗ with a length not longer than seven. The positive samples are given to
Synthesizing Context Free Grammars from Sample Strings
193
Table 1. Computation time for ambiguous grammars Language # N # Rules Time (sec.) am bn (1 ≤ m ≤ n) 2 4 1 aa∗ bb∗ 2 6 6 Parenthesis language 2 4 1 #a (w) = #b (w), w ∈ {a, b}∗ 2 7 2 Table 2. Computation time for unambiguous grammars Language # N # Rules Time (sec.) an b n 2 3 1 am bn (1 ≤ m ≤ n) 3 6 15 aa∗ bb∗ 3 6 38 Parenthesis language 3 6 308
the system in the order of their length. The negative samples are all the strings other than the positive samples. We used Windows version Visual C++ compiler and Intel Pentium II processor with 400 MHz clock. The positive samples for the parenthesis language are the following string. {ab, aabb, abab, aaabbb, aababb, aabbab, abaabb, ababab} For this input, we obtained the following two sets rules for ambiguous CFG’s, where S is the starting symbol. – S → SS, S → ab, S → aC, C → Sb – S → SS, S → ab, S → Cb, D → aS For unambiguous CFG’s, we obtained the rule sets including the followings. – S → aC, S → ab, S → aD, D → Sb, C → bS, C → DS – S → aC, S → ab, S → aD, D → Sb, C → bS, D → DS – S → aC, S → ab, S → aD, D → Sb, C → bS, C → SC Synapse could not synthesize unambiguous grammar for the language {w ∈ {a, b}| #a (w) = #b (w)} within possible run time. As far as the author knows that it is an open problem whether this language is inherently ambiguous, i.e. there is no unambiguous CFG for this language. Grammatical inference by Synapse is efficient when the positive sample strings are given in the order of their length, since the system generates few rules for each positive samples. When a longer positive sample string is given first, the system generates several rules at one time and requires testing a large number of possible sets of rules. Table 3 shows the computation time to synthesize unambiguous grammars of the language {am bn | 1 ≤ m ≤ n} for seven orders of positive strings. This shows that the computation time increases as the longer string is given first. We obtained the same rule sets for all the orders of samples.
194
K. Nakamura and T. Ishiwata Table 3. Computation time for various orders of positive samples Order of positive samples Time (sec) ab, abb, abbb, aabb, abbbb, aabbb, abbbbb, · · · , aaabbbb 15 abb, abbb, aabb, abbbb, aabbb, abbbbb, · · · , aaabbbb, ab 46 abbb, aabb, abbbb, aabbb, abbbbb, · · · , aaabbbb, ab, abb 122 abbbb, aabbb, abbbbb, · · · , aaabbbb, ab, abb, abbb, aabb 445 abbbbb, aabbbb, · · · , aaabbbb, ab, abb, · · · , aabbb 598 abbbbbb, aabbbbb, aaabbbb, ab, abb, · · · , aabbbb, aaabbb 894
5
Concluding Remarks
We presented a method of synthesizing context free grammars from positive and negative sample strings and the performance results of Synapse system, which is an implementation of the method. Our grammatical inference is based on inductive CYK algorithm to generate all the sets of effective production rules for deriving a positive sample string. The system employs incremental learning for positive samples, in the sense that for each positive sample the system generates rules and adds them to the rule set, and that the positive samples are given to the system in order as in [5]. On the other hand, this approach might not be strictly called incremental learning: the system checks all the negative samples each time it finds a rule set for a positive sample. The reason why we chose this approach is that checking negative samples requires less time than generating rules for positive samples. Synapse system can synthesize several context free grammars, including unambiguous grammars, from positive and negative samples in rather short time. As a direct and practical application, we would use this system for synthesizing and searching for CFG’s with any given conditions in basic research of formal languages. The future problems and/or open problems include: – improving the efficiency of system; – how to determine the sets of positive and negative samples for more efficient syntheses; – further theoretical analysis of the efficiency of the system; and – application of the method to other machine learning such as inductive logic programming based on iterative deepening by Bratko [2]. Acknowledgement. The author thanks Professor Yasubumi Sakakibara for his encouragement and valuable discussions.
References 1. Balzer, Robert, An 8-State Minimal Time Solution to the Firing Squad Synchronization Problem, Information and Control 10 (1967) 22–42 .
Synthesizing Context Free Grammars from Sample Strings
195
2. Bratko, Ivan, Refining Complete Hypothesis in ILP, Proceedings of 9th International Workshop ILP ’99 (eds. S. Dzeroski and D. Flach), Springer-Verlag LNAI 1634 (1999) 44–55. 3. Floyd, R. W., Nondeterministic Algorithms, Jour. of ACM 14, No. 4 (1967) 636– 644. 4. Hopcroft, John E. and Ullman, Jeffrey E., Introduction to Automata Theory, Languages, and Computation, Addison-Wesley (1979). 5. Parekh, Rajesh and Honavor, Vasant, An Incremental Interactive Algorithm for Regular Grammar Inference, Third International Colloquium, ICGI-96 (1996) 222– 237. 6. Sakakibara, Yasubumi, Recent Advances of Grammatical Inference, Theoretical Computer Science 185, i1997) 15–45. 7. Sakakibara, Yasubumi and Kondo, Mitsuhiro, GA-Based Learning of Context-Free Grammars Using Tabular Representations, Proc. 16th International Conference of Machine Learning (1999) 354–360.
Combination of Estimation Algorithms and Grammatical Inference Techniques to Learn Stochastic Context-Free Grammars? anchez2 , and Jos´e-Miguel Bened´ı2 Francisco Nevado1 , Joan-Andreu S´ 1
Instituto Tecnol´ ogico de Inform´ atica Universidad Polit´ecnica de Valencia Camino de Vera s/n, 46022 Valencia (Spain) e-mail: [email protected] 2 Depto. Sistemas Inform´ aticos y Computaci´ on Universidad Polit´ecnica de Valencia Camino de Vera s/n, 46022 Valencia (Spain) e-mail: {jandreu,jbenedi}@dsic.upv.es
Abstract. Some of the most widely-known methods to obtain Stochastic Context-Free Grammars (SCFGs) are based on estimation algorithms. All of these algorithms maximize a certain criterion function from a training sample by using gradient descendent techniques. In this optimization process, the obtaining of the initial SCFGs is an important factor, given that it affects the convergence process and the maximum which can be achieved. Here, we show experimentally how the results can be improved in cases when structural information about the task is inductively incorporated into the initial SCFGs. In this work, we present a stochastic version of the well-known Sakakibara algorithm in order to learn these initial SCFGs. Finally, an experimental study on part of the Wall Street Journal corpus was carried out.
1
Introduction
Over the last decade there has been an increasing interest in Stochastic ContextFree Grammars (SCFGs) for use in different tasks within the framework of Syntactic Pattern Recognition and Computational Linguistics [5,9,10,4]. The reasons for this are twofold: first, they are able to model the long term dependencies established between the different units of a string; and second, they incorporate stochastic information that allows for an adequate modeling of the variability phenomena that are always present in complex problems. However, although the SCFGs have been successfully used on limited-domain tasks of low perplexity, the general-purpose SCFGs work poorly on large vocabulary tasks. One of the main obstacles to using these models is the learning of SCFGs for ?
This work has been partially supported by the European Union under contract EUTRANS (ESPRIT LTR-30268) and by the Spanish CICYT under contract (TIC98/0423-C06)
A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 196–206, 2000. c Springer-Verlag Berlin Heidelberg 2000
Combination of Estimation Algorithms
197
complex real tasks. In this work, we explore the combination of estimation algorithms and grammatical inference techniques to learn SCFGs. One of the most widely-known methods for estimating SCFGs is the InsideOutside (IO) algorithm [6]. Unfortunately, the application of this algorithm presents important problems which are accentuated in real tasks: the time complexity per iteration and the large number of iterations that are necessary to converge. An alternative to the IO algorithm is an algorithm based on the Viterbi score (VS algorithm) [9,16]. The convergence of the VS algorithm is faster than the IO algorithm; however, the SCFGs obtained are, in general, not as well learned. Recently, another possibility for estimating SCFGs, which is somewhere between the IO and VS algorithms, has been proposed. This alternative considers only a certain subset of derivations in the estimation process. In order to select this subset of derivations, two alternatives were considered: from structural information content in a bracketed corpus [10,1] and from statistical information content in the k−best derivations [14]. In the first alternative, modifications of the IO and VS algorithms which learn SCFGs from a bracketed corpus were defined [10,1]. In the second alternative, a new algorithm for the estimation of the probability distributions of SCFGs from the k−best derivations was proposed [14]. All of these estimation algorithms are based on gradient descendent techniques and it is well-known that their behavior depends on the appropriate choice of the initial grammar. The usual method for obtaining this initial grammar is a heuristic initialization based on an ergodic model [6,10,15]. We explore the possibility of using methods of grammatical inference in order to obtain good initial models from both structural and probabilistic points of view. In this work, we present a stochastic version of the Sakakibara algorithm [12] that allows us to obtain SCFGs from the structural information of a bracketed corpus. Experiments with a part of the Wall Street Journal processed in the Penn Treebank project [8] were carried out in order to compare these algorithms with previous results [15]. In the following section, some considerations related to the estimation algorithms are presented together with the notation used. Next, the stochastic version of the Sakakibara algorithm is also presented. Finally, the experiments illustrating the behaviour of this algorithm are reported.
2
Estimation of the Probabilities of a SCFG
An important problem related to SCFGs is how to estimate the probabilities of the rules from a set of samples. In this section, we present a unified formal framework to describe these estimation algorithms. First of all, we introduce some notation. We then set up the problem of the estimation of the probabilities of a SCFG, and, finally, we describe some algorithms to estimate the probabilities of a SCFG.
198
F. Nevado, J.-A. S´ anchez, and J.-M. Bened´ı
A Context-Free Grammar (CFG) G is a four-tuple (N, Σ, P, S), where N is a finite set of non-terminal symbols, Σ is a finite set of terminal symbols (N ∩ Σ = ∅), P is a finite set of rules of the form A → α where A ∈ N and α ∈ (N ∪ Σ)+ (we only consider grammars with no empty rules), and S is the initial symbol (S ∈ N ). A CFG in Chomsky Normal Form (CNF) is a CFG in which the rules are of the form A → BC or A → a (A, B, C ∈ N and a ∈ Σ). A left-derivation of x ∈ Σ + in G is a sequence of rules dx = (p1 , p2 . . . , pm ), m ≥ 1 p1 pm p2 such that: (S ⇒ α1 ⇒ α2 . . . ⇒ x), where αi ∈ (N ∪ Σ)+ , 1 ≤ i ≤ m − 1, and pi rewrites the left-most non-terminal of αi−1 . A Stochastic Context-Free Grammar (SCFG) Gs is defined as a pair (G, q) where G is a CFG and P q : P →]0, 1] is a probability function of rule application such that ∀A ∈ N : α∈(N ∪Σ)+ q(A → α) = 1. Let dx be a left-derivation (derivation from now on) of the string x. We define the probability of the derivation dx of the string x, Pr(x, dx | Gs ) as the product of the probability application function of all the rules used in the derivation dx . We define the probability of the string x as: X Pr(x | Gs ) = Pr(x, dx | Gs ), ∀dx
and the probability of the best derivation of the string x as: c | Gs ) = max Pr(x, dx | Gs ). Pr(x ∀dx
Let ∆x be a finite set of different derivations of the string x. Analogously, we define the probability of the string x with respect to ∆x as: X Pr(x, dx | Gs ), (1) Pr(x, ∆x | Gs ) = dx ∈∆x
and the probability of the best derivation of the string x from a set of derivations ∆x as: c ∆x | Gs ) = max Pr(x, dx | Gs ). Pr(x, dx ∈∆x
We also define the best derivation, dbx , as the argument which maximizes this function. The problem of estimating the probabilities of a SCFG from a set of strings can be formulated as an optimization problem in order to approximate a stochastic distribution defined over the training set. To handle this problem, it is necessary to define: a framework to carry out the optimization process and an objective function to be optimized. This function depends on the training set and is defined in terms of the probabilities of the rules. In this work, we have considered the framework of Growth Transformations [2] in order to optimize the objective function. This is a gradient descendent technique which only guarantees that a local maximum is achieved.
Combination of Estimation Algorithms
199
In reference to the function to be optimized, we will consider the following function: Y Pr(Ω, ∆Ω | Gs ) = Pr(x, ∆x | Gs ), (2) x∈Ω
where Ω is a multiset of strings. It is important to note that expression (2) defines a family of functions which depend on the set ∆x . It can be seen that expression (2) coincides with the likelihood of the sample when ∆x has the maximum number of derivations of each string x ∈ Ω. Expression (2) coincides with the likelihood of the best parse of the sample when ∆x has only the best derivation of each string x ∈ Ω. The transformation to be applied to the probabilities of the rules can be formulated in a unified expression. Given an initial SCFG Gs and a finite training sample Ω, the following function can be used to modify the probabilities (∀(A → α) ∈ P ): P P 1 x∈Ω Pr(x,∆x |Gs ) dx ∈∆x N(A → α, dx )Pr(x, dx | Gs ) 0 P . (3) q (A → α) = P 1 x∈Ω Pr(x,∆x |Gs ) dx ∈∆x N(A, dx ) Pr(x, dx | Gs ) The expression N(A → α, dx ) represents the number of times that the rule A → α has been used in the derivation dx and N(A, dx ) is the number of times that the non-terminal A has been derived in dx . This transformation attempts to improve the function Pr(Ω, ∆Ω | Gs ) of the expression (2) guaranteeing that 0 Pr(Ω, ∆Ω | Gs ) ≥ Pr(Ω, ∆Ω | Gs ). It can be proven that this transformation guarantees that the estimated models are consistent [13], that is, they generate stochastic languages. Therefore, algorithms based on this transformation are appropriate for use in a stochastic framework. When ∆x has the total number of derivations of each x transformation (3) coincides with the IO algorithm, while when ∆x has only the best derivation over all possible derivations transformation (3) coincides with the VS algorithm. Based on this transformation, new estimation algorithms can be defined in which only a subset of derivations, ∆x , is used in the estimation process. This ∆x can be chosen according to stochastic criteria [14], or according to structural criteria [10,1]. When stochastic criteria are considered, one possible way of constructing the subset of derivations can be to select the k−most probable derivations of each string in the sample [14]. The idea is the same as the one which is applied in the VS algorithm and so we call it the kVS algorithm. This algorithm considers more information than the VS algorithm, and the models are, in general, better estimated [14]. There exists an efficient algorithm to compute the k−most probable derivations of a string which is based on a Dynamic Programming scheme. The time complexity to obtain the best derivation of a string x is O(|x|3 |P |). The time complexity to obtain each new derivation is, in practice, approximately proportional to the number of rules of the previous derivation times a logarithmic factor [14].
200
F. Nevado, J.-A. S´ anchez, and J.-M. Bened´ı
To study the time complexity of the proposed estimation algorithm, we assume that the SCFG is in CNF. The set of k−most probable derivations for small values of k is calculated for each string x in the sample with a time complexity of approximately O(|x|3 |P |) [14]. Therefore, the time complexity of the algorithm per iteration is O(|Ω|n3 |P |), where n is the size of the longest string in the sample. When structural criteria are considered, the set of derivations used in the estimation process is chosen according to the structural information content in the training sample [10,1]. This structural information is typically represented in the sample by parentheses. In the estimation process, only those derivations which are compatible with these parentheses are considered as appropriate derivations. A modification of the IO algorithm which takes advantage of the bracketing of a training sample was presented in [10]. In this IOb algorithm, only those partial derivations which are compatible with the bracketing are considered in the estimation process. Another estimation algorithm which considers structural information content in the sample, and which is based on the VS algorithm was defined in [1]: the VSb algorithm. The selection of the best derivation is made from those derivations which are compatible with the bracketing defined on the sample. The time complexity of IOb and VSb algorithms per iteration is O(|Ω|n3 |P |). Finally, a new estimation algorithm which combines both structural and stochastic information to choose the set of derivations is also proposed: the kVSb algorithm. This algorithm considers only the k−best derivations which are compatible with the structural information content in the sample. The time complexity of this algorithm is the same as the kVS algorithm. As we have stated, all of these algorithms are based on a gradient descendent technique, and, therefore, the choice of the initial grammar is a fundamental aspect since it affects both the maximum achieved and the convergence process. All of the algorithms achieve a local maximum, and this maximum depends on the initial grammar. Given the normal form used in the grammars, the initial grammar is typically constructed in a heuristic fashion from a set of terminals and a set of non-terminals. The most common way is to construct an ergodic model with the maximum number of rules which can be formed with a given number of non-terminals and a given number of terminals [6]. Then, initial probabilities which are randomly generated are attached to the rules. We conjecture, that the results can be improved in some cases when structural information about the task is inductively incorporated in the initial SCFGs. In the following section, we explore the use of grammatical inference techniques in obtaining these initial SCFGs from a training sample.
3
Learning the Structure of a SCFG
We consider the possibility of obtaining SCFGs by using grammatical inference techniques in order to get both structural and stochastic information from the
Combination of Estimation Algorithms
201
training sample. Taking into account that we have a bracketed corpus available, we present a stochastic version of the Sakakibara algorithm [12,7]. As is described in [12], the Sakakibara algorithm infers the minimum reversible context-free grammar which is consistent with the input structural sample. The reversible context-free grammars are a normal form for general context-free grammars, so we are not restricted to a subclass of context-free languages. A G context-free grammar is said to be reversible if: 1. it is invertible, that is, A → α and B → α in P implies A = B, and 2. it is reset-free, that is, A → αBβ and A → αCβ in P implies B = C. The input to the algorithm is a structural training sample, that is, a set of strings and a syntactic tree which is associated to each string. The syntactic tree is typically represented by brackets. In the initialization step, the algorithm creates a context-free rule for every internal node of the trees in the sample, where the words are the terminals and every internal node is given a non-terminal identifier. Then a merging process is carried out in order to join the non-terminals that do not accomplish the conditions of invertibleness and reset-freeness. First the invertibleness condition is tested: if two rules have the same right-hand side, then, by the invertibleness condition, the non-terminals of the left-hand sides are joined into a new non-terminal. Second, the reset-freeness condition is tested: if two rules have the same left-hand side, and the symbols of their right-hand sides are the same, excluding one pair of non-terminals, then the discordant pair of non-terminals must be merged. Finally, the two conditions have to be tested repeatedly until they do not cause any merge. The obtained context-free grammar identifies a reversible language that includes the sample [12]. The Sakakibara algorithm has a temporal cost of O(n2 ) [7], with n being the number of internal tree nodes of the structural sample. In order to use the obtained grammar as initial grammar for the estimation algorithms described in Section 2, a stochastic version of the Sakakibara algorithm has been proposed. This new version allows us to obtain the initial probabilities of the rules. The initial probabilities of the rules were calculated from the frequency of appearance of the subtrees in the structural sample. To compute the probabilities of the rules, the algorithm proceeds as follow. First, when the initial rules are created from the structural data, a counter which is initialized to one is attached to every rule. Second, in the merging process, if two non-terminals have to be merged, (causing a few rules to be the same rule) the counters of these rules must be added. This new value is assigned to the new rule. Once the merging process has finished, a normalization of the frequencies for every set of rules with the same left-hand side is carried out. The stochastic version of the Sakakibara algorithm obtains a SCFG in general format, but the estimation algorithms need the SCFG in CNF. We transform the SCFG in general format to a SCFG in CNF [3] keeping the probability distribution of the initial grammar over the training sample.
202
F. Nevado, J.-A. S´ anchez, and J.-M. Bened´ı
Table 1. Characteristics of the data sets defined for the experiments when the sentences with more than 15 POStags were removed. Data set # sentences Av. length Std. deviation Training 9,933 10.67 3.46 Test 2,295 10.51 3.55
4
Experiments with the Penn Treebank Corpus
Both the estimation algorithms described in Section 2 and the grammatical inference technique to obtain the initial SCFGs described in Section 3 were tested in order to compare their performance. In this section, we describe these experiments. The corpus used in the experiments was the part of the Wall Street Journal corpus which had been processed in the Penn Treebank project1 [8]. This corpus consists of English texts collected from the Wall Street Journal from editions of the late eighties. It contains approximately one million words. This corpus was automatically labelled, analyzed and manually checked as described in [8]. There are two kinds of labelling: a part of speech (POStag) labelling and a syntactic labelling. The size of the vocabulary is greater than 25,000 different words, the POStag vocabulary is composed of 45 labels2 and the syntactic vocabulary is composed of 14 labels. Given the time complexity of the algorithms to be used, we decided to work with only the POStag labelling, since the vocabulary of the original corpus was too large for the experiments to be carried out. The corpus was divided into sentences according to the bracketing. We took advantage of SCFGs estimated in a previous work [14]. These SCFGs were estimated using sentences which had less than 15 POStags. Therefore, in this work we assumed such a restriction. For the experiments, the corpus was divided into a training corpus (directories 00 to 19) and a test corpus (directories 20 to 24). The characteristics of these sets can be seen in Table 1. The perplexity per word was used to evaluate the goodness of the obtained model. The test set perplexity3 with 3-grams and linear interpolation was 9.63. This corpus was used to obtain a SCFG in three different ways. First, we considered the estimation methods described in Section 2 from an initial ergodic grammar. Then, we considered the inference method explained in Section 3. Finally, we explored the methods described in Section 2 from an initial model obtained in the way described in Section 3. 1 2 3
Release 2 of this data set can be obtained from the Linguistic Data Consortium with Catalogue number LDC94T4B (http://www.ldc.upenn.edu/ldc/noframe.html). There are 48 labels defined in [8]; however, three do not appear in the corpus. The values were computed with the software tool described in [11] (Release 2.04 is available at http://svr-www.eng.cam.ac.uk/∼ prc14/toolkit.html).
Combination of Estimation Algorithms
203
SCFG Estimated from an Initial Ergodic Model In the first experiment, an initial ergodic SCFG in CNF was constructed. This SCFG had the maximum number of rules (3,374 rules) which can be composed with 45 terminal symbols (the number of POStags) and 14 non-terminal symbols (the number of syntactic labels). The probabilities were randomly generated and three seeds were tested. Given that the results were similar, only one of the seeds is reported. Table 2 shows the perplexity of the final model for the algorithms considered (VS, kVS, IOb, VSb, kVSb). We also report the number of iterations to convergence and the number of rules in the final model. Table 2. Results of the estimation algorithms with an initial ergodic grammar. VS kVS (k = 16) Perplexity 21.59 19.01 # Iterations 40 67 # Rules 193 176
IOb 13.38 100 660
VSb kVSb (k = 16) 21.82 19.15 32 50 256 248
It can be seen that the IOb algorithm achieved better results than all the other estimation algorithms. However, it is important to note that the number of iterations that this algorithm needed to get to convergence was much larger than the number of iterations needed by the other algorithms. In the estimation process, the VS, kVS, VSb and kVSb algorithms assign null probability to the rules that do not participate in the parsing of a string, and these rules are not considered in the following iterations. Therefore, the final models were smaller than the models obtained with the IOb algorithm. In this last algorithm, some rules disappeared due to the probabilities underflow. In addition, it can be seen that the perplexity achieved by the VS (kVS) and VSb (kVSb) algorithms was practically the same. Finally, it should be noted that the kVS (kVSb) algorithm improved the results obtained by the VS (VSb) algorithm even for small values of k. SCFG Obtained by the Sakakibara Algorithm In the second experiment, a SCFG in general format was obtained with the method described in Section 3. This grammar had 73 non-terminals and 1,752 rules. After transforming it into CNF, this SCFG was evaluated and Table 3 shows the perplexity, together with the number of rules and the number of nonterminals of this model. It can be seen that this result improved the results obtained using some estimation algorithms which began with an initial ergodic model (see Table 2). Due to the characteristics of this bracketed corpus, in the final grammar, most of the non-terminals had only one rule and most of the structural information
204
F. Nevado, J.-A. S´ anchez, and J.-M. Bened´ı
Table 3. Results with the SCFG obtained with the stochastic Sakakibara algorithm. Perplexity # Rules # Non-terminals 18.97 5,263 3,585
was contained in the initial symbol. On the other hand, it is also important to note that the number of rules in the final grammar in CNF was much larger than the number of rules obtained using the previous estimation methods. SCFG Estimated from an Initial Grammar Obtained with the Sakakibara Algorithm In the third experiment, the SCFG obtained by the Sakakibara algorithm was used as the initial grammar by the estimation algorithms mentioned in Section 2. The results obtained after the estimation process can be seen in Table 4. Table 4. Results of the different algorithms with an initial grammar obtained using a stochastic version of the Sakakibara algorithm. VS kVS (k = 16) Perplexity 18.81 17.93 # Iterations 15 7 # Rules 3,532 3,880
IOb VSb kVSb (k = 16) 17.62 19.09 17.76 15 7 14 4,331 4,181 4,163
First, we compare these results with the results shown in Table 2. It can be seen that the perplexity of the grammars estimated using the VS, kVS, VSb and kVSb algorithms decreased with respect to the experiments using the ergodic initial model. This improvement was even greater for the algorithms which used less stochastic information, that is, the VS and VSb algorithms. This is an important result because no information about the task was used in this initial model. Conversely, in the ergodic model, the number of non-terminals was chosen according to the heuristic knowledge about the task. However, the perplexity with the IOb algorithm did not decrease in the same way. This may be due to the characteristics of the initial model. As we have mentioned above, in the grammar obtained by the Sakakibara algorithm, most of the non-terminals had only one rule and the initial symbol had most of the rules. Therefore, the estimation was concentrated in this non-terminal allowing small margin to the estimation process. Next, we compare these results with the results shown in Table 3. It can be observed that these models improved the perplexity only a little. Even in the case of the VSb algorithm, the perplexity decreased. As previously, this may be due to the characteristics of the initial model. It is also important to note that the size of the final model was much larger than the one obtained in the experiments with the ergodic initial model. However, the number of iterations to converge was significantly smaller.
Combination of Estimation Algorithms
5
205
Conclusions
In this work, we have studied the problem of the initialization of estimation methods of SCFGs. We have seen how most of the estimation methods allow us to obtain better models if structural information is incorporated into the initial model. The improvement is specially important in estimation methods which are very sensitive to the initialization. However, more robust estimation algorithms like the IOb algorithm can be negatively affected if the initialization method is not flexible enough. The evaluation method used in the experiments was the test set perplexity. Other methods for evaluating the structural information contained in the final models should be considered. On the other hand, the properties of the initial grammar obtained by the inference method should be studied in order to consider its influence in the final results.
References 1. F. Amaya, J.M. Bened´ı, and J.A. S´ anchez. Learning of stochastic context-free grammars from bracketed corpora by means of reestimation algorithms. In M.I. Torres and A. Sanfeliu, editors, Proc. VIII Spanish Symposium on Pattern Recognition and Image Analysis, pages 119–126, Bilbao, Espa˜ na, May 1999. AERFAI. 2. L.E. Baum. An inequality and associated maximization technique in statistical estimation for probabilistic functions of markov processes. Inequalities, 3:1–8, 1972. 3. J.M. Bened´ı and J.A. S´ anchez. Stochastic context-free grammars in general form to chomsky normal form. Technical Report DSIC-II/13/00, Departamento de Sistemas Inform´ aticos y Computaci´ on. Universidad Polit´ecnica de Valencia., 2000. 4. S.F. Chen. Bayesian Grammar Induction for Language Modeling. Ph. d. dissertation, Harvard University, 1996. 5. F. Jelinek and J.D. Lafferty. Computation of the probability of initial substring generation by stochastic context-free grammars. Computational Linguistics, 17(3):315–323, 1991. 6. K. Lari and S.J. Young. The estimation of stochastic context-free grammars using the inside-outside algorithm. Computer, Speech and Language, 4:35–56, 1990. 7. E. M¨ akinen. On the structural grammatical inference problem for some classes of context-free grammars. Information Processing Letters, April(42):1–5, 1992. 8. M.P. Marcus, B. Santorini, and M.A. Marcinkiewicz. Building a large annotated corpus of english: the penn treebank. Computational Linguistics, 19(2):313–330, 1993. 9. H. Ney. Stochastic grammars and pattern recognition. In P. Laface and R. De Mori, editors, Speech Recognition and Understanding. Recent Advances, pages 319–344. Springer-Verlag, 1992. 10. F. Pereira and Y. Schabes. Inside-outside reestimation from partially bracketed corpora. In Proceedings of the 30th Annual Meeting of the Association for Computational Linguistics, pages 128–135. University of Delaware, 1992. 11. R. Rosenfeld. The cmu statistical language modeling toolkit and its use in the 1994 arpa csr evaluation. In ARPA Spoken Language Technology Workshop, Austin, Texas, USA, 1995.
206
F. Nevado, J.-A. S´ anchez, and J.-M. Bened´ı
12. Y. Sakakibara. Efficient learning of context-free grammars from positive structural examples. Information and Computation, 97:23–60, 1992. 13. J.A. S´ anchez and J.M. Bened´ı. Consistency of stochastic context-free grammmars from probabilistic estimation based on growth transformation. IEEE Trans. Pattern Analysis and Machine Intelligence, 19(9):1052–1055, 1997. 14. J.A. S´ anchez and J.M. Bened´ı. Estimation of the probability distributions of stochastic context-free grammars from the k-best derivations. In In 5th International Conference on Spoken Language Processing, pages 2495–2498, Sidney, Australia, 1998. 15. J.A. S´ anchez and J.M. Bened´ı. Learning of stochastic context-free grammars by means of estimation algorithms. In Proc. EUROSPEECH’99, volume 4, pages 1799–1802, Budapest, Hungary, 1999. 16. J.A. S´ anchez, J.M. Bened´ı, and F. Casacuberta. Comparison between the insideoutside algorithm and the viterbi algorithm for stochastic context-free grammars. In P. Perner, P. Wang, and A. Rosenfeld, editors, Advances in Structural and Syntactical Pattern Recognition, pages 50–59. Springer-Verlag, 1996.
On the Relationship between Models for Learning in Helpful Environments Rajesh Parekh1 and Vasant Honavar2 1
Blue Martini Software 2600 Campus Drive San Mateo, CA 94403. USA [email protected] 2 Department of Computer Science Iowa State University Ames, IA 50011. USA [email protected] http://www.cs.iastate.edu/˜honavar/aigroup.html
Abstract. The PAC and other equivalent learning models are widely accepted models for polynomial learnability of concept classes. However, negative results abound in the PAC learning framework (concept classes such as deterministic finite state automata (DFA) are not efficiently learnable in the PAC model). The PAC model’s requirement of learnability under all conceivable distributions could be considered too stringent a restriction for practical applications. Several models for learning in more helpful environments have been proposed in the literature including: learning from example based queries [2], online learning allowing a bounded number of mistakes [14], learning with the help of teaching sets [7], learning from characteristic sets [5], and learning from simple examples [12,4]. Several concept classes that are not learnable in the standard PAC model have been shown to be learnable in these models. In this paper we identify the relationships between these different learning models. We also address the issue of unnatural collusion between the teacher and the learner that can potentially trivialize the task of learning in helpful environments. Keywords: Models of learning, Query learning, Mistake bounded learning, PAC learning, teaching sets, characteristic samples, DFA learning.
1
Introduction
Valiant’s PAC learning model [20] provided a framework for extensive research on the computational complexity of various learning tasks. A concept class is said to be polynomially learnable if there exists an algorithm that can find a hypothesis approximating any concept in the class, when given a polynomial number of labeled examples and polynomially bounded computational resources. Further, the algorithm is expected to run in time that is polynomial in the parameters measuring the complexity of the target concept, size of the input A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 207–220, 2000. c Springer-Verlag Berlin Heidelberg 2000
208
R. Parekh and V. Honavar
to the algorithm, and the accuracy of the resulting approximation. The specific assumptions and criteria used to define polynomial learnability have led to several variations of the basic PAC model. A unifying framework for proving the equivalence of these different models was presented in [8]. Despite the PAC model’s acceptance as a standard model of polynomial learning, several negative results about PAC learning have been proven (for instance, even elementary concept classes such as DFA cannot be efficiently PAC learned [19,11]). Perhaps, the main reason for these negative results is the model’s requirement that the concept class must be learnable under any arbitrary (but fixed) probability distribution. It is conceivable that most practical learning scenarios do not place such stringent restrictions on the learnability of concept classes. On the contrary, practical learning scenarios feature helpful learning environments (for example, a knowledgeable teacher might guide the learner by answering queries or by carefully selecting training examples that would enable the learner to learn quickly and efficiently). Several models for learning in helpful environments have been proposed in the literature. These include: learning from example based queries [2,7], online learning allowing a bounded number of mistakes [14]1 , learning with the help of teaching sets [7], learning from characteristic sets [5], and learning from simple examples [12,4]. A variety of concept classes whose learnability in the standard PAC model is unknown are shown to be learnable in the above models (see section 2 for the results on learning DFA). In this paper, we study the relationships between these different models. Some of these relationships have been identified by earlier research whereas others are new. Fig. 1 gives a schematic representation of the relationships. The rest of this paper is organized as follows: Section 2 provides an overview of the different learning models. Section 3 proves the relationships outlined in Fig. 1. Section 4 addresses the issue of collusion in the models for learning in helpful environments. Section 5 concludes with a summary and some directions for future research.
Learning from example based queries
(Goldman & Mathias)
Semi-Polynomial T/L teachibility
(de la Higuera)
Mistake bounded learning (with access to membership queries)
PACS learning
(Castro & Guijarro)
Polynomial identifiability from characteristic samples
simple -PAC learning
Fig. 1. Relationship between different learning models.
1
We consider a variant of the mistake bounded learning model where the learner has access to a membership oracle or a teacher who answers membership queries.
On the Relationship between Models for Learning in Helpful Environments
2
209
Models for Learning in Helpful Environments
2.1
Preliminaries
Let Σ denote a finite alphabet. If n ≥ 1 denotes the number of attributes then the set Σ n is referred to as the sample space X . If the learning domain involves [ examples of varying lengths then the sample space is denoted as X = Σn. n≥1
A concept class C is defined as C ⊆ 2X . An individual concept c ∈ C is thus a subset of X . A concept is usually associated with a classification function c : X −→ {0, 1} such that c(x) = 1 if an example x belongs to the concept and c(x) = 0 otherwise. The tuple (x, c(x)) represents a labeled example of c. If S is a set of labeled examples then ||S|| denotes the size of S (i.e., the sum of the lengths of the individual examples in S). A representation R assigns a name to each concept in C and is defined as a function R : C −→ {0, 1}∗ . Let r = R(c) be the representation of a concept c. |r| (the length of the string r) denotes the size of the concept c. Let D be an arbitrary (but fixed) probability distribution defined over X . A concept class C is said to be probably approximately correctly (PAC) learnable if there exists (a possibly randomized) algorithm A such that on input of any parameters and δ, for any concept c ∈ C with corresponding representation r, and for any probability distribution D over X , if A draws a set S of labeled examples of c, then A produces an approximation cˆ of c such that with probability ≥ 1 − δ, PrD ({x|x ∈ X and c(x) 6= cˆ(x)}) ≤ . The run time of A is required to be polynomial in 1/, 1/δ, |r|, and ||S||. If the algorithm A is such that any concept in C is learned exactly i.e., with probability ≥ 1 − δ, PrD ({x|x ∈ X and c(x) 6= cˆ(x)}) = 0 then C is said to be probably exactly learnable2 . Kolmogorov complexity is a machine independent notion of simplicity of objects. Objects that have regularity in their structure (i.e., objects that can be easily compressed) have low Kolmogorov complexity. For any string α ∈ {0, 1}∗ , the prefix Kolmogorov complexity of α relative to a Turing Machine φ is defined as Kφ (α) = min{|π| | φ(π) = α} where π ∈ {0, 1}∗ is a program input to the Turing machine. The Optimality Theorem for Kolmogorov Complexity guarantees that for any prefix Turing machine φ there exists a constant cφ such that for any string α, Kψ (α) ≤ Kφ (α) + cφ where ψ is the Universal Turing Machine. Further, by the Invariance Theorem it can be shown that for any two universal Turing machines ψ1 and ψ2 there is a constant η ∈ N (where N is the set of natural numbers) such that for all strings α, |Kψ1 (α)−Kψ2 (α)| ≤ η. Thus, fixing a single universal Turing machine U we denote K(α) = KU (α). The Kolmogorov complexity of a string is bounded by its length i.e., K(α) ≤ |α| + K(|α|) + ζ where ζ is a constant independent of α. The conditional Kolmogorov complexity of any string α given β is defined as Kφ (α | β) = min{|π| | φ(hπ, βi) = α} where 2
Note that in this case A takes in only δ as a parameter and is expected to run in time polynomial in 1/δ, |r|, and ||S||.
210
R. Parekh and V. Honavar
π ∈ {0, 1}∗ is a program and h., .i is a standard pairing function. Fixing a single universal Turing machine U we denote the conditional Kolmogorov complexity by K(α|β) = KU (α|β). The Solomonoff Levin universal distribution m is a universal enumerable probability distribution in that it multiplicatively dominates all enumerable probability distributions. Formally, ∀i ∈ N + ∃c > 0 ∀x ∈ N [cm(x) ≥ Pi (x)] where P1 , P2 , . . . is an enumeration of all enumerable probability distributions and N is the set of natural numbers. It can be shown that m(x) = 2−K(x)+O(1) . Thus, under m, simple objects (or objects with low Kolmogorov complexity) have a high probability, and complex or random objects have a low probability. Given a string r ∈ Σ ∗ , the universal distribution conditional on the knowledge of r, P m (α) < 1. mr , is defined as mr (α) = 2−K(α|r)+O(1) [4]. Further, ∀r ∈ Σ ∗ r α The interested reader is referred to [13] for a thorough treatment of Kolmogorov complexity, universal distribution, and related topics. 2.2
Learning from Example Based Queries
A variety of concept classes are known to be learnable in deterministic polynomial time when the learner is allowed access to a teacher (or an oracle) that answers example based queries [2]. Example based queries include equivalence, membership, subset, superset, disjointedness, exhaustive, justifying assignment, and partial equivalence queries. A membership query is of the form “does x ∈ c?” where x ∈ X is an example and c ∈ C is the target concept. The teacher’s response is yes or no depending on whether c(x) = 1 or not. For all other types of queries the input is the learner’s hypothesis cˆ and the teacher’s response is either a yes or a counterexample x ∈ X . Thus, an equivalence query is of the form “∀x ∈ X is c(x) = cˆ(x)?”. The teacher’s response is either yes or an example x such that c(x) 6= cˆ(x). Definition 1. (Due to Goldman and Mathias [6]) An example based query is any query of the form ∀(x1 , x2 , . . . , xk ) ∈ X k does φr (x1 , x2 , . . . , xk ) = 1? where r is the target concept and k is a constant. φ may use the instances (x1 , . . . , xk ) to compute additional instances on which to perform membership queries. The teacher’s response to example based queries is either yes or a counter example consisting of (x1 , x2 , . . . , xk ) ∈ X k (along with the correct classification corresponding to each of the xi ’s) for which φr (x1 , x2 , . . ., xk ) = 0 and the labeled examples for which membership queries were made in order to evaluate φr . Definition 2. A concept class C is said to be polynomially learnable from example based queries iff there exist polynomials p1 () and p2 (), and an algorithm A, such that for any concept c ∈ C with representation r, A returns a representation rˆ of a concept cˆ that is equivalent to c when it is allowed to pose a number of example based queries bounded by p1 (|r|) and see a set of examples S (including counterexamples returned by the example based queries) of size at most p2 (|r|).
On the Relationship between Models for Learning in Helpful Environments
211
The L∗ algorithm is a method for exactly learning DFA (in polynomial time) from membership and equivalence queries [1]. 2.3
Mistake Bounded Learning
Littlestone’s mistake bounded learning model deals with the online learning scenario. Instead of presenting the learner with a set of labeled examples, the online model presents one example at a time and asks the learner to predict the class of each example it receives. After making this prediction, the learner is told whether its prediction was correct. The learner uses this information to improve its hypothesis. The mistake bounded model considers bounding the number of (explicit) prediction errors made by the learner in the worst case while learning (to predict) a target concept. Several concept classes are known to be learnable with the help of membership queries in addition to other example based queries (such as equivalence queries, subset queries, etc.). We consider an augmented mistake bounded learning model where the learner has access to a teacher who answers membership queries. Definition 3. A concept class C is polynomially learnable in the augmented mistake bounded model iff there exist polynomials p1 () and p2 (), a teacher T capable of answering membership queries, and an online learning algorithm A such that for any concept c with representation r, A learns a representation rˆ of a concept cˆ that is equivalent to c when it is allowed to make at most p1 (|r|) prediction errors on the sequence of examples it sees and pose (if required) at most p2 (|r|) membership queries. DFA are known to be exactly learnable in this augmented mistake bounded model for online learning with membership queries (see [18] for a description of the Incremental ID algorithm). We show the relationship of this augmented mistake bounded model to the model of learning from example based queries. Note that this result subsumes Littlestone’s result depicting the relationship between mistake bounded learning and learning by posing a bounded number of equivalence queries [14]. 2.4
Learning from Teaching and Characteristic Sets
Goldman and Mathias have developed a teaching model for efficient learning of target concepts [7]. Their model takes into account the quantity of information that a good teacher must provide to the learner. An additional player called the adversary is introduced in this model to ensure that there is no unnatural collusion whereby the teacher directly gives the learner an encoding of the target concept. Definition 4. (Due to de la Higuera [9]) A concept class C is semi-polynomially T/L teachable iff there exist polynomials p1 () and p2 (), a teacher T , and a learner L, such that for any adversary ADV
212
R. Parekh and V. Honavar
and any concept c with representation r that is selected by ADV , after the following teaching session the learner returns the representation rˆ of a concept cˆ that is equivalent to c: • ADV gives r to T . • T computes a teaching set S of size at most p1 (|r|). • ADV adds correctly labeled examples to this set. • The learner uses the augmented set S and outputs rˆ in time p2 (||S||). In this model, a concept class for which the computation of both the teacher and the learner takes polynomial time and the learner always learns the target concept is called polynomially T/L teachable. Without the restrictive assumption that the teacher’s computations be performed in polynomial time, the concept class is said to be semi-polynomially T/L teachable. While studying the identification of languages in the limit, Gold proposed a model for learning from given data [5]. In this model, the learner when presented with a set of examples S must return a representation of a concept consistent with S. Further, the model postulates that there exists a characteristic set of examples for each language such that the learning algorithm upon seeing the characteristic set must output a representation equivalent to that of the target concept. This condition should be monotonic in that even if correctly labeled examples are added to the characteristic set, the algorithm would still infer the same language. This leads to the model for polynomial identifiability of concept classes from characteristic sets. It is based on the availability of a polynomial sized characteristic set for any concept in the concept class and an algorithm which when given a superset of a characteristic set is guaranteed to return, in polynomial time, a representation of the target concept. Definition 5. (Due to de la Higuera [9]) A concept class C is polynomially identifiable from characteristic sets iff there exist two polynomials p1 () and p2 () and an algorithm A such that: • Given any set S of labeled examples, A returns in time p1 (||S||) a representation r of a concept c ∈ C such that c is consistent with S. • For every concept c ∈ C with corresponding representation r there exists a characteristic set Sc such that ||Sc || = p2 (|r|) and if A is provided with a set S ⊇ Sc then A returns a representation rˆ of a concept cˆ that is equivalent to c. The framework of the RPNI algorithm for learning DFA identifies a practical notion of a teaching or characteristic set of a DFA and demonstrates how DFA can be exactly learned (in polynomial time) from a training set that includes a characteristic set of the target DFA as a subset [15]. 2.5
Learning from Simple Examples
The standard PAC model’s requirement of learnability under all conceivable distributions is often considered too stringent for practical learning scenarios. Li and Vit´ anyi have proposed a simple-PAC learning model for efficiently learning simple concepts. A concept class is said to be simple-PAC learnable if it is PAC
On the Relationship between Models for Learning in Helpful Environments
213
learnable under the class of simple distributions [12]. A distribution is simple if it is multiplicatively dominated by some enumerable distribution. The class of simple distributions includes a variety of distributions (such as all computable distributions). Further, the simple distribution independent learning theorem says that a concept class is learnable under the universal distribution m iff it is learnable under the entire class of simple distributions provided the examples are drawn according to the universal distribution [12]. Thus, the simple-PAC learning model is sufficiently general. Concept classes such as log n-term DNF and simple k-reversible DFA are learnable under the simple-PAC model whereas their PAC learnability in the standard sense is unknown [12]. Denis et al proposed a learning model (called the PACS model) where examples are drawn at random according to the universal distribution conditional on the knowledge of the target concept [4]. Under this model, examples with low conditional Kolmogorov complexity given a representation r of the target concept are called simple examples. Specifically, for a concept with representar = {α | K(α|r) ≤ µlg(|r|)} (where µ is a constant) is the tion r, the set Ssim r is used to denote a set of simple examples for that concept. Further, Ssim,rep set of simple and representative examples of r. The PACS model restricts the underlying distribution to mr (where mr (α) = 2−K(α|r)+O(1) ). The learnability of logarithmic Kolmogorov Complexity DFA in the simplePAC model and that of the entire class of DFA in the PACS model are shown in [16,17].
3
Relationships between the Learning Models
In this section we show the relationships between the different models for learning in helpful environments (see Fig. 1). Theorem 1. A concept class C is learnable in deterministic polynomial time using example-based queries iff it is learnable in the augmented mistake bounded framework with a polynomial mistake bound. Proof: We prove this result by showing that an algorithm using example based queries can be simulated using a mistake bounded learning algorithm and vice-versa. A similar strategy was used to show the equivalence of the mistake bounded learning model with the model for learning by posing a bounded number of equivalence queries [14]. Let A be an algorithm for learning C from example based queries. We derive a mistake bounded learning algorithm B as follows: Algorithm B 1. simulate A until it outputs a hypothesis cˆ0 as its query 2. use cˆ0 as the initial hypothesis let i = 0
214
R. Parekh and V. Honavar
3. for each observed example x do predict cˆi (x) if cˆi (x) 6= c(x) — (where c(x) is the correct classification of x) then return x as a counterexample in response to A’s query let the next query output by A be the updated hypothesis cˆi+1 let i = i + 1 end if end for
Note that A might make use of additional membership queries to assist in the computation of its hypotheses cˆi . The number of membership queries and other example based queries posed by A is polynomially bounded (by the definition of polynomial learning from example based queries). The number of mistakes made by the algorithm B is thus polynomially bounded. Let B be a mistake bounded learning algorithm for C (i.e., B makes at most a polynomial number of mistakes and possibly uses polynomial number of membership queries to learn any concept c ∈ C). We derive an algorithm A for learning C from example based queries as follows: Algorithm A 1. let i = 0 let cˆ0 be the initial hypothesis of B 2. repeat use cˆi to pose an example based query if the teacher’s response is yes then output cˆi and halt else present the counterexample x to B B predicts cˆi (x) (which is 6= c(x) since x is a counterexample) give c(x) to B let cˆi+1 be the next hypothesis of B let i = i + 1 end if until eternity
Note that B may pose a polynomial number of membership queries during the computation of its hypotheses cˆi . Further, since B makes a polynomial number of mistakes it is clear that A poses at most a polynomial number of example based queries. This proves the theorem. 2 Theorem 2. (Due to Goldman and Mathias [7]) Any concept class C learnable in deterministic polynomial time using examplebased queries is semi-polynomially T/L teachable. Proof: (The result is proved by showing how a teaching set is constructed by simulating the query based learning algorithm. The teaching set captures all the counterexamples and the additional instances, if any, that are generated during the evaluation of example based queries. The learner then simulates the
On the Relationship between Models for Learning in Helpful Environments
215
execution of the query based algorithm. However, instead of posing queries to a teacher, the learner evaluates the responses to example based queries using the labeled instances that appear in the teaching set.) 2 Theorem 3. (Due to de la Higuera [9]) A concept class C is semi-polynomially T/L teachable iff it is polynomially identifiable from characteristic sets. Proof: (This result is proved by identifying the characteristic set with the teaching set.) 2 Lemma 1. Let c ∈ C be a concept with corresponding representation r. If there exists a characteristic set Sc for c and a polynomial p1 () such that Sc can be computed from r and ||Sc || = p1 (|r|) then each example in Sc is simple in the sense that ∀α ∈ Sc , K(α|r) ≤ µ lg(|r|) where µ is a constant. Proof: Fix an ordering of the elements of Sc and define an index to identify the individual elements. Since ||Sc || = p1 (|r|), an index that is O(lg(p1 (|r|))) = O(lg(|r|)) = µ lg(|r|) bits long is sufficient to uniquely identify each element of Sc 3 . Since Sc can be computed from r we can construct a Turing machine that given r reads as input an index of length µ lg(|r|) and outputs the corresponding string of Sc . Thus, ∀α ∈ Sc , K(α|r) ≤ µ lg(|r|) where µ is a constant independent of α. 2 Lemma 2. (Due to Denis et al [4]) Suppose that a sample S is drawn according to mr . For an integer l ≥ |r|, and 0 < δ ≤ 1, if |S| ≥ lµ (ln(2) + ln(lµ ) + ln(1/δ)) then with probability greater than r 1 − δ, Ssim ⊆ S. Proof: r , mr (α) ≥ l−µ Claim 1: ∀α ∈ Ssim mr (α) ≥ 2−K(α|r) ≥ 2−µlg|r| −µ
≥ |r| ≥ l−µ r Claim 2: |Ssim | ≤ 2lµ
r |Ssim | ≤ |{α ∈ {0, 1}∗ | K(α|r) ≤ µlg(|r|)}| ≤ |{α ∈ {0, 1}∗ | K(α|r) ≤ µlg(l)}| ≤ |{β ∈ {0, 1}∗ | |β| ≤ µlg(l)}| ≤ 2µlg(l)+1
≤ 2lµ 3
Note that if the sum of the lengths of the examples belonging to a set is k then clearly, the number of examples in that set is at most k + 1.
216
R. Parekh and V. Honavar
r Claim 3: |S| ≥ lµ (ln(2) + ln(lµ ) + ln(1/δ)) then Pr(Ssim ⊆ S) ≥ 1 − δ r is not sampled in one random draw) ≤ (1 − l−µ ) Pr(α ∈ Ssim (claim 4.1) r Pr(α ∈ Ssim is not sampled in |S| random draws) ≤ (1 − l−µ )|S| r Pr(some α ∈ Ssim is not sampled in |S| random draws) ≤ 2lµ (1 − l−µ )|S| r Pr(Ssim
(claim 4.2) 6⊆S) ≤ 2lµ (1 − l−µ )|S|
We would like this probability to be less than δ. 2lµ (1 − l−µ )|S| ≤ δ −µ
2lµ (e−l )|S| ≤ δ, since 1 − x ≤ e−x if x ≥ 0 ln(2) + ln(lµ ) − |S|l−µ ≤ ln(δ) |S| ≥ lµ (ln(2) + ln(lµ ) + ln(1/δ)) r ⊆ S) ≥ 1 − δ Thus, Pr(Ssim
2
Corollary 1. Suppose that a sample S is drawn according to mr . For an integer l ≥ |r|, and 0 < δ ≤ 1, if |S| ≥ lµ (ln(2) + ln(lµ ) + ln(1/δ)) then with probability r greater than 1 − δ, Ssim,rep ⊆ S. r r ⊆ Ssim . Proof: Follows from Lemma 2 since Ssim,rep
2
Theorem 4. Any concept class that is semi-polynomially T/L teachable (or equivalently polynomially identifiable from characteristic sets) is probably exactly learnable in the PACS model. Proof: Lemma 1 shows that if there exists a polynomial sized teaching (characteristic) set Sc of examples for a concept c then the individual examples belonging to the teaching set are simple (in that they have logarithmic Kolmogorov complexity). Lemma 2 shows that a polynomial sized sample S drawn according to the universal distribution mr is sufficient to include all simple examples with a high probability. Further, corollary 1 shows that with high probability Sc ⊆ S r ). Since the concept class C is semi-polynomially T/L (we equate Sc with Ssim,rep teachable, there exists an algorithm A that in polynomial time exactly learns any concept c ∈ C from any set of examples that includes Sc as a subset. The PACS learning algorithm can be formulated as follows. Draw a polynomial sized sample S according to mr and use it as the training set for algorithm A. Thus, C is probably exactly learnable in the PACS model. 2 Theorem 5. (Due to Castro and Guijarro [3]) If a concept class C is learnable in the PACS model then the concept class logK(C) = {c ∈ C | R(c) = r and K(r) ≤ κ lg(|r|) where κ is a constant } (i.e., the set of concepts whose corresponding representations have logarithmic Kolmogorov complexity) is learnable in the simple-PAC model. Proof: (This result is proved by showing the relationship between the universal distributions m and mr .) 2
On the Relationship between Models for Learning in Helpful Environments
4
217
Collusion and Learning in Helpful Environments
Learning models that involve interaction between a knowledgeable teacher (an oracle) and a learner are vulnerable to unnatural collusion wherein the teacher passes information about the representation of the target concept as part of the training set [10,7]. The teacher and learner can a-priori agree on some suitable binary encoding of concepts. The teacher can then pass the representation of the target concept r to the learner as a suitably labeled example. In the event that the target concept cannot be suitably encoded as a single labeled example, the teacher can break the representation r into smaller groups and pass these groups as appropriately labeled examples to the learner. For example, an encoding of the target concept could be passed via the counterexamples (in the case of learning from example based queries) or via the first few examples of a teaching set (in the case of learning from teaching sets or characteristic samples). The learner can thus quickly discover the target concept without even considering the labels of the training examples! The teaching model due to Jackson and Tomkins [10] prevents this coding of the target concept by requiring that the learner must still succeed if the teacher is replaced by an adversary (who does not code the target concept as the teacher above). Further, they argue that in their model the learner can stop only when it is convinced that there is only one concept consistent with the information received from the teacher i.e., the teacher does not tell the learner when to stop. Otherwise learning would be trivialized in that the teacher passes groups of n bits to the learner (as training examples) and when sufficient number of bits have been passed to the learner so as to reconstruct the representation r of the target concept, the teacher tells the learner to stop. Goldman and Mathias’ work on polynomial teachability [7] shows that an adversary whose task is to embed the training set (also called teaching set) provided by the teacher into a larger set of correctly labeled examples is sufficient to prevent this type of collusion. Another (perhaps more subtle) form of collusion is possible in the models for learning in helpful environments. For simplicity let us assume that the target representation can be encoded using a single labeled example. Consider the polynomial teachability model. The adversary augments the teaching set with correctly labeled examples. Assuming that the augmented set is suitably shuffled the learner cannot directly identify the target without even considering the class labels. However, the learner can decode each of the labeled examples in a fixed order (say lexicographic order). For each example that represents a valid concept (in C), the learner checks whether the decoded concept is consistent with the teaching set and outputs the first concept that passes this consistency test. Note that a suitably formulated teaching set can ensure that one and only one concept is consistent with it. Here, the learner is provided with an encoding of the target concept but must perform some computation (the consistency check) in order to suitably identify the target. However, this method of identifying the target concept is potentially easier and thus more attractive than the typical algorithms that learn from a given teaching set (for instance the RPNI algorithm for learning DFA [15]).
218
R. Parekh and V. Honavar
The other models for learning in helpful environments are also vulnerable to this form of collusion. In the PACS learning model the target concept r is itself a simple example (since K(r|r) is very small). Thus, r has a very high probability of being drawn under mr . By using the decoding and consistency check trick illustrated above the learner can efficiently identify the target. Counterexamples can be used to formulate a collusive learning scheme in the model for learning from example based queries. Here the teacher can encode the target concept in the counterexample it provides to the learner. The learner can attempt to decode the counterexample. If the counterexample does not represent a valid concept then the execution of the learning algorithm continues its normal execution. However, if the counterexample represents a valid concept then the learner can pose an equivalence query to determine if the example is the target concept. If the teacher replies yes then the learner outputs the target and halts. Otherwise it takes the counterexample and repeats the above process. This method is potentially more efficient in terms of computation time. From theorem 1 we know that if there exists a deterministic polynomial time algorithm for learning a concept class using example based queries then it is easy to construct an algorithm for learning the concept in the augmented mistake bounded learning framework. Thus, the collusive learning strategy for learning from example based queries can be used to design a strategy to learn the concept in the augmented mistake bounded learning framework. It is clear that the frameworks for learning in helpful environments admit unnatural collusion. Any learnability results within models that admit collusion can be criticized on the grounds that the learning algorithm might be collusive. One method of avoiding collusive learning is to tighten the learning framework suitably. Collusion cannot take place if the representation of the target concept cannot be directly encoded as part of the training set or if the learner cannot efficiently decode the training examples and identify the one that is consistent with the training set. In the event that the learning framework cannot be suitably tightened to avoid collusion, one might provide a learning algorithm that does not rely on collusion between the teacher and the learner. For instance, the L∗ algorithm for learning DFA from membership and equivalence queries [1], the IID algorithm for incremental learning of DFA using membership queries [18], and the RPNI algorithm for learning DFA from characteristic samples [15] are examples of non-collusive algorithms in learning frameworks that admit collusion. Obtaining a general answer to the question of collusion in learning would require the development of much more precise definitions of collusion and collusion-free learning than are currently available. A detailed exploration of these issues is clearly of interest.
5
Summary
We have presented above the inter-relationships between different models for learning in helpful environments. The PACS model for learning from simple examples naturally extends the results obtained for the deterministic learning
On the Relationship between Models for Learning in Helpful Environments
219
models (example based queries, mistake bounded learning, polynomial teachability, and polynomial identifiability from characteristic sets) to a probabilistic learning framework. This work opens up several interesting questions that remain to be answered. For instance, does PACS learnability imply learnability from example based queries or polynomial teachability? Or does there exist a concept class that is PACS learnable but is not learnable from example based queries or is not polynomially teachable? Similarly, does polynomial teachability imply learnability in the mistake bounded model? We have also addressed the important issue of collusion as it relates to the models for learning in helpful environments. We have shown how the models studied in this paper admit multiple learning algorithms including some seemingly collusive ones. Additional research is required to suitably address the issues of collusion and collusion-free learning. Acknowledgements. This work was supported in part by grants from the National Science Foundation (9409580, 9982341) to Vasant Honavar. Rajesh Parekh would like to thank the Allstate Research and Planning Center for the support he has received while conducting this research.
References 1. D. Angluin. Learning regular sets from queries and counterexamples. Information and Computation, 75:87–106, 1987. 2. D. Angluin. Queries and concept learning. Machine Learning, 2(4):319–342, 1988. 3. J. Castro and D. Guijarro. Query, PACS and simple-PAC learning. Technical Report LSI-98-2-R, Universitat Polyt´ectica de Catalunya, Spain, 1998. 4. F. Denis, C. D’Halluin, and R. Gilleron. Pac learning with simple examples. STACS’96 - Proceedings of the 13th Annual Symposium on the Theoretical Aspects of Computer Science, pages 231–242, 1996. 5. E. M. Gold. Complexity of automaton identification from given data. Information and Control, 37(3):302–320, 1978. 6. S. Goldman and H. Mathias. Teaching a smarter learner. In Proceedings of the Workshop on Computational Learning Theory (COLT’93), pages 67–76. ACM Press, 1993. 7. S. Goldman and H. Mathias. Teaching a smarter learner. Journal of Computer and System Sciences, 52:255–267, 1996. 8. D. Haussler, M. Kearns, N. Littlestone, and M. Warmuth. Equivalence of models for polynomial learnability. Information and Computation, 95:129–161, 1991. 9. Colin de la Higuera. Characteristic sets for polynomial grammatical inference. In L. Miclet and C. Higuera, editors, Proceedings of the Third ICGI-96, Lecture Notes in Artificial Intelligence 1147, pages 59–71, Montpellier, France, 1996. 10. J. Jackson and A. Tomkins. A computational model of teaching. In Proceedings of the Workshop on Computational Learning Theory (COLT’92), pages 319–326. ACM Press, 1992. 11. M. Kearns and L. G. Valiant. Cryptographic limitations on learning boolean formulae and finite automata. In Proceedings of the 21st Annual ACM Symposium on Theory of Computing, pages 433–444, New York, 1989. 12. M. Li and P. Vit´ anyi. Learning simple concepts under simple distributions. SIAM Journal of Computing, 20(5):911–935, 1991.
220
R. Parekh and V. Honavar
13. M. Li and P. Vit´ anyi. An Introduction to Kolmogorov Complexity and its Applications, 2nd edition. Springer Verlag, New York, 1997. 14. N. Littlestone. Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning, 2:285–318, 1988. 15. J. Oncina and P. Garc´ıa. Inferring regular languages in polynomial update time. In N. et al P´erez, editor, Pattern Recognition and Image Analysis, pages 49–61. World Scientific, 1992. 16. R. Parekh and V. Honavar. Simple DFA are polynomially probably exactly learnable from simple examples. In Proceedings of the Sixteenth International Conference on Machine Learning (ICML’99), pages 298–306, Bled, Slovenia, 1999. 17. R. G. Parekh and V. G. Honavar. Learning DFA from simple examples. In Proceedings of the Eighth International Workshop on Algorithmic Learning Theory (ALT’97), Lecture Notes in Artificial Intelligence 1316, pages 116–131, Sendai, Japan, 1997. Also presented at the Workshop on Grammar Inference, Automata Induction, and Language Acquisition (ICML’97), Nashville, TN. July 12, 1997. 18. R. G. Parekh, C. Nichitiu, and V. G. Honavar. A polynomial time incremental algorithm for regular grammar inference. In V. Honavar and G. Slutzki, editors, Proceedings of the Fourth ICGI-98, Lecture Notes in Artificial Intelligence 1433, pages 37–49, Ames, IA, 1998. 19. L. Pitt and M. K. Warmuth. Reductions among prediction problems: on the difficulty of predicting automata. In Proceedings of the 3rd IEEE Conference on Structure in Complexity Theory, pages 60–69, 1988. 20. L. Valiant. A theory of the learnable. Communications of the ACM, 27:1134–1142, 1984.
Probabilistic k-Testable Tree Languages Juan Ram´ on Rico-Juan, Jorge Calera-Rubio, and Rafael C. Carrasco? Departament de Llenguatges i Sistemes Inform` atics Universitat d’Alacant E-03071 Alacant (Spain) {juanra, calera, carrasco}@dlsi.ua.es
Abstract. In this paper, we present a natural generalization of k-gram models for tree stochastic languages based on the k-testable class. In this class of models, frequencies are estimated for a probabilistic regular tree grammar wich is bottom-up deterministic. One of the advantages of this approach is that the model can be updated in an incremental fashion. This method is an alternative to costly learning algorithms (as inside-outside-based methods) or algorithms that require larger samples (as many state merging/splitting methods).
1
Introduction
Stochastic models based on k-grams have been widely used in natural language modeling [BPd+ 92,NEK95], speech recognition [Jel98] and data compression [Rub76]. Indeed, any stochastic model can be used to predict the next symbol in a sequence and, therefore, they are a suitable component in arithmetic data compression [CT91] algorithms. In classification problems, the need of stochastic models often arises when the Bayes’ decision rule for minimum error rate is applied: given a sequence S = s1 s2 . . . of observations, the stochastic model M that maximizes P (M |S) also maximizes P (S|M )P (M ). Therefore, a model P (S|M ) for the generation of sequences is needed. If the stochastic model is based on conditional probabilities, that is, P (S = s1 s2 . . . st |M ) = pM (s1 )pM (s2 |s1 ) · · · pM (st |s1 s2 . . . st−1 ), and their dependence is assumed to be restricted to the immediate preceding context (in particular, the last k − 1 words: pM (st |s1 . . . st−1 ) = pM (st |st−k+1 . . . st−1 )) the resulting Markov chain model [Chu67] is known as k-gram model. From a theoretical point of view, k-gram models can be regarded as the extension of locally testable languages [GV90,Yok95] when probabilities are incorporated to the model. Informally, a string language L is locally testable if every string w can be recognized as a string in L just by looking at all the substrings in w of length at most k. Previous work [Knu93,Gar93] has generalized to tree languages the identification algorithms for locally testable string languages. Trees are a more natural ?
Work supported by the Spanish Comisi´ on Interministerial de Ciencia y Tecnolog´ıa through grant TIC97-0941.
A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 221–228, 2000. c Springer-Verlag Berlin Heidelberg 2000
222
J.R. Rico-Juan, J. Calera-Rubio, and R.C. Carrasco
representation of the input when hierarchical relations are established among the pattern components. In particular, stochastic tree grammars have been widely used to tackle ambiguity in natural language parsing[Cha93,Sto95]. In treebankbased models, the grammar is inferred from a collection of hand-parsed sentences and the probabilities estimated from the number of expansions of each type in the sample. However, grammars that are built in this way from a given data set usually suffer from extreme overgeneralization [Cha96]. The approach we follow in this paper may allow for a reduction in this overgeneralization as it takes into account the context in which a rule is applied.
2
Trees and Tree Recognizers
Given an alphabet, that is, a finite set of symbols Σ = {σ1 , . . . , σ|Σ| }, the set Σ T of Σ-trees is defined through the context-free grammar G = (Σ 0 , {T, F }, T, R) where the alphabet Σ 0 includes Σ and the left and right parenthesis and whose set of rules R contains: – T −→ σ(F ) for all σ ∈ Σ – F −→ | T F where represents the empty string. For brevity, we will write σ instead of σ() or σ(). The depth of a tree of this type t = σ is depth(σ) = 0, while the depth of tree of type t = σ(t1 . . . tm ) is m
depth(σ(t1 . . . tm )) = 1 + max{depth(tj )} j=1
(1)
For instance, the Σ-tree a(b(a(bc))c) belongs to {a, b, c}T and its depth is 3. Its graphical representation is depicted in Fig.1. A deterministic finite-state tree automaton (DTA) is defiend as a four-tuple A = (Q, Σ, ∆, F ), where Q = {q1 , . . . , q|Q| } is a finite set of states, Σ = {σ1 , . . . , σ|Σ| } is an alphabet, F ⊆ Q is the subset of accepting states and ∆ = {δ0 , δ1 , . . . , δM } is a collection of transition functions of the form δm : Σ × Qm → Q. For all trees t ∈ Σ T , the result δ(t) ∈ Q of the operation of A on t is ( δ0 (σ) if t = σ ∈ Σ (2) δ(t) = δm (σ, δ(t1 ), . . . , δ(tm )) if t = σ(t1 . . . tm ) with m > 0 The language recognized by the automaton A is the subset of Σ T L(A) = {t ∈ Σ T : δ(t) ∈ F }.
(3)
For instance, if Σ = {a, b, c} and ∆ contains the transitions δ0 (b) = q1 , δ0 (c) = q2 , δ2 (a, q1 , q2 ) = q2 and δ1 (b, q2 ) = q1 , the result of the operation of A on tree t = a(b(a(bc))c), plotted in Fig.1, is δ(t) = δ2 (a, δ(b(a(bc))), δ(c)). Recursively, one gets δ(c) = q2 and δ(b(a(bc))) = q1 . Then, δ(t) = δ(a, q1 , q2 ) =
Probabilistic k-Testable Tree Languages
223
a
c
b
a
c
b
Fig. 1. A representation of the tree t = a(b(a(bc))c).
q2 . By convention, undefined transitions lead to absorption states, that is, to unaccepted trees. Stochastic tree automata generate a probability distribution over the trees in Σ T . A stochastic DTA incorporates a probability for every transition in the automaton, with the normalization that the probabilities of transitions leading to the same state q ∈ Q must add up to one. In other words, there is a collection of functions P = {p0 , p1 , p2 , . . . , pM } of the type pm : Σ × Qm → [0, 1] such that they satisfy, for all q ∈ Q, M X X
X
σ∈Σ m=0
q1 ,... ,qm ∈Q: δm (σ,q1 ,... ,qm )=q
pm (σ, q1 , . . . , qk ) = 1
(4)
In addition to this probabilities, every stochastic deterministic tree automaton A = (Q, V, δ, P, r) provides a function r : Q → [0, 1] which, for every q ∈ Q, gives the probability that a tree satisfies δ(t) = q and replaces, in the definition of the DTA, the subset of accepting states. Then, the probability of a tree t in the language generated by the stochastic DTA A is given by the product of the probabilities of all the transitions used when t is processed by A, times r(δ(t)): p(t|A) = r(δ(t)) π(t) with π(t) recursively given by ( p0 (σ) π(t) = pm (σ, δ(t1 ), . . . , δ(tm )) π(t1 ) · · · π(tm )
(5)
if t = σ ∈ Σ if t = σ(t1 . . . tm ) with m > 0 (6)
224
J.R. Rico-Juan, J. Calera-Rubio, and R.C. Carrasco
The equations (5–6) define a probability distribution p(t|A) which is consistent if X p(t|A) = 1. (7) t∈Σ T
As put forward in [Wet80] and showed in [CR86], context-free grammars whose probabilities are estimated from random samples are always consistent. It is easy to show [Sak92] that the language recognized by a DTA can be generated also with a regular tree-grammar. In the following, the probabilities of the DTA will be extracted from random samples and, therefore, consistency is always preserved.
3
Locally Testable Tree Languages
For all k > 0 and for all trees t = σ(t1 . . . tm ) ∈ Σ T , the k-root of t is a tree in Σ T defined as ( σ if k = 1 rk (σ(t1 . . . tm )) = (8) σ(rk−1 (t1 ) . . . rk−1 (tm )) otherwise Note that in case m = 0, that is t = σ ∈ Σ, then rk (σ) = σ. On the other hand, the set fk (t) of k-forks and the set sk (t) of k-subtrees are defined as follows: ( ∅ if depth(σ(t1 . . . tm )) < k − 1 m fk (σ(t1 . . . tm )) = ∪j=1 fk (tj ) ∪ rk (σ(t1 . . . tm )) otherwise (9)
sk (σ(t1 . . . tm )) =
∪m j=1 sk (tj )
( ∅ ∪ σ(t1 . . . tm )
if depth(σ(t1 . . . tm )) > k − 1 otherwise (10)
In the particular case t = σ ∈ Σ, then sk (t) = f1 (t) = σ and fk (t) = ∅ for all k > 1. For instance, if t = a(b(a(bc))c) then one gets r2 (t) = {a(bc)}, f2 (t) = {a(bc), b(a)} and s2 (t) = {a(bc), b, c}. Note that these definitions coincide with those in [Knu93] except for the meaning of k. A tree language T is a strictly k-testable language (with k ≥ 2) if there exist finite subsets R, F, S ⊆ Σ T such that t ∈ T ⇔ rk−1 (t) ⊆ R ∧ fk (t) ⊆ F ∧ sk−1 (t) ⊆ S.
(11)
In such a case, it is straightforward [Knu93,Gar93] to build a DTA A = (Q, Σ, ∆, F ) that recognizes T . For this purpose, it suffices:
Probabilistic k-Testable Tree Languages
– – – –
225
Q = R ∪ rk−1 (F) ∪ S; F = R; δm (σ, t1 , . . . , tm ) = σ(t1 . . . tm ) for all σ(t1 . . . tm ) ∈ S δm (σ, t1 , . . . , tm ) = rk−1 (σ(t1 . . . tm )) for all σ(t1 . . . tm ) ∈ F
If one assumes that the tree language L is k-testable, the DTA recognizing L can be identified from positive samples [Knu93,Gar93], that is, from sets made of examples of trees in the language. Given a positive sample S, the procedure to obtain the DTA essentially builds the automaton A using rk−1 (S), fk (S) and sk−1 (S) instead of R, F and S respectively in the above definitions for Q, F and ∆.
4
Estimating Transition Probabilities
A stochastic sample S = {τ1 , τ2 , . . . τ|S| } consists of a sequence of trees generated according to a given probability distribution. If our model is a stochastic DTA, the distribution is p(t|A) as given by equations (5–6). Again, the assumption that the underlying transition scheme (that is, the states Q and the collection of transition functions ∆) correspond to a k-testable DTA allows one to infer a stochastic DTA from a sample in a simple way. For this purpose, one should note that the likelihood of the stochastic sample S n Y
p(τi |A)
(12)
i=1
is maximized [NEK95] if the automaton A assigns to every tree τ in the sample a probability equal to the relative frequency of τ in S. In other words, every transition in ∆ is assigned a probability which coincides with the relative number of times the rule is used when the trees in the sample are parsed. Summarizing, given a stochastic sample S = {τ1 τ2 . . . τ|S| }, the set of states is Q = rk−1 (S) ∪ rk−1 (fk (S)) ∪ sk−1 (S);
(13)
the subset of accepting states is F = rk−1 (S);
(14)
the probabilities r(t) are estimated from S as |S|
r(t) =
1 X δtr (τ ) , |S| i=1 k−1 i
(15)
with δtτ = 1 if t = τ and zero otherwise; and, finally, the transition probabilities P are estimated as P|S| C(σ(t1 , . . . , tm ), τi ) (16) pm (σ, t1 , . . . , tm ) = P|S| i=1 i=1 C(rk−1 (σ(t1 , . . . , tm ), τi )
226
J.R. Rico-Juan, J. Calera-Rubio, and R.C. Carrasco
where C(t, τ ) counts the number of forks t in τ , that is, C(t, σ(t1 . . . tm )) = δtrk−1 (σ(t1 ...tm )) +
m X
C(t, tj )
(17)
j=1
It is useful to store the above probabilities r and p as the quotient of two terms, as given by equations (15) and (16). In this way, if a new sample S 0 is provided, the automaton A can be easily updated to account for the additional information. For this incremental update, it suffices to increment each term in the equations with the sums obtained for the new sample.
5
Approximating Stochastic DTA by k-Testable Automata
From the construction, it is obvious that all stochastic k-testable languages can be generated by a stochastic DTA. However, as it is also the case with string languages [SS94], the reciprocal is not always true. The best approximate ktestable model can be obtained in the following way, based upon the results in [CRC98]. Assume that we are given a stochastic DTA A = (Q, Σ, ∆, P, r). For any value of k, we obtain a k-testable stochastic DTA A0 = (Q0 , Σ, ∆0 , P 0 , r0 ) whose probabilities are given by r0 (j) =
X
r(i)ηij
(18)
i∈Q
for all j ∈ Q0 and P p0m (σ, j1 , ..., jm )
=
i1 ,...,im ∈Q
Cδ(σ,i1 ,...,im ) pm (σ, i1 , ..., im )ηi1 j1 · · · ηim jm P (19) i∈Q Ci
where Ci is the expected number of nodes of type i in a tree and ηij represents the probability that a node i expands as a subtree t such that rk−1 (t) = i. All these coefficients can be easily computed [CRC98] using iterative procedures. In particular Ci is given by [n+1]
Ci
X
[n]
(20)
pm (σ, j1 , j2 , ..., jm )(δij1 + . . . + δijm )
(21)
= r(i) +
Λij Cj
j∈Q [0]
with Ci = 0 and Λij =
M X X
X
m=1 σ∈Σ j1 ,j2 ,...,jm ∈Q: δm (σ,j1 ,...,jm )=j
Probabilistic k-Testable Tree Languages
227
The coefficients ηij are computed as [n+1]
ηij
=
M X X m=0 σ∈Σ
X
X
i1 ,...,im ∈Q: j1 ,...,jm ∈Q0 : δm (σ,i1 ,...,im )=i δ 0 (σ,j1 ,...,jm )=j m
[n]
[n]
[n]
pm (σ, i1 , ..., im )ηi1 j1 ηi2 j2 · · · ηim jm (22)
[0]
starting with ηij = 0 (note that the terms with m = 0 do not necessarily vanish with this seed). Obviously, the cross entropy between the exact model A and the approximate one A0 can also be computed following the method described in [CRC98].
6
Conclusion
We propose a probabilistic extension of k-testable tree languages that can be also regarded as a generalization of k-grams for tree languages. This model can be updated incrementally and allows for a smaller generalization degree than tree grammars directly obtained form the sample (a case that corresponds to k = 2). On the other hand, this model may work with medium-sized samples, where state merging methods as that in [COCR00] output too simple models with an insufficient number of states.
References [BPd+ 92]
Peter F. Brown, Vincent J. Della Pietra, Peter V. deSouza, Jenifer C. Lai, and Robert L. Mercer. Class-based n-gram models of natural language. Computational Linguistics, 18(4):467–479, 1992. [Cha93] Eugene Charniak. Statistical Language Learning. MIT Press, 1993. [Cha96] Eugene Charniak. Tree-bank grammars. In Proceedings of the Thirteenth National Conference on Artificial Intelligence and the Eighth Innovative Applications of Artificial Intelligence Conference, pages 1031–1036, Menlo Park, 1996. AAAI Press/MIT Press. [Chu67] K. L. Chung. Markov Chains with Stationary Transition Probabilities. Springer, Berlin, 2 edition, 1967. [COCR00] Rafael C. Carrasco, Jose Oncina, and Jorge Calera-Rubio. Stochastic inference of regular tree languages. Machine Learning, to appear, 2000. [CR86] R. Chaudhuri and A. N. V. Rao. Approximating grammar probabilities: Solution of a conjecture. Journal of the ACM, 33(4):702–705, 1986. [CRC98] Jorge Calera-Rubio and Rafael C. Carrasco. Computing the relative entropy between regular tree languages. Information Processing Letters, 68(6):283–289, 1998. [CT91] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. Wiley Series in Telecommunications. John Wiley & Sons, New York, NY, USA, 1991. [Gar93] Pedro Garc´ıa. Learning k-testable tree sets from positive data. Technical Report DSIC-ii-1993-46, DSIC, Universidad Polit´ecnica de Valencia, 1993.
228
J.R. Rico-Juan, J. Calera-Rubio, and R.C. Carrasco
[GV90]
[Jel98] [Knu93]
[NEK95] [Rub76] [Sak92] [SS94] [Sto95] [Wet80] [Yok95]
Pedro Garc´ıa and Enrique Vidal. Inference of k-testable languages in the strict sense and application to syntactic pattern recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(9):920– 925, sep 1990. Frederick Jelinek. Statistical Methods for Speech Recognition. The MIT Press, Cambridge, Massachusetts, 1998. Timo Knuutila. Inference of k-testable tree languages. In H. Bunke, editor, Advances in Structural and Syntactic Pattern Recognition (Proc. Intl. Workshop on Structural and Syntactic Pattern Recognition, Bern, Switzerland). World Scientific, aug 1993. H. Ney, U. Essen, and R. Kneser. On the estimation of small probabilities by leaving-one-out. IEEE Trans. on Pattern Analysis and Machine Intelligence, 17(12):1202–1212, 1995. Frank Rubin. Experiments in text file compression. Communications of the ACM, 19(11):617–623, 1976. Yasubumi Sakakibara. Efficient learning of context-free grammars from positive structural examples. Information and Computation, 97(1):23–60, March 1992. Andreas Stolcke and Jonathan Segal. Precise n-gram probabilities from stochastic context-free grammars. Technical Report TR-94-007, International Computer Science Institute, Berkeley, CA, January 1994. Andreas Stolcke. An efficient context-free parsing algorithm that computes prefix probabilities. Computational Linguistics, 21(2):165–201, 1995. C. S. Wetherell. Probabilistic languages: A review and some open questions. ACM Computing Surveys, 12(4):361–379, December 1980. Takashi Yokomori. On polynomial-time learnability in the limit of strictly deterministic automata. Machine Learning, 19:153–179, 1995.
Learning Context-Free Grammars from Partially Structured Examples Yasubumi Sakakibara and Hidenori Muramatsu Department of Information Sciences, Tokyo Denki University, Hatoyama, Hiki-gun, Saitama 350-0394, Japan Email: [email protected]
Abstract. In this paper, we consider the problem of inductively learning context-free grammars from partially structured examples. A structured example is represented by a string with some parentheses inserted to indicate the shape of the derivation tree of a grammar. We show that the partially structured examples contribute to improving the efficiency of the learning algorithm. We employ the GA-based learning algorithm for context-free grammars using tabular representations which Sakakibara and Kondo have proposed previously [7], and present an algorithm to eliminate unnecessary nonterminals and production rules using the partially structured examples at the initial stage of the GA-based learning algorithm. We also show that our learning algorithm from partially structured examples can identify a context-free grammar having the intended structure and is more flexible and applicable than the learning methods from completely structured examples [5].
1
Introduction
Inductive learning of formal languages inherently contains computationally hard problems. For example, identification of minimum-state finite automata from positive and negative examples is known as NP-hard [3]. Nevertheless, one advantage in learning finite automata is that once we construct a so-called prefix-tree automaton for the positive examples (that is, a specific deterministic finite automaton which exactly accepts the positive examples), the learning problem can be reduced to the problem of merging states in the prefix-tree automaton. Dupont [2] has applied the genetic algorithm to solve the partitioning problem for the set of states in the prefix-tree automaton. In this paper, we study the problem of inductively learning context-free grammars from positive and negative examples. This is more difficult learning problem than learning finite automata. The hardness reason is that the problem of learning context-free grammars from examples has two specific aspects: determining the grammatical structure (topology) of the unknown grammar, and identifying nonterminals in the grammar. The first problem is especially hard because the number of all possible grammatical structures to be considered for a given positive example becomes exponential of the length of the positive example. Thus, the hypothesis space of context-free grammars is very large (too large) to search A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 229–240, 2000. c Springer-Verlag Berlin Heidelberg 2000
230
Y. Sakakibara and H. Muramatsu
a correct context-free grammar consistent with the given examples. Sakakibara [5] has shown that if information on the grammatical structure of the unknown context-free grammar to be learned is available for the learning algorithm, there exists an efficient algorithm for learning context-free grammars from only positive examples. On the other hand, to overcome the hardness of learning context-free grammars from examples without structural information available, Sakakibara and Kondo [7] have proposed a hypothesis representation method using a table which is similar with the parse table used in the Cocke-Younger-Kasami parsing algorithm (CYK algorithm) for context-free grammars of Chomsky normal form. By employing this representation method, the problem of learning context-free grammars from examples can be reduced to the partitioning problem of nonterminals. In this paper, we propose a compromise between these two approaches and enjoy both advantages. While it is impractical and difficult to assume completely structured examples given, it may be reasonable to assume some partially structured examples available. A completely structured string is a string with parentheses inserted to indicate the shape of the derivation tree of a grammar, or equivalently an unlabelled derivation tree of the grammar. A partially structured string is defined to be a completely structured string but some pairs of left and right parentheses missing. We show that the partially structured examples contribute to significantly improving the efficiency of the learning algorithm. While Sakakibara and Kondo’s hypothesis representation method efficiently represents an exponential number of possible grammatical structures, the learning algorithm using genetic algorithm still takes a large amount of time. We present an algorithm to eliminate unnecessary nonterminals and production rules based on the given partially structured string and reduces the learning problem size at the initial stage of the GA-based learning algorithm. We also show that our learning algorithm from partially structured examples can identify a grammar having the intended structure, that is, structurally equivalent to the unknown grammar, and is more flexible and applicable than Sakakibara’s learning algorithm from completely structured examples [5].
2
Preliminaries
A context-free grammar (CFG) is defined by a quadruple G = (N, Σ, P, S), where N is an alphabet of nonterminal symbols, Σ is an alphabet of terminal symbols such that N ∩ Σ = ∅, P is a finite set of production rules of the form A → α for A ∈ N and α ∈ (N ∪ Σ)∗ , and S is a special nonterminal called the start symbol . A derivation is a rewriting of a string in (N ∪ Σ)∗ using the production rules of the CFG G. In each step of the derivation, a nonterminal from the current string is chosen and replaced with the right-hand side of a production rule for that nonterminal. This replacement process is repeated until the string consists
Learning Context-Free Grammars from Partially Structured Examples
231
of terminal symbols only. If a derivation begins with a nonterminal A and derive a string α ∈ (N ∪ Σ)∗ , we denote A ⇒ α. The language generated by a CFG G is denoted L(G), that is, L(G) = {w | S ⇒ w, w ∈ Σ ∗ }. Two CFGs G and G0 are said to be equivalent if and only if L(G) = L(G0 ). A CFG G = (N, Σ, P, S) is in Chomsky normal form if each production rule is of the form A → BC or A → a where A, B, C ∈ N and a ∈ Σ. In the following, we fix a terminal alphabet Σ, and without loss of generality, we only consider context-free grammars without any -production rules and any useless production rules where an -production rule is of the form A → (where means the empty string of length 0) and a production rule A → α is useless if there is no derivation S ⇒ βAγ → βαγ ⇒ w for any β, γ ∈ (N ∪ Σ)∗ and any w ∈ Σ∗. We assume the unknown (target) CFG, denoted G∗ , to be learned. A positive example of G∗ is a string in L(G∗ ) and a negative example of G∗ is a string not in L(G∗ ). A representative sample of G∗ is defined to be a finite subset of L(G∗ ) that exercises every production rule in G∗ , that is, every production is used at least once to generate the subset. A string with grammatical structure, called a structured string or a structural description (of string), is a string with some parentheses inserted to indicate the shape of the derivation tree of a grammar, or equivalently an unlabelled derivation tree of the grammar, that is, a derivation tree whose internal nodes have no labels.
3
Learning Algorithm Using Tabular Representation
For learning context-free grammars or context-free languages, we usually use CFGs themselves for representing hypotheses. However, the hypothesis space of CFGs is very large (too large) to search a correct CFG consistent with the given examples. This is because the problem of learning CFGs from examples has two specific aspects: 1. determining the grammatical structure (topology) of the unknown grammar, and 2. identifying nonterminals in the grammar. The first problem is especially hard. For example, assume that a given positive example is a string “aabb”. All possible grammatical structures for this string are shown in Figure 1 when we consider the Chomsky normal form of CFGs. Thus, the number of all possible grammatical structures to be considered for a string of length n becomes exponential of n. In order to solve this hardness, Sakakibara and Kondo [7] have proposed a representation method using a table which is similar with the parse table used in the Cocke-Younger-Kasami parsing algorithm (CYK algorithm) [1] for CFGs of Chomsky normal form.
232
Y. Sakakibara and H. Muramatsu
Fig. 1. All possible grammatical structures for the string “aabb”.
3.1
Tabular Representation
Given a positive example w = a1 a2 · · · an , the tabular representation for w is the triangular table T (w) where each element, denoted ti,j , for 1 ≤ i ≤ n and 2 ≤ j ≤ n − i + 1, contains the set {Xi,j,k1 , . . . , Xi,j,kj−1 } of j − 1 distinct nonterminals. For j = 1, ti,1 is the singleton set {Xi,1,1 }. (See Figure 2.) The primitive CFG G(T (w)) = (N, Σ, P, S) derived from the tabular representation T (w) is defined to be as follows: N = {Xi,j,k | 1 ≤ i ≤ n, 1 ≤ j ≤ n − i + 1, 1 ≤ k < j} P = {Xi,j,k → Xi,k,l1 Xi+k,j−k,l2 | 1 ≤ i ≤ n, 1 ≤ j ≤ n − i + 1, 1 ≤ k < j, 1 ≤ l1 < k, 1 ≤ l2 < j − k} ∪ {Xi,1,1 → ai | 1 ≤ i ≤ n} ∪ {S → X1,n,k | 1 ≤ k ≤ n − 1} Thus for each nonterminal Xi,j,k at entry ti,j , we have the production rules Xi,j,k → Xi,k,l1 Xi+k,j−k,l2 with the right-hand sides of the nonterminals at entries ti,k and ti+k,j−k . The number of nonterminals contained in G(T (w)) is at most n3 and the size (the number of production rules) of the primitive CFG G(T (w)) is very roughly O(n5 ), that is, polynomial of n. This primitive CFG G(T (w)) only generates the string w but can generate all possible grammatical structures on w. For example, when a positive example “aabb” is given, the tabular representation T (aabb) and the primitive CFG G(T (aabb)) become as shown in Figure 3. While the size of G(T (aabb)) is polyn
{X1,n,1 , · · · , X1,n,n−1 }
n − 1 {X1,n−1,1 , · · · , X1,n−1,n−2 } {X2,n−1,1 , . . . , X2,n−1,n−2 } .. .. .. .. . . . . 2
{X1,2,1 }
{X2,2,1 }
· · · {Xn−1,2,1 }
j=1
{X1,1,1 }
{X2,1,1 }
· · · {Xn−1,1,1 } {Xn,1,1 }
i=1
2
···
Fig. 2. The tabular representation for w.
n−1
n
Learning Context-Free Grammars from Partially Structured Examples
233
nomial of the length of the given example, this CFG G(T (aabb)) can generate all possible grammatical structures for aabb as shown in Figure 4. 4 {X1,4,1 , X1,4,2 , X1,4,3 } 3
{X1,3,1 , X1,3,2 }
{X2,3,1 , X2,3,2 }
2
{X1,2,1 }
{X2,2,1 }
{X3,2,1 }
j=1
{X1,1,1 }
{X2,1,1 }
{X3,1,1 } {X4,1,1 }
i=1
2
3
4
a
a
b
b
P = { S → X1,4,1 ,
S → X1,4,2 ,
S → X1,4,3 ,
X1,4,1 → X1,1,1 X2,3,1 , X1,4,1 → X1,1,1 X2,3,2 , X1,4,2 → X1,2,1 X3,2,1 , X1,4,3 → X1,3,1 X4,1,1 , X1,4,3 → X1,3,2 X4,1,1 , X1,3,1 → X1,1,1 X2,2,1 , X1,3,2 → X1,2,1 X3,1,1 , X2,3,1 → X2,1,1 X3,2,1 , X2,3,2 → X2,2,1 X4,1,1 , X1,2,1 → X1,1,1 X2,1,1 , X2,2,1 → X2,1,1 X3,1,1 , X3,2,1 → X3,1,1 X4,1,1 , X1,1,1 → a, X4,1,1 → b
X2,1,1 → a,
X3,1,1 → b, }
Fig. 3. The tabular representation T (aabb) and the derived primitive CFG G(T (aabb)).
By employing the tabular representation, the problem of learning contextfree grammars from examples can be reduced to the partitioning problem of nonterminals.
Fig. 4. All possible parse trees of G(T (aabb)) for “aabb”.
3.2
Partitioning Algorithm Using Genetic Algorithm
We use a genetic algorithm to solve the partitioning problem for the set of nonterminals {Xi,j,k | 1 ≤ i ≤ n, 1 ≤ j ≤ n − i + 1, 1 ≤ k < j}. This partitioning problem contains the problem of finding minimum-state finite automata consistent with the given examples and hence it is NP-hard. Genetic algorithms have been well studied for the partitioning problems and we take a successful
234
Y. Sakakibara and H. Muramatsu
approach by Von Laszewski for partitioning n elements into k categories (blocks) [4]. A partition of some set X is a set of pairwise disjoint nonempty subsets of X whose union is X. If π is a partition of X, then for any element x ∈ X there is a unique element of π containing x, which we denote K(x, π) and call the block of π containing x. Let G = (N, Σ, P, S) be a CFG and π be a partition of the set N of nonterminals. The CFG G/π = (N 0 , Σ, P 0 , S 0 ) induced by π from G is defined as follows: N 0 = π (the set of blocks of π), P 0 = {K(A, π) → K(B, π) K(C, π) | A → BC ∈ P } ∪{K(A, π) → a | A → a ∈ P }, S 0 = K(S, π). Now, the learning algorithm TBL for CFGs is summarized as follows: we assume that a finite sample U of positive and negative examples which contains a representative sample of the unknown CFG G∗ is given. Let U+ denote the set of positive examples in U and U− denote the set of negative examples. The Learning Algorithm TBL: 1. Construct the tabular representation T (w) for each positive example w in U+ ; 2. Derive the primitive CFG G(T (w)) for each w in U+ ; 3. Take the union of those primitive CFGs, that is, [ G(T (U+ )) = G(T (w)) w∈U+
4. Find a smallest partition πs such that G(T (U+ ))/πs is consistent with U , that is, consistent with the positive examples U+ and the negative examples U− ; 5. Output the resulting CFG G(T (U+ ))/πs .
4
Learning Algorithm to Incorporate Partially Structured Examples
While the tabular representation method efficiently represents an exponential number of possible grammatical structures, the learning algorithm using genetic algorithm still takes a large amount of time. On the other hand, while it is impractical and difficult to assume completely structured examples given, it may be reasonable to assume some partially structured examples available. In this section, we propose an approach to make use of some partial information about the grammatical structure of the given examples and present the learning algorithm to incorporate partially structured examples, and we show that the partially structured examples contribute to improving the efficiency of the learning algorithm.
Learning Context-Free Grammars from Partially Structured Examples
235
The learning algorithm from partially structured examples consists of three steps: 1. construct the tabular representation T (w) and the primitive CFG G(T (w)) for the given positive examples w, 2. eliminate unnecessary nonterminals and production rules from G(T (w)) based on the given partially structured examples, 3. merge distinct nonterminals to be consistent with the given positive and negative examples using genetic algorithm. 4.1
Partially Structured Example
We assume a CFG G. A (completely) structured string is a string with parentheses inserted to indicate the shape of the derivation tree of G, or equivalently an unlabelled derivation tree of G, that is, a derivation tree whose internal nodes have no labels. A partially structured string is defined to be a completely structured string but some pairs of left and right parentheses missing. For example, assume a CFG G = ({S, A, B}, {a, b}, P = {S → Ab, A → aB, B → ab}, S). For a string “aabb”, the completely structured string is “(a(ab)) b”, and all partially structured strings are “a(ab)b”, “(aab)b”, and “(a(ab))b”. We also define the notion of grammatical structures (or structured strings) fit to the given partially structured string. If a structured string w0 is obtained by inserting some pairs of parentheses into a partially structured string w, we say w0 is fit to w. For example, the grammatical structures fit to a partially structured string “a(ab)b” are shown in Figure 5.
Fig. 5. (upper:) two grammatical structures fit to a partially structured string “a(ab)b”, and (lower:) three grammatical structures not fit to “a(ab)b”.
236
4.2
Y. Sakakibara and H. Muramatsu
Eliminating Unnecessary Nonterminals and Production Rules
As we have seen, the primitive CFG G(T (w)) can generate all possible grammatical structures on w. Now given a partially structured string w, we do not need any nonterminals in the primitive CFG G(T (w)) which form a grammatical structure not fit to the partial structure of w. The algorithm to eliminate such unnecessary nonterminals consists of the following two steps: 1. eliminate unnecessary nonterminals and production rules using the algorithm shown in Figure 6 based on the given partially structured examples, 2. eliminate useless nonterminals which become useless as a result of eliminations of unnecessary nonterminals at Step 1. Let w = p0 a1 p1 a2 · · · pn−1 an pn be a partially structured string input to the algorithm ELM in Figure 6, where ai ∈ Σ, pi ∈ {(, )}∗ (1 ≤ i ≤ n), that is, pi is a sequence of parentheses inserted. For example, let w =“a(ab)b” be a partially structured string. In this case, a1 =‘a’, a2 =‘a’, a3 =‘b’, a4 =‘b’, and p0 = (that is, no parenthesis inserted here), p1 =‘(’, p2 = , p3 =‘)’, p4 = . The main task of the algorithm ELM in Figure 6 is to check the correct correspondences of the inserted parentheses in the substring pi−1 ai pi · · · pi+j−2 ai+j−1 pi+j−1 for each nonterminal Xi,j,k (1 ≤ i ≤ n − 1, 2 ≤ j ≤ n − i + 1, 1 ≤ k < j). While scanning the substring pi−1 ai pi · · · pi+j−2 ai+j−1 pi+j−1 from left to right, when the number of the right parentheses ever appeared becomes larger than the number of the left parentheses, it immediately eliminates the nonterminal Xi,j,k . At the end of scanning the substring, if there is a left parenthesis which has no correspondent right parenthesis, it also eliminates the nonterminal Xi,j,k . Second, we eliminate useless nonterminals which become useless as a result of eliminations of unnecessary nonterminals by the algorithm ELM. Let G = (N, Σ, P, S) be a CFG. A nonterminal symbol X is useful if there is a derivation S ⇒ αXβ ⇒ w for some α, β ∈ (N ∪ Σ)∗ and w ∈ Σ ∗ . Otherwise, X is useless. The useful nonterminal symbols can be effectively found by the following two iterative algorithms: – Iterative algorithm to find the set V of nonterminals X such that X ⇒ w for some w ∈ Σ ∗ : 1. Every nonterminal X with production rule X → w (w ∈ Σ ∗ ) in P is clearly added to V . 2. If X → B1 B2 · · · Bn is a production rule and each Bi (1 ≤ i ≤ n) is either a terminal or a nonterminal already placed in V , then X is added to V . – Iterative algorithm to find the set V of nonterminals X such that S ⇒ αXβ: 1. S is clearly added to V . 2. If X is already placed in V and X → B1 B2 · · · Bn is a production rule, then all nonterminals of B1 B2 · · · Bn are added to V . For the partially structured string “a(ab)b”, the successive uses of these two algorithms eliminate five nonterminals in the tabular representation T (aabb) as shown in Figure 7 and eliminate eight production rules in the primitive CFG G(T (aabb)) as shown in Figure 8.
Learning Context-Free Grammars from Partially Structured Examples
237
– Input: a partially structured string w = p0 a1 p1 a2 · · · pn−1 an pn where ai ∈ Σ, pi ∈ {(, )}∗ (1 ≤ i ≤ n). – Procedure: for m = 1 to n do { leftp[m] = the number of ’(’ in pm−1 ; rightp[m] = the number of ’)’ in pm ; }; for each Xi,j,k (1 ≤ i ≤ n − 1, 2 ≤ j ≤ n − i + 1, 1 ≤ k < j) do { flag = 0; Xb = leftp[i]; for m = 1 to j − 1 do { Xb = Xb − rightp[i + m − 1]; if Xb = 0 then flag = 1; if Xb < 0 then delete Xi,j,k and break; Xb = Xb + leftp[i + m]; }; Xb = Xb − rightp[i + j − 1]; if flag = 1 and Xb > 0 then delete Xi,j,k ; if flag = 0 and Xb − leftp[i] > 0 then delete Xi,j,k ; };
Fig. 6. Algorithm ELM to eliminate unnecessary nonterminals. 4 X1,4,1 , X1,4,2 , X1,4,3 3
X1,3,1 , X1,3,2
X2,3,1 , X2,3,2
2
X1,2,1
X2,2,1
X3,2,1
j=1
X1,1,1
X2,1,1
X3,1,1
X4,1,1
i=1
2
3
4
a
b
a
(
)
b
Fig. 7. Unnecessary nonterminals in the tabular representation T (aabb) given a partially structured string “a(ab)b”.
238
Y. Sakakibara and H. Muramatsu P = { S → X1,4,1 ,
S → X1,4,2 ,
S → X1,4,3 ,
X1,4,1 → X1,1,1 X2,3,1 , X1,4,1 → X1,1,1 X2,3,2 , X1,4,2 → X1,2,1 X3,2,1 , X1,4,3 → X1,3,1 X4,1,1 , X1,4,3 → X1,3,2 X4,1,1 , X1,3,1 → X1,1,1 X2,2,1 , X1,3,2 → X1,2,1 X3,1,1 , X2,3,1 → X2,1,1 X3,2,1 , X2,3,2 → X2,2,1 X4,1,1 , X1,2,1 → X1,1,1 X2,1,1 , X2,2,1 → X2,1,1 X3,1,1 , X3,2,1 → X3,1,1 X4,1,1 , X1,1,1 → a,
X2,1,1 → a,
X3,1,1 → b,
X4,1,1 → b
}
Fig. 8. Unnecessary production rules in the primitive CFG G(T (aabb)) given a partially structured string “a(ab)b”.
5
Experimental Results
The experiments that we have done is to see how several different partially structured examples improve the efficiency of the learning algorithm. Let L∗ = {am bm cn | m, n ≥ 1} over Σ = {a, b, c} be the unknown context-free language to be learned, and the unknown CFG G∗ = (N, Σ, P, S) be the following CFG to give the partially structured examples: N = {S, A, B, C, X, Y, Z}, P = { S → X Z, S → X C, X → Y B, Y → A X, X → A B, Z → C Z, Z → C C, A → a, B → b, C → c}. According to the unknown CFG G∗ , we give seven different partially structured examples of the string “aabbccc” from un-structured one to completely structured one as follows: 1. aabbccc 2. (aabb)ccc 3. (aabb)(ccc) 4. (a(ab)b)ccc 5. (a(ab)b)(ccc) 6. (a(ab)b)(c(cc)) 7. ((a(ab))b)(c(cc)) Input these different partially structured examples to the learning algorithm, we see that more structured examples contribute to more improving the efficiency. The summary of our experiments is shown in Figure 9. We give all positive examples of length up to 10 and all negative examples of length up to 20, and each partially structured example of from 1st to 7th. The learning algorithm outputs a correct CFG as the best individual in the population of size 100.
Learning Context-Free Grammars from Partially Structured Examples No. partially structured example 1 aabbccc 2 (aabb)ccc 3 (aabb)(ccc) 4 (a(ab)b)ccc 5 (a(ab)b)(ccc) 6 (a(ab)b)(c(cc)) 7 ((a(ab))b)(c(cc))
#Np 61 27 22 22 17 15 13
239
#Pp GAsteps 139 1200 39 1000 29 760 28 440 20 400 16 390 13 150
#Np : the number of nonterminals in the primitive CFG G(T (aabbccc)) #Pp : the number of production rules in G(T (aabbccc)) GAsteps : the number of the generation steps (iterations) of genetic algorithm to be converged to a correct CFG Fig. 9. Experimental results.
We clearly see that given more structured examples, the number of nonterminals and the number of production rules in the primitive CFG G(T (aabbccc)) and the number of the generation steps of genetic algorithm to be converged to a correct CFG are all significantly decreasing. It is found interesting that two CFGs which have different grammatical structures have been learned: – In the case of the structured examples of 3, 5, 6, and 7 given, the learned CFG is a CFG structurally equivalent to the unknown CFG G∗ as follows: N = {h1i, h2i, h3i, h104i, h113i, h121i, h148i}, P = { S → h104i, h104i → h148i h121i, h104i → h148i h3i, h148i → h113i h2i, h148i → h1i h2i, h113i → h1i h148i, h121i → h3i h121i, h121i → h3i h3i, h1i → a, h2i → b, h3i → c}. – In the case of the structured examples of 1, 2 and 4 given, the learned CFG has a smaller number of nonterminals than the unknown CFG G∗ as follows: N = {h1i, h2i, h3i, h112i, h116i, h140i, }, P = { S → h140i, h140i → h140i h3i, h140i → h116i h3i, h116i → h112i h2i, h116i → h1i h2i, h112i → h1i h116i, h1i → a, h2i → b, h3i → c}.
240
6
Y. Sakakibara and H. Muramatsu
Conclusions
We have proposed an algorithm for learning context-free grammars from partially structured examples. A partially structured string has been defined to be a completely structured string but some pairs of left and right parentheses missing. We have shown via experiments that the partially structured examples contribute to improving the efficiency of the learning algorithm. While it is impractical and difficult to assume completely structured examples given, it may be reasonable to assume some partially structured examples available. In this sense, our learning algorithm from partially structured examples is more flexible and applicable than the learning algorithm from completely structured examples. Further, our learning algorithm can also identify a grammar having the intended structure, that is, structurally equivalent to the unknown grammar. In practice, there are many partially bracketed sentences available in the databases for natural language processing. Our future work is to apply our learning algorithm to such real databases. Another interesting and important future work is to investigate the relation between the degree of structural information (that is, how well structured the given example is) and the number of necessary unstructured examples and further to theoretically analyse the possibility that the structured examples could decrease the number of unstructured examples (especially, the number of negative examples) required for correct learning.
References 1. A. V. Aho and J. D. Ullman. The Theory of Parsing, Translation and Compiling, Vol. I: Parsing. Prentice Hall, 1972. 2. P. Dupont. Regular grammatical inference from positive and negative samples by genetic search: the GIG method. Proceedings of Second International Colloquium on Grammatical Inference (ICGI-94) (LNAI 862, Springer-Verlag, 1994), 236–245. 3. E. M. Gold. Complexity of automaton identification from given data. Information and Control, 37, 1978, 302–320. 4. Z. Michalewicz. Genetic Algorithms + Data Structures = Evolution Programs. Springer-Verlag, 1996. 5. Y. Sakakibara. Efficient learning of context-free grammars from positive structural examples. Information and Computation, 97, 1992, 23–60. 6. Y. Sakakibara. Recent advances of grammatical inference. Theoretical Computer Science, 185, 1997, 15–45. 7. Y. Sakakibara and M. Kondo. GA-based learning of context-free grammars using tabular representations. Proceedings of 16th International Conference on Machine Learning (ICML-99), Bled, Slovenia, 1999.
Identification of Tree Translation Rules from Examples Hiroshi Sakamoto, Hiroki Arimura, and Setsuo Arikawa Department of Informatics, Kyushu University, Hakozaki 6-10-1, Higashi-ku, Fukuoka-shi 812-8581, Japan {hiroshi, arim, arikawa}@i.kyushu-u.ac.jp
Abstract. Two models for simple translation between ordered trees are introduced. First is that output is obtained from input by renaming labels and deleting nodes. Several decision problems on the translation are proved to be tractable and intractable. Second is term rewriting system, called k-variable linear translation. The efficient learnability of this system using membership and equivalence queries is shown.
1
Introduction
The HTML documents currently distributed on the network can be regarded as a very large text database. These markup texts are structured by many tags that have their own special meanings beforehand. Although computers cannot understand the meaning of languages, they can perform a complicated rendering using the structures. Recently, a new markup language called XML [4] are recommended by the World Wide Web Consortium, which are expected to realize more intelligent information exchange, like update or delete, by remote operation. A markup text is expressed by a rooted ordered tree. The root is the unique node which denotes the whole document, the other internal nodes are labeled by tags, and the leaves are labeled by the contents of the document or the attributes of the tags. For example, the source file of this paper, considered as a markup text, has an internal node labeled by the tag \section and this node has some children, e.g., the leaf labeled by Introduction and the contents of this section. An ordered tree is the one that for any internal node, the order of its children is defined. Thus, the object considered in this paper is translation between input and output trees. The aim is to produce classes of appropriate translations and analyze their learning complexity. For example, an XML document is translated to an HTML for the use of browsing. This document may be translated to another XML document in different format in data exchange. These translations are described by the language XSLT, which is recommended by W3C in 1999 [5]. This language is very powerful because regular expression and recursion are allowed in this language, and thus, it seems that it is hard to learn this language from given examples alone. Thus, we introduce more restricted classes for tree translations. A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 241–255, 2000. c Springer-Verlag Berlin Heidelberg 2000
242
H. Sakamoto, H. Arimura, and S. Arikawa
There are two types of data exchange models considered in this paper. One is extraction and another is reconstruction. The extraction is a very simple translation related to the second-order matching [9]. This translation is expressed by T → t such that a small tree t is obtained by only (1) renaming labels of T or (2) deleting nodes of T . This model is suitable for the situation that a user takes out specific entries from a very large table as a small table, or renaming a specific tag without changing the structure of the document.
A AAAAA AAA A AA AA A AA AAAA A B
A
A
λ C
C
B
C
extraction
B
On the other hand, the reconstruction is more complicated. It is characterized by term rewriting f → g for term f and g with variables. In this model, we can do more complicated transformation of trees so that exchanging any two subtrees of an input tree and renaming labels depending on ancestors or descendants of the current node. For example, it is possible to change the order of title and author in digital books card. This translation can not be defined by the erasing homomorphism because the order of any two node must be preserved. However, an erasing homomorphism also can not be defined by the term rewriting, e.g., deleting any tag of subsection and making its children become the children of the parent of the tag. This operation is called embedding in graph theory. Thus, one model does not properly contain the other model.
f g f(x, h(y, z)) f
g(y, z)
h A
A
B
A
B
B
reconstruction
This paper is organized as follows. In the next section, we give the formal definitions of the erasing homomorphism and term rewriting system. We also define the problem of identifying translation rules for given pairs of trees as examples with respect to the both models. In particular, single example is given in case of
Identification of Tree Translation Rules from Examples
243
the erasing homomorphism, that is this problem is a decision problem. In case of the term rewriting system, we define the class of k variable linear translation system in which any rule contains at most k variables and no variable appears in a term twice. These restrictions guarantee the termination and confluence of the rewriting system. In Section 3, the complexity of announced decision problem is considered. We first deal with a subclass of erasing homomorphism, called erasing isomorphism. This problem contains the tree inclusion problem [11] as a special case. The first result is that the hardness of this problem is not spoiled even if the given trees are in depth 1, that is strings. It is open whether this problem is in P, but we show that a nontrivial subproblem is in P. Moreover, we prove the NP-completeness of erasing homomorphism with respect to the restrictions either given trees are strings or output tree is labeled by a single alphabet. In Section 4, the learning problem of linear translation system from membership and equivalence queries [1] are considered. The hypothesis space is the class of translation systems and the target class is the class of k-variable linear translation systems. A counterexample for a hypothesis is an ordered pair (t, t0 ) of trees such that exactly one of the hypothesis and the target can translate t to t. We present a learning algorithm based on the theory of [2,3] and we show that our algorithm identifies each target using at most O(m) equivalence queries and at most O(kn2k ) membership queries, where m is the number of rules of the target and n is the number of nodes of counterexamples.
2
Tree Translation Models
For a finite set A, we denote by A∗ the set of all finite sequences over A and by A+ the set A∗ \ {λ}, where λ is the null sequence. Let card(A) denote the cardinality of A. An alphabet Σ is a finite set of symbols. An alphabet is used for the set of labels for trees in the first model and used for the ranked alphabet to define terms in the second model. 2.1
Erasing Homomorphisms
We first introduce a very simple class of tree translations, called erasing homomorphisms, as a formal model of data extraction from semi-structured data. In Section 3, we will study the identification problem for erasing isomorphisms from given examples. A tree is a connected, acyclic, directed graph. A rooted tree is a tree in which one of the vertices is distinguished from the others and is called the root. We refer to a vertex of a rooted tree as a node of the tree. An ordered tree is a rooted tree in which the children of each node are ordered. That is, if a node has k children, then we can designate them as the first child, the second child, and so on up to the k-th child. The node(T ) denotes the set of nodes of tree T , |T | denotes card(node(T )). Let `(i) denote the label of a node i ∈ node(T ). An alphabet Σ is a set of symbols.
244
H. Sakamoto, H. Arimura, and S. Arikawa
A labeled tree T is considered as a pair of a graph Sk and a mapping h from node(Sk) to Σ such that h(i) = A iff `(i) = A for each i ∈ node(Sk) = node(T ). We call the tree Sk a skeleton of T and the skeleton of a tree T is denoted by Sk(T ). Let T /n be a subtree of T whose root is n ∈ node(T ). Let λ denote the unique null symbol not in Σ. We define two operations on tree T . One is renaming, denoted by a → b, to replace all labels a in T by b. Another is deleting, denoted by a → λ, to remove any node n for `(n) = a in T and make the children of n become the children of the parent of n. Let S = {a → b | a ∈ Σ, b ∈ Σ ∪ {λ}} be a set of operations. Then, we write T →S T 0 iff T 0 is obtained by applying all operations in S to T simultaneously. Definition 1. Let (T, P ) be a pair of trees over Σ. Then, the problem of erasing homomorphism is to decode whether there exists a set S of operations such that T →S P . The input tree T is called target and P pattern. This problem is denoted by EHP (T, P ). Definition 2. The problem of erasing isomorphism, denoted by EIP (T, P ) is to decide whether T →S P such that if a → c, b → c ∈ S, then either a = b or c = λ, that is, any two different symbols are never renamed to a same symbol. There is other restriction for these problems. EHP (T, P )k and EIP (T, P )k denote the problems that the depth of the input tree T is bounded by k, where the depth of T is the length of the longest path of T . In particular, a string is a tree with depth 1. When we consider the restriction that any two nodes of a pattern tree P are labeled by distinct symbols, this problem is the special case of EIP (T, P ). Moreover, this problem is equivalent to the tree inclusion problem [11] which is decidable in O(|T | · |P |) time. 2.2
Tree Translation Systems
In this subsection, we introduce a formal model of data reconstruction for semistructured data, called tree translation systems, which is more expressive than the class of erasing isomorphism in the last subsection. In Section 4, we will consider the identification problem for tree translation systems in an interactive setting. First, we introduce the class of ranked trees whose node label is ranked and the out-degree of a node is bounded by the rank of its node label, where we do not allow any operations such as deletion and insertion that may change the out-degree of a node. Let Σ = ∪n≥0 Σn be a finite ranked alphabet of function symbols, where for each f ∈ Σ, a nonnegative integer arity(f ) ≥ 0, called arity, is associated. We assume that Σ contains at least one symbol of arity zero. Let X be a countable set of variables disjoint with Σ, where we assume that each x ∈ X has arity zero. Definition 3. We denote by T (Σ, X) the set of all labeled, rooted, ordered trees t such that
Identification of Tree Translation Rules from Examples
245
– Each node v of t is labeled with a symbol in Σ ∪ X, denoted by t(v). – If t(v) is a function symbol f ∈ Σ of arity k ≥ 0 then v has exactly k children. – If t(v) is a variable x ∈ X then v is a leaf. We call each element t ∈ T (Σ, X) a pattern tree (pattern for short). A pattern tree is also called a first-order term in formal logic. We often write T by omitting Σ and X if they are clearly understood from context. For pattern t, we denote the set of variables appearing in t by var(t) ⊆ X and define the number of the nodes of t by size(t). A pattern t is said to be a ground pattern if it contains no variables. A position in pattern t is simply a node of t. For pattern t, we denote by occ(t) the set of all positions in t. If there is a downward path from a position α to another position w in t, we say that either α is above β or β is below α, and write α ≥ β. If α ≥ β but β 6≥α then we say that α is strictly above β, β is strictly below α and write α > β. For pattern t and position α ∈ occ(t), the subpattern appearing at α, denoted by t/α, is the labeled subtree of t whose root is located at α. A pattern t is called linear if any variable x ∈ X appears in t at most once. A pattern t is of k-variable if var(t) = {x1 , . . . , xk }. For k ≥ 0, we use the notation t[x1 , . . . , xk ] to indicate that pattern t is a k-variable linear pattern with mutually distinct variables x1 , . . . , xk ∈ X, where the order of variable in t is arbitrary. For k-variable linear pattern t[x1 , . . . , xk ] and a sequence of patterns s1 , . . . , sk , we define t[s1 , . . . , sk ] as the term obtained from t by replacing the occurrence of xi with patterns si for every 1 ≤ i ≤ k. Now, we introduce tree translation systems. Definition 4. A tree translation rule (rule for short) is an ordered pair (p, q) ∈ T ×T such that var(p) ⊇ var(q). We also write (p → q) for rule (p, q). A tree translation system (TT) is a set H of translation rules. Definition 5. A translation rule C = (p, q) is of k-variable if card(var(C)) ≤ k, and linear if both of p and q are linear. For every k ≥ 0, we denote by LR(k) and LT T (k) the classes of all kvariable linear translation rules, and all k-variable linear tree translation systems, respectively. We also denote by LT T = ∪k≥0 LT T (k) all linear tree translation systems. Definition 6. Let H ∈ LLT be a linear translation system. The translation relation defined by H with the set M (H) ⊆ T × T is defined recursively as follows. – Identity: For every pattern p ∈ T , (p, p) ∈ M (H). – Congruence: If f ∈ Σ is a function symbol of arity k ≥ 0 and (pi , qi ) ∈ M (H) for every i then (f (p1 , . . . , pk ), f (q1 , . . . , qk )) ∈ M (H).
246
H. Sakamoto, H. Arimura, and S. Arikawa
– Application: If (p[x1 , . . . , xl ], q[x1 , . . . , xl ]) ∈ H is a k-variable linear rule, and (pi , qi ) ∈ M (H) for every i then (p[p1 , . . . , pk ], q[q1 , . . . , qk ]) ∈ M (H), where note that p and q are k-variable linear terms. If C ∈ M (H) then we say that rule C is derived by H. The definition of the meaning M (H) above corresponds to the computation of top-down tree transducer [7] or the a special case of term rewriting relation [6] where only top-down rewriting are allowed. Lemma 1. Given pair C ∈ T ×T and LT T (k) H, the problem of deciding the membership C ∈ M (H) can be computed in O(mn5 ) time, where m = card(H) and n = size(C) . Proof. By using dynamic programming, we can compute all position pairs π ∈ occ(C) such that C/π ∈ M (H) with the time complexity stated above. u t In what follows, we will normally denote patterns by letters p, q, s and t, translation rules by capital letters C, D, and translation systems by capital letter H, possibly subscripted. Now, we extend the notions of sizes, the set of variables, positions, and subpatterns for rules as follows. For rule C = (p, q), we define size(C) = size(p) + size(q), var(C) = var(p) ∪ var(q). A position pair in rule C = (p1 , p2 ) is any pair (α1 , α2 ) such that αi ∈ occ(pi ) for every i = 1, 2. We denote the set of position pairs in C by occ(C) = occ(p)× occ(q). For position pairs π = (α1 , α2 ) and τ = (β1 , β2 ), we extend ≥ by π ≥ τ iff αi ≥ βi for every i = 1, 2. The strict order > is defined by π > τ iff π ≥ τ but π 6≤τ . For rule C = (pi , p2 ) and position pair π = (α1 , α2 ), the subrule at π is the rule C/π = (p1 /α1 , p2 /α2 ). For rule C, a subrule D of C at π is said to be smaller than subrule E of C at τ if π ≤ τ holds. Definition 7. Let C = (p[x1 , . . . , xl ], q[x1 , . . . , xl ]) be a k-variable linear rule and D ∈ T ×T be a rule. If there exist (pi , qi ) ∈ M (H∗ ) for every i such that D = (p[p1 , . . . , pk ], q[q1 , . . . , qk ]), then we say that C covers D relative to H∗ and write C H∗ D.
3
Deciding Translation Rules by Single Examples
In this section, we consider the erasing homomorphism and erasing isomorphism problem, denoted by EHP (T, P ) and EIP (T, P ), respectively. First, we study the complexity of EIP (T, P ) and show that a subclass is in P. Next, we prove the NP-hardness of more general problem EHP (T, P ). Recall that EIP (T, P )1 is the problem that T and P are both strings. The following result tells us that EIP (T, P ) ∈ P iff EIP (T, P )1 ∈ P. Theorem 1. EIP (T, P ) is polynomial time reducible to EIP (T, P )1 . Proof. Let node(T ) = {0, 1, . . . , n}, where 0 is the root, and let {`(0), . . . , `(n)}∩ {a0 , . . . , an } = ∅. The reduction is as follows. Compute strings t(i) from T recursively as follows.
Identification of Tree Translation Rules from Examples
247
1. For each leaf node i of T , let t(i) = (ai A)2 ai = ai Aai Aai for A = `(i). 2. For each internal node i, let t(i) = (ai A)2 · t(ji ) · · · t(jk ) · ai for A = `(i), where ji , . . . , jk are the children of i in the left order. Similarly, compute the string p(i) from the pattern P . The (t(0), p(0)) is computable in log-space. Since Σ ∩ {a0 , . . . , an } = ∅, we note that the construction of t(0) defines a one-to-one mapping from the set of trees over Σ to a subset of strings over Σ ∩ {a0 , . . . , an }. Then, we show that T → P iff t(0) → p(0). Suppose that T →S P for some S and let divide S into SΣ = {a → b | a, b ∈ Σ} and Sλ = {a → λ | a ∈ Σ}. There exists a tree T 0 such that T →Sλ T 0 and T 0 →SΣ P . Let S 0 = {a → λ | A → λ ∈ Sλ , t(0) has aAaA}. Then, there exists a string 0 t such that t(0) →Sλ ∪S 0 t0 . Since node i of T is labeled by A iff t(0) contains (ai A)2 , we have that T →Sλ T 0 iff t(0) →Sλ ∪S 0 t0 . Thus, t0 = t0 (0), where t0 (0) is the string from T 0 by the above reduction. Moreover, t0 (0) →SΣ p(0). Thus, T →S P implies t(0) →Sλ ∪S 0 ∪SΣ p(0). Finally, we show the converse. Suppose t(0) →S p(0). By the one-to-one correspondence between T and t(0) (P and p(0)), it is clear that if S = {a → b | b 6= λ}, then T →S P . Assume that S = {a → b ∈ S | b 6= λ} ∪ Sλ and t(0) →Sλ t0 . Note the following facts: (1) only the t(i) contains ai Aai A as its prefix for each i = 0, . . . , n. (2) the string t(0) contains no square of a symbol. Thus, if A → λ ∈ Sλ , then ai → λ ∈ Sλ . The application of A → λ and ai → λ corresponds to delete the node i of T . Let Sλ0 = {A → λλ | A ∈ Σ}. Then, t(0) →Sλ t0 (0) implies T 0 →Sλ0 T 0 . Hence T →S 0 P for S 0 = {a → b ∈ S | a ∈ Σ}. Therefore, T → P iff t(0) → p(0). t u One of aims of this section is to show the complexity of EIP (T, P ). By the theorem 1, we can reduce the EIP (T, P ) to EIP (T, P )1 . Thus, we consider only the problem EIP (T, P )1 and write EIP (T, P ) instead of EIP (T, P )1 . Next, we show that there is a subclass of EIP (T, P ) to be in P. Let w = a1 a2 · · · an ∈ Σ n . The i-th symbol of w is denoted by w[i]. The substring ai ai+1 · · · aj of w is denoted by w[i, j]. An occurrence of a string α on w is a number i such that w[i, i + |α| − 1] = α. The number of occurrences of α in w is denoted by ](α, w). The set of all occurrences of α in w is denoted by occ(α, w). Let w, α, β be strings. There exists an overlap of α and β on w if there exist occurrences i and j of α and β on w such that i < j < |α| + i − 1 or j < i < |β| + j − 1. If a string is of the form AαA for some A ∈ Σ and ](A, α) = 0, then we call the string an interval of A. A string w ∈ Σ ∗ is called k-interval free if w contains an overlap of at most (k − 1) intervals. A string w ∈ Σ ∗ is said to have a split if there exists an 1 ≤ i ≤ |w| − 1 such that w[i] does not contained in any interval appearing in w. Example 1. Let us consider some example for the notions defined in the above. The string ABBCA is an interval of A but ABACA is not. The string ABCADB contains no split because each symbol is contained in an interval of A or B. On the other
248
H. Sakamoto, H. Arimura, and S. Arikawa
hand, ACADBBB has a split. The followings are example of 3-interval string and 3-interval string.
A
B
C
B
A
C
overlap of 3-interval
A
A
B
B
C
A
C
A
3-interval free
Definition 8. The problem EIP (T, P ) is denoted EIP (T, P )k if T and P are both k-interval free. Lemma 2. If T and P are 3-interval free and have no split, then EIP (T, P )3 is decidable in polynomial time in |T | and |P |. Proof. For given T , define hT ii recursively as follows. 1. Let hT i1 = occ(T [1], T ). 2. Let hT ii+1 = occ(a, T ) such that an interval of ai contains a and an interval of a contains ai , where hT ii = occ(ai , T ). The sequence hT i1 , . . . , hT ii is uniquely decided because if both occ(a, T ) and occ(a0 , T ) satisfy the condition for hT ii+1 , then either an a contained in intervals of ai and a0 or an a0 contained in intervals of ai and a. This is a contradiction for that T is 3-interval tree. All symbols appearing in T at least two times are classified into three categories. First is hT ii . We can compute the set hT i of hT ii and hP i of hP ii0 in O(|T |2 + |P |2 ) time. Second is [hT ii ] which is a set of occ(a, T ) 6∈ hT i such that at least two intervals of ai contains a for some occ(ai , T ) ∈ hT ii . The time to compute the sets is in O(|T | + |P |) time. Third is [hT ii ]j which is the sequence of occ(a, T ) 6∈ hT i such that ](a, T ) ≥ 2 and exactly one interval of ai contains a for some occ(ai , T ) ∈ hT ii . The time to compute the sequences is in O(|T | + |P |) time. Let T →S P and a → b ∈ S. Then, since T and P are 3-interval free, occ(a, T ) ∈ hT i iff occ(b, P ) ∈ hP i, occ(a, T ) ∈ [hT ii ] iff occ(b, P ) ∈ [hP ii0 ], and occ(a, T ) ∈ [hT ii ]j iff occ(b, P ) ∈ [hP ii0 ]j 0 . A correct matching for occ(a, T ) and occ(b, P ) is computed as follows. Let n = |hP i|.
Identification of Tree Translation Rules from Examples
249
Step1: For hT i, hP i, and i = 1, . . . , n, check whether there exists a k such that; (1-a) The length of hT ik+i is equal to that of hP ii . (1-b) The number of ak+i+1 in j-th interval of ak+i is equal to that of bi+1 in j-th interval of bi , where occ(ak+i+1 , T ), occ(ak+i , T ) ∈ hT i, occ(bi+1 , P ), occ(ai , P ) ∈ hP i. For all i = 1, . . . , n, occ(ak+i , T ) ∈ hT i, and occ(bi , P ) ∈ hP i, the number of ak+i+1 in j-th interval of ai is equal to the number of bi+1 in j-th interval of bi , where 1 ≤ j ≤ n − 1. This check is done in O(n(|T | + |P |)) = O(|T |2 ). If there is no such a k, then answer “no” and terminate. If a k is found, then make the matching S = {ak+i → bi | i = 1, . . . , n} and go to the next stage. Step2: If ak+i → bi ∈ S, then for each occ(b, P ) ∈ [hP ii ], find an occ(a, T ) ∈ [hT ik+i ] such that the number of b in the j-th interval of bi is equal to the number of a in the j-th interval of ak+i . This check is done in (|T | + |P |). If there is no occ(a, T ) for an occ(b, P ), then let S = ∅ and go to the condition (1) to find the next k, and add all the a → b to S otherwise. Step3: Let [hP ii ]j = occ(b1 , P ), . . . , occ(bm , P ) and occ(a1 , T ), . . . , occ(am , T ) a subsequence of [hT ik+i ]j . If |occ(a` , T )| = |occ(b` , P )|, then add all a` → b` to S. This check is done in O(|T | + |P |). For any other a in T such that ](a, T ) ≥ 2, let a → λ ∈ S. Now the remained matching is a → b for ](a, T ) = ](b, P ) = 1. This check is done in O(|T | + |P |). t u The total time to check whether T → P is O(|T |2 + |P |2 ). Theorem 2. EIP (T, P )3 ∈ P. Proof. Each string w is represented by a concatenation of w1 , . . . , wn such that w = αw1 β · · · wn γ, where all wi have no split, and for every symbol a in α, β, . . . , γ, ](a, w) = 1. Let T = αt1 βt2 · · · tn γ and P = α0 p1 β 0 p2 · · · pm γ 0 . All symbols in αβ · · · γ appear in T exactly once and all symbols in α0 β 0 · · · γ 0 appear in P exactly once. Thus, we have that T →S P iff tji →Si pi , where j1 , . . . , jm is a subsequence of 1, . . . , n, and |α| ≥ |α0 |, . . . , |γ| ≥ |γ 0 |. t u Theorem 3. The EHP (T, P ) is NP-complete even if P is labeled by a single alphabet. Proof. The general problem is clearly in NP, then we prove only the hardness, that is 3-SAT is log-space reducible to this problem. 3-SAT is the problem to decide whether there exists a truth assignment for a given 3-CNF over the set X = {x1 , . . . , xn } of variables. A 3-CNF is a Boolean formula of the form C = V m xi1 ∨ x ˜ i2 ∨ x ˜i3 ), where x ˜ is a literal of a variable x. i=1 Ci and Ci = (˜ In this proof we use a special notation t(t1 , . . . , tk ) for a graph which denotes that the edges (t, t1 ) . . . , (t, tk ) are defined and the order t1 < · · · < tk is fixed. The reduction is as follows.
250
H. Sakamoto, H. Arimura, and S. Arikawa
(1) P is the tree defined by the graph (Vp , Ep ) such that Ep = Ep0 ∪Ep1 ∪Ep2 , Ep1 = {t0 (s1 ), s1 (s2 ), s2 (a1 , . . . , an )}, and Ep0 = {r(t0 , . . . , tm )}, Ep2 = {ti (p(i,1) , . . . , p(i,5) ) | i = 1, . . . , m}, where V is the set of all v and v 0 for (v, v 0 ) ∈ Ep and the root is r. For any node v ∈ VP , let `(v) = c. (2) T is the tree defined by the graph (Vt , Et ) such that Et = Et0 ∪ Et1 ∪ Et2 , Et1 = {t0 (s1 ), s1 (s2 ), s2 (a1 (b1 ), . . . , an (bn ))}, and Et0 = {r(t0 , . . . , tm )}, Et2 = {ti (t(i,0) , . . . , t(i,4) ), t(i,j) (t(i,j,1) , t(i,j,2) ) | i = 1, . . . , m, j = 1, 2, 3}, where Vt is the set of all v and v 0 for (v, v 0 ) ∈ Et and the root is r. For each i = 1, . . . , n, let `(ai ) = Ai and `(bi ) = Bi . For each i = 1, . . . , n, and j = 0, 4, let `(t(i,j) ) = c(i,j) . For each i = 1, . . . , n and j = 1, 2, 3, let `(t(i,j) ) = Bk , `(t(i,j,1) ) = `(t(i,j,2) ) = Ak if the j-th literal of Ci is positive, and let `(t(i,j) ) = Ak , `(t(i,j,1) ) = `(t(i,j,2) ) = Bk if the j-th literal of Ci is negative. Any other node is labeled by c. The depth of each node b1 , . . . , bn of T is 5 and the depth of any other node is at most 3. On the other hand, the depth of each node a1 , . . . , an is 4 and the depth of any other node is 2. Thus, if T → P , then Et1 → Ep1 and Et2 → Ep2 . If Et1 →S Ep1 , then either Ai → c, Bi → λ ∈ S or Bi → c, Ai → λ ∈ S for all i = 1, . . . , n. We can consider the corresponding (Ai → c, Bi → λ ∈ S) ⇔ xi = 1 and (Bi → c, Ai → λ ∈ S) ⇔ ¬xi = 1. We denote the operation corresponding to a truth assignment f by Sf . Note that only the matching Sf decide whether Et2 also appear in Et1 . Et2 → Ep2 because all labels in V m Suppose that the CNF C = i=1 Ci is satisfiable by f . Then, for each clause Ci of C, at least one literal x˜ij in Ci satisfies that x˜ij is positive and assigned 1, or x˜ij is negative and assigned 0. These are corresponding (Ai → c, Bi → λ), or (Bi → c, Ai → λ), respectively. This operation deletes the children of t( i, j) for at least one of j = 1, 2, 3. It follows that T /ti has 5, 6 or 7 children. In these cases, T /ti → P/ti by one of (c(i,0) → c, c(i,4) → c), (c(i,0) → c, c(i,4) → λ), and (c(i,0) → λ, c(i,4) → λ). Thus, we have T →Sf P . Conversely, if the CNF C is unsatisfiable, then for any assignment f , at least one Ci is not satisfiable, that is x˜ij is positive and assigned 0, or x˜ij is negative and assigned 1. It follows that all leaves of T /ti are not deleted by Sf . This implies that T /ti has 8 children. Thus, it cannot be true that T /ti → P/ti . Hence C is satisfiable iff T → P . The proof is completed. t u Theorem 4. The EHP (T, P ) is NP-complete even if T is a string. Proof. This problem Vmis also reducible from 3-SAT defined in the above. A 3-CNF xi1 ∨ x ˜i2 ∨ x ˜i3 ). The reduction is as follows. is of the form C = i=1 Ci and Ci = (˜ (1) T = T1 · T2 , T1 = A2 x1 ¬x1 · · · A2 xn ¬xn A2 , and T2 = α1 · · · αm . (2) P = P1 · P2 , P1 = A2 x1 · · · A2 xn A2 , and P2 = β1 · · · βm , where xi1 ui ˜ xi2 vi ˜ xi3 A2 , βi = xi1 xi2 xi3 A2 iff Ci = (˜ xi1 ∨ x ˜i2 ∨ x ˜i3 ). αi = ˜
Identification of Tree Translation Rules from Examples
251
Let T → P . Then A → A or A → λ. Assume the latter. Then, |T 0 | = 2n + 5m and |P | = 3n + 5m + 2, where T →s T 0 for s = {A → λ}. Thus, T 6→ P in this case. Since A → A is fixed, if T → P , then at least T1 → P1 and T2 → P2 . Moreover, for each i = 1, . . . , n, exactly one of xi and ¬xi of T1 must be mapped to xi of P1 . We can consider the correspondence between assignments f and rules Sf such that the variable xi is assigned 1 by f if xi → xi , ¬xi → λ ∈ Sf , or xi is assigned 0 by f if xi → λ, ¬xi → xi ∈ Sf . Suppose that the CNF C is satisfiable for an assignment f . Then for each ˜ i1 , x ˜i2 and x ˜i3 is assigned 1. It follows that at least clause Ci , at least one of x xi2 and ˜ xi3 of αi is not deleted by Sf . Since ui → a and vi → b can one of ˜ xi 1 , ˜ be defined for any symbols a and b, it holds that αi →Sf ∪{ui →a,vi →b} βi . Thus, T ; P . Conversely, suppose that C is unsatisfiable by any f . Then for at least one αi , |αi0 | = 2 for αi →Sf αi0 . If follows that T 6→P . Hence, it is proved that C is satisfiable iff T → P . t u
4
Identification of Tree Translations Using Queries
In this section, we show that there exists a polynomial time algorithm that exactly identifies any translation system in LT T (k) using equivalence and membership queries. 4.1
The Learning Problem
Our problem is identifying an unknown tree translation system H∗ from examples of ordered pairs E ∈ M (H∗ ) that are either derived or not derived by H∗ . As a formal model, we employ a variant of exact learning model by Angluin [1] called learning from entailment[2,3,8,10], which is tailored for translation systems. Let H be a class of translation systems to be learned, called hypothesis space, and LR be the set of all ordered pairs, called the domain of learning. In our learning framework, the meaning or the concept represented by H ∈ H is the set M (H∗ ). If M (P ) = M (Q) then we define P ≡ Q and say that P and Q are equivalent. A learning algorithm A is an algorithm that can collect the information about H∗ using the following type of queries. In this paper, we assume that the alphabet Σ is given to A in advance and the maximum arity of symbols in Σ is constant. Definition 9. An equivalence query (EQ) is to propose any translation system H ∈ H. If H ≡ H∗ then the answer to the query is “yes”. Otherwise the answer is “no”, and A receives any translation C ∈ LR as a counterexample such that either C ∈ M (H∗ )\M (H), or C ∈ M (H)\M (H∗ ). A counterexample is positive if C ∈ M (H∗ ) and negative if C 6∈M (H∗ ). A membership query (MQ) is to propose any translation C ∈ LR. The answer to the membership query is “yes” if C ∈ M (H∗ ), and “no” otherwise.
252
H. Sakamoto, H. Arimura, and S. Arikawa
Algorithm Learn LT T (k); Input: positive integer k; Given: the equivalence and the membership queries for the target set H∗ ∈ LT T (k); Output: a set H of linear tree translations equivalent to H∗ ; begin H := ∅; until EQ(H) returns “yes” do begin Let E be a counterexample returned by the equivalence query; D := Shrink(E, H); /* (See Definition 11) */ H := H ∪ Expand(D, k); /* (See Definition 12) */ end /* main loop */ return H; Fig. 1. A learning algorithm for k-variable linear tree translations using equivalence and membership queries
Definition 10. The goal of A is exact identification in polynomial time. A must halt and output a rewriting system H ∈ H such that H∗ ≡ H, where at any stage in learning, the running time and thus the number of queries must be bounded by a polynomial poly(m, n) in the size m of H∗ and the size n of the longest counterexample returned by equivalence queries so far. Although this setting first seems to be unnatural, it is known that any exact learnability with equivalence queries implies polynomial time PAC-learnability [12] and polynomial time online learnability [1] under a mild condition on the class of target hypothesis whether additional membership queries are allowed or not [1]. 4.2
The Learning Algorithm
Figure 1 gives our learning algorithm Learn LT T (k) that uses equivalence and membership queries to identify a k-variable linear tree translation system H that is equivalent to the target H∗ , but may be polynomially larger than H∗ . In the algorithm, we denote by Shrink(E, H) and Expand(D, k) the procedures to return any smallest positive sub-counter examples of rule E and to return the set of k-variable linear rules, called LT T (k)-expansion of D. Definition 11. A rule D is called a smallest positive sub-counterexample if D is in M (H∗ )\M (H) and and there exists no subrule D0 of D strictly smaller than D such that D0 ∈ M (H∗ )\M (H). If D is a subrule of some rule E then we call D a smallest positive sub-counterexample of E. Definition 12. For a positive counterexample D of H∗ w.r.t H and k ≥ 0, we define the LT T (k)-expansion of D by Expand(D, k) = { C ∈ LR(k) | C H∗ D, C ∈ M (H∗ ) }.
Identification of Tree Translation Rules from Examples
253
The following lemmas state that these procedures Shrink and Expand work in polynomial time. Lemma 3. Given a positive counterexample E of H∗ w.r.t. H, a smallest positive sub-counterexample of E can be computed in polynomial time in m and n using O(n2 ) membership queries, where m = card(H) and n = size(E). Proof. For all pairs π ∈ occ(E), we can check if the subrule D = E/π satisfies D ∈ M (H∗ ) using a membership query and if D ∈ M (H) in O(mn5 ) time by Lemma 1. t u Lemma 4. Let D be a positive counterexample of H∗ w.r.t. H. Then, (i) the cardinality of the set Expand(D, k) is bounded by O(n2k ). Furthermore, (ii) Expand(D, k) can be computed in polynomial time using O(kn2k ) membership queries. Now, we prove the correctness of the algorithm LEARN LT T . Let Σ be a fixed ranked alphabet and H∗ be the target translation system over Σ. In what follows, let H0 , H1 , . . . , Hn , . . . and E0 , E1 , . . . , En , . . . (n ≥ 0), respectively, be the sequence of hypotheses asked in the equivalence queries by the algorithm and the sequence of counterexamples returned by the queries. H0 is the initial hypothesis ∅, and at each stage n ≥ 1, LEARN LT T makes an equivalence query EQ(Hn−1 ), receives a counterexample En , and produce the next hypothesis Hn from En and Hn−1 . For unknown rule C ∈ H∗ , if C ∈ M (H) then we say that C is missing w.r.t. H, and otherwise we say that C is covered by H. Lemma 5. Let D be any smallest positive counterexample of H∗ w.r.t. H. Then, there exists some missing rule C ∈ H∗ w.r.t. H such that C H∗ D. Proof. Since D ∈ M (H∗ ), we know that there exist some rule C = (l, r) ∈ H∗ and some (pi , qi ) ∈ M (H∗ ) for every i such that D = (l[p1 , . . . , pk ], r[q1 , . . . , qk ]) and thus C H∗ D. Note that each (pi , qi ) ∈ M (H∗ ) is a subrule strictly smaller than D. Since D is smallest positive sub-counterexample, we can see that (pi , qi ) ∈ M (H) for all i. On the other hand, If C ∈ M (H) then the contradiction D ∈ M (H) is derived by the definition of the set M (·). Hence the result follows. t u Lemma 6. If C H∗ D and C ∈ M (H∗ ) for some C ∈ LR(k) then C ∈ Expand(D, k). Lemma 7. For every n ≥ 0, M (Hn ) ⊆ M (H∗ ), and furthermore, the counterexample En is positive. Proof. By construction, it is easy to see that Expand(D, k) ⊆ M (H∗ ) and thus every rule added to H is a member of M (H∗). By induction on the construction of M (H), we can show that H ⊆ M (H∗ ) implies M (H) ⊆ M (H∗ ). This proves the lemma.
254
H. Sakamoto, H. Arimura, and S. Arikawa
Lemma 8. For every n ≥ 0, let cn ≥ 0 be the number of missing rules in H∗ w.r.t. H. Then, c0 = m > c1 > · · · > cn > · · ·, where m = card(H∗ ). Proof. First suppose that cn = 0. Then, for every n ≥ 0, Hn covers all rules in H∗ , and thus M (Hn ) ⊇ M (H∗ ). Since M (Hn ) ⊆ M (H∗ ) from Lemma 7, this implies that Hn ≡ H∗ and the algorithm terminates. On the other hand, suppose that cn > 0. Since there is some missing clause C in H∗ , we know Hn 6≡H∗ and thus a positive counterexample E is given to the algorithm. Then by Lemma 3 a smallest positive counterexample D is obtained. By Lemma 5 and Lemma 6, Expand(D, k) contains some missing rule C ∈ H∗ . Thus, cn−1 > cn holds. t u Theorem 5. The algorithm LEARN LT T of Figure 1 exactly identifies any translation system H∗ in LT T (k) using O(m) equivalence queries and O(kn2k ) membership queries. Proof. By construction of the algorithm, if it terminates then H = Hn ≡ H∗ . Hence, the correctness of the algorithm immediately follows from Lemma 8. The time complexity and the query complexity follows from Lemma 3 and Lemma 4 (ii). Hence the theorem is proved. t u
5
Conclusion
In this paper, we consider the extraction and reconstruction problems in semistructured data. We first show that nontrivial subproblem of the erasing isomorphism is in P and the NP-completeness of the erasing homomorphism problem. It is an open question whether there is a gap between the isomorphism and homomorphism problem. Next, we show that if we allow a learner additional information obtained by active queries then the class of k-variable linear translation systems is polynomial time identifiable using equivalence and membership queries w.r.t. the translation relation.
References 1. D. Angluin, “Queries and concept learning,” Machine Learning, vol.2, pp.319–342, 1988. 2. H. Arimura, “Learning Acyclic First-order Horn Sentences From Entailment,” Proc. 7th Int. Workshop on Algorithmic Learning Theory (LNAI 1316), pp.432– 445, 1997. 3. H. Arimura, H. Ishizaka, and T. Shinohara, “Learning unions of tree patterns using queries,” Theoretical Computer Science, vol.185, pp.47–62, 1997. 4. T. Bray, J. Paoli, C. M. Sperberg-McQueen, “Extensible Markup Language (XML) Version 1.0,” W3C Recommendation 1998. http://www.w3.org/TR/REC-xml 5. J. Clark (ed.), “XSL Transformations (XSLT) Version 1.0,” W3C Recommendation 1999. http://www.w3.org/TR/xslt 6. N. Dershowitz, J.-P. Jouannaud, “Rewrite Systems”, Chapter 6, Formal Models and Semantics, Handbook of Theoretical Computer Science, Vol. B, Elsevier, 1990.
Identification of Tree Translation Rules from Examples
255
7. F. Drewes, “Computation by Tree Transductions”, Ph D. Thesis, University of Bremen, Department of Mathematics and Informatics, February 1996. 8. M. Frazier and L. Pitt, “Learning from entailment: an application to propositional Horn sentences”, Proc. 10th Int. Conf. Machine Learning, pp.120–127, 1993. 9. K. Hirata, K. Yamada, and M. Harao, “Tractable and Intractable Second-Order Matching Problems,” Proc. 5th Annual International Computing and Combinatorics Conference, LNCS 1627, pp.432–441, 1999. 10. R. Khardon, “Learning function-free Horn expressions,” Proc. COLT’98, pp.154– 165, 1998. 11. P. Kilpelainen and H. Mannila, “Ordered and unordered tree inclusion,” SIAM J. Comput., pp.340–356, 1995. 12. L. G. Valiant, “A theory of learnable,” Commun. ACM, vol.27, pp.1134–1142, 1984.
Counting Extensional Differences in BC-Learning Frank Stephan1? and Sebastiaan A. Terwijn2?? 1 2
Mathematisches Institut, Universit¨ at Heidelberg, Im Neuenheimer Feld 294, 69120 Heidelberg, Germany, [email protected] Vrije Universiteit Amsterdam, Department of Mathematics and Computer Science, De Boelelaan 1081a, 1081 HV Amsterdam, The Netherlands, [email protected]
Abstract. Let BC be the model of behaviourally correct function learning as introduced by B¯ arzdins [4] and Case and Smith [8]. We introduce a mind change hierarchy for BC, counting the number of extensional differences in the hypotheses of a learner. We compare the resulting models BCn to models from the literature and discuss confidence, team learning, and finitely defective hypotheses. Among other things, we prove that there is a tradeoff between the number of semantic mind changes and the number of anomalies in the hypotheses. We also discuss consequences for language learning. In particular we show that, in contrast to the case of function learning, the family of classes that are confidently BC-learnable from text is not closed under finite unions. Keywords. Models of grammar induction, inductive inference, behaviourally correct learning.
1
Introduction
Gold [10] introduced an abstract model of learning computable functions, where a learner receives increasing amounts of data about an unknown function and outputs a sequence of hypothesis that has to converge to a single explanation, i.e. a program, for the function at hand. This concept of explanatory or Ex-learning has been widely studied [8,10,11,15]. A recurring theme is the question how often the learner can change its hypothesis, and how conscious it is of this process: does the learner know when it has converged and how fast does the learner see when new data requires the hypothesis to be changed. Gold [10] already observed that a learner who knows when the correct hypothesis has been found is quite restricted: such a learner can wait until it has the correct hypothesis and then output a single but correct guess. Therefore such a learner can never learn a dense class of functions, which requires to be able to withdraw and change to a new hypothesis at arbitrary late time points, as in the model Ex. ? ??
Supported by the Heisenberg program of the German Science Foundation (DFG), grant no. Ste 967/1–1 Supported by a Marie Curie fellowship of the European Union under grant no. ERBFMBI-CT98-3248.
A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 256–269, 2000. c Springer-Verlag Berlin Heidelberg 2000
Counting Extensional Differences in BC-Learning
257
Another well-studied paradigm is the model BC of behaviourally correct learning [4,8]. The difference with the Ex-model lies in the notion of convergence: Whereas in Ex the syntax of the hypotheses of the learner is required to converge, i.e. convergence is intensional, in BC the semantics of the hypotheses should converge, i.e. convergence is extensional. B¯ arzdins [4] showed that behaviourally correct learners can identify classes on which no Ex-learner succeeds. BC-learners are quite powerful: Steel [8] noticed that the concept of syntactic convergence to an almost everywhere correct hypothesis can be covered by a error-free BClearner. Furthermore, Harrington [8] showed that a further generalization of BClearners, namely those which almost always output finite variants of the function to be learned, can identify all recursive functions. There are many models of learning in which the number of changes in hypothesis, also called mind changes, is counted. Previous studies focussed mainly on intermediate notions employing syntactic convergence. In particular B¯ arzdins and Freivalds [5] initiated the analysis of Ex-learning with a bound on the number of mind changes. Freivalds and Smith [9] generalized this concept by using recursive ordinals which are counted down recursively at every mind change. In Section 3 we introduce the models BCn where the BC-learner may make at most n semantic mind changes on any function to be learned. It is shown that the classes BCn form a proper hierarchy that is incomparable to Ex-learning. Ambainis, Jain, and Sharma [1] showed that a class of functions is Exlearnable with an ordinal number of mind changes if and only if it can be learned by a machine which converges on every function to some hypothesis, even on the nonrecursive ones. Following Osherson, Stob, and Weinstein, we call a learner that converges on all functions confident. This notion can be generalized to BC: A BC-learner is confident if it converges semantically on every function. Rather than defining ordinal mind change bounds for BC (which is possible, but not carried out in the current version of this paper) we can take instead the characterization of ConfEx as an alternative starting point and study ConfBC. This we do in Section 4. Here we show among other things that the result that all classes Exn are in confident Ex also holds in the case of semantic convergence: Every BCn -learnable class has a confident BC-learner. In Section 5 we complicate the discussion by considering hypotheses which are finitely defective. The more noticeable difference with the Ex case is that here there is a tradeoff between anomalies and mind changes. We prove that BC1 , the first nontrivial level of the BCn hierarchy (since BC0 coincides with Ex0 ), is not contained in OEx∗ , a learning criterion from Case and Smith [8]. This improves a result from [8]. Finally, in Section 6 we discuss consequences for grammatical inference. In [10] Gold also introduced a model of learning recursively enumerable sets (in this context also called languages), which is more general than the model of learning recursive functions. The negative results obtained in the previous sections for function identification immediately imply their counterparts for language identification. In this section we discuss the positive counterparts. In contrast to the case of function learning we show that the family of classes that are confidently
258
F. Stephan and S.A. Terwijn
BC-learnable from text is not closed under finite unions. We do this by constructing a certain class of finite thickness that also shows that a result from [20] is optimal. The results in this paper are part of a tradition in learning theory as documented in [11]. This tradition may be seen as a general computability theoretic background for themes in learning theory. In our view (a view put forth by Angluin), this tradition in the style of Gold relates to more practical paradigms such as Valiants pac-learning model (see e.g. [12]) as recursion theory relates to complexity theory. In our opinion, a full understanding of various phenomena in learning requires a broad development of the area, just as a deep understanding of computability needs a broad development of the theory of recursive functions.
2
Preliminaries and Notation
We recall the following definitions. For used notation we refer to the end of this section. A function M from finite sequences of natural numbers to ω Exidentifies a recursive function f if k = limn→∞ M (fn ) exists and is a code for f , i.e. ϕk = f . M BC-identifies f if for almost every n M (fn ) is a code for f , i.e. ϕM (fn ) = f . Ex and BC denote the families of classes that are identifiable by a recursive Ex and BC learner, respectively. In the literature on inductive inference, it is customary to allow a learner to output initially the symbol “?”, that does not count as a numerical hypothesis. This is relevant when counting the number of mind changes that a learner makes on given input data. We say that a learner M makes a mind change on f at n if M (fn − ) is not equal to “?” and M (fn − ) 6= M (fn ). A class of recursive functions C is in Exm if there is a recursive learner that identifies every f ∈ C by making at most n mind changes on f . We will also consider team learning [16,18]. Recall that for a learning criterion C, a class A is in [n, m]C if there is a team consisting of m learners such that for every f ∈ A at least n of these learners C-identify f . We will use the following notation. For a function f , fn denotes the string f (0)f (1)f (2) . . . f (n−1). Our recursion theoretic notation is standard and follows Odifreddi [14] and Soare [19]. So ϕe is the e-th partial recursive function, and ϕe,s (x) is the result of running ϕe for s steps on input x. ω is the set of natural numbers. h. , .i denotes a standard pairing function. For a string σ, |σ| is the length of σ. σ − = σ(0)σ(1) . . . σ(|σ| − 2), that is, σ with the last bit left off. The disjoint union of two functions f and g is denoted by f ⊕ g.
3
Semantic Mind Changes
It is clear that the notion of mind change as defined above is not useful for the study of the model of BC-learning, since in this model the inductive inference machine does not have to converge to a particular code of a function but may infinitely often output a different code, as long as in the limit these codes denote the same recursive function. In other words, in the limit the outputs of the
Counting Extensional Differences in BC-Learning
259
function may differ syntactically but semantically they must be the same. This brings us to define a notion of mind change for BC-learning as follows. Definition 3.1. A machine M BCn -identifies a recursive function f (or: M BC-identifies f with at most n semantic mind changes) if M BC-identifies f such that the cardinality of the set {n : M (fn ) 6= ? ∧ ϕM (fn ) 6= ϕM (fn+1 ) } is at most n. BCn denotes the family of classes that can be recursively BCn -identified. That is, the machine M is allowed only n semantic mind changes, i.e. a change of output from e0 to e1 such that ϕe0 6= ϕe1 . Here, as in the case of Exn , an initial sequence of empty hypotheses “?” is allowed. In the following, when we speak about mind changes it will depend on the model under consideration what we mean: If the model is defined using the basic model Ex we will always mean ‘mind change’ in the previously defined, syntactical, sense, and if the model is a variant of BC we will always use the semantic meaning of the word mind change. We now state the basic properties of the model BCn and show how it relates to the other models. Theorem 3.2. 1. BC0 = Ex0 . 2. Exn ⊂ BCn for n ≥ 1. 3. For every n it holds that Exn+1 6⊆BCn . S 4. Ex 6⊆ n BCn . 5. BC1 is not contained in Ex. Proof. 1. Ex0 ⊆ BC0 by definition. To Ex0 -identify a BC0 -class, simply output the first hypothesis of the BC0 -machine that is unequal to “?”. Since the BC0 learner is not permitted to change this hypothesis semantically, it is correct. 2. follows from 5. Items 3. and 4. can be proven by a well-known argument used in Theorem 5.2 in order to obtain a more general result. Item 5 will be proved in Theorem 5.7.
4
Confidence
The notion of confidence was defined by Osherson, Stob, and Weinstein, in the first edition of [11], for set-learners. We can define confidence for function-learners in the following analogous way: Definition 4.1. An Ex-learner is called confident if it converges on every function. (This is in general not the same as only requiring convergence on all recursive functions, see Ambainis, Freivalds, and Smith [2], and Sharma, Stephan, and Ventsov [17].) A BC-learner is called confident if it BC-converges on every function. We denote by ConfEx the family of classes that are learnable by a recursive and confident Ex-learner, and by ConfBC the family of classes that are learnable by a recursive and confident BC-learner.
260
F. Stephan and S.A. Terwijn
Ambainis, Jain, and Sharma [1] showed that a class is confidently Ex-learnable if and only if it can be Ex-learned with a countable ordinal number of mind changes. In particular, every class that is Ex-learnable with a constant number of mind changes is also confidently Ex-learnable. The next result is the corresponding one for BC: Every class BCn is in ConfBC. It needs a new proof technique since the semantic mind changes cannot be directly detected and counted down as in the case of Ex-learning. In particular on inputs not corresponding to functions in S the learner does not directly know if it has already made n semantic mind changes or not. S Theorem 4.2. n BCn is included in ConfBC. Proof. Let M be a BCn -learner. We transform M into a confident BC-learner M 0 that identifies at least the functions that M identifies. In order to do this, we try to keep score of how many semantic mind changes M has made. By a standard argument [11, Proposition 4.22] we may assume without loss of generality that M is total. As soon as we think that there have been at least n semantic changes we stick to the last hypothesis of M 0 , and we follow M ’s hypotheses otherwise. Call a pair (σ, x) a candidate for a mind change at stage s if ϕM (σ− ),s (x) 6= ϕM (σ),s (x). (By convention the inequality also holds if one of the sides is undefined when the other is not.) Let a witness be a set of n candidates for a mind change (σ, x) where all σ’s are different. At stage s (where s is the length of the sequence of incoming data) we try to find a witness, if none was found already at a previous stage. Every time a witness turns out to be invalid, on stage s say, we choose as new hypothesis that of M on s, and say the witness is abandoned. Now if M indeed makes n mind changes then after a finite number of stages we will have found a witness that is never abandoned at a later stage. On the other hand, if M makes less than n mind changes, every witness is abandoned at a later stage, and hence the set of hypotheses of M 0 is unbounded in that of M . Hence, M 0 learns all the functions that M BCn -learns, and if M makes more than n semantic mind changes M 0 gets eventually stuck at one hypothesis. Hence, since M is total, M 0 is confident. The following theorem shows that the inclusion in Theorem 4.2 is strict. S Theorem 4.3. ConfEx is not contained in n BCn . Proof. Let D be the class of all nonincreasing functions. A standard diagonalization shows that D 6∈BCn for any n. On the other hand, D ∈ ConfEx: Since any f ∈ D can step down at most f (0) times, we can identify D by a confident learner that on any input σ makes sure that no more than σ(0) syntactic changes have been made. Blum and Blum [6] showed that Ex is not closed under finite unions. That the same holds for BC was proved by Smith [18]. In contrast to this result, the confident version of BC is closed under finite unions, as is the confident version of Ex [1,17].
Counting Extensional Differences in BC-Learning
261
Theorem 4.4. ConfBC is closed under finite unions. Proof. Given two confident BC-learners M0 and M1 we define a new confident BC-learner M . Given a recursive function f , define M (fn ) to be (a code of) the following program. M (fn ) searches for the least hi, si, i ∈ {0, 1} and s ∈ ω, for which (∀x < n)[ϕMi (fn ),s (x) = fn (x)].
(1)
If such hi, si is found, M (fn ) on input y simulates the computation Mi (fn )(y). Otherwise ϕM (fn ) (y) is undefined. This concludes the definition of M . Now it is clear that if f is BC-identified by both M0 and M1 then it is also BCidentified by M . If f is identified by one and not the other, by confidence let m be so large that both Mi have converged on f , i.e. ∀s ≥ m(ϕMi (fs ) = ϕMi (fm ) ) for i ∈ {0, 1}. Then after stage m one of the two learners always outputs semantically the same wrong output, hence for n large enough this output can never satisfy (1). Therefore, for large enough n, M always selects the right hypothesis of the two. Recall the notion of team learning from the introduction. The previous result can be seen as a result on team learning: In the proof of Theorem 4.4 we showed that two confident BC-learners can be replaced by one. By induction we see that a team of confident BC-learners can be replaced by one confident learner BC-identifying at least the same.
5
Anomalous Hypotheses
In this section we discuss learning with a finite number of anomalies. In both the Ex and the BC case it is known that allowing final hypotheses that are defective at a finite number of inputs, either by being undefined or by giving the wrong answer, increases the number of classes that can be effectively learned. For partial functions ϕ and ψ, let ϕ =∗ ψ denote that for almost every x, ϕ(x) = ψ(x). (As usual, we take ϕ(x) = ψ(x) to mean that if one of ϕ(x),ψ(x) is undefined, then the other one is too.) Similarly, ϕ =n ψ means that ϕ = ψ with the possibility of at most n exceptions. Now Ex∗ and Exn are defined as Ex with =∗ and =n instead of =. Similarly for BC∗ and BCn . For example M BCn -identifies a function f if for almost every k, ϕM (fk ) =n f . We define BCnm as follows. Definition 5.1. A learner M BCnm -identifies a function f whenever M BCn identifies f with at most m semantic mind changes. BCnm denotes the family of classes that can be recursively BCnm -identified. (There is at least one other (nonequivalent) way of defining BCnm , where one also counts semantic mind changes modulo finite differences. That is, one could define a BCnm -learner as a learner where ϕM (fk ) =n f for almost all k and where
262
F. Stephan and S.A. Terwijn
there are at most m numbers k with M (fk ) 6= ? and ϕM (fk ) 6=n ϕM (fk+1 ) . But this definition is mathematically less elegant. For example the the relation “=n ” is not transitive and so it might happen that ϕM (fk ) =n ϕM (fk+1 ) and ϕM (fk+1 ) =n ϕM (fk+2 ) while ϕM (fk ) 6=n ϕM (fk+2 ) . Furthermore, there would be nontrivial collapses like BC10 = BC20 with respect to the alternative definition.) J. Steel [15] noticed that Ex∗ ⊆ BC. The next result shows that a smaller bound on the number of mind changes cannot be compensated by permitting errors and using semantic instead of syntactic mind changes. Note that the result provides the omitted proofs of the third and fourth item of Theorem 3.2. Theorem 5.2. For every n it holds that Exn+1 6⊆ BC∗n . Furthermore, Ex 6⊆ S ∗ n BCn . Proof. The family D from the proof of Theorem 4.3 outside BC∗n for S is also ∗ every n; thus D witnesses that Ex is not contained in n BCn . Furthermore, the families Dn = {f ∈ D : f (0) ≤ n} are in Exn but not in BC∗m for all m < n. arzdins [4] proved In Blum and Blum [6, p152] it is stated that Ex∗ 6⊆ Ex. B¯ that BC 6⊆Ex. Case and Smith [8, Theorem 2.3] proved that the class S1 = {f : ϕf (0) =1 f } is in Ex1 − Ex. Clearly S1 ∈ BC1 so it follows immediately that BC1 6⊆Ex. Case and Smith and Harrington [8, Theorem 3.1] proved that the class {f : ∀∞ x(ϕf (x) = f )} is in BC − Ex∗ . From this proof actually follows the stronger statement that the smaller class X = {f : ∃n∀i(i ≤ n → ϕf (i) = ϕf (0) ∧ i > n → ϕf (i) = f )} is in BC − Ex∗ . Since X is clearly in BC1 this gives us Theorem 5.3. BC1 is not included in Ex∗ . Theorem 5.3 will be improved in Theorem 5.7. The following result shows that in the BC model there is a tradeoff between mind changes and anomalies. Note that this is different in the Ex model where there is no such tradeoff. Namely, Case and Smith [8] proved that Ex10 is not contained in Ex. Tradeoff results for a different notion of mind change in the context of vacillatory function identification were studied in Case, Jain, and Sharma [7]. Theorem 5.4. For all n and m it holds that BCnm is included in BCn(m+1)+m . For n > 0 the inclusion is strict. Furthermore, the bound n(m+1) +m is optimal. Proof. Proof of the inclusion: Let M be a BCnm -learner. We will try to overcome anomalies by hard-wiring bits of the input data, in such a way as to make the least possible number of semantic changes. Hard-wiring all values of the input data can already make this number recursively unbounded when the first hypotheses of M are wrong, so we have to be more careful. Since we know that the “final” hypotheses of M are faulted at at most n places, we never patch more than n inputs. That is, we transform every hypothesis M (σ) into an hypothesis
Counting Extensional Differences in BC-Learning
263
M 0 (σ) that implements the following algorithm. Compute the longest τ σ such that there are at most n places x ∈ dom(τ ) with either ϕM (τ ),s (x) ↑ or ϕM (τ ),s (x) ↓6= τ (x), where s = |σ|. Then let τ (x) if x ∈ dom(τ ), ϕM 0 (σ) (x) = ϕM (τ ) (x) if x ∈ / dom(τ ). So the algorithm has two ingredients: delaying and patching. It is easy to verify that every mind change is either caused by patching some x with τ (x) which has been incorrect before or by following an original mind change of M . Between two (delayed) semantic mind changes of M there are at most n places at which M 0 causes a mind change by patching one input. So patching may induce up to n mind changes between two delayed ones plus n mind changes before the first (delayed) mind change of M and n mind changes after the last (delayed) mind change of M . Together with the up to m original mind changes of M this gives altogether at most n(m + 1) + m mind changes. Furthermore the last hypothesis of M agrees with the function to be learned on all but up to n places. These at most n places are repaired by patching. So whenever M BCnm -learns a function f , M 0 BCn(m+1)+m -learns the same function f . Proof of the strictness of the inclusion when n > 0: This follows immediately from Theorem 5.2. Proof of the optimality of the bound: We prove that Exnm is not included in BCn(m+1)+m−1 . Consider the class S of functions that are zero at all but up to n(m + 1) + m inputs. Then S 6∈BCn(m+1)+m−1 by a standard diagonalization, making use of the fact that a nonzero value of a function in S cannot be predicted. On the other hand, S ∈ Exnm because an Exnm -learner can output a new guess after having seen n + 1 nonzero values. In this way, with m mind changes the Exnm -learner can handle (n + 1)(m + 1) − 1 nonzero values. Hence S ∈ Exnm . Next we consider teams. First we prove that BCn ⊆ [1, n + 1]Ex and that BCn 6⊆ [1, n]Ex∗ . Theorem 5.5. BCn is strictly included in [1, n + 1]Ex for every n. Proof. Let M witness that S ∈ BCn and let Sk be the subclass of those functions in S where M makes exactly k semantic mind changes. Clearly S = S0 ∪ S1 ∪ . . . ∪ Sn . For each class Sk there is an Ex-learner Mk : The machine Mk searches for the least tuple (σ0 , x0 , σ1 , x1 , . . . , σk−1 , xk−1 , σk ) which is a candidate for witnessing k semantic mind changes. Mk computes at every fm f an m-th approximation to this tuple and outputs M (σk ) for this approximation. The search conditions for this tuple to witness the k semantic mind changes are the following three. – σ0 ≺ σ1 ≺ . . . ≺ σk f where f is the function to be learned, – M (σh ) 6= ? for every h ≤ k,
264
F. Stephan and S.A. Terwijn
– ϕM (σh ) (xh ) 6= ϕM (σh+1 ) (xh ) (i.e. either exactly one of the values is undefined or both are defined but different) for every h < k. Note that for the learner M0 the first and the third condition are void so that the only search condition is to find some σ0 f with M (σ0 ) 6= ?. The last condition can only be verified in the limit, so it might happen that either a correct tuple needs some time until it qualifies or that some incorrect tuple is considered to be a candidate until it is disqualified. If f ∈ Sk then there exists such tuples and Mk converges to the least one of them. It follows that Mk (fm ) converges to M (σk ) for the σk of this least tuple. The candidates for the mind changes are then correct. So M makes k mind changes before seeing σk and no mind change later. So M (σk ) is indeed a program for f and Mk is an Ex-learner for Sk . It follows that the team M0 , M1 , . . . , Mn infers the whole class S with respect to the criterion [1, n + 1]Ex. The strictness of the inclusion follows from (the proof of) Theorem 4.3 showing that the class D is in ConfEx and thus in [1, n + 1]Ex, but not in BCn . Theorem 5.6. BCn is not included in [1, n]Ex∗ for every n. Proof. Let S1 = {f : ϕf (0) =1 f }. (See also the discussion preceding Theorem 5.3.) Let S = {f1 ⊕ · · · ⊕ fn : fi ∈ S 1 }. It follows from Kummer and Stephan [13, Theorem 8.2] that S 6∈[1, n]Ex, whereas it is easy to see (by combining the codes of the fi ) that S ∈ Exn0 ⊆ BCn . To obtain a result for [1, n]Ex∗ , define the cylindrification Cyl(S) = {f : (∃g ∈ S)(∀x, y)[f (hx, yi) = g(x)]}. Since for any class A it holds that Cyl(A) ∈ [1, n]Ex if and only if Cyl(A) ∈ [1, n]Ex∗ , and Cyl(A) ∈ [1, n]Ex implies A ∈ [1, n]Ex, it follows that Cyl(S) 6∈[1, n]Ex∗ . However, the BCn -algorithm for S easily transfers to Cyl(S). Case and Smith [8] introduced the notion of OEx∗ -learning where the learner outputs finitely many indices such that among these indices is also one which coincides with the function f to be learned at all but finitely many inputs. Since Ex∗ ⊆ OEx∗ , the next result improves Theorem 5.3 by showing that BC1 is not even contained in OEx∗ . Theorem 5.7. BC1 is not contained in OEx∗ . Proof. The class Cyl(S1 ), the cylindrification of the class S1 (see the proof of Theorem 5.6), is in BC1 . Suppose for a contradiction that Cyl(S1 ) is in OEx∗ , and that M is a total OEx∗ -learner for it. Now a family of partial functions ψe is constructed, using for each ψe a marker me ; after each step s the domain of ψe is {0, 1, . . . , s} − {me,s } where me,s is the marker position after step s. The intention of construction for ψe is to show that there is a function fe ∈ Cyl(S 1 ) which is an extension of the function hx, yi 7→ψe (x) and which is not OEx∗ -learned by M . – In step 1 define ψe (0) = e and place me on the position 1, that is, let me,1 = 1.
Counting Extensional Differences in BC-Learning
265
– In step s + 1, s ≥ 1, first compute for all a, b ≤ s + 1 the strings σa,b such that the domain of σa,b is the longest interval {0, 1, . . . , ub } where all pairs hx, yi ≤ ub satisfy x < b and ψe (x) if x 6= me,s , σa,b (hx, yi) = a if x = me,s . – Then check whether there is a value a ≤ s + 1 such that M outputs on some input σ with σa,me,s ≺ σ σa,s+1 a new guess which has not been seen before. – If so, then let ψe (me,s ) = a and move the marker to the next still undefined position of ψe : me,s+1 = s + 1. – If not, then let ψe (s + 1) = 0 and let the marker stay where it is: me,s+1 = me,s . If the marker moves infinitely often then ψe is total; otherwise ψe is defined at all inputs except the end-position me,∞ of the marker me . By the Recursion Theorem there is an index e with ϕe = ψe ; fix such index e and note that all extensions of ψe are in S 1 . If the marker me moves infinitely often, then ψe is total and the function fe given by fe (hx, yi) = ψe (x) is in Cyl(S 1 ). It follows from the construction that M outputs infinitely many different guesses on fe . So M does not OEx∗ -learn fe which gives the desired contradiction for this case. So it remains to look at the scenario when me moves only finitely often and remains on the end-position me,∞ . Now define the functions ψe (x) if x 6= me,∞ , fe,a (hx, yi) = a if x = me,∞ . M shows on all these functions the same behaviour in the sense that it outputs the same finite set E of indices — since otherwise there would be an a permitting a new output outside E and the marker would move again. Furthermore all functions fe,a are in Cyl(S 1 ) and they differ on infinitely many values. So only finitely many of them have a program in E which computes them at almost all places and one can choose a so that the resulting function fe = fe,a is not computed by any of the indices in E at all but finitely many places. So in both cases there is a function fe ∈ Cyl(S 1 ) which is not learned by M under the criterion OEx∗ and it follows that Cyl(S 1 ) is a witness for the non-inclusion BC1 6⊆OEx∗ . Recall the notion of confidence from Definition 4.1. A class is in ConfEx∗ if it is Ex∗ -identified by a learner that converges on every function. Since every Ex∗m -learner can easily be converted into a ConfEx∗ -learner we have the inclusion [1, n]Ex∗m ⊆ [1, n]ConfEx∗ . Furthermore, every ConfEx∗ -learner outputs on every function only finitely many indices, so a team of n ConfEx∗ -learners in total also outputs on every function finitely many indices. Thus it follows that [1, n]ConfEx∗ ⊆ OEx∗ . As a consequence, BC1 is not contained inSany of the just mentioned criteria. Smith [18, Theorem 3.8] proved that BC 6⊆ n [1, n]Ex∗ . This may be compared to the following corollary.
266
F. Stephan and S.A. Terwijn
Corollary 5.8. BC1 is neither a subclass of S ∗ n [1, n]ConfEx .
S n,m
[1, n]Ex∗m nor a subclass of
Note that it makes sense to consider teams in the case of learning with finitely many errors since teams of ConfEx∗ -learners have more power than single ConfEx∗ -learners: The class containing the functions which are zero almost everywhere and the functions which are self-describing is learnable by a [1, 2]Ex∗0 team but not by a single Ex∗ -learner [8, Theorem 2.13]. We also remark S that the proof of Theorem 4.3 shows that in fact ConfEx is not included in n BC∗n . Finally, we note that the fact that since the classes [a, b]BC0 and [a, b]Ex0 are the same and the exact relation between the classes [a, b]Ex0 is still unknown, the same holds for the classes [a, b]BCn . Nevertheless many results have already been obtained for the inclusion relation of [a, b]Ex0 . For a list of references see [11, p 219].
6
Grammar Induction
In this section we make some remarks on grammatical inference. In the previous sections we have been concerned with the inductive inference of computable functions. Here we consider the more general paradigm of learning recursively enumerable sets, or, when we think of the code of an r.e. set as a grammar generating the set, the learning of grammars from pieces of text. The set learning analogs of the models Ex and BC that we studied in the previous sections are defined as follows (we use the notation of [11]): Definition 6.1. Let L be an r.e. set. A text t for L is a (not necessarily recursive) enumeration of the elements of L. The initial segment of length n of t is denoted by tn . A learner M TxtEx-learns L if for every text t for L, limn→∞ M (tn ) = e exists and We = L. M TxtBC-learns L if for every text t for L, WM (tn ) = L for almost every n. A machine M TxtBCn -learns L (or: M TxtBC-identifies L with at most n semantic mind changes) if M TxtBC-learns L such that the cardinality of the set {n : M (fn ) 6= ? ∧ WM (fn ) 6= WM (fn+1 ) } is at most n. A class L of r.e. sets is in TxtEx [TxtBC, TxtBCn ] when there is a recursive learner that TxtEx-learns [TxtBC-learns, TxtBCn -learns] every L ∈ L. Variants of these classes, such as the analog TxtBCn of BCn , are defined in the obvious way. The definition of confidence for language-learners is as follows: Definition 6.2. A TxtEx-learner is confident if it converges on every text. A TxtBC-learner is confident if it TxtBC-converges on every text. We denote by ConfTxtBC the classes that are TxtBC-learnable by a confident learner. First we note that a negative result on function learning immediately yields a corresponding negative result for language learning, since the latter is a more
Counting Extensional Differences in BC-Learning
267
general setting. (We can embed the first into the second by interpreting the graph of a recursive function as a simple kind of r.e. set.) Thus, the Theorems 3.2, 4.3, 5.2, 5.3, 5.6, and 5.7 all hold for the corresponding models of language learning. Theorem 4.2 remains valid for language learning, with the same proof. The following simple result shows that Theorem 5.4 does not transfer. Theorem 6.3. (See [11, page 145, 147]) TxtBC10 is not contained in TxtBC. Proof. Consider the class X = {We : We =1 ω}. X ∈ TxtBC10 since it is identified by the learner that always outputs a code for ω. On the other hand, it follows from Angluin’s characterization of (unbounded) identifiability [11, Theorem 3.26] that X is not learnable by any learner (even when nonrecursive learners are allowed). In particular X is not TxtBC-learnable. Finally, it is easy to see that the idea for the proof of Theorem 5.5 can be used to show that this result also holds for language learning. The only thing we have left open is the corresponding statement of Theorem 4.4, which we take care of now. We want to show that Theorem 4.4 does not hold for language learning. For this we use the following result, which is interesting in itself. First a definition: Definition 6.4. Let L be a collection of r.e. sets. (i) (Angluin [3]) L has finite thickness if for every finite D 6= ∅ the collection {L ∈ L : D ⊆ L} is finite. (ii) L is finite-to-1 enumerable if there is a recursive function f such that L = {Wf (i) : i ∈ ω} and for every member L ∈ L there are at most finitely many i such that L = Wf (i) . (Note that this finite number may depend on L.) Similarly, L is 1-1-enumerable if it has an enumeration in which every set has only one code. Theorem 6.5. There exists a uniformly r.e. collection L that has finite thickness and that is not in TxtBC. Proof. The proof is an adaptation of the proof of Theorem 3.1 in Terwijn [20] (which showed that there is a 1-1-enumerable identifiable collection of recursive sets that is not in TxtBC). The collection L contains for every e a subclass Le such that the e-th partial recursive function ϕe does not TxtBC-learn Le . To separate the strategies for different e we let the elements of Le be subsets of ω [e] . The classes Le are uniformly enumerated as follows. Le will contain Le,0 = ω [e] , a certain diagonal set Le,1 , and sets Le,j , j > 1, such that one of the following cases holds: – ϕe does not TxtBC-learn Le,1 . In this case every Le,i , i > 1, will be either empty or equal to Le,0 . – ϕe does not TxtBC-learn a Le,j with j > 1. In this case every Le,i with 1 < i < j will equal Le,0 and all Le,i with i > j will be empty. The construction of Le is now as follows. We use auxiliary variables xe,j and σj .
268
F. Stephan and S.A. Terwijn
Initialization: Let σ0 be the empty string, Le,0 = ω [e] , Le,j = ∅ for all j > 0. In subsequent stages we may add elements to these sets. Stage j. For all i with 1 < i < j, let Le,i = Le,0 . Let Le,j = Le,0 − {xe,0 , xe,1 , . . . , xe,j−1 }. Search for a number xe,j in Le,j and an extension σj of σj−1 such that the range of σj contains only elements from ω [e] − {xe,0 , . . . , xe,j }, ϕ(σj ) is defined, and the set Wϕ(σj ) generated by it contains xe,j . If these are found, add the range of σj to Le,1 . This completes the construction of the Le . Now there are two possibilities: – The construction of Le is completed at every stage j. Then the union of all the σj constitute a text for Le,1 , but ϕe infinitely often outputs an hypothesis that contains a non-element of Le,1 . Hence ϕe does not TxtBC-identify Le,1 . – Stage j in the construction is not completed for some j. In this case xe,j is not found, and the learner ϕ does not overgeneralize on any text for Le,j starting with σj−1 . Hence ϕ does not TxtBC-identify Le,j . Note that every Le has finite thickness since it contains at most the sets Le,0 , Le,1 , and possibly some Le,j . Terwijn [20, Theorem 5.3] showed that a finite-to-1 enumerable collection that has finite thickness is in TxtBC. Theorem 6.5 shows that the hypothesis of finiteto-1 enumerability is necessary. Now we use the proof of Theorem 6.5 to show that the analog of Theorem 4.4 fails for language learning. Theorem 6.6. There are classes C0 and C1 in ConfTxtBC1 such that C0 ∪ C1 is not in TxtBC. Hence neither ConfTxtBC nor TxtBCn , n ≥ 1, is closed under finite unions. Proof. Let L be the collection from the proof of Theorem 6.5. This collection contains for every e a set Le,1 . Let C0 be the collection consisting of all these Le,1 ’s, plus the empty set. Clearly C0 is in ConfTxtBC1 . We now prove that also C1 = L − C0 is in ConfTxtBC1 . Since by Theorem 6.5 C0 ∪ C1 = L is not in TxtBC the theorem follows. We define a confident recursive TxtBC-learner M for C1 . We use the notation of the proof of Theorem 6.5. Given a piece of text σ, M follows the definition of Le,1 for |σ| steps in order to find the first “gap” xe,0 . If xe,0 is not found, M outputs ω [e] as a guess. If xe,0 is found and is in the range of σ, we know that σ is not part of a text for Le,1 , so M can safely output ω [e] . Otherwise, let M (σ) be the program that searches for |σ| steps for as many gaps xe,i as possible. If after |σ| steps xe,0 , . . . , xe,l are found, M (σ) starts to enumerate Le,0 − {xe,0 , . . . , xe,l }. If, however, in the course of this enumeration another gap xe,l+1 is found, M knows its guess is wrong and starts to enumerate all of Le,0 . Now if there is indeed an infinite number of gaps xe,i , then M (σ) is always a code for Le,0 . If there is only a finite number of gaps xe,0 , . . . , xe,l , then M (σ) is almost always a code for Le,0 − {xe,0 , . . . , xe,l }. Note that in this last case there is also at most one semantic mind change. So M is confident, and it TxtBC1 -identifies C1 .
Counting Extensional Differences in BC-Learning
269
We note without proof that, in analogy to Theorem 6.6, there are two classes in TxtEx0 whose union is not in TxtEx. However, in Theorem 6.6 one cannot get TxtBC0 instead of TxtBC1 since the union of two classes in ConfTxtEx is in ConfTxtBC, and every TxtExn -learnable class is ConfTxtEx-learnable. Acknowledgments. We thank William Gasarch for helpful discussions.
References 1. A. Ambainis, S. Jain, and A. Sharma, Ordinal mind change complexity of language identification, Proceedings of the Third European Conference on Computational Learning Theory, Springer Lecture Notes in A. I. 1208 (1997) 301–316. 2. A. Ambainis, R. Freivalds, and C. H. Smith, Inductive Inference with Procrastination: Back to Definitions, Fundamenta Informaticae 40 (1999) 1–16. 3. D. Angluin, Inductive inference of formal languages from positive data, Information and Control 45 (1980) 117–135. 4. J. M. B¯ arzdins, Two theorems on the limiting synthesis of functions, Theory of Algorithms and Programs (Latvian State University) 1 (1974) 82-88 (in Russian). 5. J. M. B¯ arzdins and R. Freivalds, On the prediction of general recursive functions, Soviet Mathematics Doklady 13 (1972) 1224-1228. 6. L. Blum and M. Blum, Toward a mathematical theory of inductive inference, Information and Control 28 (1975) 125-155. 7. J. Case, S. Jain, and A. Sharma, Complexity issues for vacillatory function identification, Information and Computation 116(2) (1995) 174–192. 8. J. Case and C. Smith, Comparison of identification criteria for machine inductive inference, Theoretical Computer Science 25 (1983) 193-220. 9. R. Freivalds and C. Smith, On the role of procrastination in machine learning, Information and Computation 107 (1993), 237–271. 10. E. M. Gold, Language identification in the limit, Information and Control 10 (1967) 447-474. 11. S. Jain, D. Osherson, J. S. Royer, and A. Sharma, Systems that learn, An introduction to learning theory, second edition, MIT Press, 1999. 12. M. J. Kearns, U. V. Vazirani, An introduction to computational learning theory, MIT Press, 1994. 13. M. Kummer and F. Stephan, On the structure of degrees of inferability, Journal of Computer and System Sciences 52 (1996) 214–238. 14. P. Odifreddi, Classical Recursion Theory, North-Holland, Amsterdam, 1989. 15. P. Odifreddi, Inductive inference of total functions, in: S. B. Cooper, T. A. Slaman, S. S. Wainer (eds.), Computability, Enumerability, Unsolvability. Directions in Recursion Theory, London Math. Soc. Lecture Note Series 224 (1996) 259–288. 16. D. N. Osherson, M. Stob, and S. Weinstein, Aggregating inductive expertise, Information and Computation 70(1) (1986) 69–95. 17. A. Sharma, F. Stephan, Y. Ventsov, Generalized notions of mind change complexity, Proceedings of the Tenth Conference on Computational Learning Theory (COLT’97), Nashville (1997) 96–108. 18. C. Smith, The power of pluralism for automatic program synthesis, J. ACM 29, Vol. 4 (1982) 1144-1165. 19. R. I. Soare, Recursively enumerable sets and degrees, Springer-Verlag, 1987. 20. S. A. Terwijn, Extensional set learning, Proceedings of The Twelfth Annual Conference on Computational Learning Theory (COLT ’99), Santa Cruz (1999) 243–248.
Constructive Learning of Context-Free Languages with a Subpansive Tree ? Noriko Sugimoto1 , Takashi Toyoshima2 , Shinichi Shimozono1 , and Kouichi Hirata1 1
Department of Artificial Intelligence, 2 Department of Human Sciences, Kyushu Institute of Technology, Kawazu 680-4, Iizuka 820-8502, Japan {sugimoto,sin,hirata}@ai.kyutech.ac.jp [email protected]
Abstract. A subpansive tree is a rooted tree that gives a partial order of nonterminal symbols of a context-free grammar. We formalize subpansive trees as background knowledge of CFGs, and investigate query learning of CFGs with the help of subpansive trees. We show a restricted class of CFGs, which we call hierarchical CFGs, is efficiently learnable, while it is unlikely to be polynomial-time predictable.
1
Introduction
Language acquisition is one of the central interests to both theoretical computer science and linguistics. In computational learning theory, Angluin [2] showed that the regular languages are efficiently learnable, proposing a polynomial-time algorithm on finite automata with terminal membership queries and terminal equivalence queries. Since then, the learnability of context-free classes has come to be the next research topic, and a number of studies have been reported under different acquisition scenarios: Angluin [1] allowed nonterminal membership queries for learning a k-bounded context-free grammar (CFG); Sakakibara [25, 26] and Sakamoto [27] proposed learning algorithms from structured examples with structural queries; Ishizaka [17] dropped nonterminal membership queries for a simple deterministic CFG, but with equivalence queries extended to ask about a grammar from the outside of the target class. For natural language acquisition by humans, on the other hand, children are not always corrected when they produce a “wrong” utterance, or even told when they produce or hear a “wrong” utterance; let alone what is wrong about it. That is, there is no consistently reliable teacher, and structural information is scarcely given. Thus, it is usually held that humans are endowed with some ?
The research reported here is partially supported by Grants-in-Aid for Encouragement of Young Scientists, Japan Society for the Promotion of Science, awarded to the last three authors (nos. 12710285, 12780286, 11780284, respectively). The standard disclaimer applies.
A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 270–283, 2000. c Springer-Verlag Berlin Heidelberg 2000
Constructive Learning of Context-Free Languages
271
kind of “innate” grammatical knowledge that enables them to acquire a target grammar without negative examples [6,23]. This “innate” grammatical knowledge of humans must be abstract and adaptive enough to develop into concrete particular grammars [9,11], of more than 4,000 languages, by modest estimate, known in the world. In this paper, we introduce the concept of a subpansive tree that formalizes a background knowledge to characterize CFGs. A subpansive tree is a rooted tree that represents a partial order on nonterminal symbols of a target CFG. This partial order on nonterminals models a part of the “innate” grammatical knowledge of humans. In natural languages, sentences are not directly made up from words but internally structured; words constitute phrases, and phrases constitute sentences [10,18]. In these structures, we can observe categorical inheritance to a phrase from its constituent head (a part of speech), so that a noun phrase is projected from a noun, for example. This endocentricity [4] is the origin of the context-free base [7], and the opposite exocentricity and discontinuous constituency/dependency are the principal sources that natural languages are considered not to be properly included in the context-free class [8]. The subpansive tree encodes the categorical inheritance found in phrase structures of natural languages, and helps to derive productions of a target CFG (‘expansion rules’ to generate sentences) by guessing a nonterminal symbol into which a substring can be intrajected as ‘subpansion.’ A relation from an ancestor nonterminal A to its descendant nonterminal B implies that from an abstract nonterminal A, specific strings containing B will be obtained in the succeeding productions. This use of a subpansive tree captures an intuition about the importance of structural information for grammatical inference that is different from the one employed in [25,26,27]. Rather than to use specific structural information from examples and queries, the subpansive tree is given as an abstract background knowledge about structures, from which a particular grammar — instructions to build specific structures — develops. Human children are not directly exposed to structural information of sentences, but learn to divide them into phrases and down to words; they lean to structure a string of words into phrases, and up to a sentence, i.e., terminal string into nonterminal symbols, and up to the root. Yet, a string of a context-free language (CFL) can be structurally ambiguous. For instance, a string of terminal symbols abc can have either of the following CFG structures: S
S A
B
A
a b
c
a b
B c
Here, the set of symbols are exactly the same, but the structures, and hence the CFGs (productions) that generate them, are distinct. This point is already made in [25], but this very ambiguity is what must be resolved by humans through
272
N. Sugimoto et al.
learning a particular grammar, not to be pre-solved as examples or hints; that is, what it means to learn a grammar is to learn how to structure strings of terminal symbols. Since the number of parts of speech is quite limited and the categorical inheritance readily determines what phrase can expand into what kind of strings of words in natural languages, we take, unlike Ishizaka’s model [17], the set of nonterminal symbols is given, and as a preliminary study, we adopt Angluin’s [1] nonterminal membership queries and terminal equivalence queries, to see the merits and demerits of the subpansive tree in learning CFGs, though nonterminal membership queries are known to be rather powerful. In the following, we first define a subpansive tree of a CFG, and discuss a few fundamental issues on relations between CFGs and subpansive trees that require a mild restriction on productions. Next, we study the learnability of CFGs to which a subpansive tree requires that for every production, all nonterminals in the righthand-side are children of the lefthand-side. We call this class of CFGs the hierarchical CFGs. In this setting, we design a learning algorithm using terminal equivalence and nonterminal membership queries. This algorithm builds a hypothesis constructively, by generalizing the most specific productions. Furthermore, we show that for any size of DNF formulas, there exists a subpansive tree such that if CFGs with a subpansive tree are predictable in polynomial time, then so is DNF formulas. This implies that the class of hierarchical CFGs is unlikely to be polynomial-time predictable.
2
Preliminaries
An alphabet is a non-empty finite set of symbols. For an alphabet X, the set of all finite strings formed from symbols in X is denoted by X ∗ . The empty string is denoted by ε, and X + denotes the set X ∗ − {ε} of non-empty strings. A language over X is a subset of X ∗ . Let Σ and N be alphabets that are mutually disjoint Σ ∩N = ∅. A production A → α on N and Σ is an association form a nonterminal A ∈ N to a string α ∈ (N ∪Σ)∗ . A context-free grammar (CFG, for short) is a 4-tuple (Σ, N, P, S), where S ∈ N is the distinguished start symbol and P is a finite set of productions on N and Σ. Symbols in N are said to be nonterminals, while symbols in Σ are called terminals. Let α and β be strings in (Σ ∪ N )∗ . We say that β is derived from α in one step with G, and denote α ⇒G β, if there exists a production X → χ in P such that, for some α1 , α2 ∈ (Σ ∪ N )∗ , α = α1 Xα2 and β = α1 χα2 . That is, β is obtained from α by replacing one occurrence of A by α. We extend the relation ⇒G to the reflexive and transitive closure ⇒∗G . Let G = (Σ, N, P, S) be a CFG, and A a nonterminal in N . The language LG (A) of A is the set {w ∈ Σ ∗ | A ⇒∗G w}. The language L(G) of G just refers to LG (S). A language L is called a context-free language (CFL, for short) if there exists a CFG G that identifies L = L(G). A CFG G = (Σ, N, P, S) is said to be reduced if every A ∈ N satisfies the following conditions: 1. there exists a string w ∈ Σ ∗ such that A ⇒∗G w,
Constructive Learning of Context-Free Languages
273
2. there exist α, β ∈ (Σ ∪ N )∗ such that S ⇒∗G αAβ, and 3. LG (A) = 6 {ε}. Throughout this paper, every CFG is assumed to be reduced. Next, we prepare the models of learning. Let G be a family of CFGs and L the family of CFLs corresponding to G, that is, L = {L(G) | G ∈ G}. Then, in the context of identifying a grammar G ∈ G through members and nonmembers of L = L(G), the family G (or L) is said to be the target concept, and the grammar G (language L(G) ) is said to be the target grammar (target language, respectively). As far as we concern in this paper, sets of nonterminals of the same size are identified. Also, learning algorithms will deal with only a family GΣ,N of CFGs specified by some fixed Σ and N . Let G = (Σ, N, P, S) be a target CFG in a family GΣ,N , and L the corresponding language L = L(G). A positive example of G (or, equivalently, L) is a string w ∈ L, and a negative example of G (or L) is a string w 6∈L. To obtain these examples or to ensure the target grammar, learning algorithms are allowed to issue queries of the following types to the teacher: For a pair (w, A) ∈ Σ ∗ × N , a nonterminal membership query NM G (w, A) asks whether w is in LG (A). For a grammar G0 ∈ G, an equivalence query EQ G (G0 ) asks whether L(G0 ) = L(G): If the empty string is replied, then the answer is ‘yes;’ Otherwise, if a non-empty string w is replied, then the answer is ‘no,’ and w is either a positive counterexample in L(G) − L(G0 ) or a negative counterexample in L(G0 ) − L(G). A family G of CFGs is said to be polynomial-time learnable via equivalence and nonterminal membership queries if there exists a learning algorithm that use both equivalence and nonterminal membership queries and exactly identifies each G ∈ G in polynomial-time with the maximum length of counterexamples. In the same manner, a family of CFGs is polynomial-time learnable via equivalence queries alone if it can be identified only with equivalence queries in polynomialtime.
3
A Context-Free Grammar and a Subpansive Tree
In this section, we introduce the concept of subpansive tree. In the learning process, the subpansive tree of a CFG works as background knowledge. Let L and L0 be languages over Σ. Then, L0 is said to be a component of L if there exist two strings u, v ∈ Σ ∗ such that w ∈ L0 implies uwv ∈ L. Definition 1. Let G = (Σ, N, P, S) be a CFG, and let T = (N, E) be a rooted tree. Then, T is a subpansive tree of G if it satisfies the following conditions: 1. S is a root of T , and 2. if A ∈ N is an ancestor of B ∈ N on T , then LG (B) is a component of LG (A). Example 1. Consider the following CFG G: G = ({S, A0 , A1 }, {0, 1, a}, {S → 0A0 | 1A1 , A0 → aa | aA1 , A1 → a | aA0 }, S).
274
N. Sugimoto et al.
Let Ti (i = 1, 2, 3) be the trees as follows: T1 =
T2 =
S A0
A1
S
T3 =
S
A0
A1
A1
A0
For each nonterminal S, A0 and A1 , it holds that LG (S) = {0a2n | n ≥ 1} ∪ {1a2n−1 | n ≥ 1}, LG (A0 ) = {a2n | n ≥ 1}, LG (A1 ) = {a2n−1 | n ≥ 1}. Then, the following is clear: {0w | w ∈ LG (A0 )} ⊆ LG (S), {1w | w ∈ LG (A1 )} ⊆ LG (S), {aw | w ∈ LG (A0 )} ⊆ LG (A1 ), {aw | w ∈ LG (A1 )} ⊆ LG (A0 ). Hence, every Ti is a subpansive tree of G. As the relationship between a CFG and a tree, the following proposition holds: Proposition 1. Let G be a CFG and T a tree. Then, it is undecidable to determine whether T is a subpansive tree of G. Proof. Deciding whether L(G0 ) ⊆ L(G00 ) for given two CFGs G0 and G00 is undecidable (cf. [16]). It can be trivially reduced to asking whether a language t u LG (B) is a component of LG (A) for some nonterminals A and B on T . However, in the following case, we can easily find out that a tree T is a subpansive tree of a CFG G. Proposition 2. Let G = (Σ, N, P, S) be a CFG and T = (N, E) a tree. If for every edge (A, B) ∈ E of T a production from A to a string that contains B exists in G, then T is a subpansive tree of G. Proof. Straightforward from the definition of a subpansive tree.
t u
By Proposition 2, we can design the algorithm SubpansiveTree to construct one of the subpansive trees of a given CFG as Figure 1. For the CFG G given in Example 1, the algorithm SubpansiveTree constructs just the tree T1 .
4
A Hierarchical CFG
In order to discuss the problem how to learn a CFG in terms of a subpansive tree, we introduce an appropriate class of CFGs, which we coin a name, hierarchical CFGs or HCFGs, from logic programming [21].
Constructive Learning of Context-Free Languages
275
SubpansiveTree Input : A CFG G = (Σ, N, P, S). Output : A subpansive tree (N, E) of G. E := ∅; V := {S}; while V 6= N do begin for each A ∈ V do if there exists a A → α1 Bα2 ∈ P such that B ∈ / V then add (A, B) to E and add B to V ; end /* while */ output (N, E);
Fig. 1. An algorithm to construct a subpansive tree from a CFG
Definition 2. A CFG G = (Σ, N, P, S) is said to be hierarchical or an HCFG if there exists a mapping (called a level mapping [21]) f : N → {1, . . . , |N |} such that, for each production A → α ∈ P and every nonterminal B occurring in α, A 6= B implies f (A) > f (B). The class of all HCFGs is denoted by HCFG. Proposition 3. For a CFG G = (Σ, N, P, S), the problem of determining whether G is hierarchical is decidable. Proof. By Definition 2, there exists a mapping f : N → {1, . . . , |N |}. Since the number of all mappings from N to {1, . . . , |N |} is finite, the problem is decidable. t u We associate an HCFG G with a subpansive tree by the level mapping of G. Let T be a tree (N, E) and f be a mapping from N to {1, . . . , |N |}. Then, T is leveled by f if f (A) > f (B) holds for each (A, B) ∈ E. Then, it is obvious the following proposition. Proposition 4. Let G be an HCFG (N, Σ, P, S) and T be a tree (N, E). Then, there exists a level mapping f of G such that T is leveled by f iff T is a subpansive tree of G. An HCFG has the following useful properties for learning languages constructively. Proposition 5. Let G = (Σ, N, P, S) be a HCFG, T be a subpansive tree for G, and k be the depth of T . Then, for any w ∈ L(G), there exists a parse tree of w whose depth is at most k + |w| + 1. Proof. Let TG (w), or simply T (w), be a parse tree of w in G. For an internal node A of T (w), the yield of A is defined as a concatenation of the leaves in left-to-right order on the sub-tree of T (w) whose root is A. For every parent-child relationship (A, B) in T (w), if A and B are labeled by same nonterminal and the length of A’s yield is same as one of B’s yield,
276
N. Sugimoto et al.
then the tree obtained by replacing the sub-tree of T (w) rooted by A with one rooted by B is also a parse tree of w. Therefore, without loss of generality, we can assume that, if A and B have a parent-child relationship in T (w), then the relationship is also preserved in T or the length of B’s yield is less than one of A’s yield. Under this assumption, the statement can be easily proved by the induction on the length of w. t u By Proposition 5, if the maximum number of nonterminals occurring in the righthand side of each production is bounded by some constant, then each parse tree of w ∈ L(G) is computed in polynomial time. Consider the language family {L(G) | G ∈ HCFG}. The following proposition claims that the language family contains all regular languages. Proposition 6. For each regular expression E, there exists a HCFG G such that L(G) is equivalent to the language LE represented by E. Proof. We prove the statement by the induction on the length of E. If E = φ, then let P be an empty set. If E = a for some a ∈ Σ ∪ {}, then let P be the set {S → a}. Furthermore, let G be a CFG (Σ, {S}, P, S). Let E1 , E2 be regular expressions, and let Gi = (Σ, Ni , Pi , Si ) be a CFG representing LEi . Without loss of generality, we can assume that N1 ∩ N2 = ∅ and S ∈ / N1 ∪ N2 . Then, we construct P0 as follows: 1. If E = (E1 + E2 ), then let P0 be {S → S1 | S2 }; 2. If E = (E1 E2 ), then let P0 be {S → S1 S2 }; 3. If E = (E1∗ ), then let P0 be {S → ε | SS1 }. Furthermore, let G be a CFG (Σ, N, P, S), where N = {S} ∪ N1 ∪ N2 and P = P0 ∪ P1 ∪ P2 . It is easy to show that the set L(G) is equal to LE , and that t u if both G1 and G2 are in HCFG, so is G. The language family {L(G) | G ∈ HCFG} properly contains all regular languages: Consider the language L = {an bn | n ≥ 1}. Then, L is not regular but there exists an HCFG G = ({a, b}, {S}, {S → aSb | ab}, S) such that L = L(G). On the other hand, the language family {L(G) | G ∈ HCFG} is contained by the language family {L(G) | G is a CFG }. However, the properness remains open.
5
Generalization with a Subpansive Tree
In this section, we introduce the generalization under a subpansive tree, which will be essential for the learning algorithm of CFGs discussed in Section 6. We introduce three kinds of generalizations; generalization on strings over (Σ ∪ N )∗ , on productions, and on the set of productions. Definition 3. Let G = (Σ, N, P, S) be a CFG and T = (N, E) be a subpansive tree of G. Let α, β be non-empty strings in (N ∪Σ)+ . Then, α is more general than β under T , denoted by α T β, if there exist β1 , β3 ∈ (N ∪ Σ)∗ , β2 ∈ (N ∪ Σ)+ and A ∈ N satisfy the following:
Constructive Learning of Context-Free Languages
277
1. α = β1 Aβ3 , 2. β = β1 β2 β3 , and 3. every nonterminal in β2 is either A or a descendant B ∈ N . The relation T is extended to the reflexive and transitive closure ∗T . Definition 4. Let G = (Σ, N, P, S) be a CFG, and let p = A → α and q = A → β be productions on Σ and N from a nonterminal A ∈ N . Then, p is more general than q under T , denoted by p T q, if α T β. In the above definition, p and q are not necessarily elements of P for a CFG G = (Σ, N, P, S). Definition 5. Let Π be the set of all the productions on N and Σ, and let π1 and π2 be finite subsets of Π. Then, π1 is more general than π2 under T , denoted by π1 T π2 , if for every p2 ∈ π2 there exists p1 ∈ π1 that is more general than p2 under T . In order to associate the generalization in Definition 5 with a subpansive tree, we introduce the following two mappings ΓT and ΓTn over 2Π , which is motivated by the TP operator famous in logic programming (cf. [21]). Definition 6. Let G be a CFG and T be a subpansive tree of G. Also let Π be the set of all the productions on G. Then, the mapping ΓT : 2Π → 2Π is defined as follows: ΓT (π) = {q ∈ Π | q T p for some p ∈ π} for π ⊆ Π. Furthermore, for a finite set π ⊆ Π and non-negative integer n, ΓTn (π) is defined as follows: ΓT0 (π) = π ΓTn (π) = ΓT ΓTn−1 (π) for n ≥ 1. Now, we investigate some properties of ΓT . In the remainder of this section, the notion G = (Σ, N, P, S) always denotes a reduced HCFG, T is a subpansive tree of G, Π is the set of all the productions on G, and π is a finite subset of Π. Lemma 1. For each p ∈ Π, the size of ΓT ({p}) is bounded by a polynomial with |N | and |p|. Proof. Let p be a production A → α. By the definition of ΓT , if B → β ∈ ΓT ({p}), then B = A and β T α. Since β is obtained by replacing a substring of α with one nonterminal, there exist at most |α| · (|α| + 1) · |N | candidates for β and is a polynomial with |N | and |p|. t u Lemma 2. Let m be the maximum length of p ∈ π. Then, the size of ΓTn (π) (n ≥ 1) is bounded by a polynomial with |N |, m and |π|. Proof. It follows from Lemma 1.
t u
278
N. Sugimoto et al.
Lemma 3. Let m be the maximum length of p ∈ π. Then, there exists a polynomial poly such that ΓTn (π) = ΓTn+1 (π) for each n ≥ poly(m, |N |, |π|). Proof. Let p be a production A → α ∈ Π. Then, the number of all productions q satisfying q T p is bounded by |α| · (|α| + 1) · |N |. Furthermore, since p T q and q T r implies p T r, for any non-negative integer n, the size of the set ΓTn ({p}) is bounded by |α| · (|α| + 1) · |N |. Then, the size of ΓTn (π) is bounded by m · (m + 1) · |N | · |π| for each n ≥ 0. Furthermore, by the definition of ΓTn , ΓTn is monotonic with respect to n, that is, 1. ΓTi (π) ⊇ ΓTj (π) for i ≥ j, and 2. if ΓTi (π) = ΓTi+1 (π) for some i, then ΓTj (π) = ΓTi (π) for each j (j ≥ i). Hence, we can set poly(m, |N |, |π|) to m · (m + 1) · |N | · |π|.
t u
Lemma 4. Let p be a production A → α ∈ P . Then, there exist w ∈ LG (A) and non-negative integer n such that p ∈ ΓTn ({A → w}). Proof. Since G is reduced, there exists a string w ∈ Σ + such that A ⇒G α ⇒∗G w, α = u0 A1 u1 A2 · · · Am um , w = u0 v1 u1 v2 · · · vm um , and for each i (i = 1, 2, . . . , m) vi ∈ LG (Ai ) − {ε}. Since the string α is more general than w, it holds that (A → α) T (A → w). Hence, it holds that p ∈ ΓTn ({A → w}). u t Since the properties of the mapping ΓTn given in the above lemmas hold in the class of all CFGs, it is useful to discuss learnability of CFGs in the framework of identification in the limit [14]. However, it is not suitable to the polynomial time learning, because the size of the set ΓTn (P ) and the length of each production in it are monotonically increasing on n and the length of the given positive example. On the other hand, for an HCFG G, the minimal generalizations of G can be easily computed. Let G be an HCFG (Σ, N, P, S) and T be a subpansive tree (N, E) of G. Then, for each A ∈ N , TA denotes a subtree (NA , EA ) of T satisfying the following conditions: 1. EA = ∅ if A is a leaf of T ; EA = {(A, B) ∈ E | B ∈ N } otherwise. 2. NA is the set consisting of A and all children of A. Then, the next lemma holds. Lemma 5. Let G = (Σ, N, P, S) be an HCFG and T be a subpansive tree of G. Then, for each p = A → α ∈ P , there exists w ∈ LG (A) and a non-negative integer n ≤ |w| such that p ∈ ΓTnA ({A → w}). The above lemma tells that it is enough to consider the generalization on the subtree TA of the subpansive tree for an HCFG. In the following section, we propose the procedure Generalize which finds the minimal generalizations by using nonterminal membership queries.
Constructive Learning of Context-Free Languages
6
279
Learnability of HCFGs with a Subpansive Tree
In this section, we discuss learnability of HCFGs in our setting: the subpansive tree of a target HCFG is given as a background knowledge, and two types of queries, nonterminal membership and equivalence queries are available. Then, for a tree T , let HCFG[T ] denote the family {G ∈ HCFG | T is a subpansive tree of G}. Theorem 1. Let T be a tree. Then, HCFG[T ] is polynomial-time learnable via equivalence and nonterminal membership queries. We give the learning algorithm Learn HCFG in the following proof, which is a constructive algorithm, because the algorithm makes the most specific hypothesis, and repeats generalizing it until the target CFG is obtained. Proof. Consider the algorithm Learn HCFG as Figure 2, which mainly consists of two procedures, Generalize and Diagnose. Note that in the first construction of P nonterminal membership queries NM G∗ (ε, A) for A ∈ N are invoked. Learn HCFG Input : A subpansive tree T = (N, E) of a target CFG G∗ Output : A CFG G = (Σ, N, P, S) such that L(G) = L(G∗ ) P := {A → ε | ε ∈ LG∗ (A)}; G := (Σ, N, P, S); while EQ G∗ (G) replies w do begin if w ∈ L(G) − L(G∗ ) then /* w: negative counterexample */ P := Diagnose(w, P ); else /* w: positive counterexample */ G := Generalize(w, G, T ); end /* while */ output G;
Fig. 2. The procedure Learn HCFG
The procedure Diagnose was proposed by Angluin [1] originally for the class of k-bounded CFG in which the right-hand side of each production has at most k nonterminals for a fixed integer k. In our algorithm, the procedure Diagnose finds a false production, for example, A → α in a hypothesis P , and replaces it with the productions that associate A to strings derived from α in one step. In these productions, the number of occurrences of nonterminals in right-hand side is bounded by some constant depending on the target G∗ . By Proposition 5, Diagnose finds a false production in polynomial time, and the number of new productions is also bounded by a polynomial. On the other hand, in order to design the procedure Generalize in Figure 3, we prepare two notions, crr G (W, A) and red P (Q).
280
N. Sugimoto et al.
Let G = (Σ, N, P, S) be a CFG, W a finite subset of Σ + and A a nonterminal in N . Also let P0 be a set {A → w | w ∈ W }. Then, the correct generalizations of W with respect to A, denoted by crr G (W, A), is the set R satisfying the following conditions: 1. R ∗T P0 , 2. ΓTA (R) = R, and 3. for each A → u0 B1 u1 · · · um−1 Bm um in R, there exists u0 v1 u1 · · · um−1 vm um in W such that vi ∈ LG (Bi ) for each 1 ≤ i ≤ m. Let P and Q be sets of productions. Then, the reduced set red P (Q) of Q with P is a subset R of Q which satisfies the following: (1) for any production A → α in Q, A ⇒∗R∪P α, and (2) for any R0 ⊂ R, there is A → α in Q such that A 6⇒∗R0 ∪P α. Here, ⇒P for a set of productions P implicitly means ⇒G for a CFG whose set of productions is P .
Generalize Input : w ∈ Σ + ; a CFG G0 = (Σ, N, P0 , S); a tree T ; a target CFG G∗ ; /* T is a subpansive tree of G∗ */ Output : A CFG G = (Σ, N, P, S) such that L(G) ⊇ L(G0 ) ∪ {w}. P := P0 ; for each A ∈ N do begin WA := ∅; for each substring u of w do if NM G∗ (u, A) replies yes then WA := WA ∪ {u}; end /* for each */ mark := {A ∈ N | A is a leaf of T }; for each A ∈ mark do P := P ∪ red P (crr G (WA , A)); while mark 6= N do begin if all children of A ∈ (N − mark ) are in mark then P := P ∪ red P (crr G (WA , A)); mark := mark ∪ {A}; end /* while */ output G; Fig. 3. The procedure Generalize
In the procedure Generalize in Figure 3 we deal with crr and red directly. Note that we can check the condition 3 in crr by using nonterminal membership queries. In the while loop the enumeration of nonterminals is executed in a bottomup manner, and the number of iterations of the loop is bounded by the depth of the given subpansive tree T . Furthermore, by Lemma 1, 2, and 3, the set red P (crr G (WA , A)) can be computed in polynomial time with |P | and |N |. Therefore, the amount time of running Generalize is bounded by a polynomial with
Constructive Learning of Context-Free Languages
281
|G0 |, |w| and |T |. Since the size of P is bounded by a polynomial at each step of iteration in the learning algorithm Learn HCFG, it terminates in polynomial time. t u On the other hand, let DNFn be a family of DNF formulas over n Boolean variables. Then, we can show the following theorem. Theorem 2. For each n ≥ 0, there exists a subpansive tree such that, if HCFG is polynomial-time predictable, then so is DNFn . Proof. Let d = t1 ∨ · · · ∨ tm be a DNF formula over n Boolean variables {x1 , . . . , xn }. Then, let T be a tree: [ T = {S, X1 , . . . , Xn }, {(S, Xi )}. 1≤i≤n
We show the statement by the prediction-preserving reduction [24]. The word transformation f is an identity function, that is, f (e) = e. The representation transformation g is defined as follows. First, for each term ti in d, h(ti ) is a production S → w1 · · · wn , where 1 if ti contains xj , wj = 0 if ti contains xj , Xi otherwise. Furthermore, G0 is the following set of productions
n [
{Xj → 0|1}. Then, for
j=1
each d ∈ DNFn , a representation transformation g is defined as: g(d) =
m [
h(ti ) ∪ G0 .
i=1
Note here that T is a subpansive tree of g(d), and the size of g(d) is at most |d|(n + 1) + 4n. For e ∈ {0, 1}n , it holds that e satisfies d (e ∈ d) ⇐⇒ there exists an i (1 ≤ i ≤ m) such that e satisfies ti ⇐⇒ there exists an i (1 ≤ i ≤ m) such that e ∈ L({g(ti )} ∪ G0 ) ⇐⇒ f (e) ∈ L(g(d)). Hence, there exists a subpansive tree T that predicting DNFn reduces to predicting HCFG. By using the property of prediction-preserving reduction given by Pitt and Warmuth [24], we can conclude the statement. t u Hence, we conjecture that HCFG is not polynomial-time predictable. According to [3,20,24], we may also conjecture that HCFG not polynomial-time learnable via equivalence queries alone.
282
7
N. Sugimoto et al.
Concluding Remarks
We have proposed a new scenario for learning a CFL with a subpansive tree. The subpansive tree characterizes a language by sublanguages, modeling a background knowledge in natural language acquisition. Through the level mapping technique, we have defined the subclass of CFGs that we named hierarchical CFGs, and associated with a subpansive tree. We have shown that all regular languages are properly contained in the languages generated by the class of hierarchical CFGs, and that the class of hierarchical CFGs is efficiently learnable by nonterminal membership queries and equivalence queries. We also have shown that the class of hierarchical CFGs seems hard to predict in polynomial time. Consider, finally, a subpansive tree that lacks some nodes, an incomplete subpansive tree. An incomplete subpansive tree is constructed from a proper subset of nonterminals of a given complete subpansive tree. This is comparable to a natural language in which only some parts of speech are fixed. Arguably, any natural language has contentive categories, such as noun, verb, adjective, and adposition. These reflect epistemological cognitive categories that roughly depict entities, actions, properties, and relations. In an influential hypothesis of linguistic theory [11,13], a factor of cross-linguistic variation is attributed to other minor categories, such as determiner and complementizer, which may be absent in some languages. Thus, grammatical inference with incomplete subpansive trees comes closer to natural language acquisition, and if it can succeed, we may also find a useful application in computer science, such as extracting from a program metafunctions and/or subfunctions that are recurrently used but unknown. Yet, to derive a complete subpansive tree from incomplete ones, we have to consider all the complete subpansive trees that are possible, the number of which is proportional to exponential of nonterminals of the incomplete subpansive tree. This seems rather formidable, but one of our future research topics is to discuss the learnability of languages with incomplete subpansive trees and its implications. Another is to restrict the membership queries to terminals, and drop the equivalence queries all together. After all, humans, even adults, are unconscious about the precise structure of a given sentence, and they cannot completely describe their grammar (target) or know how much a child has developed his or hers (hypothesis); let alone evaluating the differences and giving counterexamples, though the child may be told what can be said and what cannot. In the meantime, it also pays to investigate whether or not the class of hierarchical CFGs is a proper subclass of CFGs, a task we are not yet able to cover in this preliminary study.
References 1. D. Angluin: Learning k-bounded context-free grammars, YALEU/DCS/RR-557, Yale University (1987).
Technical Report
Constructive Learning of Context-Free Languages
283
2. D. Angluin: Learning regular sets from queries and counterexamples, Information and Computation 75, 87–106 (1987). 3. D. Angluin: Query and concept learning, Machine Learning 2, 319–342 (1988). 4. L. Bloomfield: Language, Holt, Rinehart & Winston, (1933). 5. P. Berman and R. Roods: Learning one-counter languages in polynomial time, Proc. 28th IEEE Symposium on Foundation of Computer Science, 61–67 (1987). 6. S. Crain and D. Lillo-Martin: Introduction to linguistic theory and natural language acquisition, Blackwell (1999). 7. N. Chomsky: Syntactic structures, Mouton (1957). 8. N. Chomsky: Formal properties of grammars, Handbook of mathematical psychology (eds.) R. D. Luce, R. R. Bush and E. Galanter, J. Wiley & Sons, 323-418 (1963). 9. N. Chomsky: Aspects of the theory of syntax , MIT Press (1965). 10. N. Chomsky: Remarks on nominalization, Readings in English transformational grammar (eds.) R. A. Jacobs and P. S. Rosenbaum, Ginn & Co,.184-221 (1970). 11. N. Chomsky: Lectures on government and binding, Foris Publications (1981). 12. C. Domingo and V. Lav´ın: The query complexity of learning some subclasses of context-free grammars, Proc. 2nd European Conference on Machine Learning, 404–414 (1995). 13. N. Fukui: Theory of projection in syntax , CSLI Publications (1995). 14. E. Gold: Language identification in the limit, Information and Control 10, 447–474 (1967). 15. C. D. L. Higuera: Characteristic sets for polynomial grammatical inference, Machine Learning 27, 125–138 (1997). 16. J. E. Hopcroft and J. D. Ullman: Introduction to automata theory, languages and computation, Addison-Wesley Publishing (1979). 17. H. Ishizaka: Polynomial time learnability of simple deterministic languages, Machine Learning 5, 151–164 (1990). 18. R. Jackendoff: X syntax: A study of phrase structure, MIT Press (1977). 19. X. Ling: Learning and invention of Horn clause theories – a constructive method –, Methodologies for Intelligent Systems 4, 323–331 (1989). 20. N. Littlestone: Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm, Machine Learning 2, 285–318, 1988. 21. J. W. Lloyd: Foundations of logic programming (second, extended edition), Springer-Verlag (1987). 22. E. M¨ akinen: On the structural grammatical inference problem for some classes of context-free grammars, Information Processing Letters 42, 1–5 (1992). 23. S. Pinker: Language learnability and language development, Harvard University Press (1984). 24. L. Pitt and M. K. Warmuth: Prediction preserving reduction, Journal of Computer System and Science 41, 430–467, 1990. 25. Y. Sakakibara: Learning context-free grammars from structural data in polynomial time, Theoretical Computer Science 76, 223–242 (1990). 26. Y. Sakakibara: Efficient learning of context-free grammars from positive structural examples, Information and Computation 97, 23–60 (1992). 27. H. Sakamoto: Language learning from membership queries and characteristic examples, Proc. of 6th International Workshop on Algorithmic Learning Theory, LNAI 997, 55–65 (1995). 28. N. Sugimoto, K. Hirata and H. Ishizaka: Constructive learning of translations based on dictionaries, Proc. of 7th International Workshop on Algorithmic Learning Theory, LNAI 1160, 177–184 (1996).
A Polynomial Time Learning Algorithm of Simple Deterministic Languages via Membership Queries and a Representative Sample Yasuhiro Tajima1 and Etsuji Tomita1 The Graduate School of Electro-Communications, The University of Electro-Communications, 1-5-1, Chofugaoka, Chofu-shi, Tokyo, 182-8585 Japan {tajima, tomita}@ice.uec.ac.jp Abstract. We show that a simple deterministic language is learnable via membership queries in polynomial time if the learner knows a special finite set of positive examples which is called a representative sample. Angluin(1981) showed that a regular language is learnable in polynomial time with membership queries and a representative sample. Thus our result is an extension of her work.
Key words : learning via queries, simple deterministic languages, representative sample
1
Introduction
We show that a simple deterministic language is learnable via membership queries in polynomial time if the learner knows a special finite set of positive examples which is called a representative sample. Here, the polynomial consists of the size of the grammar which generates the target language, the cardinality of the representative sample and the maximum length of a word in the representative sample. A representative sample is introduced by Angluin[1]. In [1], she has shown the polynomial time learnability of regular languages from membership queries and a representative sample. Here, for a deterministic finite automaton M , a representative sample Q is a finite subset of the target language L(M ) such that all transitions of M are applied to accept all words in Q. Thus our result is a proper extension about the learnability in this setting. There are some related works about learning via queries. It has been shown that simple deterministic languages are polynomial time learnable with some modified queries by Yokomori[8] and with membership queries and extended equivalence queries by Ishizaka[4]. Here, the extended equivalence query takes a context-free grammar as an input while the target language must be only simple deterministic. Sakamoto[6] used a special example like a representative sample to show the learnability of a subset of context-free languages. Angluin[2] showed that regular languages are polynomial time learnable from membership queries and counterexamples. In our algorithm, the technique to determine the equivalence for two nonterminals is similar to Angluin[2]’s idea. A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 284–297, 2000. c Springer-Verlag Berlin Heidelberg 2000
A Polynomial Time Learning Algorithm of Simple Deterministic Languages
2 2.1
285
Definitions Basic Definitions
Basic definitions are based on [3]. A context-free grammar is a 4-tuple G = (N, Σ , P, S) where N is a finite set of nonterminals, Σ is a finite set of terminals, P is a finite set of rules and S ∈ N is the start symbol. If P is ε-f ree and any rule in P is of the form A → aβ then G = (N, Σ , P, S) is said to be in Greibach normal form, where A ∈ N, a ∈ Σ and β ∈ N ∗ . A context-free grammar G is a simple deterministic grammar [5] iff G is in Greibach normal form and for every A ∈ N and a ∈ Σ , if A → aβ is in P then A → aγ is not in P for any γ ∈ N ∗ such that γ 6= β. In addition, such a set P of rules is called simple deterministic. Let A → aβ be in P where β ∈ N ∗ , A ∈ N and a ∈ Σ . Let γ and γ 0 ∈ N ∗ . Then γAγ 0 ⇒ γaβγ 0 denotes the derivation from γAγ 0 to γaβγ 0 in G. We define ∗
G
⇒ to be the reflexive and transitive closure of ⇒. When it is not necessary to G
∗
G
∗
specify the grammar G, α ⇒ α0 and α ⇒ β denote α ⇒ α0 and α ⇒ β, respecG
G
∗
tively. A word generated from γ ∈ (N ∪ Σ )∗ by G is w ∈ Σ ∗ such that γ ⇒ w and G
∗
the language generated from γ by G is denoted by LG (γ) = {w ∈ Σ ∗ | γ ⇒ w}. G
A word generated from S by G for the start symbol S is called a word generated by G and the language generated by G is denoted by L(G) = LG (S). In this paper, |β| denotes the length of β if β is a string and |W | denotes the cardinality of W if W is a set. For any simple deterministic grammar G1 = (N1 , Σ , P1 , S1 ), there exists a simple deterministic grammar G2 = (N2 , Σ , P2 , S2 ) such that L(G1 ) = L(G2 ) and every rule A → aβ in P2 satisfies that |β| ≤ 2. Such a grammar G2 is said to be in 2-standard form. For a simple deterministic grammar G = (N, Σ , P, S) and A ∈ N , if there exists a derivation such that ∗ ∗ S ⇒ xAz ⇒ xyz for some x, z ∈ Σ ∗ and y ∈ Σ + then A is called reachable and live. Throughout this paper, we assume that a simple deterministic grammar is in 2-standard form and all nonterminals are reachable and live. It implies that the target language Lt holds that Lt 6= ∅. For a simple deterministic grammar G = (N, Σ , P, S) and a nonterminal A ∈ N , we define the thickness of A as the length of the shortest word which is in LG (A) and the thickness of G as max{kA | kA is the thickness of A ∈ N }. For w ∈ Σ + , proper pre(w) = {w0 ∈ Σ ∗ | w0 w00 = w, w00 ∈ Σ + } is called a set of proper prefixes of w. For a set R, R2 denotes the set of all concatenations of two elements in R. π For an equivalence relation = over R and r ∈ R, the equivalence class of r π 0 is denoted by B(r, π) = {r ∈ R | r = r0 }. A classification π over R is defined as the set of equivalence classes B(r, π).
286
2.2
Y. Tajima and E. Tomita
A Representative Sample
Definition 1. Let G = (N, Σ , P, S) be a simple deterministic grammar and Q be a finite subset of L(G). Then Q is a representative sample of G iff the following holds. ∗
– For any A → aβ in P , there exists a word w ∈ Q such that S ⇒ xAγ ⇒ ∗ 2 xaβγ ⇒ w for some x ∈ Σ ∗ and γ ∈ N ∗ . From this definition, for any simple deterministic grammar G = (N, Σ , P, S), there exists a representative sample Q such that |Q| ≤ |P |. Definition 2. For a simple deterministic language L, a finite set Q ⊆ L is a representative sample iff there exists a simple deterministic grammar G = (N, Σ , P, S) such that L(G) = L and Q is a representative sample of G. 2
3
A Membership Query and Some Properties of Simple Deterministic Languages
M EM BER(w) : A membership query for a simple deterministic language Lt is defined as follows. Input : w ∈ Σ ∗ . Output : “yes” if w ∈ Lt and “no” if w 6∈Lt . Clearly, we can solve the membership query in O(|x|) time for any x ∈ Σ ∗ and a simple deterministic grammar G. We claim that the equivalence problem for two simple deterministic grammars are solvable in some polynomial time[7]. Theorem 1 (Wakatsuki et al.[7] Theorem 6.1). Let G1 = (N1 , Σ , P1 , S1 ) and G2 = (N2 , Σ , P2 , S2 ) be simple deterministic grammars. Let k1 and k2 be thicknesses of G1 and G2 , respectively. Then, there exists an algorithm which solves the equivalence problem for L(G1 ) and L(G2 ) in polynomial time, where the polynomial consists of |N1 |, |N2 |, |Σ | and max{k1 , k2 }. In addition, if it is not equivalent then the algorithm outputs w ∈ (L(G1 )−L(G2 ))∪(L(G2 )−L(G1 )) such that |w| is bounded by a polynomial of |N1 |, |N2 |, |Σ | and max{k1 , k2 }. 2 Corollary 1. Let G1 = (N1 , Σ , P1 , S1 ) and G2 = (N2 , Σ , P2 , S2 ) be simple deterministic grammars. Let k1 and k2 be thicknesses of G1 and G2 , respectively. Then there exists an algorithm which solves the equivalence problem for LG1 (α1 ) and LG2 (α2 ) in polynomial time for any α1 ∈ N1 ∪ N12 and any α2 ∈ N2 ∪ N22 , where the polynomial consists of |N1 |, |N2 |, |Σ | and max{k1 , k2 }. In addition, if it is not equivalent then the algorithm outputs w ∈ (LG1 (α1 ) − LG2 (α2 )) ∪ (LG2 (α2 ) − LG1 (α1 )) such that |w| is bounded by a polynomial of |N1 |, |N2 |, |Σ | 2 and max{k1 , k2 }.
A Polynomial Time Learning Algorithm of Simple Deterministic Languages
4
287
The Main Result
In the followings of this paper, we assume that Gh = (Nh , Σ , Ph , Sh ) denotes a hypothesis grammar which is guessed by the learner, Lt denotes the target language and Q denotes the representative sample given to the learner. Moreover, we assume that Gt = (Nt , Σ , Pt , St ) represents a simple deterministic grammar such that Lt = L(Gt ) and Q is a representative sample of Gt . Now, we claim the main theorem. Theorem 2. Simple deterministic languages are polynomial time learnable with membership queries and a representative sample. The polynomial consists of |Q|, 2 |Nt |, |Σ | and max{|w| | w ∈ Q}. Now, it is independent from the learner that which grammar is applied to construct Q. Thus treating the length of the longest word in Q as an independent parameter is reasonable.
5
Basic Ideas of This Learning Algorithm
In our learning algorithm, a context-free grammar which contains all rules of Gt is constructed and then the learner selects correct rules from them. Such a context-free grammar can be constructed from the representative sample based on the following Ishizaka’s lemma[4]. Lemma 1 (Ishizaka[4] Lemma 10). Let Gt = (Nt , Σ , Pt , St ) be a simple deterministic grammar. For any A(6= St ) ∈ Nt , there exist w ∈ L(Gt ) and a 3-tuple (wp , wm , ws ) such that – wp wm ws = w, ∗ ∗ – St ⇒ wp · A · ws ⇒ w and – it holds that for u ∈ Σ + , u ∈ LGt (A) iff M EM BER(wp · u · ws0 ) = yes and for any u0 ∈ proper pre(u), M EM BER(wp · u0 · ws0 ) = no where ws0 is the shortest suffix of ws such that wp wm ws0 ∈ L(Gt ). 2 Let R be the set of all 3-tuples such that R = {(wp , wm , ws ) ∈ Σ + × Σ + × Σ ∗ | wp wm ws ∈ Q} ∪ {(ε, w, ε) | w ∈ Q}, where Q is the representative sample given to the learner. Then, from this lemma, R contains a 3-tuple which corresponds to A for every nonterminal A ∈ Nt . Moreover, for every rule in Gt , say A → aβ, the rule which corresponds to A → aβ is contained in the set of rules such that Psuf f = {(wp , wm , ws ) → a | (wp , wm , ws ) ∈ R, wm = a ∈ Σ } ∪ {(wp , awm , ws ) → a(wp a, wm , ws ) | (wp , awm , ws ), (wp a, wm , ws ) ∈ R, a ∈ Σ} ∪ {(wp , awm1 wm2 , ws ) → a(wp a, wm1 , wm2 ws )(wp awm1 , wm2 , ws ) | (wp , awm1 wm2 , ws ), (wp a, wm1 , wm2 ws ), (wp awm1 , wm2 , ws ) ∈ R, a ∈ Σ },
288
Y. Tajima and E. Tomita
because A → aβ is used to generate some word in Q from the definition of a representative sample. In other words, Psuf f is sufficient enough to induce a correct grammar. Now, we assume a classification over R is denoted by π and an equivalence class of A0 ∈ R is denoted by B(A0 , π). In addition, π satisfies that all members in R which is of the form (ε, w, ε) are in the same equivalence class. Let Gsuf f = (R/π, Σ , Psuf f /π, Ssuf f ) be a context-free grammar such that R/π = {B(A0 , π) | A0 ∈ R}, Psuf f /π = {B(A0 , π) → a | A0 → a ∈ Psuf f } ∪ {B(A0 , π) → aB(A1 , π) | A0 → aA1 ∈ Psuf f } ∪ {B(A0 , π) → aB(A1 , π)B(A2 , π) | A0 → aA1 A2 ∈ Psuf f }, Ssuf f = B((ε, w, ε), π), where w ∈ Q. Then, there exists a π such that we can make Gsuf f be equivalent to Gt by deleting incorrect rules from Psuf f /π. The learning problem of simple deterministic languages with a representative sample is solved by the following two procedures. – A procedure to make a classification π over R. – A procedure to delete incorrect rules from Psuf f /π. Thus our algorithm works as follows. 1. Make a classification π over R. 2. Delete incorrect rules from Psuf f /π. 3. Check whether π and the deletion are correct. If not, to try another π, go back to step 1. In our algorithm, the classification π is established using an observation table which is similar to Angluin[2]’s. The method to delete incorrect rules from Psuf f /π and the checking procedure whether the rule set is correct or not are newly introduced in this paper.
6
The Learning Algorithm
Now, we describe the details of the learning algorithm. 6.1
The Procedure to Make a Classification over R
The equivalence class for every r ∈ R is established based on behaviors for a set W ⊂ Σ ∗ such that – at the beginning, W = {v ∈ Σ ∗ | u, w ∈ Σ ∗ , uvw ∈ Q} where Q is the representative sample given to the learner, – W is closed under suffixes and prefixes.
A Polynomial Time Learning Algorithm of Simple Deterministic Languages
289
This word set grows during the learning algorithm. In the first step to make the equivalence class, the learner asks membership queries for every r ∈ R and w ∈ W to decide the following function T : R × W → {0, 1} such that 1 (if M EM BER(pr · w · short(r)) = yes, and f or all w0 ∈ proper pre(w), M EM BER(pr · w0 · short(r)) = no), T (r, w) = 0 (otherwise), here short(r) is the shortest suffix of rs such that M EM BER(rp ·rm ·short(r)) = yes for r = (rp , rm , rs ). These results are expressed in an observation table such that – R is presented in the column, – W is presented in the row. In Table 1, we show an example of an observation table where Q = {aabb} and W = {a, b, aa, bb, ab, aab, abb, aabb}. Table 1. An observation table W T (·, ·) a b aa bb ab aab abb aabb r1 :(ε , aabb , ε) r2 :(a, abb, ε) r3 :(aa, bb, ε) R r4 :(aab, b, ε) r10 :(a, ab, b) r20 :(aa, b, b) r5 :(a, a, bb)
0 0 0 0 0 0 1
0 1 0 1 0 1 0
0 0 0 0 0 0 0
0 0 1 0 0 0 0
1 0 0 0 1 0 0
0 0 0 0 0 0 0
0 1 0 0 0 1 0
1 0 0 0 1 0 0
The learner establishes the classification with the following definition. Definition 3. Let r ∈ R. Then row(r) is the mapping f : W → {0, 1} such that f (w) = T (r, w) where w ∈ W . 2 π
The equivalence relation = is defined for r, r0 ∈ R as π
r = r0 ⇐⇒ row(r) = row(r0 ), π
and π is defined as the classification by =. From the definition of row(), all members in R which are of the form (ε, w, ε) are in the same equivalence class. Suppose that W1 ⊆ W2 ⊂ Σ ∗ . Let π1 and π2 be the classifications which are made from W1 and W2 , respectively. Then, it holds that π2 is finer than or equal to π1 .
290
6.2
Y. Tajima and E. Tomita
The Procedure to Delete Incorrect Rules from Psuf f /π
Now we consider a method to delete wrong rules from Psuf f /π. At first, the initial rule set of a hypothesis grammar is as follows. Psuf f = {(wp , wm , ws ) → a | (wp , wm , ws ) ∈ R, wm = a ∈ Σ } ∪ {(wp , awm , ws ) → a(wp a, wm , ws ) | (wp , awm , ws ), (wp a, wm , ws ) ∈ R, a ∈ Σ } ∪ {(wp , awm1 wm2 , ws ) → a(wp a, wm1 , wm2 ws )(wp awm1 , wm2 , ws ) | (wp , awm1 wm2 , ws ), (wp a, wm1 , wm2 ws ), (wp awm1 , wm2 , ws ) ∈ R, a ∈ Σ }, Psuf f /π = {B(A0 , π) → a | A0 → a ∈ Psuf f } ∪ {B(A0 , π) → aB(A1 , π) | A0 → aA1 ∈ Psuf f } ∪ {B(A0 , π) → aB(A1 , π)B(A2 , π) | A0 → aA1 A2 ∈ Psuf f }. The learner reduces them by the following three steps. step 1 deletes incorrect rules which are of the form A → aB. The learner deletes all rules in Psuf f /π which are of the form A → aB and meet the following condition. – There exists w ∈ W such that aw ∈ W and T (rA , aw) 6= T (rB , w) where rA ∈ B(rA , π) = A and rB ∈ B(rB , π) = B. This condition means that A → aB is incorrect when B can generate w but A cannot generate aw or vice versa. step 2 deletes incorrect rules which are of the form A → aBC. The learner deletes all rules in Psuf f /π which are of the form A → aBC and meet one of the following conditions. – There exists w ∈ W such that aw ∈ W , T (rA , aw) = 1 and T (rB , w1 ) = 0 or T (rC , w2 ) = 0 for any w1 , w2 ∈ W with w1 w2 = w. Here, rA ∈ B(rA , π) = A, rB ∈ B(rB , π) = B and rC ∈ B(rC , π) = C. In other words, aw should be derived from A but BC cannot generate w. – There exist w1 , w2 ∈ W such that aw1 w2 ∈ W , T (rA , aw1 w2 ) = 0 and T (rB , w1 ) = 1 and T (rC , w2 ) = 1. In other words, aw1 w2 should not be derived from A but BC generates w1 w2 . This condition means that the rule A → aBC is incorrect because of w. step 3 deletes rules which contain non-valid nonterminals.
A Polynomial Time Learning Algorithm of Simple Deterministic Languages
291
We assume that Psuf f /π has been executed the above two steps. Definition 4. We define the followings. ΣT (r) = {a ∈ Σ | T (r, a · w) = 1 f or some a · w ∈ W }. ΣP (r) = {a ∈ Σ | B(r, π) → aβ is in Psuf f /π f or some β ∈ (R/π)∗ }. If ΣT (r) = ΣP (r) then r is called valid. Suppose that γ = r1 r2 · · · rn ∈ R+ , then γ is valid if ri is valid for every 1 ≤ i ≤ n. On the other hand, γ 0 ∈ R+ is non-valid if it is not valid. 2 Then the learner repeats the following process |Psuf f /π| times. 1. Remove all rules which contain non-valid elements from Psuf f /π. 2. For all r ∈ R, re-evaluate whether r is valid or not with the new Psuf f /π and go back to 1. In other words, all rules which derive non-valid members of R are deleted. 6.3
The Checking Procedure of π and Psuf f /π
Now we have the classification π and the reduced rule set Psuf f /π. In this step, the learner checks them to make a set of simple deterministic grammars and to compare them each other. Assume an order over Psuf f /π denoted by ≤P , arbitrarily. Let P (A, a) be a rule A → aβ in Psuf f /π such that P (A, a) ≤P A → aγ for any γ ∈ (R/π)∗ . For a rule in Psuf f /π, say A → aβ, let G(A → aβ) = (R/π, Σ , PA→aβ , S0 ) be a simple deterministic grammar such that PA→aβ = {P (B, b) ∈ Psuf f /π | B 6= A, b 6= a} ∪ {A → aβ}, S0 = B((ε, w, ε), π), where w ∈ Q. If G(A → aβ) has nonterminals which are not reachable and live then delete all of them and all rules which contain them from G(A → aβ). Let G be the set of simple deterministic grammars such that G = {G(A → aβ) | A → aβ is in Psuf f /π}. In other words, there exists at least one grammar which contains A → aβ for every rule A → aβ in Psuf f /π. The checking procedure is as follows. For every A ∈ R/π and a ∈ Σ , 1. check the equivalence for LG (γ1 ) and LG (γ2 ), for every γ1 and γ2 such that both A → aγ1 and A → aγ2 are in Psuf f /π, and for every G ∈ G. 2. check the equivalence for LG1 (γ) and LG2 (γ), for every γ such that A → aγ is in Psuf f /π, and for every pair G1 and G2 of grammars such that both of them are in G. Formally, the checking procedure is shown in Fig. 1. Now we claim the following lemma.
292
Y. Tajima and E. Tomita
Procedure check; begin W 0 := ∅; for all A ∈ R/π do for all a ∈ Σ do for all pairs of rules in Psuf f /π such that A → aγ1 and A → aγ2 do; for all G = (N, Σ , P, S) ∈ G do if γ1 ∈ N ∗ and γ2 ∈ N ∗ then check the equivalence for LG (γ1 ) and LG (γ2 ); if (LG (γ1 ) 6= LG (γ2 )) then W 0 := W 0 ∪ {aw} where w ∈ (LG (γ1 ) − LG (γ2 )) ∪ (LG (γ2 ) − LG (γ1 )); endif endif done done for all pairs of G1 = (N1 , Σ , P1 , S1 ) and G2 = (N2 , Σ , P2 , S2 ) in G do for all rules in Psuf f /π such that A → aγ do if γ ∈ N1∗ and γ ∈ N2∗ then check the equivalence for LG1 (γ) and LG2 (γ); if (LG1 (γ) 6= LG2 (γ)) then W 0 := W 0 ∪ {w} where w ∈ (LG1 (γ) − LG2 (γ)) ∪ (LG2 (γ) − LG1 (γ)); endif endif done done done done output W 0 ; end. Fig. 1. The correctness checking procedure
Lemma 2. Let B(r, π) ∈ R/π and w ∈ W . For any G1 ∈ G, it holds that ∗
T (r, w) = 1 ⇐⇒ B(r, π) ⇒ w G1
Proof : We prove this lemma by induction on the length of w. Assume that |w| = 1, i.e. w = a ∈ Σ . From the definition of T , if T (r, a) = 1 then it holds that T (r, u) = 0 for any u ∈ W such that |u| ≥ 2 and whose first symbol is a. Now it holds that r is valid, then there exists r0 ∈ B(r, π) which is of the form (pr0 , a, sr0 ) where pr0 , sr0 ∈ Σ ∗ . It implies that B(r0 , π) → a is in Psuf f /π and B(r, π) ⇒ a. Conversely, if B(r, π) ⇒ a then B(r, π) → a is in G1
G1
Psuf f /π. It implies that there exists r0 ∈ B(r, π) which is of the form (pr0 , a, sr0 ) ∗ for pr0 , sr0 ∈ Σ ∗ , i.e. T (r0 , a) = 1. Thus it holds that T (r, w) = 1 iff B(r, π) ⇒ w. G1
Now suppose that w = a1 a2 · · · an for ai ∈ Σ , i = 1, 2, · · · , n and this lemma holds for any u ∈ W such that |u| ≤ n − 1.
A Polynomial Time Learning Algorithm of Simple Deterministic Languages
293
Since r is valid, if T (r, w) = 1 then there exists r0 ∈ B(r, π) such that B(r0 , π) → a1 β is in Psuf f /π for some β ∈ (R/π)+ . Then it suffices to consider the following two cases. ∗
1. In case that β = B(s1 , π), then B(r0 , π) ⇒ a1 a2 · · · an holds. This is because G1
∗
T (s1 , a2 a3 · · · an ) = 1 ⇐⇒ B(s1 , π) ⇒ a2 a3 · · · an G1
from the assumption of this induction. ∗ 2. In case that β = B(s1 , π)B(s2 , π), then B(r0 , π) ⇒ a1 a2 · · · an holds. This is G1
because
∗
T (s1 , a2 a3 · · · am ) = 1 ⇐⇒ B(s1 , π) ⇒ a2 a3 · · · am G1
and ∗
T (s2 , am+1 am+2 · · · an ) = 1 ⇐⇒ B(s2 , π) ⇒ am am+1 · · · an G1
for some m such that 2 ≤ m < n from the assumption of this induction. ∗
Conversely, if B(r, π) ⇒ w then there exists a rule such that B(r, π) → a1 β in G1
∗
Psuf f /π for some β ∈ (R/π)+ . In addition, β ⇒ a2 a2 · · · an from the assumption G1
of this induction. It suffices to consider the following two cases. 1. In case that β = B(s, π), then T (s, a2 a3 · · · an ) = 1. It implies that T (r, w) = 1 because B(r, π) → a1 β is not deleted by the procedure in Section 6.2. 2. In case that β = B(s1 , π)B(s2 , π), then it holds that T (s1 , a2 a3 · · · am ) = 1 and T (s2 , am+1 am+2 · · · an ) = 1 for some m such that 2 ≤ m < n, because B(r, π) → a1 β is not deleted by the procedure in Section 6.2. Thus this lemma holds.
2
The thickness of any G ∈ G is less than or equal to the length of the longest word in Q from the above Lemma 2. Thus “checking the equivalence” in Fig. 1 is solvable in polynomial time of |R/π|, |Σ | and the length of the longest word in Q from Corollary 1. The correctness checking procedure in Fig. 1 checks whether the plural rules in Psuf f /π which have A at the left-hand side and a at the top of the righthand side are equivalent or not. If the learner finds out two rules which are not equivalent, then all subwords of a witness of them are added to W of the observation table. Then π becomes finer and |Psuf f /π| will be decreased. If the output of the algorithm in Fig. 1 is empty, i.e. W 0 = ∅, then all G ∈ G are equivalent. Now we claim the following lemmas.
294
Y. Tajima and E. Tomita
Lemma 3. When the result of the procedure in Fig. 1 is W 0 = ∅, then any G1 ∈ G satisfies that L(G1 ) = Lt . ∗ ∗ Proof : From Lemma 1, there exists r = (p, m, s) ∈ R such that St ⇒ pAs ⇒ Gt
pms for any A ∈ Nt . We denote such r by rA in the following of this proof. It is sufficient to show the following claim.
Gt
Claim. It holds that w ∈ LGt (A) iff w ∈ LG1 (B(rA , π)) for any A ∈ Nt , w ∈ Σ + and G1 ∈ G. (Proof of claim) We prove this claim by induction on the length of w. Base step : Assume that w = a ∈ Σ , i.e. |w| = 1. It holds that Σ ⊆ W . Then, ∗ from Lemma 2, it holds that T (rA , a) = 1 iff B(rA , π) ⇒ a for any G1 ∈ G. It implies that this claim holds.
G1
Induction step : Let |w| = n and w = a1 a2 · · · an , where ai ∈ Σ for 1 ≤ i ≤ n. Assume that this claim holds for any w0 ∈ Σ + such that |w0 | ≤ n − 1. Let A → a1 α be in Pt and α = A1 · · · Ak where Ai ∈ Nt for 1 ≤ i ≤ k ≤ 2. Let γ1 = rA1 · · · rAk where rAi ∈ R for 1 ≤ i ≤ k and α1 = B(rA1 , π) · · · B(rAk , π). Then any rAi is valid for 1 ≤ i ≤ k. From the definition of G, there exists G1 ∈ G which has a rule B(rA , π) → a1 α1 . From the assumption that W 0 = ∅, it holds that LG1 (α1 ) = LG (α1 ) for any G ∈ G. For any α2 such that B(rA , π) → a1 α2 is in Psuf f /π, it holds that LG1 (α1 ) = LG1 (α2 ) from W 0 = ∅. On the other hand, from the assumption of this induction, |a2 · · · an | ≤ n − 1 implies that a2 · · · an ∈ LGt (A1 · · · Ak ) iff a2 · · · an ∈ LG1 (α1 ). Thus for any w0 ∈ Σ + such that |w0 | ≤ n − 1 and for any G ∈ G, it holds that a1 w0 ∈ LG (B(rA , π)) iff a1 w0 ∈ LGt (A). The above argument holds for any rule in Pt . If no rule such that A → a1 α is in Pt then there is no rule such that B(rA , π) → a1 γ in Psuf f /π because rA is valid. Thus w ∈ Lt (A) ⇐⇒ w ∈ LG (B(rA , π)) for any G ∈ G and for any w ∈ Σ + such that |w| = n. Hence this claim holds. (End of claim) 2 Lemma 4. We assume that the result of the procedure in Fig. 1 is W 0 6= ∅. Let V = W ∪ {v ∈ Σ ∗ | uvw ∈ W 0 , u, w ∈ Σ ∗ }, here W is a part of the observa0 0 tion table. Let Psuf f /π be the reduced set of rules which is constructed from the observation table which consists of R and V . Then there exists r ∈ R such that 0 0 |{B(r, π) → aα in Psuf f /π}| > |{B(r, π 0 ) → aα in Psuf f /π }|. Proof : It holds that |{B(r, π) → aα in Psuf f /π}| decreases when W of the observation table increases. Suppose that this lemma does not hold. Then 0 0 0 0 Psuf f /π = Psuf f /π and the learner makes the same G from Psuf f /π = Psuf f /π . It implies that at least one of the following two cases holds.
A Polynomial Time Learning Algorithm of Simple Deterministic Languages
295
1. There exists G ∈ G such that u0 ∈ LG (β1 ) and u0 6∈LG (β2 ) where au0 ∈ W 0 and both of B(r, π) → aβ1 and B(r, π) → aβ2 are in Psuf f /π for some r ∈ R. We assume that r is of the form (pr , amr , sr ) where pr , mr , sr ∈ Σ ∗ . Let β1 = B(r11 , π)B(r12 , π) · · · B(r1i , π) and β2 = B(r21 , π)B(r22 , π) · · · B(r2j , π). Then from Lemma 2, it holds that T (r11 , u01 ) = 1, · · · , T (r1i , u0i ) = 1 for some u01 · · · u0i = u0 and there exists j 0 (1 ≤ j 0 ≤ j) such that T (r2j 0 , u0j 0 ) = 0 for any separation of u0 such that u01 · · · u0j = u0 . On the other hand, if T (r, a·u0 ) = 0 then it implies that B(r, π) → aB(r11 , π) · · · B(r1i , π) is not in Psuf f /π. If T (r, a · u0 ) = 1 then it implies that B(r, π) → aB(r21 , π) · · · B(r2j , π) is not in Psuf f /π. It is a contradiction. 2. There exist A → aγ in Psuf f /π, G1 ∈ G and G2 ∈ G such that u0 ∈ LG1 (γ) and u0 6∈LG2 (γ) where u0 ∈ W 0 . It suffices to consider the following two cases. a) If γ = B(r1 , π) ∈ R/π then it holds that T (r1 , u0 ) = 1 ⇐⇒ u0 ∈ LG (B(r1 , π)) for any G ∈ {G1 , G2 } ⊆ G from Lemma 2. It is a contradiction. b) If γ = B(r1 , π)B(r2 , π) ∈ (R/π)2 then, from Lemma 2, it holds that T (r1 , u01 ) = 1 ⇐⇒ u01 ∈ LG (B(r1 , π)) and
T (r2 , u02 ) = 1 ⇐⇒ u02 ∈ LG (B(r2 , π))
for any G ∈ {G1 , G2 } ⊆ G and any u01 , u02 ∈ Σ ∗ such that u01 u02 = u0 . It is a contradiction. Both cases conclude the contradiction. Thus this lemma holds.
2
The whole learning algorithm is shown in Fig. 2. Obviously, any rule which corresponds to a rule in Pt are not deleted from Psuf f /π. Thus the hypothesis grammar generates the target language.
7
Correctness and the Time Complexity of the Learning Algorithm
Lemma 5. The learning algorithm terminates in time of the polynomial about |Nt |, |Σ |, |Q| and the length of the longest word in Q. Proof : We denote the length of the longest word in Q as lq . At the beginning of the learning algorithm, – each of |R| and |W | is at most 12 lq (lq + 1)|Q|, and – |Psuf f | is at most |R|(|R|2 + |R|) which is O(lq6 |Q|3 ).
296
Y. Tajima and E. Tomita
Procedure learning(INPUT, OUTPUT); INPUT : Q : a representative sample of Lt ; OUTPUT : Gh : the correct hypothesis grammar; begin R := {(x, y, z) | z ∈ Σ ∗ , x, y ∈ Σ + , x · y · z ∈ Q} ∪ {(ε, w, ε) | w ∈ Q}; W := {y ∈ Σ + | x, z ∈ Σ ∗ , x · y · z ∈ Q}; make the initial rule set Psuf f ; for i = 1 to |R|3 + 2|R|2 + 1 do fill the observation table and establish the classification π; delete incorrect rules from Psuf f /π with step 1, step 2 and step 3; find P (A, a) for all A ∈ R/π and a ∈ Σ ; make the set of grammars G; call Procedure check; if (W 0 = ∅) then output any G ∈ G and terminate; endif for all w ∈ W 0 do W := W ∪ {y ∈ Σ + | x, z ∈ Σ ∗ , x · y · z = w}; done done output any G ∈ G; end. Fig. 2. The learning algorithm
Finding the value of T of an observation table for r ∈ R and w ∈ W takes at most lq + |w| time, here w is the longest word in W . Thus filling an observation table takes at most |R||W |(lq + |w|) time. Also, the classification π is established in O(|R||W |(lq + |w|)) time. Obviously, |Psuf f /π| ≤ |Psuf f |, thus the time complexity of the deletion procedure of Psuf f /π is as follows. – “step 1” takes at most |Psuf f /π||W |2 |R|2 time. – “step 2” takes at most |Psuf f /π|(2|W |2 |R|3 ) time. – “step 3” takes at most |Psuf f /π|(|R||W |+|R||Psuf f /π|+|R||Psuf f /π|). This is because, for all r ∈ R, to find ΣT (r) takes at most |W | time and to find ΣP (r) takes at most |Psuf f /π| time. It takes |Psuf f /π| time to decide P (A, a) for every A ∈ R/π and a ∈ Σ . Thus it takes O(|Psuf f /π|) time to generate G and |G| is at most |Psuf f /π|. The size of every grammar in G is bounded by |Psuf f /π| because the set of nonterminals and the set of rules are a subset of R/π and Psuf f /π, respectively. In addition, the thickness of every grammar in G is bounded by lq because every nonterminal in G ∈ G can generate a subword of the representative sample Q. The time complexity of the algorithm in Fig. 1 is bounded by O(|R||Σ |(|Psuf f /π|2 |G|fEQ + |G|2 |Psuf f /π|fEQ )),
A Polynomial Time Learning Algorithm of Simple Deterministic Languages
297
here fEQ is the time complexity of the equivalence checking procedure for simple deterministic grammars in Theorem 1. Then it holds that fEQ is bounded by a polynomial of |R|, |Σ | and lq . The size of |W 0 |, the output of the algorithm in Fig. 1, is bounded by a polynomial of |Psuf f /π|, |R|, |Σ | and lq . Also, the length of the longest word in W 0 can be bounded by a polynomial of |R|, |Σ | and lq from Corollary 1. Thus the length of the longest word in W denoted by |w| is bounded by a polynomial of |R|, |Σ | and lq . Now, the main loop of the learning algorithm is executed at most |R|3 + 2|R|2 + 1 times. This is because |Psuf f | is bounded by |R|2 + 2|R| at the beginning and is monotonically decreasing by Lemma 4. Thus |W | is bounded by a polynomial of |R| and |w|. 2 Lemma 6. The learning algorithm outputs a correct hypothesis. Proof : If the algorithm terminates then W 0 = ∅ or Psuf f /π = ∅ from Lemma 4. From the assumption that Lt 6= ∅, it holds that Psuf f /π 6= ∅. It implies that the hypothesis grammar is correct from Lemma 3. 2 Summarizing the above all, we have proved Theorem 2.
8
Conclusions
In this paper, we have shown that simple deterministic languages are polynomial time learnable with membership queries and a representative sample. The time complexity of the learning algorithm is bounded by a polynomial in the size of the grammar which generates the target language, the cardinality of the representative sample and the length of the longest word in the representative sample.
References 1. Angluin, D.: A note on the number of queries needed to identify regular languages. Inf. & Cont. 51 (1981) 76–87 2. Angluin, D.: Learning regular sets from queries and counterexamples. Inf. & Comp. 75 (1987) 87–106 3. Hopcroft, J. E., Ullman, J. D.: Introduction to Automata Theory, Languages, and Computation. Addison-Wesley (1979) 4. Ishizaka, H.: Polynomial time learnability of simple deterministic languages. Machine Learning 5 (1990) 151–164 5. Korenjak, A. J., Hopcroft, J. E.: Simple deterministic languages. Proc. IEEE 7th Annu. Symp. on Switching and Automata Theory (1966) 36–46 6. Sakamoto, H.: Language learning from membership queries and characteristic examples. Proc. 6th Int. Conf. on Algorithmic Learning Theory - ALT ’95 : LNAI 997 (1995) 55–65 7. Wakatsuki, M., Tomita, E.: An improved branching algorithm for checking the equivalence of simple DPDA’s and its worst-case time complexity. (in Japanese) Trans. of IEICE J74-D-I (1991) 595–603 8. Yokomori, T.: Learning simple languages in polynomial time. Proc. SIG-FAI, JSAI 8801-3 (1988) 21-30
Improve the Learning of Subsequential Transducers by Using Alignments and Dictionaries Juan Miguel Vilar? Unidad Predepartamental de Inform´ atica Universitat Jaume I E12071 Castell´ on (Spain) email: [email protected]
Summary. Subsequential transducers are finite state models that can be successfully employed in small to medium sized translation tasks. Among other advantages, they can be automatically inferred from training samples. We present a way of incorporating, in the inference algorithm, information that can be obtained by means of statistical translation models. Keywords: Automatic translation, subsequential transducers, learning in the limit, statistical translation models
1
Introduction
An approach for building an automatic translation system for a limited domain task is to develop a model from a representative set of translation examples. Two main advantages are expected from this approach: lower development costs since the intervention of experts is less necessary —only a sufficient set of examples is required—; and a system easily adaptable —should new needs arise, it would suffice to provide the system with additional examples meeting the new requirements—. The main drawback is that the the training sets needed in order to obtain good models may be too large. Subsequential transducers have been successfully used under this framework. These are finite state models that have interesting properties for different computational linguistic tasks [7], they can be easily integrated with speech recognition systems [9], and it is possible to use machine learning algorithms for building them from training samples [8]. Other approach is to use statistical models like those proposed in [2]. In this paper we propose a new algorithm, the OMEGA (for OSTIA Modified for Employing Guarantees and Alignments), that modifies a subsequential transducer algorithm (OSTIA) so that alignment and dictionary information obtained with an statistical model, can be employed in order to improve the models obtained. This combines the advantages of subsequential transducers with the more linguistically oriented statistical models. ?
Work partially funded by the European Union under grant IT-LTR-OS-30268.
A.L. Oliveira (Ed.): ICGI 2000, LNAI 1891, pp. 298–311, 2000. c Springer-Verlag Berlin Heidelberg 2000
Improve the Learning of Subsequential Transducers
2
299
Basic Concepts
Here we present a brief introduction to some basic concepts from automata theory in order to fix the notation. 2.1
Alphabets and Strings
An alphabet X is a nonempty finite set of symbols. We use X ∗ to represent the free monoid of the strings over X . Any subset of X ∗ is called a (formal ) language. ¯x ¯0 . The symbol λ Given two strings x ¯, x ¯0 ∈ X ∗ , their concatenation is written as x represents the empty string. Calligraphic uppercase letters like X and Y represent alphabets, individual symbols are represented by lowercase letters like x and y, and strings are denoted by lowercase letters with a bar like x ¯ and y¯. The length of x ¯ is |¯ x|. We refer to the individual elements of the strings by means of subindices, so that x ¯ = x1 . . . xn . Substrings are denoted by x ¯ji = xi . . . xj . The special notation y ∈ y¯ means that there exists at least one j such that y = yj . Given three strings x ¯, y¯ and z¯ such that x ¯ = y¯z¯, we say that y¯ is a prefix of x ¯ and that z¯ is a suffix of x ¯. From a formal point of view, the words of a language can be considered as the symbols of an alphabet an the sentences as the strings. So during the rest of the paper we treat symbols and strings as synonyms of words and sentences. 2.2
Deterministic Finite State Automata
A deterministic finite state automaton (DFA) is a finite state machine that recognizes languages over a finite alphabet. Formally, a deterministic finite state automaton is a tuple (X , Q, q0 , δ, F ) where X is an alphabet, Q is a finite set of states, q0 ∈ Q is the initial state, δ : Q × X → Q is the transition function 1 , and F ⊆ Q is the set of final states. As usual, we extend δ to strings by defining δ(q, λ) = q and δ(q, x ¯a) = δ(δ(q, x ¯), a), for any state q, string x ¯ and symbol a. A string x ¯ is said to be accepted by an automaton if and only if δ(q0 , x ¯) ∈ F . The language accepted by an automaton is the set of strings accepted by it. 2.3
Subsequential Transducers
There are different ways to extend DFA so that they perform translations. We work with Subsequential Transducers (SST). These can be seen as DFAs with an output string associated to each edge and final state. Given an input string, the SST starts in the initial state and processes the input following a path that leads to a final state. The output associated to the input string is the concatenation of a common prefix, the output associated to the edges traversed and the output associated to the final state reached. 1
In this paper, we use function with the meaning of partial function
300
J.M. Vilar cuadrado/λ un/a A
B
square
grande/large square C
triángulo/λ
grande/large triangle
λ E
triangle D
Fig. 1. An example of subsequential transducer. The initial state has an arrow pointing to it and final states are marked by double-circling
Formally, a Subsequential Transducer is a tuple H = (X , Y, Q, q0 , π ¯ , E, σ) where X and Y are the input and output alphabets, Q is a finite set of states, q0 ∈ Q is the initial state, E ⊆ Q×X ×Y ∗ ×Q is the set of edges and σ : Q → Y ∗ is the state emission function. Those states for which σ is defined are called final states. The set of edges satisfies the determinism condition: if the edges (p, x, y¯, q) and (p, x, y¯0 , q 0 ) belong to E, then y¯ = y¯0 and q = q 0 . A path in an SST H is |¯ c| a string c¯ = {(pi , xi , y¯i , qi )}i=1 of E ∗ such that qi = pi+1 for every i between 1 and |¯ c| − 1. The input of the path c¯ is the string c¯X = x1 . . . x|¯c| , analogously, the output of c¯ is the string c¯Y = y¯1 . . . y¯|¯c| . When referring to the input (output) of the edge ci , we will write ci,X (ci,Y ). The set of the paths departing from p and arriving at q is denoted by p ; q. Given an input string x ¯ ∈ X , we can associate to it the output string π ¯ y¯σ(q) iff there exists a path c¯ ∈ q0 ; q such ¯ ∧ c¯Y = y¯ and q is final. So, a subsequential transducer H defines a that c¯X = x (partial) function |H| : X ∗ → Y ∗ . A very simple SST can be seen in Fig. 1. Each state is represented by a node in the graph. The initial state is marked by an arrow pointing to it. Final states are double circled and their output is written in them. The edges of the SST correspond to edges in the graph. This transducer correctly translates the four Spanish sentences un tri´ angulo, un cuadrado, un tri´ angulo grande, and un cuadrado grande into English. Note how the empty string allows the transducer to cope with the different ordering of nouns and adjectives between English and Spanish. 2.4
The Problem of Inference
The problem we face is the inference of a model from a corpus of pairs of sentences, where the second component of each pair is the translation of the first. The model of translation that we will use is the subsequential transducer, so our problem can be stated as follows: Given a corpus C, find a subsequential transducer H such that the behaviour of H is compatible with C (i.e. y¯ = |H|(¯ x) for all (¯ x, y¯) in C), and H is an “adequate” generalization of C. The second condition captures the essence of learning. The real problem is to find an adequate definition of adequate. We will follow the definition found in [8], which is an adaptation of the definition of “learning in the limit” proposed
Improve the Learning of Subsequential Transducers
301
in [6]. In essence, a learning algorithm identifies in the limit a class of functions if for any function of the class there exists a finite set of examples such that the algorithm outputs a correct representation of the function when taking as input those examples or any superset of them.
3
Inference of SSTs: The OSTIA Algorithm
The class of total subsequential functions can be identified in the limit by the Onward Subsequential Transducer Inference Algorithm (OSTIA), created by Oncina. A detailed description of this algorithm can be found in [4] or [8], we give here only a brief description of it. In outline, the algorithm first creates a transducer with the shape of a tree for representing the corpus and then it traverses the tree in order to find possible merges of states that generalize the training set. Initial Corpus Representation. The initial representation of the corpus is done in the form of a tree in which every pair of the corpus is represented. The states are the prefixes of the input strings. Edges connect each state to those that are obtained by adding a word to the corresponding prefix. This way, every state will be reached by exactly one string, the one that corresponds to its name. The output is placed as early as possible: the common prefix is the prefix of all the output strings, the output of the edges is assigned so that the output of a prefix x ¯ is the longest common prefix of the outputs of all the input strings in the corpus that begin with x ¯, and the output of the states is the suffix of the output string not produced in the edges. Traversal of the Tree. The states of the tree are visited following the order obtained by comparing the strings leading to them first by length and then lexicographically. The current state is compared to those that precede it and, in case it is possible, merged with the first state that precedes it and is found to be compatible. Merging of States. The criterion for deciding state compatibility has two parts: the states must be locally compatible and the merges they induce must be also between pairs of compatible states. Local compatibility means that the output of the strings arriving to those states must be preserved, this can be achieved if at least one has no output or when the outputs of both are the same. Formally: p and q are locally compatible iff 6 ∃σ(p)∨ 6 ∃σ(q) ∨ σ(p) = σ(q).
(1)
After two states have been seen to be locally compatible, the induced merges are tried. This leads to a sequence of merges in which the output of some arcs or states may be moved. In case one of the induced merges is found to be impossible the original merge is discarded.
302
J.M. Vilar
For an example of the need for new merges, consider this situation:
a/1 0
0
b/λ
λ
A
B
a/1
0
C
b/0
D
1
E
F
We want to merge states A and D. After their union we arrive at
a/1 0 0 A
b/λ
λ B
C
a/1 b/0
1
E
F
There are now two arcs with input symbol a departing from A. In order to restore the determinism condition, it is necessary that they have the same output. So, part of the output of the arc from A to B is “pushed back” in order to make it equal to 1. This changes the output of B and of the arc leaving it. After this changes, B and E and finally C and F are merged, arriving at a/1
0 A
4
b/0
0 B
1 C
Use of Domain and Range Information: OSTIA-DR
OSTIA identifies in the limit the class of total subsequential functions [8]. Identification of the class of partial subsequential functions can be achieved by OSTIADR, a modification also proposed by Oncina that makes use of a model for the domain and/or range of the function [4]. Both the domain and range are represented by deterministic finite state automata. Each state p of the initial tree can be labelled with D(p) and R(p), the states reached on the domain and range models when they are used to recognize the only string that leads to p. The new compatibility condition is obtained by augmenting (1) to give: p and q are locally compatible iff (6 ∃σ(p)∨ 6 ∃σ(q) ∨ σ(p) = σ(q)) ∧ D(p) = D(q) ∧ R(p) = R(q).
(2)
Improve the Learning of Subsequential Transducers
303
This condition implies that the merging of states does not change their labels, since they are equal to begin with. The only situation in which a label can change is when the output of an arc is pushed back, but in that case the updating of the labels is trivial.
5
Use of Information from Statistical Models
With the use of domain and range information, it is possible to identify in the limit the class of subsequential functions. But when trying to apply this result in practical translation tasks two main problems arise: the models of the input and output languages are not always available or they are poor; and the amount of samples needed to produce an acceptable model is large. When dealing with natural languages there are certain properties that can be exploited to alleviate these problems, we propose here the use of information gathered from statistical translation models in the form of dictionaries and alignments. 5.1
Dictionaries and Alignments
We define a dictionary D as a function from the output vocabulary into the power set of the input vocabulary, i.e. D : Y → 2X . The value of D(y) is the set of words that must appear in the input for y to appear in the output. For example, imagine that we had D(car) = {coche, autom´ ovil }. That would mean that for car to appear in the translation of a sentence, it is necessary that it contains either coche or autom´ ovil. We use the convention that D(y) = ∅ if the word y can appear “spontaneously” in the output2 . We extend the dictionary S to strings by defining D(¯ y ) = y∈¯y D(y). Finally, we use D−1 (x) as a shorthand for the set {y ∈ Y | x ∈ D(y)}. A pair (¯ x, y¯) is compatible with a dictionary D if and only if ∀y ∈ y¯ : D(y) = ∅ ∨ ∃x ∈ x ¯ : x ∈ D(y).
(3)
This means that a pair is compatible with the dictionary if each word in the output sentence either appears spontaneously or it is caused by one in the input sentence. The notion of compatibility can be extended to SSTs by defining an SST H to be compatible with a dictionary D if and only if for every path c¯ departing cX , π ¯ c¯Y ) is compatible with the dictionary and for every from q0 it is true that (¯ sentence x ¯ in the domain of H we have that (¯ x, |H|(¯ x)) is compatible with D. The concept of alignment we will use is similar to the one found in models of IBM (see [2]). An alignment is a function a from the positions in the output string into the positions in the input string. We hope that the alignment relates 2
Think for instance in the word it when translating the Spanish llueve into the English it rains (verbs related to the weather have no formal subject in Spanish).
304
J.M. Vilar
cuadrado/λ G={} N={}
un/a A
square G={a,square} N={square}
grande/large square C
G={a} N={}
circulo/λ
grande/large circle
B
circle G={a,circle} N={circle}
λ G={a,large} N={} E
D
Fig. 2. An example of labelling for a transducer
the words so that xa(j) is the word that “causes” yj . Ideally, we expect that after seeing xa(j) it is safe to say that yj can be in the output. An alignment a of a pair (¯ x, y¯) is compatible with a dictionary D if and only if ∀j, 1 ≤ j ≤ |¯ y | : D(yj ) = ∅ ∨ ∃i, 1 ≤ i ≤ a(j) : xi ∈ D(yj ).
(4)
That is, if every output word can appear spontaneously or the position to which the alignment relates it has already seen one of its possible causes. 5.2
Guarantees and Needs
In order to use the dictionaries with OSTIA, we associate to the states of a transducer a pair of functions: the guarantees function G : Q → 2Y and the needs function N : Q → 2Y . The meaning of y ∈ G(p) is that it is safe that the output of any path departing from p contains y since in every path of q0 ; p there will be at least one word in D(y). The meaning of y ∈ N (p) is that every path in q0 ; p must have an input word in D(y) since at least one of the paths departing from p will have y in its output and none of the words in D(y) in its input. Figure 2 can help in understanding these functions. Every path departing from state C can have in its output the word square since cuadrado is seen in every (in this case one) path arriving in C, so square is included in the set of guarantees for C. On the other hand, since circle is output in at least one path departing from D, it is included in the set of needs for D. We define G through the auxiliary function g, which corresponds to the guarantees for individual paths. Given a path c¯, the value of g(¯ c) is the union of the possibleSwords that can be output, given the input words in the path, that is g(c) = c∈¯c D−1 (cX ). Now, G(p) can be defined as the intersection of the corresponding g for every possible path ending in p: G(p) =
\ c¯∈q0 ;p
g(¯ c).
(5)
Improve the Learning of Subsequential Transducers
305
Analogously, for N we define the needs for a single path c¯ ending in a final state q as: ( (n(¯ c2. , q) ∪ {y ∈ c1,Y | D(y) 6= ∅}) − D−1 (c1,X ) if c¯ 6= λ, (6) n(¯ c, q) = {y ∈ σ(q) | D(y) 6= ∅} if c¯ = λ. The first case is the general one: the needs of the path without the first arc are joined with the needs arising from this first arc and the words that can be explained by the input of the arc are subtracted from that union. In case the path is empty, its needs are those derived from the output of the state. Now, N (p) is: [ n(¯ c, q) (7) N (p) = c¯∈p;q q is final
Obviously, we will not compute these functions explicitly after each merge, instead, these values will be computed when building the initial tree and they will be updated, if necessary, after successful merges [10]. The most important property of these functions is given by the following theorem: Theorem 1. Let H = (X , Y, Q, q0 , π ¯ , E, σ) a SST and D a dictionary. If for every state p ∈ Q, N (p) ⊆ G(p) and D(¯ π ) = ∅, then H is compatible with D. The proof can be found in [10]. The idea is that, by the construction of G and N , every path will contain in its input the words necessary for justifying its output.
6
Modifying OSTIA: The OMEGA Algorithm
The OMEGA, or Ω, (for OSTIA Modified for Employing Guarantees and Alignments) algorithm uses the above ideas for building the initial tree representation and merging the states. The traversal of the tree will not suffer any modifications from the order followed by OSTIA. Building of the Initial Tree. The information of the alignment is introduced in the initial tree representation by keeping the same structure for the tree but delaying the output of certain parts in order to avoid a word appearing earlier than its cause. This is accomplished by “pushing back” those output words that are aligned with later parts of the input. That is, if word yj is aligned to word xa(j) in one input pair, it will appear in the output of the arc corresponding to xa(j) or in a later one. After building the initial tree, each of the nodes is labelled with G and N , corresponding to the values of G and N . The topology of the tree makes this labelling quite easy. G labels are assigned in a preorder traversal of the tree while N labels are assigned in a postorder traversal.
306
J.M. Vilar
Merging of States. The procedure for merging states needs to be modified in order to take into account the information provided by the dictionary. This modification has two main aspects: changing the condition for local compatibility and the updating of the labels after the merges. The merging of states p and q is not possible if N (p) ∪ N (q) 6⊆G(p) ∪ G(q) because this would violate the conditions of Theorem 1. Including this in (2) we obtain the new condition for local compatibility: p and q are locally compatible iff (6 ∃σ(p)∨ 6 ∃σ(q) ∨ σ(p) = σ(q)) ∧ D(p) = D(q) ∧ R(p) = R(q)
(8)
∧ N (p) ∪ N (q) ⊆ G(p) ∩ G(q). When two states are joined, their labels need to be updated. In general, if p and q are the states to be joined, the new value of G(p) is G(p) ∩ G(q) and the new value of N (p) is N (p) ∪ N (q). Unlike in the case of OSTIA-DR, the states not intervening in the merging may need to have their labels updated since the changes in the guarantees or needs of an state propagate through the paths leaving or entering in it. Details can be found in [10].
7
Experiments
In this section we present some experiments carried out in order to evaluate the strengths and weaknesses of Ω. First, we carry out an extensive experiment comparing Ω with OSTIA in different aspects. After that, Ω is compared to another finite state inference technique proposed by Casacuberta. 7.1
The Corpus
We have tested the Ω algorithm in a corpus designed in the EuTrans project for the training of automatic translation systems [1]. The aim of the corpus is to reflect typical sentences that a tourist may say in the front desk of an hotel when visiting a foreign country. The situations considered are: asking for reservations; questions about rooms and the registration form; questions about departure; and a variety of other expressions like greetings, introductions, apologies, and so on. We have used a set of 80.000 Spanish to English pairs of the corpus for training the models and 3.897 sentences for testing them. The training and test sets were disjoint. The input and output vocabulary sizes are 689 and 514, respectively3 . The average length of the input sentences is 9.7 and the average output length is 9.9. Bigrams were used as domain and range models. The alignments were trained using IBM’s model 2 [2], slightly modified in order to allow some smoothing of the parameters, following [5]. After the training of the parameters, the Viterbi 3
This difference comes from the use of gender and number information in Spanish adjectives and the different verbal forms.
Improve the Learning of Subsequential Transducers
307
alignments were found and used as input for Ω. The dictionaries were induced from the alignments in a simple way: each word of the output vocabulary was assigned the words that were related to it by the alignments. This guarantees compatibility. 7.2
Error Correction
Typically, some of the test sentences will not be accepted by the transducers inferred from the training material. One approach to always obtaining a translation is to find a sentence that can be translated and is not too different from the original one. In our case, for the test sentence t¯, a sentence x ¯ that belongs to the domain of the transducer and minimizes L(¯ x, t¯) (the Levenshtein distance) is searched. The translation proposed for t¯ is that of x ¯. In case t¯ is in the domain of the transducer, we have x ¯ = t¯, as expected. There are at least to aspects to be considered for this effort in translating everything: although explicit models for the domain and range of the transducers are used, these models have been inferred from the training samples, so they are only approximations; and test data are assumed to be correct: deviations between the language models and the data are due to lack of training data or poor generalization of the model. 7.3
Category Labels
In order to improve the performance of the models certain words and expressions were grouped together in so-called categories, following the approach in [1]. The process consists in substituting certain words in the corpus by category labels, training a transducer with this new corpus, and expanding in this transducer the arcs with categories using transducers for those categories. The categories chosen were: masculine and feminine names, surnames, dates, hours, room numbers, and general numbers. They were chosen because they are easy to recognize, their translation rules are simple, and the amount of special linguistic knowledge introduced is very low. 7.4
Results
The results obtained can be seen in Figs. 3 and 4. Sentence error rate is the percentage of sentences incorrectly translated, no matter the number of erroneous words. Word error rate is the percentage of erroneous words in the proposed translation. It can be seen that word error rate is greatly reduced when Ω is used. This decrease is particularly acute when the number of training samples is relatively low. This was expected since when the number of training samples grows, the behaviour of OSTIA and Ω tends to be the same. Intuitively, when the number of samples is low, the information obtained from the alignments plays an important role, while enough samples the outputs compensate for the lack of information.
308
J.M. Vilar Without using categories: OSTIA-DR OMEGA
OSTIA-DR OMEGA 25
word error rate
sentence error rate
80
60
40
20
0 5.000
10.000
20.000
40.000
20 15 10 5 0 5.000
80.000
10.000
training set size
20.000
40.000
80.000
training set size
Using categories: OSTIA-DR OMEGA
OSTIA-DR OMEGA 15
40
word error rate
sentence error rate
50
30 20 10 0 5.000
10.000
20.000
training set size
40.000
80.000
10
5
0 5.000
10.000
20.000
40.000
80.000
training set size
Fig. 3. Results obtained when using error correction. Note the difference in the scales
The sentence error rate is also lower when using Ω, except when categories are not used. This is a bit surprising, but it is due to the process of error correction and the characteristics of the automata obtained by Ω. Note that the use of guarantees and needs tends to make the models inferred to be on the “safe side”: a word will not appear in the output unless it is clear that it should. This implies less state merging and leads to a lower generalization. The use of error correcting parsing smoothes this tendency, but on the word level. On the sentence level, error correction does not have enough power to recover entire sentences since it only sees the input sentence and there are no provisions for reflecting the corrections in the output. This effect of error correction is analysed in Fig. 4. These results were obtained without error correction. The accuracy of Ω models on the sentences not rejected is much higher than that of the OSTIA models. We conclude that the difference in the total sentence error rate is due to this higher rejection of Ω models. And the differences in word error rate are even greater. The election on whether or not to apply error correction will therefore depend on the intended use of the translations. A higher rejection rate can be an advantage if it is accompanied by a much better results in those sentences actually translated. This is the case for instance if the translator is used in a country whose language is unknown to you: it is better to have no translation than to have a translation that you can not rely on. On the other hand, if the output of the automatic translator is used as a first version for a human translator, the use of error cor-
Improve the Learning of Subsequential Transducers OSTIA-DR OMEGA
OSTIA-DR OMEGA
30
sentence error rate
100
coverage
80 60 40 20 0 5.000
309
10.000
20.000
40.000
20
10
80.000
0 5.000
10.000
20.000
40.000
80.000
training set size
training set size
OSTIA-DR OMEGA
word error rate
8
6
4
2
0 5.000
10.000
20.000
40.000
80.000
training set size
Fig. 4. Results obtained with categories and not using error correction. The coverage represents the percentage of test sentences that were translated. The sentence and word error rates are computed for the translated sentences. Note the difference in the scales
rection can be totally justified and then a translation is always obtained, at the cost of having more errors. 7.5
Comparison with Other Methods of Inference
The work presented here is somehow related to the one presented by Casacuberta in [3]. In order to compare the results, we present here some experiments done on the same corpora. A brief description of some features of the corpora used can be found in Table ??. By different reasons, these corpora are known as the EuTrans-I and EuTrans-II corpora. The Eutrans-I corpus is very similar to the first training corpus used in the previous sections, but the test set is different. The Eutrans-II corpus is completely unrelated, although it also has to do with hotel reception desks dialogues. The languages involved in this case were Italian and English and the acquisition was completely natural. The number of training sentences is comparatively low, which makes this task a very tough one. The results obtained can be seen in Table 2. This table presents the best results reported by Casacuberta in [3] and the corresponding results obtained by Ω. It can be seen that Casacuberta’s technique (MGTI) is superior for the Eutrans-II task, while Ω works better for the Eutrans-I task. This can be explained by the different nature of Ω and MGTI. When the number of samples is enough, Ω can adequately move the output in the arcs, leading to good generalizations. MGTI is not able to do such movements. So in a relatively clean corpus
310
J.M. Vilar Table 1. Some features of the corpora used for further experimentation
Eutrans-I Spanish English
Training: Sentences 10,000 10,000 Words 97,131 99,292 Vocabulary 686 513 Test: Sentences 2,996 2,996 Words 35,023 35,590
Eutrans-II Italian English
Training: Sentences 3,038 Words 55,302 Vocabulary 2,459 Test: Sentences 300 Words 6,121
3,038 64,176 1,712 300 7,243
Table 2. Comparison of experimental results between Ω and MGTI
Eutrans-I Eutrans-II MGTI 9.7 27.2
7.0 37.6 with high regularities and enough samples, this characteristic favours Ω. When the training samples are exceedingly scarce, the ability of MGTI to smooth the models obtained wins. It seems that the kind of smoothing provided by the use of Error Correcting Parsing is not enough to compensate the models obtained by Ω.
8
Conclusions
Subsequential transducers are an attractive option for small to medium size translation tasks. They can be derived from finite sets of samples by using the Ω algorithm, presented here, which modifies OSTIA (an earlier algorithm from Oncina) so that it uses information gathered by statistical models in the form of alignments and dictionaries, yielding an important reduction in the number of samples needed to achieve an adequate level of accuracy. The models inferred using Ω are more accurate: the translation of those sentences directly accepted by the models have less that the translation obtained by OSTIA models, the difference is bigger when the number of training sentences is low. The price to pay is a slightly smaller number of translated sentences. When error correction is employed in order to translate every possible sentence, the word error rates obtained by Ω models are still lower than those of OSTIA models, the difference in sentence error rates is not so big since error correction does not use information about the translation process. In comparison with other methods for finite state inference, the results are better when the corpus is large enough but on small corpora where smoothing is crucial other methods may be better.
Improve the Learning of Subsequential Transducers
311
References 1. J. C. Amengual, J. M. Bened´ı, F. Casacuberta, A. Casta˜ no, A. Castellanos, D. Llorens, A. Marzal, F. Prat, E. Vidal, and J. M. Vilar. Using categories in the eutrans system. In Steven Krauwer, Doug Arnold, Walter Kasper, Manny Rayner, and Harold Somers, editors, Proceedings of the Spoken Language Translation Workshop, pages 44–53, Madrid (Spain), July 1997. Association of Computational Linguistics and European Network in Language and Speech. 2. Peter F. Brown, Stephen A. Della Pietra, Vincent J. Della Pietra, and Robert L. Mercer. The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics, 19(2):263–311, June 1993. 3. Francisco Casacuberta. Inference of finite-state transducers by using regular grammars and morphisms. In Proceedings of ICGI2000, Lecture Notes in Artificial Intelligence. Springer-Verlag, September 2000. 4. A. Castellanos, E. Vidal, M. A. Var´ o, and J. Oncina. Language understanding and subsequential transducer learning. Computer Speech and Language, 12:193–228, 1998. 5. Ismael Garc´ıa-Varea, Francisco Casacuberta, and Hermann Ney. An interative, dp-based search algorithm for statistical machine translation. In Proceedings of the ICSLP’98, volume 4, pages 1135–1138, Sydney (Australia), December 1998. 6. E Mark Gold. Language identification in the limit. Information and Control, 10:447–474, 1967. 7. Mehryar Mohri. Finite-state transducer in language and speech processing. Computational Linguistics, 23(2):269–311, June 1997. 8. Jos´e Oncina, Pedro Garc´ıa, and Enrique Vidal. Learning subsequential transducers for pattern recognition interpretation tasks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(5):448–458, May 1993. 9. Juan Miguel Vilar, Victor Manuel Jim´enez, Juan Carlos Amengual, Antonio Castellanos, David Llorens, and Enrique Vidal. Text and speech translation by means of subsequential transducers. In Andr´ as Kornai, editor, Extended Finite State Models of Language, Studies in Natural Language Processing, pages 121–139. Cambridge University Press, 1999. 10. Juan Miguel Vilar Torres. Aprendizaje de Traductores Subsecuenciales para su empleo en tareas de dominio restringido. PhD thesis, Departamento de Sistemas Inform´ aticos y Computaci´ on, Universidad Polit´ecnica de Valencia, Valencia (Spain), 1998. (in Spanish).
Author Index
Amengual, Juan Carlos Arikawa, Setsuo 241 Arimura, Hiroki 241 Bened´ı, Jos´e-Miguel
51
Emerald, J.D.
196 Nakamura, Katsuhiko 186 Nevado, Francisco 196 Parekh, Rajesh
Rico-Juan, Juan R. Ruiz, J. 115
65
Fernau, Henning 75 Florˆencio, Christophe C. Fred, Ana L.N. 103 Fredouille, Daniel 25 Garcia, P. 115 Guimar˜ aes, Gabriela
Kobayashi, Satoshi
89
127
de la Higuera, Colin 15 Hirata, Kouichi 270 Honavar, Vasant 207 186 157
207
15
39 51
Ishiwata, Takashi
39
Martinek, Pavel 171 Muramatsu, Hidenori 229
Calera-Rubio, Jorge 221 Cano, A. 115 Carrasco, Rafael C. 221 Casacuberta, Francisco 1 Coste, Fran¸cois 25 Denis, Fran¸cois Dupont, Pierre
Lemay, Aur´elien
141
221
S´ anchez, Joan-Andreu 196 Sakakibara, Yasubumi 229 Sakamoto, Hiroshi 241 Sempere, Jos´e M. 75 Shimozono, Shinichi 270 Stephan, Frank 256 Subramanian, K.G. 65 Sugimoto, Noriko 270 Tajima, Yasuhiro 284 Terlutte, Alain 39 Terwijn, Sebastiaan A. 256 Thollard, Franck 141 Thomas, D.G. 65 Tomita, Etsuji 284 Toyoshima, Takashi 270 Vilar, Juan Miguel
298