This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
FOUNDATIONS OF KNOWLEDGE ACQUISITION: Machine Learning
THE KLUWER INTERNATIONAL SERIES IN ENGINEERING AND COMPUTER SCIENCE OFFICE OF NAVAL RESEARCH Advanced Book Series Consulting Editor Andre M. van Tilborg Other titles in the series: Foundations of Knowledge Acquisition: Cognitive Models of Complex Learning, edited by Susan Chipman and Alan L. Meyrowitz ISBN: 0-7923-9277-9 Foundations of Real-Time Computing: Formal Specifications and Methods, edited by Andre M. van Tilborg and Gary M. Koob ISBN: 0-7923-9167-5 Foundations of Real-Time Computing: Scheduling and Resource Management, edited by Andre M. van Tilborg and Gary M. Koob ISBN: 0-7923-9166-7
FOUNDATIONS OF KNOWLEDGE ACQUISITION: Machine Learning
edited by
Alan L. Meyrowitz Naval Research Laboratory Susan Chipman Office of Naval Research
W K A P •-
)7a2 3 £OgjLz
A R C H I
OZ78
w KLUWER ACADEMIC PUBLISHERS Boston/Dordrecht/London
E F
7
Distributors for North America: Kluwer Academic Publishers 101 Philip Drive Assinippi Park Norwell, Massachusetts 02061 USA
Distributors for all other countries: Kluwer Academic Publishers Group Distribution Centre Post Office Box 322 3300 AH Dordrecht, THE NETHERLANDS
Library of Congress Cataloging-in-Publication Data (Revised for vol. 2) Foundations of knowledge acquisition. (The Kluwer international series in engineering and computer science ; SECS 194) Editors' names reversed in v. 2. Includes bibliographical references and index. Contents: v. [1] Cognitive models of complex learning — v. [2] Machine learning. 1. Knowledge acquisition (Expert systems) I. Chipman, Susan. II. Meyrowitz, Alan Lester. III. Series. QA76.E95F68 1993 006.3'1 92-36720 ISBN 0-7923-9277-9 (v. 1) ISBN 0-7923-9278-7 (v. 2)
Learning = Inferencing + Memorizing Ryszard S. Michalski
1
Adaptive Inference Alberto Segre, Charles Elkan, Daniel Scharstein, Geoffrey Gordon, and Alexander Russell
43
On Integrating Machine Learning with Planning Gerald F. DeJong, Melinda T. Gervasio, and Scott W. Bennett The Role of Self-Models in Learning to Plan Gregg Collins, Lawrence Birnbaum, Bruce Krulwich, and Michael Freed Learning Flexible Concepts Using A Two-Tiered Representation R. S. Michalski, F. Bergadano, S. Matwin, and J. Zhang Competition-Based Learning John J. Grefenstette, Kenneth A. De Jong, and William M. Spears
83
117
145
203
VI
7.
8.
9.
10.
Index
Problem Solving via Analogical Retrieval and Analogical Search Control Randolph Jones
227
A View of Computational Learning Theory Leslie G. Valiant
263
The Probably Approximately Correct (PAC) and Other Learning Models David Haussler and Manfred Warmuth
291
On the Automated Discovery of Scientific Theories Daniel Osherson and Scott Weinstein
313
331
Foreword One of the most intriguing questions about the new computer technology that has appeared over the past few decades is whether we humans will ever be able to make computers learn. As is painfully obvious to even the most casual computer user, most current computers do not. Yet if we could devise learning techniques that enable computers to routinely improve their performance through experience, the impact would be enormous. The result would be an explosion of new computer applications that would suddenly become economically feasible (e.g., personalized computer assistants that automatically tune themselves to the needs of individual users), and a dramatic improvement in the quality of current computer applications (e.g., imagine an airline scheduling program that improves its scheduling method based on analyzing past delays). And while the potential economic impact of successful learning methods is sufficient reason to invest in research into machine learning, there is a second significant reason: studying machine learning helps us understand our own human learning abilities and disabilities, leading to the possibility of improved methods in education. While many open questions remain about the methods by which machines and humans might learn, significant progress has been made. For example, learning systems have been demonstrated for tasks such as learning how to drive a vehicle along a roadway (one has successfully driven at 55 mph for 20 miles on a public highway), for learning to evaluate financial loan applications (such systems are now in commercial use), and for learning to recognize human speech (today's top speech recognition systems all employ learning methods). At the same time, a theoretical understanding of learning has begun to appear. For example, we now can place theoretical bounds on the amount of training data a learner must observe in order to reduce its risk of choosing an incorrect hypothesis below some desired threshold. And an improved understanding of human learning is beginning to emerge alongside our improved understanding of machine learning. For example, we now have models of how human novices learn to become experts at various tasks ~ models that have been implemented as precise computer programs, and that generate traces very much like those observed in human protocols.
Vlll
The book you are holding describes a variety of these new results. This work has been pursued under research funding from the Office of Naval Research (ONR) during the time that the editors of this book managed an Accelerated Research Initiative in this area. While several government and private organizations have been important in supporting machine learning research, this ONR effort stands out in particular for its farsighted vision in selecting research topics. During a period when much funding for basic research was being rechanneled to shorter-term development and demonstration projects, ONR had the vision to continue its tradition of supporting research of fundamental long-range significance. The results represent real progress on central problems of machine learning. I encourage you to explore them for yourself in the following chapters. Tom Mitchell Carnegie Mellon University
Preface The two volumes of Foundations of Knowledge Acquisitiondocument the recent progress of basic research in knowledge acquisition sponsored by the Office of Naval Research. This volume you are holding is subtitled: Machine Learning, and there is a companion volume subtitled: Cognitive Models of Complex Learning. Funding was provided by a five-year Accelerated Research Initiative (ARI) from 1988 through 1992, and made possible significant advances in the scientific understanding of how machines and humans can acquire new knowledge so as to exhibit improved problem-solving behavior. Previous research in artificial intelligence had been directed at understanding the automation of reasoning required for problem solving in complex domains; consequent advances in expert system technology attest to the progress made in the area of deductive inference. However, that research also suggested that automated reasoning can serve to do more than solve a given problem. It can be utilized to infer new facts likely to be useful in tackling future problems, and it can aid in creating new problem-solving strategies. Research sponsored by the Knowledge Acquisition ARI was thus motivated by a desire to understand those reasoning processes which account for the ability of intelligent systems to learn and so improve their performance over time. Such processes can take a variety of forms, including generalization of current knowledge by induction, reasoning by analogy, and discovery (heuristically guided deduction which proceeds from first principles, or axioms). Associated with each are issues regarding the appropriate representation of knowledge to facilitate learning, and the nature of strategies appropriate for learning different kinds of knowledge in diverse domains. There are also issues of computational complexity related to theoretical bounds on what these forms of reasoning can accomplish. Significant progress in machine learning is reported along a variety of fronts. Chapters in Machine Learning include work in analogical reasoning; induction and discovery; learning and planning; learning by competition, using genetic algorithms; and theoretical limitations.
X
Knowledge acquisition, as pursued under the ARI, was a coordinated research thrust into both machine learning and human learning. Chapters in the companion volume, Cognitive Models of Complex Learning, also published by Kluwer Academic Publishers, include summaries of work by cognitive scientists who do computational modeling of human learning. In fact, an accomplishment of research previously sponsored by ONR's Cognitive Science Program was insight into the knowledge and skills that distinguish human novices from human experts in various domains; the Cognitive interest in the ARI was then to characterize how the transition from novice to expert actually takes place. Chapters particularly relevant to that concern are those written by Anderson, Kieras, Marshall, Ohlsson, and VanLehn. The editors believe these to be valuable volumes from a number of perspectives. They bring together descriptions of recent and on-going research by scientists at the forefront of progress in one of the most challenging arenas of artificial intelligence and cognitive science. Moreover, those scientists were asked to comment on exciting future directions for research in their specialties, and were encouraged to reflect on the progress of science which might go beyond the confines of their particular projects.
Dr. Alan L. Meyrowitz Navy Center for Applied Research in Artificial Intelligence Dr. Susan Chipman ONR Cognitive Science Program
FOUNDATIONS OF KNOWLEDGE ACQUISITION: Machine Learning
Chapter 1
LEARNING = INFERENCING + MEMORIZING Basic Concepts of Inferential Theory of Learning and Their Use for Classifying Learning Processes
Ryszard S. Michalski Center for Artificial Intelligence George Mason University Fairfax, VA 22030 ABSTRACT This chapter presents a general conceptual framework for describing and classifying learning processes. The framework is based on the Inferential Theory of Learning that views learning as a search through a knowledge space aimed at deriving knowledge that satisfies a learning goal. Such a process involves performing various forms of inference, and memorizing results for future use. The inference may be of any type— deductive, inductive or analogical. It can be performed explicitly, as in many symbolic systems, or implicitly, as in artificial neural nets. Two fundamental types of learning are distinguished: analytical learning that reformulates a given knowledge to the desirable form (e.g., skill acquisition), and synthetic learning that creates new knowledge (e.g., concept learning). Both types can be characterized in terms of knowledge transmutations that are involved in transforming given knowledge (input plus background knowledge) into the desirable knowledge. Several transmutations are discussed in a novel way, such as deductive and inductive generalization, abductive derivation, deductive and inductive specialization, abstraction and concretion. The presented concepts are used to develop a general classification of learning processes. Key words: learning theory, machine learning, inferential theory of learning, deduction, induction, abduction, generalization, abstraction, knowledge transmutation, classification of learning.
2
INTRODUCTION In the last several years we have been witnessing a great proliferation of methods and approaches to machine learning. Research in this field now spans such subareas or topics as empirical concept learning from examples, explanation-based learning, neural net learning, computational learning theory, genetic algorithm based learning, cognitive models of learning, discovery systems, reinforcement learning, constructive induction, conceptual clustering, multistrategy learning, and machine learning applications. In view of such a diversification of machine learning research, there is a strong need for developing a unifying conceptual framework for characterizing existing learning methods and approaches. Initial results toward such a framework have been presented in the form of Inferential Theory of Learning (ITL) by Michalski (1990a, 1993). The purpose of this chapter is to discuss and elaborate selected concepts of ITL, and use them to describe a general classification of learning processes. The ITL postulates that learning processes can be characterized in terms of operators (called "knowledge transmutations"—see next section) that transform input information and initial learner's knowledge to the knowledge specified by the goal of learning. The main goals of the theory are to analyze and explain diverse learning methods and paradigms in terms of knowledge transmutations, regardless of the implementation-dependent operations performed by different learning systems. The theory aims at understanding the competence of learning processes, i.e., their logical capabilities. Specifically, it tries to explain what type of knowledge a system is able to derive from what type of input and learner's prior knowledge, what types of inference and knowledge transformations underlie different learning strategies and paradigms, what are the properties and interrelationships among knowledge transmutations, how different knowledge transmutations are implemented in different learning systems, etc. The latter issue is particularly important for developing systems that combine diverse learning strategies and methods,
3
because different knowledge representations and computational mechanisms facilitate different knowledge transmutations. Knowledge transmutations can be applied in a great variety of ways to a given input and background knowledge. Therefore, the theory emphasizes the importance of learning goals, which are necessary for guiding learning processes. Learning goals reflect the knowledge needs of the learner, and often represent a composite structure of many subgoals, some of which are consistent and some may be contradictory. As to the research methodology employed, the theory attempts to explain learning processes at the level of abstraction that allows it to be relevant both to cognitive models of learning, and those studied in machine learning. The above research issues make the Inferential Theory of Learning different from and complementary to Computational Learning Theory (e.g., Warmuth and Valiant, 1991), which is primarily concerned with the computational complexity or convergence of learning algorithms. The presented work draws upon the ideas described in (Michalski, 1983 & 1990a; Michalski and Kodratoff, 1990b; and Michalski, 1993).
LEARNING THROUGH INFERENCE Any act of learning aims at improving learner's knowledge or skill by interacting with some information source, such as an environment or a teacher. The underlying tenet of the Inferential Theory of Learning is that any learning can be usefully viewed as a process of creating or modifying knowledge structures to satisfy a learning goal. Such a process may involve performing any type of inference—deductive, inductive or analogical. Figure 1 illustrates an information flow in a general learning process according to the theory. In each learning cycle, the learner generates new knowledge and/or a new form of knowledge by performing inferences from the input information and the learner's prior knowledge. When obtained knowledge satisfies the learning goal, the knowledge is assimilated into the learner's knowledge base. The input information to a learning process can be observations, stated facts, concept instances,
4 previously formed generalizations or abstractions, conceptual hierarchies, information about the validity of various pieces of knowledge, etc.
External Input
i Multitype inference Deduction Induction Analogy Output
I
Internal Input
Background Knowledge
Figure 1. A schematic characterization of learning processes. Any learning process needs to be guided by some underlying goal, otherwise the proliferation of choices of what to learn would quickly overwhelm any realistic system. A learning goal can be general (domainindependent), or domain-dependent. A general learning goal defines the type of knowledge that is desired by a learner. There can be many such goals, for example, to determine a concept description from examples, to classify observed facts, to concisely describe a sequence of events, to discover a quantitative law characterizing physical objects, to reformulate given knowledge into a more efficient representation, to learn a control algorithm to accomplish a task, to confirm a given piece of knowledge,
5 etc. A domain-specific goal defines a specific knowledge needed by the learner. At the beginning of a learning process, the learner determines what prior knowledge is relevant to the input and the learning goal. Such goalrelevant part of learner's prior knowledge is called background knowledge (BK). The BK can be in different forms, such as declarative (e.g., a collection of statements representing conceptual knowledge), procedural (e.g., a sequence of instructions for performing some skill), or a combination of both. Input and output knowledge in a learning process can also be in such forms. One way of classifying learning processes is based on the form of input and output knowledge involved in them (Michalski, 1990a). The Inferential Theory of Learning (ITL) states that learning involves performing inference ("inferencing") from the information supplied and the learner's background knowledge, and memorizing its results that are found to be useful. Thus, one can write an "equation": Learning = Inferencing + Memorizing
(1)
It should be noted that the term "inferencing" is used in (1) in a very general sense, meaning any type of knowledge transformation or manipulation, including syntactic transformations and random searching for a specified knowledge entity. Thus, to be able to learn, a system has to be able to perform inference, and to have a memory that supplies the background knowledge, and stores the results of inferencing. As mentioned earlier, ITL postulates that any learning process can be described in terms of generic units of knowledge change, called knowledge transmutations (or transforms). The transmutations derive one type of knowledge from another, hypothesize new knowledge, confirm or disconfirm knowledge, organize knowledge into structures, determine properties of given knowledge, insert or delete knowledge, transmit knowledge from one physical medium to another, etc. Transmutations may performed by a learner explicitly, by well-defined rules of inference (as in many symbolic learning systems), or implicitly, by specific
6 mechanisms involved in information processing (as in neural-net learning or genetic algorithm based learning). The capabilities of a learning system depend on the types and the complexity of transmutations a learning system is capable of performing. Transmutations are divided to two classes: knowledge generation transmutations and knowledge manipulation transmutations. Knowledge generation transmutations change the content of knowledge by performing various kinds of inference. They include, for example, generalization, specialization, abstraction, concretion, similization, dissimilization, and any kind of logical or mathematical derivation (Michalski, 1993). Knowledge manipulation transmutations perform operations on knowledge that do not change its content, but its organization, physical distribution, etc. For example, inserting a learned component to a given structure, replicating a given knowledge segment in another knowledge base, or sorting given rules in a certain order are knowledge manipulation transmutations. This chapter discusses two important classes of knowledge generation transmutations {generalization, specialization}, and {abstraction, concretion}. These classes are particularly relevant to the classification of learning processes discussed in the last section. Because Inferential Theory views learning as an inference process, it may appear that it only applies to symbolic methods of learning, and does not apply to "subsymbolic" methods, such as neural net learning, reinforcement learning or genetic algorithm-based learning. It is argued that it also applies to them, because from the viewpoint of the input-output transformations, subsymbolic methods can also be characterized as performing knowledge transmutations and inference. Clearly, they can generalize inputs, determine similarity between inputs, abstract from details, etc. From the ITL viewpoint, symbolic and subsymbolic systems differ in the type of computational and representational mechanisms they use for performing transmutations. Whether a learning system works in parallel or sequentially, weighs inputs or performs logic-based transformations
7
affects the system's speed, but not its ultimate competence (within limits), because a parallel algorithm can be transformed into a logically equivalent sequential one, and a discrete neural net unit function can be transformed into an equivalent logic-type transformation. These systems differ in the efficiency and speed of peforming different transmutations. This makes them more or less suitable for different learning tasks. In many symbolic learning systems, knowledge transmutations are performed in an explicit way, and in conceptually comprehensible steps. In some inductive learning systems, for example, INDUCE, generalization transmutations are performed according to well-defined rules of inductive generalization (Michalski, 1983). In subsymbolic systems (e.g., neural networks), transmutations are performed implicitly, in steps dictated by the underlying computational mechanism (see, e.g., Rumelhart et al., 1986). A neural network may generalize an input example by performing a sequence of small modifications of the weights of internode connections. Although these weight modifications do not directly correspond to any explicit inference rules, the end result, nevertheless, can be characterized as a certain knowledge transmutation. The latter point is illustrated by Wnek et al. (1990), who described a simple method for visualizing generalization operations performed by various symbolic and subsymbolic learning systems. The method, called DIAV, can visualize the target and learned concepts, as well as results of various intermediate steps, no matter what computational mechanism is used to perform them. To illustrate this point, Figure 2 presents a diagrammatic visualization of concepts learned by four learning systems: a classifier system using genetic algorithm (CFS), a rule learning program (AQ15), a neural net (BpNet), and a decision tree learning system (C4.5). Each diagram presents an "image" of the concept learned by the given system from the same set of examples: 6% of positive examples (5 out of the total 84 positive examples constituting the concept), and 3% of negative examples (11 out of possible 348).
8 Classifier System (CFS)
Decision Rules (AQ15)
Neural Net (BpNEt)
Decision T r e e ( C 4 . 5 )
CSS
Target concept
t Positive training example
V7?m
Learned concept
- Negative training example
The cell A corresponds to the description: HEAD-SHAPE = R & BODY-SHAPE= R & SMILING = Yes & HOLDING = F & JACKET COLOR = B & Tie = N Figure 2. A visualization of the target concept and concepts learned by four learning methods.
A
9 In the diagrams, the shaded area marked "Target Concept" represents all possible instances the concept to be learned. The shaded area marked "Learned concept" represents a generalization of training examples hypothesized by a given learning system. The set-theoretic difference between the "Target concept" and the "Learned concept" represents errors in learning (an "Error image"). Each instance belonging to the "Learned concept" and not to the "Target concept," or to the "Target concept" and not to "Learned concept" will be incorrectly classified by the system. To understand the diagrams, note that each cell of a diagram represents a single combination of attribute values, e.g., an instance of a concept. A whole diagram represents the complete description space (432 instances). The attributes spanning the description space characterize a collection of imaginary robot-like figures. Figure 3 lists the attributes and their value sets. ATTRIBUTES Head Shape Body Shape Smiling Holding Jacket Color Tie
LEGAL VALUES R- round, S- square, 0 - octagon R- round, S-square, O-octagon Y- yes , N- no S- sword, B- balloon, F- flag R- red, Y- yellow, G- green, B- blue Y- yes, N- no
Figure 3. Attributes and their value sets. To determine a logical description that corresponds to a given cell (or a set of cells), one projects the cell (or a set of cells) onto the ranges of attribute values associated with the scales aside of the diagram, and "reads out" the description. To illustrate this, the bottom part of Figure 2 presents a description of the cell marked in the diagram as A. By analyzing the images of the concepts learned by different paradigms, one can determine the degree to which they generalized the original examples, can "see" the differences between different generalizations, determine how new or hypothetical examples will be classified according to the learned concepts, etc. For more details on the
10 properties of the diagrams, on the method of "reading out" descriptions from the diagrams, and on the implemented diagrammatic visualization system, DIAV, see (Michalski, 1978, Wnek et al., 1990; Wnek and Michalski, 1992.) The diagrams allow one to view concepts as images, and thus to abstract from the specific knowledge representation used by a learning method. This demonstrates that from the epistemological viewpoint taken by the ITL, it is irrelevant if knowledge is implemented in the form of a set of rules, a decision tree, a neural net or some other way. For example, in a neural net, the prior knowledge is represented in an implicit way, specifically, by the structure of the net, and by the initial settings of the weights of the connections. The learned knowledge is manifested in the new weights of the connections among the net's units (Touretzky and Hinton, 1988). The prior and learned knowledge incorporated in the net could be re-represented, at least theoretically, in the form of images, or, as explicit symbolic rules or numerical expressions, and then dealt with as any other knowledge. For example, using the diagrams in Figure 2, one can easily "read out" from them a set of rules equivalent to the concepts learned by the neural network and genetic algorithm. The central aspect of any knowledge transmutation is the type of underlying inference, which characterizes a transmutation along the truthfalsity dimension. The type of inference thus determines the truth status of the derived knowledge. Therefore, before we discuss transmutations and their role in learning, we will first analyze basic types of inference. BASIC TYPES OF INFERENCE As stated earlier, ITL postulates that learning involves conducting inference on the input and current BK, and storing the results whenever they are evaluated as useful. Such a process may involve any type of inference, because any possible type of inference may produce knowledge worth remembering. Therefore, from such a viewpoint, a complete learning theory has to include a complete theory of inference.
11 Such a theory of inference should account for all possible types of inference. Figure 4 presents an attempt to schematically illustrate all basic types of inference. The first major classification divides inferences into two fundamental types: deductive and inductive. The difference between them can be explained by considering an entailment: P u BK 1= C
(2)
where P denotes a set of statements, called premise, BK represents the reasoner's background knowledge, and C denotes a set of statements, called consequent. Deductive inference is deriving consequent C, given premise P and BK. Inductive inference is hypothesizing premise P, given consequent C and BK. Thus, deductive inference can be viewed as "tracing forward" the relationship (2), and inductive inference as "tracing backward" such a relationship. Because of its importance for characterizing inference processes, relationship (2) is called the fundamental equation for inference.
CONCLUSIVE
CONTINGENT DEDUCTIVE
INDUCTIVE
Truth-preserving
Falsity-preserving
Figure 4. A classification of basic types of inference. Inductive inference underlies two major knowledge generation transmutations: inductive generalization and abductive derivation. They differ in the type of BK they employ, and the type of premise P they
12 hypothesize. Inductive generalization is based on tracing backward a tautological implication, specifically, the rule of universal specialization., i.e., Vx, P(x) => P(a), and produces a premise P that is a generalization of C, i.e., is a description of a larger set of entities than the set described by C (Michalski, 1990a, 1993). In contrast, abductive derivation is based on tracing backward an implication that represents a domain knowledge, and produces a description that characterizes reasons for C. Other, less known, types of inductive inference are inductive specialization and inductive concretion (see section on Inductive Transmutations). In a more general view of deduction and induction that also captures their approximate or commonsense forms, the entailment relationship "l=" may also include a "plausible" entailment, i.e., probabilistic or partial. The difference between the "conclusive" (valid) and "plausible" entailment leads to another major classification of inference types. Specifically, inferences can be divided into those based on conclusive or domain-independent dependencies, and those based on contingent or domain-dependent dependencies. A conclusive dependency between statements or sets of statements represents a necessarily true logical relationship, i.e., a relationship that must be true in all possible worlds. Valid rules of inference or universally accepted physical laws represent conclusive dependencies. To illustrate a conclusive dependency, consider the statement "All elements of the set X have the property q." If this statement is true, then the statement "x, an element of X, has the property q" must also be true. The above relationship between the statements is true independently of the domain of discourse, i.e., of the nature of elements in the set X, and thus is conclusive. If reasoning involves only statements that are assumed to be true, such as observations, "true" implications, etc., and conclusive dependencies (valid rules of inference), then deriving C, given P, is the conclusive (or crisp) deduction, and hypothesizing P, given C, is conclusive (or crisp) induction. For example, suppose that BK is "All elements of the set X have the property q," and the input (premise P) is "x is an element of X."
13 Deriving a statement "x has the property q" is conclusive deduction. If BK is "x is an element of X" and the input (the observed consequent C) is "x has the property q," then hypothesizing premise P "All elements of X have the property q" is conclusive induction. Contingent dependencies are domain-dependent relationships that represent some world knowledge that is not totally certain, but only probable. The contingency of these relationships is usually due to the fact that they represent incomplete or imprecise information about the totality of factors in the world that constitute a dependency. These relationships hold with different "degrees of strength." To express both conclusive and contingent dependencies within one formalism, the concept of mutual dependency is introduced. Suppose SI and S2 are sentences in PLC (Predicate Logic Calculus) that are either statements (closed PLC sentences; no free variables) or term expressions (open PLC sentences, in which some of the arguments are free variables). If there are free variables, such sentences can be interpreted as representing functions, otherwise they are statements with a truth-status. To state that there is a mutual dependency (for short, an m-dependency) between sentences SI and S2, we write SI <=>S2: a, (3
(3)
where a and p, called merit parameters , represent an overall forward strength and backward strength of the dependency, respectively. If SI and S2 are statements, then an m-dependency becomes an m-implication. Such an implication reduces to a standard logical implication if a is 1, and (3 is undetermined, or a is undetermined and (3 is 1, otherwise it is a bidirectional plausible implication.. In such an implication, if SI (S2) is true, than a (J3) represents a measure of certainty that S2 (SI) is true, assuming that no other information relevant to S2 (SI) is known. If SI and S2 are term expressions, then a and p represent an average certainty with which the value of SI determines the value S2, and conversely. An obvious question arises as to the method for representing and computing merit parameters. We do not assume that they need to have a
14 single representation. They could be numerical values representing a degree of belief, an estimate of the probability, ranges of probability, or a qualitative characterization of the strength of conclusions from using the implication in either direction. Here, we assume that they represent numerical degrees of dependency based on the contingency table (e.g., Goodman & Kruskal, 1979; Piatetsky-Shapiro, 1992), or estimated by an expert. Another important problem is how to combine or propagate merit parameters when reasoning through a network of m-dependencies. Pearl (1988) discusses a number of ideas relevant to this problem. Since the certainty of a statement cannot be determined solely on the basis of the certainties of its constituents, regardless of its meaning, the ultimate solution of this open problem will require methods that take into consideration both the merit parameters and the meaning of the sentences. A special case of m-dependency is determination, introduced by Russell (1989), and used for characterizing a class of analogical inferences. Determination is an m-dependency between term expressions in which a is 1, and (3 is unspecified, that is, a unidirectional functional independency. If any of the parameters a or p takes value 1, then an independency is called conclusive, otherwise is called contingent. The idea of an m-dependency stems from research on human plausible reasoning (Collins and Michalski, 1989). Conclusions derived from inferences involving contingent dependencies (applied in either direction), and/or uncertain facts are thus uncertain. They are characterized by "degrees of belief (probabilities, degrees of truth, likelihoods, etc.). For example, "If there is fire, there is smoke" is a bi-directional contingent dependency, because there could be a situation or a world in which it is false. It holds in both directions, but not conclusively in either direction. If one sees fire, then one may derive a plausible (deductive) conclusion that there is smoke. This conclusion, however, is not certain. Using reverse reasoning ("tracing backward" the above dependency), observing smoke, one may hypothesize, that there is fire. This is also an uncertain inference, called contingent abduction. It
15 may thus appear that there is no principal difference between contingent deduction and contingent abduction. These two types of inferences are different if one assumes that there is a causal dependency between fire and smoke, or, generally, between P and C in the context of BK (i.e., P can be viewed as a cause, and C as its consequence). Contingent deduction derives a plausible consequent, C, of the causes represented by P. Abduction derives plausible causes, P, of the consequent C. A problem arises when there is no causal dependency between P and C in the context of BK. In such a situation, the distinction between plausible deduction and abduction can be based on the relative strength of dependency between P and C in both directions (Michalski, 1992). Reasoning in the direction of stronger dependency is plausible deduction, and reasoning in the weaker direction is abduction. If a dependency is completely symmetrical, e.g., P <=> C, then the difference between deduction and abduction ceases to exist. In sum, both contingent deduction and contingent induction are based on contingent, domain-dependent dependencies. Contingent deduction produces likely consequences of given causes, and contingent abduction produces likely causes of given consequences. Contingent deduction is truth-preserving, and contingent induction (or contingent abduction) is falsity-preserving only to the extent to which the contingent dependencies involved in reasoning are true. In contrast, conclusive deductive inference is strictly truth-preserving, and conclusive induction is strictly falsitypreserving (if C is not true, then the hypothesis P cannot be true either). A conclusive deduction thus produces a provably •correct (valid) consequent from a given premise. A conclusive induction produces a hypothesis that logically entails the given consequent (though the hypothesis itself may be false). The intersection of the deduction and induction, i.e., an inference that is both truth-preserving and falsity-preserving, represents an equivalencebased inference (or reformulation). Analogy can be viewed as an extension of such equivalence-based inference, namely, as a similaritybased inference. Every analogical inference can be characterized as a
16 combination of deduction and induction. Induction is involved in hypothesizing an analogical match, i.e., the properties and/or relations that are assumed to be similar between the analogs, whereas deduction uses the analogical match to derive unknown properties of the target analog. Therefore, in Figure 4, analogy occupies the central area. The above inference types underlie a variety of knowledge transmutations. We now turn to the discussion of various knowledge transmutations in learning processes. TRANSMUTATIONS AS LEARNING OPERATORS Inferential Theory of Learning views any learning process as a search through a knowledge space, defined as the space of admissible knowledge representations. Such a space represents all possible inputs, all learner's background knowledge, and all knowledge that the learner can potentially generate. In inductive learning, knowledge space is usually called a description space. The theory assumes that search is conducted through an application of knowledge transmutations acting as operators. Such operators take some component of the current knowledge and some input, and generate a new knowledge component. A learning process is defined as follows: Given
Determine • Output knowledge 0, satisfying goal G, by applying transmutations T to input I and background knowledge BK. The input knowledge, I, is the information (facts or general knowledge) that the learner receives from the environment. The learner may receive the input all at once or incrementally, Goal, G, specifies criteria that need to be satisfied by the Output, O, in order that learning is
17 accomplished. Background knowledge is a part of learner's prior knowledge that is "relevant" to a given learning process. Transmutations are generic types of knowledge transformation for which one can make a simple mental model. They can be implemented using many different computational paradigms. They are classified into two general categories: knowledge generation transmutations, which change the content or meaning of the knowledge, and knowledge manipulation transmutations, which change its physical location or organization, but do not change its content. Knowledge generation transmutations represent patterns of inference, and can be divided to synthetic and analytic. Synthetic transmutations are able to hypothesize intrinsically new knowledge, and thus are fundamental for knowledge creation (by "intrinsically new knowledge" we mean knowledge that cannot be conclusively deduced from the knowledge already possessed). Synthetic transmutations include inductive transmutations (those that employ some form of inductive inference), and analogical transmutations (those that employ some form of analogy). Analytic (or deductive) transmutations are those employing some form deduction. This chapter concentrates on a few knowledge generation transmutations that are particularly important for the classification of learning processes described in the last section. A discussion of several other knowledge transmutations is in (Michalski, 1993). In order to describe these transmutations, we need to introduce concepts of a well-formed description, the reference set of a description, and a descriptor. A set of statements is a well-formed description if and only if one can identify a specific set of entities such that this set of sentences describe. This set of entities (often a singleton) is called the reference set of the description. Well-formed descriptions have truthstatus, that is, they can be characterized as true or false, or, generally, by some intermediate truth-value.
18 For the purpose of this presentation, we will make a simplifying assumption that descriptions can have one of only three truth-values: "true," "false," or "unknown." The "unknown" value is attached to hypotheses generated by contingent deduction, analogy, or inductive inference. The "unkown" value can be turned to true or false by subjecting the hypothesis to a validation procedure. A descriptor is an attribute, a function, or a relation, whose value or status is used to characterize the reference set. Consider, for example, a statement: "Elizabeth is very strong, has Ph.D. in Astrophysics from the University of Warsaw, and likes soccer." This statement is a well-formed description because one can identify a reference set, {Elizabeth}, that this statement describes. This description uses three descriptors: a one-place attribute "degree-ofstrength(person)," a binary relation "likes(person, activity)," and a four place relation, "degree-received(person, degree, topic, University). The truth-status of this description is true, if Elizabeth has the properties stated, false it she does not, unknown, if it is not known to be true, but there is no evidence that it is false. Consider now a sentence: "Robert is a writer, and Barbara is a lawyer." This sentence is not a well-formed description. It could be split, however, to two sentences, each of which would be a well-formed description (one describing Robert, and another describing Barbara). Finally, consider a sentence "George, Jane and Susan like mango, political discussions, and social work." This is a well-formed description of the reference set {George, Jane, Susan}. Knowledge generation transmutations apply only to well-formed descriptions. Knowledge manipulation transmutations apply to descriptions, as well as entities that are not descriptions (e.g., tenns, or sets of terms). Below is a brief description of four major classes of knowledge generation transmutations. First two classes consists of a pair of opposite transmutations, and the third one contains a range of transmutations.
19 1. Generalization vs. specialization A generalization transmutation extends the reference set of the input description. Typically, a generalization transmutation is inductive, because the extended set is inductively hypothesized. A generalization transmutation can also be deductive, when the more general assertion is a logical consequence of the more specific one, or is deduced from the background knowledge and/or the input. The opposite transmutation is specialization transmutations, which narrows the reference set. A specialization transmutation usually employs deductive inference, but, as shown in the next section, there are also inductive specialization transmutations. 2. Abstraction vs. concretion Abstraction reduces the amount of detail in a description of a reference set, without changing the reference set. This can be done in a variety of ways. A simple way is by replacing one or more descriptor values by their parents in the generalization hierarchy of values. For example, suppose given is a statement "Susan found an apple." Replacing "apple" by "fruit" would be an abstraction transmutation (assuming that background knowledge contains a generalization hierarchy in which "fruit" is a parent node of "apple"). The underlying inference here is deduction. The opposite transmutation is concretion, which generates additional details about a reference set. 3. Similization vs. dissimilization Similization derives new knowledge about some reference set on the basis of detected partial similarity between this set and some other reference set, of which the reasoner has more knowledge. The similization thus transfers knowledge from one reference set to another reference set, which is similar to the original one in some sense. The opposite transmutation is dissimilization, which derives new knowledge from the lack of similarity between the compared reference sets. The similization and dissimilization are based on analogical inference. They can be viewed as a combination of deductive and inductive
20
inference (Michalski, 1992). They represent patterns of inference described in the theory of plausible reasoning by Collins and Michalski (1989). For example, knowing that England grows roses, and that England and Holland have similar climates, a similization transmutation might hypothesize that Holland may also grow roses. An underlying background knowledge here is that there exists a dependency between climate of a place and the type of plants growing in that location. A dissimilization transmutation would be to hypothesize that bougainvillea, which is widespread on the Caribbean islands, probably does not grow in Scotland, because Scotland and Caribbean islands have very different climate. 4. Reformulation vs. randomization A reformulation transmutation transforms a description into another description according to equivalence-based rules of transformation (i.e., truth- and falsity-preserving rules). For example, transforming a statement: "This set contains numbers 1,2,3,4 and 5" into "This set contains integers from 1 to 5" is a reformulation. An opposite transmutation is randomization, which transforms a description into another description by making random changes. For example, mutation in a genetic algorithm represents a randomization transmutation. Reformulation and randomization are two extremes of a spectrum of intermediate transmutations, called derivations. Derivations employ different degrees or types of logical dependence between descriptions to derive one piece of knowledge from another. An intermediate transmutation between the two extremes above is crossover, which is also used in genetic algorithms. Such a transmutation derives new knowledge by exchanging parts of two related descriptions. INDUCTIVE TRASMUTATIONS Inductive transmutations, i.e., knowledge transformations employing inductive inference have fundamental importance to learning. This is due to their ability to generate intrinsically new knowledge. As discussed earlier, induction is an inference type opposite to deduction. The results
21 of induction can be in the form of generalizations (theories, rules, laws, etc.), causal explanations, specializations, concretions and other. The usual aim of induction is not to produce just any premise ("explanation") that entails a given consequent ("observable"), but the one which is the most "justifiable." Finding such a "most justifiable" hypothesis is important, because induction is an under-constrained inference, and just "reversing" deduction would normally lead to an unlimited number of alternative hypotheses. Taking into consideration the importance of determining the most justifiable hypothesis, the previously given characterization of inductive inference based on (2) can be further elaborated. Namely, an admissible induction is an inference which, given a consequent C, and BK, produces a hypothetical premise P, consistent with BK, such that PuBKI=C
(4)
and which satisfies the hypothesis selection criterion. In different contexts, the selection criterion (which may be a combination of several elementary criteria) is called a preference criterion (Popper, 1972; Michalski, 1983), bias (e.g., Utgoff, 1986), a comparator (Poole, 1989). These criteria are necessary for any act of induction because for any given consequent and a non-trivial hypothesis description language there could be a very large number distinct hypotheses that can be expressed in that language, and which satisfy the relation (4). The selection criteria specify how to choose among them. Ideally, these criteria should reflect the properties of a hypothesis that are desirable from the viewpoint of the reasoner's (or learner's) goals. Often, these criteria (or bias) are partially hidden in the description language used. For example, the description language may be limited to only conjunctive statements involving a given set of attributes, or determined by the mechanism performing induction (e.g., a method that generates decision trees is automatically limited to using only operations of conjunction and disjunction in the hypothesis representation). Generally,
22 these criteria reflect three basic desirable characteristics of a hypothesis: accuracy, utility, and generality. The accuracy expresses a desire to find a "true" hypothesis. Because the problem is logically under-constrained, the "truth" of a hypothesis can never be guaranteed. One can only satisfy (4), which is equivalent to making a hypothesis complete and consistent with regard to the input facts (Michalski, 1983). If the input is noisy, however, an inconsistent and/or incomplete hypothesis may give a better overall predictive performance than a complete and consistent one (e.g., Quinlan, 1989; Bergadano et al., 1992). The utility requires a hypothesis to be computationally and/or cognitively simple, and be applicable to performing an expected set of problems. The generality criterion expresses the desire to have a hypothesis that is useful for predicting new unknown cases. The more general the hypothesis, the wider scope of different new cases it will be able to predict. Form now on, when we talk about inductive transmutations, we mean transmutations that involve admissible inductive inference. While the above described view of induction is by no means universally accepted, it is consistent with many long-standing discussions of this subject going back to Aristotle (e.g., Adler and Gorman, 1987; see also the reference under Aristotle). Aristotle, and many subsequent thinkers, e.g., Bacon (1620), Whewell (1857) and Cohen (1970), viewed induction as a fundamental inference type that underlies all processes of creating new knowledge. They did not assume that knowledge is created only from low-level observations and without use of prior knowledge. Based on the role and amount of background knowledge involved, induction, can be divided into empirical induction and constructive induction. Empirical induction uses little background knowledge. Typically, an empirical hypothesis employs the descriptors (attributes, terms, relations, descriptive concepts, etc.) that are selected from among those that are used in describing the input instances or examples, and therefore such induction is sometimes called selective (Michalski, 1983).
23
In contrast, a constructive induction uses background knowledge and/or experiments to generate additional, more problem-oriented descriptors, and employs them in the formulation of the hypothesis. Thus, it changes the description space in which hypotheses are generated. Constructive induction can be divided into constructive generalization, which produces knowledge-based hypothetical generalizations, abduction, which produces hypothetical domain-knowledge-based explanations, and theory formation, which produces general theories explaining a given set of facts. The latter is usually developed by employing inductive generalization with abduction and deduction. There is a number of knowledge transmutations that employ induction, such as empirical inductive generalization, constructive inductive generalization, inductive specialization, inductive concretion, abductive derivation, and other (Michalski, 1993). Among them, the empirical inductive generalization is the most known form. Perhaps for this reason, it is sometimes mistakenly viewed as the only form of inductive inference. Constructive inductive generalization creates general statements that use other terms than those used for characterizing individual observations, and is also quite common in human reasoning. Inductive specialization is a relatively lesser known form of inductive inference. In contrast to inductive generalization, it decreases the reference set described in the input. Concretion is related to inductive specialization. The difference is that it generates more specific information about a given reference set, rather than reduces the reference set. Concretion is a transmutation opposite to abstraction. Abductive explanation employees abductive inference to derive properties of a reference set that can serve as its explanation. Figure 5 gives examples of the above inductive transmutations.
24
A. Empirical generalization (BK limited: "pure" generalization) Input: "A girl's face" and "Lvow cathedral" are beautiful paintings. BK: "A girl's face" and "Lvow cathedral" are paintings bv Dawski. Hypothesis: All paintings by Dawski are beautiful. B. Constructive inductive generalization (generalization + deduction) Input: "A girl's face" and "Lvow cathedral" are beautiful paintings. BK: "A girl's face" and "Lvow cathedral" are paintings by Dawski. Dawski is a known painter. Beautiful paintings by a known painter are expensive. Hypothesis: All paintings by Dawski are expensive. C. Inductive specialization Input: There is high-tech industry in Northern Virginia. BK: Fairfax is a town in Northern Virginia. Hypothesis: There is high-tech industry in Fairfax.
P. Inductive Concretion Input: John is an expert in some formal science. BK: John is Polish. Many Poles like logic. Logic is a formal science. Hypothesis: John is an expert in logic.
Et Afrdyetiye derivation Input: There is smoke in the house. BK: Fire usually causes smoke. Hypothesis: There is a fire in the house. F
General constructive induction (generalization plus abductive derivation) Input: Smoke is coming from John's apartment. BK: Fire usually causes smoke. John's apt, is in the Hemingway building. Hypothesis: The Hemingway building is on fire. Figure 5. Examples of inductive transmutations.
25 In Figure 5, examples A, C and D illustrate conclusive inductive transmutations (in which the generated hypothesis conclusively implies the consequent), and examples B, E and F illustrate contingent inductive transmutations (the hypothesis only plausibly implies the consequent).In example B, the input is only a plausible consequence of the hypothesis and BK, because background knowledge states that "Beautiful paintings by a known painter are expensive." This does not imply that all paintings that are expensive are necessarily beautiful. The difference between inductive specialization (Example C) and concretion (Example D) is that the former reduces the set being described (that is, the reference set), and the latter increases the information about the reference set. In example C, the reference set is reduced from Virginia to Fairfax. In example D, the reference set is John; the concretion increases the amount of information about it. HOW ABSTRACTION DIFFERS FROM GENERALIZATION Generalization is sometimes confused with abstraction, which is often employed as part of the process of creating generalizations. These two transmutations are quite different, however, and both are fundamental operations on knowledge. This section provides additional explanation of abstraction, and illustrates the differences between it and generalization. As mentioned earlier, abstraction creates a less detailed description of a given reference set from a more detailed description, without changing the reference set. The last condition is important, because reducing information about the reference set by describing only a part of it would not be abstraction. For example, reducing a description of a table to a description of one of its legs would not be an abstraction operation. To illustrate an abstraction transmutation, consider a transformation of the statement "My workstation has a Motorola 25-MHz 68030 processor" to "My workstation is quite fast." To make such an operation, the system needs domain-dependent background knowledge that "a processor with the 25-MHz clock speed can be viewed as quite fast," and a rule "If a processor is fast then the computer with that
26 processor can be viewed as fast." Note that the more abstract description is a logical consequence of the original description in the context of the given background knowledge, and carries less information. The abstraction operation often involves a change in the representation language, from one that uses more specific terms to one that uses more general terms, with a proviso that the statements in the second language are logically implied by the statements in the first language. A very simple form of abstraction is to replace in a description of an entity a specific attribute value (e.g., the length in a centimeter) by a less specific value (e.g., the length stated in linguistic terms, such as short, medium and long). A more complex abstraction would involve a significant change of the description language, e.g., taking a description of a computer in terms of electronic circuits and connections, and changing it into a description in terms of the functions of the individual modules. In contrast to abstraction, which reduces information about a reference set but does not change it, generalization extends the reference set. To illustrate simply the difference between generalization and abstraction, consider a statement d(S,v), which says that attribute (descriptor) d takes value v for the set of entities S. Let us write such a statement in the form: d(S) = v
(5)
Changing (5) to the statement d(S) = v', in which v' represents a more general concept, e.g., a parent node in a generalization hierarchy of values of the attribute d, is an abstraction operation. By changing v t o v ' less information is being conveyed about the reference set S. Changing (5) to a statement d(S') = v, in which S' is a superset of S, is a generalization operation. The generated statement conveys more information than the original one, because the property d is not assigned to a larger set. For example, transferring the statement "color(my-pencil) = lightblue" into "color(my-pencil)=blue" is an abstraction operation. Such an operation is deductive, if one knows that light-blue is a kind of blue.
27 Transforming the original statement into "color(all-my-pencils) = lightblue" is a generalization operation. Assuming that one does not have prior knowledge that all writing instruments that I posses are blue, this is an inductive operation. Finally, transferring the original statement into "color(all-my-pencils)=blue" is both generalization and abstraction. Thus, associating the same information with a larger set is a generalization operation; associating a smaller amount of information with the same set is an abstraction operation. In summary, generalization transforms descriptions along the setsuperset dimension, and abstraction transforms descriptions along the level-of-detail dimension. Generalization often uses the same description space (or language), abstraction often involves a change in the representation space (or language). An opposite transmutation to generalization is specialization. An opposite transmutation to abstraction is concretion. Generalization is typically an inductive operation, and abstraction a deductive operation. As a parallel concept to constructive induction, which was discussed before, one may introduce the concept of constructive deduction. Similarly to constructive induction, constructive deduction is a process of deductively transforming a source description into a target description, which uses new, more goal-relevant terms and concepts than the source description. As in constructive induction, the process uses background knowledge for that purpose. Depending on the available background knowledge, constructive deduction may be conclusive or contingent. Abstraction can be viewed as a form of constructive deduction that reduces the amount of information about a given inference set, without changing it. Such a reduction may involve using terms at "higher level of abstraction" that are derived from the "lower lever' terms. Constructive deduction is a more general concept than abstraction, as it includes any type of deductive knowledge derivation, including transformations of a given knowledge to equivalent but different forms, plausible deductive derivations, such as those based on probabilistic inferences (e.g., Schum, 1986; Pearl, 1988), or plausible reasoning (e.g., Collins and Michalski,
28 1989). In such cases, the distinction between constructive induction and constructive deduction becomes a matter of degree to which different forms of reasoning play the primary role. A CLASSIFICATION OF LEARNING PROCESSES Learning processes can be classified according to many criteria, such as the type of the inferential learning strategy used (in our terminology, the type of primary transmutation employed), the type of knowledge representation (logical expressions, decision rules, frames, etc.), the way information is supplied to a learning system (batch vs. incremental), the application area in which it is applied, etc. Classifications based on such criteria have been discussed in Carbonell, Michalski and Mitchell (1983) and Michalski (1986). The Inferential Theory of Learning outlined above offers a new way of looking at learning processes, and suggests some other classification criteria. The theory considers learning as a knowledge transformation process whose primary purpose may be either to increase the amount of the learner's knowledge, or to increase the effectiveness of the knowledge already possessed. Therefore, the primary learning purpose can be used as a major criterion for classifying learning processes. Based on this criterion, learning processes are divided into two categories—synthetic and analytic. The main goal of synthetic learning is to acquire new knowledge that goes beyond the knowledge already possessed, i.e., beyond its deductive closure. Thus, such learning relies on synthetic knowledge transmutations. The primary inference types involved in such processes are induction and/or analogy. (The term "primary" is important, because every inductive or analogical inference also involves deductive inference. The latter form is used, for example, to test whether a generated hypothesis entails the observations, to perform an analogical knowledge transfer based the hypothesized analogical match, to generate new terms using background knowledge, etc. ) The main goal of analytic learning processes is to transform knowledge that the learner already possesses into the form that is most
29 desirable and/or effective for achieving the given learning goal. Thus, such learning relies on analytic knowledge transmutations. The primary inference type used is therefore deduction. For example, one may have a complete knowledge of how an automobile works, and therefore can in principle diagnose the problems based on it. By analytic learning, one can derive simple tests and procedures for more efficient diagnosis. Other important criteria for classification of learning processes include: • The type of input information—whether it is in the form of (classified) examples, or in the form of (unclassified) facts or observations. • The type of primary inference type employed in a learning process— induction, deduction or analogy. • The role of the learner's background knowledge in the learning process—whether learning relies primarily on the input data, primarily.on the background knowledge, or on some balanced combination of the two. Figure 6 presents a classification of learning processes according to the above criteria. A combination of specific outcomes along each criterion determines a class of learning methodologies. Individual methodologies differ in terms of the knowledge representation employed, the underlying computational mechanism, or the specific learning goal (e.g., learning rules for recognizing unknown instances, learning classification structures, or learning equations). Such methodologies like empirical generalization, neural-net learning and genetic algorithm based learning all share a general goal (knowledge synthesis), have input in the form of examples of observed facts (rather than rules or other forms of general knowledge), perform induction as the primary form of inference, and involve relatively small amount of background knowledge. The differences among them are in the employed knowledge representation and the underlying computational mechanism. If the input to a synthetic learning method are examples classified by some source of knowledge, e.g., a teacher, then we have learning from examples. Such learning can be divided in turn into "instance-to-class" and "part-to-whole" categories (not shown in the Figure).
30
anon ntcussmuTioa Primary
lEmnsFtmsES SYDTHETIC
Purpose
Type of Input
L, FBOm OBSEHVflTIOD
L. FHOm EiamPLES
\
Type of Primary Inference
SPECIFICHTIOn6DIDED
EiampLE8UIDED
/
\
/ DEDUCTIVE
IDDOCTIVE
BDflLOGY
Role of Prior Knowledge
EIBPIBICBL
moucTion
Empirical Symbolic Generalization
Learning Goal and/or Representational Paradigm
COHSTBUCTIVE
Qualitative Discovery Conceptual Clustering
Abductive Learning
Constructive Inductive Generalization
Simple Casebased Learning
MULTTSTRATEGY SYSTEMS
COnSTHUCTIVE DEDUCTIOn
BXIOIDHTIC
Learning by Analogy
Abstraction
Advanced Case- based Learning
Problem Reformulation
Explanationbased Learning ("pure")
Integrated Empirical & Explanationbased Learning
Learning by Plausible Deduction
Automatic Program Synthesis
Operationaliziation
MuUistrategy Task-adaptive Learning
Neural Net Learning Genetic Algorithms Reinforcement Learning
t t I
t
j_j
1— TTlFTHnnOLnGIES-1
Figure 6. A general classification of learning processes.
l
31 In the "instance-to-class" category, examples are independent entities that represent a given class or concepts. For example, learning a general diagnostic rule for a given disease from characteristics of the patients with this disease is an "instance-to-class" generalization. Here each patient is an independent example of the disease. In the "part-to-whole" category, examples are interdependent components that have to be investigated together in order to generate a concept description. For example, a "part-to-whole" inductive learning is to hypothesize a complete shape and look of a prehistoric animal from a collection of its bones. When the input to a synthetic learning method includes facts that need to be described or organized into a knowledge structure, without the benefit of advise of a teacher, then we have learning from observation. The latter is exemplified by learning by discovery, conceptual clustering and theory formation categories. The primary type of inference used in synthetic learning is induction. As described earlier, inductive inference can be empirical (background knowledge-limited) or constructive (background knowledge-intensive). Most work in empirical induction has been concerned with empirical generalization of concept examples using attributes selected from among those present in the descriptions of the examples. Another form of empirical learning includes quantitative discovery, in which learner constructs a set of equations characterizing given data. Empirical inductive learning (both from examples, also called supervised learning, and from observation, also called unsupervised learning) can be done using several different methodologies, such as symbolic empirical generalization, neural net learning, genetic algorithm learning, reinforcement learning ("learning from feedback"), simple forms of conceptual clustering and case-based learning. The above methods typically rely on (or need) relatively small amount of background knowledge, and all perform some form of induction. They differ from each other in the type of knowledge
32
representation, computational paradigm, and/or the type of knowledge they aim to learn. Symbolic methods frequently use such representations as decision trees, decision rules, logic-style representations (e.g., Horn clauses or limited forms of predicate calculus), semantic networks or frames. Neural nets use networks of neuron-like units; genetic algorithms often use classifier systems. Conceptual clustering typically uses decision rules or structural logic-style descriptions, and aims at creating classifications of given entities together with descriptions of the created classes. Reinforcement learning acquires a mapping from situations to actions that optimizes some reward function, and may use a variety of representations, such a neural nets, sets of mathematical equations, or some domain-oriented languages (Sutton, 1992). In contrast to empirical inductive learning, constructive inductive learning is knowledge-intensive. It uses background knowledge and/or search techniques to create new attributes, terms or predicates that are more relevant to the learning task, and use them to derive characterizations of the input. These characterizations can be generalizations, explanations or both. As described before, abduction can be viewed as a form of knowledge-intensive (constructive) induction, which "traces backward" domain-dependent rules to create explanations of the given input. Many methods for constructive induction use decision rules for representing both background knowledge and acquired knowledge. For completeness, we will mention also some other classifications of synthetic methods, not shown in this classification. One classification is based on the way facts or examples are presented to the learner. If examples (in supervised learning) or facts (in unsupervised learning) are presented all at once, then we have one-step or non-incremental inductive learning. If they are presented one by one, or in portions, so that the system has to modify the currently held hypothesis after each input, we have an incremental inductive learning. Incremental learning may be with no memory, with partial memory, or with complete memory of the past facts or examples. Most incremental
33
machine learning methods fall into the "no memory" category, in which all knowledge of past examples is incorporated in the currently held hypothesis. Human learning falls typically into a "partial memory" category, in which the learner remembers not only the currently held hypothesis, but also representative past examples supporting the hypothesis. The second classification is based on whether the input facts or examples can be assumed to be totally correct, or can have errors and/or noise. Thus, we can have learning from a perfect source, or imperfect (noisy) source of information. The third classification characterizes learning methods (or processes) based on the type of matching instances with concept descriptions. Such matching can be done in a direct way, which can be complete or partial, or an indirect way. The latter employs inference and a substantial amount of background knowledge. For example, rule-based learning may employ a direct match, in which any example has to exactly satisfy a condition part of some rule, or a partial match, in which a degree of match is computed, and the rule that gives the best match is fired Advanced casebased learning methods employ matching procedures that may conduct an extensive amount of inference to match a new example with past examples (e.g., Bareiss, Porter and Wier, 1990). Learning methods based on the two-tiered concept representation (Bergadano et al., 1992) also use inference procedures for matching an input with the stored knowledge. In both cases, the matching procedures perform a "virtual" generalization transmutation. Analytic methods can be divided into those that are guided by an example in the process of knowledge reformulation (example-guided), and those that start with a knowledge specification (specification-guided). The former category includes explanation-based learning (e.g., DeJong et al., 1986), explanation-based-generalization (Mitchell et al., 1986), and explanation-based specialization (Minton et al., 1987; Minton, 1988). If deduction employed in the method is based on axioms, then it is called axiomatic. A "pure" explanation-based generalization is an
34 example of an axiomatic method because it is based on a deductive process that utilizes a complete and consistent domain knowledge. This domain knowledge plays the role analogous to axioms in formal theories. Synthesizing a computer program from its formal specification is a specification-guided form of analytic learning. Analytic methods that involve truth-preserving transformations of description spaces and/or plausible deduction are classified as methods of "constructive deduction." One important subclass of these methods are those utilizing abstraction as a knowledge transformation operation. Other subclasses include methods employing contingent deduction, e.g., plausible deduction, or probabilistic reasoning. The type of knowledge representation employed in a learning system can be used as another dimension for classifying learning systems (also not shown in Figure 6). Learning systems can be classified according to this criterion into those that use a logic-style representation, decision tree, production rules, frames, semantic network, grammar, neural network, classifier system, PROLOG program, etc., or a combination of different representations. The knowledge representation used in a learning system is often dictated by the application domain. It also depends on the type of learning strategy employed, as not every knowledge representation is suitable for every type of learning strategy. Multistrategy learning systems integrate two or more inferential strategies and/or computational paradigms. Currently, most multistrategy systems integrate some form of empirical inductive learning with explanation-based learning, e.g., Unimem (Lebowitz, 1986), Odysseus (Wilkins, Clancey, and Buchanan, 1986), Prodigy (Minton et al., 1987), GEMINI (Danyluk, 1987 and 1989), OCCAM (Pazzani, 1988), IOE (Dietterich and Flann, 1988) and ENIGMA (Bergadano et al., 1990). Some systems include also a form of analogy, e.g., DISCIPLE-1 (Kodratoff and Tecuci, 1987), or CLINT (Raedt and Bruynooghe, 1993). Systems applying analogy sometimes is viewed as multistrategy, because analogy is an inference combining induction and deduction. An advanced
35 case-based reasoning system that uses different inference types to match an input with past cases can also be classified as multistrategy. The Inferential Theory of Learning is a basis for the development of multistrategy task-adaptive learning (MTL), first proposed by Michalski (1990a). The aim of MTL is to synergistically integrate such strategies as empirical learning, analytic learning, constructive induction, analogy, abduction, abstraction, and ultimately also reinforcement strategies. An MTL system determines by itself which strategy or a combination thereof is most suitable for a given learning task. In an MTL system, strategies may be integrated loosely, in which case they are represented as different modules, or tightly, in which case one underlying representational mechanism supports all strategies. Various aspects of research on MTL have been reported by Michalski (1990c) and by Tecuci and Michalski, (1991a,b). Related work was also reported by Tecuci (1991a,b; 1992). Summarizing, the theory postulates that learning processes can be described in terms of generic patterns of inference, called transmutations. A few basic knowledge transmutations have been discussed, and characterized in terms of three dimensions: A. The type of logical relationship between the input and the output: induction vs. deduction. B. The direction of the change of the reference set: generalization vs. specialization. C. The direction of the change in the level-of-detail of description: abstraction vs. concretion. Each of the above dimensions corresponds to a different mechanism of knowledge transmutation that may occur in a learning process. The operations involved in the first two mechanisms, induction vs. deduction, and generalization vs. specialization, have been relatively well-explored in machine learning. The operations involved in the third mechanism, abstraction vs. concretion, have been relatively less studied. Because these three mechanisms are interdependent, not all combinations of operations can occur in a learning process . The problems of how to quantitatively
36 and effectively measure the amount of change in the reference set and in the level-of-detail of descriptions are important topics for future research. The presented classification of learning processes characterizes and relates to each other major subareas of machine learning. As any classification, it is useful only to the degree to which it illustrates important distinctions and relations among various categories. The ultimate goal of this classification effort is to show that diverse learning mechanisms and paradigms can be viewed as parts of one general structure, rather than as a collection of unrelated components. SUMMARY The goals of this research are to develop a theoretical framework and an effective methodology for characterizing and unifying diverse learning strategies and approaches. The proposed Inferential Theory looks at learning as a process of making goal-oriented knowledge transformations. Consequently, it proposes to analyze learning methods in terms of generic types of knowledge transformation, called transmutations, that occur in learning processes. Several transmutations have been discussed and characterized along three dimensions: the type of the logical relationship between an input and output (induction vs. deduction), the change in the reference set (generalization vs. specialization), and the change in the level-of-detail of a description (abstraction vs. concretion). Deduction and induction has been presented as two basic forms of inference. In addition to widely studied inductive generalization, other form of induction have been discussed, such as inductive specialization, concretion, and abduction. Is has been also shown that abduction can be viewed as a knowledge-based induction, and abstraction as a form of deduction. The Inferential Theory can serve as a conceptual framework for the development of multistrategy learning systems that combine different inferential learning strategies. Research in this direction has led to the formulation of the multistrategy task-adaptive learning (MTL), that dynamically and synergistically adapts the learning strategy, or a combination of them, to the learning task.
37
Many of the ideas discussed are at a very early state of development, and many issues have not been resolved. Future research should develop more formal characterization of the presented transmutations, and develop effective methods for characterizing different knowledge transmutations, and measuring their "degrees." Other important research area is to determine how various learning algorithms and paradigms map into the described knowledge transmutations. In conclusion, the ITL provides a new viewpoint for analyzing and characterizing learning processes. By addressing their logical capabilities and limitations, it strives to analyze and understand the competence aspects of learning processes. Among its major goals are to develop effective methods for determining what kind of knowledge a learner can acquire from what kind of inputs, to determine the areas of the most effective applicability of different learning methods, and to gain new insights into how to develop more advanced learning systems. ACKNOWLEDGMENTS The author expresses his gratitude to George Tecuci and Tom Arciszewski for useful and stimulating discussions of the material presented here. Thanks also go to many other people for their insightful comments and criticism of various aspects of this work, in particular, Susan Chipman, Hugo De Garis, Mike Hieb, Ken Kaufman, Yves Kodratoff, Elizabeth Marchut-Michalski, Alan Meyrowitz, David A. Schum, Brad Utz, Janusz Wnek, Jianping Zhang, and the students who took the author's Machine Learning class. This research was done in the Artificial Intelligence Center of George Mason University. The research activities of the Center have been supported in part by the Office of Naval Research under grants No. N00014-88-K-0397, No. N00014-88-K-0226, No. N00014-90-J-4059, and No. N00014-91-J-1351, in part by the National Science Foundation under the grant No. IRI-9020266, and in part by the Defense Advanced Research Projects Agency under the grants administered by the Office of Naval Research, No. N00014-87-K-0874 and No. N00014-91 -J-1854.
38 REFERENCES Adler, M. J., Gorman, W. (Eds..) The Great Ideas: A Synopicon of Great Books of the Western World, Vol. 1, Ch. 39 (Induction), pp. 565571, Encyclopedia Britannica, Inc„ 1987. Aristotle, Posterior Analytics, in The Works of Aristotle, Volume 1, R. M. Hutchins (Ed.), Encyclopedia Britannica, Inc., 1987. Bacon, F., Novum Organum, 1620. Bareiss, E. R., Porter, B. and Wier, C.C., PROTOS, An Exemplar-based Learning Apprentice, in Machine Learning: An Artificial Intelligence Approach, Vol III, Carbonell, J.G., and Mitchell , T. M. (Eds.), Morgan Kaufmann, 1990. Bergadano, F., Matwin, S., Michalski, R.S. and Zhang, J., Learning Twotiered Descriptions of Flexible Concepts: The POSEIDON System, Machine Learning Journal, Vol. 8, No, 1, Januray 1992. Carbonell, J. G., Michalski R.S. and Mitchell, T.M., An Overview of Machine Learning, in Machine Learning: An Artificial Intelligence Approach, Michalski, R.S., Carbonell, J.G., and Mitchell , T. M. (Eds.), Morgan Kaufmann Publishers, 1983. Cohen, L.J., The Implications of Induction, London, 1970. Collins, A. and Michalski, R.S., "The Logic of Plausible Reasoning: A Core Theory," Cognitive Science, Vol. 13, pp. 1-49, 1989. Danyluk. A. P., "Recent Results in the Use of Context for Learning New Rules," Technical Report No. TR-98-066, Philips Laboratories, 1989. DeJong, G. and Mooney, R., "Explanation-Based Learning: An Alternative View," Machine Learning Journal, Vol 1, No. 2, 1986. Dietterich, T.G., and Flann, N.S., "An Inductive Approach to Solving the Imperfect Theory Problem," Proceedings of 1988 Symposium on Explanation-Based Learning, pp. 42-46, Stanford University, 1988. Goodman, L.A. and Kruskal, W.H., Measures of Association for Cross Classifications, Springer-Verlag, New York, 1979. Kodratoff, Y., and Tecuci, G., "DISCIPLE-1: Interactive Apprentice System in Weak Theory Fields," Proceedings ofIJCAI-87, pp. 271273, Milan, Italy, 1987. Lebowitz, M., "Integrated Learning: Controlling Explanation," Cognitive Science, Vol. 10, No. 2, pp. 219-240, 1986.
39 Michalski, R.S., "A Planar Geometrical Model for Representing MultiDimensional Discrete Spaces and Multiple-Valued Logic Functions, " Report No. 897, Department of Computer Science, University of Illinois, Urbana, January 1978. Michalski, R. S., "Theory and Methodology of Inductive Learning," Machine Learning: An Artificial Intelligence Approach, R. S. Michalski, J. G. Carbonell, T. M. Mitchell (Eds.), Tioga Publishing Co., 1983. Michalski, R.S., Understanding the Nature of Learning: Issues and Research Directions, in Machine Learning: An Artificial Intelligence Approach Vol. II, Michalski, R.S., Carbonell, J.G., and Mitchell , T. M. (Eds.), Morgan Kaufmann Publishers, 1986. Michalski, R.S., Toward a Unified Theory of Learning: Multistrategy Task-adaptive Learning, Reports of Machine Learning and Inference Laboratory MLI-90-1, January 1990a. Michalski, R.S.and Kodratoff, Y. "Research in Machine Learning: Recent Progress, Classification of Methods and Future Directions," in Machine Learning: An Artificial Intelligence Approach, Vol. Ill, Kodratoff, Y. and Michalski, R.S. (eds.), Morgan Kaufmann Publishers, Inc., 1990b. Michalski, R.S., LEARNING FLEXIBLE CONCEPTS: Fundamental Ideas and a Method Based on Two-tiered Representation, in Machine Learning: An Artificial Intelligence Approach, Vol. Ill, Kodratoff, Y. and Michalski, R.S. (eds.), Morgan Kaufmann Publishers, Inc., 1990c. Michalski, R.S., INFERENTIAL THEORY OF LEARNING: Developing Foundations for Multistrategy Learning, in Machine Learning: A Multistrategy Approach, Vol. IV, R.S. Michalski and G. Tecuci (Eds.), Morgan Kaufmann, 1993. Minton, S., "Quantitative Results Concerning the Utility of ExplanationBased Learning," Proceedings of AAAI-88, pp. 564-569, Saint Paul, MN, 1988. Minton, S., Carbonell, J.G., Etzioni, O., et al., "Acquiring Effective Search Control Rules: Explanation-Based Learning in the PRODIGY System,'' Proceedings of the 4th International Machine Learning Workshop, pp. 122-133, University of California, Irvine, 1987. Mitchell, T.M., Keller,T., Kedar-Cabelli,S., "Explanation-Based Generalization: A Unifying View," Machine Learning Journal, Vol. 1, January 1986.
40 Pazzani, M.J., "Integrating Explanation-Based and Empirical Learning Methods in OCCAM," Proceedings of EWSL-88, pp. 147-166, Glasgow, Scotland, 1988. Pearl J., Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, Morgan Kaufmann, 1988. Piatetsky-Shapiro, G., "Probabilistic Data Dependencies," Proceedings of the ML92 Workshop on Machine Discovery, J.M. Zytkow (Ed.), Aberdeen, Scotland, July 4, 1992. Popper, K. R., Objective Knowledge: An Evolutionary Approach, Oxford at the Clarendon Press, 1972. Poole, D., Explanation and Prediction: An Architecture for Default and Abductive Reasoning, Computational Intelligence, No. 5, pp. 97110, 1989. Porter, B. W. and Mooney, R. J. (eds.), Proceedings of the 7th International Machine Learning Conference, Austin, TX, 1990. De Raedt, L. and Bruynooghe, M. CLINT: A Multistrategy Interactive Concept Learner, in Machine Learning: A Multistrategy Approach, Vol. IV, R.S. Michalski and G. Tecuci (Eds.), Morgan Kaufmann, 1993 (to appear). Rumelhart, D. E., McClelland and the PDP Research Group, Parallel Distributed Processing, Vol, 1 & 2, A Bradford Book, The MIT Press, Cambridge, Massachusetts, 1986. Russell, S., The Use of Knowledge in Analogy and Induction, Morgan Kaufman Publishers, Inc., San Mateo, CA, 1989. Schafer, D., (Ed.), Proceedings of the 3rd International Conference on Genetic Algorithms, George Mason University, June 4-7, 1989. Schum, D.,A.,"Probability and the Processes of Discovery, Proof, and Choice," Boston University Law Review, Vol. 66, No 3 and 4, May/July 1986. Segre, A. M. (Ed.), Proceedings of the Sixth International Workshop on Machine Learning, Cornell University, Ithaca, New York, June 2627, 1989. Sutton, R. S. (Ed.), Special Issue on Reinforcement Learning, Machine Learning Journal, Vol. 8, No. 3/4, May 1992. Tecuci G., "A Multistrategy Learning Approach to Domain Modeling and Knowledge Acquisition," in Kodratoff, Y., (ed.), Proceedings of the European Conference on Machine Learning, Porto, SpringerVerlag, 1991a.
41 Tecuci G., "Steps Toward Automating Knowledge Acquisition for Expert Systems," in Rappaport, A., Gaines, B., and Boose, J. (Eds.), Proceedings of the AAAI-91 Workshop on Knowledge Acquisition "From Science to Technology to Tools", Anaheim, CA, July, 1991b. Tecuci, G. and Michalski, R.S.,"A Method for Multistrategy Task-adaptive Learning Based on Plausible Justifications," in Birnbaum, L., and Collins, G. (eds.) Machine Learning: Proceedings of the Eighth International Workshop, San Mateo, CA, Morgan Kaufinann, 1991a. Tecuci G., and Michalski R.S., Input "Understanding" as a Basis for Multistrategy Task-adaptive Learning, in Ras, Z., and Zemankova, M. (eds.), Proceedings of the 6th International Symposium on Methodologies for Intelligent Systems, Lecture Notes on Artificial Intelligence, Springer Verlag, 1991b. Touretzky, D., Hinton, G., and Sejnowski, T. (Eds.), Proceedings of the 1988 Connectionist Models, Summer School, Carnegie Mellon University, June 17-26, 1988. Utgoff, P. Shift of Bias for Inductive Concept Learning, in Machine Learning: An Artificial Intelligence Approach Vol. II, Michalski, R.S., Carbonell, J.G., and Mitchell, T. M. (Eds.), Morgan Kaufmann Publishers, 1986. Warmuth, M. & Valiant, L. (Eds.) (1991). Proceedings of the 4rd Annual Workshop on Computational Learning Theory, Santa Cruz, CA: Morgan Kaufmann. Whewell, W., History of the Inductive Sciences, 3 vols., Third edition, London, 1857. Wilkins, D.C., Clancey, W.J., and Buchanan, B.G., An Overview of the Odysseus Learning Apprentice, Kluwer Academic Press, New York, NY, 1986. Wnek, J., Sarma, J., Wahab, A. A. and Michalski,R.S., COMPARING LEARNING PARADIGMS VIA DIAGRAMMATIC VISUALIZATION: A Case Study in Concept Learning Using Symbolic, Neural Net and Genetic Algorithm Methods, Proceedings of the 5th International Symposium on Methodologies for Intelligent Systems, University of Tennessee, Knoxville, TN, North-Holland, October 24-27, 1990. Wnek, J. and Michalski, R.S., COMPARING SYMBOLIC AND SUBSYMBOLIC LEARNING: A Case Study, in Machine Learing: A Multistrategy Approach, Volume IV, R.S. Michalski and G. Tecuci (Eds.), Morgan Kaufmann, 1993.
Chapter 2 Adaptive Inference Alberto Segre, Charles Elkan2, Daniel Scharstein, Geoffrey Gordon3, and Alexander Russell4 Department of Computer Science Cornell University Ithaca, NY 14853
Abstract Automatically improving the performance of inference engines is a central issue in automated deduction research. This paper describes and evaluates mechanisms for speeding up search in an inference engine used in research on reactive planning. The inference engine is adaptive in the sense that its performance improves with experience. This improvement is obtained via a combination of several different learning mechanisms, including a novel explanation-based learning algorithm, bounded-overhead success and failure caches, and dynamic reordering and reformulation mechanisms. Experimental results show that the beneficial effect of multiple speedup techniques is greater than the beneficial effect of any individual technique. Thus a wide variety of learning methods can reinforce each other in improving the performance of an automated deduction system.
1
Support for this research was provided by the Office of Naval Research grants N0014-88-K-0123 and N00014-90-J-1542, and through gifts from the Xerox Corporation and the Hewlett-Packard Company. 2 Current address: Department of Computer Science and Engineering, University of California at San Diego, La Jolia, CA 92093 3 Current address: Corporate Research and Development, General Electric, Schenectady, NY 12301 4 Current address: Laboratory for Computer Science, Massacusetts Institute of Technology, Cambridge, MA 02134
44
INTRODUCTION This paper presents an overview of our work in adaptive inference. It represents part of a larger effort studying the application of machine learning techniques to planning in uncertain, dynamic domains.5 In particular, it describes the implementation and empirical evaluation of a definite-clause, adaptive, automated deduction system. Our inference engine is adaptive in the sense that its performance characteristics change with experience. While others have previously suggested augmenting PROLOG interpreters with explanation-based learning components (Prieditis & Mostow, 1987), our system is the first to integrate a wide variety of speedup techniques such as explanation-based learning, bounded-overhead success and failure caching, heuristic antecedent reordering strategies, learned-rule management facilities, and a dynamic abstraction mechanism. Adaptive inference is an effort to bias the order of search exploration so that more problems of interest are solvable within a given resource limit. Adaptive methods include techniques normally considered speedup learning methods as well as other techniques not normally associated with machine learning. All the methods that we consider, however, rely on an underlying assumption about how the inference engine is to be used. The goal of most work within the automated deduction community is to construct inference engines which are fast and powerful enough to solve very large problems once. Large problems which were previously not mechanically solvable in a reasonable amount of time are of special interest. Once a problem is solved, another, unrelated, problem may be attempted. In contrast, we are interested in using our inference engine to solve a collection of related problems drawn from a fixed (but possibly unknown) problem distribution. These problems are all solved using the same domain theory. A complicating factor is that the inference engine is operating under rigid externally-imposed resource constraints. For example, in our own planning work, it is necessary to keep the resource constraint low enough so that the SEPIA agent is able to plan in real time. A stream of queries, corresponding to goals initiated by sensory input to the agent, are passed to the inference engine; the inference engine uses a logic of approximate plans (Elkan, 1990) to derive sequences of actions 5
Our SEPIA intelligent agent architecture (Segre & Turney, 1992a, 1992b) builds on our previous work in learning and planning (Elkan, 1990; Segre, 1987, 1988, 1991; Segre & Elkan, 1990; Turney & Segre, 1989a, 1989b). The goal of the SEPIA project is to build a scalable, real-time, learning agent.
45 which are likely to achieve the goal. Since much of the agent's world doesn't change from one query to the next, information obtained while answering one query can dramatically affect the size of the search space which must be explored for subsequent ones. The information retained may take many different forms: facts about the world state, generalized schemata of inferential reasoning, advice regarding fruitless search paths, etc. Regardless of form, however, the information is used to alter the search behavior of the inference engine. All of the adaptive inference techniques we employ share this same underlying theme. The message of this paper is that multiple speedup techniques can be applied in combination to significantly improve the performance of a automated deduction system (Segre, 1992). We begin by describing the design and implementation of our definite-clause automated deduction system and the context in which we intend to use it. Next, we present a methodology for reliably measuring the changes in performance of our system (Segre, Elkan & Russell, 1990, 1991; Segre, Elkan, Gordon & Russell, 1991). Of course, in order to discuss the combination of speedup techniques, it is necessary to understand each technique individually; thus we introduce each speedup technique, starting with our bounded-overhead caching system (Segre & Scharstein, 1991). We then discuss speedup learning and EBL* (our heuristic formulation of explanation-based learning) (Elkan & Segre, 1989; Segre & Elkan, 1990) and show how EBL* can acquire more useful new information than traditional EBL systems. We also briefly touch on other speedup techniques, describe our current efforts in combining and evaluating these techniques, and sketch some future directions for adaptive inference. DEFINITE-CLAUSE INFERENCE ENGINES A definite-clause inference engine is one whose domain theory {i.e., knowledge base) is a set of definite clauses, where a definite clause is a rule with a head consisting of a single literal and a body consisting of some number of non-negated antecedent literals. A set of definite clauses is a pure PROLOG program, but a definite-clause inference engine may be much more sophisticated than a standard pure PROLOG interpreter. All definite-clause inference engines, however, search an implicit AND/OR tree defined by the domain theory and the query, or goal, under consideration. Each OR node in this implicit AND/OR tree corresponds to a subgoal that must be unified with the head of some matching clause in the domain theory, while each AND node corresponds to the body of a clause in the domain theory. The children of an OR node represent alternative paths to search for a proof of the subgoal, while the children of an AND node represent sibling subgoals which
46 require mutually-consistent solutions. We are particularly interested in resource-limited inference engines. Resource limits specify an upper bound on the resources which may be allocated to solving a given problem or query before terminating the search and assuming no solution exists. Such limits are generally imposed in terms of maximum depth of search attempted, maximum nodes explored, or maximum CPU time expended before failing. While some inference engines may not appear to possess explicit resource limits, in practice, all inference engines must be resource limited, since in most interesting domains, some problems require an arbitrarily large amount of resources. Any resource limit creates a horizon effect: only queries with proofs that are sufficiently small according to the resource measure are solvable; others are beyond the horizon. More precisely, a domain theory and resource-limited inference engine architecture together determine a resource-limited deductive closure, DR, which is the set of all queries whose solutions can be found within the given resource bound /?. DR is, by construction, a subset of the deductive closure D of the domain theory. The exact size and composition of DR depend on several factors: the domain theory, the resource limit, and the search strategy used. The search strategy determines the order in which the nodes of the implicit AND/OR tree are explored. Different exploration orders not only correspond to different resource-limited deductive closures DR, but to different proofs of the queries in DR as well as different node expansion costs. For example, breadth-first inference engines guarantee finding the shallowest proof, but require excessive space for problems of any significant size. Depth-first inference engines require less space, but risk not terminating when the domain theory is recursive. Choosing an appropriate search strategy is a critical design decision when constructing an inference engine. The Testbed Inference Engine We have implemented a backward-chaining definite-clause inference engine in Common Lisp. The inference engine's inference scheme is essentially equivalent to PROLOG'S SLD-resolution inference scheme. Axioms are stored in a discrimination net database along with rules indexed by the rule head. The database performs a pattern-matching retrieval guaranteed to return a superset of those database entries which unify with the retrieval pattern. The cost of a single database retrieval in this model grows linearly with the number of matches found and logarithmically with the number of entries in the database.
47
The system relies on a well-understood technique called iterative deepening (Korf, 1985) for forcing completeness in recursive domains while still taking advantage of depth-first search's favorable storage characteristics. As generally practiced, iterative deepening involves limiting depth-first search exploration to a fixed depth. If no solution is found by the time the depth-limited search space is exhausted, the depth limit is incremented and the search is restarted. In return for completeness in recursive domains, depth-first iterative deepening generally entails a constant factor overhead when compared to regular depth-first search: the size of this constant depends on the branching factor of the search space and the value of the depth increment. Changing the increment changes the order of exploration of the implicit search space and, therefore, performance of the inference engine. Our inference engine performs iterative deepening on a generalized, user-defined, notion of depth while respecting the overall search resource limit specified at query time. Fixing a depth-update function (and thus a precise definition of depth) and an iterative-deepening increment establishes the exploration order of the inference engine. For example, one might define the iterative-deepening update function to compute depth of the search; with this strategy, the system is performing traditional iterative deepening. Alternatively, one might specify update functions for conspiratorial iterative deepening (Elkan, 1989), iterative broadening (M. Ginsberg & Harvey, 1990), or numerous other search strategies.6 Our implementation supports the normal PROLOG cut and fail operations, and therefore constitutes a full PROLOG interpreter. Unlike PROLOG, however, our inference engine also supports procedural attachment (i.e., escape to Lisp), which, among other things, allows for dynamic restriction and relaxation of resource limits. In addition, for a successful query our system produces a structure representing the derivation tree for a solution rather than a PROLOG-like answer substitution. When a failure is returned, the system indicates whether the failure was due to exceeding a resource limit or if in fact we can be guaranteed absence of any solution. 6
The conspiracy size of a subgoal corresponds to the number of other, as yet unsolved, subgoals in the current proof structure. Thus conspiratorial best-first search prefers narrow proofs to bushy proofs, regardless of the actual depth of the resulting derivation. Iterative broadening is an analogous idea that performs iterative deepening on the breadth of the candidate proofs.
48
The proof object is a tree whose structure reflects the successful portion of the search. The nodes in the tree are of two different types. A consequent node is used to represent an instance of a domain theory element (a rule consequent or a fact) that matches {i.e., unifies with) the query or current subgoal. A subgoal node represents an instantiation of a domain theory rule antecedent. The edges of the tree make explicit the relations between the nodes, and are also of two distinct types. A rule edge links the consequent node representing the rule's consequent to the subgoal nodes representing the rule's antecedents, while a match edge links a subgoal node to the consequent node below it {i.e., each match edge corresponds to a successful unification). The root of any proof tree is the consequent node linked by a match edge to the subgoal node representing the original query. The leaves of trees representing completed proofs are also consequent nodes where each leaf represents a fact in the domain theory. A proof tree is valid relative to a given domain theory if and only if: (1) all subgoal-consequent node pairs linked by a match edge in the tree represent identical expressions, and (2) every rule instance in the tree is a legal instance of a rule in the domain theory. If a proof tree is valid, then the truth value of the goal appearing at its root is logically entailed by the truth value of the set of leaves of the tree that are subgoals. An example should help make this clear. An Example Consider the following simple domain theory, where universallyquantified variables are indicated by a leading question mark: Facts: H{A,B) H{C,A) /(?*) K{B) K{C)
Rules: M{ly) <- N{lx, ?*) A //(?*, Ty) #(?*, ?JC) (?*) <- //(?*, Ty) A /(?y) QClx) <- //(?*, ?y) A K{ly) /?(?*, ?;y) <-/>(?*) A M(?;y) SOx)<-R{?x,?y)AQOy)
Given the query S{A), the derivation shown in Figure 1 is constructed by the inference engine. In the proof of Figure 1, if a subgoal node and a consequent node are linked by a match edge then they unify. The original query 5(A) is represented as a subgoal node, while the root of the proof tree is the consequent node directly below the original query. Note that every
49 rule instance (collection of nodes connected by rule edges) in the proof, e.g., the rule instance (1) is a legal substitution instance of a domain-theory rule; thus this proof is valid. Given that the proof is valid and that the leaf subgoal nodes all correspond to valid facts from the domain theory, we can conclude the root node lies within the domain theory's deductive closure. R{A,A)
N(C,C) J(C)
II J(C)
H(C,A) K(C)
II K(C)
Figure 1: Original proof of S(A). Rule edges are represented by single lines, while match edges are represented by double lines.
50 MEASURING THE PERFORMANCE OF ADAPTIVE SYSTEMS It is often difficult to extrapolate reliably from empirical data. In Segre, Elkan and Russell (1991), we describe how a series of common methodological choices can compromise the reliability of conclusions drawn from experimental evaluations of speedup learning. We show how problems with the measures, manipulations, and controls of many previous experiments found in the machine-learning literature can preclude drawing reliable and robust conclusions. In Segre, Elkan, Gordon and Russell (1991), we introduce a methodology for experimental evaluations of speedup learning designed to overcome some of these flaws. Our methodology supports conclusions concerning the relative performance of speedup learning systems, allowing us to predict the average-case behavior of a problem solver on large problems from data about performance on a test set of small problems. Such forecasts are based on a theory, or mathematical model, of how problemsolving systems behave. The mathematical model is both justified analytically and supported experimentally. For the purpose of this paper, we will provide just enough justification for our methodology in order to understand the experiments shown later. Briefly, our methodology models automated deduction as search. The search space explored by an inference engine is a function of the problem being solved, the domain theory used, and the inference engine itself. Our assumption is that, independent of a particular automated deduction system's implementation details, the size of the space it explores — and therefore the time it requires to search it — grows exponentially in a coefficient capturing the intrinsic difficulty of the problem, a coefficient we will denote 5. The intuition underlying 5 is that, if we know the difficulty of a given problem a priori, we should be able to project approximately how large a space must be explored by a particular problem solver using a particular domain theory in order tofinda solution.7 Since we are particularly interested in predicting the performance characteristics of an inference engine, we would like to relate resource use — especially the time t needed to find a solution — to characteristics of the problem, the inference engine, and the domain theory. It is reasonable to assume that a larger search space takes proportionally longer time to search; 7
Note that 8 represents an abstract notion of problem difficulty; it should not be confused with depth of search (although depth of search is usually highly correlated with problem difficulty).
51 thus the time t to solve a problem of intrinsic difficulty 8 (using a given problem solver and domain theory) is expected to be t = cbs
(2)
where c is a constant capturing the efficiency of a problem solver in terms of the cost of generating a single node. Equation 2 represents our belief that the time required to obtain a solution is directly related to a characteristic of the particular problem: the characteristic we have informally described as the intrinsic problem difficulty. The soundness of predictions made using our methodology depends on the applicability of the analytic model. Our model is extremely simple; it may not be obvious that it adequately describes more complex problemsolving architectures. In Segre, Elkan, Gordon and Russell (1991) we explore more detailed analytic models of several common problem-solving architectures, showing in each case that, under reasonable conditions, the detailed models reduce mathematically to the simple model presented here. Equation 2 can be rewritten so as to express a linear relation between a dependent variable log (t) and an independent variable 8 log(t) = log(b)8+log(c).
(3)
Equation 3 suggests that, by measuring t over a set of problems of known 8 and using standard methods of parametric statistics (e.g., simple linear regression), we can obtain experimental estimates of b and c. Direct comparisons between two different problem solvers can be made by comparing their respective empirically obtained b and c parameters. If b for one is lower than b for the other, then, for difficult enough problems (i.e., problems with large enough 8), the first problem solver will operate systematically faster than the second. Furthermore, once the parameters of a particular problem solver have been estimated, we can use Equation 3 to make projections about the expected behavior of a problem solver on future problems of known 8. Our statistical analysis yields more robust results than simply comparing solution costs over a set of problems. Our introduction of the 8 parameter allows us to make explicit the notion that certain problems are harder than others, thus preventing their respective solutions costs from being combined naively. Thus the basic experimental procedure advocated here is quite simple: collect datapoints of the form (8,0 and use parametric statistics to derive experimental estimates of b and c for the model. This experimental procedure is intended to serve as a rough guide; before conducting a particular experiment, one must decide on a case-by-case basis exactly which
52 observable measures of t and 8 should be used. In addition, since different problem solvers operating with the same resource limits may solve different subsets of the test problems, one must be careful to adjust the analysis for the effects of resource limits on the data collected. For the experiments reported here, we will avoid this problem by setting sufficiently large resource limits; for a full discussion of the issues involved, see Segre, Elkan, Gordon and Russell (1991). CACHING In this section, we describe our inference engine's caching scheme. A cache is a device that stores the result of a previous computation so that it can be reused. It trades increased storage cost for reduced dependency on a slow resource. The use of caches has been proposed for storing previously proven subgoals {e.g., success caching) in automated deduction systems (Plaisted, 1988). Here the extra storage required to store successfully-proven subgoals is traded against the increased cost of repeatedly proving these subgoals. The utility of such a cache depends on how often subgoals are likely to be repeated; in the case of iterative deepening, we know a priori that subgoals are repeated frequently. In addition to caching successfully proven subgoals, caching failed subgoals can also improve performance (Elkan, 1989). These failure caches record failed subgoals, along with the resource bounds in force at the time of the failures. Future attempts to prove a cached subgoal are not undertaken unless the resources available are greater than they were when the failed attempt occurred. Failure cache entries may record either an outright failure (i.e., the entire search tree rooted at the subgoal was exhausted without success) or a resource-limited failure (i.e., the search tree rooted at the subgoal was examined unsuccessfully as far as resources allowed, but greater resources may later yield a solution). Resource-limited failure cache entries must contain an additional annotation, describing the resources available at 8
In Segre, Elkan and Russell (1991) we discuss the problems associated with using CPU time directly as a measure of /; in Segre, Elkan, Gordon and Russell (1991) we advocate using either the number of nodes expanded or the number of nodes generated in the search as a machine-independent experimental approximation of t. There are several credible alternatives available for the independent variable 8: for some problem solvers, depth of the solution produced may be a reasonable estimator of difficulty. Other choices include the logarithm of the size of the solution produced, the cost of solving the problem with a control automated deduction system, or, in some domains (e.g., planning), the size or length of a human-generated solution.
53 the time of the cached failure. Success and failure caches affect the search at OR-node choice points. In their simplest forms, they serve to prune the search space rooted at the current subgoal. Success caches act as extra database facts, grounding the search process, while failure caches censor a search which is already known to be fruitless. Either way, they serve as effective speedup techniques by dynamically injecting bias into the search, altering the set of problems which are solvable within a given resource bound. Traditionally, cache implementations for inference engines have allowed an unlimited number of cache entries. Unfortunately, as the cache grows, the cost of expanding a node also increases. However, since each problem solved by a traditional inference engine is considered independently, clearing the cache between problems is sufficient to avoid excessive cache overheads: such a cache can never contain more entries than the total number of nodes actually searched. If, as is our intent, the system attempts to solve many related problems in succession, clearing the cache flushes valuable information: precisely the information we wish to exploit. On the other hand, allowing the cache to grow without limit also causes cache overhead to grow monotonically, eventually outweighing any possible advantage of caching. In Segre and Scharstein (1991), we explore the benefits of boundedoverhead caches; that is, those caches which require at most a fixed amount of space and entail a fixed amount of overhead per lookup. In our implementation, success and failure entries coexist in a single, fixed-size cache. At each OR-node choice point, the inference engine first checks the cache for a matching success entry. If one is found, the subgoal is considered solved. If no matching entry is found, the inference engine checks for a failure entry. If it finds one with a sufficiently large resource limit, the subgoal is considered unsolvable, and the inference engine is forced to backtrack. If neither type of cache hit occurs, the inference engine proceeds to try proving the subgoal normally. When finished, it inserts a new entry into the cache: a success entry if the subgoal is solved, and a failure entry if the no proof is found within the current resource bounds. Once the cache is full, adding a new entry entails deleting an existing one. A cache-management policy is used to decide which existing entry should be replaced. Cache-management policies are nothing more than heuristics which assign relative importance to cache entries. Simple replacement policies such as first-in-first-out (FIFO), least-recently used (LRU), and least-frequently used (LFU) are suggested by analogy with paged memory systems. These cache-management strategies exploit knowledge about memory access patterns. For paged memory systems, empirical
54 studies of memory traces have shown that both programs and data exhibit locality of reference', that is, access patterns tend to cluster in locallyconstrained areas of memory. In automated deduction, one might expect iterative deepening to exhibit some property which can serve in place of locality of reference; an analytic understanding of this property would certainly aid in designing high-performance management policies for automated deduction caches.9 For now, we continue to rely on simple policies such as LRU while actively studying the problem of designing high-performance cache management policies for iterative deepening. Using a fixed-size cache limits the overhead associated with a caching scheme. More importantly, it permits us to apply information acquired in the course of solving one problem to subsequent problems. Unfortunately, even a bounded-overhead cache may adversely affect performance. To see why this is so, consider the interaction of a simple success cache with the inference engine's backtracking behavior. When forced to backtrack over a subgoal that has matched a success cache entry, the inference engine will necessarily consider all alternative paths at that choice point. Since cache entries represent deductively entailed — and therefore redundant — information, some of the alternate paths considered at this choice point are subsumed by the matching cache entry just found wanting. Thus the inference engine will waste time exploring some alternate paths which are known a priori to be fruitless. By increasing the branching factor with redundant choice points, unsuccessful cache entries may actually cause an inflated number of nodes to be searched.10 We can avoid this problem in a general sense by restricting the applicability of cache entries and changing the backtracking behavior of the 9
Traditional cache-management policies typically assume that the cost of replacing an entry is independent of the identity of the entry being replaced. Cachemanagement policies such as LRU, LFU and FIFO rely at least implicitly on this assumption; a decision to replace a cache element is made based only on its past usefulness rather than any notion of its cost. More recent work on caching systems for shared-memory non-uniform memory access (NUMA) machines must also take into account the differing latencies of local vs. remote data items. NUMA systems generally need only worry about two possible costs: cheaper local access as opposed to more expensive remote access. For inference engines, of course, success and failure cache entries are not of uniform cost and, unlike the binary NUMA model, may be arbitrarily expensive. 10
This problem is related to the utility problem found in speedup learning systems (Minton, 1990b).
55 inference engine at cache hits (Elkan, 1989). By permitting success cache hits only where the candidate cache entry is at least as general as the current subgoal, we can ignore alternative choice points when backtracking over a cache hit. Failure cache hits are restricted to situations where the cache entry is at least as general as the subgoal, and, as usual, the current resource limit is dominated by the resource limit associated with the cache entry. These cache hit generality constraints prevent a cache hit from binding variables in the current search context, obviating the need to consider any alternate search paths that may exist at this subgoal. Once a cache hit occurs, the entire search space rooted at that subgoal is effectively pruned and needn't be explored upon backtracking. Thus imposing cache hit generality constraints — while producing less frequent cache hits — avoids adverse search effects altogether.11 We next empirically measure the performance of our caching system, contrasting various caching strategies and configurations. The domain theory and problem set used for this test are shown in the Appendix. They consist of 26 problems drawn from a simple situation-calculus formulation of a classic AI block-stacking world (Sussman, 1973). In order to use the basic experimental procedure outlined in Section 2, we must first decide which observable measures of t and 5 should be used for this experiment. Our general strategy is to use yet a third inference engine as a control system; in this case, a simple breadth-first search system. In this fashion, we can use some measurable parameter of the control system as 8 and the same measurable parameter of the tested systems as t. We would like our results to be independent of a particular inference engine implementation; therefore, as suggested in Segre, Elkan, Gordon and Russell (1991), we use the number of nodes expanded e as a measure of performance in place of CPU time t. We justify this simplification by assuming that c is invariant over all tested systems. By assuming t~ec, and cancelling the (constant) factor c from both sides of Equation 2, we concentrate on the implementation-independent performance effects (Le.9 the reduction of search space size) without folding in aspects of the cache implementation. We must be careful, however, not to make comparisons of experimentally-obtained b values among systems displaying different cache overheads. An unfortunate side effect of imposing generality constraints is the problem of introducing duplicate cache entries. Duplicate entries, if handled consistently, should eventually be deleted by any reasonable cache-management strategy.
56 For the independent variable we make a similar choice, using the number of nodes expanded by a control inference engine log (ebfS) as the experimental measure of problem difficulty 5. Now Equation 3 can by simplified, yielding a new regression model: log (e)=log (b) log (ebfs).
(4)
Thus from a collection of datapoints of the form (ebfs,e) obtained by solving a set of test problems, we can obtain the regression slope log (b) using a one-parameter simple linear regression and the regression model of Equation 4. As with the two-parameter model, obtaining a lower regression slope in the one-parameter model corresponds to greater empirically-measured speed, provided the two systems being compared display comparable node expansion costs c. We first measure the performance of the non-caching inference engine against the breadth-first search control system. Figure 2 shows the results of this first trial. The computed regression slope fog(6)=1.026 indicates that depth-first iterative deepening explores relatively more nodes than the control breadth-first search inference engine, which would yield a slope of exactly log(b)=\ when measured against itself. Note that our method does not support drawing conclusions regarding the relative performance of these two systems, since their respective node expansion costs can not realistically be considered equal. We do note, however, that the observed increase in relative number of nodes explored is expected, given that the iterativedeepening system must reexplore all nodes in levels 1 through n-\ in order to search the nth level.12 We next measure the performance of our inference engine on the same 26 randomly-ordered block stacking problems with caching enabled. We are particularly interested in seeing the effects of cache size on the size of the search space that must be searched, as well as the independent effects of success and failure caching. We perform four separate trials, each using the same unit-increment depth-first iterative deepening system. 12
An increment value of 1 may well produce the worst-case performance for iterative deepening. Depending on the problem population, increasing the increment value may substantially reduce the computed regression slope. The performance advantage of depth-first iterative deepening lies in the fact that storage requirements are reduced in comparison to breadth-first search, thus leading in practice to a lower per-node search cost. Note that the reduced per-node search cost is not evident when using e as an approximation of t, and is heavily dependent on the specifics of the implementation.
Figure 2: Measured performance of depth-first iterative-deepening definite clause inference engine on 26 problems drawn from an AI blocks-stacking world. The initial depth limit is 1, and the inference engine applies a unit increment iterative-deepening strategy. The computed regression slope of 1.026 implies this particular inference engine explores more nodes than the breadth-first search control system. The first time through the 26 problems, we enable caching but do not impose cache size limits. While the unlimited size caching system — like the breadth-first search control system — is not of practical interest due to its unfavorable cache overhead and space requirements, it does serve to quantify the maximum beneficial search effect attainable with caching. Next, we perform three more sets of nine trials, where each trial uses a different size cache. All trials rely on a least-recently used cache management policy. The first set of trials caches both success and failure entries, while the second and third sets perform success-only and failure-only caching, respectively. The results are shown in Figure 3. Note that for a given cache size, all three caching systems should display identical node expansion costs c, and are therefore directly comparable. We can reliably conclude that a mixed success/failure caching system is faster than either a success-only or failureonly caching system for all tested cache sizes. This confirms the preliminary
58 Regression Slope (Performance) vs. Cache Size i
>8 0>) \
i
i
i
H
1
0.95
|
T
Success & Failure^—o Successo.--o FailureQ--o J
0.9 o
0.85
..„.
i
i
, ,. i . ,
200
400
600
1 i
1
800 1000 Cache Size
Figure 3: Search performance of success-and-failure caching, successonly caching, and failure-only caching systems using an LRU replacement policy. The horizontal lines correspond to the performance of a noncaching system (log (&)= 1.026) and an infinite-size caching system (log (6)=.850), and are provided only for comparison. results reported elsewhere for infinite size caches (Elkan, 1989), and is somewhat surprising, since — given the cache hit generality constraints imposed — one might expect success caching to have little or no positive effect on performance. Intuitively, since most subgoals fail, failure caching would seem much more likely to have a positive performance effect, especially in situation-calculus domains. Instead, while the relative frequency of success and failure cache entries may differ, their effects on system performance seem equally important. EXPLANATION-BASED LEARNING In this section, we describe our framework for devising domainindependent explanation-based learning algorithms. Our work on the EBL* family of explanation-based learning algorithms hinges on a novel formal perspective of EBL as a heuristic search through the space of transformed explanations (Elkan & Segre, 1989; Segre & Elkan 1990). We prove there that the EBL* family of algorithms is complete in the sense that every valid
59 rule extractable from an explanation can be extracted by some member of the EBL* family. By virtue of this completeness property, EBL* algorithms subsume all possible EBL algorithms. In keeping with most previous domain-independent characterizations of EBL (Hirsh, 1987; Mitchell, Keller & Kedar-Cabelli, 1986; Mooney & Bennett, 1986), we adopt the first-order definite-clause knowledge representation used by our inference engine for our EBL* family of algorithms. In general, an EBL algorithm takes an explanation — in this case, a derivation or proof— and produces new information that serves to change the behavior of the inference engine on future queries. In some EBL systems {e.g., ARMS, Segre, 1988, GENESIS, Mooney, 1990, and BAGGER, Shavlik, 1990), this bias takes the form of acquired problem space macro-operators (after STRIPS, Fikes, Hart & Nilsson, 1972) that alter the search space by compressing generalizations of previously useful subproofs into more efficiently applicable proof idioms. Thus EBL is essentially adding redundant operators which, when integrated with the existing operators, bias the exploration of the search space. Acquired macro-operators may often lead to quick solutions; however, in other circumstances a macro-operator may delay the discovery of a goal node. Other EBL systems {e.g., LEX2, Mitchell, Utgoff & Banerjii, 1983, and PRODIGY, Minton, 1990a) represent acquired bias as explicit searchcontrol heuristics for existing problem space operators.13 These heuristics typically alter the ordering of alternative choices by promoting more promising operators so that they are attempted first. Some heuristics may reject certain operators outright, while others may select a particular operator as especially suitable in the current situation (to the detriment of all other operators). As in the macro-operator systems, while the heuristics should contribute to a quicker solution of a future goal, the cost of evaluating them may instead conspire to slow down the search. For the purpose of this paper, we will assume that an EBL algorithm takes a proof as its input and produces a new, generalized, domain-theory macro-operator as its output.14 Traditional EBL algorithms (Kedar-Cabelli & McCarty, 1987; Mitchell, Keller & Kedar-Cabelli, 1986; Mooney & Bennett, 1986) extract new, more general, rules which can be used to help prove We use the term search-control heuristic to describe all learned searchcontrol knowledge, regardless of whether or not it is deductively sound. 14 It is possible to reformulate a search-control heuristic learning system into a macro-operator learning system (Segre & Elkan, 1990).
60 similar — although perhaps not identical queries. Different algorithms differ in exactly which macro-operator is constructed from a given proof. More precisely, it is possible to operate on the proof {e.g., prune portions of the derivation tree) so as not to compromise its validity; by using different pruning strategies, different EBL algorithms will produce different macrooperators. Like traditional EBL algorithms, EBL* starts with a proof obtained from the inference engine, transforms the derivation and chunks it to obtain a new rule. Unlike traditional EBL algorithms, algorithms in the EBL* family are defined by a repertoire of basic operations for transforming proofs. This repertoire is provably complete in the sense that any rule that can be extracted from a valid transformed proof tree can also be extracted by some sequence of EBL* operators. The EBL* repertoire consists of five operations for transforming explanations. Operator 1 {Specialization) Compose the substitution labeling each node with a given new primitive substitution. Operator 2 {Generalization) In the substitution labeling each node, delete a given primitive substitution. Operator 3 {Match Edge Pruning) Delete a consequent node and the subtree rooted there. In the labels of nodes in the remaining explanation, delete the primitive substitutions derived from the deleted subtree. Operator 4 {Match Edge Grafting) Create a match edge linking a leaf subgoal node with the root of a new explanation tree. In the original explanation, compose the substitutions labeling each ancestor of the old leaf subgoal node with the substitution derived from the added subtree. Operator 5 {Rule Edge Pruning) Delete a subgoal node and the subtree rooted there. Do not delete the substitution derived from the deleted subtree.
61 We claim that any EBL algorithm past or present that transforms an explanation in some validity-preserving way before extracting a new rule or fact from the transformed explanation can be encoded as a series of EBL* operations. This implies that the critical difficulty lies in deciding exactly how to sequence the application of EBL* operators. In other words, while the theoretical merit of EBL* depends on the completeness of its repertoire, its practical usefulness depends on knowing how to order particular transformations of an explanation in order to learn operational rules. Traditional EBL algorithms can thus be seen as instances of EBL* algorithms that apply only thefirstthree EBL* operators in afixed,standard, sequence. Instead, we propose the use of explicitly stated search-control heuristics to guide the explanation-transformation process. The purpose of these heuristics is to restrict exploration of the space of possible explanations and encourage the extraction of operational rules. Here are five general (domain-independent) heuristics, that can be used to construct a particular EBL* algorithm. Heuristic 0 Consequent leaf nodes represent domain theory facts, and should be trimmed from the tree (Operator 3). Heuristic 1 A subgoal not involving any variable mentioned in the root goal of the proof tree should be deleted along with its subproof, preserving the bindings induced (Operator 5). Intuitively, such a subgoal provides background information that can be compiled into the rule to be learned. Heuristic 2 A subgoal possessing an answer substitution subsuming all of its other answer substitutions should be deleted along with its subproof, preserving the bindings induced (Operator 5). There is no need ever to prove the same subgoal again, so it can also be compiled into the acquired rule (similar to partial evaluation, Van Harmelen & Bundy, 1988). Heuristic 3 Chains of unary-subgoal rules (e.g., SOx,ly) ->P(!x,1y)) should be deleted (Operator 3). A rule of this form often expresses a taxonomic isa relationship, and chains of isa deductions should
62 not be compiled into the acquired rule. Heuristic 4 If the same subgoal appears repeatedly in an explanation, then the different appearances of the subgoal should be kept as one antecedent in the learned rule (Operator 5). Later proofs using the learned rule will then avoid proving the same subgoal more than once. These heuristics are given as an example of how to go about devising an EBL* transformation strategy. For the most part, these heuristics focus on the application of the fifth EBL* operator, and thus might be seen as ways of augmenting a traditional EBL algorithm as emulated by EBL*. With the exception of Heuristic 2, all of these heuristics are easily implemented.16 An example will help to make this clear. Consider once again the derivation of Figure 1. A traditional EBL algorithm would transform the original derivation by first pruning the derivation according to its operationally criterion (Mitchell, Keller & Kedar-Cabelli, 1986). For example, the algorithm might choose simply to prune the consequent nodes corresponding to facts in the domain theory since they are by definition operational (Operator 3). Next, each rule instance is replaced by the corresponding original rule from the domain theory (Operator 1), and, finally, the constraints represented by match edges are reestablished to make the transformed derivation valid once again (Operator 2). The new macrooperator produced from the transformed proof in Figure 4 is: SQx) <- //(?*, 7y) A JQy) A / ( ? Z ) A K{1Z) A //(?z, lu) A HQu, ?w>) A K(7w).
(5)
In contrast, we can devise an EBL* algorithm based on the heuristics just described that produces the new, more succinct — yet equally valid — 15
Mooney and Bennett also propose a similar pruning of isa relations (Mooney & Bennett, 1986). 16 A complete implementation of Heuristic 2 entails enumerating all possible proofs for a subgoal; generally a very bad idea. In Segre and Elkan (1990) we introduce two efficient strong approximations of Heuristic 2 that do not entail enumerating all possible proofs of subgoals. These approximations correspond to detecting unique subproofs as well as universally-true subproofs in a resource-bounded fashion. Both unique subproofs and universally-true subproofs represent special cases of Heuristic 2 which are easily implemented in our inference engine.
63
sax) II
sax) QOu) QOu) ?w)
H{1x,Ty)
/(?>>)
tf(?z,?z)
//(?«, ?w)
KQw)
H(lz,lu)
N(?z, ?z)
JOz)
K{lz)
Figure 4: Transformed proof of S(A) after a prototypical traditional EBL algorithm is applied. macro-operator: (6) This particular EBL* algorithm operates by applying Heuristic 1 everywhere it is applicable (short of applying it recursively). Next, Heuristic 0 is used to prune remaining consequent nodes corresponding to domain theory facts, and Heuristic 3 is used to prune chains of unary-subgoal rules. Both approximations of Heuristic 2 as well as Heuristic 4 are applied until the derivation reaches quiescence (i.e., no further changes can be made to the proof). The resulting proof is shown in Figure 5. We next turn our attention to comparing the performance of the EBL* algorithm just described with the performance of a traditional EBL algorithm which does not alter the proof before extracting a new macro-operator. As S(?*)<-#(?*,?y).
64
sax) sax) RdlXyA)
II Rax,A)
II PQx) H(?x, ?j) Figure 5: Transformed proof of 5(A) after the EBL* algorithm of Section 5 has been applied. before, we will use the same randomly-ordered set of 26 block-stacking problems from the Appendix. Our experimental procedure is to randomly select 2 training problems from the original 26 problems. The system augments its domain theory by learning from the training problems and then tests the augmented theory on the remaining 24 problems. The test problems are measured and analyzed in the same fashion as the data collected in Section 2. The results are shown in Figure 6. Figure 6a represents the performance of the traditional EBL system, while Figure 6b shows the performance of the EBL* system. The regression parameters and standard error estimates obtained: log ie) = (1.069±0.053) log (ebfs)
(7a)
log (e) = (0.982±0.020) log (ebfs)
(7b)
and: indicate that the EBL* system (Equation 7b) is significantly faster than the traditional EBL system (Equation 7a).
65 Traditional EBL
10 log(ebfs) Figure 6a: Performance of a traditional EBL algorithm after learning from 2 problems on the remaining set of 24 problems drawn from the AI blocks-stacking world. The inference engine is performing unit-increment depth-first iterative deepening. Inspection of the macro-operators acquired shows that the EBL* system acquires somewhat more specific macro-operators from the training problems. The loss of generality does not appear to hinder the macrooperator's usefulness, but does seem to limit their application in less appropriate situations. We see clear evidence of these effects in Figure 6. The effect of this kind of speedup learning is the spreading of the datapoints away from the regression line; problems helped by learning will drop, while problems hindered by learning will rise (recall each datapoint's x coordinate value ebfa is fixed a priori). If enough problems are helped, then the overall slope of the regression line will decrease, and we can conclude that the problem solver is, in the limit, faster after learning. Conversely, if the problems hindered by learning outweigh the problems helped, the regression slope will rise and we can conclude the problem solver is in the limit slower after learning. Careful examination of the plots shown in Figures 6a and 6b illustrates that some problems are in fact hindered more by the EBL macrooperators than the EBL* macro-operators. In fact, the more general EBL macro-operators succeed in hindering the solution of two problems which were instead helped by the EBL* macro-operators.
10 log(ebfs) Figure 6b: Performance of an EBL* algorithm after learning from 2 problems on the remaining set of 24 problems drawn from the AI blocksstacking world. The inference engine is performing unit-increment depthfirst iterative deepening. Can we say something about the utility of EBL or EBL* when compared to a non-learning system? Not from this experiment. The assumption that the node expansion cost c is uniform across all three systems does not hold. While the two learning systems can be expected to have roughly comparable c parameters (each learning system acquires exactly two macro-operators), the non-learning system will not. The best we can do is observe that the learning systems search smaller spaces than the non-learning system for this particular distribution of queries: determining whether this reduction in the end corresponds to faster (or slower) performance is necessarily implementation-dependent. On the other hand, our conclusions relating the two learning systems are much stronger: EBL* clearly outperforms EBL for this particular training set and query distribution. COMBINING TECHNIQUES Up to this point, we have examined individual speedup learning techniques. It is our belief that the combined effect of multiple speedup techniques will exceed the beneficial effects due to the individual techniques.
67 Two distinct types of synergy can arise between different speedup techniques. The first is a natural synergy, where simply composing the two techniques is sufficient. We've already observed one example of natural synergy: the combined success-and-failure caching system of Figure 3 significantly outperformed success-only and failure-only caching systems of identical cache size and cache overhead. Another example of natural synergy occurs between EBL and failure caching. It is well-understood that the macro-operators added by EBL constitute redundant paths in the search space. While these redundant paths may serve to shortcut solutions to certain problems, they may increase the cost of solution for other problems, sometimes even pushing solutions outside the resource-limited deductive closure DR (Minton, 1990b). As for unsolvable problems (i.e., those problems whose solution lies outside D altogether), the cost of completely exhausting the search space can only increase with EBL. While this is not a cause for concern — since EBL only really makes sense for resource-limited problem solvers — the use of failure caching can nonetheless reduce this effect. To see how this is so, consider two alternate paths in the search space representing the same solution to a given subgoal. One path represents a learned macro-operator, while the other path represents the original domain theory. To determine that a similar but unsolvable subgoal is a failure, an EBL-only system would have to search both subtrees. However, an EBL system with failure caching need not necessarily search the second subtree. A second type of synergy arises by design. For example, a boundedoverhead caching system requires certain information about cache behavior in order to apply a cache management policy. This information could also be exploited by other speedup techniques (e.g., by an EBL* pruning heuristic); since the information is already being maintained, there should be no additional overhead associated with using this information. It is precisely this type of synergy by design which we hope will provide the greatest advantage of adaptive inference systems. In this section, we present some empirical findings — again based on the 26 randomly-ordered blocks world problems of the Appendix — that illustrate the natural synergy between EBL* and caching. We wish to compare a non-learning system with the caching system and the EBL* system tested earlier as well as with a system that performs both EBL* and caching. Unfortunately, these four systems do not exhibit uniform node expansion cost c. However, in the interests of simplifying the analysis, we again assume that the node expansion cost c is uniform across all four systems and limit our comparisons to the changes in search space size entailed by the different speedup techniques. While reductions in the size of
68 the search space generally entail a performance improvement, the magnitude of the improvement depends heavily on the details of the implementation (here represented by the exact relation between the various c parameters). Figure 7 presents the results obtained when applying both EBL* and caching to the same 24 situation-calculus blocks world problems. Our experimental procedure is to use the same 2 training problems used in Section 5. The system augments its domain theory by learning from the training problems and then tests the augmented theory on the remaining 24 problems. Unlike Section 5, however, performance is measured on the test problems with caching enabled. A cache size of 45 elements was used. The regression parameters obtained (Equation 8a) can be compared directly to the regression parameters obtained for the non-caching EBL* system (Equation 8b): log (e) = (0.865 ± 0.019) log (ebfs)
(8a)
log (e) = (0.982 ± 0.020) log (ebfs).
(8b)
We can also compare these regression parameters to the regression parameters for a non-learning system (Equation 8c) and a 45-element LRU caching system (Equation 8d):17 log (e) = (1.026 ± 0.004) log (ebfs)
(8c)
log (e) = (0.902 ±0.007) log (ebfs)
(8d)
We can draw several preliminary conclusions from these results. First, both the EBL*-only and caching-only systems search significantly fewer nodes than the non-learning system. Second, the EBL*-plus-caching system searches significantly fewer nodes than any of the other systems. As discussed previously, we cannot conclude outright that the EBL*-pluscaching system is necessarily faster than the other systems; however, inasmuch as this domain theory and problem set are representative of search problems as a whole, these results do lend credence to the view that several types of speedup learning can advantageously be composed. Among of the factors governing whether or not an improvement in final performance 7
Note that the parameters obtained for Equations 8c and 8d were computed using only the 24 datapoints corresponding to the learning system's test set. Nevertheless, they are essentially the same as the values shown in Figures 2 and 3, respectively, which were computed on the entire 26 problem situation-calculus problem set.
69
log(e)
EBL* and LRU Caching
10 log(ebfs) Figure 7: Performance of an EBL* algorithm after learning from 2 problems on the remaining set of 24 problems drawn from the AI blocksstacking world. The inference engine is performing unit-increment depthfirst iterative deepening, and LRU caching is enabled with a cache size of 45. emerges are specifics of the implementation. Another factor which was not controlled for in the experiment just reported is the selection of the training set for the learning systems. It is clear that the overall performance of a learning system is critically dependent on which problems are used in constructing new macro-operators. While it is usually safe to compare the performance of t\yo similar learning systems without controlling for differing training sets (as we did in Section 5), this procedure will generally not produce a meaningful comparison with nonlearning systems. Therefore, we now repeat the preliminary experiment just described, altering the experimental procedure slightly to reduce the reliance on training set composition. In this experiment, we perform 20 passes over the 26 problems, each time randomly selecting two problems as training examples and measuring performance of the original domain theory plus the two new macro-operators on the remaining 24 problems. By considering the performance of the learning systems over all passes, we control for training set composition. We
70
repeat the 20 passes twice, once for the EBL*-only system and once for the EBL*-plus-caching system. We then compare the results obtained with the datapoints obtained for the non-learning system and the cache-only system. The EBL*-only system solved all 24 problems within a resource limit of 600,000 nodes searched on only 11 of the 20 passes. On the 9 remaining passes, some of the problems were not solved within the resource bound. For the 9 incomplete passes, we make optimistic estimates of search space explored by treating unsolved problems as if they were solved after exploring the entire resource limit. When analyzed individually, the regression slopes for the 11 complete passes ranged from a low of log (b)=0.745±0.061 to a high offog(6)=1.039±0.051(for the 9 incomplete passes, these ranged from fog(6)=0.774±0.071 tofog(&)=1.334±0.096).Ten of 11 complete passes searched significantly fewer nodes than the non-learning system of Figure 2, while only 2 of 9 incomplete passes seemed to do so, even considering that these are optimistic estimates of performance (note that the use of optimistic performance estimates does not affect qualitative conclusions). A somewhat more useful analysis is shown in Figure 8; all 480 datapoints obtained (20 passes over 24 problems with unsolved problems charged the entire resource limit) are plotted together. The computed regression slope and standard error for the collected trials, which represents the average expected search performance over the entire problem distribution, is fog(6)=1.062±0.019. This represents significantly slower performance than that of the non-learning system.18 Our optimistic estimate of overall search performance for the EBL* only system factors out which problems are selected for training, and supports the conclusion that using this particular EBL* algorithm is not a good idea unless one has some additional information to help select training problems. A similar procedure is now used to measure the performance of the EBL*-plus-caching system. Each pass in this trial used the same randomly selected training problems as in the last trial. For the combined system, all 24 problems in the test set were solved within the resource bound on each and every pass.19 Here, the individually analyzed regression slopes ranged Note that the regression slope computed for the non-learning system does not change even if the data is repeated 20 times; only the standard error decreases. Thus we can compare the slope of this learning system directly to the slope of the nonlearning system (log (b)= 1.026) from Figure 1. 19 In fact, the resource bound used for this experiment was selected to meet this condition.
71 Depth-First Iterative Deepening with EBL* (20 passes) • i i i i 14 log{e)) fo*(e) = 1.062±0.019to$(«wS *°°* 12 10 8
:
•
°
.-••-""'
g
» | o .JSI*
n
8 o
6
s
4 oooo
2
..-•"" . '•'
0
_l |
° 1 __ _
I
1
_ !
1
10 log(ebfs)
Figure 8: Search performance of an iterative-deepening inference engine using EBL* on two randomly selected problems on the remaining 24 situation-calculus problems of the Appendix. Repeated 20 times for a total of 480 datapoints, many of which may be represented by a single point on the plot. Unsolved problems are charged 600,000 nodes, the entire resource limit. from a low of log (Z>)=0.667±0.051 to a high log (6)=1.245±0.054. Sixteen of twenty passes performed less search than the base system of Figure 1. The combined 480 datapoints are shown in Figure 9; the computed regression slope and standard error are log (6)=0.897±0.014. There are several conclusions one can draw from these results: (1) The EBL*-plus-caching system demonstrates better performance than the EBL*-only system, independent of training set selection. Note that the optimistic estimate of performance used for the EBL*only system does not affect this (qualitative) conclusion, but rather only the magnitude of performance advantage observed. (2) The EBL*-plus-caching system demonstrates better performance than both the non-learning and cache-only systems. This performance advantage is roughly independent of training set selection. Naturally, better training sets imply better performance; but on the average, the
Figure 9: Search performance of an iterative-deepening inference engine with a 45-element LRU cache and using EBL* on two randomly selected problems on the remaining 24 situation-calculus problems of the Appendix. Repeated 20 times for a total of 480 datapoints. advantages of learning outweigh the disadvantages regardless of the precise training set composition. (3)
The relative performance of the EBL*-only system with respect to the non-learning or cache-only system is critically dependent on the composition of the training set. In those situations where better training sets are selected, performance is potentially better than that of either a non-learning or cache-only system.
In summary, independent of which problems are selected for learning, the use of EBL* and a fixed-size LRU caching system will search significantly fewer nodes than any of the other systems tested previously.20 20
Note that these conclusions are independent of training set composition but not training set size. The size of the training set wasfixeda priori on the basis of the number of problems available overall. Additional experiments with differing training set sizes would have to be performed to determine the best training set size for this particular query distribution.
73
DISCUSSION AND CONCLUSION The main point of this paper is that multiple speedup techniques, when brought to bear in concert against problems drawn from a fixed (possibly unknown) problem distribution, can provide better performance than any single speedup technique. While the results just presented are certainly encouraging, there is still much room for improvement. We are pursuing our research on adaptive inference along several different lines. First, we are investigating additional speedup learning techniques with the intent to incorporate them in our adaptive inference framework. In particular, we are studying fast antecedent reordering strategies and the automatic construction of approximate abstraction hierarchies (Russell, Segre & Camesano, 1992). Given a predetermined search strategy (e.g., depth-first, breadth-first, etc.), the computation time required to find a proof for a given query relies on the order of exploration of the implicit search space. This is a much-studied problem in the automated reasoning and logic programming communities (Smith & Genesereth, 1985; Barnett, 1984). Most previously proposed heuristics are necessarily ad hoc; our heuristics are derived from successive approximations of an analytic model of search. By adding successively more sweeping assumptions about the behavior of the search process, we have built successively more efficient heuristics for reordering the body of a domain theory clause. Second, we are looking at how system performance may be improved by sharing information among the various speedup learning components. One example of this kind of sharing is using information maintained by the cache management strategy to support a dynamic abstraction hierarchy mechanism. Hierarchical planners generally achieve their computational advantage by either relying on a priori knowledge to construct appropriate hierarchies (Sacerdoti, 1974) or by automatically constructing hierarchies from syntactic cues in the domain theory (Knoblock, 1990). Unfortunately, neither of these approaches are very useful in practice. Our approach to this problem within the SEPIA planner framework is to use information maintained by the cache management strategy to decide which subgoals possess sufficiently high (or sufficiently low) probability of success to warrant being treated as explicit assumptions. Assumption subgoals are simply treated as true (or false) in order to produce an approximate plan very quickly. The assumptions are then verified using a secondary iterativedeepening strategy that relies on the inference engine's dynamic resourcereallocation scheme. If the appropriate assumptions are made, the cost of deriving a plan with assumptions plus the cost of verifying the assumptions is notably less than the cost of deriving the plan without using assumptions at all.
74
Finally, we are beginning to look at the problem of revising incorrect or partially-specified domain theories. Speedup learning techniques are meant to use a complete and correct domain theory more efficiently. Clearly, in more realistic domains, we cannot assume that the original domain theory is complete and correct. Generally stated, the theory revision problem is the problem of revising inaccurate or incomplete domain theories on the basis of examples which expose these inaccuracies. There has been much recent research devoted to the theory revision problem for propositional domain theories (Cain, 1992; A. Ginsberg, 1988a, 1988b; A. Ginsberg, Weiss & Politakis, 1988; Ourston & Mooney, 1990; Towell & Shavlik, 1990); the first-order problem is substantially harder (Richards & Mooney, 1991). Nevertheless, the shared central idea in each of these projects is to find a revised domain theory which is at once consistent with the obtained examples and as faithful as possible to the original domain theory. Here, faithfulness is generally measured in syntactic terms, e.g., smallest number of changes. We are working on afirst-ordertheory revision algorithm which is both efficient and incremental (Feldman, Segre & Koppel, 1991a, 1991b; Feldman, Koppel & Segre, 1992). Our probabilistic theory revision algorithm is based on an underlying mathematical model and therefore exhibits a set of desirable characteristics not shared by other theory revision algorithms. In this paper, we have presented our framework for adaptive inference, and we have briefly outlined some of the speedup techniques used in our system. We have also described a new experimental methodology for use in measuring the effects of speedup learning, and we have presented several exploratory evaluations of speedup techniques intended to guide the design of adaptive inference systems. We expect to integrate other learning techniques such as heurisitic antecedent reordering, dynamic abstraction hierarchies, and our probabilistic first-order domain-theory revision system into the adaptive inference framework in order to produce a comprehensive, adaptive, inference engine. Acknowledgements Thanks to Lisa Camesano, Ronen Feldman, Mark Ollis, Sujay Parekh, Doron Tal, Jennifer Turney, and Rodger Zanny for assisting in various portions of the research reported here. Thanks also to Debbie Smith for help in typesetting this manuscript.
75 References Barnett, J. (1984). How Much is Control Knowledge Worth? A Primitive Example. Artificial Intelligence, 22, pp. 77-89. Cain, T. (1991). The DUCTOR: A Theory Revision System for Propositional Domains. Proceedings of the Eighth International Machine Learning Workshop (pp. 485-489). Evanston, IL: Morgan Kaufmann Publishers. Elkan, C, Segre, A. (1989). Not the Last Word on EBL Algorithms (Report No. 89-1010). Department of Computer Science, Cornell University, Ithaca, NY. Elkan, C. (1989). Conspiracy Numbers and Caching for Searching And/Or Trees and Theorem-Proving. Proceedings of the Eleventh International Joint Conference on Artificial Intelligence (pp. 341-346). Detroit, MI: Morgan Kaufmann Publishers. Elkan, C.(1990). Incremental, Approximate Planning. Proceedings of the National Conference on Artificial Intelligence (pp. 145-150). Boston, MA: MIT Press. Feldman, R., Segre, A., Koppel, M. (1991a). Refinement of Approximate Rule Bases. Proceedings of the World Congress on Expert Systems. Orlando, FL: Pergamon Press. Feldman, R., Segre, A., Koppel, M. (1991b). Incremental Refinement of Approximate Domain Theories. Proceedings of the Eighth International Machine Learning Workshop (pp. 500-504). Evanston, IL: Morgan Kaufmann Publishers. Feldman, R., Koppel, M., Segre, A. (1992, March). A Bayesian Approach to Theory Revision. Workshop on Knowledge Assimilation. Symposium conducted at the AAAI Workshop, Palo Alto, CA. Fikes, R., Hart, P., Nilsson, N. (1972). Learning and Executing Generalized Robot Plans. Artificial Intelligence, 3, pp. 251-288.
76 Ginsberg, A. (1988a). Knowledge-Base Reduction: A New Approach to Checking Knowledge Bases for Inconsistency and Redundancy. Proceedings of the National Conference on Artificial Intelligence (pp. 585-589). St. Paul, MN: Morgan Kaufmann Publishers. Ginsberg, A. (1988b). Theory Revision via Prior Operationalization. Proceedings of the National Conference on Artificial Intelligence (pp. 590595). St. Paul, MN: Morgan Kaufmann Publishers. Ginsberg, A. Weiss, S., Politakis, P. (1988). Automatic Knowledge Base Refinement for Classification Systems. Artificial Intelligence, 35, 2, pp. 197-226. Ginsberg, M., Harvey, W. (1990). Iterative Broadening. Proceedings of the National Conference on Artificial Intelligence (pp. 216-220). Boston, MA: MIT Press. Hirsh, H. (1987). Explanation-based Generalization in a LogicProgramming Environment. Proceedings of the Tenth International Joint Conference on Artificial Intelligence (pp. 221-227). Milan, Italy: Morgan Kaufmann Publishers. Kedar-Cabelli, S., McCarty, L. (1987). Explanation-Based Generalization as Resolution Theorem Proving. Proceedings of the Fourth International Machine Learning Workshop (pp. 383-389). Irvine, CA: Morgan Kaufmann Publishers. Knoblock, C. (1990). A Theory of Abstraction for Hierarchical Planning. In D.P. Benjamin (Ed.), Change of Representation and Inductive Bias (pp. 81-104). Hingham, MA: Kluwer Academic Publishers. Korf, R. (1985). Depth-First Iterative Deepening: An Optimal Admissible Tree Search. Artificial Intelligence, 27,1, pp. 97-109. Minton, S. (1990a). Learning Search Control Knowledge. Hingham, MA: Kluwer Academic Publishers. Minton, S. (1990b). Quantitative Results Concerning the Utility of Explanation-Based Learning. In J. Shavlik & T. Dietterich (Eds.), Readings in Machine Learning (pp. 573-587). San Mateo, CA: Morgan Kaufmann Publishers.
77
Mitchell, T., Utgoff, P., Banerji, R. (1983). Learning by Experimentation: Acquiring and Refining Problem-Solving Heuristics. In R. Michalski, J. Carbonell & T. Mitchell (Eds.), Machine Learning: An Artificial Intelligence Approach, Vol. 1 (pp. 163-190). San Mateo, CA: Morgan Kaufmann Publishers. Mitchell, T., Keller, R., Kedar-Cabelli, S. (1986). Explanation-Based Generalization: A Unifying View. Machine Learning 7, 7, pp. 47-80. Mooney, R., Bennett, S. (1986). A Domain Independent ExplanationBased Generalizes Proceedings of the National Conference on Artificial Intelligence (pp. 551-555). Philadelphia, PA: Morgan Kaufmann Publishers. Mooney, R. (1990). A General Explanation-Based Learning Mechanism. San Mateo, CA: Morgan Kaufmann Publishers. Ourston, D., Mooney, R. (1990). Changing the Rules: A Comprehensive Approach to Theory Refinement. Proceedings of the National Conference on Artificial Intelligence (pp. 815-820). Boston, MA: MIT Press. Plaisted, D. (1988). Non-Horn Clause Logic Programming Without Contrapositives. Journal of Automated Reasoning, 4, pp. 287-325. Prieditis, A., Mostow, J. (1987). PROLEARN: Towards a Prolog Interpreter that Learns. Proceedings of the National Conference on Artificial Intelligence (pp. 494-498). Seattle, WA: Morgan Kaufmann Publishers. Richards, B., Mooney, R. (1991). First-Order Theory Revision. Proceedings of the Eighth International Machine Learning Workshop, (pp. 447-451). Evanston, IL: Morgan Kaufmann Publishers. Russell, A., Segre, A., Camesano, L. (1992). Effective Conjunct Reordering for Definite-Clause Theorem Proving. Manuscript in preparation. Sacerdoti, E. (1974). Planning in a Hierarchy of Abstraction Spaces. Artificial Intelligence, 5, pp. 115-135. Segre, A. (1987). Explanation-Based Learning of Generalized Robot Assembly Plans. Dissertation Abstracts International, AAD87-21756. (University Microfilms No. AAD87-21756.)
78
Segre, A. (1988). Machine Learning of Robot Assembly Plans. Hingham, MA: Kluwer Academic Publishers. Segre, A., Elkan, C, Russell, A. (1990). On Valid and Invalid Methodologies for Experimental Evaluations ofEBL (Report No. 90-1126). Ithaca, NY: Cornell University. Segre, A., Elkan, C. (1990). A Provably Complete Family ofEBL Algorithms. Manuscript submitted for publication. Segre, A., Elkan, C, Gordon, G., Russell, A. (1991). A Robust Methodology for Experimental Evaluations of Speedup Learning. Manuscript submitted for publication. Segre, A., Elkan, C., Russell, A. (1991). Technical Note: A Critical Look at Experimental Evaluations ofEBL. Machine Learning, 6, 2, pp. 183196. Segre, A. (1991). Learning How to Plan. Robotics and Autonomous Systems, 8,1-2, pp. 93-111. Segre, A., Scharstein, D. (1991). Practical Caching for Definite-Clause Theorem Proving. Manuscript submitted for publication. Segre, A., Turney, J. (1992a). Planning, Acting, and Learning in a Dynamic Domain. In S. Minton (Ed.), Machine Learning Methods for Planning and Scheduling. San Mateo, CA: Morgan Kaufmann Publishers. Segre, A., Turney, J. (1992b). SEPIA: A Resource-Bounded Adaptive Agent. Artificial Intelligence Planning Systems: Proceedings of the First International Conference. College Park, MD: Morgan Kaufmann Publishers. Shavlik, J. (1990). Extending Explanation-Based Learning. San Mateo, CA: Morgan Kaufmann Publishers. Smith, D., Genesereth, M. (1985). Ordering Conjunctive Queries. Artificial Intelligence, 26, pp. 171-215. Sussman, G. (1973). A Computational Model of Skill Acquisition (Report No. 297). Cambridge, MA: MIT Artificial Intelligence Laboratory.
79 Towell, G., Shavlik, J., Noordewier, M. (1990). Refinement of Approximate Domain Theories by Knowledge-Based Neural Networks. Proceedings of the National Conference on Artificial Intelligence, (pp. 861-866), Boston, MA: MIT Press. Turney, J., Segre, A. (1989a). A Framework for Learning in Planning Domains with Uncertainty (Report No. 89-1009). Ithaca, NY: Cornell University. Turney, J., Segre, A. (1989b, March). SEPIA: An Experiment in Integrated Planning and Improvisation. Workshop on Planning and Search. Symposium conducted at the AAAI Workshop, Palo Alto, CA. Van Harmelen, F., Bundy, A. (1988). Explanation-Based Generalisation = Partial Evaluation (Research Note). Artificial Intelligence 36, 3, pp. 401-412.
80
Appendix Blocks world domain theory and randomly-ordered problem set used for the experiments reported herein. The domain theory describes a world containing 4 blocks, A, B, C, and D, stacked in various configurations on a Table. It consists of 11 rules and 9 facts: there are 26 sample problems whose first solutions range in size from 4 to 77 nodes and vary in depth from 1 to 7 inferences deep.
Facts: holds (on (A,Table),S0) holds (on (B,Table),SQ) holds (on (C,D),S0) holds (on (D,Table),S0) holds (clear (A),SQ) holds(clear(B),S0) holds (clear (C),SQ) holds (empty ( ),SQ) holds (clear (Tablets) Rules: holds (and (lx,ly),ls) <- holds (lx, Is) A holds (ly, Is)) differ(lx,ly)
81 holds(on (lx, ly),do (putdown (?z, ?w),?,s)) «- holds (on (?x, ??),?$) holds (clear (lz),do (putdown Ox, ?y ),?*)) <- holds (clear Qz),7s) A differ (?z, ?y) Problem Set: holds (and (on (C,B),on (B,A)),?s) holds (on(CJable),ls) holds (and(on (B,D),on (A,C)),ls) holds (and(clear (D),on (A,B)),ls) holds (and (on (D,B),on (B,C)),ls) holds (and(on(A,B),on(B,C)),7s) holds (on(D,A),ls) holds (and (on (D,B),on (C,A)),?5) holds (clear (D),1s) holds (and(on(C,D),on(A,C)),7s) holds (and (on (A,B),on (B,D)),1s) holds (and (on (D,A),on (A,C)),ls) holds (and (on (B,A),on(C,B)),?s) holds (and(on(C,A),on(D,B)),1s) holds (on (A,D)^s) holds (and (on(A,B),clear (D)),1s) holds (and(on (A,D),on (D,B)),ls) holds (and (on (A,C),on (C,B)),1s) holds(and(on(B,C),on(A,B)),ls) holds (on (A,C),ls) holds (on (D,C),7s) holds (and (on (B,D),on (C,A)),7s) holds (and(on (A,C),on (C,D)),ls) holds (and(on (A,B),on(D,C)\ls) holds (and(on (D,C),on (C,B)),Is) holds (and (on (C,B),on (D,C)),7s)
Chapter 3
On Integrating Machine Learning with Planning Gerald F. DeJong Melinda T. Gervasio Scott W. Bennett University of Illinois Beckman Institute & Computer Science Department 405 N. Mathews Ave., Urbana, IL 61801 dejong@ cs.uiuc.edu [email protected] bennett@ cs.uiuc.edu ABSTRACT Domain-independent classical planning is faced with serious difficulties. Many of these are traceable to some facet of the frame problem. Perfect knowledge of the planner's world and operators is impossible for most domains. With anything less, small unmodeled errors can result in large discrepancies between the expected and observed worlds. Such discrepancies may interfere with the achievement of a goal. Reactivity provides an extreme response. It allows only sensor information after an action is taken to judge the action's effects, and abandons projection altogether. There must be a vast space of possibilities between the extremes of classical planning and reactivity. This paper describes two called computable planning and permissive planning. Machine learning, in the form of Explanation-Based Learning, is used in computable planning to recognize deferrable goals, resulting in many of the benefits offered by reactivity but in a domain independent form. In permissive planning, machine learning techniques are used to refine plans through experience so that they become less sensitive to the necessarily approximate knowledge.
The research reported in this paper was supported by the Office of Naval Research under grants N-00014-86-0309 and N-00014-91-J-1563.
84 INTRODUCTION Increasingly it has become accepted that domain-independent classical planning, in which one finds a set of actions and constraints that provably achieves a goal when applied to a problem's initial state, is problematic. Chapman (1987) showed that with certain assumptions, the general classical planning problem can be prohibitively difficult or impossible. An approach, known generally as reactivity attempts to circumvent the obstacles of classical planning [Agre and Chapman 1987; Firby, 1987; Schoppers, 1987; Suchman, 1987]. Classical planning demands that the planner know and model the effects of its actions. For any sequence of actions a classical planner must be able to judge accurately their cumulative changes to a state. This ability to project states through operators may seem modest at first but the failure of classical planning can be viewed as traceable to this task. A central tenet of pure reactivity is to do no projection. The "planner" (or better the agent) makes no attempt to anticipate how the world will look after an action is executed. Instead, an action is selected based entirely upon the agent's sensor-supplied knowledge of the current world state. After the execution of an action, the world changes in some (possibly complex) way. The next action is selected in the same manner, based upon the updated sensor values. In a way, a purely reactive system employs the world itself to model the effects of actions. Mentally trying out a plan step and backtracking if the plan is inadequate, a cornerstone of classical planning, is not allowed. The method of selecting actions, and the demands such a mechanism places upon its sensors, are the subject of much current research [Agre and Chapman, 1987; Brooks, 1987; Hammond, Converse, and Marks, 1990, Rosenschein & Kaelbling, 1987]. In any case, the resulting system relies on its sensor abilities to drive action selection. Reactivity, for all its recent popularity and promise, is only one approach to the problems of classical planning. In this chapter we explore two others. Both rely heavily on machine learning. We call the first Completable Planning. It is a melding of constrained reactivity for deferred goals into a basically classical planning framework. We call the second approach Permissive Planning. In it, the basic concept of a micro-world is rejected. All representations reasoned about during planning are taken to be approximate descriptions of the real world. Machine learning is relied upon to adjust planning concepts to be more permissive, that is, to work successfully in the real world in spite of their necessarily simplified representations. COMPLETABLE PLANNING Integrated systems [Drummond & Bresina 1990; Gervasio, 1990b; Kaelbling, 1986; Mitchell, 1990; Turney & Segre, 1989] combine reactivity with classical planning to benefit from the goal-directedness provided by classical
85 planning as well as the flexibility and quick response time provided by reactivity. Computable planning [Gervasio, 1990a, 1990b; Gervasio &DeJong, 1991] is an integrated approach in which a classical planner is augmented with the ability to defer achievable goals. By providing reactive abilities, completable planning reduces reliance on perfect a priori information. By requiring that all deferred goals be achievable, completable planning enables the continued construction of provably correct plans. The integrated approach is naturally limited by the achievability constraint on deferred goals. However, it provides a solution to an interesting group of planning problems. There are many problems in relatively well-behaved domains where enough a priori information is available to enable the construction of almost complete plans. There are also certain kinds of information which are difficult to predict a priori but can trivially be gathered during execution. In these cases, a planner which constructs plans completely a priori faces an extremely difficult task, while a planner which leaves everything to be dynamically determined by reacting to the execution environment loses the goal-directed behavior provided by a priori planning. A completable planner faces neither problem, with its ability to defer achievable goals and utilize runtime information in planning. Completable planning is implemented in our system through contingent explanation-based learning [Gervasio, 1990a, 1990b; Gervasio & DeJong, 1991], an augmented explanationbased learning strategy for learning completable plans. Through contingent EBL, a planner can learn general completable plans with deferred goals and corresponding achievability proofs. Given an initial state description I and a goal state description G, the planning problem involves determining a plan P, consisting of a sequence of actions, which when executed from the initial state will achieve a goal state. Let states(S) be the set of states satisfying the partial state description S, and let PREC(p) and EFF(p) be the sets of states satisfying the the precondition and effect state descriptions of an action or action sequence p. Given I and G, aprovably-correct plan for I and G is an ordered sequence of actions of the form: {pi ;p2r-.;Pn}, constrained in the following manner: states® c states(PREC(pi)) For pi e {pi,p2,...,Pn-i}, states(EFF(pi)) c states(PREC(pi+i)) states(EFF(p„)) c states(G). This is shown graphically in Figure 1. A completable plan differs from a classical (provably-correct) plan in that it may be incomplete prior to execution—i.e. particular plan components may not yet be determined. Toretain the notion of provably-correctplans, additional constraints are necessary. In particular, deferred goals must be achievable. Achievability is defined as follows:
Figure 1. A plan consisting of unconditional actions. achievable(Si,S2) iff V sestates(Si) 3 a plan p which when executed from s will result in a state in states(S2>. Given an initial state description I and goal state description G, a computable plan for I and G is an ordered sequence of plan components of the form: {pi;p2;.»;Pn}> constrained in the following manner: states(I) c states(PREC(pi)) OR achievable(PREC(pi),I) For pi e {pi,p2,...,Pn-i} states(EFF(pi)) c states(PREC(pi+i)) OR achievable(PREC(pi+i),EFF(pi)) states(EFF(pn)) c states(G) OR achievable(G,EFF(pn)). Proving Achievability An important criterion for the success of the integrated approach is proving achievability without having to determine actions to achieve the associated deferred goal. Note that the definition of achievability does not require the determination of aprecise plan butrather only the existence of apian. By deferring planning decisions, a planner buys itself the ability to use less precise a priori information as well as the ability to utilize runtime information. By deferring only achievable planning decisions, a planner enables itself to construct incomplete yet provably-correct plans. Three classes of problems have been identified wherein achievability proofs can be constructed without determining the actions to achieve the associated goals. Repeated Actions and Terminal Goal Values. The first class of problems involves repeated actions towards a terminal goal value, such as hammering a nail all the way into a piece of wood or completely unloading a clothes-dryer. The achievability proof for this class of problems lies in the notion of incremental progress, with every execution of an action resulting in a state nearer to the go*d state until the goal is reached, in which case the action has no effect Instead of being precomputed, the precise number of repetitions needed can be determined at execution time by repeatedly performing the action until the goal is reached. Continuously-Changing Quantities and Intermediate Goal Values. The second class of problems involves continuously-changing quantities and intermediate goal values, such as whipping cream until soft peaks form or accel-
87 erating to some higher velocity. Proving achievability for this class of problems involves reasoning about the achievability of the limits on the value of a continuously-changing quantity, which guarantees the achievability of all the values within those limits. During execution, this smaller interval can be dynamically determined by monitoring the changing quantity for the desired value. Multiple Opportunities. The third class of problems involves multiple opportunities, such as choosing a paper cup for coffee or deciding which gas station to stop at on a long trip. The achievability proof for these problems depends upon the existence of several objects of some type needed to achieve the goal. Choosing a specific object can be deferred until the objects are in sight and one can be chosen trivially. Achievability proofs are implemented as rule schemata, which are secondorder predicate calculus rules which serve as templates from which to derive first-order predicate calculus rules for use in theorem-proving. Figure 2 shows [V6
I
I
Intermediate Value Rule Schema
[Vqv0v2t0t2 ((value q vO tO) AND (continuous q) AND ((8 [tO t2]) -» (value q v212))) -> [Vvl (between vlv0v2) -> [3tl (witbintl(t0t2))AND((e(t0tl))-»(valueqvltl)]]]]
I |
Intermediate Value Rule for An Increasing Quantity
[Vqv0v2t0t2vl I ((value q vO tO) AND (continuous q) AND ((qualitative_behavior q increasing (tO t2)) -> I I (value q v212)) AND (between vl vO v2)) J -> [3 tl (within tl [tO t2]) AND ((quaBtativeJ>ehavior q increasing (tO tl)) -> (value q vl tl)) ] ]
Figure 2. Sample Rule Schema and Derived Rule. an example of such a rule schema for the class of problems involving continuously-changing quantities and intermediate goal values, as well as arule derived from that schema for an increasing quantity. The reasoning embodied by this schema, also known as the Intermediate Value Theorem in calculus, is as follows. Let q be a continuous quantity having a value vo at some time to. Also let it be the case that certain conditions 8 being true over some interval (to t2) will result in q having some other value v2 at time t2. Then for all values vi between vo and v2 there exists some time ti within the interval (to t2) such that if 9 holds over the interval (to ti), q will have the value vi at ti. This reasoning as applied to an increasing quantity is depicted graphically in Figure 3 .
88
quantity value
Figure 3. Reasoning about intermediate values for an increasing quantity.
Probably Computable Plans A major limitation of the achievability criterion defined above is that it is absolute—i.e. goal achievability must be guaranteed. In the real world, however, it is unlikely that goal achievement will ever be completely guaranteed as this would involve reasoning about all possible contributing and limiting factors. In [Gervasio & DeJong, 1991] we address probable achievability. Probable achievability can be discussed in terms of the constraints it places on the different constituent components of a computable plan. There are three types of plan components: unconditional actions, conditionals, and repeat loops. Unconditional actions are simply actions to be executed without environmental input. Classical plans consist solely of unconditional actions, and thus unconditional actions can be said to constitute the classical part of a completable plan and conditionals and repeat-loops, the reactive part. In completable planning, the deferred goals addressed by the reactive components must be achievable, and thus constraints must be placed on conditionals and repeat-loops to guarantee their achievement of the preconditions of succeeding actions. Probably Completable Conditionals. Conditionals deals with the problem of over-general initial state descriptions. Prior to execution, all a planner may know is that at a particular point in the execution of a plan, it will be in one of several possible states satisfying some description. However the different states satisfying this description may require different actions to achieve the preconditions for succeeding actions. For example, in planning to get to some higher floor in a new building, after going through thefrontdoor, all you may know is that you will be in the lobby. However, depending upon various factors such as whether there will be a staircase or an elevator or both, your proximity to each one, which floor you wish to go to, and the functionality of the elevator, you would like to take different actions. Through conditionals, decisions can be made during execution regarding appropriate actions, using any additional information which becomes available at that point in execution.
89 A conditional is of the form: {COND Ci -»qi; C2 -» q25 ••*> cn -> qn} where each C] -> qi is an action-decision rule which represents the decision to execute the plan q* when the conditions Ci are true. Like the situation-action type rules used in reactive systems such as [Drummond & Bresina, 1990; Kaelbling, 1986; Mitchell, 1990; Schoppers, 1987], action-decision rules map different situations into different actions, allowing a system to make decisions based on its current environment However, in a completable plan a conditional pi = {COND Ci -> qi; C2 ->q25...; Cn -> qn} must also satisfy the following constraints for achievability: 1. Exhaustiveness: states(ci A...ACn) must be a probably exhaustive subset ofstates(EFF(pi_i)). 2. Observability: each Ci must consist of observable conditions, where an observable condition is one for which there exists a sensor which can verify the truth or falsity of the condition. 3. Achievement, for each qi, states(EFF(qi)) £ states(PREC(pi+i)). This is shown graphically in Figure 4. For the exhaustiveness constraint, coverstates^FFfo.!)) states(EFF(qO) V ^"-NSta tes ( c i) ^ p s a s ^ states(PREC(pi+i))
states(c3)
states(EFF(q3))
Figure 4. A completable conditional pi with three action-decision rules. age can be represented using qualitative or quantitative probabilities. The greater the coverage, the greater the conditional's chance of achieving PREC(pi+i). The observability constraint requires knowledge of sensory capability, and here we use the term sensor in the broader sense of some set of sensory actions, which we will assume the system knows how to execute to verify the associated condition. It is needed to ensure that the conditional can be successfully evaluated during execution. Finally, the achievement constraint ensures that the actions taken in the conditional achieve the preconditions of the succeeding plan component Provided these three constraints are satisfied, the conditional is considered probably completable, and the goal PREC(pi+i) of the conditional is probably achievable. Probably Completable Repeat-Loops. A repeat-loop is of the form: {REPEAT q UNTIL c }, which represents the decision to execute the plan q until the test c yields true. Repeat loops are similar in idea to servo-mechanisms, but in addition to the simple yet powerful failure-recovery strategy such mechanisms provide, repeat loops also permit the construction of repeated action sequences achieving incremental progress towards the goal, which may be viewed as a reactive, runtime method of achieving generalization-to-N [W. W. Cohen,
90 1988; Shavlik & DeJong, 1987]. Repeat loops are thus useful in computable plans for mainly two reasons: simple failure recovery and iterations for incremental progress. Repeat-loops for simple failure-recovery are useful with actions having nondeterministic effects, which arise from knowledge limitations preventing a planner from knowing which of several possible effects a particular action will have. For example, in attempting to unlock the door to your apartment, pushing the key towards the keyhole will probably result in the key lodging into the hole. However, once in a while, the key may end up jamming beside the hole instead; but repeating the procedure often achieves the missed goal. In computable planning, if an action has several possible outcomes, and if the successful outcome is highly probable, and if the unsuccessful ones do not prevent the eventual achievement of the goal, then a repeat-loop can be used to ensure the achievement of the desired effects. A repeat-loop p - {REPEAT q until c } for failure-recovery must satisfy the following constraints for achievability: 1. Observability: c must be an observable condition 2. Achievement: c must be a probable effect of q 3. Repeatability: the execution of q must not irrecoverably deny the preconditions of q until c is achieved. This is shown graphically in Figure 5a. The observability constraint is needed, once again, to be able to guarantee successful evaluation, while the achievement and repeatability constraints together ensure a high probability of eventually exiting the repeat loop with success. As with the exhaustiveness constraint for conditionals, die repeatability constraint may be relaxed so that the execution of q need only probably preserve or probably allow the re-achievement of the preconditions of q. Repeat-loops for incremental progress deal with over-general effect state description. Once again, knowledge limitations may result in a planner not having precise enough information to make action decisions a priori. In actions which result in changing the value of a quantity, for example, your knowledge may be limited to the direction of change or to a range of possible new values, which may not be specific enough to permit making decisions regarding precise actions—for example, determining die precise number of action repetitions or the precise length of time over which to run a process in order to achieve the goal. The implicit determination of such values during execution is achieved in computable planning through the use of repeat-loops which achieve incremental progress towards the goal and use runtime information to determine when the goal has been reached. A repeat-loop p={REPEAT c until p } for incremental progress must satisfy the following constraints for achievability:
91 1. Continuous observability: c must be an observable condition which checks a particular parameter for equality to a member of an ordered set of values— for example, a value within the range of acceptable values for a quantity. 2. Incremental achievement: each execution of q must result in incremental progress towards and eventually achieving c—i.e. it must reduce the difference between the previous parameter value and the desired parameter value by at least somefinitenon-infinitesimal e. 3. Repeatability: the execution of q must not irrecoverably deny the preconditions of q until c is achieved. This is shown graphically in Figure 5b. The continuous observability constraint X \ S S S ^
states(c)
probable successful outcome a. Failure recovery.
^iterations vJ
b. Incremental Progress.
Figure 5. Completable repeat-loops. ensures that the progress guaranteed by the incremental achievement and repeatability constraints can be detected and the goal eventually verified. For both failure recovery and interactions for incremental progress, if the repeat-loop satisfies the constraints, the repeat-loop is considered probably completable and the goal c is achievable. Contingent Explanation-Based Learning Explanation-based learning (EBL) is a knowledge-intensive procedure by which general concepts may be learned from an example of the concept [DeJong & Mooney, 1986; Mitchell, Keller, and Kedar-Cabelli, 1986]. EBL involves constructing an explanation for why a particular training example is an example of the goal concept, and then generalizing the explanation into a general functional definition of that concept or more general subconcepts. In planning, explanation and generalization may be carried out over situations and actions to yield macro-operators or general control rules. Here, we are interested in learning macro-operators or general plans. Reactive plans present a problem for standard explanation-based learning [Mooney & Bennett, 1986]. Imagine the problem of learning how to cross the street After the presentation of an example, an explanation for how the crosser got to the other side of the street may be that the crossing took place through some suitably-sized gap between two cars. Unfortunately, the generalization of this explanation would then include the precondition that there be such a suitably-sized gap between some two cars—a precondition which for some future
92 street-crossing can only be satisfied by reasoning about the path of potentially every car in the world over the time interval of the expected crossing! The basic problem is that standard explanation-based learning does not distinguish between planning decisions made prior to execution and those made during execution. After execution, an explanation may thus be constructed using information which became available only during execution, yielding a generalization unlikely to be useful in future instances. Contingent explanation-based learning uses conjectured variables to represent deferred goals and completors for the execution-time completion of the partial plans derivedfromthe general plan. A conjectured variable is a plannerposed existential used in place of a precise parameter value prior to execution, thus acting as a placeholder for the eventual value of a plan parameter. In the integrated approach, a planner is restricted to introducing conjectured variables only if achievability proofs can be constructed for the associated deferred goals. This is achieved by allowing conjectured variables in the domain knowledge of a system only in the context of its supporting achievability proof. In this manner, the provably-correct nature of classical plans may be retained in spite of the presence of conjectured variables. A completor is an operator which determines a completion to a completable plan by finding an appropriate value for a particular conjectured variable during execution. The supporting achievability proof accompanying a conjectured variable in a completable plan provides the conditions guaranteeing the achievement of the deferred goal represented by the variable. These conditions are used in constructing an appropriate completor. There are currently three types of completors, one for each of the three types of achievability proofs discussed earlier. Iterators perform a particular action repeatedly until some goal is achieved. Monitors observe a continuously-changing quantity to determine when a particular goal value for that quantity has been reached. Filters look for an object of a particular type. The contingent explanation-based learning algorithm is summarized in Figure 6. Example: Learning a Completable Plan for Spaceship Acceleration A system written in Common LISP and running on an IBM RT Model 125 implements the integrated approach to planning and learning reactive operators. The system uses a simple interval-based representation and borrows simple qualitative reasoning conceptsfrontQualitative Process Theory [Forbus, 1984]. The system is thus able to reason about quantity values at time points as well as quantity behaviors over time intervals. For example, (value (velocity spaceship) 65 10) represents the fact that the spaceship is traveling at 65 m/s at time 10), and (behavior (velocity spaceship) increasing (10 17)) represents the fact that the spaceship's velocity was increasing from time 10 to 17). The system also
93 Input training example and goal concept Construct an explanation/or why the example is an example of the goal concept If an explanation is successfully constructed Then Generalize and construct a general plan using the goal (root), the preconditions (leaves), determining applicability, and the sequence of operators achieving the goal Identify the conjectured variables in the generalized explanation. If there are conjectured variables Then For every conjectured variable Identify the supporting achievability conditions Construct an appropriate completor using these conditions Add the completor to the operators of the general plan. Output general completable reactive plan. Else Output general non-reactive plan. Else Signal FAILURE.
Figure 6. Contingent EBL Algorithm. uses a modified EGGS algorithm [Mooney & Bennett, 1986] in constructing and generalizing contingent explanations. The system is given the task of learning how to achieve a particular goal velocity higher than some initial velocity—i.e. acceleration. The example presented to the system involves the acceleration of a spaceship from an initial velocity of 65 m/s at time 10 to the goal velocity of 100 m/s at time 17.1576, with a fire-rockets action executed at time 10 and a stop-fire-rockets action executed at time 17.1576. In explaining the example, the system uses the intermediate value rule for an increasing quantity in 2 to prove the achievability of the goal velocity. It determines that the following conditions hold: 1) velocity increases continuously while the rockets are on, 2) if the rockets are on long enough, the maximum velocity of 500 m/s will be reached, and 3) the goal velocity of 100 m/s is between the initial velocity of 65 m/s and 500 m/s. There is thus some time interval over which the spaceship can be accelerated so as to achieve the goal. In this particular example, that time interval was (10 17.1576). The general explanation yields a two-operator (fire-rockets and stop-firerockets) completable plan. This plan contains a conjectured variable for the time the goal velocity is reached and the stop-fire-rockets action is performed. Using the conditions provided by the achievability proof, a monitor operator for observing the increasing velocity during the acceleration process and indicating when the goal velocity is reached to trigger the stop-fire-rockets operator is created and incorporated into the general plan.
94 Alternatively, the system can learn a classical plan from the same example by using equations derived from the principle of the conservation of linear momentum in order to explain the achievement of the goal velocity. This involves reasoning about various quantities, including the combustion rate of fuel and the velocity of the exhaust from the spaceship, in order to determine the acceleration rate. The learned general plan also involves two operators, but the time to stop the rocket firing is precomputed using some set of equations rather than determined during execution. Given the problem of achieving a goal velocity of yf from the initial velocity of vi at time ti, the system may construct either a completable planfromthe general computable plan or a classical plan from the general classical plan (Figure 7). In computing the time at which to stop the rocketCompletable Plan [fire-rocketsat time ti monitor increasing velocity for the goal value of v/, binding It to the time this value is reached stop-fire-rockets at time It ] Classical Plan [ fire-rockets at time ti given vi - velocity at time ti vf - goal velocity ve m relative exhaust velocity me = burn rate M = initial mass of spaceship stop-fire-rockets at time tf= ti + t ] Figure 7. Completable vs. classical acceleration plans. firing, the classical plan assumes a constant exhaust velocity and bum rate. Provided the expected values are accurate, it will achieve the goal velocity. However, if the actual values differ, the spaceship may not reach or may surpass the goal velocity. Even small deviations from the expected values could have devastating effects if a plan involved many such a priori computations, through which errors could get propagated and amplified. In contrast, the completable plan makes no assumptions regarding the exhaust velocity and burn rate, and instead uses execution-time information to determine when to stop firing the rockets. It is thus more likely to achieve the goal velocity regardless of such variations. For a classical planner to correctly compute when to stop the rockets, it would have to completely model the rocket-firing process—including the fuelto-oxygen ratio, combustion chamber dimensions, nozzle geometry, material characteristics, and so on. This intractability is avoided in the integrated ap-
95 proach through the deferment of planning decisions and the utilization of execution-time information in addressing deferred decisions. Extensions to Contingent EBL To extend computable planning to probable achievability we extended contingent EBL to learn probably completable plans. The idea of probably computable plans lends itself naturally to incremental learning strategies. Conditionals, for example, represent a partitioning of a set of states into subsets requiring different actions to achieve the same goal. With probable achievability, a plan may include only some of these subsets. As problems involving the excluded subsets are encountered, however, the plan can be modified to include the new conditions and actions. Similarly, incremental learning can be used to learn failurerecovery strategies within repeat-loops. The motivation behind the incremental learning of reactive components is similar to the motivation behind much work on approximations and learning from failure, including [Bennett, 1990; Chien, 1989; Hammond 1986; Mostow & Bhatnagar, 1987; Tadepalli, 1989]. The primary difference between these approaches and completable planning is that in these approaches, a system has the ability to correct the assumptions behind its incorrect approximations and thus tends to converge upon a single correct solution for a problem. In completable planning, uncertainty is inherent in the knowledge representation itself and the system instead addresses the problem of ambiguity through reactivity. As a system learns improved reactive components, it thus tends to increase a plan's coverage of the possible states which may be reached during execution. The preconditions of an action may be satisfied either prior to execution or during execution. The procedure in Figure 8 is applied on learned general plans For each precondition pr Ifpr is not satisfied by I then Ifpr is observable then Find all operators supported bypr For each such operator Make the execution ofthat operator conditional onpr Remove prfromthe general plan's preconditions. Figure 8. Procedure to distinguish between preconditions. to distinguish between these two types of preconditions. A conditional manifests itself in an explanation as multiple, disjunctive paths between two nodes (Figure 9a), with a path representing one action-decision rule, the leaves which cannot be satisfied in the initial state forming the condition, and the operators along the path forming the action. Since coverage may be incomplete, a system may fail to satisfy any of the conditions within a conditional, in which case, the system has the option of learning a new alternative (Figure 9b) to solve the cur-
96 rent problem and to increase coverage in future problems (Figure 9c). The pro-
a. old conditional b. new alternative c. new conditional Figure 9. Explanation structures in learning new conditionals. cedure in Figure 10 adds a new rule into a conditional. new-to-add := plan components in new plan not matching any in old plan old-to-change := plan component in old plan not matching any in new plan Make a new action-decision rule using new-to-add Append the new rule to the action-decision rules of old-to-change For each precondition pr in the new plan Ifpr is not already in the old plan then addpr to the preconditions of the old plan.. Figure 10. Procedure to add new rule to conditional. Recall that for conditionals to be computable, they must satisfy the constraints of exhaustiveness, observability, and achievement. Since the plans here are derived from explanations, the constraint of achievement is already satisfied. The procedure above checks for observability. For the exhaustiveness constraint, let X be the desired minimum coverage, where X can be a user-supplied value or one computed from other parameters such as available resources and importance of success. Coverage can be represented by qualitative probabilities—for example, the term "usually" can be used to denote high probability. The exhaustiveness constraint is satisfied in a conditional {COND ci -> qi; ... ;cn -> q n } iff the probability of (civc2v...vCn) is at least X. Repeat-loops for simple failure-recovery address the problem of actions with nondeterministic effects or multiple possible outcomes, and thus repeatloops are first constructed by identifying such actions in the general plan using the procedure in Figure 11. Recall that for a repeat-loop for failure to be comFor each action a in the plan If the outcome of a used in the plan is a probable outcome among others then If the desired outcome c is observable then Construct a repeat loop for a. Figure 11. Procedure for constructing a repeat loop. pletable, it must satisfy the constraint of repeatability aside from the constraints of observability and achievement. If the unsuccessful outcomes of a do not prevent the repetition of a, then the repeatability constraintis satisfied, and the probable eventual achievement of the desired effects is guaranteed. However, for unsuccessful outcomes which deny the preconditions to a, actions to recover the
97 preconditions must be learned. Precondition-recovery strategies within a repeat-loop can be characterized as a conditional, where the different states are the different outcomes, the actions are the different recovery strategies, and the common effect state is the precondition state of the action a. If we let Uj be an unsuccessful outcome, and ri be the recovery strategy for ui, then a repeat-loop eventually takes the form {REPEAT {q; [COND ui -> r^...; un -> rn]} UNTIL c}. Learning the embedded conditional for failure recovery can be done as in the previous section. Example: Learning a Probably Computable Train Route Plan The system was given the task of learning a plan to getfromone small city to another going through two larger cities using a train. The primary source of incompleteness preventing complete a priori planning is the system's knowledge with regard to the state of the railroads. For a system to getfromone city to another, the cities have to be connected by a railroad, and the railroad has to be clear. For a railroad to be considered clear, it must be notflooded,not be congested with traffic, be free of accidents, and not be under construction. These conditions cannot be verified a priori for all railroads, hence the need for conditionals. The training example involves getting from the city of Wyne to the city of Ruraly, where the rail connectivity between the two cities is shown in (Figure 12). Here, the railroad AB is a major railroad and sometimes gets congested. Wyne'
T" h"—-
TlT" B"I
Figure 12. Rail connectivity between Wyne and Ruraly.
Ruraly
T
Also, the northern railroads to andfromX, C, and Z are susceptible to flooding. And accidents and construction may occur from time to time. The initial training example given to the system is the route Wyne-A-B-Ruraly, which is generally the quickest way to get from Wyne to Ruraly. The learned general plan gets a train from one city to another with two intermediate stops, where only the railroad between the two intermediate cities is susceptible to heavy traffic and needs to be checked for it (Figure 13). When the system encounters a situation in which none of the conditions in a conditional is satisfied—in this example, the no-traffic condition is false just as the systemis to execute (go Amatrak A B AB) to achieve (at Amatrak B)—the system is given the alternative route A-C-B, which gets the system to B and allows it to continue with the next step in its original plan and achieve its goal of getting to Ruraly. From this experience, the sys-
Figure 13. Initial Learned Plan. tern modifies its old plan to include the new alternative of going through another city between the two intermediate cities. The system thus now has two alternatives when it gets to city A. When it encounters a situation in which AB is congested and AC is flooded, it is given yet another alternative, A-D-E-B, from which it learns another plan to getfromA to B and modifies the old plan as before. Now, in planning to get from Wyne to Ruraly, the system constructs the conditional in Figure 14, which corresponds to the second conditional in Figure 13. Note that the incremental learning algorithm permits the system to learn PLANl [COMPS [COND ((NOT (ACC AB)) (NOT (CON AB)) (NOT (TRF AB)) - > ((GO AMATRAK A B AB)) ((NOT (ACC AQ) (NOT (CON AQ) (NOT (FLD AC))) - > (((GO AMATRAK A C AQ) (COND (((NOT (ACC CB)) (NOT (CON CB)) (NOT (FLD CB))) - > ((GO AMATRAK C B C-B))))) ((NOT (ACC A-D)) (NOT (CON A-D))) - > (((GO AMATRAK A D A-D)) (COND (((NOT (ACC D-E)) (NOT (CON D-E))) - > ((GO AMATRAK D E I>-E)))) (COND (((NOT (ACC E-B)) (NOT (CON E-B))) - > ((GO AMATRAK E B E-B)))))]
Figure 14. Final conditional in specific plan for getting from Ruraly to Wyne. conditionals only on demand. In this example, alternative routes for the getting from Wyne to A andfromB to Ruraly are not learned. Assuming either a training phase or an evaluation step for determining whether particular situations are likely to occur again, a system can use this algorithm to learn minimally contingent plans. Limitations Computable planning represents a trade-off. A planner in this approach incurs the additional cost of proving achievability as well as completing plans during execution. Our intuitions, however, are that there is a whole class of interesting problems for which proving achievability is much easier than determining plans and where additional runtime information facilitates planning. Future work will investigate such problems more thoroughly to develop a crisper definition of the class problems addressed by completable planning. As these problems are better defined, contingent EBL may also need to be extended to enable
99 the learning of the computable plans with different kinds of deferred decisions. This includes learning to construct different types of achievability proofs and completors. Another direction for future work is a more thorough analysis of the tradeoff between the advantages brought and costs incurred by completable planning. Aside from the a priori planning cost completable plans have over reactive plans, and the runtime evaluation cost completable plans have over classical plans, in proving achievability completable plans also sometimes require knowledge about the general behavior of actions not always available in traditional action definitions. On the other hand, completable planning minimizes a priori information requirements. Related to this is the development of completable planning within a hierarchical planning framework [Sacerdoti, 1974; Stefik, 1981]. Casting completable planning in such a framework gives rise to several interesting research issues, including the development of an abstraction hierarchy which incorporates runtime decision-making (as in [Firby, 1987]) and using achievability as a criterion for defining a hierarchy. PERMISSIVE PLANNING Permissive planning is, in some ways, the dual of the reactive approach. Like the reactive approach, it gives up the notion of a provably correct plan. However, the concept of projection remains. Indeed, it is, if anything, more central than before. In most real-world domains it is impossible to describe the world correctly and completely. It follows that internal system representations of the world must, at best, be approximate. Such approximations may arise from imperfect sensors, incomplete inferencing, unknowable features of the world, or limitations of a system's representation ability. We introduce the concept of permissiveness of a plan as a measure of how faithfully the plan's preconditions must reflect the real world in order for the plan to accomplish its goals. One plan is more permissive than another if its representations can be more approximate while continuing to adequately achieve its goals. We do not propose to quantify this notion of permissiveness. Instead, we employ a machine learning approach which enhances permissiveness of acquired planning concepts. The approach involves acquiring and refining generalized plan schemata or macro-operators which achieve often-occurring general goals and sub-goals. Acquisition is through rather standard explanation-based learning [DeJong & Mooney, 1986; Mitchell, Mahadevan, and Steinberg, 1985; Mitchell et al., 1986; Segre, 1988]. However, the refinement process is unique. Improving Permissiveness To drive refinement, the system constantly monitors its sensors during plan execution. When sensor readings fall outside of anticipated bounds, execution
100 ceases and the plan is judged to have failed. The failure can only be due to a data approximation; if there were no mismatch between internal representations and the real world, the plan would have the classical planning property of provable correctness. The plan's failure is diagnosed. Ideally, only a small subset of the system's data approximations could underlie the monitored observations. The system conjectures which of its data representations, if incorrect, might account for the observations. Next, the system uses qualitative knowledge of the plan's constituent operators. The small conjectured error is symbolically propagated through the plan to its parameters. The plan parameters are adjusted so as to make the planning schema less sensitive to the diagnosed discrepancy with the world. If the process is successful, the refined schema is uniformly more permissive than the original, which it replaces. Thus, through interactions with the world, the system's library of planning schemata becomes increasingly permissive, reflecting a tolerance of the particular discrepancies that the training problems illustrate. This, in turn, results in a more reliable projection process. Notice that there is no improvement of the projection process at the level of individual operators. Performance improvement comes at the level of plan schemata whose parameters are adjusted to make them more tolerant of real-world uncertainties in conceptually similar future problems. Adjustment is neither purely analytical nor purely empirical. Improvement is achieved through an interaction between qualitative background knowledge and empirical evidence derived from the particular real-world problems encountered. Domain Requirements The notion of permissive planning is not tied to any particular domain. Though domain-independent it is, nonetheless, not universally applicable. There are characteristics of domains, and problem distributions within domains, that indicate or counter-indicate the use of permissive planning. An application that does not respect these characteristics is unlikely to benefit from the technique. For permissive planning to help, internal representations must be approximations to the world. By this we mean that there must be some metric for representational faithfulness, and that along this metric, large deviations of the world from the system's internal representations are less likely than small deviations. Second, some planning choices must be subject to continuous real-valued constraints or preferences on planning choices. These choices are called parameters of the plan schema. They are usually real-valued arguments to domain operators that must be resolved before the plan can be executed. Permissiveness is achieved through tuning preferences on these parameters. Finally, the planner must be supplied with information on how each operator's preconditions and arguments qualitatively change its effects. This information is used to regress symbolic representations of the diagnosed out-of-bounds
101 approximations through the planning structure. Such propagation determines how parameters should be adjusted so as to decrease the likelihood of similar future failures. Once determined, the information so gained embodies a new preference for how to resolve parameter values. Permissive Planning in Robotics Clearly, many domains do not respect these constraints. However, robotic manipulation domains form an important class in which the above characteristics are naturally enforced. Consider the data approximation constraint. A typical expression in a robotics domain may refer to real-world measurements. Object positions and dimensions require the representation for metric quantities. An example might be something like (HEIGHT-IN-INCHES BLOCK3 2.2). Such an expression is naturally interpreted as an approximation to the world. Indeed, expressions such as this one are useless in the real world under a standard semantics. The conditions of truth require that the height of the world object denoted by BLOCK3 be exactly 2.2 inches. Technically, no deviation whatsoever is permitted. If the height of BLOCK3 is off by only IQr40 inches, the expression is false - just as false as if it were off by 5 inches or 50 inches. Clearly, such an interpretation cannot be tolerated; the required accuracy is beyond the numerical representational capabilities of most computers. Another nail is driven into the coffin for standard semantics by real-world constraints. Actual surfaces are not perfectly smooth. Since the top and bottom of BLOCK3 most likely vary by more than 10"40 inches, the "height" of a real-world object is not a well-defined concept. In fact, no working system could interpret expressions such as the one above as describing the real world. The most common of several alternatives is to relegate the system to a micro-world. Here, the system implementor takes on the responsibility for insuring that no problems will resultfromnecessarily imprecise descriptions of the domain. In general, this requires the implementor to characterize in some detail all of the future processing that will be expected of the system. Often he must anticipate all of the planning examples that the system will be asked to solve. Other alternatives have been pursued involving explicit representations of and reasoning about error [Brooks, 1982; Davis, 1986; Erdmann, 1986; Hutchinson & Kak, 1990; Lozano-Perez, Mason, and Taylor, 1984; Zadeh, 1965] and guaranteed conservative representations [Malkin & Addanki, 1990; Wong & Fu, 1985; Zhu & Latombe, 1990]. These either sacrifice completeness, correctness, or efficiency and offer no way of tuning or optimizing their performance through interactions with the world. Expressions such as (HEIGHT-IN-INCHES BLOCK3 2.2) are extremely common in robotic domains and can be easily interpreted as satisfying our informal definition of an approximation: the metric for faithfulness is the real-valued height measure, and, presumably, if a reasonable system describes the world using the expression (HEIGHT-IN-INCHES BLOCK3 2.2) it is more
102 likely the case that any point on the top surface of BLOCK3 is 2.2001 inches high than 7.2 inches high. It is essential that the expression not saddle the system with the claim that BLOCK3 is precisely 2.2 inches high. The second condition for permissive planning requires that continuous real-valued parameters exist in the system's general plans. Geometric considerations in robotic manipulation domains insure that this condition is met Consider some constraints on a robot manipulator motion past a block (in fact BLOCK3), whichrests on the table. Some height must be adopted for the move. From the geometrical constraints there is a minimum height threshold for the path over the block. Since the arm must not collide with anything (in particular with BLOCK3), it must be raised more than 2.2 inches above the table. This height threshold is one of the plan parameters. Any value greater than 2.2 inches would seem to be an adequate bound on the parameter for the specific plan; if 2.2 inches is adequate, so is 2.3 inches, or 5.0 inches, etc. Thus, the plan supports the parameter as a continuous real-valued quantity. Notice, that once the specific plan of reaching over BLOCK3 is generalized by EBL, the resulting plan schema parameterizes the world object BLOCK3 to some variable, say, ?x and the value 2.2 to ?y where (HEIGHT-IN-INCHES ?x ?y) is believed, and the threshold parameter to ?z where ?z is equivalent to (+ ?y e) for the tight bound, or equivalent to (+ ?y e 0.1), for the bound of 2.3, or equivalent to (+ ?y £ 2.8), for the bound of 5.0, etc. The value of e insures that the bound is not equaled and can be made arbitrarily small in a perfect world. As will become clear, in permissive planning, e may be set identically to zero or left out entirely. The final condition for permissive planning requires qualitative information specifying how the effects of domain operators relate to their preconditions and arguments. This constraint, too, can be naturally supported in robotic manipulation domains. Consider again the plan of moving the robot arm past BLOCK3. The plan involves moving the arm vertically to the height ?z and then moving horizontally past the obstacle. The required qualitative information is that the height of the robot arm (the effect of MOVE-VERTICALLY) increases as its argument increases and decreases as its argument decreases. With this rather simple information the generalized plan schema for moving over an obstacle can be successfully tuned to prefer higher bounds, resulting in a more permissive plan schema. One might imagine that the system would behave similarly if we simply choose to represent BLOCK3 as taller than it really is. But permissive planning is more than adopting static conservative estimates for world values. Only in the context of moving past objects from above does it help to treat them as taller than their believed heights. In other contexts (e.g., compliantly stacking blocks) it may be useful to pretend the blocks are shorter than believed. Permissive planning adjusts the planning concepts, not the representations of the world. It there-
103 fore preserves the context over which each parameter adjustment results in improved, rather than degraded, performance. From a different point of view, permissive planning amounts to blaming the plan for execution failures, even when in reality the accuracy of representations of the world, not the plan, are at fault This is a novel approach to planning which results in a different, rather strange semantics for the system's representations. Current research includes working out a more formal account of the semantics for representations in permissive plans. Straightforward interpretations of the expressions as probabilistic seem not to be sufficient Nor are interpretations that view the expressions as fuzzy or as having uncertainty or error bounds. The difficulty lies in an inability to interpret an expression in isolation. An expression "correctly" describes a world if it adequately supports the permissive plans that make use of it. Thus, an expression cannot be interpreted as true or not true of a world without knowing the expression's context including the system's planning schemata, their permissiveness', and the other representations that are believed. The GRASPER System The GRASPER system embodies our ideas of permissive planning. GRASPER is written in Common Lisp running on an IBM RT125. The system includes an RTX scara-type robotic manipulator and a television camera mounted over the arm's workspace. The camera sub-system produces bitmaps from which object contours are extracted by the system. The RTX robot arm has encoders on all of its joint motors and the capability to control many parameters of the motor controllers including motor current allowing a somewhat course control of joint forces. The GRASPER system learns to improve its ability to stably grasp isolated novel real-world objects. Stably grasping complex and novel objects is an open problem in thefieldof robotics. Uncertainty is one primary difficulty in this domain. Real-world visual sensors cannot, even in principle, yield precise information. Uncertainty can be reduced and performance improved by engineering the environment (e.g., careful light source placement). However, artificially constraining the world is apoor substitute for conceptual progress in planner design. The position, velocity, and force being exerted by the arm, whether sensed directly or derivedfromsensory data, are also subject to errors so that the manipulator's movements cannot be precisely controlled. Nor can quantities like the position at which the manipulator first contacts an object be known precisely. Intractability also plays a significant role in this domain. To construct plans in a reasonable amount of time, object representations must be simplified. This amounts to introducing some error in return for planning efficiency. Altogether, the robotic grasping domain provides a challenging testbed for learning techniques. Figure 15 shows the laboratory setup.
104
Figure 15. GRASPER Experimental Setup. Our current goal for the GRASPER system in the robotics grasping domain is to successfully grasp isolated plastic pieces from several puzzles designed for young children. The system does not possess any model of the pieces prior to viewing them with its television camera. Since the pieces are relatively flat and of fairly uniform thickness, an overhead camera is used to sense piece contours. These pieces have interesting shapes and are challenging to grasp. The goal is to demonstrate improved performance at the grasping task over time in response to failures. Concept Refinement in GRASPER No explicit reasoning about the fact that data approximations are employed takes place during plan construction or application. Thus, planning efficiency is not compromised by the presence of approximations. Indeed, efficiency can be enhanced as internal representations for approximated objects may be simpler. The price of permissive planning with approximations is the increased potential for plan execution failures due to discrepancies with the real world. GRASPER's permissive planning concepts contain three parts. First, there is a set of domain operators to be applied, along with their constraints. This part is similar toother EBL-acquiredmacro-operators [Mooney, 1990; Segre, 1988] and is not refined. Second, there is a specification of the parameters within the macro-operator and for each, a set of contexts and preferences for their settings. Third, there is a set of sensor expectations. These include termination conditions for the macro and bounds on the expected readings during executions of the macro. If the termination conditions are met and none of the expectations are violated then the execution is successful. Otherwise it is a failure. A failed execution indicates a real-world contradiction; aconclusion, supported by the system's internal world model, is inconsistent with the measured world. It is only
105 during failure handling that the system accesses information about approximations. In the spirit of permissive planning, the planning concept that supports the contradiction is blamed for the failure. A symbolic specification of the difference between the observed and expected sensor readings is qualitatively regressed through the concept's explanation. This regression identifies which parameters can influence the discrepancy and also discovers which direction they should be tuned in order to reduce the discrepancy. A parameter is selected from among the candidates, and a new preference is asserted for the context corresponding to the failure conditions. The preferences themselves are qualitative—"Under these conditions, select the smallest consistent value from the possibilities available." The resulting planning concept contains more context specific domain knowledge and is uniformly more likely to succeed than its predecessor. As an aside it is important that the proposed new parameter preference be consistent with previous context preferences for that parameter. If the new preference cannot be reconciled with existing experiential preferences the original macro-operator structure is flawed or an incorrect selection was made (possibly during some previous failure)fromthe candidate parameters. Ongoing research is investigating how to handle such inconsistencies in a theoretically more interesting way than simple chronological backtracking across previous decisions. The current system does no more than detect such "over-tuning" of parameters. We will now consider a brief example showing the GRASPER system refining its grasping concept. The system already possesses an EBI^acquired planning concept for grasping small objects. Basically, the concept says to raise the arm with the gripper pointing down, to select grasping points on the object, to position itself horizontally over the object's center of mass, to open the gripper, to rotate the wrist, to lower the arm, and to close the gripper. Also specified are parameters (like how high initially to move the gripper, how close the center of mass must be to the line between the grasp points, how wide to open the gripper before descending, etc.), and sensor expectations for positions, velocities, and forces for the robot's joints. Through experience prior to the example, the grasping concept was tuned to open the gripper as wide as possible before descending to close on the object This yields a plan more permissive of variations in the size, shape, and orientation of the target object. A workspace is presented to the GRASPER system. Figure 16 shows the output of the vision system (on the left) and the internal object representations on the right with gensymed identifiers for the objects. The upper center object (OBJECT5593) is specified as the target for grasping.Figure 17 highlights the selected target object. The dark line indicates the polygonal object approxima-
106 1 Vision Data
Nodes Explored:
\t> * *
*
CP
*0
90
Approxinated Objects
A
0BltlCTB6M
onncnstf
^
0BJECTSo^3 OBJXOWI
Figure 16. System Status Display During Grasp of Object5593. \ j — — — — Arrows illustrate planned finger positions
Figure 17. Grasp Target and Planned Finger Positions. tion. This is the object's internal representation used in planning. The light colored pixels show the vision system output which more accurately follows the object's true contours. The arrows illustrate the planned positions for the fingers in the grasping plan. Notice, that the fingers are well clear of the object due to previous experience with the opening-width parameter. The chosen grasp points are problematic. A human can correctly anticipate that the object may "squirt" away to the lower left as the gripper is closed. GRASPER, however, has a "proof that closing on the selected grasp points will achieve a stable grasp. The proof is simply a particular instantiation of GRASPER'S planning schema showing that it is satisfiable for OB JECT5593. The proof is, of course, contingent on the relevant internal representations of the world being "accurate enough," although this contingency is not explicitly represented nor acknowl-
107 edged by the schema. In particular, the coefficient of friction between any two surfaces is believed to be precisely 1.0. This is incorrect. If it were correct the gripper could stably grasp pieces whose grasped faces made an angle of up to 45 degrees. The system believes the angle between the target faces of OBJECT5593 is 41.83 degrees, well within the 45 degree limit. This is also incorrect. The action sequence is executed by the arm while monitoring for the corresponding anticipated sensor profiles. During a component action (the execution of the close-gripper operator) the expected sensor readings are violated as shown in Figure 18. The shaded areas represent the expected values only roughhsitioa (aa) VS. Elapsed Tise (seeoads)
Force vs. Elapsed Tise (seeoads)
Figure 18. Expected vs. Observed Features. ly. Some expectations are qualitative and so cannot be easily captured on such a graph. Position is in millimeters, force is in motor duty cycle where 64 is 100%. Only the observed data for the close-gripper action are given. This action starts approximately 10 seconds into the plan and concludes when the two fingers touch, approximately 18 seconds into the plan. The termination condition for close-gripper (the force ramping up quickly with littlefingermotion) is met, but close-gripper expects this to occur while the fingers are separated by the width of the piece. This expectation is violated so the close-gripper action and the grasp plan both fail. It is assumed that thefingerstouched because the target piece was not between them as they closed. A television picture after the failure verifies that the gripper was able to close completely because the target object is not where it used to be. The piece is found to have moved downward and to the left. The movement is attributed to the plan step in which expectations began to go awry, namely, the close-gripper action.
108 The failure is explained in terms of the original "proof" that the close-gripper action would result in a stable grasping of OBJECT5593. While reasoning about execution failures, the system has access to information about which world beliefs are approximate. The failure is "explained" when the system discovers which approximate representations may be at fault. The system must identify approximations which, if their values were different, would transform the proof of a stable grasp into a proof for the observed motion of the piece. In this example, the offending approximations are the angle between the target faces of OBJECT5593 which may be an under-estimate and the coefficient of friction between the gripper fingers and the faces which may be an over-estimate. Errors of these features in the other direction (e.g., a coefficient of friction greater than 1.0) could not account for the observation. We might, at this point, entertain the possibility of refining the approximations. This would be the standard AI debugging methodology which contributes to the conceptual underpinnings of many diverse AI researchfromdefault reasoning to debugging almost-correct plans to diagnosis to refining domain knowledge. However, debugging the system's representations of the world is not in the spirit of permissive planning. We do not believe a fully debugged domain theory is possible even in principle. The approximate beliefs (face angle and coefficient of friction representations) are left as before. Instead, the system attempts to adjust the plan to be less sensitive to the offending approximations. This is done by adjusting preferences for the parameters of the planning concept Adjustment is in the direction so as to increase the probability that the original conclusion of a stable grasp will be reached and to reduce the probability of the observed object motion. This is a straightforward procedure given the qualitative knowledge of the plan. All parameters that, through the structure of the plan, can qualitatively oppose the effects of the out-of-bound approximations are candidates. In the example, the only relevant plan parameter supports the choice of object faces. The previous preferences on the parameters to choose between face pairs are that they each have a minimum length of 5 cm., they straddle the center of geometry, and the angle they form must be greater than 0 and less than 45 degrees. The first and second preferences are unchanged; the third is qualitatively relevant to the offending approximations and is refined. The initial and refined preferences are shown in Figure 19. Note that the refinement is itself qualitative, not quantitative. Previously, the particular angle chosen within the qualitatively homogeneous intervalfrom0 to 45 degrees was believed to be unimportant (a flat preference). The system now believes that angles within that interval can influence the success of grasping and that small angles (more nearly parallel faces) are to be preferred. Angles greater than 45 degrees are not entertained. Notice that this preference improves robustness regardless of which approximation (the
109
P r e f
""1
i
P r e f
preference function before example
45 90 Angle Between Faces
preference function after example
I 45 90 Angle Between Faces
180
Figure 19. Refinement of Angle Preference coefficient offrictionor the angle between faces) was actually at fault for the failed grasping attempt When given the task again, GRASPER plans the grasping points given in Figure 20 which is successful. The change results in im-
ed Arrows illustrate planned finger positions
\
Figure 20. Successful Grasp Positions. proved grasping effectiveness for other sizes and shapes of pieces as well. In basic terms, the refinement says that one way to effect a more conservative grasp of objects is to select grasping faces that make a more shallow angle to each other. Empirical Results The GRASPER system was given the task of achieving stable grasps on the 12 smoothplastic pieces of a children's puzzle. Figure 21 shows the gripper and several of the pieces employed in these experiments. A random ordering and set of orientations was selected for presentation of the pieces. Targetpieces were also placed in isolation from other objects. That is, the workspace never had pieces near enough to the grasp target to impinge on the decision made for grasping the target Thefirstrun was performed with preference tuning turned off. The results are illustrated in Figure 22. Failures observed during this run included,/wger stubbing failures (FS) where a gripperfingerstruck the top of the
110
0Mmm :li^l§
Figure 21. Gripper and Pieces. FS Down Finger stubbing failure
I Hi knowledge about vertical I slipping failures has been I 1 included inciuaea .
0 1 2 3 4 5 6 7 8 9101112 Trials Without Tuning
0 1 2 3 4 5 S T S 910U12 Trials With T\ining
Figure 22. Comparison of Tuning to Non-tuning in Grasping the Pieces of a Puzzle. object while moving down to surround it and lateral slipping failures (LS) where, as the grippers were closed, the object slipped out of grasp, sliding along the table surface. The given coefficient of friction (1.0) and the choice for opening width as the object chord resulted in a high error rate. There were 9 finger stubbing failures and 1 lateral slipping failure in 12 trials. In our second run, preference tuning was turned on. An initial stubbing failure on trial 1 led to a tuning of the chosen-opening-width parameter which determines how far to open for the selected grasping faces. Since the generated qualitative tuning explanation illustrates that opening wider would decrease the chance of this type of a failure, the system tuned the parameter to choose the largest opening width possible (constrained only by the maximum gripper opening). In trials 2 and 3, finger stubbing failures did not occur because the opening width was greater than the object width for that orientation. Vertical slipping failures (VS), about which the current implementation does not have knowledge, did occur. Preventing vertical slipping failures involves knowing shape information
Ill
along the height dimension of the object, which we are considering to give in the future using a model-based vision approach. In trial 5, a lateral slipping failure is seen and the qualitative tuning explanation suggests decreasing the contact angle between selected grasping surfaces as in the example above. Single examples of thefingerstubbing and lateral slipping failures were sufficient to eliminate those failure modes from the later test examples. Limitations Permissive planning is not a panacea To be applicable the domain must satisfy some strong constraints outlined. Furthermore, there are other obstacles besides projection that must be surmounted to salvaging some vestige of traditional planning. In particular, the search space of an unconstrained planner seems intractably large. Here we might buy into an IOU for Minton-style [Minton, 1988] utility analysis for EBL concepts. That is not the focus of the current research. However, the endeavor of permissive planning would be called into question should that research go sour or fail to extend to schema-type planners. Our current conceptualization of permissive planning is more general than is supported by the implementation. For example, there is no reason that increasing permissiveness need be relegated to adjusting parameter preferences. Structural changes in the planning concept may also be entertained as a means of increasing permissiveness. The current implementation may be pushed to do so through simple chronological backtracking through permissiveness and planning choices when inconsistent parameter preferences arise. We are searching for a more elegant method. Our current theory of permissive planning also leaves room for improvement A more formal and general specification of permissive planning is needed. There are questions about the scope of applicability, correctness, and source of power that can be resolved only with a more precise statement of the technique. For example, we currently rely heavily on qualitative propagation through the planning "proof. Is qualitative reasoning necessary or is its use merely a consequent of the fact that permissiveness is achieved through tuning preferences on continuous parameters? The current theory and implementation also rely on the notion of a centralized macro-operator. These provide the context, a kind of appropriate memory hook, for permissiveness enhancements. But is commitment to a schema-like planner necessary to support permissive planning or only sufficient? These are the questions that drive our current research. CONCLUSIONS A primary motivation for our work is that internal representations of the external physical world are necessarilyflawed.It is neither possible nor desirable for a planner to manipulate internal representations that are perfectly faithful to
112 the real world. Even apparently simple real-world objects, when examined closely, reveal a subtlety and complexity that is impossible to model perfectly. Increasing the complexity of the representations of world objects can dramatically degrade planning time. Furthermore, in most domains, there can be no guarantee that a feature of the world, no matter how inconspicuous it seems, can be safely ignored. Very likely, some plan will eventually be entertained that exploits the over-simplified representation. As a result the standard planning process of projection, anticipating how the world will look after some actions are performed, is problematic. The problems arising from imperfect a priori knowledge in classical planning was recognized as early as the STRIPS system, whose PLANEX component employed an execution algorithm which adapted predetermined plans to the execution environment [Fikes, Hart, and Nilsson, 1972]. Augmenting traditional planning with explicit reasoning about errors and uncertainties complicates the problem [Brooks, 1982; Davis, 1986; Erdmann, 1986; Hutchinson & Kak, 1990; Lozano-Perez et al., 1984; Zadeh, 1965]. Such systems which model error explicitly are subject to a similar problem: the error model employed is seldom faithful to the distributions and interactions of the actual errors and uncertainties. The same issues of mismatches between domain theories and the real world arise when the domain theory is a theory about uncertainties. Other work such as [Wilkins, 1988] addresses these problems via execution monitoring and failure recovery. More recently, Martin and Allen (1990) presented a method for combining strategic (a priori) and dynamic (reactive) planning, but uses an empirical-based approach rather than a knowledge-based approach for proving achievability. The idea of incrementally improving a plan's coverage is also presented in [Drummond & Bresina, 1990], where a plan's chance of achieving the goal is increased through robustification. This deals primarily with actions having different possible outcomes, while the conditionals in this work deal with the problem of over-general knowledge. The idea of conditionals is also related to the work on disjunctive plans, such as [Fox, 1985; Homem de Mello & Sanderson, 1986], although these have focused on the construction of complete, flexible plans for closed-world manufacturing applications. There has also been work in learning disjunctions using similaritybased learning techniques [Shell & Carbonell, 1989; Whitehall, 1987]. Other work on integrating a priori planning and reactivity [Cohen, Greenberg, Hart, and Howe, 1989; Turney & Segre, 1989] focuses on the integration of the planning and execution of multiple plans. There has also been some work in learning stimulus-response rules for becoming increasingly reactive [Mitchell, 1990]. In this paper we have described two other approaches. Permissive planning endorses a kind of uncertainty-tolerant interaction with the world. Rather than
113 debugging or characterizing the flawed internal representations, the planning process itself is biased, through experience, to prefer the construction of plans that are less sensitive to the representational flaws. In this way the projection process becomes more reliable with experience. Completable planning and contingent EBL take advantage of the benefits provided by classical planning and reactivity while ameliorating some of their limitations through learning. Perfect characterizations of the real world are difficult to construct, and thus classical planners are limited to toy domains. However, the real world often follows predictable patterns of behavior which reactive planners are unable to utilize due to their nearsightedness. Contingent EBL enables the learning of plans for use in the completable planning approach, which provides for the goal-directed behavior of classical planning while allowing for the flexibility provided by reactivity. This makes it particularly well-suited to many interesting real-world domains. It is our belief that machine learning will play an increasingly central role in systems that reason about planning and action. Through techniques such as explanation-based learning, a system can begin to actively adapt to its problemsolving environment In so doing, effective average-case performance may be possible by exploiting information inherent in the distribution of problems, and simultaneously avoiding the known pitfall of attempting guaranteed or bounded worst-case domain independent planning.
REFERENCES Agre, P. & Chapman, D. (1987). Pengi: An Implementation of a Theory of Activity. Proceedings of the National Conference on Artificial Intelligence (pp. 268-272). Seattle, Washington: Morgan Kaufmann. Bennett, S. W. (1990). Reducing Real-world Failures of Approximate Explanationbased Rules. Proceedings of the Seventh International Conference on Machine Learning (pp. 226-234). Austin, Texas: Morgan Kaufmann. Brooks, R. A. (1982). Symbolic Error Analysis and Robot Planning (Memo 685). Cambridge: Massachusetts Institute of Technology, Artificial Intelligence Laboratory. Brooks, R. A. (1987). Planning is Just aWay of Avoiding Figuring OutWhattoDoNext (Working Paper 303). Cambridge: Massachusetts Institute of Technology, Artificial Intelligence Laboratory. Chapman, D. (1987). Planning for Conjunctive Goals. Artificial Intelligence 32, 5, 333-378. Chien, S. A. (1989). Using and Refining Simplifications: Explanation-based Learning of Plans in Intractable Domains. Proceedings of The Eleventh International Joint Conference on Artificial Intelligence (pp. 590-595). Detroit, Michigan: Morgan Kaufmann. Cohen, W. W. (1988). Generalizing Number and Learning from Multiple Examples in Explanation Based Learning. Proceedings ofthe Fifth International Conference on Machine Learning (pp. 256-269). Ann Arbor, Michigan: Morgan Kaufmann.
114 Cohen, P. R., Greenberg, M. L., Hart, D. M., & Howe, A. E. (1989). Trial by Fire: Understanding the Design Requirements for Agents in Complex Environments. Artificial Intelligence Magazine 10, 3,32-48. Davis, E. (1986). Representing and Acquiring Geographic Knowledge, Morgan Kaufmann. DeJong, G. F. & Mooney, R. J. (1986). Explanatioflh-Based Learning: An Alternative View. Machine Learning 1, 2,145-176. Drummond, M. & Bresina, J. (1990). Anytime Synthetic Projection: Maximizing the Probability of Goal Satisfaction. Proceedings oftheEighthNational Conference on Artificial Intelligence (138-144). Boston, Massachusetts: Morgan Kaufmann. Erdmann, M. (1986). Using Backprojections for Fine Motion Planning with Uncertainty. International Journal ofRobotics Research 5, 1, 19-45. Fikes, R. E., Hart, P. E., & Nilsson, N. J. (1972). Learning and Executing Generalized Robot Plans. Artificial Intelligence 3,4,251-288. Firby, R. J. (1987). An Investigation into Reactive Planning in Complex Domains. Proceedings of the National Conference on Artificial Intelligence (202-206). Seattle, Washington: Morgan Kaufmann. Forbus, K. D. (1984). Qualitative Process Theory. Artificial Intelligence 24,85-168. Fox, B. R. & Kempf, K. G. (1985). Opportunistic Scheduling for Robotic Assembly. Proceedings of the Institute ofElectrical and Electronics Engineers International Conference on Robotics and Automation (880-889). Gervasio, M. T. (1990a). Learning Computable Reactive Plans Through Achievability Proofs (Technical Report UIUCDCS-R-90-1605). Urbana: University of Illinois, Department of Computer Science. Gervasio, M. T. (1990b). Learning General Completable Reactive Plans. Proceedings of the Eighth National Conference on Artificial Intelligence (1016-1021). Boston, Massachusetts: Morgan Kaufmann. Gervasio, M. T. & DeJong, G. F. (1991). Learning Probably Completable Plans (Technical Report UIUCDCS-91-1686). Urbana: University of Illinois, Department of Computer Science. Hammond, K. (1986). Learning to Anticipate and Avoid Planning Failures through the Explanation of Failures. Proceedings ofthe National Conference on Artificial Intelligence (pp. 556-560). Philadelphia, Pennsylvania: Morgan Kaufmann. Hammond, K., Converse, T., & Marks, M. (1990). Towards a Theory of Agency. Proceedings of the Workshop on Innovative Approaches to Planning Scheduling and Control (pp. 354-365). San Diego, California: Morgan Kaufmann. Homem de Mello, L. S. & Sanderson, A. C. (1986). And/Or Graph Representation of Assembly Plans. Proceedings of the National Conference on Artificial Intelligence (pp. 1113-1119). Philadelphia, Pennsylvania: Morgan Kaufmann. Hutchinson, S. A. & Kak, A. C. (1990). Span A Planner That Satisfies Operational and Geometric Goals in Uncertain Environments. Artificial Intelligence Magazine II, 7,30-61. Kaelbling, L. P. (1986). An Architecture for Intelligent Reactive Systems. Proceedings ofthe 1986 Workshop on Reasoning About Actions & Plans (pp. 395-410). Timberline, Oregon: Morgan Kaufmann.
115 Lozano-Perez, T., Mason, M. T., & Taylor, R. H. (1984). Automatic Synthesis of FineMotion Strategies for Robots. International Journal of Robotics Research J, i, 3-24. Malkin, P. K. & Addanki, S. (1990). LOGnets: A Hybrid Graph Spatial Representation for Robot Navigation. Proceedings of the Eighth National Conference on Artificial Intelligence (1045-1050). Boston, Massachusetts: Morgan Kaufmann. Martin, N. G. & Allen, J. F. (1990). Combining Reactive and Strategic Planning through Decomposition Abstraction. Proceedings of the Workshop on Innovative Approaches to Planning, Scheduling and Control (pp. 137-143). San Diego, California: Morgan Kaufmann. Minton, S. (1988). Learning Search Control Knowledge: An Explanation-Based Approach. Norwell: Kluwer Academic Publishers. Mitchell, T. M., Mahadevan, S. & Steinberg, L. I. (1986). LEAP: A Learning Apprentice forVLSlI^sign.ProceedingsoftheNinthlnternationalJointConferenceonArtificial Intelligence (pp. 573-580). Los Angeles, California: Morgan Kaufmann. Mitchell, T. M., Keller, R., & Kedar-Cabelli, S. (1986). Explanation-Based Generalization: A Unifying View. Machine Learning 1,1, 47-80. Mitchell, T. M. (1990). Becoming Increasingly Reactive. Proceedings ofthe Eighth National Conference on Artificial Intelligence (1051-1058). Boston, Massachusetts: Morgan Kaufmann. Mooney, R. J. and Bennett, S. W. (1986). A Domain Independent Explanation-Based Generalizer. Proceedings of the National Conference on Artificial Intelligence (pp. 551-555). Philadelphia, Pennsylvania: Morgan Kaufmann. Mooney, R. J. (1990). A General Explanation-Based Learning Mechanism and its Application to Narrative Understanding. London: Pitman. Mostow, J. & Bhatnagar, N. (1987). FAILSAFE—A Floor Planner that uses EBG to Learn from its Failures. Proceedings ofthe Tenth InternationalJoint Conference on Artificial Intelligence. Milan, Italy: Morgan Kaufmann. Rosenschein, S. J. & Kaelbling, L. P. (1987). The Synthesis of Digital Machines with Provable Epistemic Properties (CSLI-87-83). Stanford: CSLI. Sacerdoti, E. (1974). Planning in a Hierarchy of Abstraction Spaces. Artificial Intelligence 5,115-135. Schoppers,M. J. (1987). Universal Plans for Reactive Robots in Unpredictable Environments. Proceedings ofthe Tenth InternationalJoint Conference on Artificial Intelligence (pp. 1039-1046). Milan, Italy: Morgan Kaufmann. Segre, A. M. (1988). Machine Learning of RobotAssembly Plans. Norwell: Kluwer Academic Publishers. Shavlik, J. W. & DeJong, G. F. (1987). An Explanation-Based Approach to Generalizing Number. Proceedings of the Tenth International Joint Conference on Artificial Intelligence (pp. 236-238). Milan, Italy: Morgan Kaufmann. Shell,P. &Carbonell, J. (1989). Towards a General Framework for Composing Disjunctive and Iterative Macro-operators. Proceedings of the Eleventh International Joint Conference on Artificial Intelligence (pp. 596-602). Detroit, Michigan: Morgan Kaufmann.
116 Stefik, M. (1981). Planning and Metaplanning (MOLGEN: Part 2). Artificial Intelligence 76,2,141-170. Suchman, L. A. (1987). Plans and Situated Actions, Cambridge: Cambridge University Press. Tadepalli, P. (1989). Lazy Explanation-Based Learning: A Solution to the Intractable Theory Problem. Proceedings of the Eleventh International]oint Conference on Artificial Intelligence. Detroit, Michigan: Morgan-Kaufmann. Turney, J. & Segre, A. (1989). SEPIA: An Experiment in Integrated Planning and Improvisation. Proceedings of The American Association for Artificial Intelligence Spring Symposium on Planning and Search (pp. 59-63). Whitehall, B. L. (1987). Substructure Discovery in Executed Action Sequences (Technical Report UILU-ENG-87-2256). Urbana: University of Illinois, Department of Computer Science. Wilkins, D. E. (1988). Practical Planning: Extending the Classical Artificial Intelligence Planning Paradigm. San Mateo: Morgan Kaufman. Wong, E. K. & and Fu, K. S. (1985). A Hierarchical Orthogonal Space Approach to Collision-Free Path Planning. Proceedings ofthe 1985 Institute of Electrical and Electronics Engineers International Conference on Robotics and Automation (506-511). Zadeh, L. A. (1965). Fuzzy Sets. Information and Control 8,5,338-353. Zhu, D. J. and Latombe, J. C. (1990). Constraint Reformulation in a Hierarchical Path Planner. Proceedings of the 1990 Institute ofElectrical and Electronics Engineers International Conference of'Robotics and^Aw^mflriow (pp. 1918-1923). Cincinnati, Ohio: Morgan Kaufmann.
Chapter 4 THE ROLE OF SELF-MODELS IN LEARNING TO PLAN Gregg Collins, Lawrence Birnbaum, Bruce Krulwich, and Michael Freed Northwestern University The Institute for the Learning Sciences Evanston, Illinois
ABSTRACT We argue that in order to learn to plan effectively, an agent needs an explicit model of its own planning and plan execution processes. Given such a model, the agent can pinpoint the elements of these processes that are responsible for an observed failure to perform as expected, which in turn enables the formulation of a repair designed to ensure that similar failures do not occur in the future. We have constructed simple models of a number of important components of an intentional agent, including threat detection, execution scheduling, and projection, and applied them to learning within the context of competitive games such as chess and checkers.
INTRODUCTION The search for a domain-independent theory of planning has been a dominant theme in AI since its inception. This concern was explicit, for example, in Newell and Simon's (1963) pioneering model of human problem solving and planning, the General Problem Solver (GPS). The line of classical planning work that followed GPS, including STRIPS (Fikes and Nilsson, 1971), ABSTRIPS (Sacerdoti, 1974), and NOAH (Sacerdoti, 1977), has maintained this concern to the present day, as is perhaps best exemplified by Wilkins's (1984) SIPE. The reason for this concern seems clear enough, if we consider the alternative: Without a domain-independent theory, a planner cannot be viewed as anything more than a collection of domain- and task-dependent routines having
118 no particular relationship to each other. Such a view of planning offers no approach to the problem of adapting a planner to new domains. The pursuit of a domain-independent theory of planning, however, has led to an unexpected and unfortunate outcome, in that the resulting models are essentially general purpose search procedures, embodying hardly any knowledge about planning, and no knowledge about the world. This knowledge resides, instead, in the operators that specify the search space. In effect, such models of planning are analogous to programming languages: Generality has been achieved, but only by sacrificing almost completely any useful constraints on how the planner should be programmed. The responsibility for both the efficiency of the planning process, and the efficacy of the resulting plans, lies almost entirely with the human being who writes the operator definitions. In other words, we are left with a domain-independent theory of planning that offers very little guidance in attempting to adapt planners to new domains. Unfortunately, this was the very problem that motivated the search for such a theory in thefirstplace. The alternative, then, is to assume that a domain-independent theory planning must be knowledge-intensive rather than knowledgepoor, if it is to provide effective guidance in adapting to new domains. Human planners know a great deal about planning in the abstract, and it is this knowledge that enables them to adapt quickly to new domains and tasks. Our approach thus takes much of its inspiration from Sussman's (1975) and Schank's (1982) work showing how abstract planning knowledge, in the form of critics or thematic organization points (TOPs), can be used to improve planning performance in specific domains. More generally, we wish to construct a theory in which the detailed, domain-specific knowledge necessary for effective planning is generated by the planner itself, as a product of the interaction between die planner's knowledge about planning and its experience in particular domains. Our ultimate goal is a model that is capable of transferring lessons learned in one domain to other domains, through the acquisition of such abstract planning knowledge itself (Birnbaum and Collins, 1988). DEBUGGING THE PLANNER Any theory of learning must first of all address the question of when to learn. Sussman (1975) pioneered an approach to this problem, which has come to be known as failure-driven learning, in which
119 learning is motivated by the recognition of performance failures. A failure-driven learning system contains a debugging component that is called into play whenever the system's plans go awry; the repairs suggested by this component are then incorporated into the system's planning knowledge, thereby improving future performance in similar situations. Because this approach directly relates learning to task performance, it has become the dominant paradigm for learning how to plan within AI (see, e.g., Schank, 1982; Hayes-Roth, 1983; Kolodner, 1987; Simmons, 1988; Hammond, 1989a). Of course, such an approach immediately raises the question, what is being debugged? The obvious answer, and the one that has generally been taken for granted, is simply "plans." This view is based on what has come to be known as the "classical" tradition in planning, in which planners are assumed to produce as output completely self-contained, program-like plans, which plan executors are assumed to then faithfully carry out. The completely self-contained nature of plans within this framework leads rather naturally to the assumption that whenever an agent fails to achieve its goals, the fault must lie within the individual plans themselves. However, the classical conception of planning has become increasingly untenable as the role of reactivity in goal-directed behavior has become more clearly understood (see, e.g., Hayes-Roth and HayesRoth, 1979; Brooks, 1986; Agre and Chapman, 1987; Firby, 1989; Hammond, 1989b). The shift towards reactive models of planning has, in particular, called into question the idea that plans are completely selfcontained structures. In so doing, it raises serious problems for any theory that is based on the idea of debugging monolithic plans of this sort. Reactive models of planning are in large part motivated by the recognition that, since the conditions under which an agent's plans will be carried out cannot be completely anticipated, much of the responsibility for determining the particular actions that the agent will perform at a given time must lie in the plan execution component of that agent, rather than resting exclusively with the plans themselves. In order to be capable of carrying out the additional responsibilities required by these models, the plan execution component can no longer be taken to be a simple, general-purpose program interpreter. Rather, it must be seen as a highly articulated set of components, each devoted to controlling a particular aspect of behavior.
120 Consider, for example, a simple plan for keeping a piece safe in chess, formulated as a self-contained, program-like structure: while the piece is on the board do if a threat against the piece is detected then either a. move the piece b. guard the piece c. interpose another piece d. remove the threat c ••• eic There are two key points to notice about this plan. First, an agent cannot yield complete control of its behavior to a plan of this sort, because the plan will never relinquish control unless a threat is successfully executed and the piece is taken, and the agent cannot afford to execute such a plan to the exclusion of all others. Second, the details of the actions to be carried out in service of this plan cannot be specified in very much detail in advance of detecting a particular threat. Thus, to be able to carry out such a plan, the agent must perform some version of timesharing or multitasking. The plan must relinquish control of the agent's computational, perceptual, and behavioral resources until such time as a threat against the piece is actually detected. This in turn implies that some executive component of the agent must be charged with die responsibility of returning control to the plan at the appropriate time, i.e., when such a threat is detected. Thus, a task that was formerly the responsibility of individual plans—threat detection— now becomes the responsibility of a specialized component of the planning architecture. In light of the above discussion, we need to reconsider our original question of what is being debugged in a failure-driven approach to learning how to plan. Since a great deal of the responsibility for determining what to do has now been shifted to the agent's plan executor, any adequate approach to learning by debugging must be capable of determining the causes of, and repairs for, performance errors arising from the operation of this execution component. Approaches that consider only the plans themselves as the objects to be debugged are obviously incapable of making such determinations. Thus, as more responsibility is shifted to the plan executor, the focus of debugging effort must be shifted there as well.
121 Moreover, this shift offers a basis for addressing our original concern of how to adapt a planner to new domains. Errors that arise from the operation of the plan executor are the very sorts of errors that are most likely to yield lessons of broad applicability. Because all plans make extensive use of the components that comprise the plan executor, repairing bugs in these commonly held resources has the potential to improve the execution of any plan, regardless of the domain in which it is intended to function. More generally, this argument applies to any component of the intentional architecture: When ubiquitous functions such as threat detection are assigned to specialized components of the agent's architecture, any improvement in a particular component benefits all plans utilizing that component. Thus, learning that occurs in the context of one task or domain may subsequently yield improved performance in other tasks and domains. In a sense, the increased specialization entailed by this approach offers opportunities to increase efficiency in much the same way that manufacturing efficiency exploits specialization of workers and equipment on an assembly line: By breaking plans up into constituent pieces, and distributing responsibility for those pieces among components of the agent specialized for those purposes, we can optimize each component for its particular purpose. To extend the analogy, when a faulty item is discovered coming out of a factory, one might simply repair that item and continue on; but it is obviously more sensible to determine where in the manufacturing process the fault was introduced, and to see whether anything can be done to avoid such problems in the future. Our thesis is that a similar approach can be applied when learning how to plan. To put this in plainer terms, when a plan fails, debug the planner, not just the plan. A MODEL-BASED APPROACH TO DEBUGGING From the perspective outlined above, the process of debugging an intentional system must involve determining which element of that system is responsible for an observed failure. This is a difficult problem inasmuch as the architecture of the agent is, as we have argued, a rather complex mechanism. Our approach to this problem utilizes model-based reasoning, a methodology that has been developed in AI for reasoning about and debugging complex mechanisms such as electronic circuits (see, e.g., Stallman and Sussman, 1977; Davis, 1984; deKleer and Williams, 1987). In this paradigm for debugging, the diagnostic system uses a model of the device being debugged to
122 generate predictions about what the behavior of the device would be if it were functioning properly. These predictions are then compared with the actual behavior of the device. When a discrepancy is detected, the diagnostic system attempts to determine which of a set of possible faults is the underlying cause of the discrepancy. A fault is expressed as the failure of an assumption in the device model. For example, the model of an electronic circuit might include a number of assumptions of the following sort: that each circuit component is working according to specification, that each connection between these components is functioning properly, and that the input and output of the circuit have certain characteristics. A circuit debugger using such a model would then generate a set of predictions, for example, that the voltage across a given resistor should have a certain value. If the measured voltage were found to disagree with the prediction, the system would try to fault one or more of the assumptions included in the model. A reasonable diagnosis might thus be, for example, that a particular transistor was not functioning as specified by the model, or that the input voltage to the circuit was not as anticipated. The key issue in model-based debugging is inferring the faulty assumptions underlying an observed symptom. The ability to relate failed predictions to underlying assumptions in this way depends upon understanding how those predictions follow from the assumptions. Inasmuch as the performance expectations are generated by inference from the model in the first place, the most straightforward approach to this task is to record these inferences in the form of explicit justification structures (see, e.g., deKleer et aL, 1977; Doyle, 1979).1 By examining these justification structures, the system can then determine which assumptions of the model are relevant to an observed failure.2 Applied to our task of learning to plan by debugging, the paradigm described above comprises the following two steps: First, a model of the agent's intentional architecture is used to generate predictions about the performance of its plans; second, deviations from these predictions are used to pinpoint where in the mechanism an observed fault lies. 1
One of the first applications of such explicit reasoning records to the task of learning to plan can be found in CarbonelTs (1986) notion of derivational traces. 2 Of course, this does not guarantee that the system's ability to diagnose the cause of the failure, since it may be ambiguous which assumption was responsible for the fault. We will discuss the use of justification structures to support debugging in more detail below.
123 Thus, our approach entails that an intentional agent needs a model of itself in order to adequately diagnose its failures. Such a model will enable an agent to determine for itself where the responsibility for an observed failure lies. Determining the locus of failure is only part of the problem, however. Some way must also be found to ensure that similar failures do not occur in the future. This entails modifying either the agent, its environment, or both. Our primary concern here is with learning, that is, improving the performance of the agent itself in response to failures. Our approach to this task is again based on an analogy with man-made systems, such as engines, television sets, or factory production lines. Just as optimizing the performance of such a system involves manipulating the controls of that system in response to feedback about the system's performance, improving an agent's performance will similarly involvefindingtherightcontrol parameters to manipulate, and changing their settings in response to perceived plan failures. Consider, for example, a task that must be carried out by any realworld agent, threat detection. If the agent were, in general, detecting threats too late to respond to them effectively, despite having adequate time to do so in principle, a sensible response would be to increase the rate at which threats are checked for. On the other hand, if the agent were spending too much time checking for threats, and neglecting other tasks as a result, the parameter governing this rate should be adjusted the other way, so that less time is spent on the detection process. To provide an adequate basis for a theory of learning, then, our debugging model must support reasoning of this type, i.e., it must provide a means of identifying die controllable parameters of the agent that are relevant to a given fault In order for the debugger to identify relevant control parameters in this way, knowledge about those parameters and their settings must be part of the agent's model of itself. If the model includes assumptions about the correctness of the current settings of controllable parameters of the system, then the diagnosis process outlined above can, in principle, determine when the setting of a given parameter is responsible for an observed fault. Such a diagnosis suggests an obvious repair, namely, adjusting the faulty parameter setting in some way. In the example given above, for instance, the model might contain an assumption stating that the rate at which the agent checks for threats is rapid enough so that every threat that arises will be detected in time to respond to it
124
effectively. If this assumption is faulted, for example, when the system fails to counter some threat, then the appropriate repair is clearly to increase this rate. Thus, model-based reasoning not only provides a paradigm for fault diagnosis, it also provides a basis for a theory of repair. THE MODEL OF THE AGENT A central issue in our approach is the development of explicit models for intentional agents that can be used in debugging their performance. We have constructed simple models of a number of important components of an intentional agent, including threat detection, execution scheduling, projection, and case retrieval and adaptation. These models have been implemented in a computer program called CASTLE3, and applied to learning within the context of competitive games such as chess and checkers (see, e.g., Collins et al., 1989; Birnbaum etal., 1990; Birnbaum etal, 1991; Collins etaL, 1991). In this section we will describe models of two aspects of a simple planner, dealing with threat detection and execution scheduling. A model of threat detection The task of the threat detector is to monitor the current state of the world, looking for situations that match the description of known threats. In our approach, this matching task is modeled as a simple rulebased process: The planner's threat detection knowledge is encoded as a set of condition-action rules, with each rule being responsible for recognizing a particular type of threat. In chess, for example, the planner could be expected to possess rules specifying, among other things, the various board configurations that indicate that an attack on a piece is imminent When a threat-detection rule is triggered, the threat description associated with that rule is passed on to the plan selection component of the system, which will attempt to formulate a response to the threat. Because the system cannot, in general, afford the cognitive resources that would be necessary to apply all known threat-detection rules to all input features at all times, the threat-detection component also includes 3
CASTLE stands for "Concocting Abstract Strategies Through Learning from Expectationfailures."
125
two mechanisms for modulating the effort spent on this task: First, the threat-detection rules are evaluated at intervals, where the length of the interval is an adjustable parameter of the system; second, attention focusing circumscribes the domain to which the threat-detection rules will be applied Vx 3t t<ert(x) & detectfx, t) -> added-to(x,T(t))
Threats are placed on a threat queue when detected
Threats remain on the threat queue Vx, tj, %2 added-to(x, T(tj))& t2 <ert(x) & ~3rr (tl
A rule is capable of detecting a threat at a particular time if there exist bindings such that evaluating the rule with those bindings at that time would result in the detection of the threat
A rule can detect a threat in time if there while that threat is active, but before it can be realized, when the rule could detect that threat
Vx 3r,q
Our rules are capable of detecting all active threats in time
could~detect4n-time(r, x, q)
Vx (Br,q could-detect-in-time(r, x, q)) (3 r, q, t focus-results(t) i)q & activefx, t) & t <ert(x) & could-detect(r, x, t, q))
Any threat that could be detected without attention focusing can still be detected with such focusing
V X focus-results(t) = union-over(F(t), focus-rule-evaluaXion-resjiUtf, t))
The bindings available at a given time are the results of applying all of the focus rules at that time
Vr,t
evaluate-rule(r,focus-results(t), t)
Vx, t remove-fromfx, T(t) )<-> x e T(t) &. ~active(x, t)
At each turn, all of the threat rules are applied using bindings provided by the focus rules A threat is removed from the threat queue when it is no longer active
Figure 1: Partial model of threat-detection4 predicate "ert" stands for earliest realization time, i.e., the earliest time at which the threatened action can be expected to occur.
126 Let's consider this last issue in more detail. Broadly speaking, the agent has a choice between two strategies for threat detection: It can recompute the set of outstanding threats each time it checks its rules, or it can incrementally compute this set by noting the changes that have occurred since the previous check. In turn-taking games, such as chess, the latter approach is arguably more efficient, and indeed appears to be the one employed by humans. In order to benefit from the incremental approach, however, a planner must find a way to ensure that it can detect new threats without having to re-detect all of the old threats, since otherwise it is in effect doing all of the work entailed by the recomputation approach. Given the rule-based framework for threat-detection described above, focusing on threats resulting from changes can be implemented as a set of restrictions on the domain of application of the threat detection rules. In our model these restrictions are themselves implemented as a set of focus rules that delimit the set of bindings over which the threatdetection rules are evaluated. The above is a rather brief outline of the threat detection model we have developed. A portion of the model is shown in Figure 1. A model of execution scheduling Since a planner has limited resources, the formulation of a viable plan in service of an active goal is not, by itself, enough to guarantee that the plan will be carried out. Thus, another important aspect of any planner is a priority-based scheduling mechanism for determining which plans should be executed when. While such a mechanism could, in principle, be arbitrarily complex, we have chosen to model execution scheduling using a simple priority queue, assuming discrete time (which is sufficient for the turn-taking games we are currently investigating). Given such an approach, the basic model of execution scheduling is as follows: Once a goal is formed, a plan is chosen for that goal, assigned a deadline and a priority, and placed on the priority queue. At each time increment, the queue is checked, and the highest priority plan on the queue is chosen for execution. Thus a plan will be successfully executed if and only if there is a time before its deadline when it is the highest priority plan on the queue. A portion of the model appears in Figure 2.
127 Vn, t, g execute(p, t) & t < deadlinefp) & plan-for(p, g) -» achieve(g)
A goal will be achieved if a plan for that goal is executed before its deadline
V p,t
A plan will be executed at a particular time if it is on the execution queue, and is the highest-priority plan at that time
execute(p, t) <-* p e Q(t) & highest-priority (p,t)
Vp, t highest-priority (p,t) <-> ~BpaPaeQ(t)& priority(p) < priority
(pj)
Vp, ts, te added-to(p, Q(ts)) & tsp e Q(te) V g, t goal(g, t) &. call-plarmer(g, t) & planner(g, t)-p -> added-to(p, Q(t)
A plan has the highest priority at a given time if there s no other plan on the queue with a higher priority at that time A plan that is put on the queue stays on the queue until it is removed
If the planner is given a goal, a plan for that goal will be added to the execution
Figure 2: Partial model of execution scheduling CASE STUDY I: THE FORK The models described above were developed as part of an account of learning in competitive situations—in particular, chess. In use, therefore, the models are augmented by chess-specific assumptions that specify how the mechanisms they describe are being deployed in order to block threats in chess. For example, these assumptions include the assertions that all threats against materiel will be detected, that goals to block those threats will be formulated, and that those goals will succeed. These assumptions reflect the performance goals that the system has for itself in the domain of chess.
Figure 3: Fork example; Opponent (white) to move
128
We will now consider an instance of learning in response to a classic chess tactic, the fork. In our account, the novice chess player initially notices the fork when it fails to block a threat against one of its pieces. Consider a situation in which white's rook and knight are under attack from a black pawn (see Figure 3). Though it is white's turn to play, there is no move that would protect both pieces. Since rooks are more valuable than knights, white will save the rook, and black will capture the knight. If our planner is placed in this situation, the capture of the knight will cause a failure of its expectation that all threats to its pieces will be blocked. We now give an informal account of how fault diagnosis must proceed in this case. (The fault diagnosis procedure is described more fully below.) By the first two axioms of execution scheduling (see Figure 2), the planner's ability to block a threat depends upon there being a time such that a counterplan for the threat is on the queue, it is before the deadline, and there is no higher priority plan on the queue. Thus, the failure of the goal to block the threat implies that there was no such time, i.e., that some necessary condition for execution must have been missing at each time when the plan might, in principle, have been executed. We can therefore group all of the times into equivalence classes, based on which necessary condition failed to hold. In order to fix the problem, the learner must ensure that in similar situations in the future, all of these conditions hold for at least one time. That, in turn, entails picking a time out of one of these failure equivalence classes and figuring out how the condition that failed at that time could have been made true, while ensuring that the other conditions would remain true as well. The learner is thus presented with a choice of possible ways to fix the problem, depending upon which missing condition it attempts to reinstate. For example, it could have executed the block at an earlier time if it could have found a way to schedule the plan earlier; it could have executed the block at a later time if it could have found a way to delay the execution of the threat, say, by putting the opponent in check; and, finally, it could have executed the block at the time when it saved the rook, and saved the knight instead, if it could have made the plan for saving the knight a higher priority than the plan for saving the rook. In the particular instance of the fork that we are considering here, the only possible approach is to try to execute the blocking plan earlier. However, by the axioms describing the execution scheduling and threat detection mechanisms, this requires adding the blocking plan to the priority queue earlier, which in turn requires formulating a goal to block
129 the threat earlier, which, finally, requires detecting the threat earlier. Reasoning with the model in this way thus leads to the conclusion that the agent must detect forks earlier than it needs to detect simple threats against material. DIAGNOSING FAULTS The goal of fault diagnosis, as discussed above, is to explain the failure of a prediction as a consequence of the failure of some set of underlying assumptions. The connections between underlying assumptions and consequent expectations are represented in terms of explicit justification structures. For instance, the justification structure underlying the expectation that the planner will block the threat to its knight in the fork example above is a conjunction of three antecedent expectations: That moving the knight will block the threat, that this move will be executed, and that such execution will take place before the opponent takes the knight. The expectation that the knight move will be executed is in turn justified by the conjunction of two prior expectations: That the move appears on the plan queue, and that it is the highest priority item on the queue. These expectations are in turn justified by another level of expectations, and so on. Diagnosing a fault thus involves "backing up" through these justification structures, recursively explaining the failure of an expectation as the result of a failure of one or more of its immediately antecedent expectations. This approach is fundamentally similar to that proposed by Smith, Winston, Mitchell, and Buchanan (1986) and Simmons (1988). More generally, a justification structure links an expectation with a set of supporting beliefs. Such a justification may be either conjunctive, meaning all the supporters must be true to justify the expectation, or disjunctive, meaning at least one of the supporters must be true to justify the expectation. When an expectation fails, the diagnosis algorithm attempts to determine which of its supporters should be faulted. If the support set is conjunctive, then at least one supporter must be faulted. If the support set is disjunctive, then all of the supporters must be faulted. The basic algorithm non-deterministically expands the set of faulted expectations according to these simple rules until a stopping criterion (namely, the ability to generate a repair) is met. However, when faulting conjunctive supports, the degree of arbitrary choice required can be reduced by checking whether a proposition in the support set was observed to be true during the execution of the plan, and if so exonerating it, i.e., removing itfromconsideration.
130 Since the goal of the diagnosis process is to provide enough information to allow a repair to be effected, the process should, in principle, continue until a repair is actually generated. Because of this, the diagnostic module in our system calls the repair module at each step, checking to see if the current set of faulted expectations provides enough information for a repair to be generated. This is particularly important for reducing the effort entailed in propagating a fault through a disjunctive justification. Although, in principle, every proposition in a disjunctive support set must be faulted when the supported proposition is faulted, it does not necessarily follow that the diagnostic procedure must explain the failure of each disjunct. Since the overarching goal of the process is to fix the problem, and since the repair of a single disjunct will suffice to restore the supported expectation, it may be enough to explain the failure of one disjunct. In other words, the faulting of a disjunctive justification results in the goal to unfault one of the disjuncts. One additional point deserves special mention. Previous work in justification-based diagnosis has by and large presumed that the assumptions embedded within justification structures will be simple propositions without any quantified variables. While this limitation is not generally a problem for fault diagnosis in, say, digital circuits5, it is impossible to describe even moderately realistic models of planning and plan execution in such a restricted form. We have thus been forced to confront the issue of how current diagnostic methods can be extended to handle universally and existentially quantified assumptions. For example, there will be many instances in which the planner expects that the execution of a plan will involve some entity that meets a particular set of constraints, but does not know in advance the exact identity of this entity. This may be true, for example, of objects that will be found in the planning environment, including tools, raw materials, obstacles, and other agents. In our model, it is also true of execution times for plans, since the planner leaves the execution scheduler some latitude in determining when the plan should actually be carried out. The upshot is that many of the assumptions underlying the expectation that a plan will succeed are existentially quantified: They assert that a time, tool, or material meeting certain constraints will actually exist. The problem is that faulting such an existentially quantified assumption is equivalent to faulting an infinite set of )
Although it does pose problems in the analysis of circuits with memory.
131 disjuncts, one for each object over which the existential ranges. Of course, even in principle, the diagnostic engine cannot consider why the assertion failed for each instance of the variable. Our solution to this problem is to partition the set of instances into classes in such a way that the elements of each class have all failed to meet exactly the same constraint or constraints, while meeting all others. This organizes the set of faulted instances in such a way that the repair module need only consider one instance of each type of failure, rather than repeatedly trying to unfault instances that have failed for the same reason about which it can do nothing. In the case of the fork, for example, the planner's expectation that there will be a time at which it will be able to carry out the plan to save the knight depends on there being a time before the deadline when the plan is on the schedule queue, and is of higher priority than anything else on the queue. Each of these three constraints defines a class of times, namely, those at which the expectation failed because that particular constraint was not met.6 Universally quantified assumptions are equally prevalent in planning as are existentials, and similarly problematic in that such assumptions are equivalent to arbitrarily long conjunctions. For example, the model of threat detection described above includes the assumption that for every threat, there exists a threat detection rule capable of detecting that threat given appropriate bindings. This is equivalent to an arbitrarily long conjunction stating, individually for each threat that might ever arise, that a rule exists to detect that threat. Since such a justification structure cannot actually be built, the basic fault diagnosis algorithm cannot be employed any further. Our approach to this problem is to respond to the faulting of a universally quantified assumption by faulting a limited number of the conjuncts justifying the assumption, without attempting to construct the entire set of such conjuncts. In other words, when a universally quantified assumption is faulted, the diagnostic process searches directly for one or more specific counterexamples to the assumption.
"In fact, there are several other equivalence classes, since multiple constraints may fail for a given time. However, since the goal of fault diagnosis given disjunctive justifications is to find something to unfaulty we start with those classes corresponding to a single failed constraint on the grounds that these are likely to prove easier to fix.
132 THE FORK REVISITED We are now in a position to describe in more detail how our system diagnoses the fault underlying its vulnerability to the fork. The fault diagnosis algorithm is initially called when the expectation monitor notices that a piece was taken despite the planner's expectation that it would be able to block the threat against that piece. The system starts by retrieving the justification structure that supports this expectation, as follows: Faulting expectation: (ACHIEVE (BLOCK (CAPTURE-THREAT K2))) Checking for justification... found justification: (EXISTS (T2) (AND (PLANFOR (MOVE K2 3 4) (BLOCK (CAPTURE-THREAT K2))) (EXECUTE (MOVE K2 3 4) T2) (<= T2 (ERT (CAPTURE-THREAT K2)))))
This justification says roughly that the belief that it would block the capture threat to its knight was based on its belief that there would exist a time—T2—at which it would execute a counterplan to the capture, and that T2 would be early enough that there would be no possibility that the capture had already been effected. Because this supporting proposition is existentially quantified over the variable T2, the system formulates a set of equivalence classes based on the constraints on T2, as described above. Since the first conjunct in the justification—the assertion that the counterplan would, in principle, have blocked the threat—does not depend on T2, and since the truth of this proposition can be verified by examining the system's data base, the proposition is removed from consideration. In other words, since the system knows that it had a counterplan to the capture—namely, moving the knight—it exonerates this as a possible cause of the failure. Fault equivalence classes are then formed around the other two conjuncts: Existentially quantified justification collecting constraints on variable T2: (PLANFOR (MOVE K2 3 4) (BLOCK (CAPTURE-THREAT K2))) does not reference T2 Verified proposition: (PLANFOR (MOVE K2 3 4) (BLOCK (CAPTURE-THREAT K2)))
133 Forming equivalence classes over values of variable T2: --> T2 such that (EXECUTE (MOVE K2 3 4) T2) fails ~> T2 such that (<= T2 (ERT (CAPTURE-THREAT K2))) fails
The system has thus formulated two equivalence classes, one for times when the counterplan could not, for some reason, be executed, and another for times for which it was too late to block the threat anyway. The system must now determine which of these classes to attempt to unfault an instance of, by calling the repair module. A repair might be possible, if for example the system were able to determine a way to put the opposing king in check, while at the same time moving or defending one of the threatened pieces. Since no such opportunity is present in the current instance, we would not expect the system tofinda repair at this juncture: Disjunctive justification calling repair for each disjunct: Calling repair on (EXECUTE (MOVE K2 3 4) T2)... no repair plan found Calling repair on (<= T2 (ERT (CAPTURE-THREAT K2)))... no repair plan found
Having failed to find a repair, the system must now extend its explanation of the fault by choosing one of the conjuncts in the previous support set and retrieving its justification. Because the conjunct appears in the scope of an existential quantifier, the propositions in its support set also fall within that scope. The process of extending the explanation thus resembles macro expansion, in that the elements in the support set replace the chosen conjunct in the existential.7 Once the existential has been expanded, the system must recompute the set of fault equivalence classes: Expanding conjunct: (EXECUTE (MOVE K2 3 4) T2) (EXISTS CT2) (AND (MEMBER (MOVE K2 3 4) Q T2) (HIGHEST-PRIORITY (MOVE K2 3 4) T2) (<= T2 (ERT (CAPTURE-THREAT K2))))) Reformulating equivalence classes over values of variable T2: 'This is necessary because the unfaulter needs access to all of the constraints that apply to an existentially quantified variable, not just the faulted constraints, in order to ensure that the proposed repair does not simply trade one type of failure for another.
134 ~> T2 such that (MEMBER (MOVE K2 3 4) Q T2) fails - > T2 such that (fflGHEST-PRIORITY (MOVE K2 3 4) T2) fails ~> T2 such that (<= Tl (ERT (CAPTURE-THREAT K2») fails Disjunctive justification -calling repair for each disjunct: Calling repair on (MEMBER (MOVE K2 3 4) Q T2)... no repair plan found Calling repair on (fflGHEST-PRIORITY (MOVE K2 3 4) 12)... no repair plan found Calling repair on (<= T2 (ERT (CAPTURE-THREAT K2)))... no repair plan found
Thus, the system has expanded the executability constraint into its two supporting conjuncts: a plan is executable if it is on the execution queue, and is the highest priority item on that queue. Expanded within the context of the existential quantification of T2, we now have a justification structure asserting the expectation that there would be a time at which the counterplan to the knight capture was on the queue, that it was the highest priority task on the queue, and that the capture had not yet taken place. The system now constructs equivalence classes once again. This time there are three such classes: times before the counterplan was put on the queue, the time for which the counterplan was on the queue and not the highest priority, and times after the capture threat had been realized (in other wonis, times that were too early, times when there was something more important to do, and times when it was too late). This process continues until either a repairable fault is uncovered, or the system reaches assumptions that have no known justification. We now turn our attention to the problem of generating repairs once a fault has been diagnosed, within the context of a somewhat simpler example. CASE STUDY II: DISCOVERED
ATTACKS
As described above, our model makes a number of assumptions that reflect its performance goals in the domain of chess. These include,
135 in particular, the assumption that if a threat against materiel exists, it will be detected. There are, of course, a number of ways in which this assumption might fail; each of these presents our system with an opportunity to learn a different lesson. Consider, in particular, the example of discovered attacks in chess, in which the movement of one piece opens a line of attack for another piece. Novices often fall prey to such attacks, and the key point is that this is not because they fail to understand the mechanism of the threat, i.e., the way in which the piece can move to make the capture. Instead, the problem appears to be one of attention: Novices simply fail to consider new threats arising from pieces other than the one just moved. In other words, the problem lies not in their ability to detect the given threat in principle, but rather in their decisions about where to look for threats in practice. In fact, without such a distinction—e.g., if the method for detecting threats entailed scanning the entire board after each move—the problem of discovered attacks, and the need to distinguish them from other sorts of attacks, would not even arise. The model of threat detection described above reflects this distinction by introducing a notion of focus rules that limit the application of threat- detection rules. The former embody knowledge of where to look; the latter, of what to look for. Our system is capable of learning both sorts of rules, depending upon the cause of the failure in a given situation.
£ iSI
\m 8 ^*^1 I S-PIL "M *i il
:
(a)
(b)
HAW~1
nn
m
|3 (c)
9H
~£MJ
iHH
IT 9^iu m
Tin
Hi
(d)
Figure 4: Discovered attacks example; Opponent (white) to move
136 The following example depicts a scenario in which the system falls prey to a discovered attack, and thereby learns to improve the attentionfocusing portion of its threat detection component. In figure 4(a), the system's opponent is to move. In figure 4(b), we see the result of the opponent's move, which is to advance one of its pawns one square. This opens a discovered attack by the opponent's bishop on the system's rook. However, because of overly restrictive attentionfocusing, the system fails to notice this threat, despite the fact that it has a threat detection rule which, if applied to the appropriate portions of the board, would have in fact detected the threat. In figure 4(c), we see that the system has chosen to capture the opponent's pawn with its knight, leaving the threat on the rook unaddressed. Finally, in figure 4(d), the opponent carries out the threat and captures the rook. Let's examine the system's decision-making in some detail. Any decision to make a particular move must be based upon some assessment of how well that move compares to the available alternatives. Such a comparison, in turn, depends upon the ability to project the implications of each alternative considered. Once an alternative is selected, relevant aspects of the projection made in service of that choice become, in effect, predictions about the future course of events upon which the rationality of the system's chosen action depends. These predictions, in turn, depend upon assumptions stemming from the system's self-model, for example, the assumption that its threat detection component is capable of detecting all outstanding threats. In response to the failure of one of these predictions, the system will attempt to determine which aspect of its decision-making process was at fault, as described above. In this particular case, the failed prediction is that capturing the opponent's pawn would be of greater value than blocking any extant threat. When the opponent takes the system's rook, this prediction is seen to have failed. The justification structure underlying the failed prediction is roughly the following: The system believes that no such threat exists because it has not detected such a threat and it assumes that it can detect all threats. The latter assumption, in turn, is justified by the assumption that there is a threat rule capable of detecting any given threat, and the assumption that the threat rules have been evaluated using appropriate bindings. Finally, the latter is justified by the assumption that the system's focus rules have generated the appropriate bindings. This justification structure, and the assumptions it contains, are derived from the model of threat detection described above.
137 There are two components that might be at fault in this case: the threat rules, and the focus rules. Determining which component is at fault means exonerating all of the alternatives. In this case, the reasoning involved turns on the model's assumption that the use of attention focusing is not degrading the system's ability to detect threats. Recall that in our model, attention focusing is implemented by using focus rules to restrict the range of bindings over which threat rules will be evaluated. Thus, the assumption that the focus rules are not degrading performance amounts to the assumption that this limitation on the bindings of the threat rules will not prevent the detection of any existing threat, i.e., that any threat that can be detected by the threat detection rules without limiting their bindings in this way can also be detected when the limitations imposed by the focus rules are enforced. Applied to our current example, the instantiated form of this assumption states that if the threat to the rook could have been detected at all, it could be detected with the focus rules in force. Attempting to fault this assumption means attempting to show its negation, namely that the threat to the rook could have been detected by the threat detection rules in principle, but could not be detected in practice because the threat lay outside of the range delimited by the focus rules. Since both of these are in fact true in this case, it can be concluded that the assumption in question is faulty, and that the problem therefore lies in the focus rules.8 REPAIR AND LEARNING As we discussed above, our approach to repair entails adjusting the controllable parameters of the planning system that are implicated in the failure. Things are not quite as simple as this description implies, however, because a "parameter" in our agent model will generally be much more complex than a simple numerical value. In particular, rule sets such as the threat detection and focus rules referred to above are in fact adjustable parameters of the system as well, because the system can, and indeed must, add and delete rules from these sets in order to 8
The system is able to establish these facts by evaluating the predicate could-detect (see figure 1 above) applied to the threat against the rook that it failed to detect. In order for the system's deductive retrieval mechanism to evaluate such a predicate, it must have access to a declarative formulation of the content of the threat detection and focusing rules, so that, for example, it can determine whether a given rule is able to detect a given threat.
138
learn. Adjusting such parameters, then, requires the ability to construct new rules that will modify the system's behavior in appropriate ways. In the example discussed above, for instance, the diagnosis process has determined that the problem lies in the focus rules. One way to repair such a problem would be to add a new focus rule to the set. In order to formulate such a rule, the system must know two things: what the new rule should do, and under what conditions it should fire. Because we have modeled the agent's architecture in terms of discrete rule sets with well-defined functions, we are in a position to specify both of these things a priori, at least in general terms. Focus rules, for example, have the job of ensuring that potentially enabled threats are examined by the threat detection rules. A focus rule accomplishes this by delivering a set of bindings that would allow the threat rules to detect the existence of such a threat, at just the point when it is potentially enabled. This specification provides a blueprint for the construction of a new rule. Because the purpose of this new rule is to prevent similar failures in the future, its trigger conditions must consist of a set of features that are characteristic of situations in which such a failure might arise. Thus, in the current example, the new focus rule should trigger whenever there is the potential that a threat will be enabled in the same way that the threat against the rook was enabled in this instance. The trigger conditions of the new focus rule should therefore comprise a set of observable features indicating that such an enablement has occurred. Tried all Set is
c m ,ete
...Zi ,
° p
Tried all —^
J-Focus *—1
A~y~New threats —•
^Detection ion-J
Set is . J complete Previously available" _
£> -Old threats — '
Not , disabled
Active—, threats
K>.
Move selection
Available -J opportunities
Figure 5: Explanation for enablement of threat
139 One approach to the problem of identifying such a feature set is to use some form of explanation-based learning (see, e.g., DeJong and Mooney, 1986; Mitchell, Keller, and Kedar-Cabelli, 1986; Schank, Collins, and Hunter, 1986). If the system can explain how the current threat was enabled, then this account can be used to pick out features of the current situation that can be used to identify future situations in which threats have been similarly enabled. In other words, the model's specification of the trigger conditions for focus rules—namely, that they should trigger when threats are potentially enabled—can be used as an EBL target concept (Krulwich, 1991). The first step in this process is to construct an explanation of how the threat against the rook was enabled in this instance. This explanation (see figure 5) is centered around the following chain of assertions: • A new threat by the opponent's bishop against the rook was enabled because • The line of attack between the two pieces became clear because • The opponent's pawn was moved out of the line of attack After generalizing and adjusting the leaves of this explanation, the system uses it to construct a new rule (shown in figure 6) that will correctly focus on discovered attacks. The rule directs the system's attention to potential moves through the square vacated by the previous move. Once this rule is added to the focus rule set, the system no longer falls prey to discovered attacks, regardless of the particular pieces involved or their location on the board. (def-brule leamed-focus-method25 (focus leamed-focus-method25 ?player (move ?player (capture ?taken-piece) ?taking-piece (rc->loc ?rowl ?coll) (rc->loc ?row2 ?col2)) (world-at-time ?time2)) <= (and (move-to-make (move?other-player move ?interm-piece (rc->loc ?r-interm ?c-interm) (rc->loc ?r-other ?c-other)) ?player ?goal ?timel) (loc-on-line ?r-interm ?c-interm ?rowl ?coll ?row2 ?col2) (at-loc ?player ?taking-piece (rc->loc ?rowl ?coll) (- gen-time2.24 2)))) Figure 6: A new focus rule
140 CONCLUSION We have argued that in order to learn to plan effectively, an agent needs an explicit model of its own planning and plan execution processes. Given such a model, the agent can pinpoint those elements of its planning and execution processes that are responsible for an observed failure to perform as expected, and can formulate a repair designed to ensure that similar failures do not occur in the future. In other words, a self-model of this sort enables an agent to determine for itself what it needs to learnfroma given experience. The proposal that learning to plan entails the use of a self-model may, at first blush, appear somewhat radical. However, the fact is that a variety of everyday planning and problem-solving behaviors straightforwardly depend upon self-models of this sort. For example, many common-sense plans, such as cooking, require waiting for the appropriate moment to perform a particular task. In such situations, human planners often employ the strategy of setting an alarm that will recall their attention to the activity when the pending task needs to be performed, thus freeing them to attend to other matters in the interim. For example, chemists often put a stopper in a test tube in which they are boiling something, so that the "Pop!" that occurs when rising pressure forces the stopper out will alert them to the fact that it is time to remove the test tube from the heat.9 To set up such an alarm, an agent must have some notion of the kinds of events that will in fact attract its attention—in other words, a model of the properties of its attentionfocusing component. Such a model might, for example, specify that flashing lights, loud noises, and quick movements attract the agent's attention; that once its attention is so attracted, it will attempt to explain the cause of the event; and that if the cause is due to the agent itself, it will recall its purpose in setting up the event. Armed with this theory, the agent can decide whether a particular type of event, for instance the "Pop!" of a stopper being disgorged from a test tube, can successfully serve as an alarm. Mnemonic devices, for example the proverbial string tied around one's finger, employ similar techniques to attack a slightly different problem, that of retrieving a piece of information at the appropriate time. To develop and employ such techniques, the agent must not only have a model of its attention focusing mechanisms, but also of its memory. The ubiquity of such examples in everyday life ^Thanks to Ken Forbus for this example.
141 argues that any theory of planning that aims to achieve human-level performance must ultimately come to grips with the need for selfmodeling, even without taking into account die issue of learning. Acknowledgments: We thank Matt Brand, Kris Hammond, Louise Pryor, Chris Riesbeck, and Roger Schank for many useful discussions. This work was supported in part by the Office of Naval Research under contract N00014-89-J-3217, by the Air Force Office of Scientific Research under contract AFOSR-91-0341-DEF, and by the Defense Advanced Research Projects Agency, monitored by the Air Force Office of Scientific Research under contract F49620-88-C-0058. The Institute for the Learning Sciences was established in 1989 with the support of Andersen Consulting, part of The Arthur Andersen Worldwide Organization. The Institute receives additional support from Ameritech, an Institute Partner, and from IBM. REFERENCES Agre, P., and Chapman, D. (1987). Pengi: An implementation of a theory of activity. Proceedings of the 1987 AAAI Conference, Seattle, WA, pp. 268-272. Birnbaum, L., and Collins, G. (1988). The transfer of experience across planning domains through the acquisition of abstract strategies. Proceedings of the 1988 Workshop on Case-Based Reasoning, Clearwater Beach, FL, pp. 61-79. Birnbaum, L., Collins, G., Freed, M., and Krulwich, B. (1990). Model-based diagnosis of planning failures. Proceedings of the 1990 AAAI Conference, Boston, MA, pp. 318-323. Birnbaum, L., Collins, G., Brand, M., Freed, M., Krulwich, B., and Pryor, L. (1991). A model-based approach to the construction of adaptive case-based planning systems. Proceedings of the 1991 Workshop on Case-Based Reasoning, Washington, DC, pp. 215-224. Brooks, R. (1986). A robust layered control system for a mobile robot. IEEE Journal of Robotics and Automation, vol. 2, no. 1. Carbonell, J. (1986). Derivational analogy: A theory of reconstructive problem solving and expertise acquisition. In R. Michalski, J. Carbonell, and T. Mitchell, eds., Machine Learning: An Artificial
142
Intelligence Approach, Volume II, Morgan Kaufmann, Los Altos, CA, pp. 371-392. Collins, G., Birnbaum, L., and Krulwich, B. (1989). An adaptive model of decision-making in planning. Proceedings of the Eleventh IJCAI, Detroit, MI, pp. 511-516. Collins, G., Birnbaum, L., Krulwich, B., and Freed, M. (1991). Plan debugging in an intentional system. Proceedings of the Twelfth IJCAI, Sydney, Australia, pp. 353-358. Davis, R. (1984). Diagnostic reasoning based on structure and behavior. Artificial Intelligence, vol. 24, pp. 347-410. DeJong, G., and Mooney, R. 1986. Explanation-based learning: An alternative view. Machine Learning, vol. 1, pp. 145-176. deKleer, J., and Williams, B. (1987). Diagnosing multiple faults. Artificial Intelligence, vol. 32, pp. 97-130. Doyle, J. (1979). A truth maintenance system. Artificial Intelligence, vol. 12, pp. 231-272. Fikes, R., and Nilsson, N. (1971). STRIPS: A new approach to the application of theorem proving to problem solving. Artificial Intelligence, vol. 2, pp. 189-208. Firby, R. (1989). Adaptive execution in complex dynamic worlds. Research report no. 672, Yale University, Dept. of Computer Science, New Haven, CT. Hammond, K. (1989a). Case-Based Planning: Viewing Planning as a Memory Task. Academic Press, San Diego. Hammond, K. (1989b). Opportunistic memory. Proceedings of the Eleventh IJCAI, Detroit, MI, pp. 504-510. Hayes-Roth, F. (1983). Using proofs and refutations to learn from experience. In R. Michalski, J. Carbonell, and T. Mitchell, eds., Machine Learning: An Artificial Intelligence Approach, Vol 7, Tioga, Palo Alto, CA, pp. 221-240.
143
Kolodner, J. (1987). Capitalizing on failure through case-based inference. Proceedings of the Ninth Cognitive Science Conference, Seattle, WA, pp. 715-726. Krulwich, B. (1991). Determining what to learn in a multi-component planning system. Proceedings of the Thirteenth Cognitive Science Conference, Chicago, IL, pp. 102-107. Mitchell, T., Keller, R., and Kedar-Cabelli, S. (1986). Explanationbased generalization: A unifying view. Machine Learning, vol. 1, pp. 47-80. Newell, A., and Simon, H. (1963). GPS, a program that simulates human thought. In E. Feigenbaum and J. Feldman, eds., Computers and Thought, McGraw-Hill, New York, pp. 279-293. Sacerdoti, E. (1974). Planning in a hierarchy of abstraction spaces. Artificial Intelligence, vol. 5, pp. 115-132. Sacerdoti, E. (1977). A Structure for Plans and Behavior. American Elsevier, New York. Schank, R. (1982). Dynamic Memory: A Theory of Reminding and Learning in Computers and People. Cambridge University Press, Cambridge, England. Schank, R., Collins, G., and Hunter, L. (1986). Transcending inductive category formation in learning. The Behavioral and Brain Sciences, vol. 9, pp. 639-686. Simmons, R. (1988). A theory of debugging plans and interpretations. Proceedings of the 1988 AAAI Conference, St. Paul, MN, pp. 94-99. Stallman, R., and Sussman, G. (1977). Forward reasoning and dependency-directed backtracking in a system for computer-aided circuit analysis. Artificial Intelligence, vol. 9, pp. 135-196. Sussman, G. (1975). A Computer Model of Skill Acquisition. American Elsevier, New York. Wilkins, D. (1984). Domain independent planning: Representation and plan generation. Artificial Intelligence, vol. 22, pp. 269-302.
Chapter 5 LEARNING FLEXIBLE CONCEPTS USING A TWO-TIERED REPRESENTATION R. S. Michalski, F. Bergadano1, S. Matwin2 and J. Zhang Center for Artificial Intelligence George Mason University Fairfax, VA 22030 ABSTRACT Most human concepts are flexible in the sense that they inherently lack precise boundaries, and these boundaries are often contextdependent. This chapter describes a method for representing and inductively learning flexible concepts from examples. The basic idea is to represent such concepts using a two-tiered representation. Such a representation consists of two structures ("tiers"): the Base Concept Representation (BCR), which captures explicitly the basic and contextindependent concept properties, and Inferential Concept Interpretation (ICI), which characterizes allowable concept modifications and contextdependency. The proposed method has been implemented in the POSEIDON3 system (also called AQ16), and tested on various practical problems, such as learning the concept of "Acceptable union contracts" and "Voting patterns of Republicans and Democrats in the U.S. Congress." In the experiments, the system generated concept descriptions that were both, more accurate and simpler than those produced by other methods tested, such as methods employing simple exemplar-based representations, decision tree learning, and some previous methods for rule learning. 1 On leave of absence from the University of Torino, Italy 2 On leave of absencefromthe Univerity of Ottawa, Canada. 3 The system is named after POSEIDON, the Greek god of the sea, water and waves, which representfluidityand changing aspects of nature.
146 INTRODUCTION Typical assumptions underlying a large part of machine learning research are that concepts have precise boundaries, are contextindependent, and are representable by a single symbolic description. An important consequence of this assumption is that recognizing instances of such concepts, which we call crisp, is very simple: if an instance satisfies a given concept description, then it belongs to the concept, otherwise it does not. Another common assumption is that concept instances are equally representative, that is there is no distinction in the typicality among instances. In some methods, these assumptions are partially relaxed by assigning to a concept a fuzzy set membership function (e.g., Zadeh, 1974), or a probability distribution (e.g., Cheeseman et al., 1988; Fisher, 1987). However, once such a measure is defined explicitly for a given concept, the concept has a fixed, well-defined meaning. Moreover, these methods remain unsatisfactory for coping with context-dependency, handling exceptional cases, or for capturing gradual changes of knowledge about the concept properties. When one looks at human concepts, one can see that most of them inherently lack precisely defined boundaries, and that their meaning is often context-dependent. Although on the surface these properties can be viewed as undesirable, one can argue that they contribute to a cognitive economy of human knowledge representations (Michalski, 1987). Our view is that this imprecision and context-dependency can be more adequately captured by rules of inference and flexible concept matching than by a probability distribution or a numerical set membership function. In other words, we postulate that the imprecision and contextdependency has often a logical, rather than a probabilistic character. This is confirmed by an observation that people usually decide about the concept membership of borderline instances through inference—by reasoning from general knowledge, drawing an analogy, or performing induction, rather than by conducting a statistical analysis. Examples of human concepts can often be characterized by a degree
147 of typicality in representing the concept. For example, a robin is usually viewed as a more typical bird than a penguin or ostrich. The typicality is usually viewed as the degree to which an instance shares the common concept properties. Another property of concepts is that in different contexts they may have different meaning. For example, the concept "bird" may apply to a live, flying bird, a sculpture, a chick hatching out of the egg, or even an airplane. Thus, human concepts are flexible, as their boundaries have certain degree of fluidity, and can change with the context in which the concepts are used. It is clear that in order to learn such concepts, machine learning systems need to employ richer concept representations than are currently used. This chapter describes an approach to learning flexible concepts based on the idea of two-tiered representation (TT), proposed by Michalski (1987). In this representation, a concept is described by two structures ("tiers"), the base concept representation (BCR), and the inferential concept interpretation (ICI). The BCR defines explicitly the basic properties of the concept, while the ICI describes implicitly, through rules and matching procedures, the allowed modifications of the explicit meaning, and its changes or extensions in different contexts. In the general definition of the two-tiered representation, the "distribution" of the meaning between the two tiers is not fixed, but depends on the properties of the reasoning agent, and on the criteria for evaluating the quality of concept descriptions. In the instantiation of the two-tiered approach that applies to modeling human concept representation, the BCR is assumed to describe the most typical, common, and intentional meaning of a concept, while the ICI would handle the exceptional or borderline cases, and context dependency (Michalski, 1990). The ICI for specific concepts is often inherited from more general concepts. Early ideas, experiments and the first method for learning two-tiered concept representations were presented in (Michalski et al., 1986; Michalski, 1988; and Michalski, 1990). The general idea was to induce, in the first step, a concept description that is a complete and consistent characterization of all training examples. Such a description is often
148 overly complex and performs poorly on new examples, if the concept has flexible and/or complex oorders, or examples are noisy. Therefore, in the second step, such descriptions are simplified or optimized according to some criterion of description "quality." The method employed a simple form of description simplification, called TRUNC, which removes those parts of the description that cover only a small fraction of examples (the so called light disjuncts, or light rules). Such a description change can be logically interpreted as a specialization operation. As the ICI, the method applied a flexible matching procedure. An intriguing result of that research was that the description's complexity was substantially reduced without affecting its performance on new examples. The new method, described here, significantly extends these early ideas. One important advance is the development of a heuristic doublelevel search procedure, called TRUNC-SG, which explores the space of two-tiered descriptions to derive a globally optimized description. The search employs both generalization and specialization operators, and is guided by a new criterion, the general description quality measure (GDQ). This measure considers the accuracy of the description, the computational cost of both tiers - Base Concept Representation and Inferential Concept Interpretation, and its cognitive comprehensibility (Bergadano et al., 1988). By introducing such a general description quality measure, any form of concept learning can be viewed as a process of modifying the input concept description in order to maximize a given description quality measure. Initial concept description can be in the form of positive examples only, positive and negative examples, a complete and/or consistent concept description, an initial description supplied by a teacher, an abstract concept definition (as in the explanation-based learning), or a combination of these forms. Another advance is that flexible matching is used not only in the recognition process, as in (Michalski et al., 1986), but also in the learning process, i.e., in searching for high "quality" concept descriptions. This feature also distinguishes the method from the related work described in (Bergadano and Giordana, 1989), which does not involve deductive
149 reasoning in the learning phase, and evaluates the performance of generated descriptions solely on the basis of the coverage of examples. These earlier approaches may be compared to using hands in learning how to row a boat, and then using oars in the performance phase. The idea that learning is more effective if one uses the same instruments for learning and for performance phases was also present in some incremental learning systems (e.g., Fisher, 1987). The work described here represents also an important advance over tree-pruning techniques (e.g., Quinlan, 1987). These techniques apply a much more restrictive description reduction operator (a tree-pruning operator that performs a generalization of the class replacing the pruned subtree, and specialization of other classes), and do not use deductive matching or flexible interpretation of the learned descriptions. Other advances include the ability to take into consideration the typicality of training instances (when it is known), and the use of a rule base for the Inferential Concept Interpretation. This chapter describes basic ideas of two-tiered representation, the method proposed, and experimental results from comparing it with several other methods, such as variants of exemplar-based learning, decision tree learning, learning complete and consistent descriptions, and the earlier method using two-tiered representation based on the TRUNC procedure. The experiments have shown that the proposed method compares favorably with other methods. The descriptions learned by the method were both simpler and had higher accuracy in classifying testing examples. TWO-TIERED CONCEPT REPRESENTATION Motivation and Definition Traditional work on concept representation has assumed that the whole meaning of a concept resides in a single structure, e.g., a semantic network, a logic-based description, or a decision tree. Such a structure is expected to capture all relevant properties of the concept(s) and define the
150 concept boundary (e.g., Collins and Quillian, 1972; Minsky, 1975; Smith and Medin, 1981; Sowa, 1984). When concepts have flexible boundaries, or the learning examples have a considerable amount of noise, it may be advantageous to construct a concept representation that is partially inconsistent and/or incomplete with regard to the given examples. This idea was confirmed by the work on pruning decision trees (Quinlan, 1987), in HILLARY system (Iba et al., 1988), and in the work on twotiered representation (Michalski, 1987; and this chapter). In traditional approaches, the recognition of a concept instance is doen typically by directly matching the instance description with the stored concept representation. Such matching may include comparing feature values in an instance with those in the concept description, or tracing links in a semantic network, but is not assumed to involve any complex inferential processes. More recently, researchers working on exemplar-based reasoning (e.g., Bareiss, 1989; Kolodner, 1988 and Hammond, 1989) have proposed various inference mechanisms in order to classify new instances. In these methods, however, the concept representation consists of stored examples (cases). Such a representation taxes memory, and makes it difficult to compare different concepts. The two-tiered representation employs a general concept description (BCR), and an inference mechanism (ICI) for matching the description with instances. Such concept representation can be much simpler than the one that stores individual examples, or their independent generalizations. The BCR can be viewed as a characterization of the "central tendency" of a concept; its contains the most relevant properties, and specifies the basic intention behind the concept. The ICI handles special cases, exceptions 4 and context-dependency. It treats them either by extending the base concept representation (concept extension), or by specializing it (concept contraction). This process involves the background knowledge and relevant inference rules contained in the ICI. Inference allows the 4 The term "exceptions" is used here in its colloquial meaning. Subsection Types of Match gives it a precise meaning.
151 recognition, extension or modification of the concept meaning according to its context. When an unknown entity is to be recognized, it is first matched against the Base Concept Representation. Then, depending on the outcome, the entity may be related to the concepts inferential extensions or contractions. A simple inferential matching can be merely a probabilistic inference based on some measure of similarity, e.g., the flexible matching method (Michalski et al., 1986). Advanced matching may involve any kind of inference —deductive, analogical or inductive. Let us illustrate the idea of two-tiered representation using the concept of "chair." BCR: Superclass: A piece of furniture. Function: To seat one person. Structure: A seat supported by legs and a back rest attachedfromthe side. Physical properties: The number of legs is usually four. Often made of wood. The height of the seat is usually about 14-18 inches from the end of the legs, etc. (BCR may also include a picture of 3D models of typical chairs) ICI: Possible variations of the properties in BCR: The number of legs can vary from one to four. The legs may be replaced by any support. The shape of the seat, the legs and the backrest, and the material of which they are made are irrelevant, as long as the function is preserved. The backrest may be very small or missing, etc. Context dependency: Context = museum exhibit --> chair is not used for seating persons any more. Context = toys --> the size can be much smaller than stated in BCR. The chair does not serve for seating persons, but correspondingly small dolls. Special cases: If legs are replaced by wheels --> type(chair) is wheelchair Chair without the backrest --> type(chair) = stool Chair with the armrests -> type(chair) = armchair This simple example illustrates several important features of two-tiered representation. Commonly occurring cases of chairs match the BCR completely, and the ICI does not need to be involved. For such cases, the recognition time can thus be reduced. The BCR is not the same as a description of a prototype (e.g., Rosch and Mervis, 1975), as it can be a
152 generalization characterizing different typical cases or be a set of different prototypes. The ICI does not represent only distortions or corruptions of the prototype, but it can describe some radically different cases. When an entity does not satisfy the base representation of any relevant concept (which concepts are relevant is indicated by the context of discourse), or satisfies the base representation of more than one concept, the ICI is involved. The ICI can be changed, upgraded or extended, without any change to Base Concept Representation. While the BCR-based recognition involves just direct matching, the ICI-based recognition can involve a variety of transformations and any type of inference. The ideas of two-tiered representation are supported by research on the so-called transformational model (Smith and Medin, 1981). In this model, matching object features with concept descriptions may transform object features into those specified in the concept description. Such a matching is inferential. Some recent work in cognitive linguistics also seems to support the ideas of two-tiered representation. For example, Lakoff (1987), in his idealized cognitive models approach, stipulates that humans represent concepts as a structure, which includes a fixed part and mappings that modify it. The fixed part is a propositional structure, defined relative to some idealized model. The mappings are metaphoric or metonymic transformations of the concept's meaning. As mentioned before, in the general two-tiered model, the distribution of the concept meaning between BCR and ICI can vary, depending on the criterion of the concept description quality. For example, the BCR can be just concept examples, and ICI can be a procedure for inferential matching, as used in the cased-based reasoning approach. Consequently, the case-based reasoning approach can be viewed as a special case of the general two-tiered representation. Concept Representation Language In the proposed method, the formalism used for concept representation is based on the variable-valued logic system VLi (Michalski, 1975). This formalism allows us to express simply and
153 implemented, F maps events from the set E, and concept descriptions from the set D, into the degree of match from the interval [0..1]: F: E x D --> [0..1] The value of F for an event e, and a concept description D, is defined as the probabilistic sum of F for its rules. Thus, if D consists of two rules, ri and r2, we have: F(e, D) = F(e,n) + F(e, i?) - F(e, r;) x F(e, n) A weakness of the probabilistic sum is that it is biased toward descriptions with many rules. If a concept description D has a large number of rules, the value of F(e, D) may be close to 1, even if F(e,r) for each rule r, is relatively small (see Table 4). To avoid this effect, if the value of F(e,r) falls below a certain threshold, then it is assumed to be 0 . (In our method this problem does not occur, because concept descriptions are typically reduced to only few rules; see the TRUNC-SG procedure in the subsection Basic Algorithm). The degree of match, F(e,r) between an event e, and a rule r , is defined as the average of the degrees of fit for its constituent conditions, weighted by the proportion of positive examples to all examples covered by the rule: F(e,r) = ( £ F(e, cj n) x #rpos /(#rpos + #rneg) i
where F(e, c//n) is a degree of match between the event e and the condition c/ in the rule r, n is the number of conditions in r, and #rpos and #rneg are the number of positive examples and the number of negative examples covered by r, respectively. The degree of match between an event and a condition depends on the type of the attribute in the condition. Four types of attributes are distinguished: nominal, structured-nominal, linear and structured-linear (Michalski and Stepp, 1983). Values of a structured-nominal (linear) attribute are nodes of an unordered (ordered) generalization hierarchy. In an ordered hierarchy, the children nodes of any parent node constitute a totally ordered set.
154 In a nominal or structured-nominal condition, the referent is a single value or an internal disjunction of values, e.g., [color = red v blue v green]. The degree of match is 1, if such a condition is satisfied by an event, and 0 otherwise. In a linear or structured-linear condition, the referent is a range of values, or an internal disjunction of ranges, e.g., [weight = 1..3 v 6..9]. A satisfied condition returns the value of match 1. If the condition is not satisfied, the degree of match is a decreasing function of the distance between the value and the nearest end-point of the interval. If the maximum degree of match between an example and all the candidate concepts is smaller than a preset threshold, the result is "no match." Inferential Concept Interpretation: Deductive Rules In addition to flexible matching, the Inferential Concept Interpretation includes a set of deductive rules that allow the system to recognize exceptions and context-dependent cases. For example, flexible matching allows an agent to recognize an old sequoia as a tree, although it does not match the typical size requirements. Deductive reasoning is required to recognize a tree without leaves (in the winter time), or to include in the concept of tree its special instance (e.g., a fallen tree). In fact, flexible matching is most useful to cover instances that are close to the typical case, while deductive matching is appropriate to deal with concept transformations necessary to include exceptions, or take into consideration the context-dependency. The deductive inference rules in the Inferential Concept Interpretation are expressed as Horn clauses. The inference process is implemented using the LOGLISP system (Robinson and Sibert, 1982). Numerical quantifiers and internal connectives are also allowed. They are represented in the annotated predicate calculus (Michalski 1983). Types of Match. The method recognizes three types of match between an event and a two-tiered description: 1. Strict match: An event matches the Base Concept Representation exactly, and it said to be S-covered.
155 2. Flexible match: An event is not S-covered, but matches the Base Concept Representation through a flexible matching function. In this case, the event is said to be F-covered. 3. Deductive match: the event is not F-covered, but it matches the concept by conducting a deductive inference using the Inferential Concept Interpretation rules. In this case, the event is said to be D-covered. (In general, this category could be extended to include also matching by analogy and induction; Michalski, 1989). The above concepts provide a basis for proposing a precise definition of classes of concept examples that are usually characterized only informally. Specifically, examples that are S-covered are called representative examples; examples that are F-covered are called nearlyrepresentative examples; and examples that are D-covered are called exceptions. As mentioned earlier, one of the major advances of the presented method over previous methods using two-tiered representation (e.g., Michalski et al., 1986) is that the Inferential Concept Interpretation includes not only a flexible matching procedure, but also inference rules. Thus, using our newly introduced terminology, we can say that the method can handle not only representative or nearly representative examples, but also exceptions. AN OVERVIEW OF THE POSEIDON SYSTEM Basic algorithm The ideas presented above have been implemented in a system called POSEIDON (also called AQ16). Table 1 presents two basic phases in which the system learns the Base Concept Representation. The first phase generates a general consistent and complete concept description, and the second phase optimizes this description according to a General Description Quality measure. The optimization is done by applying different description modification operators.
156 Phase 1 Given: Concept examples obtained from a some source Relevant background knowledge DetermineComplete and consistent description of the concept Phase 2 Given: Complete and consistent description of the concept A general description quality (GDQ) measure Typicality of examples (if available) Determine: The Base Concept Representation that maximizes GDQ. Table 1. Basic Phases in Generating BCR in POSEIDON. The search process is defined by: Search space: A tree structure, in which nodes are two-tiered concept descriptions (BCR + ICI). Operators: Condition removal, Rule removal, Referent modification. Goal: Determine a description that maximizes the general description quality criterion. The complete and consistent description is determined by applying the AQ inductive learning algorithm (using program AQ15; Michalski et al., 1986). The second phase improves this description by conducting a "double level" best-first search. This search is implemented by the TRUNC-SG procedure ("SG" symbolizes the fact that the method uses both specialization and generalization operators). In this "double level" search, the first level is guided by a general description quality measure, which ranks candidate descriptions. The second level search is guided by heuristics controlling the search operators to be applied to a given description. The search operators simplify the description by removing some of its components, or by modifying the arguments or referents of
157 some of its predicates. A general structure of the system is presented in Figure 1. SOURCE OF EXAMPLES
K
i
Gmmte Consistent Comply Descrip&sB (A
| Phase 1
Compute Ueseriptfeft Quality
Phase 2
Figure 1. Learning Phases in POSEIDON. The goal of the search is not necessarily to find an optimal solution, as this would require a combinatorial search. Rather, the system tries to maximally improve the given concept description by expanding only a limited number of nodes in the search tree. The nodes to be expanded are suggested by various heuristics discussed before. The BCR is learned from examples. The Inferential Concept Interpretation contains two parts: a flexible matching function and a rule base. The rule base contains rules that explain exceptional examples, and is acquired through an interaction with an expert.
158 Operators for Optimizing Base Concept Representation A description can be modified using three general operators: rule removal, condition removal and referent modification. The rule removal operator removes one or more rules from a ruleset. This is a specialization operator because it leads to "uncovering" some examples. It is the reverse of the "adding an alternative" generalization rule (Michalski, 1983). Condition removal (from a rule) is a generalization operator, as it is equivalent to the "dropping condition" generalization rule. The referent modification operator changes the referent in a condition (i.e., the set of attribute values stated in a condition). Such changes can either generalize or specialize a description. Consequently, two types of referent modification operators are defined: condition extension, which generalizes the description, and condition contraction, which specializes the description. To illustrate these two types of referent modification, consider the condition: [size = 1..5 v 7]. Changing this condition to : [size = 1..7] represents a condition extension operator. Changing it to [size = 1..5] represents a condition contraction operator. On the other hand, if the initial condition is [size * 1..5 v 7], then changing it to [size * 1..7], represents a condition contraction operator. Similarly, changing it to [size * 1..5] represents a condition extension operator. A summary of the effect of different operators on a description is given in Table 2: Search operator Rule removal Condition removal Condition extension Condition contraction
Type of knowledge modification (RR) (CR) (CE) (CC)
Table 2. Search operators and their effect on the description
159 Thus, applying the above search operators can either specialize or generalize the given description. A generalized (specialized) description covers potentially a larger (smaller) number of training examples, which can be positive or negative. At any given search step, the algorithm chooses an operator on the basis of an evaluation of the changes in the coverage caused by applying the operator (see Basic Algorithm subsection). Learning the Inferential Concept Interpretation As indicated above, by applying a search operator (RR, CR, CE or CC) to the current Base Concept Representation, one can make it either more general or more specific. If the modified representation is more specific, some positive examples previously covered may cease to be S-covered. These examples may, however, be still covered by the existing Inferential Concept Interpretation (and thus would become F-covered or D-covered). On the other hand, if the modified base representation is more general than the original one, some negative examples, previously uncovered, may now become S-covered. They may, however, remain to be excluded by the existing Inferential Concept Interpretation rules. Consequently, two types of rules in the Inferential Concept Interpretation can be distinguished: those that cover positive examples left uncovered by the base representation ("positive exceptions"), and rules that eliminate negative examples covered by the base representation ("negative exceptions"). A problem then is how to acquire these rules. The rules can be supplied by an expert, inherited from higher level concepts, or deduced from other knowledge. If the rules are supplied by an expert, they may not be operationally effective, but they can be made so through analytic learning (e.g., Mitchell et al., 86; Prieditis and Mostow, 1987). If the expert supplied rules are too specific or partially correct, they may be improved inductively (e.g., Michalski and Larson, 1978; Dietterich and Hann 1988; Mooney and Ourston, 1989). Thus, in general, rules for the Inferential Concept Interpretation can be developed by different strategies.
160 In the implemented method, the system identifies exceptions (i.e., examples not covered by the Base Concept Representation), and asks an expert for a justification. The expert is required to express this justification in the form of rules. The search procedure, shown in Fig. 1, guides the process by determining examples that require justification. This way, the role of the program is to learn the "core" part of the concept from the supplied examples, and to identify the exceptional examples. The role of a teacher is to provide concept examples, and to justify why the examples identified by the learning system as exceptions are also members of the concept class. QUALITY OF CONCEPT DESCRIPTIONS Factors Influencing the Description Quality The learning method utilizes a general description quality measure that guides the search for an improved two-tiered description. The General Description Quality measure takes into consideration three basic characteristics of a descritpion: its accuracy, comprehensibility, and its cost. This section discusses these three components, and describes a method for combining them into a single measure. The accuracy expresses the description's ability to produce correct classifications. Major factors in estimating the description's predictive power are its degree of completeness and consistency with regard to input examples. When learning from noisy examples, however, to achieve a high degree of completeness and consistency may lead to an overly complex and overspecialized description. Such a description may be well tuned to the particular training set, but may perform poorly in classifying future examples. For that reason, #hen learning from imperfect inputs, it may be better to produce descriptions that are only partially complete and/or consistent. If an intelligent system is supposed to give advice to humans, knowledge used by such a system should be comprehensible to human experts. A "black box" classifier, even with a high predictive power, is not satisfactory in such situations. To be comprehensible, a description should
161 involve terms, relations and concepts that are familiar to experts, and be syntactically simple. This requirement is called the comprehensibility principle (Michalski, 1983). Since there is no established measure of description's comprehensibility, we approximate it by the representational simplicity. Such a measure is based on the number of different operators involved in the description: disjunctions, conjunctions, and the relations embedded in individual conditions. In the case of twotiered representations, the measure takes into account the operators occurring in both, the BCR and the ICI, and weighs the relative contribution of each part to the comprehensibility of the whole description. The third criterion, the description cost, captures the cost of storing the desription and using it in computations to make a decision. Other things being equal, descriptions which are easier to store and easier to use for recognizing new examples are preferred. When evaluating the description cost, two characteristics are of primary importance. The first is the cost of measuring values of variables occurring in the description. In some application domains, e.g., in medicine, this is a very important factor. The second characteristic is the computational cost (time and space) of evaluating the description. Again, in some real-time applications, e.g., in speech or image recognition, there may be stringent constraints on the evaluation time. The cost and the comprehensibility of a description are frequently mutually dependent, but generally these are different criteria. The criteria described above need to be combined into a single evaluation measure that can be used to compare different concept descriptions. One solution is to have an algebraic formula that, given numeric evaluations for individual criteria, produces a number that represents their combined value. Such a formula may involve, e.g., a multiplication, weighted sum, maximum/minimum, or t-norm/t-conorm of the component criteria (e.g., Weber, 1983). Although the above approach is often appropriate, it also has significant disadvantages. First, it combines a set of heterogeneous
162 evaluations into a single number, and the meaning of this final number is hard to understand for a human expert. Second, it usually forces the system to evaluate all the criteria for each description, even if it is sufficient to compare descriptions on the basis of just one or two most important ones. The latter situation occurs when one description is so much better than the other according to some important criterion, that it is not worth to even consider the alternatives. To overcome these problems, we use a combination of a lexicographic evaluation and a linear functionbased evaluation, which is described in the next section. Combining Individual Factors Into a Single Preference Criterion Given a set of candidate descriptions, we use the General Description Quality criterion to select the "best" description. Such a criterion consists of two measures, the lexicographic evaluation functional (LEF), and the weighed evaluation functional (WEF). The LEF, which is computationally less expensive than WEF, is used to rapidly focus on a subset of the most promising descriptions. The WEF is used to select the final description. A general form of a LEF (Michalski, 1983) is: LEF: <(Criterioni,Xi), (Criterion2,X2),.., (Criterionk,Tk)> where Criterioni, Criterion, ... , Criterionk are elementary criteria used to evaluate a description, and T I , T 2 , ... ,Tk are corresponding tolerances, expressed in %. The criteria are applied to every candidate description in order from the left to right (reflecting their decreasing importance). At each step, all candidate descriptions whose score on a given criterion is within the tolerance range from the best scoring description on this criterion are considered equivalent with respect to this criterion, and are kept on the CANDIDATE LIST; other descriptions are discarded. If only one description remains on the list, it is chosen as the best. If the list is non empty after applying all criteria, a standard solution is to chose the description that scores highest on the first criterion. In POSEIDON, we chose another approach in the latter case (see below). The LEF evaluation scheme is not affected by the problems of using a linear function evaluation, mentioned above. The importance of a
163 criterion depends on the order in which it is evaluated in LEF, and on its tolerance. Each application of an elementary criterion reduces the CANDIDATE LIST, and thus the subsequent criterion needs to be applied only to a reduced set. This makes the evaluation process very efficient. In POSEIDON, the default LEF consists of the three elementary criteria discussed above, i.e., accuracy, the representational simplicity and the description cost, specified in that order. The next section describes them in detail. Tolerances are program parameters, and are set by the user. If the tolerance for some criterion is too small, the chances of using the remaining criteria decrease. If the tolerance is too large, the importance of the criterion is decreased. For this reason, the LEF criteria in POSEIDON are applied with relatively large tolerances, so that all the elementary criteria are taken into account. If after applying the last criterion the CANDIDATE LIST has still several candidates, the final choice is made according to a weighed evaluation functional (WEF). The WEF is a standard linear function of elementary criteria. The description with the highest WEF is selected. Thus, the above approach uses a computationally efficient LEF to obtain a small candidate set, and then applies a more complex measure to select from it the best description. Taking the Typicality of Examples into Consideration Accuracy is a major criterion to determine the quality of a concept description. In determining accuracy, current machine learning methods usually assume that it depends only on the number of positive and negative examples (training and/or testing) correctly classified by the description. One can argue, however, that in evaluating accuracy one might also take into consideration the typicality of examples (Rosch and Mervis, 1975). If two descriptions cover the same number of positive and negative examples, the one that covers more typical positive examples and fewer typical negative examples can be considered more accurate. For the above reason, we propose a measure of completeness and
164 consistency of a description that takes into account the typicality of the examples. In POSEIDON, the typicality of examples can be obtained in one of two ways. The first way is that the system estimates it by the frequency of the occurence of examples in the data (notice that this is different from a usual cognitive measure of typicality, which captures primarily the degree to which an example resembles a prototypical example). The second way is that the typicality of examples is provided by an expert who supplies training examples. If the typicality is not provided, the system makes the standard assumption that the typicality is the same for all examples. In the measures below, the degree of completeness of a description is proportional to the typicality of the positive events covered, and the consistency is inversely proportional to the typicality of the negative events covered5. Since the system is working with a two-tiered description, other factors are taken into account. One is that according to the idea of two-tiered representation, a "high quality" concept description should cover the typical examples explicitly, and the non-typical ones only implicitly. Thus, the typical examples should be covered the Base Concept Representation, and non-typical, or exceptional ones by the Inferential Concept Interpretation. In POSEIDON, the Base Concept Representation is inductively learned from examples provided by a teacher. Therefore, the best performance of the system will be achieved if the training set contains mostly typical examples of the concept being learned. For the exceptional examples, the teacher is expected to provide rules that explain them. These rule become part of the Inferential Concept Interpretation. An advantage of such an approach is that the system learns a description of typical examples by itself, and the teacher needs to explain only the special cases. 5 When negative examples are instances of another concept, as is often the case, their typicality is understood as the typicality of being members of that other concept
165 In view of the above, the examples covered explicitly (strictly-covcrtd, or S-COV) are assumed to contribute to the completeness of a description more than flexibly-covered (F-COV) or deductively-covered (D-COV). General Description Quality Measure This section defines the General Description Quality (GDQ) measure implemented in POSEIDON. As mentioned above, the measure combines the accuracy, representational simplicity and the cost of a description. The accuracy is based on two factors, the typicality-based completeness, T_COM, and the typicality-based consistency, T-CON. These two factors are defined for a two-tiered concept description, D, as follows: Zjws*Typ(e+) +2wf*Typ(e+) + £ wd*Typ (e+) e G S-cov TjCOM (D)
e e F-cov
e e D-cov
= SjTyp(e + ) e+ePOS 2jws*Typ(e~) + 2 w f *Typ(e~) + X wd*Typ (e~) e"G S-cov e"G F-cov e"e D-cov
TjCON (D))
=
Xryp(e~) e-GNEG
where POS and NEG are sets of positive and negative examples, respectively, which are covered by the two-tiered concept description Z). Typ(e) expresses the degree of typicality of example e of the given concept. Weights wSj wf, and wd represent different significance of the type of coverage (S-COV, F-COV, and D-COV). Thresholds ti, and t2 reflect the desirability of a given type of coverage for the given degree of typicality: ws: if Typ(e) > t2, then 1, else w wp if t2 > Typ(e) > ti, then 1, else w W(j: if tj > Typ(e), then 1, else w where thresholds tiand t2 satisfy the relation 0 < ti< t2 ^ 1, and 0<w < 1 .
166 The role of w is to decrease the weight the examples that are covered in a way (S, F or D) that is not compatible with their typicality. Using the terms of T__COM and T__CON, the description accuracy is defined as: Accuracy = wi*T_COMPLETENESS +W2*T__C0NSISTENCY
where wi+w2= 1. The weights wi and w2 reflect the expert's judgment about the relative importance of completeness and consistency for the given problem. The default value of both is 0.5. A measure of comprehensibility of a concept description is difficult to define. As mentioned earlier, we approximate it by a representational simplicity, defined as: RepSimplicity(D)) = TC - (vi * X c ( o p ) op G BCR (D))
+ v2 * E c ( o p ) ) op e ICI (D))
where TC is the sum of the complexities of all operators in the description D. BCR (D)) is the set of all operator occurrences in the BCR of the description, and ICI (dsp) is the set of all operator occurrences in the ICI. C (op) , the complexity of an operator, is a real function that maps each operator symbol into a real number representing its complexity. The complexities of the operators are chosen by an expert, assuming the following constraints: C(range) < C(intemal v ) < C(=) < C(<>) < C(&) < C(v) < C(=>). When the operator is a predicate, C increases with the number of the arguments. Parameters vi and v2 represent relative weights of the operators in BCR and ICI, respectively, assuming V1+V2 = 1. The Base Concept Representation is supposed to describe the general and easy-to-define meaning of the concept, while the Inferential Concept Interpretation is mainly used to handle rare or exceptional events. As a consequence, the Base Concept Representation should be easier to comprehend than the Inferential Concept Interpretation, and thus v i should be larger than v2. The cost of a description D depends on two factors:
167 • Measuring-Cost (MC) - the cost of measuring variables used in the concept description MQ£))= £ I,mc(v)/(/Po!J+/Neg/) eePos+Neg veVar^e) • Evaluation-Cost (EC) - the cost of evaluating the concept description EC(D) = Z ec(e) /(/Pos/^/Neg/) eePos+Neg where Vars(e) is the set of all variables occuring in the concept description, mc(v) is the cost of measuring the value of the variable v, and &(£) is the computational cost of evaluating the concept description to classify the event e. The latter depends on the computing time and/or on the number of operators involved in the evaluation. We now define the cost of a description: Cost(D) = ui*MC(Z))) + U2*EC(D) where u i and 112 are weights defining the relative importance of the measuring-cost and the evaluation-cost for a given problem. The general description quality (GDQ) measure is in the form of a Lexicographic Evaluation Functional (LEF), in which the above defined concepts of accuracy, representational simplicity and the description cost are used as elementary criteria. The tolerances and other parameters defined above can be chosen by a user to reflect the problem domain, or determined experimentally. They also have default values, so that the user does not have to specify them. More details about the general description quality measure are in (Bergadano et al., 1988). LEARNING BY MAXIMIZING DESCRIPTION QUALITY As mentioned before, learning a base concept representation (BCR) of a concept is performed in two phases. In the first phase, a complete and consistent concept description is learned inductively from examples. In the second phase, the obtained complete and consistent description is optimized according to the general description quality criterion. In POSEIDON, the first phase is done using the AQ15 learning program, described in (Michalski et al., 1986a). The following subsections
168 describe the second phase (the TRUNC-SG procedure). Search Heuristics for Optimizing Base Concept Representation The task of optimizing BCR by directly applying the General Description Quality measure is computationally expensive. It requires that every newly generated description is matched flexibly against all training examples. To make this process more efficient, a double-level search method is employed. The first level uses a simple heuristic to determine which operator, RR, CR, CE or CC, is likely to improve the description, and the second level actually applies the operator, and evaluates the description according to the General Description Quality measure. The first level applies the so-called Potential Accuracy Improvement heuristic (PAI). The PAI is a function of the change in the coverage of positive and negative examples by the description due to an operator application. Specifically: PAI = AP/TP - AN/TN where AP (AN) is the change in the number of positive (negative) examples that would be covered by the description after applying the operator, and TP (TN) is the total number of positive (negative) examples. For generalizing operators, SR and CE, AP and AN are non-negative, and for specializing operators, CR and CC, AP and AN are non-positive. The advantage of the Potential Accuracy Improvement measure is that it can be computed much more efficiently than the General Description Quality. For every condition in the current description, a list of examples covered by it is maintained using bit vectors. The sets of examples covered by a ruleset (representing a complete description) is then obtained by intersection and union operations. The matching time can be improved further by also maintaining bit vectors for the examples covered by rules (the matching time trades off with the memory for storing the bit vectors). Note that computing the General Description Quality requires flexible matching, and thus cannot be done by an intersection and union operations on bit vectors. The above formula does not take into consideration the degree of
169 reduction of the description complexity caused by applying an operator. For example, removing a rule reduces complexity more than removing a condition. To account for this, POSEIDON assigns a higher weight (preference) to applying the RR operator (rule removal) than for applying the CR operator (condition removal). The condition removal operator generalizes the description, therefore, the description (ruleset) resulting from its application may cover some additional examples (positive or negative). Due to this, some rule(s) may become redundant. If the CR operation produces a rule that differs from another rule only in the value of one attribute, the two rules can be merged into one, in which the attribute is related to the internal disjunction of values (this is a case of the so-called "refunion" operation; see Michalski and Stepp, 1983). For example, the rules [shape = circle]&[size = 2..6] and [shape = square]&[size = 2..6] can be replaced by single rule [shape = circle v square]&[size = 2..6]. It is worth noting that in the case of operators RR and CR, the Potential Accuracy Improvement heuristic can be simplified by using an approximation: PAF = #P/TP - #N/TN where #P (#N) is the number of positive (negative) examples covered by the component (rule or condition) to be removed. Such a heuristic is very efficient because it needs to be computed only once for every condition and every rule in the initial description. This computation can be done before the search starts, and does not need to be repeated for every node in the search. The operator that produces the largest Potential Accuracy Improvement is chosen, and applied to the description under consideration. The descriptions so generated are then subjected to an evaluation by the General Description Quality criterion. The search algorithm (TRUNC-SG) is presented in Table 3. Let us explain the motivation and individual steps of the algorithm. Step 1
170 chooses the node (description) for expansion on the best-first basis, that is, chooses the node with the highest General Description Quality. This is not always an optimal choice, because "worse" nodes can sometimes lead to better descriptions after a number of removals. Whether the search will behave in this manner will depend on the adequacy of the General Description Quality as the measure of concept quality.
1.
2.
3.
4.
5. 6.
Search Algorithm (TRUNC-SG Identify in the search tree the best candidate description D. (Initially, D is the complete and consistent description obtained by AQ15 in Phase I. Subsequently, it is the highest ranked description according to the General Description Quality criterion). Apply to D that operator, selectedfromamong the operators RRi - Remove the i-th rule, or CRij - Remove the j-th condition from the i-th rule CQj - Contract the referent of the j-th condition in the i-th rule CEij - Extend the referent of the j-th condition in the i-th rule that maximizes the Potential Accuracy Improvement measure. Compute the General Description Quality (GDQ) of the description obtained in step 2. If the GDQ of this description does not exceed the GDQ of the original D by more than D (an experimental threshold), then proceed to step 1. (Computing the description accuracy for GDQ employsflexiblematching). Identify exceptional examples that are (a) the positive examples that cease to be covered, and (b) the negative examples that become covered. Ask an expert to provide rules explaining these examples. If such rules are obtained, add them to the Inferential Concept Interpretation; otherwise, add the exceptional example(s) to it. Update the GDQ value of the new node by taking into account the added Inferential Concept Interpretation. If the stopping criterion is satisfied, then STOP, otherwise proceed to step 1.
Table 3. The algorithm for Optimizing a Concept Description.
171 Step 2 chooses the "best" search operator according to the Potential Accuracy Improvement heuristic, and applies it to the current description. Step 3 computes the General Description Quality of the new node. It should be noted that, in the General Description Quality measure, the typical examples covered directly by the base representation can weigh more than those covered through flexible matching. The examples covered by Inferential Concept Interpretation rules weigh more than the ones covered through flexible matching, but less than the ones covered by the Base Concept Representation. A new description (node) is worth to consider only if it "sufficiently" better (more than A) than the previous one, otherwise the control goes to Step 1 (the reason for this is given below). Step 4 determines exceptional examples, and asks an expert for an explanation of them. If the explanation is provided, appropriate rules are added to the Inferential Concept Interpretation. These rules may extend or contract the Base Concept Representation. For example, the rule removal operator might uncover some positive examples, that were previously covered. In this case, new rules added to the Inferential Concept Interpretation would allow the system to reason about such "special" positive examples, and explain why they should be classified as instances of the concept being learned. On the other hand, the condition removal operator might cause some negative examples to be covered. In this case, new Inferential Concept Interpretation rules would have to be added to contract the Base Concept Representation. An important issue concerning step 4 is when an explanation should be required from an expert ("explainer"). The problem is that in some cases the chosen operator may not be appropriate, because it leads to a very poor description. In such a case, it is not worthwhile to ask an expert for an explanation, and search should continue in other direction. The method employs the following strategy. Suppose that N is the node (description) to be expanded, and M is the node obtained after applying an operator, e.g., the condition removal. The effort to obtain an explanation is made only if the General Description Quality of M is
172 "significantly" better than that of N (above a certain threshold T). In this case, the explainer is given the General Description Quality evaluations of both descriptions, N and M, and asked for an explanation. These evaluations give the explainer a sense of importance of the request. If the explainer cannot provide an explanation, the exceptional examples are directly added to the Inferential Concept Interpretation. Step 5 updates the General Description Quality of the obtained two-tiered description by taking into consideration the added Inferential Concept Interpretation rules. Step 6 decides whether to stop or continue the search. The stopping criterion is satisfied when the number of nodes explored exceeds value kl, or when the General Description Quality is not improved after the exploration of k2 nodes since the last improvement. The search parameters kl and k2 have a default value, which is modifiable by the user. When the search stops, the best node found until this point defines the chosen two-tiered concept description. In conclusion, let us point out the main difference between the above two-level search and the standard best-first search. The difference is that only one operator is applied to the (best-GDQ) node selected for expansion, rather than all available operators, as in the standard search. The operator applied is the "best" according to the PAI heuristic. Such a procedure helps to avoid generating low quality nodes, and thus makes unnecessary the computation of the General Description Quality for these nodes. Other operators are applied only if the results obtained along this branch of the search tree turn out to be unsatisfactory. An Abstract Example An abstract example of the search process is given in Figure 2. Individual nodes represent both components of a two-tiered description (BCR and ICI) generated at any given search step, and show the coverage of training examples by the description. The rectangular areas represent the coverage by the Base Concept Representation, and the curved lines denote the coverage by the Inferential Concept Interpretation. In the example, the accuracy is computed according to the formula described before, assuming that all examples have the same typicality.
Figure 2. An Illustration of the Search Process. The initial description is represented by node 1. The BCR contains two rules represented by two rectangular areas, which cover five positive examples out of eight, and one negative example out of five. The
174 Inferential Concept Interpretation extends this coverage by recognizing one more positive example. Next nodes correspond to descriptions obtained by an application of operators marking the branches of the search tree. For example, node 3 is obtained by eliminating condition c 5 in the second rule of the initial description. The new description is more accurate because all positive examples are now covered, without changing the coverage of the negative examples. By truncating the first rule in node 3, node 5 is generated. The description no longer covers negative examples, and is simpler. This node is then accepted as the optimized description resulting from the search. The other nodes lead to inferior concept representations with respect to General Description Quality, and are discarded. The quality has been computed with wl=w2=0.5. For simplicity, the cost is omitted, and the complexity of the Inferential Concept Interpretation is ignored. The complexity of the Base Concept Representation is indicated by the number of rules and the number of conditions. EXPERIMENTS The proposed method was implemented in the POSEIDON system (also called AQ16). To evaluate its performance, it was tested^ together with several other methods, in two problem domains. The other methods tested included: simple forms of exemplar-based learning, learning consistent and complete descriptions (implemented in AQ15), generating top rule descriptions (described by Michalski et al., 1986), and generating pruned decision trees (implemented in the ASSISTANT program; Cestnik, Kononenko & Bratko, 1987). All these methods were applied to the same training data, and tested on the same testing data from the two problem domains. The first domain was labor-management contracts, and the problem was to learn a general description that discriminates between acceptable and unacceptable contracts. The second domain was congressional voting, and the problem was to characterize the voting behavior of Republicans and Democrats in the US House of Representatives.
175 Experimental Data Labor-management contracts. The data regarding labormanagement contracts were obtained from Collective Bargaining, a review of current collective bargaining issues published by the Department of Labor of the Government of Canada. The data describe labor-management contracts negotiated between various organizations and labor unions with at least 500 members, and concluded in the second half of 1987 or the first half of 1988. The experiments focused on the personal and business services sector. This sector includes unions representing hospital staff, teachers, university professors, social workers and certain classes of administrative personnel. The data involved multivalued attributes, and thus the VLi language was directly applicable. Each contract is described by sixteen attributes, belonging to two main categories. One category concerns issues related to the salaries, e.g., pay increases in each year of the contract, the cost of living allowance, a stand-by pay, etc., and the second category concerns issues related to fringe benefits, e.g., different kinds of pension contributions, holidays, vacation, dental insurance, etc. Positive examples represent contracts that have been accepted by both parties. Negative examples represent contracts deemed unacceptable at least by one of the parties. Here is an example of an acceptable labor-management contract: Duration of the contract = 2 years Wage increase in thefirstyear = 7.5% Wage increase in the second year = 3.5% Cost-of-living-allowance = unknown Hours of work/per week = 38 Pension offer = none Stand-by pay = $0.12/hr Shift differential = second shift is paid 25% more thanfirstshift Educational allowance is offered Holidays per year= 11 days VacLength = better than average in the industry Long term disability insurance = offered by the employer 50% dental insurance cost = covered by the employer Bereavement leave = available Employer-sponsored health plan = not mentioned
176 The above description is represented as the following VLi rule: [Dur = 2] [Wagel = 1.5] [Wage2 = 3.5] [Cola = unknown] [Workhours = 38] [Pension = none] [StbyPay = 12] [ShiftDff = 25] [Educallw = yes] [Hlds =11] [VacLen = better] [LngTrmDisbll = true] [Dntl-ins = half] [Bereavement = yes] [EmpHlthPln= unknown] ::> [Contract Class = acceptable] In the rule above, and next rules, the following abbreviations were used: SByPay for "Stand-by-pay" Vacation for "Vacation length" Hlds for "Holidays per year," LngTrmDisbl "Long term disability insurance," EmpHlthPln for "Employer-sponsored health plan" ShiftDff for Shift differential Contract Class for "Contract classification" Also, for simplicity, the conjunction is represented by concatenation. The training set consisted of 18 positive and 9 negative examples of contracts; the testing set consisted of 19 positive and 11 negative examples. US Congress Voting record. The data regarding the US Congress voting record were the same as the ones used by Lebowitz (1987) in his experiments on conceptual clustering. The data represent the 1981 voting records of 100 selected representatives (50 in the training set and 50 in the testing set). The problem was to learn descriptions discriminating between the voting record of Democrats and Republicans. Below is an example of the voting record of a Democrat in the US Congress: Draft registration = no Ban aid to Nicaragua = no Cut expenditure on MX missiles = yes Federal subsidy to nuclear power stations = yes Subsidy to national parks in Alaska = yes Fair housing bill = yes Limit on PAC contributions = yes Limit on food stamp program = no Federal help to education = no State = north east Population = large Occupation = unknown Cut in Social Security spending = no Federal help to Chrysler Corp. = vote not registered
177 A Description of Experiments For each problem domain, the experiments involved the following steps: 1. Learn a complete and consistent description from the training examples (by the AQ15 program). 2. Determine the top rule description from the above description using the TRUNC method (Michalski et al., 1986). The top rule description consists of a single rule that covers the maximum number of positive examples among all other rules in the complete and consistent description. Such a description is easy to determine, because the AQ15 generates rules together with measures indicating the number of examples covered totally and uniquely by each rule, which are denoted thetweight and u-weight of a rule, respectively (see below). In the experiments, one top rule description was generated for positive concept examples, and one for the negative examples (the latter one from a complete and consistent description of the negative examples). An instance was classified as belonging to a concept, if it matched best the top rule description of positive examples, and was rejected if it matched the top rule description of the negative examples. If both descriptions were matched with roughly the same degree, then the instance was classified as "no match." Learning the top rule description, and using it with flexible matching, represents a simple, but important version of the two-tiered concept learning approach (Michalski, 1990). 3. Determine an optimized two-tiered description from the complete and consistent description using the TRUNC-SG procedure. 4. Determine descriptions of the given concepts using other methods, specifically, variants of the exemplar-based learning approach, and the decision tree learning program ASSISTANT. 5. Test the performance of all generated descriptions on the testing examples. To illustrate the difference between the complete and consistent descriptions, the top rule, and the optimized descriptions created by POSEIDON, figures below a sample of these descriptions in the Labor Management domain. Figure 3 shows a complete and consistent description produced by AQ15. In the Figure, t (t-weight) is the total number of examples covered by a rule, and u (u-weight) is the number of examples uniquely covered by the rule.
Figure 3. Complete and Consistent Descriptions Generated by AQ15.
By selecting from each descripion the rule with the largest t-weight, the following top rule descriptions were obtained (Figure 4): BCR: {[dur >l]&[wgjncr_yr2 >3%]&[#hlds >10]: ::> [Contract Class =
(t =77, u = 77)}
acceptable]
{[wgjnoiyrl = 2..4%]&[#hlds < 10]&[vacation=AVG]: (t =5, u =5)} ::> [Contract Class =
unacceptable]
ICI: Flexible matching Figure 4. Top Rule Descriptions Obtained by the TRUNC Method.
179 By optimizing the complete and consistent description using the TRUNC-SC method, and acquiring the ICI rules from an expert, the following optimized two-tiered description was obtained (Figure 5). BCR: [wage_incr_yrl > 4.5%] or [wage.Jncr_yi2 > 3.0%] or [#hlds>9]or [vacation > AVG] ::> [Contract Class =
acceptable]
[wage_incr_yrl < 4.0%] & [#hlds < 10] or [wage_incr_yr2 < 4.0%] & [vacation < average] or [Dur = 1] & [wage_incr_yrl < 4.0%] or [wage_incr_yr2 < 3.0%] ::> [Contract Class = unacceptable] ICI: Flexible matching plus deductive matching using rules: [wage_incr_jrl>5.5%]&[vacation < average] ::> [Contract Class = acceptable] [wage_incr_yrl <3%] &[wage_incr__yr2 < wage_incr_yrl] ::> [ContractClass = unacceptable] [wage_incr_jrl^%]&[wagejncr_yr2<3%]&[hours_work>40]& [pension= empLcontr] ::> [Contract Class = unacceptable] Figure 5. Optimized Two-tiered Descriptions Obtained by POSEIDON. During the BCR description optimization process, the system determined the training events that were incorrectly classified by the base representation. An expert was asked to formulate rules explaining these examples (the ICI rules in Figure 5). For example, the first ICI rule for an unacceptable contract (Figure 5) describes contracts with the wage increase in the first year lower or equal 3%, and an even lower increase in the second year. In such circumstances, the holiday and vacation time do not matter, and the contract is classified as unacceptable (by the union). As one can see, the optimized BCR descriptions are significantly simpler than the complete and consistent descriptions generated by AQ15.
180 They also seem to represent the most important characteristics of the labor management contracts. Specifically, a contract is acceptable when it offers a significant wage increase (the first two rules in Figure 5), or it offers many holiday days, or the vacation time is above average. Results From Testing POSEIDON and Other Methods As mentioned earlier, experiments tested POSEIDON and three other methods, specifically, variants of exemplar-based learning, the method for learning consistent and complete descriptions, a method for generating top rule descriptions, and a method for generating pruned decision trees. All of these methods were employed to learn a concept description from the same set of training examples. All the learned descriptions were then applied to the same testing examples. The performance was evaluated by counting the number of examples that were classified correctly, incorrectly, or unclassified. Tables 4 to 7 present the results of different experiments. A summary of all results is shown in Table 8. In all tables, columns "Correct" and "Incorrect" specify the percentage of the testing events that were correctly and incorrectly classified, respectively. The column NoJAatch specifies the number unclassified examples (i.e., the examples that did not match any description to a sufficient degree). To provide an estimate of the complexity of descriptions learned, the tables also list the number of conditions and rules in each description. In the case of pruned decision trees, the table lists the number of nodes and leaves (the number of leaves corresponds to the number of rules that can be directly determined from the decision tree). Experiment 1 (Table 4) tested a factual description, and variants of the exemplar-based approach (1-, 3- and 5- nearest neighbor match). A factual description is a disjunction of all the training events, and, as such, is obviously complete and consistent with regard to the training set. The first part of Experiment 1 tested the factual description on the testing examples using the strict match method. In such a method, a testing example must match exactly one of the training examples to be classified.
181 In this case, obviously, the description had no predictive power. It produced No_Match answers for all testing examples of the labor contract data, and for 96% testing examples of the congressional voting data (two examples were the same in the training and testing sets). Simple Exemplar-based Description 27 rules and 432 Labor management problem (Labor): 51 rules and 969 Congress problem (Congress): No Correct Labor Labor Congress Strict Match 0% Training Set 100% 100% 100% 4% Testing Set 0% 1-Nearest Neighbor 0% Training Set 100% 100% 0% Testing Set 86% 77% 3-Nearest Neighbors 0% Training Set 100% 100% 0% 84% Testing Set 83% 5-Nearest Neighbor 0% Training Set 100% 100% 0% Testing Set 84% 80%
conditions conditions Match Congress 0% 96% 0% 0% 0% 0% 0% 0%
Table 4. Results of Experiment 1. Subsequent parts of Experiment 1 tested the factual description using the k-nearest neighbor method with different k. The method involved determining k closest (best "fitting") learning examples to the one being classified, and assigning to it the class of the majority of the closest examples. Such a method is equivalent to simple forms of exemplar-based learning. The 1-Nearest Neighbor row lists results from applying the factual description with a matching method somewhat similar to the one described in (Kibler and Aha, 1987). The only difference was that Kibler and Aha's method uses the maximum function for evaluating a ruleset (disjunction), while our flexible matching uses the probabilistic sum. The method was also tested with k=3 and 5.
182 The second experiment used concept descriptions generated by AQ15 without truncation (Table 5). Such descriptions are consistent and complete with regard to the training examples, i.e., they classify all training examples 100% correct when using the strict matching method. The flexible matching method did not change this result. Complete and Consistent Description (No truncation) Labor-mgmt problem (Labor): Congress problem (Congress):
Strict Match Training Set Testing Set Flexible Match Training Set Testing Set
11 rules and 28 conditions 10 rules and 32 conditions No Match Labor Congress
Labor
Correct Congress
100% 80%
100% 86%
0% 3%
0% 0%
100% 80%
100% 86%
0% 3%
0% 0%
Table 5. Results of Experiment 2. For the testing set, the number of correct classifications was relatively high (80-86%), the same for the strict and flexible matching methods. Flexible matching made no difference, probably due to two factors. Firstly, the complete and consistent descriptions include many specific rules, leaving little space for the "no match" cases (3%), in which flexible matching could help. Secondly, the descriptions consisted only of disjoint rules, as the program was run using the "disjoint cover" parameter. In such a situation, the "multiple match" cases do not occur, and flexible matching cannot help. The above results are similar to those obtained in the previous experiment, which used an exemplar-based approach (Table 4). The main difference is that the AQ descriptions are much simpler in terms of the number of rules and the number of conditions involved (11 vs. 27 rules in the labor management problem, and 10 vs. 51 rules in the congress voting problem). The simpler descriptions allow the system to
183 be more efficient in the recognition mode. The third experiment (Table 6) tested the top rule descriptions determined from the above complete and consistent descriptions. As shown in Table 5, the performance of these rules using flexible matching was comparable to that of the complete and consistent descriptions, as well as factual descriptions (compare with Tables 4 and 5). The Top Rule Description (the TRUNC method) Labor-mgmt problem (Labor): Congress problem (Congress):
2 rules and 6 conditions 2 rules and 6 conditions
Correct Labor Congress Strict Match Training Set Testing Set Flexible Match Training Set Testing Set
Labor
No_Match Congress
52% 63%
62% 69%
48% 30%
38% 24%
81% 83%
75% 85%
0% 0%
0% 0%
Table 6. Results of Experiment 3. It may be surprising that the top rule descriptions performed better on the testing set than on the training set. This is due to the fact that the training set contained more exceptions than the testing set. The system used the TRUNC method, in which the truncation process removes rules that cover all except the most typical training examples. The top rule descriptions consist of only one rule per concept, and therefore they are significantly simpler than the factual, and consistent and complete descriptions (they use only 2 vs. 11 vs. 27 rules in the Labor Management problem, and 2 vs. 10 vs. 51 rules in the Congress Voting problem). It is quite revealing that such simple rules performed as well as much more complex descriptions generated in previous methods. The
184 fourth experiment (Table 7) tested optimized descriptions generated by POSEIDON, i.e., derived by the TRUNC-SG method. The descriptions were tested using flexible matching alone (Flexible Match), and in combination with deductive matching (Deductive Match). Optimized Description (POSEIDON) 9 rules and 12 conditions Labor-mgmt problem (Labor): 10 rules and 21 conditions Congress problem (Congress): No_Match Correct Labor Congress Labor Congress Strict Match 16% 37% Training Set 84% 63% 23% 54% Testing Set 43% 73% Flexible Match 0% Training Set 15% 100% 85% 0% 4% 92% Testing Set 83% Deductive Match 0% 4% Training Set 96% 96% 0% 0% 92% Testing Set 90% Table 7. Results of Experiment 4. For comparison, the performance of these descriptions was also tested using strict match. The latter is rather an impractical combination. As expected, these descriptions used with strict matching gave relatively poor performance. The optimized descriptions (BCR) combined with deductive matching (ICI) gave the best performance (90-92% correct). When used with only flexible matching, the performance was slightly lower. The descriptions are simpler than complete and consistent descriptions, although they include the Inferential Concept Interpretation rules. They are, of course, more complex than the top rule descriptions, which do not use any interpretation rules.
185 For the Labor data, descriptions applied with deductive matching produced higher performance than when used with flexible matching only (90 vs. 83%)6. For the Congress data problem, the performance was the same for the two matching methods. This is because deductive rules were acquired on the training set; in the specific testing set, the Dcovered events were the same as F-covered ones. Table 8 summarizes the results of experiments, specifically, compares the performance and complexity of descriptions generated simple exemplar-based methods, the two-tiered descriptions generated POSEIDON, and pruned decision trees generated by ASSISTANT descendant of the Quinlan's ID3 program; Cestnik et al.f 1987).
it by by (a
ASSISTANT was applied to the same learning and training data, which were used in the previous experiments (whose results were presented in Tables 4, 5, 6 and 7.) The decision trees obtained by ASSISTANT were optimized using a tree-pruning mechanism (Cestnik et al., 1987). This mechanism is compared with the TRUNC-SG method in the next section. The factual description was applied with the flexible matching function. The complexity of a rule-based description was measured by stating the number of rules (#Rules) and the number of conditions (#Conds). The complexity of a decision tree was measured by the number of leaves (#Leaves) and the number of nodes (#Nodes). 6 This difference, for the Labor Contract data, is not %2 significant. Nevertheless, we think that there are other reasons to prefer deductive matching over flexible matching. Deductive classification is based on rules and knowledge-based inference, and is therefore easier to understand by humans. The rules may be modified locally, while changing theflexiblematching function is difficult and produces uncontrolled, global consequences. In other words, examples that are correctly recognized through ICI deductive rules are also explained ipso facto in terms of domain knowledge. The same cannot be said of examples correctly recognized by flexible matching, which is a knolwedge-independent distance measure. To reflect this, the GDQ measure assigns a higher score to a description with deductive matching than withflexiblematching.
Table 8. Summary of the Results of Testing Descriptions Generated by Different Methods.
187 In the above experiments, for both domain problems, the learning method implemented in POSEIDON produced descriptions that are simpler (except for the top rule descriptions), and also perform better on the testing data than other tested methods. Being simpler, these descriptions are also easier to understand, and have a lower evaluation cost. The meaning of the concept defined by such descriptions depends on the base representation (i.e., a TRUNC-SG optimized description learned from examples), and the inferential concept interpretation (consisting of an apriori defined flexible matching procedure and a set of deductive rules, formulated by the expert). Using rules in the inferential concept interpretation has an advantage that exceptional cases are easy to explain. In the current method, the system determines which examples are exceptional (those that are misclassified by the base representation). The expert analyzes them, and determines the rules for ICI. The top rule descriptions were significantly simpler than any other descriptions, but performed somewhat worse than the optimized description and the decision tree. Depending on the desired trade-off between the accuracy and simplicity, the top rule or the optimized description can be taken as the base representation of the concept being defined. The Role of Parameters and Related Issues POSEIDON has many parameters which can be controlled by a user. On the surface, this might be considered as a disadvantage. In our view, a learning system that allows the user to explicitly modify parameters that affect learning processes (but which are not just method-dependent), is to be preferred over a system that does not explicitly define such parameters. The point is that in the latter systems these parameters are defined only implicitly, by the assumptions and the structure of the method. For example, many systems do not take into consideration the typicality of examples. In POSEIDON, this is equivalent to an assumption that the typicality of all examples is equal to the default value 1. As another example, consider the cost of measuring the value of attributes. If a learning program does not have parameters representing such costs, then
188 this is equivalent to an assumption that all costs are the same (which in reality is often not true). By being able to control such learning parameters, the user can produce results that better fit the task at hand. For example, for some tasks, the accuracy of descriptions may be decisive criterion, while for others the description simplicity may be of equal concern. An important problem to be investigated is the sensitivity of POSEIDON to its various parameters. While a comprehensive answer to this problem goes beyond the scope of this paper, we report below a preliminary sensitivity analysis regarding the parameters controlling the trade-off between the description accuracy and simplicity. Such parameters are considered to have the most important effect on the performance of learned descriptions. Specifically, they are the tolerances in the lexicographic evaluation functional measuring the description quality. To explain their role, let us briefly review the description quality measure. This measure combines several criteria, such as the accuracy, the simplicity, and the cost. Each criterion is associated with a tolerance interval such that differences within this interval are not considered unimportant. Thus, if the tolerance interval of accuracy is very narrow, then the accuracy becomes the prevailing criterion in quality evaluation. On the other hand, if this tolerance interval is wide, the remaining criteria become more significant. An experiment was performed using the same Congress voting data, as used in experiments reported in Tables 4-7. The training set had 51 examples, while the testing set had 49 examples. The concept to be learned was the voting record of Republicans in the US Congress. The description tested in Table 7, had 10 rules and 21 conditions, and yielded the accuracy of 100% on the training set, and 92% on the testing set. The description was obtained using the accuracy tolerance (xi) value equal 0.05. To determine the method's sensitivity to this parameter, the accuracy tolerance TI was set to values 0.55, 0.35, 0.02, 0.005, and for each value the description accuracy was measured. For the above accuracy tolerances, the system's performance on the testing set was 88%, 88%, 90%, and 92%,
189 respectively. Thus, this experiment seems to indicate that the accuracy of the descriptions slowly grows with the narrowing of the tolerance interval on the accuracy in the description quality measure, which completely confirms an intuitive expectation. In general, when the accuracy tolerance interval is wide, the simplicity of the description assumes an important role, yielding performances close to the performance of the top rule in the two-tiered description. Intermediate values, such as the one used in the experiments presented in Table 7 (xi = 0.05) produced the best results, e.g., the performance of 92% on the testing set from the Congress data. In the case of the narrow tolerance interval for accuracy, the simplicity has a lower impact on the quality of the description. An interesting topic for future research is to systematically investigate the influence of such parameter changes on the performance of the descriptions7. Another issue that should be explored more in the future is the role of example typicality of learning examples. In the presented method, if the input examples are assigned typicality values, the generated base concept representation will tend to cover the most typical examples, while the inferential concept interpretation will tend to cover less typical examples. A problem for future investigation is to determine the effect of the typicality on the overall quality of generated concept descriptions. When the typicality information is unavailable, the system itself will assign examples to different classes of typicality. The examples covered by the base representation are classified as typical, those covered by flexible 7 In our experiment, for small values of z. = 0.02 and .005, which emphasize the role of accuracy in the measured quality of a description, the performance on the testing set was close or equal to the performance obtained for x. = 0.05, and higher than performance of 86% for AQ15 in Table 7. The reason is that in the last two experiments, as well as in the original experiment in Table 7, it was always possible to find a description that was simpler than the one produced by AQ15, but still 100% correct on the training data. Therefore, by giving more importance to accuracy, the simpler description was preferred, and better performance on the test set was obtained.
190 matching as nearly-typical, and those covered by the deductive rules as non-typical. 8 An interesting experiment would be to compare such classifications with human classifications. Another interesting issue relates to the noise in the data. The preliminary analysis indicates that the proposed method has a significant ability for handling noisy data. Experiments show that noisy examples are usually covered by the "light" rules, i.e., rules that cover few examples. By removing such rules from the description, the effect of noise can be significantly minimized (Zhang and Michalski, 1989). Future research should investigate these aspects of the method in greater detail. RELATED WORK The research presented here relates to various efforts on learning imprecise concepts, in particular, to learning methods generating pruned decision trees (e.g., Quinlan, 1987; Cestnik, Kononenko & Bratko, 1987; Fisher and Schlimmer, 1988). In these methods, a concept description (or a set of descriptions) is represented as a single tree structure ("one tier") that is supposed to account for all concept instances. An unknown instance is classified by following the nodes of the decision tree from the root to the leaf indicating the class. To avoid overfitting, some parts (subtrees) of the originally generated decision tree are pruned away. As a result, such decision trees do not cover some training examples. Since the recognition process does not use flexible matching, such pruned trees must always produce some error on the training examples, although the overall performance on new examples may increase. The two-tiered method avoids overfitting by simplifying original descriptions, yielding base concept representations that, in the formal logical sense, are usually also incomplete and inconsistent. The two-tiered method, however, can compensate for the lack of coverage or for an 8 This three-way classification of the examples can be viewed as a simple method of learning typicality. A similar feature is available in Cobweb (Fisher, 1987). On the other hand, if the typicality information is available, it is used by POSEIDON to improve the quality of the learned description.
191 excessive coverage of the first tier (BCR), by the application of the second tier (ICI). This can be done by flexible matching and/or deductive inference rules. The latter ones are normally unaffected by noise, because they depend on a deeper understanding of the domain. In addition, the presented method takes into consideration the typicality of the examples (if it is available). This feature gives the method an additional help for handling noisy examples. The method presented in (Quinlan, 1987) is based on a hill-climbing approach that first truncates conditions, and then rules. No search is performed, only one alternative truncation is tried at every step. The final result might possibly be far from optimal. By avoiding the search, such a procedure should, however, be significantly faster than the one implemented in POSEIDON. If the speed of learning and the simplicity of descriptions are of central importance, then the TRUNC method (that determines the top rule descriptions without search) should be applied rather than TRUNC-SG. In the same paper (Quinlan, 1987), other methods for pruning decision trees are also described. Some of these methods require a separate testing set for the simplification phase, and others use the same training set that was used in creating the tree. The simplification phase in POSEIDON can also be done either using the original training set, or using a separate set of examples. The experiments by Fisher and Schlimmer (1988) on pruning decision trees use a statistical measure to determine the attributes to be pruned. Such measures require a rather large data sample, and thus do not apply well to small training sets. In the two-tiered approach, training events are analyzed logically, rather than statistically, both in the phase creating a complete and consistent description, and in the optimization phase. Consequently, the two-tiered approach seems to be more suited for learning from a relatively small number of examples. An interesting possibility for future research is to integrate a statistical measure, such as used by Fisher and Schlimmer, or other, in the process of rule learning and truncating with large data sets. The system developed by (Iba et al. 1988) uses a trade-off measure
192 and truncating with large data sets. The system developed by (Iba et al. 1988) uses a trade-off measure that is somewhat similar to the general description quality (GDQ) measure proposed in this paper. Our GDQ measure considers more factors. Besides taking into account the typicality of the instances covered by the description, it considers different types of matching between an instance and a description. Moreover, the simplicity measured by GDQ depends not only on the number of rules in the description as in (Iba et al., 1988), but also on the different syntactic features in the description. The inductive algorithm implemented in CN2 uses a heuristic function to terminate search during rule construction (Clark & Niblett, 1989). The heuristic is based on an estimate of the noise present in the data. Such pruning of the search space of inductive hypotheses results in rules that may not classify all the training examples correctly, but that perform well on testing data. CN2 can be viewed as an induction algorithm that includes pre-truncation, while the algorithm reported here is based on post-truncation. CN2 applies truncation during rule generation, and POSEIDON applies truncation after rule generation. The advantage of pre-truncation is efficiency of the learning process. On the other hand, such an approach has difficulty with identifying irrelevant conditions and redundant rules. The two-tiered method described here can also be viewed as a kind of constructive induction in the sense of (Michalski, 1983). In fact, the whole learned description may include new terms, absent from the examples used for learning. This behavior is also encountered in several other systems (e.g., Sammut and Banerji, 1986; Drastal, Czako & Raatz, 1989). However, constructive learning in POSEIDON is due to the second tier based on domain knowledge characterizing non-typical examples. This is different from using domain knowledge to rewrite or augment the whole training set (e.g., Rouveirol, 1991), or to generate new attributes by a data-driven approach (Bloedorn & Michalski, 1992), or a hypothesis-driven approach (Wnek and Michalski, 1991). The exemplar-based learning system PROTOS (Bareiss, 1989) is
193 concept description and acquiring the matching knowledge via explanations of training events provided by a teacher. There are, however, major differences: 1) PROTOS stores exemplars as base concept descriptions, whereas POSEIDON generates simple and easy-tounderstand generalizations as base concept descriptions, 2) PROTOS uses domain knowledge in classifying all new cases, whereas POSEIDON uses Inferential Concept Interpretation rules only for classifying exceptions, 3) During the learning process, PROTOS asks the teacher for explanations for all exemplars, whereas POSEIDON only asks for explanations of exceptions. The problem of using some typicality measure of examples has not so far been given much attention in machine learning, although there were attempts in this direction. For example, Michalski and Larson (1978) introduced the idea of "outstanding representatives" of a concept to focus the learning process on the most significant examples. In cognitive science, the concept of typicality of examples has been studied extensively (e.g., Rosch and Mervis, 1975; Smith and Medin 1981). The concept of two-tiered representation has naturally led us to the proposition of a precise definition of representative, nearly-representative and exceptional examples, namely, as those that are covered by the first tier, the second tier's procedure for flexible matching, and the second tier's inference rules, respectively. The ideas of two-tiered representation are also consistent with recent research on two-stage category construction (Ahn and Medin, 1992). To summarize, there are several major differences between the method presented and related research described in the literature. First, the method has the ability to recover from the loss of coverage due to the description truncation by using the second tier. Specifically, the procedure of flexible matching or deductive rules are used to cover examples not covered explicitly. As has been demonstrated experimentally, this ability often leads to a significant reduction of concept descriptions, and at the same time, to an improvement of their predictive power. Second, the description reduction is done by independently performing both generalization and
194 specialization operators. Third, any part of the description may be truncated in the simplification process, not just only specific parts (as, e.g., in decision tree truncation). Fourth, the method is able to take into account the typicality of the examples. Finally, the method uses a general description quality measure, which takes into consideration a number of different aspects of a description. To relate the presented two-tiered approach to other basic machine learning approaches, Table 9 characterizes them in terms of the type of concept representation used and the kind of matching applied for classification.
Representation Matching
Simple Induction General Precise
Exemplar-based Specific Inferential
Two-tiered General Inferential
Table 9. A comparison of the two-tiered method with simple inductive and exemplar-based methods. SUMMARY AND OPEN PROBLEMS The most significant aspect of the presented method is that it represents concepts in a two-tiered fashion, in contrast to traditional learning methods that represent concepts by a monolithical structure. In this representation, the first tier, the base concept representation (BCR), captures the explicit and common concept meaning, and the second tier, the inferential concept interpretation (ICI) defines allowable modifications of the base meaning and exceptions. Thus, typical concept instances match the BCR, and thus can be recognized efficiently. Such a two-tiered representation is particularly suitable for learning flexible concepts, i.e., concepts that lack precise definition and are context-dependent. In the POSEIDON system that implements the method, the BCR is learned in two steps. First, a complete and consistent description is learned by a conventional learning program (AQ15). Next, this description is optimized according to a general description quality measure. This is done by a double-level search process that uses both generalization and
195 optimized according to a general description quality measure. This is done by a double-level search process that uses both generalization and specialization operators. The General Description Quality takes into account not only properties of BCR, but also of ICI. This is done by measuring the complexity and accuracy of the total description. The ICI has two components: one specifies a flexible matching function, and the second specifies inference rules for handling exceptions and context-dependency. The ICI rules can be of two types. The rules of the first type extend the meaning of the concept, while the rules of the second type contract this meaning. The first type rules are employed when an instance is neither covered by the BCR (not S-covered), nor by the flexible matching function (not F-covered). The second type of rules are used when an unknown instance covers a base representation of more than one concept, or when concept membership has to be confirmed. In both cases, the rules are used deductively. An advantage of using rules for matching over other matching methods is that they can serve as an explanation why a given instance does or does not belong to the concept. The experimental results have strongly supported the hypothesis that two-tiered concept descriptions can be simpler and easier to understand than "single-tier" descriptions. In the experiments, these descriptions also had greater prediction accuracy, i.e., performed better on new examples. For example, the two-tiered descriptions obtained for the acceptable labor management contracts gave a performance of over 90% correct using only about 9 rules. In contrast, the best performance of a simple exemplar based method gave the 80% correct predictions on new examples, and used 27 rules, and the corresponding pruned decision tree performed at 86%, and had 29 leaves (each of which may be viewed as corresponding to one rule). The system also performed better than the previous method based on the TRUNC procedure in terms of the prediction accuracy (80%), but at the cost of a more complex concept description. In addition, two-tiered descriptions are relatively easy to understand, and can easily represent an explicit domain knowledge. The presented method is different in several significant ways from the
196 earlier method of learning two-tiered representations (Michalski et al., 1986). The flexible matching procedure is used not only in the testing phase, but also in the learning phase. In addition to a flexible matching function, the method employs rules for extending or contracting the concept meaning. The earlier TRUNC method used only one specialization operator (rule removal), while TRUNC-SG employed in POSEIDON uses two generalization and two specialization operators. The price for that is that the new method is significantly more complex. There are many interesting problems for future research. An especially interesting problem is how to integrate the description optimization phase with the initial description generation phase (done by AQ). Another interesting problem is how to learn second tier rules from examples. In the initial method developed by (Plante and Matwin, 1990), the inferential concept interpretation rules are learned by a chunking process in the situations when multiple explanations of positive or negative training events are provided. Futureresearchshould also address the application of constructive induction (Michalski, 1983) in the process of learning flexible concepts. In constructive induction, background knowledge is used to construct new attributes and/or higher level descriptors. As a result, produced descriptions can capture the salient features of the concept, and can be simpler and more comprehensible. The ideas of constructive induction seem to be very relevant to the method proposed. For example, through constructive induction the system may be able to fold several rules into a single one, or prevent the removal of relevant rules. The current system does not address the problem of dynamically emerging hierarchies of concepts. The system only learns one concept at a time, and concepts do not change or split as new examples become available. Another open issue is the ability of the system to reorganize itself. The distribution of knowledge between the Base Concept Representation and the Inferential Concept Interpretation should be determined by the performance of the system on large testing sets. If it turns out, for instance, that some inferential concept interpretation rules
197 are used very often, then they could be compiled into the base representation. Further research is needed on the role and the importance of different parameters used in the method, and on the trade-offs that they can control. This paper has focused on learning attribution^ descriptions that characterize entities by attributes, and ignore their structural properties. Although such descriptions are quite powerful and sufficient for many practical problems, there are applications that require structural descriptions that characterize entities as systems of components, and the relationships among these components. Developing a method for learning two-tiered structural descriptions is therefore an important topic for future research. A relatively solution to the above problem would be to replace the AQ15 program by a version of INDUCE (e.g., Michalski, 1983) for learning the initial complete and consistent description. The basic search procedure would essentially be the same, but would deal with a more complex knowledge representation. A structural representation would allow additional description modification operators, so the descriptions could be modified in more ways, so this would increase the flexibility and the complexity of the search process. Also, the computation of the general quality of descriptions would require proper modification, and flexible matching would need to be extended to handle structural concept descriptions. As practical problems frequently require only attributional descriptions, and the method is domain-independent, POSEIDON has a potential to be useful for concept learning and knowledge acquisition in a wide range of real-world applications. ACKNOWLEDGMENTS The authors thank Hugo de Garis, Attilio Giordana, Ken Kaufman, Elizabeth Marchut-Michalski, Doug Medin, Franz Oppacher, Lorenza Saitta, Gail Thornburg, and Gheorghe Tecuci for useful comments and criticisms. The authors express special gratitude to Alan Meyrowitz and Susan Chipman for their support of the research described in this chapter.
198 In addition, Alan Meyrowitz provided detailed comments on this chapter that helped in the prepartion of the final version. The authors thank Zbig Koperczak for his help in acquiring the data used in the experiments. This research was done in the Artificial Intelligence Center of George Mason University. The research activities of the Center are supported in part by the Defense Advanced Research Projects Agency under the grants administered by the Office of Naval Research, No. N00014-87-K-0874 and No. N00014-91-J-1854, in part by the Office of Naval Research under grants No. N00014-88-K-0397, No. N0001488-K-0226, No. N00014-90-J-4059, and No. N00014-91-J-1351, and in part by the National Science Foundation under the grant No. IRI9020266. The second author was supported in part by the Italian Ministry of Education (ASSI), and the third author was supported in part by the Natural Sciences and Engineering Research Council of Canada. REFERENCES Ahn, W. & Medin, D.L. (1992). A Two-stage Model of Category Construction, Cognitive Science, 16, pp. 81-121. Bareiss, R. (1989). An exemplar-based knowledge acquisition. Academic Press. Bergadano, F. & Giordana, A. (1989). Pattern classification: an approximate reasoning frameworkJnternational Journal of Intelligent Systems. Bergadano, F., Matwin, S., Michalski, R.S. & Zhang, J. (1988a). Learning flexible concept descriptions using a two-tiered knowledge representation: Part 1 - ideas and a method. Reports of Machine Learning and Inference Laboratory, MLI-88-4. Center for Artificial Intelligence, George Mason University. Bergadano, F., Matwin, S., Michalski, R. S. & Zhang, J. (1988b). Measuring quality of concept descriptions, Proceedings Third European Working Sessions on Learning. Glasgow, 1-14. Bloedorn, E. & Michalski, R.S. (1992). Data-driven constructive induction AQ17: A method and experiments. Reports of Machine Learning and Inference Laboratory, Center for Artificial Intelligence, George Mason University (to appear).
199 Cheeseman, P., Kelly, J,, Self, M., Stutz, J., Taylor, W. & Freeman, D.(1988). AutoClass: A Bayesian classification system. Proceedings of the Fifth International Conf on Machine Learning, Ann Arbor, 54-64. Cestnik, B., Kononenko, L, Bratko, I. (1987). ASSISTANT 86: A knowledge-elicitation tool for sophisticated users. Proceedings of the 2nd European Workshop on Learning. 31-45. Clark, P. Niblett, T. (1989). The CN2 induction algorithm. Machine Learning Journal, Vol. 3, No.4, 261-183. Collins, A. M., Quillian, M. R. (1972).Experiments on semantic memory and language comprehension" in Cognition, Learning and Memory, L. W. Gregg (Ed.), John Wiley. DeJong, G., Mooney, R. (1986) Explanation-based learning: An alternative view. Machine Learning, Vol. 1. No. 2. Dietterich, T.. (1986). Learning at the knowledge level. Machine Learning, Vol. 1. No. 3. 287-315. Dietterich., T., Flann, N. (1988). An inductive approach to solving the imperfect theory problem. Proceedings of the Explanation-based Learning Workshop, Stanford University. 42-46. Drastal, G., Czako, G., Raatz, S. (1989). Induction in an abstraction space: A form of constructive induction. Proceedings ofUCAI 89, Detroit. 708-712. Fisher, D. (1987). Knowledge acquisition via incremental conceptual clustering. Machine Learning. Vol.2,139-172. Fisher, D. H. & Schlimmer, J. C. (1988). Concept simplification and prediction accuracy. Proceedings of the Fifth Int'L Conf On Machine Learning. Ann Arbor, 22-28. Hammond, K. (1989). Case-based planning: Viewing planning as a memory task. Academic Press. Iba, W., Wogulis, J. & Langley, P. (1988). Trading off simplicity and coverage in incremental concept learning. Proceedings of the Fifth Int'L Conf. on Machine Learning. Ann Arbor, 73-79. Kedar-Cabelli, S. T. & McCarthy, L. T. (1987).Explanation-based generalization as resolution theorem proving. Proceedings of the 4th Int. Workshop on Machine Learning, Irvine. Kibler, D. & Aha, D. (1987). Learning representative exemplars of concepts. Proceedings of the 4th International Workshop on Machine Learning, Irvine.
200
Kolodner, J. (Ed.) (1988). Proceedings of the Case-based Reasoning Workshop, DARPA, Clearwater Beach, FL. Lakoff, G. (\9%l).Women, Fire, and Dangerous Things: What Categories Reveal about Mind, University of Chicago Press. Lebowitz, M. (1987). Experiments with incremental concept formation: UNIMEM, Machine Learning Journal, Vol. 2, No. 2. Michalski, R.S. (1975). Variable-valued logic and its applications to pattern recognition and machine learning. In D. C. Rine (Ed.), Computer science and multiple-valued logic theory and applications, North-Holland Publishing Co. 506-534. Michalski, R.S. & Larson, J. B. (1978). Selection of most representative training examples and incremental generation of VLi hypotheses: The underlying methodology and the description of programs ESEL and AQ11. Reports of the Department of Computer Science, TR 867, University of Illinois at Urbana-Champaign. Michalski, R.S. (1983). A theory and methodology of inductive learning. In R.S. Michalski J.G. Carbonell & T.M. Mitchell (Eds.), Machine learning: An artificial intelligence approach. Palo Alto, CA: Tioga (now Morgan Kufmann). Michalski, R.S. & Stepp, R.E. (1983). Learning from observation: conceptual clustering. In R.S. Michalski, J.G. Carbonell & T.M. Mitchell (Eds.), Machine learning: An artificial intelligence approach . Palo Alto, CA: Tioga (now Morgan Kaufmann). Michalski, R. S., Mozetic, I., Hong J. & Lavrac, N. (1986). The multipurpose incremental learning system AQ15 and its testing application to three medical domains. Proceedings of the 5th AAAI. 1041-1045. Michalski, R. S.(1989). Two-tiered concept meaning, inferential matching and conceptual cohesiveness. In S. Vosniadou & A. Ortony (Eds.),Similarity and analogy, Cambridge: Cambridge University Press. Michalski, R. S. & Ko, H. (1988). On the nature of explanation, or why did the wine bottle shatter? AAAI Symposium: ExplanationBased Learning, Stanford University. 12-16. Michalski, R. S. (1987). How to learn imprecise concepts: A method employing a two-tiered knowledge representation for learning. Proceedings of the Fourth International Workshop on Machine Learning, Irvine, CA. 50-58.
201 Michalski, R. S. (1990). Learning flexible concepts: fundamental ideas and a methodology in Y. Kodratoff and R. S. Michalski (Eds.) Machine Learning: An artificial intelligence approach, Vol. HI. San Mateo, CA: Morgan Kaufmann Publishers. Mooney, R. & Ourston, D. (1989). Induction over the unexplained: integrated learning of concepts with both explainable and conventional aspects. Proceedings of 6th Int'l Workshop on Machine Learning, Ithaca, NY, 5-7. Minsky, M. (1975). A framework for representing knowledge. In P. Winston (Ed.), The Psychology of computer vision. Mitchell, T. M., Keller, R. & Kedar-Cabelli, S. (1986) Explanationbased generalization: A unifying view. Machine Learning Journal, Vol. 1. No. 1, 11-46. Mitchell, T. M. (1977). Version spaces: an approach to concept learning. Ph.D. Dissertation, Stanford University. Plante, B. & Matwin, S. (1990). Learning second tier rules by chunking of multiple explanations. Research Report, Department of Computer Science, University of Ottawa. Prieditis, A. E. & Mostow, J. (1987). PROLEARN: Towards a Prolog interpreter that learns. Proceedings. ofUCAI 87, Milan. 494498. Quinlan, J. R.. (1987) Simplifying decision trees. Int. Journal of ManMachine Studies. Vol. 27,221-234. Robinson J. A. & Sibert E. E. (1982). LOGLISP: An aternative to Prolog. Machine Intelligence, Vol. 10, J. E. Hayes & D. Michie (Eds.), 399-419. Rosch, E. & Mervis, C. B. (1975). Family resemblances: Studies in the internal structure of categories. Cognitive Psychology, Vol. 7, 573-605. Rouveirol, C. (1991). Deduction and semantic bias for inverse resolution. Proceedings ofUCAI 91. Sydney, Australia. Sammut, C. & Banerji, R.B. (1986). Learning concepts by asking question, in R.S. Michalski, J. G. Carbonell & T.M. Mitchell (Eds.), Machine learning: An artificial intelligence approach. Palo Alto, CA: Tioga (now Morgan Kaufmann Publishers). Smith, E. E. & Medin, D. L. (1981). Categories and concepts. Harvard University Press. Sowa, J. F. (1984). Conceptual sructures. Addison Wesley.
202
Sturt, E. (1981). Computerized construction in Fortran of a discriminant function for categorical data. Applied statistics, Vol. 30, 213222. Watanabe, S. (1969). Knowing and guessing, a formal and quantitative study. John Wiley. Weber, S. (1983). A general concept of fuzzy connectives, negations and implications based on t-norms and t-conorms,M Fuzzy sets and systems, Vol. 11, 115-134. Winston, P. H. (1975). Learning structural descriptions from examples," in P. Winston (Ed.) The Psychology of computer vsion., McGraw-Hill. Wnek, J. & Michalski, R.S. (1991). Hypothesis-driven constructive induction in AQ17: Method and experiments. Reports of Machine Learning and Inference Laboratory, Center for Artificial Intelligence, George Mason University. Zadeh, L. A. (1974). Fuzzy logic and its applications to approximate reasoning. Information Processing. North Holland, 591-594. Zhang, J. & Michalski, R. S. (1989). Rule optimization via SG-trunc method. Proceedings of the Fourth European Working Sessions on Learning. Glasgow.
Chapter 6 Competition-Based Learning1 John J. Grefenstette, Kenneth A. De Jong, William M. Spears Navy Center for Applied Research in Artificial Intelligence Information Technology Division Naval Research Laboratory Washington, DC 20375-5000 Abstract This paper summarizes recent research on competition-based learning procedures performed by the Navy Center for Applied Research in Artificial Intelligence at the Naval Research Laboratory. We have focused on a particularly interesting class of competition-based techniques called genetic algorithms. Genetic algorithms are adaptive search algorithms based on principles derived from the mechanisms of biological evolution. Recent results on the analysis of the implicit parallelism of alternative selection algorithms are summarized, along with an analysis of alternative crossover operators. Applications of these results in practical learning systems for sequential decision problems and for concept classification are also presented. INTRODUCTION One approach to the design of moreflexiblecomputer systems is to extract heuristics from existing adaptive systems. We have focused on a class of learning systems that use competition-based procedures, called genetic algorithms (GAs). GAs are based on principles derived from one of the most impressive examples of adaptation available: the adaptation achieved by natural systems to their environment through the mechanisms of biological evolution. The principles were first elucidated in a computational framework by John Holland (1975). Holland's analysis of natural adaptive systems shows that biological evolution embodies a sophisticated kind of generate-and-test strategy that rapidly identifies and exploits regularities in 1
Sponsored in part by the Office of Naval Research under Work Request N00014-91WX24011.
204 the environment. By extracting these processes from the specific context of genetics, the algorithms can be applied to a wide range of optimization and learning problems. GAs have in fact been applied successfully to routing and scheduling problems, machine vision, engineering design optimization, gas pipeline control systems, and others. In the area of machine learning, GAs have been used to learn rules for sequential decision problems as well as to learn classification rules from examples (De Jong, 1990). GAs have also been widely used for learning both the topology and the weights of neural nets. Our research efforts for the past few years have fallen into two main categories: the analysis of genetic algorithms, and the application of genetic algorithms to machine learning problems. This article will focus primarily on recent progress in the analysis of genetic algorithms. The remainder of the article is organized as follows: The next section contains a brief tutorial on genetic algorithms. This is followed by two sections that outline recent progress in the analysis of two fundamental topics in the field: how knowledge structures are selected for reproduction, and how the selected structures are recombined to create new plausible knowledge structures. These sections are followed by a brief overview of our work in developing machine learning systems based on genetic algorithms. The final section describes the directions of current work.
OVERVIEW OF GENETIC ALGORITHMS Genetic algorithms are adaptive search procedures based on principles derived from the dynamics of natural population genetics. GAs are distinguished from other search methods by the following features: • A population of structures that can be interpreted as candidate solutions to the given problem. • The competitive selection of structures for reproduction, based on each structure's fitness as a solution to the given problem. • Idealized genetic operators that alter the selected structures in order to create new structures for further testing. These features enable the GA to exploit the accumulating knowledge obtained during the search in such a way as to achieve an efficient balance between the need to explore new areas of the search space and the need to focus on high performance regions of the space. This section provides a general overview of a simple form of genetic algorithm. For more detailed
205 procedure GA begin t = 0; initialize P(t); evaluate structures in P(t); while termination condition not satisfied do begin t = t+l; select P(t) from P(M); alter structures in P(t); evaluate structures in P(t); end end. Figure 1: A Genetic Algorithm discussions, see (Holland, 1975; Goldberg, 1989). A genetic algorithm simulates the dynamics of population genetics by maintaining a knowledge base of structures that evolves over time in response to the observed performance of its structures in their operational environment. A specific interpretation of each structure (e.g. as a collection of parameter settings, a condition/action rule, etc.) yields a point in the space of alternative solutions to the problem at hand, which can then be subjected to an evaluation process and assigned a measure called its fitness, reflecting its potential worth as a solution. The search proceeds by repeatedly selecting structures from the current knowledge base on the basis of fitness and applying idealized genetic search operators to these structures to produce new structures (offspring) for evaluation. The basic paradigm is shown in Figure 1, and is explained in more detail below. At iteration r, the GA maintains a population of structures P(t) representing candidate solutions to the given problem. Population P(0) may be initialized using whatever knowledge is available about possible solutions. In the absence of such knowledge, the initial population should represent a random sample of the search space. Each structure is evaluated and assigned a measure of its fitness as a solution to the problem at hand. When each structure in the population has been evaluated, a new population of structures is formed in two steps. First, structures in the current
206 population are selected to be reproduced on the basis of their relative fitness. That is, high performing structures may be chosen several times for replication and poorly performing structures may not be chosen at all. In the absence of any other mechanisms, the resulting selective pressure would cause the best performing structures in the initial knowledge base to occupy a larger and larger proportion of the knowledge base over time. Next the selected structures are altered using idealized genetic operators to form a new set of structures for evaluation. The primary genetic search operator is the crossover operator, which combines the features of two parent structures to form two similar offspring. There are many possible forms of crossover. The simplest version operates by swapping corresponding segments of a string or list representation of the parents. For example, if the parents are represented by the lists: (ax a2 a3 a4 a5) and (bx b2 b3 b4 b5) then crossover might produce the offspring (Ai a2 t>3 b4 b$) and (b\ b2 a3 a4 #5).
Other forms of crossover operators have been defined for other representations (e.g., Whitley et al, 1989; Koza, 1989; Grefenstette, 1991b). Specific decisions as to whether both resulting structures are to be entered into the knowledge base, whether the precursors are to be retained, and which other structures, if any, are to be purged define a range of alternative implementations. The crossover operator usually draws only on the information present in the structures of the current knowledge base in generating new structures for testing. If specific information is missing, due to storage limitations or loss incurred during the selection process of a previous iteration, then crossover is unable to produce new structures that contain it. A mutation operator, which alters one or more components of a selected structure, provides the means for introducing new information into the knowledge base. Again, a wide range of mutation operators have been proposed, ranging from completely random alterations to more heuristically motivated local search operators. In most cases, mutation serves as a secondary search operator that ensures the reachability of all points in the search space. The power of the GA lies not in the testing of individual structures but in the efficient exploitation of the wealth of information that the testing of structures provides with regards to the interactions among the components comprising these structures. Specific configurations of component values
207
observed to contribute to good performance (e.g. a specific pair of parameter settings, a specific group of rule conditions, etc.) are preserved and propagated through the structures in the knowledge base in a highly parallel fashion. This, in turn, forms the basis for subsequent exploitation of larger and larger such configurations. Intuitively, we can view these structural configurations as the regularities in the space that emerge as individual structures are generated and tested. Once encountered, they serve as building blocks in the generation of new structures. That is, GAs actually search the space of all feature combinations, quickly identifying and exploiting combinations that are associated with high performance. The ability to perform such a search on the basis of the evaluation of completely specified candidate solutions is called the implicit parallelism of GAs. To summarize, the power of a GA derives from its ability to exploit, in a near-optimal fashion, information about the utility of a very large number of structural configurations without the computational burden of explicit calculation and storage. This leads to a focused exploration of the search space wherein attention is concentrated in regions that contain structures of above average utility. The knowledge base, nonetheless, is widely distributed over the space, insulating the search from susceptibility to stagnation at a local optima. A great variety of genetic algorithms have been studied and compared. Often these comparisons take the form of empirical studies, but the generality of the results are often difficult to assess, since they usually depend on the particular characteristics of the search space. More analytic tools for comparison need to be developed. Our recent efforts have included new analyses of the fundamental components of genetic algorithms: the rules for selecting knowledge structures for reproduction, and the effects of various crossover operators. The following two sections describe our progress on these two topics. ANALYSIS OF SELECTION ALGORITHMS One way to improve our understanding of genetic algorithms is to identify properties that are invariant across the many seemingly different versions of the algorithms. (Grefenstette, 1991a) focuses on invariances among genetic algorithms that differ along two dimensions: (1) the way the user-defined objective function is mapped to afitnessmeasure, and (2) the way the fitness measure is used to assign offspring to parents. The remainder of this section summarizes those results.
208 The process of reproducing knowledge structures in a genetic algorithm can be decomposed into four steps. First, each structure x is evaluated according to an objective function u(x) that defines problem-specific criterion for success. Second, afitnessfunction is applied to the result of the evaluation to obtain / ( * ) , the fitness of x. The range of/must be a non-negative interval, and larger values of f(x) indicate more desirable solutions to the objective function.2 Third, a selection algorithm assigns a target number of offspring to each population member. Finally, a probabilistic sampling algorithm assigns to each member of the population an integer number of offspring. The first step is, of course, entirely problem dependent, and will not concern us further. For the final step, several sampling algorithms have been investigated, culminating in one called stochastic universal sampling by Baker (1987), which appears to provide an optimal sampling method. Accordingly, variations on the sampling algorithm will not concern us further. That leaves the middle two steps open for variations, and in fact, many variations are in current use. A short discussion of some of the major variants of fitness functions and selection algorithms will give a fair indication of the range of possibilities. The fitness function maps the raw score of the objective function to a non-negative interval. Such a mapping is always necessary if the goal is to minimize the objective function, since higher fitness values correspond to lower objective function values in that case. More generally, the fitness function often serves to scale the raw values returned by the objective function in order to provide a high level of selective pressure. Scaling that accentuates small differences is especially desirable late in the search, when the variance in objective performance tends to diminish. One popular approach to scaling (Grefenstette, 1986) is to define the fitness function as a dynamic, linear transformation of the objective value: / ( x ) = fl(M(x)-fc(0) where a is positive for maximization problems and negative for minimization problems, and b(t) represents the worst value seen in the last few generations. The trajectory of b(t) generally rises over time, providing greater 2
This notation is at variance with that used in (Grefenstette, 1991a). The mnemonic here is that f(x) denotes the fitness, and u(x) denotes the user-defined utility (e.g., cost to be minimized or profit to be maximized). The notation was reversed in (Grefenstette, 1991a). We hope that standard notation may be adopted soon, but in the meantime, this paper will use the more intuitive notation.
209 selection pressure later in the search. This method is sensitive, however, to "lethals", i.e., poor performing individuals that may occasionally arise through crossover or mutation. A more robust method has been called sigma scaling (Goldberg, 1989):
f(x) = u(x)-(\i-c*o)
if u(x)> bi-c * a)
/ ( * ) = (), otherwise. where L| L is the mean objective function value of the current population and o is the current population standard deviation. Sigma scaling provides a level of selective pressure that is sensitive to the spread of performance values in the population. Besides these two forms of fitness functions, many other variations have been proposed and implemented (Goldberg, 1989). We next consider variations in the selection phase. The selection algorithm assigns an expected number of children C(x) to each population member x, based on the fitness values. The most widely used method is proportional selection, defined as:
C(x) = / ( * ) / / where / is the average fitness of the current population. This method was originally proposed and analyzed by Holland, who showed that it results in a nearly-optimal allocation of trials, under certain circumstances (Holland, 1975). In practice, this selection algorithm may lead to premature convergence, based on the unlimited number of offspring that may be assigned to "super individuals" that may arise early in a search (Baker, 1989). Other forms of selection are less brittle in this respect. For example, rank-based selection assigns offspring according to the formula: C(jc) = a + 6 * rank(x) where rank(x) indicates the relative position of x in the population, from 0 for the worst performer to 1 for the best, and a and b are constants chosen so that a is the minimum number of oflsping and a+b is the maximum. Rankbased selection eliminates the problem of premature convergence to "super individuals" by providing a strict upper bound on the number of offspring assigned to any one member in a given generation. In practice, rank-based selection tends to provide a slower, steady rate of convergence than proportional selection. A final example is threshold selection in which all population members whose objective function falls below a (possibly time-varying) threshold are deleted, and the survivers are assigned an equal number of offspring to fill the vacated slots. These three examples should give an
210 indication of the range of selection algorithms that have been explored in genetic algorithms. Understanding the similarities and differences between these options is a fundamental step toward a deeper understanding of genetic algorithms. Given these two dimensions of variations in the design of genetic algorithms, we say that a genetic algorithm is admissible if it meets what appear to be the weakest reasonable requirements along these dimensions. We can then show that any admissible genetic algorithm exhibits a form of implicit parallelism, meaning that it allocates search effort in way that differentiates among a large number of competing areas of the search space on the basis of a limited number of explicit evaluations of knowledge structures. These results provide a sense of coherence to the field, in that commonalities are exposed among superficially different versions of the genetic algorithm. These results can also serve to spotlight the features which distinguish broad classes of genetic algorithms from one another. A few definitions are required to make these ideas concrete. We say that afitnessfunction is monotonic if f(x)
iff u(x)
Afitnessfunction is strictly monotonic if it is monotonic and if u(x)
then
f(x)
That is, a monotonic fitness function does not reverse the sense of any pairwise ranking provided by the objective function. A strictly monotonic fitness function preserves the relative ranking of any two points in the search space with distinct objective function values. Referring to the examples mentioned earlier, the linear dynamic fitness function is strictly monotonic, but sigma scaling is monotonic but not strict, since it may assign zero fitness to knowledge structures that have different objective function values. A selection algorithm is monotonic if C{x)
iff
f(x)
A selection algorithm is strictly monotonic if it is monotonic and if fix)
then
C(x)
A monotonic selection algorithms is one that respects the "survival-of-thefittest" heuristic. A strictly monotonic selection algorithm assigns a higher expectation of reproduction to more knowledge structures with more deservingfitnessvalues. For example, proportional selection and rank selection are
211 both strictly monotonic, whereas threshold selection is monotonic but not strict, since it may assign the same number of offspring to knowledge structures with different fitness values. Finally, we say that a GA is admissible if its fitness function and selection algorithm are both monotonic. A GA is strict iff its fitness function and selection algorithm are both strictly monotonic. The main results in (Grefenstette, 1991a) relate the dynamic behavior of monotonic and strict genetic algorithms to the notion of partial domination of one set by another. Consider two arbitrary subsets of the solution space, A and B. Let the representatives of subset A in the population at time t be A(t) =
••• ,an>
3
sorted such that w(a/)>w(ai+i) for \
••• ,6 n >
also sorted in order of decreasing u. Finally, we say B partially dominates A (A
forl
and at least one inequality is strict. Intuitively, if A
We assume without loss of generality that we are maximizing u. These definitions can be extended in a natural way to cover the case where A(t) and B(t) have differing cardinalities (Grefenstette, 1991a). If A(t) is smaller than B(t\ we augment A(t) by adding copies of the best representative of A. If B (t) is smaller than A (f), we augment B(t) by adding copies of the worst representative of B. 4
212
x Figure 2: Two Regions Defined by Range of Objective Values grows strictly faster than the number allocated to set A, since any subset of B dominates any subset of A. The effect of this strategy, compounded over succeeding generations, is that the search effort allocated to set B will grow exponentially faster than the search effort allocated to set A. This is a highly plausible heuristic, and in some cases is the optimal adaptive strategy (Holland, 1975). This example illustrates implicit parallelism because it holds no matter where the dotted lines are drawn. The contribution of this new analysis is to extend Holland's original result, which applied only to a particular form of genetic algorithm, to the entire class of admissible genetic algorithms. The result is independent of the precise fitness function or selection algorithm, as long as they satisfy the requirement of admissibility (or strictness). These results provide new insights into the common characteristics of genetic search algorithms.
ANALYSIS OF CROSSOVER The analysis in the previous section refers exclusively to the distribution of search effort resulting from the selection, or reproduction, phase of the genetic algorithm. The selection phase is followed by operations that create modified structures from the selected parent structures. There are usual two distinct forms of structural alteration: crossover and mutation. Crossover refers to operations in which pairs of selected knowledge structures exchange information, producing new structures that inherit similarities from
213 both parents. In contrast, mutation operations apply to individual structures to create small variations in the newly formed structures. Without crossover and mutation, the population in a genetic algorithm would quickly converge to multiple copies of the most fit structure in the initial population. With crossover and mutation, genetic algorithms combine the focus of attention, or exploitation, provided by selection with exploration of new structures created by these idealized genetic operators. In order to gain a complete picture of the operation of a genetic algorithm, it is necessary to complement our previous analysis of the selection phase with an analysis of the effects of the crossover and mutation operators on the search. Since mutations generally play a relatively minor role as a background search operator in genetic algorithms, our recent efforts have focused on the more dominant crossover operators. As in the case of fitness functions and selection algorithms, there have been an interesting variety of crossover operators developed for genetic algorithms. Traditionally, genetic algorithms have relied upon 1-point or 2point crossover operators. Many recent empirical studies, however, have shown the benefits of higher numbers of crossover points. Syswerda (1989) introduced a "uniform" crossover operator in which the allele (i.e., gene value) of any position in an offspring was determined by a random selection from the corresponding alleles of the two parents. He provided an initial analysis of the disruptive effects of uniform crossover, and compared it with both 1-point and 2-point crossover. He presented some provocative results suggesting that, in spite of higher disruption properties, uniform crossover can exhibit better recombination behavior, which can improve empirical performance. One of the goals of our analysis has been to better understand the effects of these various crossover operators. Holland (1975) provided the initial formal analysis of the behavior of GAs by characterizing how they bias the makeup of new offspring in response to feedback on the fitness of previously generated individuals. More specifically, let H be a hyperplane in the representation space. For example, if the structures are represented by six binary features, then the hyperplane denoted by H = 0# 1### consists of all structures in which the first feature is absent and the third feature is present. The order of a hyperplane is the number of features that are defined as either 0 or 1. For example, the hyperplane H specified above is a 2nd-order hyperplane. By focusing on hyperplane subspaces of L-dimensional spaces, Holland showed that the expected number of samples (individuals) allocated to a particular kth order hyperplane Hk at time t + 1 is given by:
214 m(Hkj+l)ZmiHktt)*
*-^*
{l-Pmk-PcPd(Hk))
f In this expression, (Hk) is the average fitness of the current samples allocated to Hk, / i s the average fitness of the current population, Pm is the probability of using the mutation operator, Pc is the probability of using the crossover operator, and Pd(Hk) is the probability that the crossover operator will be "disruptive" in the sense that the children produced will not be members of the same subspace as their parents. The usual interpretation of this result is that subspaces with higher than average payofls will be allocated exponentially more trials over time, while those subspaces with below average payofls will be allocated exponentially fewer trials. This assumes that there are enough samples to provide reliable estimates of hyperplane fitness, and that the effects of crossover and mutation are not too disruptive. The effects of mutation are generally insignificant in practice and may be neglected in a first-order analysis. Considerable attention has been given to estimating Pd, the probability that a particular application of crossover will be disruptive. Holland (1975) provided a simple and intuitive analysis of the disruption of 1-point crossover: as long as the crossover point does not occur within the defining boundaries of Hk (i.e., in between any of the k fixed defining positions), the children produced from parents in Hk will also reside in Hk. De Jong (1975) extended this analysis to n-point crossover by noting that no disruption can occur if there is an even number of crossover points (including 0) between each of the defining positions of a hyperplane. De Jong's original analysis applied to the special case of 2nd-order hyperplanes, i.e., hyperplanes with exactly two defining positions. This analysis was extended by Spears and De Jong (1991a) to arbitrary higher-order hyperplanes. A result of the general analysis for the particular case of 3rd-order hyperplanes is shown in Figure 3. The term P$%even is the probability there is an even number of crossover points (including 0) between each of the defining positions of a 3rd-order hyperplane. The number of crossover points is indicated by n. These results show that by choosing an even number of crossover points, we can reduce the representational bias of crossover, in the sense that the influence of defining length on the disruption of hyperplanes declines as the number of crossover points increases. However, this reduction in bias comes at the expense of increasing the disruption of the shorter definition length hyperplanes. If we interpret the area above a particular curve as measure of the cumulative disruption potential of its associated crossover operator, then these curves suggest that 2-point crossover is the
215 1
p—i
n
0.9 —
\.
n= 1
— 0.9
0.8 —
— 0.8
0.7 —
— 0.7
n •=
0.6 —
3,even
1/
— 0.6 — 0.5
0.5 —
^f=4
0.4 —
— 0.4
0.3 —
^n=6
— 0.3
0.2 —
x?=5
— 0.2
0.1 — 0
n = J
|
L/2
— 0.1 n
L~
Defining Length Figure 3: Pleven on 3rd-Order Hyperplanes best as far as minimizing disruption. This confirms De Jong's original analysis, and much of the standard practice in the field. This line of analysis may be overly conservative in the sense that it assumes a worst case scenario: the parents are assumed to be complementary strings, differing at every position along the chromosome. As a result, the Pleven curves are very weak bounds on Pd. A more realistic bound on the disruption caused by crossover requires a better estimate of the true disruption probability Pd. The primary reason for the weakness of the Pleven bound is that it ignores the fact that many of the cases in which an odd number of crossover points fall between hyperplane defining positions are not disruptive to the sampling process. This occurs whenever the second parent happens to have identical alleles on the hyperplane defining positions which are exchanged by "odd" crossovers. (Note that an "odd" crossover occurs when an odd number of crossover points falls within two adjacent defining positions of the hyperplane.) Deriving an expression for the probability that both parents will share common alleles on the defining positions of a particular hyperplane is difficult in general because of the complexity of the population dynamics. We can, however, get a feeling for the effects of shared alleles on disruption by making the following simplifying
Defining Length Figure 4: Pks on 3rd-Order Hyperplanes withPeq = 0.5 assumption: the probability Peq of two parents sharing an allele is constant across all loci. With this assumption we can generalize Pleven t 0 pk%s 0-e-> the probability of survival) by including "odd" crossovers which are not disruptive. Figure 4 shows the effects of counting the non-disruptive "odd" crossovers, assuming a value of Peq = 0.5, which is likely to hold in the early generations when matches are least likely. Note that the amount of expected disruption has been significantly reduced, compared to Figure 3, and the relative difference in disruption among different numbers of crossover points is reduced as well. At the same time, note that the curves for the various number of crossover points have held their relative position with respect to one another. In (Spears and De Jong, 1991b), a similar analysis is applied to uniform crossover. The results show that, as expected, uniform crossover eliminates all representational bias, yielding a horizontal line in graphs like Figure 4. The precise location of the line depends upon P0, the probability of swapping at any one position. The higher the value of PQ9 the lower the horizontal curve in Figure 4 and the higher the rate of disruption. A value of PQ = 0.2 can be shown to produce roughly the same overall disruption as 2point crossover. In summary, this analysis highlights three important
217 properties of uniform crossover. The first is the ease with which the disruptive effect of uniform crossover can be precisely controlled by varying PQ. The second important property is that the disruptive potential of uniform crossover is independent of the defining length of hyperplanes. This allows uniform crossover to perform equally well, regardless of the distribution of important genes. Finally, when disruption does occur, it can be shown (De Jong and Spears, 1992) that uniform crossover results in a minimally biased exploration of the search space. We are currently extending these results toward a more complete theory for recombination operators. Our goal is to understand these interactions well enough to design genetic algorithms that can make adaptive decisions about the proper balance between exploration and exploitation. MACHINE LEARNING WITH GENETIC ALGORITHMS We have been using the powerful adaptive search strategies embodied in genetic algorithms to design and implement a variety of performanceoriented learning systems. The general context is one in which the environment defines one or more tasks to be performed, and the learning problem involves both skill acquisition (how to perform a task) and skill refinement (improving task performance). The approach taken is to identify a set of structures which control the performance aspects of the system, and to use a genetic algorithm to search the space of admissible structures to find ones that result in good performance on the tasks to be learned. The projects described in the following sections fall into two general categories based on the space of admissible structures being searched. The SAMUEL and GABBL systems search the space of admissible production rules for sets of rules which solve difficult sequential decision problems and concept classification tasks. The applications of GAs to NP-complete problems and neural networks takes a more parameterized system point of view. Here the performance of the problem solved is controlled by a fixed set of predefined parameters, and GAs are used to search the associated parameter space for combinations of parameters which result in good performance. Each of these projects is described in more detail in the following sections.
218 Competition-Based Learning for Sequential Decision Tasks When the behavior of a rule-based system can be tested in a simulated environment, it becomes possible to consider generating and testing sets of rules off-line before they are used in the real task domain. The behavior of a set of rules can be monitored in a simulation to discover any weaknesses or inadequacies. We are investigating techniques that allow a learning system to actively explore alternative behaviors in simulation, and to construct high performance rules from this experience. If we can design a payofffunction that quantitatively measures the performance of the system with a given rule set, we can then view the learning process as a heuristic optimization problem, i.e., a search through a space of knowledge structures looking for structures that lead to high performance. Our research is currently focused on learning rules for a variety of tactical scenarios, using genetic algorithms as the method for exploring the space of possible rule sets. Our approach has been implemented in a system called SAMUEL (Grefenstette and Cobb, 1991). The primary features of SAMUEL are: • A restricted but high level rule language; • Partial matching; • Utility-driven conflict resolution; • Numeric credit assignment at the level of individual rules; and • Genetic learning at the level of rule sets. The system is described in detail in (Grefenstette, Ramsey and Schultz, 1990). We have experimented with SAMUEL on a variety of tasks involving multiple-agent environments, including evading a predator, stalking a prey, and dog-fighting. SAMUEL has been able to learn high performance strategies for each of these tasks. In this section, we will briefly describe a number of recent studies on this approach. The reader is referred to the published articles for more complete details. The foundations for SAMUEL can be traced to the analysis of the credit assignment problem in (Grefenstette, 1988). The credit assignment problem arises when long sequences of rules fire between successive external rewards. It can be shown that the two distinct approaches to rule learning with genetic algorithms each offer a useful solution to a different level of the credit assignment problem. Analytic and experimental results are presented that support the hypothesis that multiple levels of credit assignment, at both the levels of the individual rules and at the level of rule sets, can improve the performance of rule learning systems based on genetic algorithms. These
219 multiple levels are both present in SAMUEL. One focus of our experimental work has been the robustness the rules learned in simulated environments. Robustness can be measured by testing the learned rules in new environments that have been systematically altered from the simulation environment in which the rules were learned. For example, either the learning environment or the target environment may contain noise. Experiments reported in (Ramsey, Schultz, and Grefenstette, 1990) examine the effect of learning tactical plans without noise and then testing the plans in a noisy environment, and the effect of learning plans in a noisy simulator and then testing the plans in a noise-free environment. Empirical results show that, while best results are obtained when the training model closely matches the target environment, using a training environment that is more noisy than the target environment is better than using a training environment that has less noise than the target environment. One of the interesting aspects of SAMUEL is that it employs a symbolic, attribute-value rule language, rather than the low-level representations adopted by many genetic algorithm-based systems. The use of a symbolic rule language in SAMUEL is intended to facilitate the incorporation of traditional machine learning methods into the system where appropriate. The rule language in SAMUEL also makes it easier to incorporate existing knowledge, whether acquired from experts or by symbolic learning programs. In (Schultz and Grefenstette, 1990), the use of available heuristic domain knowledge to initialize the population to produce better plans is investigated, and two methods for initialization of the knowledge base are empirically compared. These results provide an interesting contrast with most published work on genetic algorithms, which usually assume tabula rasa initial conditions. The results presented here show that genetic algorithms can be used to improve partially correct decision rules, as well as to learn rules from scratch. The use of a high-level language also facilitates the explanation of the learned rules. Gordon (1991a, 1991b) describes a method for improving the comprehensibility, accuracy, and generality of reactive plans learned by genetic algorithms. The method involves two phases: (1) formulate explanations of execution traces, and (2) generate new reactive rules from the explanations. The explanation phase involves translating the execution trace of a reactive planner into an abstract language, and then using Explanation Based Learning to identify general strategies within the abstract trace. The rule generation phase consists of taking a subset of the explanations and using these explanations to generate a set of new reactive rules to add to the
220 original set for the purpose of performance improvement. The particular subset of the explanations that is chosen yields rules that provide new domain knowledge for handling knowledge gaps in the original rule set. The original rule set, in a complimentary manner, provides expertise to fill the gaps where the domain knowledge provided by the new rules is incomplete. Genetic algorithms gain much of their power from mechanisms derived from the field of population genetics. However, it is possible, and in some cases desirable, to augment the standard mechanisms with additional features not available in biological systems. In (Grefenstette, 1991b), we examine the use of Lamarckian learning operators in the SAMUEL architecture. The operators are Lamarckian in the sense that strategies are modified through the addition or deletion of rules, based on the experience of the strategy in the test environment. These changes are then passed along as "genetic material" to subsequent generations of strategies. The use of this mechanism is illustrated on three tasks in multi-agent environments. Cobb and Grefenstette (1991) explore the effect of explicitly searching for the persistence of each decision in a time-dependent sequential decision task. Prior studies showed the effectiveness of SAMUEL in solving a simulation problem where an agent learns how to evade a predator that is in pursuit. In the previous work, an agent applies a control action at each time step. This paper examines a reformulation of the problem: the agent learns not only the level of response of a control action, but also how long to apply that control action. By examining this problem, the work shows that it is appropriate to choose a representation of the state space that compresses time information when solving a time-dependent sequential decision problem. By compressing time information, critical events in the decision sequence become apparent. We have begun to apply the SAMUEL approach to more complex learning environments. In (Schultz, 1991), SAMUEL is used to learn highperformance reactive strategies for navigation and collision avoidance. The task domain requires an autonomous underwater vehicle to navigate through a randomly generated, dense mine field and then rendezvous with a stationary object. The vehicle has a limited set of sensors, including sonar, and can set its speed and direction. The strategy that is learned is expressed as a set of reactive rules, (i.e. stimulus-response rules) that map sensor readings to actions to be performed at each decision time-step. Simulation results demonstrate that an initial, human-designed strategy which has an average success rate of only eight percent on randomly generated mine fields can be improved by this system so that the final strategy can achieve a success rate
221 of 96 percent. This study provides encouraging evidence that this approach to machine learning may scale up to realistic problems. We will continue to advance these techniques, with the intention of exploring possible applications to laboratory robots and Navy research vehicles in the near future. Genetic Algorithms for Concept Learning Genetic algorithms (GAs) have traditionally been used for non-symbolic learning tasks. In (Spears and De Jong, 1990a) we consider the application of a GA to a symbolic learning task, supervised concept learning from examples. A GA concept learner (GABIL) is implemented that learns a concept from a set of positive and negative examples. The performance of the system is measured on a set of concept learning problems and compared with the performance of two existing systems: ID5R and C4.5. Preliminary results support that, despite minimal system bias, GABIL is an effective concept learner and is quite competitive with ID5R and C4.5 as the target concept increases in complexity. In (Spears and Gordon, 1991) we identify strategies responsible for the success of these concept learners. We then implement a subset of these strategies within GABIL to produce a multistrategy concept learner. Finally, this multistrategy concept learner is further enhanced by allowing the GAs to adaptively select the appropriate strategies. GA and Neural Nets Genetic algorithms and neural nets (NNs) have been used as heuristics for some NP-Complete problems. Unfortunately, the results have been mixed because NP-Complete problems are not equivalent with respect to how well they map onto NN (or GA) representations. The Traveling Salesman Problem is a classic example of a problem that does not map naturally to either NNs or GAs. Suppose we are able to identify an NP-complete problem that has an effective representation in the methodology of interest (GAs or NNs) and develop an efficient problem solver for that particular case. Other NPcomplete problems which do not have effective representations can then be solved by transforming them into the canonical problem, solving it, and transforming the solution back to the original one. Spears and De Jong (1990b) outline GA and NN paradigms that solve boolean satisfiability (SAT) problems, and uses Hamiltonian circuit problems to illustrate how either paradigm can be used to solve other NP-Complete problems after they are transformed into equivalent SAT problems. Initial empirical results are
222
presented which indicate that although both paradigms are effective for solving SAT problems, the GA paradigm may be superior for more complex boolean expressions. Recently, genetic algorithms have been used to design neural network modules and their control circuits. In these studies, a genetic algorithm without crossover outperformed a genetic algorithm with crossover. Spears and Anand (1991) re-examine these studies, and conclude that the results were caused by an inadequate population size. New results are presented that illustrate the effectiveness of crossover when the population size is adequate. CURRENT DIRECTIONS Each phase of the research described here has suggested areas for further study. The analysis of the effects of the fitness function and selection algorithms have addressed only the broadest differences among the alternatives. Future analysis could focus on the issue of sensitivity of the fitness functions and selection algorithms. For example, genetic algorithms with linear fitness functions and proportional selection are highly sensitive to the objective function. That is, large differences in objective function values are reflected as large differences in growth rates. The use of dynamic scalingfitnessfunctions reducing sensitivity. Rank-based selection schemes reduce sensitivity to the objective function even more. Empirical studies have shown a general correlation between convergence and sensitivity that should be explored in a more formal setting. It is not clear that more sensitivity is necessarily better. Given that genetic algorithms will often be used to optimize a surface with unknown properties, the genetic algorithm designer should be prepared to use algorithms with the appropriate sensitivity for the application at hand. We conjecture that, given appropriate formal definitions of sensitivity, theorems similar to those in (Grefenstette, 1990b) could be developed to characterize the searches performed by various sub-classes of admissible genetic algorithms distinguished by the sensitivity of the selection algorithm. We will continue the analysis of crossover by focusing in "construction theory", the dual to the disruption theory reported here. Construction theory refers to the analysis of how effectively various crossover operators can build more elaborated structures from the patterns existing in the population. Understanding the constructive effects of crossover is a key element in understanding genetic algorithms.
223 Our machine learning applications of genetic algorithms will continue to explore the use of traditional, symbolic machine learning operators (e.g., specialization, generalization, etc.) as mutation operators within the genetic framework of SAMUEL. The approach is currently being tested on more challenging, multi-agent tasks, including tasks that require the cooperative efforts of several learning agents. Based on the rate of progress to date, we expect to see continued progress toward the development of practical machine learning systems that exploit the power of genetic algorithms.
ACKNOWLEDGMENTS The authors acknowledge the contributions to this project by the other members of the Machine Learning Group at NRL: Helen Cobb, Diana Gordon, Connie Ramsey, and Alan Schultz.
REFERENCES Baker, J. E. (1987). Reducing bias and inefficiency in the selection algorithm. Proceedings of the Second International Conference Genetic Algorithms and Their Applications (pp. 14-21). Cambridge, MA: Erlbaum. Baker, J. E. (1989). Analysis of the effects of selection in genetic algorithms, Doctoral dissertation, Department of Computer Science, Vanderbilt University, Nashville. Cobb, H. G. and J. J. Grefenstette (1991). Learning the persistence of actions in reactive control rules. Proceedings of the Eighth International Machine Learning Workshop (pp. 293-297). Evanston, IL: Morgan Kaufmann, De Jong, K. A. (1975). An analysis of the behavior of a class of genetic adaptive systems. Doctoral dissertation, Department of Computer and Communication Sciences, University of Michigan, Ann Arbor. De Jong, K. A. (1990). Genetic-algorithm-based learning. In Machine Learning: An artificial intelligence approach, Vol. 3, Y. Kodratoff and R. Michalski (eds.), Morgan Kaufmann. De Jong, K. A. and W. M. Spears (1992). A formal analysis of the role of multi-point crossover in genetic algorithms. Annals of Mathematics and
224 Artificial Intelligence. Goldberg, D. E. (1989). Genetic algorithms in search, optimization, and machine learning. Reading: Addison-Wesley. Gordon, D. F. (1991a). An enhancer for reactive plans. Proceedings of the Eighth International Machine Learning Workshop (pp. 505-508). Evanston, IL: Morgan Kaufmann. Gordon, D. F. (1991b). Improving the comprehensibility, accuracy, and generality of reactive plans. Proceedings of the Sixth International Symposium on Methodologies for Intelligent Systems (pp. 358-367). Charlotte, NC: Springer-Verlag. Grefenstette, J. J. (1986). Optimization of control parameters for genetic algorithms. IEEE Transactions on Systems, Man, and Cybernetics, SMC-16(1), 122-128. Grefenstette, J. J. (1988). Credit assignment in rule discovery system based on genetic algorithms. Machine Learning 3(2/3), 225-245. Grefenstette, J. J. (1991a). Conditions for implicit parallelism. In Foundations of Genetic Algorithms, G. J. E. Rawlins (ed.), Bloomington, IN: Morgan Kaufmann. Grefenstette, J. J. (1991b). Lamarckian learning in multi-agent environments. Proceedings of the Fourth International Conference of Genetic Algorithms (pp. 303-310). San Diego, CA: Morgan Kaufmann. Grefenstette, J. J. and H. G. Cobb (1991). User's guide for SAMUEL, Version 1.3. NRL Memorandum Report 6820. Washington, DC. Grefenstette, J. J., C. L. Ramsey and A. C. Schultz (1990). Learning sequential decision rules using simulation models and competition. Machine Learning 5(4), 355-381. Holland, J. H. (1975). Adaptation in natural and artificial systems. Ann Arbor: University of Michigan Press. Koza, J. R. (1989). Hierarchical genetic algorithms operating on populations of computer programs. Proceedings of the 11th International Joint Conference on Artificial Intelligence. San Mateo, CA: Morgan Kaufmann. Ramsey, C. L., A. C. Schultz and J. J. Grefenstette (1990). Simulationassisted learning by competition: Effects of noise differences between training model and target environment. Proceedings of Seventh International Conference on Machine Learning (pp. 211-215). Austin, TX:
225 Morgan Kaufmann. Schultz. A. C. (1991). Using a genetic algorithm to learn strategies for collision avoidance and local navigation. Proceedings of the Seventh International Symposium on Unmanned, Untethered Submersible Technology (pp. 213-225). Durham, NH. Schultz, A. C. and J. J. Grefenstette (1990). Improving tactical plans with genetic algorithms. Proceedings of IEEE Conference on Tools for AI90 (pp. 328-334). Washington, DC: IEEE. Spears, W. M. and V. Anand (1991). A study of crossover operators in genetic programming. Proceedings of the Sixth International Symposium on Methodologies for Intelligent Systems (pp. 409-418). Charlotte, NC: Springer-Verlag. Spears, W. M. and K. A. De Jong (1990a). Using genetic algorithms for supervised concept learning. Proceedings of IEEE Conference on Tools for AI 90 (pp. 335-341). Washington, DC: IEEE. Spears, W. M. and K. A. De Jong (1990b). Using neural networks and genetic algorithms as heuristics for NP-complete problems. International Joint Conference on Neural Networks (pp. 118-121). Washington D.C: Lawrence Erlbaum Associates. Spears, W. M. and K. A. De Jong (1991a). An analysis of multi-point crossover. In Foundations of Genetic Algorithms, G. J. E. Rawlins (ed.), Bloomington, IN: Morgan Kaufmann. Spears, W. M. and K. A. De Jong (1991b). On the virtues of parameterized uniform crossover. Proceedings of the Fourth International Conference of Genetic Algorithms (pp. 230-236). San Diego, CA: Morgan Kaufmann. Spears, W. M. and D. F. Gordon (1991). Adaptive strategy selection for concept learning. Proceedings of the Workshop on Multistrategy Learning (pp. 231-246). Harpers Ferry, WV: George Mason University. Syswerda, G. (1989). Uniform crossover in genetic algorithms. Proceedings of the Third International Conference on Genetic Algorithms (pp. 2-9). Fairfax, VA: Morgan Kaufmann. Whitley, D., T. Starkweather and D. Fuquay (1989). Scheduling problems and traveling salesmen: The genetic edge recombination. Proceedings of the Third International Conference on Genetic Algorithms (pp. 133141). Fairfax, VA: Morgan Kaufmann.
Chapter 7 Problem Solving via Analogical Retrieval and Analogical Search Control Randolph Jones Artificial Intelligence Laboratory University of Michigan 1101 Beal Avenue Ann Arbor, MI 48109-2110
ABSTRACT In this chapter we describe EUREKA, a problem solver that uses analogy as its basic reasoning and learning process. EUREKA introduces a learning mechanism called analogical search control, and uses a model of memory based on spreading activation to retrieve analogies and solve problems. These relatively simple mechanisms allow the system to account for a number of psychological phenomena in problem solving. In this chapter we focus on some of the computational aspects of the system. To this end, we provide a full description at theoretical and implementation levels, and present the results of some experiments that explore the model's computational behavior. INTRODUCTION The standard paradigm for problem solving in AI involves applying operators and expanding new states in order to find a path between the initial state and a goal state. We introduce a model that differs from most previous problem-solving systems, in that it uses analogy as its sole matching mechanism. Thus, its problem-solving behavior combines standard problem-solving techniques with analogical reasoning. The use of analogy requires us to address the role of retrieval and memory in problem solving. Humans have a large capacity for knowledge in longterm memory, and difficulties occur in choosing the correct analogy for
228 a given problem situation. In our view, the retrieval of knowledge from memory plays a central role in one's ability to solve a problem. In addition, any mechanisms that influence retrieval should have a strong effect on the types of learning that can occur. In our model, one of the most important forms of learning involves the adjustment of memory structures to alter retrieval patterns. These adjustments are designed to facilitate the retrieval of useful knowledge, although we find that there is sometimes a trade-off between flexibility and performance improvement. We have designed the EUREKA model in order to demonstrate the advantages of a system that relies an analogical reasoning and incorporates a model of memory. The memory model we have chosen is based on the use of spreading activation for retrieval, and analogical reasoning is carried out through a new mechanism, called analogical search control2 These relatively simple reasoning and learning methods provide a problem solver that can account for many of the basic psychological phenomena in problem solving. The remainder of this paper consists of a description of the EUREKA model of problem solving, including its retrieval and analogy mechanisms, followed by an evaluation of some of the model's computational aspects. THE EUREKA MODEL As we have suggested, analogy and retrieval lie at the heart of the model. We have combined these mechanisms with a problem solver that relies heavily on the means-ends-analysis framework. The standard means-ends approach selects operators by examining the entire set of operators and choosing one that reduces some of the differences between the current state and the goal conditions (Ernst & Newell, 1969; Fikes & Nilsson, 1971; Minton, 1988/1989). This procedure is repeated, EUREKA
1
This view is also held by a number of other researchers in artificial intelligence (e.g., Carbonell, 1986; Hammond, 1986/1988; Kolodner, Simpson, & Sycara, 1985; Schank, 1982). 2 The term "analogical search control" was introduced in work with the CASCADE 3 system (VanLehn & Jones, in press), but the mechanism itself was first developed in EUREKA (Jones, 1989).
229 with backtracking if necessary, until the problem is solved or it becomes clear that the problem cannot be solved. In EUREKA, the search for an operator is replaced with a retrieval method that considers only a small subset of the operator memory. In addition, the strict method for selecting operators is relaxed to allow the system more flexibility and the ability to generalize its knowledge. In the following sections we describe three important aspects of the model in detail. These include the representation and organization of knowledge; the performance engine, which includes a problemsolving component and retrieval mechanism; and the learning mechanisms. EUREKA
Representation, Performance, and Learning EUREKA is a memory-based problem solver. As it solves problems, it records its actions into memory, and it uses these memories to guide future problem-solving efforts. To begin our discussion, we should note that there are two levels at which it is useful to view the knowledge structures in EUREKA. At a low level, EUREKA'S long-term memory is a semantic network consisting of nodes (representing concepts) connected by small sets of labeled links (representing relations). This network is used by EUREKA'S retrieval mechanism to retrieve analogies from memory. At a higher level, the semantic network represents solution traces to problems, which are used by the problem solver to direct future efforts. We will begin this section with a discussion of the knowledge representation and learning mechanism used by the problem solver, and then examine the semantic network together with the retrieval algorithm.
The Basic Problem Solver. We can best describe EUREKA'S problem solver as two recursive functions: TRANSFORM and APPLY. These functions appear in Table 1. Each of them in turn attempts to satisfy what we call TRANSFORM goals and APPLY goals. This algorithm is similar to standard means-ends analysis as used in GPS (Ernst & Newell, 1969) or STRIPS (Fikes k Nilsson, 1971). The primary difference in EUREKA'S algorithm concerns the method for choosing operators, which we will discuss in detail later. First, let us consider the TRANSFORM function, which is presented
230 Table 1. EUREKA'S two main problem-solving functions. TRANSFQRM(StateX,Conditions):
Returns StateZ
If StateX s a t i s f i e s Conditions Then Return StateZ as StateX Else Let Operator be SELECTJDPERAT0R(StateX,Conditions); If Operator i s empty Then Return StateZ as "Failed State 1 ' Else Let StateY be APPLY(StateX,Operator); If StateY i s "Failed State" Then Return StateZ as "Failed State" Else Return StateZ as TRANSFORM(StateY,Conditions) APPLY(StateX,Operator):
Returns StateY
Let P be PRECONDITIONS(Operator); If StateX s a t i s f i e s P Then Return StateY as EXECUTE(StateX,Operator) Else Let StateW be TRANSFORM(StateX,P); If StateW i s "Failed State" Then Return StateY as "Failed State" Else Return StateY as APPLY(StateW,Operator)
with a goal in the form of "TRANSFORM StateX into StateZ, which satisfies Conditions." The value returned by the function is either a state that satisfies the Conditions or a special "failure" state. TRANSFORM first checks to see if StateX already satisfies the Conditions. If this is the case, then TRANSFORM succeeds and returns with StateZ bound to StateX. If this is not the case, then TRANSFORM must set up an APPLY goal followed by another TRANSFORM goal in an attempt to complete the transformation. To set up an APPLY goal, TRANSFORM selects an operator. Usually, the chosen operator will reduce some differences between the current state StateX and the Conditions. Although this would necessarily be true in a standard means-ends system, it is not always the case in E U REKA. As we describe in the next section, the model selects an operator and sets up an APPLY goal using that operator. This subgoal may fail,
231 in which case the current TRANSFORM goal also fails. However, if the APPLY subgoal succeeds, it returns a new state, StateY, resulting from the application of the Operator to StateX. With this new state, TRANSFORM sets up a recursive subgoal of the form "TRANSFORM StateY into StateZ, which satisfies Conditions." When the current TRANSFORM goal finishes, StateZ receives the value returned by the subgoal, whether it is a failure or a successful state. The APPLY function has a similar form to the TRANSFORM function, but is somewhat simpler because APPLY does not have to worry about selecting an operator to set up its subgoals. The form of an APPLY goal is " A P P L Y Operator to StateX to produce StateY." The function first checks the preconditions of the Operator to see if they are satisfied in StateX. If so, APPLY simply executes the Operator in StateX and returns the resulting state in StateY. If the preconditions are not met, the current state, StateX, must be further transformed until the Opera t o r can be applied. More precisely, EUREKA must set up a subgoal of the form "TRANSFORM StateX into StateW, which satisfies the preconditions of Operator." If this subgoal fails, then APPLY also fails. If the subgoal succeeds, it returns a state, StateY, in which the Operator should now be executable. At this point, APPLY calls itself in order to execute the operator in the new state. The final subgoal is of the form " A P P L Y Operator to StateW to get StateY," and the value returned by the recursive call becomes the return value of the parent APPLY goal. Each time EUREKA encounters a TRANSFORM or APPLY goal, it performs a simple form of rote learning by storing that goal in memory and appropriately linking the goals from a single problem. As an example, after a successful problem-solving attempt the memory structure in Figure 1 might be asserted into memory. When EUREKA fails to solve a problem, we do not allow it to backtrack as most problem-solving systems can. Rather, it must begin a new attempt at the problem from the top-level TRANSFORM goal. Because multiple attempts to solve the problem will probably share some goals, a directed acyclic graph grows in memory to represent multiple attempts at a problem. 3 An example 3
The problem-solving traces that EUREKA stores are reminiscent of Carbonell's (1986) derivational traces, although not as detailed.
232 [TRANSFORM stateiool into Statel23, which satisfies Goall7
J™L APPLY STACK(A,B) to StatelOO to produce State200 TRANSFORM stateiool into State201, which satisfies Precond[SJ4CK(A,B)]
|APPLY PICKUP(A)| to StatelOO to produce Statel05 StatelOO
1*1 [Bl Id Precond[STACK(A,B)]
El
into Statel23, which satisfies Goall7
APPLY STACK(A,B)J to State201 to I produce State200
TRANSFORM stateios! into State201, which satisfies Precond[STACK(A,B)1 Statel05 State201
El
T R A N S F O R M State200|
rai roi
Statel23 State200
MM
Goall7
JSL
F i g u r e 1. EUREKA'S representation for a single attempt at a "blocksworld" problem.
is provided in Figure 2. The system continues attempting to solve a given problem until it is successful or it reaches a predetermined limit on the number of attempts it has made. During each individual attempt, EUREKA'S learning mechanisms encourage (but do not guarantee ) it to search previously unexplored paths. In the current implementation, the system attempts to solve a problem 50 times before it gives up.
233
Figure 2. Abstract representation of multiple attempts at a single problem.
The Role of Retrieval in Problem Solving. The retrieval of knowledge is a central issue in the EUREKA model. In terms of performance, learning, and memory, the critical point of problem solving concerns choosing an appropriate operator to help satisfy a TRANSFORM goal. In order to accomplish this, the system must retrieve and choose the "best" operator from all of the candidates in memory. What the systems believes is "best" changes as it gains experience. In many standard problem solvers, retrieval consists of a search through the entire operator memory, with the final selection being made according to some predetermined evaluation function. A number of other systems use control rules or preference rules to retrieve and evaluate operators (Laird, Rosenbloom, & Newell, 1986b; Minton, 1988/1989; Ohlsson, 1987).
234
Figure 3. A portion of EUREKA'S semantic network.
controls its search by retrieving appropriate analogies to the current TRANSFORM goals. Retrieval is modeled by a process of spreading activation. When EUREKA stores its problem-solving traces in memory, they are actually stored in the form of a semantic network. As an example, Figure 3 provides a portion of the network that represents the top-level TRANSFORM goal in Figure 1. Each node in the network has an associated level of activation, which represents its current relevance to the system. In addition, each link has a trace strength, which represents the semantic strength of a specific relation between two concepts. These features are similar to those used by Anderson (1976, 1983) in his ACT model. EUREKA
Every time the TRANSFORM function is executed, activation spreads from the nodes in the semantic network that represent the current state and goal conditions. This activation is used to retrieve a set of related
235 TRANSFORM goals from long-term memory. After these states have been retrieved, one is selected as an analogy for solving the current T R A N S FORM goal. This selection is based on two factors: the degree of match to the current state and goal conditions, and the history of success or failure in using each of the retrieved TRANSFORM goals for search control in the past. After EUREKA has selected a candidate TRANSFORM goal, it creates an analogical mapping to generalize the retrieved goal, so that it can guide the next search step for solving the current goal. Finally, EUREKA selects an operator that was used successfully in the retrieved goal, applies the analogical mapping to it, and attempts to APPLY the new operator.
There are three distinct steps in the process of operator selection: the retrieval of a set of past goals, the evaluation and selection of an appropriate analogy, and the selection of an operator from the retrieved analogical TRANSFORM goal. Most problem-solving systems downplay the aspect of retrieval by making all operators candidates for evaluation and selection. In our attempt to provide a more plausible psychological model, we only allow EUREKA to retrieve a small set of operators that are likely to be relevant to the problem, and then make the final selection from this small set. Naturally, there are drawbacks to this type of retrieval framework. It is possible that the appropriate operators for a given situation may not be retrieved. In this case, the system may not be able to solve a "potentially solvable" problem. However, the possibility of failing to retrieve necessary knowledge comes with the ability to solve many different types of problems efficiently. A complete model of problem solving should explain the conditions under which a problem might not be solved. As we have noted above, failure can occur even if the appropriate knowledge is stored somewhere in memory; the knowledge may just be inaccessible to the retrieval mechanism. Spreading Activation in EUREKA. When EUREKA must retrieve an operator to solve a TRANSFORM goal, it spreads activation from the concepts involved in the current TRANSFORM goal, including concepts in the current working state, the current goal conditions, and information about the TRANSFORM goal. The spreading-activation algorithm is a variation of the algorithm used by Anderson (1976, 1983) in his ACT
236 Table 2. EUREKA'S spreading-activation algorithm. Let ACTIVATIONJTHRESHOLD be 0.01; Let DAMPING-FACTOR be 0.4; Let INITIAL-ACTIVATION be 1.0; SPREAD JLNIT (Source) SPREAD (Source, INITIALJICTIVATION, NIL)
SPREAD(Sourc e,Value,Path) If Value is less than ACTIVATIONJTHRESHOLD or Source is in Path Then EXIT Else Increase Source.Activation by Value; For each link X from Source Let Target be the node connected to Source by X; Let Newvalue be SPREAD -VALUE ( Source, X, Value) x DAMPING -FACTOR; PUSH Source onto Path; SPREAD(Target,Newvalue,Path) SPREAD-VALUE (Source, Link, Value) Let Total be 0;
For each link X from Source If X is the same type of link as Link Then increase Total by X.trace-strength; Return Valuex(Link.tracejstrenth/Total)
framework, and it is provided in Table 2. In EUREKA'S current implementation, activation spreads in a depth-first manner, although there is no compelling reason why another type of algorithm could not be used. When a node in the semantic network receives an amount of activation, it is first added to any other activation associated with the node and stored. Next, that activation is divided and passed on to the neighboring nodes. Finally, the spreading process is recursively applied to the new set of nodes.
237 If activation were allowed to spread in an unrestricted manner, it would eventually spread throughout the entire network, so there are certain restrictions and assumptions associated with the process. First, the amount of activation passed to any neighboring node is always less than the amount of activation in the source node. This is achieved by multiplying the activation value by a "damping" factor. Another restriction requires that if a concept receives an amount of activation that is close to zero (specified by a threshold), activation stops spreading at that point. These two conditions guarantee that the amount of activation being spread decreases and that each activation path eventually terminates. The current implementation uses a damping value of 0.4 and a threshold of 0.01. Initially activated concepts start with an activation value of 1.0. The implementation includes one additional simplifying assumption: each path of activation is temporarily recorded to ensure that it never cycles. In this way, activation that is spread from a given node can never be spread back to that node after traveling through a cyclic path in the network. A final detail of the spreading-activation process concerns the manner in which activation is divided when passed from a source node to a number of neighboring nodes. A simple algorithm would divide the activation evenly among all neighboring nodes. If this were the case, however, all neighboring concepts would be equally likely retrieved in spite of their degree of relatedness to the source concept. Anderson (1976, 1983) overcomes this problem by associating trace strengths with the links between nodes and dividing the activation among neighboring concepts in proportion to these trace strengths. In the EUREKA model, we rely heavily on the use of trace strengths to influence retrieval patterns. The system's low-level learning mechanism involves adjusting these trace strengths to encourage the retrieval of familiar and useful knowledge. Therefore, EUREKA'S spreading mechanism also divides activation in proportion to trace strengths. A useful metaphor to describe spreading-activation processes involves the competition of nodes in the network for activation from neighboring source concepts. The competition is based on the strengths of the links to those concepts. However, in EUREKA many different types of links connect concepts. For instance, the node representing a chair
238 has part-of links to its various parts, an is-a link to the node for furniture, and various links to situations in which a chair is involved. For example, it would be undesirable for the node for furniture to compete for activation with the node for chair-legs, because these two concepts bear very different relationships to the concept of a chair. Therefore, EUREKA'S spreading mechanism only allows competition for activation between nodes connected to a source by links of the same type. 4
Figure 4. Competition for activation is limited to links of the same type.
For example, consider Figure 4, in which chair is given an activation level of 1.0, and there are three concepts that are part-of & chair. The activation available for spreading from chair is the initial amount of activation times the damping factor, or 1.0x0.4 = 0.4. Because furniture is the only node connected to chair by an is-a link, it receives the full activation value of 0.4. However, each of the three part-of nodes receive 4
Neches (1981/1982) provides another alternative for separating the competition for activation. His approach spreads activation through different types of links depending on the system's current situation.
239 a share of the 0.4 value based upon the strength of its link to chair relative to the other part-of links. For example, seat gets an activation value of 0.4 x -^ = 0.213, whereas its competitors receive less activation. It is easy to see that one way EUREKA'S behavior can change is by changing the character of the semantic network. This can happen in two ways. First, by adding new nodes to the network (which happens when problem traces are stored). Second, by updating the trace strengths of the links between nodes. EUREKA does this in two situations. When the system attempts to store a relation that it has seen before, it increments the trace strength of that relation rather than storing a new copy. In addition, when a TRANSFORM goal is successfully used to help solve a new problem, it increases the trace strengths of all the links in the portion of the semantic network that represents that TRANSFORM goal. The latter mechanism allows the system to improve its behavior with success, and is a primary source of learning in EUREKA. Selecting an Appropriate Analogy. As we have seen, EUREKA spreads activation throughout long-term memory to retrieve of a small number of stored TRANSFORM goals (the ones that end up with the most activation). The cutoff point for retrieving goals is set by the activation of the most strongly activated goal. In the current implementation, any goal that has less than one percent of the activation of the most active goal is not retrieved. From empirical studies with the system, we have found that this algorithm usually causes anywhere from one to ten goals to be retrieved. This set of goals is then pruned to include only those TRANSFORM goals that have been previously satisfied, because EUREKA cannot make use of TRANSFORM goals that have not been solved. After the system has pruned the retrieved set of TRANSFORM goals, it has a small set of situations that are related to the current goal in some way. EUREKA must choose one of these to use as an analogy to the current TRANSFORM goal. This choice is made by creating a partial match between the current TRANSFORM goal and each of the retrieved goals. Each retrieved goal is assigned a value between zero and one that signifies its degree of match with the current goal. This score is calculated by taking the ratio of the number of common literals in the current goal and the retrieved goal to the total number of literals that represent the retrieved goal. After assigning a partial-match value, p,
240 to each goal, EUREKA multiplies p by a selection factor that represents how useful the particular goal has proven in the past. This factor is calculated by dividing the number of times the retrieved goal has been successfully used as an analogy (s) by the number of times it has been chosen as a candidate analogy (t). This leaves each retrieved goal with a numerical attribute, defined as | x p, that predicts the relevance of the goal to the current problem. These values provide another source of learning for E U R E K A , but they do not have a large impact on improving its behavior. Rather, their purpose is to encourage EUREKA to explore new sections of the problem space so it does not get stuck in unproductive cycles. After the calculations are done, one of the retrieved goals is randomly selected based on the goals' associated values. The selection is based on a weighted distribution that gives more precedence to the goals with the highest scores. However, because the selection is probabilistic, no candidates are completely ruled out unless they share absolutely no literals with the current goal. It is always possible that a goal that matches poorly will be selected. This allows the system a degree of flexibility, but for the most part it encourages selection of highly matched goals. Selection and Analogical Transformation of an Operator. At this point, EUREKA has chosen a TRANSFORM goal that it believes is similar (in an analogical sense) to the current TRANSFORM goal of the problem it is working on. It's last step is to choose an operator from the retrieved goal to APPLY to the current problem. To do this, the system chooses one of the operators that led to a solution from the retrieved 5 TRANSFORM goal. However, it is important to note that the operators that EUREKA stores in memory are completely instantiated, containing no variables. Thus, any retrieved operator must usually be analogically mapped to the current situation in order to be executable. The analogy mechanism computes a number of partial mappings Usually there is only one operator that led to satisfaction of the retrieved TRANSFORM goal. If there is more than one, a random selection is made.
241 from concepts in the stored goal to concepts in the current goal. The mappings are limited by the requirement that at least some structure matches between the two goals after mapping. This greatly limits the number of analogical matches that are derived, and limits situations in which all the concepts in one goal map to all the concepts in the other. After the transformations have been computed, each one is evaluated according to the new degree of match between goals and the number of analogical assumptions involved (i.e., the number of objects that must be mapped). Finally, the best analogical match is chosen in a random manner, similar to the selection of a goal from the retrieved set. This mechanism lets EUREKA use stored goals for search control even when they do not completely match the current problem. This is one of the primary advantages of analogical search control, in that it can apply across a wide range of problems, and will allow the system to make inductive assumptions if necessary. The ability to base all of its decisions on analogy to past behavior is a final source of learning for EUREKA. Every time the system chooses an operator from memory, it does so based on the experiences it has stored in memory and the retrieval patterns of its semantic network. As an example, suppose the system has already solved the problem shown in Figure 1 and it is given a new problem with an initial state consisting of Blocks E, F, G, and H sitting on a table, and a goal condition of having Block E stacked on Block F. Assuming EUREKA has already retrieved and selected the stored goal, a number of transformations would be proposed by the analogical mapping mechanism, including A-»E and B—>F, along with others. These two specific transformations would provide the best match between the goal conditions in the current and retrieved TRANSFORM goals. Therefore, EUREKA would select the operator it applied in the old case and analogically map it to STACK(E,F) for the current problem. We should stress that the evaluation of degree of match gives more weight to matches on the goal conditions, giving rise to the type of operator selection found with strict means-ends analysis. However, the In EUREKA'S current implementation, there is no other constraint on which objects can map to each other.
242 system can select other matches when there are no retrieved goals that match the conditions well. This can lead to a forward-chaining type of behavior if the current states of the TRANSFORM goals match, or something between forward chaining and means-ends analysis if there is only a partial match between goals. We refer to this type of reasoning as flexible means-ends analysis. Note that if EUREKA'S current TRANSFORM goal is one that it has previously solved, it has a higher chance of being retrieved by the spreading-activation mechanism. It will also have the highest possible degree of partial match because it is matching against itself. This means that the system will tend to repeat whatever it has done successfully in the past to solve the goal. However, we should stress that EUREKA'S decisions are based on probabilistic choices, so even in this case it may select a different state, though it would be highly unlikely. In addition, retrieved goals that are likely to be more relevant to the current situation generally have a higher degree of match to the current goal. Because of the high degree of match, the retrieved goal is more likely to be selected. This argument is based on the assumption that structural similarity implies greater relevance (Gentner, 1983). Along with the retrieval mechanism, this discourages EUREKA from selecting operators that work in the domain of chemistry, for example, when it is busy working on a problem in the blocks world. Although this type of selection is discouraged, it is not ruled out completely. In this way, the mechanism allows the selection of a useful situation from another domain that can be used as an analogy to solve a problem in the current domain. Therefore, EUREKA has a single algorithm involving the retrieval, selection, and analogical mapping of stored goals that accounts for a number of types of problem solving. These include cases of straightforward operator application, simple generalization within a domain, and broad analogies across domains.
Although this type of reasoning was introduced in EUREKA (Jones, 1989), this name was introduced by Langley and Allen (1991). Flexible means-ends analysis has been successfully incorporated into their DAEDALUS system and also the G I P S system (Jones & VanLehn, 1991).
243 One remaining question concerns the kinds of knowledge the system has available initially. If EUREKA started without any knowledge in its long-term memory, it would never be able to solve any problems, because there would be no previously solved problems on which to base future decisions. Therefore, EUREKA must start with a set of operators that it can apply to new problems. To be consistent with the problem-solving mechanism, these operators are stored in the form of simple problems that require only one operator application to be solved. Each problem is represented as a simple satisfied goal that is not connected to any other goal in any type of sequence. In this way, each operator initially stands on its own, rather than being involved in a more complicated problemsolving episode. This gives EUREKA the ability to solve new problems "by analogy" before it has seen any other complete problems. Summary This ends our description of the EUREKA model of problem solving. As we have seen, there are three interdependent components in EUREKA, involving the system's memory, performance and retrieval, and learning mechanisms. Although the model is based on a means-ends framework, it has a number of features that distinguish it from standard means-ends systems. First, EUREKA records all of its past behavior in memory in order to use analogical search control to guide future problem solving. The model also relaxes the strict requirements of standard means-ends analysis for operator selection. EUREKA chooses operators by performing an analogical match on past TRANSFORM goals that it has solved. This allows the system to exhibit means-ends style characteristics in general, but also allows the flexibility to break out of that pattern. In addition, this mechanism lets EUREKA make generalizations within or across domains, and it allows the system to search larger portions of the problem space when it cannot find operators that it believes are clearly relevant to the current problem. Finally, EUREKA incorporates a model of retrieval based on spreading activation, which provides the ability to focus on local areas of longterm memory and to learn by influencing retrieval patterns. Elsewhere
244 (Jones, 1989; Jones & Langley, 1991), we have shown that these mechanisms combine to create a model of problem solving that can account for many aspects of human behavior. These experiments include improvement in performance on individual problems, transfer within and across domains (e.g., speed-up learning and analogical reasoning), negative transfer or Einstellung, and the role of external cues in the form of hints. Due to space considerations, we cannot describe those results in detail here. Instead, we will discuss and evaluate some of the computational aspects of EUREKA. E X A M I N A T I O N OF SOME COMPUTATIONAL C H A R A C TERISTICS EUREKA was originally designed to provide a computational model of psychological behavior in problem solving. In this sense, it constitutes an architecture for problem solving. However, as a running computer program, it also contains a number of computational mechanisms and explicit parameters that are worthy of exploration. In this section, we focus on some of EUREKA'S learning parameters and examine it's retrieval and selection mechanisms from a computational standpoint.
Learning Parameters The first two experiments reported here were designed to evaluate behavior across a wide range of settings for the parameters involved in its decision algorithms. There are two primary parameters involved in the decision points: one concerns the amount that the trace strengths on links in the semantic network are increased during problem solving, and the other involves the amount of punishment or reward the system associates with selecting a retrieved goal for use with analogical search control. These factors influence behavior in the retrieval of a set of TRANSFORM goals from memory, and in the selection of a single goal from that set. In principle, changing the values of these parameters could drastically change the amount of knowledge retrieved from memory and the likelihood that it will be selected for search control once it has been retrieved. It is not necessarily desirable (or possible) to come up with a "best" set of values for the parameters. Rather, the particular parameter EUREKA'S
245 values represent a specific bias in a continuum of possible behaviors. For these experiments, we wish to explore this behavior space and exhibit the tendencies of the system with respect to certain ranges of the parameter values. Retrieval of Knowledge. Our first experiment measured E U behavior with respect to the parameter for increasing the trace strengths of the links in memory after a problem has been successfully solved. There are two occasions in which trace strengths are increased. Whenever a relation is encountered that is already stored in memory, the link representing that relation has its trace strength incremented by one. However, when the system succeeds in solving a problem, the trace strengths of all links representing TRANSFORM goals that helped to solve the problem are increased by a factor v. In the first part of this experiment, we tested the effects of this factor on the system's ability to generalize knowledge across problems in the blocks world. That is, we gave the system blocks-world operators with no variables in them. These operators were overly specific, and they would not apply to a very wide range of problems in a standard problem solver. However, with its analogical mechanisms, EUREKA is able to generalize the operators to new problems that it encounters. REKA'S
We ran EUREKA first on a blocks-world problem that required no generalization of the operators and then on a problem that required generalization. We measured the amount of effort required to solve the latter problem by recording the number of TRANSFORM and APPLY goals the system visited while solving the problem. By varying the value of the link strengthening factor, v, we can examine the effect of v on transfer of knowledge within the domain. Figure 5 compares the effort spent on the problem requiring generalization to the value of v. This graph shows an initial decline in the effort spent on solving the new problem as the learning factor, v, increases. This occurs because the successful use of the overly specific operators on the test problem causes the likelihood that the operators will be retrieved in future similar situations to increase with v. However, it is interesting to note that after a point near v = 20, performance actually starts to degrade. An explanation for this is that operators receive too much reward for being successful in the training problem and they become easily retrieved even
246 Number of goals visited
4
°1
35
l\
30
J\
/
25
J \
/
20
J 15J 10
\
/ ^--^^^
/^^^
^\/
J
5I 0
1
10 100 Value of factor increment
Figure 5. Number of goals compared to the retrieval increment factor, v.
when they are inappropriate to a new problem. This indicates that negative transfer effects can increase as v gets very large. In the second part of this experiment, we ran a number of experiments on practice effects with various values of v. Specifically, we had the system attempt to solve a single problem multiple times. In other experiments of this type, EUREKA exhibits a gradual increase in performance with repetition. We compared the effort spent on the problem in the first trial with the effort spent on the tenth trial for various values of v. This comparison was calculated as a percentage decrease in the number of goals visited and the results are shown in Figure 6. A lower value in this figure indicates a greater improvement in performance across trials. Performance improvement appears to increase with v, although this increase becomes less pronounced as v becomes large. Unlike the previous experiment, there are no visible negative transfer effects in these results, but that is to be expected because a single problem was being solved for each trial. Therefore, no transfer was occurring between distinct problems.
247
Percentage reduction between trials 120-1
Number of goals visited
10
100
Value of factor increment
Figure 6. EUREKA'S behavior with respect to the retrieval increment factor, v.
Selection of Retrieved Goals. Our second experiment examined the factor used to select an old goal as an analogy once it has been retrieved. Recall that this factor is multiplied by the degree of match between two goals to derive a final factor for selection. The selection factor is computed by storing two values: t is a measure of how often a goal has been selected for use in analogical search control in a particular situation, and s is a measure of how often a problem has been solved when the goal was chosen in that situation. When a problem is solved, each goal that was used to control search has its s and t attributes incremented by a fixed value w. When EUREKA fails to solve a problem, only / is incremented. The increment factor, w, is the variable of interest in this experiment. As in the second part of the experiment on the retrieval parameter, we repeatedly ran EUREKA on a number of individual problems. This time, however, we varied the value of w between zero and 100, measuring the percentage change in each dependent variable between the first and last trial. Again, a decrease in this value represents an average increase in performance improvement. The results are graphed in Figure 7. The
248 Percentage reduction between trials
Number of goals visited
70-,
0
1
10
100
Value of factor increment
Figure 7. EUREKA'S behavior with respect to the selection increment factor, w.
number of goals visited exhibits a gradual improvement as w becomes large, appearing to reach asymptotic values at about w = 1. These results are consistent with what we know of w's role in E U This factor's major purpose is to encourage the system to explore new paths after failures and to prefer old paths that have been successful. As such, we would expect improvement on individual problems to be more dramatic as the factor is increased. However, it is interesting that increases in w appear to have little impact as it becomes large. Again, this may be due to the fact that the system was solving individual problems repeatedly, and it was able to reach a plateau of reasonable behavior on those problems. REKA.
Computational Aspects of Spreading Activation Our final analysis and experiment concern the computational utility of a spreading-activation approach to retrieval. Spreading activation has received quite a bit of attention from psychologists as a plausible model of human memory. Most of the psychological literature on this topic
249 concentrates on accounting for the amount of time people take to perform memory-related tasks involving fact retrieval and word recognition (Anderson, 1974,1983; Collins & Quillian, 1969; Meyer k Schvaneveldt, 1971). In addition, Holyoak and Koh (1987) have proposed spreading activation as a mechanism for the retrieval of analogies in problem solving. However, we argue that this retrieval mechanism also has advantages from a computational standpoint. In particular, search by spreading activation is only influenced by the structure of memory and not by its specific contents. Other types of retrieval algorithms (e.g., Laird, Rosenbloom, & Newell, 1986a; Minton, 1988/1989; Ohlsson, 1987) can require an extensive analysis of portions of memory. In contrast, spreading activation uses a local algorithm that does not require attending to multiple complex concepts at one time. Spreading activation also imposes certain limits and constraints on the type and depth of search. More knowledge-intensive types of retrieval usually do not limit the size of memory that may be examined during retrieval. Finally, spreading activation specifies a paradigm under which the retrieval of knowledge occurs, placing a bias on which types of knowledge will be retrieved under different conditions. Theoretical Analysis of Spreading Activation. Time and memory complexity are important considerations when dealing with the computational characteristics of spreading activation. Consider a typical algorithm that implements this mechanism.8 At first glance, it appears that there is an exponential growth in the number of semantic-network nodes visited during spreading activation. This growth is based on the branching factor of the network and the depth to which activation is spread. However, with most spreading-activation systems (including E U R E K A ) , the amount of activation spread from a node is inversely proportional to the number of nodes connected to the source. Consider an idealized network in which the fan from each node is / and the trace strengths of all links are the same. Even if there are cycles 8
Although spreading activation is well-suited for parallelism, we consider a serial algorithm for this analysis because the current EUREKA model is implemented in this manner.
250 1 node / nodes / 2 nodes
OOO
OOO /*•»*•
Figure 8. Spreading activation viewed as a tree traversal.
in the semantic network, the activation process treats each branch of its traversal separately. Thus, we can view a spreading-activation process as the traversal of a tree, where multiple tree nodes may correspond to a single node in the semantic network (see Figure 8). To determine how long spreading takes, we derive a formula for the number of nodes visited during this traversal. The total number of nodes visited, T, is the summation^of the number of nodes in each level of the tree up to a certain distance d. This distance is determined by the specific mechanism's parameters. For a network with fan factor / , we get
r = i + / + /2 + ... + r
(1)
We can simplify this equation to T =
f^1 - 1 / - I
(2)
which is exponential with respect to d, the depth of the spreading process. However, d is determined by the amount of activation received by the furthest nodes and the threshold for stopping the spreading process. If we let an represent the amount of activation that is received by a node
251 n levels away from the source, then we have
where ag is the initial activation given to the source node. If we define the threshold for stopping the spread of activation as /i, then activation will spread until an < h. In Equation 2, we used d as the number of levels that activation spread to. Therefore, we must have
(4)
"d = yd
<*-l
=
ji^l
>h
(5)
•
These equations can be rewritten as
fd > %
(6)
/*-!<£ .
(7)
and Substituting Equations 6 and 7 into Equation 2 gives us upper and lower bounds for the number of nodes visited:
L
jzr>T-f^r
•
(8)
Notice that Equation 8 does not involve any exponential relationships, because d has been factored out. The equation also implies that the time and space required for spreading activation is independent of the size of memory, close to linear with respect to the inverse of the threshold f^J, and nearly constant with respect to / . Naturally, we have made some simplifying assumptions. However, any pattern of activation can be viewed in terms of a tree traversal, as we have done. This leaves only one variable that can complicate things: the fan factor / . It
252 is important to note that, for a single step of activation, if / is high then a large number of nodes receive relatively small amounts of activation. If / is low, a small number of nodes receive a relatively large amount of activation. This balancing effect causes the time required to spread activation to remain approximately constant when the threshold h is fixed. In fact, in our implementation of spreading activation, there is an additional decay factor that attenuates the amount of activation that is passed from one node to another. This can further decrease the number of nodes visited during spreading. The advantage of these results is that we can make reasonable assumptions about the time and space required for the implementation to run. In addition, we can expect to integrate large amounts of knowledge into memory without degrading the efficiency of the system's retrieval mechanism. Empirical Evaluation of Spreading Activation. To supplement this simplified analysis of spreading activation, we have run an experiment with the EUREKA system to demonstrate the independence of retrieval time with respect to memory size. In this experiment, we ran EUREKA on a number of problems, continuously adding knowledge to the semantic network. The problems and extra knowledge were taken from domains including the blocks world, towers of hanoi, water-jug problems, and "radiation" problems. At various points we started the retrieval process by spreading activation from a small set of specified nodes. Finally, we graphed the time taken to spread activation from each source against the total number of nodes in the network. These results are provided in Figure 9. Each curve in the figure represents the spreading time from a single source node. The most obvious characteristic of this graph is that each curve eventually levels off, indicating the type of behavior that we predicted. For large networks, as predicted, retrieval time does seem to be independent of the size of memory. There are some other aspects of this graph that should be discussed. First, notice that sometimes spreading activation appears to visit more nodes than are in the network. This happens because there is a large number of cycles in a typical network, so individual nodes are visited many times during the spreading process. Thus, even if the total number of nodes visited is larger than the number of nodes in the network, it does not mean that every node in the network is visited.
253 Number of nodes visited during retrieval
700-n
600H
500-
400-
300H
200
ioo
'
2^0
'
i5o
'
4f5ci
'
5J0
'
iio
Size of memory Figure 9. Comparing retrieval time to total network size.
Also, notice that the curves are somewhat jagged for small network sizes. This suggests that retrieval time is influenced quite a bit by the specific structure of the network, rather than by its size. Apparently, adding a few links or altering the strengths on links can significantly influence retrieval time, at least for small networks. The curves eventually appear to smooth out, but this could be partly because our interval for measuring retrieval time increased as the network grew. A final argument with respect to the computational advantages of spreading activation concerns the lack of knowledge used by the mechanism itself. All the knowledge of the system exists within the semantic network. This knowledge is not consulted while carrying out the spreading-activation process. In these terms, spreading activation can be considered a knowledge-free process. This means that the processing for spreading activation can be highly localized, because decisions made on where to spread activation and how much to spread to any given
254 node need not consider the state or knowledge content of other parts of the network. This localization ability lends itself nicely to a simple algorithm for spreading activation that could easily be implemented in a parallel fashion. This would further increase the efficiency of the retrieval mechanism in terms of time. This advantage would be limited in a system where the search for retrievable knowledge depends on other knowledge in memory. DISCUSSION The EUREKA model provides contributions to research on problem solving along a number of distinct dimensions. In this final section we examine the model along some of these dimensions, discussing the contributions this work has made and outlining directions for future research. Problem-Solving Abilities As we have mentioned, EUREKA'S current major purpose is to provide a psychological model for problem solving and it is far from being a complete and powerful problem solver. However, it does suggest a number of techniques and mechanisms that should prove useful in computational problem solving. One of EUREKA'S strengths is that it integrates techniques from a number of problem-solving paradigms without relying completely on any one in particular. For example, the model views problem solving as a task involving the retrieval of useful knowledge from memory. This is similar to the view provided in case-based reasoning (e.g., Hammond, 1988; Kolodner et al., 1985; Schank, 1982). Under this view, problem solving is less involved in examining productions or operators and evaluating their utility for the current problem, and more concerned with retrieving past experiences that will suggest useful approaches to apply to the current problem. An important difference between EUREKA and standard case-based problem solvers is that EUREKA uses analogical search control to "choose a case" to reason from each time it must make a decision, rather than
255 choosing one full problem-solving episode from memory that must be made to fit the current problem. This allows the system to use knowledge from multiple cases, if necessary, proposing a solution to the schemacombination problem. Although we have not experimented with schema combination in EUREKA, it has been successfully modeled with analogical search control in CASCADE 3 (VanLehn & Jones, in press). Another difference between EUREKA and case-based systems is that it builds its own "cases" from experience, rather than having to start with a library of full cases, because it can initially treat each operator as a "case" consisting of one step with which to reason. This also allows the system to fall back on more conventional problem-space search when necessary. In addition, EUREKA'S learning mechanism has some of the flavor of problem solver's that learn from examples in order to identify heuristically when an operator would be good to apply (e.g., SAGE, Langley, 1985; PRODIGY, Minton, 1988/1989; LEX, Mitchell, Utgoff, & Banerji, 1983). By strengthening the links of relations involved in useful T R A N S FORM goals, EUREKA attempts to identify the specific relations that are most relevant for retrieval. This is precisely the goal of systems that learn heuristics by analyzing failed and successful solution paths. EUREKA'S advantage is that it uses this single mechanism for all of its learning, including search-control learning and learning at the knowledge level. One disadvantage of the current model is that it does not learn from failures as many of these systems do. Therefore, it will not learn as quickly as it might it some situations. Another contribution of EUREKA is that it suggests a method for the efficient retrieval of knowledge from a large database. Many of the most powerful contemporary problem solvers (e.g., SOAR, Laird et al., 1986b, and PRODIGY, Minton, 1988/1989) rely on the ability to access their entire memory if necessary. This approach provides these systems with the ability to solve wide ranges of problems of non-trivial complexity. However, these systems should suffer when presented with problems involving large amounts of domain knowledge, or when provided with general knowledge from large numbers of problem domains in which most of the knowledge in memory is irrelevant to each particular problem. Minton (1988/1989) has called one facet of this issue the "utility
256 problem" for explanation-based learning. EUREKA'S spreading-activation mechanism provides the ability to focus on small portions of memory and provides a decision-making mechanism that does not slow down as the size of memory increases. Thus, at least one portion of the utility problem disappears, but there is naturally a tradeoff involved. Because EUREKA has an implicit limit on the amount of memory it will examine, there will be cases when the system cannot solve a problem even though it has the appropriate knowledge stored in memory. However, we predict that this type of mechanism will have strong heuristic value, providing a solution most of the time.
As we have suggested, EUREKA is currently somewhat weak as a performance problem solver, but it contains a number of mechanisms that should be useful in the context of more powerful problem solvers. In the future, we plan to extend EUREKA to take advantage of this potential. For example, one important factor in EUREKA'S weakness is its lack of higher-level control knowledge of the type found in UPL (Ohlsson, 1987), SOAR, or PRODIGY. In addition, we built in the assumption that the system could not backtrack for the sake of psychological validity. However, we expect that supplying EUREKA with a limited backtracking ability, along with the ability to learn higher-level control knowledge that operates on the retrieved knowledge, will greatly increase the complexity of the problems that it can solve. Indeed, systems that have borrowed mechanisms from EUREKA are capable of solving much more difficult problems (Jones k VanLehn, 1991; Langley k Allen, 1991; VanLehn k Jones, in press). Analogical Reasoning EUREKA also provides a context for the retrieval and use of analogies in problem solving, both within and across domains. Although the use of analogy has received a large amount of attention (see Hall, 1989, for a review), it is rarely incorporated in a problem solver in an elegant and general way. In addition, most research has focussed on how to elaborate analogies once they have been suggested, (e.g., Carbonell, 1983, 1986; Falkenhainer, Forbus, k Gentner, 1986; Holyoak k Thagard, 1989) and not on how to retrieve them in the first place. Anderson and Thompson's
257 (1989) P U P S and Holyoak and Thagard's PI are two notable exceptions that use analogical reasoning as a basic problem-solving process. They bear some resemblance to EUREKA, particularly in the use of spreading activation as a retrieval mechanism. However, P U P S , PI, and EUREKA use the results of spreading activation in quite different ways, and improved performance arises from very different types of learning mechanisms. Where P U P S and PI store new generalized operators based on past analogies and adjust the preconditions on these operators, EUREKA learns by simply storing problem-solving traces without generalization and then learning new retrieval patterns. Because EUREKA does all of its reasoning by analogy, the retrieval mechanism only needs to retrieve analogies. The system also provides a mechanism for decision making in problem solving that includes analogy as one activity in a continuum of possible problem-solving behaviors, allowing analogies to arise naturally when they are useful. The system does not need to switch from straightforward problem-solving mode into analogy mode, as has been the case in other work on analogical problem solving (e.g., Anderson, 1983; Holland et al., 1986) One extension of this ability would involve the use of alternative analogical-mapping mechanisms. EUREKA'S matcher is a relatively simple one that generates a number of partial matches and evaluates them. The evaluation function involves the degree of match between two structures and the number of assumptions required to achieve the match. As we have mentioned, the elaboration of analogies is a well-studied problem, and we might expect EUREKA'S performance to improve if equipped with a smarter analogical transformation mechanism, such as the structure-mapping engine (Falkenhainer, 1989; Falkenhainer et al., 1986) or Holyoak and Thagard's (1989) ACME algorithm. Because this component of the system is independent of the other components, it should be easy to replace it with alternative mechanisms. Another area for future work concerns the development of analogy as the sole reasoning mechanism. Depending on the knowledge in memory and the current problem, this mechanism manifests itself in seemingly different types of problem-solving behavior. These include straightforward operator application (or deductive reasoning), the ability to generalize operators within a domain, and the ability to draw
258 broad analogies across domains (inductive and "abductive" reasoning). As suggested previously, this is a desirable characteristic because the system does not have to make any high-level decisions about which type of performance mode it should use on each problem. Rather, the most appropriate method arises from the general mechanism, based on the system's current knowledge base and the demands of the current problem. We believe that using a general analogical method as the sole reasoning method can provide further benefits in problem solving and other parts of AL For example, using this approach should be useful in concept induction tasks, in which similar objects form natural classes. In addition, a single analogical method should prove useful in the areas of reasoning and explanation-based learning. 9 We want to explore the benefits that can be realized by viewing various forms of reasoning as special cases of analogy. Concluding Remarks Our experiences in constructing and evaluating the EUREKA model have been encouraging. Not only can the model explain a number of human learning behaviors by incorporating a theory of memory with a problem solver based on means-ends analysis (Jones, 1989; Jones & Langley, 1991), but it also addresses a number of issues in computational problem solving and suggests methods for improving systems in that area. Through our experimental evaluation of EUREKA we explored the nature of EUREKA'S behavior with respect to its parameters for retrieval and selection of knowledge from memory, and the utility of a retrieval mechanism based on spreading activation for large memories. We feel that our model provides evidence for the utility of a problem solver that incorporates a psychologically plausible retrieval mechanism, and a general analogical matching mechanism. Our research has also opened a number of interesting new questions concerning the use of analogical reasoning and the nature of problem difficulty. By examining In fact, this type of approach has been suggested independently by Falkenhainer (1989).
259 these questions, we feel that EUREKA can eventually provide the basis for a a general architecture for computational problem solving. Acknowledgements Discussions with Pat Langley, Bernd Nordhausen, Don Rose, David Ruby, and Kurt VanLehn led to the development of many of the ideas in this paper. This research was supported in part by contract N00014-84K-0345 from the Computer Science Division, Office of Naval Research, and a University of California Regents' Dissertation Fellowship. References Anderson, J. R. (1974). Retrieval of propositional information from long-term memory. Cognitive Psychology, 5, 451-474. Anderson, J. R. (1976). Language, memory, and thought Hillsdale, NJ: Lawrence Erlbaum. Anderson, J. R. (1983). The architecture of cognition. Cambridge, MA: Harvard University Press. Anderson, J. R. & Thompson, R. (1989). Use of analogy in a production system architecture. In S. Vosniadou k A. Ortony (Eds.), Similarity and analogical reasoning. Cambridge, England: Cambridge University Press. Carbonell, J. G. (1983). Learning by analogy: Formulating and generalizing plans from past experience. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach. Los Altos, CA: Morgan Kaufmann. Carbonell, J. G. (1986). Derivational analogy: A theory of reconstructive problem solving and expertise acquisition. In R. S. Michalski, J. G. Carbonell, & T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach (vol. 2). Los Altos, CA: Morgan Kaufmann.
260 Collins, A., & Quillian, M. R. (1969). Retrieval time from semantic memory. Journal of Verbal Learning and Verbal Behavior, 8, 240248. Ernst, G., k Newell, A. (1969). GPS: A case study in generality and problem solving. New York: Academic Press. Falkenhainer, B. C. (1989). Learning from physical analogies: A study in analogy and the explanation process. Doctoral dissertation, University of Illinois at Urban a-Champaign. Falkenhainer, B., Forbus, K. D., k Gentner, D. (1986). The structuremapping engine. Proceedings of the Fifth National Conference on Artificial Intelligence (pp. 272-277). Philadelphia: Morgan Kaufmann. Fikes, R. E., k Nilsson, N. J. (1971). STRIPS: A new approach to the application of theorem proving to problem solving. Artificial Intelligence, 2, 189-208. Gentner, D. (1983). Structure-mapping: A theoretical framework for analogy. Cognitive Science, 7, 155-170. Hall, R. P. (1989). Computational approaches to analogical reasoning: A comparative analysis. Artificial Intelligence, 39, 39-120. Hammond, K. J. (1988). Case-based planning: An integrated theory of planning, learning, and memory (Doctoral dissertation, Yale University, 1986). Dissertation Abstracts International, 48, 3025B. Holland, J. H., Holyoak, K. J., Nisbett, R. E., k Thagard, P. R. (1986). Induction: Processes of inference, learning, and discovery. Cambridge, MA: MIT Press. Holyoak, K. J., k Koh, K. (1987). Surface and structural similarity in analogical transfer. Memory and Cognition, 15, 332-340.
261 Holyoak, K. J.,& Thagard, P. (1989). Analogical mapping by constraint satisfaction. Cognitive Science, iS, 295-355. Jones, R. M. (1989). A model of retrieval in problem solving. Doctoral dissertation, University of California, Irvine. Jones, R. M. k Langley, P. (1991). An integrated model of retrieval and problem solving. Manuscript submitted for publication. Jones, R. M. & VanLehn, K. (1991). Strategy shifts without impasses: A computational model of the sum-to-min transition. In Proceedings of the Thirteenth Annual Conference of the Cognitive Science Society (pp. 358-363). Chicago: Lawrence Erlbaum. Kolodner, J. L., Simpson, R. L., k Sycara, K. (1985). A process model of case-based reasoning in problem solving. In Proceedings of the Ninth International Joint Conference on Artificial Intelligence (pp. 284-290). Los Angeles: Morgan Kaufmann. Laird, J. E., Rosenbloom, P. S., k Newell, A. (1986a). Chunking in Soar: The anatomy of a general learning mechanism. Machine Learning, J, 11-46. Laird, J. E., Rosenbloom, P. S., k Newell, A. (1986b). Universal subgoaling and chunking: The automatic generation and learning of goal hierarchies. Hingham, MA: Kluwer Academic. Langley, P. (1985). Learning to search: From weak methods to domainspecific heuristics. Cognitive Science, 9, 217-260. Langley, P., k Allen, J. A. (1991). The acquisition of human planning expertise. In L. A. Birnbaum k G. C. Collins (Eds.), Machine Learning: Proceedings of the Eighth International Workshop (pp. 80-84). Evanston, IL: Morgan Kaufmann. Meyer, D. E., k Schvaneveldt, R. W. (1971). Facilitation in recognizing pairs of words: Evidence of a dependence between retrieval opera-
262 tions. Journal of Experimental Psychology, 90, 227-234. Minton, S. (1989). Learning effective search control knowledge: An explanation-based approach (Doctoral dissertation, Carnegie Mellon University, 1988). Dissertation Abstracts International, 49, 4906B4907B. Mitchell, T. M, Utgoff, P. E., & Banerji, R. (1983). Learning by experimentation: Acquiring and refining problem-solving heuristics. In R. S. Michalski, J. G. Carbonell, T. M. Mitchell (Eds.), Machine learning: An artificial intelligence approach. Los Altos, CA: Morgan Kaufmann. Neches, R. (1982). Models of heuristic procedure modification (Doctoral dissertation, Carnegie Mellon University, 1981). Dissertation Abstracts International, 43, 1645B. Ohlsson, S. (1987). Transfer of training in procedural learning: A matter of conjectures and refutations? In L. Bole (Ed.), Computational models of learning. Berlin: Springer-Verlag. Schank, R. C. (1982). Dynamic memory. Cambridge, England: Cambridge University Press. VanLehn, K., & Jones, R. M. (in press). Integration of explanation-based learning of correctness and analogical search control. In S. Minton & P. Langley (Eds.), Proceedings of the symposium on learning, planning and scheduling. Los Altos, CA: Morgan Kaufmann.
Chapter 8 A View of Computational Learning Theory* Leslie G. Valiant Harvard University and NEC Research Institute
Abstract The distribution-free or "pac" approach to machine learning is described. The motivations, basic definitions and some of the more important results in this theory are summarized.
INTRODUCTION At present computers are programmed by an external agent, a human, who presents an explicit description of the algorithm to be executed to the machine. There would be clear advantages in having machines that could learn, in the sense that they could acquire knowledge by means other than explicit programming. It is not self-evident that machine learning is feasible at all. The existence of biological systems that perform learning apparently as a result of computations in their nervous systems provides, however, a strong plausibility argument in its favor. When contrasted with currently understood methods of knowledge acquisition learning as exhibited by humans, and even so-called lower animals, is a spectacular phenomenon. What is the nature of this phenomenon? What are the laws and limitations governing it? Although the engineering of learning systems and the understanding of human learning are both very relevant we place at the center of our investigations a third point of view, namely that of seeking to understand * Research at Harvard was supported in part by the National Science Foundation NSF-CCR-89-02500, the Office for Naval Research ONR-N0014-85-K-0445, the Center for Intelligent Control ARO DAAL 03-86-K-0171 and by DARPA AFOSR 89-0506. This article appeared also in "Computation and Cognition", C.W. Gear (ed.), SIAM, Philadelphia (1991) 32-53.
264 for its own sake the computational phenomenon of learning. Understanding the ultimate possibilities may be more fruitful in the long run then either of the other two approaches, and may even turn out to be easier. As an analogy one can consider the question of understanding motion in physics. The engineering of moving vehicles and the understanding of human movement both raise interesting and challenging questions. The more general question of understanding the laws and limitations of motion itself, however, has yielded the more fundamental insights. In general usage the word learning has a great variety of senses. Our aim here is to discuss just one of these, namely inductive learning, which we consider to be both central and amenable to analysis. Induction has been investigated extensively by philosophers and its nature debated. It has to do with the phenomenon of successful generalization. Humans appear to be able to abstract from experience principles that are not strictly implied in the experience. For example, after seeing relatively few classified examples of a category, such as that of a chair, a child is able to classify further natural examples with remarkable and mysterious accuracy. The centrality of induction has been recognized from the beginning. According to Aristotle "all belief comes from either syllogism or induction". Its nature has proved more elusive. Hume's view, that regularities in experience give rise to habits in expectation, seems to capture the essence of induction. It does not make explicit, however, the specific nature of the regularities that give rise to such habits. It seems clear that these regularities must have some particular nature, which is another way of saying that for generalization to work some assumptions have to be made. In the absence of any assumptions, a child after seeing some objects, all identified as chairs, would be unjustified in reaching any opinion whatever about unseen objects. In 1984 a theory was proposed by the author (Valiant 1984) in which Hume's regularities are not imposed by properties of the observed world, but by computational limitations internal to the learner. In particular the assumption underlying the induction process is that what is being learned can be learned by a computational process that is quantitatively feasible. This approach offers two philosophical advantages. First it makes no assumptions about the world, complex as it is. This is important since, in contrast with physics where many simple laws have been
265 found, at the level of cognition and human concepts no analogous simple regularities have been identified. The second philosophical advantage is that the assumptions that are made can be argued to be self-evidently true. The concepts that humans do learn from examples are, by definition, learnable from examples. The assumptions, however, are not vacuous. For example, learnability implies that the program learned has a small representation, a restriction that reduces the set of possibilities to a minute subset of all possible functions. Current evidence suggests that the constraint of computational feasibility on the learning process restricts the class even further than this. In this paper we review some of the recent results that relate to this one framework for studying learning. Our treatment is necessarily at best partial even for this one model and no attempt is made here to relate it to other approaches. Various reviews of related material are given by Dietterich (1990), Hausssler (1987), Kearns (1990), Reams, Li, Pitt, and Valiant (1987b), and by Laird, (1989). Our aim is to give a brief view of these results. We make particular reference to questions such as the following: Which of the results are unexpected or surprising. What new insight have been gained? What range of learning phenomena can be usefully discussed in this framework? What new algorithms have been discovered? Learning appears to be a rich and diverse field and we are clearly a long way from having even the roughest outline of the possibilities.
A MODEL FOR LEARNING B Y EXAMPLE Our model can be viewed as a specification of the functional behavior desired of a mechanism that purports to do learning. In the simplest version it models the learning of a concept or function from positive and negative examples of it. We will discuss this version most extensively. The definition attempts, however, to capture learning at a broader level and, as we shall see, is adaptable to a variety of learning situations. The model incorporates two basic notions. The first is that one cannot hope to perform inductive learning to perfection. Some level of error in what is learned is inevitable. The learner should be able to estimate an upper bound on the error at any stage of learning and
266 should be able to control it. In particular, he should be able to make it arbitrarily small if he is willing to spend more resources such as time or the number of examples seen. To be specific we shall insist that the resources needed to reduce the error to e should grow only as a fixed polynomial p(l/e). The second basic notion is that in any theory of learning one should give an account of how the computations performed can be done in feasibly few steps. In particular, we shall require that the resources be again bounded by a fixed polynomial in the relevant parameters, such as the size of the minimal description of the program being learned, or the number of variables in the system. As we shall see later there is evidence suggesting that the class of all programs is not learnable in the above sense. Hence learning algorithms will restrict themselves to some special class C of functions. Typically the algorithm computes a representation of some special form of the function, and the choice of this knowledge representation is often critical. Depending on the application area different styles of representation may be appropriate. For cognitive tasks one based on logic is natural. The simplest such representations are Boolean expressions. Where continuous variables are involved geometric criteria are more appropriate. Thus we may represent an example of a concept in n-dimensional space by a set of n coordinate values. A concept may be represented, for example, as a hyperplane that separates positive from negative examples. Lastly in linguistic contexts where sequences are important one may consider automata-theoretic characterizations. In general we consider that the cognitively most relevant setting for learning is one in which the system already has much knowledge. The basic variables can then be either the primitive input variables of the system or the outputs of arbitrarily complex programs that are already part of the system (by virtue of preprogramming or previous learning). Thus learning is always relative to the existing knowledge base. This is important for two reasons. First, it implies that a theory of relative learning is sufficient. Second, it highlights an important advantage of learning over programming. It may be infeasible to augment the knowledge of a system by programming, even in a minor way, if the state of the system is so complex that it is difficult for the outside agent to understand it. In contrast, learning takes place relative to the current
267 state since the program sought takes as inputs the outputs of whatever complex programs the system already contains. We shall define our basic model as one for learning predicates (or concepts) from examples and counterexamples of it. The aim of the learning algorithm is to find a rule (or program or hypothesis) that reliably categorizes unseen examples as being positive or negative. Let X be the domain from which the examples are drawn. A concept c C X is the set of positive examples. This is sometimes denoted by pos(c), while its complement X — pos(c) is denoted by neg(c). It is meaningful to discuss the learnability of a class C of concepts rather than that of a single concept. For example, for Boolean functions over n variables we would define X to be {0, l } n . An example of a Boolean concept class over {0, l } 5 would be the class of 2-disjunctive normal form expressions (2-DNF) consisting of all predicates that can be written as a disjunction of length two conjunctions. An example of an individual concept in this class would be that defined by the expression X1X3 + X2X3 + #1X4 + ^2^4. It turns out that it is sometimes important in the learning context to distinguish the functions being learned from particular representations of them. The learner needs to represent the hypothesis somehow and we shall denote the class of such representations by H. The above example is logically equivalent to the expression (x\ + #2)(#3 + #4), but the difficulty of learning some class containing this instance may be different. For brevity our notation will sometimes identify functions with their representation where this makes no difference. Learning by an efficiently universal class of representations, such as Boolean circuits, is sometimes called prediction (Haussler, Littlestone & Warmuth 1988b). In general we want to determine how fast the computational difficulty of learning increases as some size parameter | c | of the concept grows. We shall therefore regard as stratified both the domain X = (Jn>i Xn as well as the class of representations C = Un>i ^n- Typically n will be the number of variables. We can also introduce a further parameter s to define, for example, the size of a Boolean expression. Then Cn = Us>i Cn,s, where CUyS denotes the subclass of C consisting of concepts with parameters n and s. We assume that for each c £ C there are two probability distributions D+ and D~~ that describe the relative probability of occurrence in nature of the elements in pos(c) and neg(c) respectively. The distribu-
268 tions represent the nature of the world, about which we wish to make no assumptions. Hence they are allowed to be unknown and arbitrary except for time invariance. Learning in a changing world is clearly more difficult. Analysis of that situation remains to be done. If c G C is the concept being learned and h € H is the hypothesis of the learner, we define the error e+(h) to be the probability according to D+ that a random x 6 pos(c) belongs to neg(h). Analogously e~(h) is the probability according to D" that a random x 6 neg(c) belongs to pos(h). We now define a class C to be learnable by representation class if, both over some domain X , if the following holds: There exists a learning algorithm A that for some polynomial p, for any c G C, for any Z>+, D~ and any 0 < £, S < 1, given access to random draws from D + , D " in any one step, will output in p(l/e, l/£, | c |) steps a hypothesis h that with probability at least 1 - 6 will have e+(h) < e and e~~{h) < e where | c \ is some agreed measure of concept complexity (e.g. number of variables plus size of description of c in bits.) The definition finesses the issue of the distribution being unknown by requiring only that the hypothesis perform well on the same unknown distribution from which the examples are drawn. The requirement that the same algorithm perform well on a variety of distributions seems natural since in human learning one must presume that no more than a limited number of learning algorithms are being applied in a wide variety of contexts. Furthermore current analysis suggests that insistence on good performance even in worst-case distributions is not as onerous as worst-case requirements in some other areas of computation, such as graph algorithms, appear to be. For example, restricting to uniform distributions is not known to make many classes learnable that are not so otherwise. If the computational requirement is removed from the definition then we are left with the notion of nonparametric inference in the sense of statistics, as discussed in particular by Vapnik (1982). For discrete domains all reasonable representations are then learnable (Blumer, Ehrenfeucht, Haussler & Warmuth, 1987). What gives special flavor to our
269 definition is the additional requirement of efficient computation. This appears to restrict the learnable classes very severely. This model has been described as "probably approximately correct" or "pac" learning (Angluin, 1987b). Since the efficiency aspect is so central a more accurate acronym would be "epac" learning. The quantitative requirement in the definition is that the runtime, and hence also the number of examples sought, has to be bounded by a fixed polynomial in l/e^l/6 as well as in the parameters that specify the complexity of the instance being learned. With doubly stratified classes such as Cn,s both n the number of variables and s the size of the concept would be parameters. The model is not restricted to discrete domains. Blumer, Ehrenfeucht, Haussler and Warmuth (1989) describe a formulation allowing geometric concepts. For example in n-dimensional Euclidean space a natural concept class is that of a half-space (Minsky & Papert, 1988). In such domains one has to state how one charges for representing and operating on a real number. Typically charging one unit is appropriate for both cases. In the definition as given learning refers to the acquisition of new information and the parameter optimised is the accuracy e. Similar formulations are possible for other situations also. For example the learner may not be acquiring new information but may seek to increase some measure of performance at a task as a result of training.
ROBUSTNESS OF MODEL For any computational model it is important to ask how it relates to alternative formalisms that attempt to capture similar intuitions. A series of recent results has established that the pac model is robust in the sense that the set of learnable classes remains invariant under a large range of variations in the definitions. Several aspects of the definition contain some arbitrariness and it is natural to ask first whether the particular choices made make any difference. Haussler, Kearns, Littlestone and Warmuth (1988a) review some twenty-eight variations and show them all equivalent. One issue, for example, is whether the decision to have separate sources for positive and
270 negative examples rather than a single source and distribution, enhances or diminishes the power of the model. It turns out that it makes no difference. Among further variations shown to be equivalent in Haussler et al., (1988a) are those generated by the choice of whether the parameters e,£,n,s are given as part of the input to the learning algorithm or not. Also, allowing the learning algorithm to be randomized rather than deterministic adds no power (under weak regularity assumptions on H) since any randomness needed can be extracted from the source of random examples. A further issue is the treatment of the confidence parameter 6. It does not make any difference whether we insist on the complexity to be bounded by a constant or a polynomial in log(l/£) rather than a polynomial in 6. Haussler et al., (1988b) consider models where the examples are viewed as coming in a sequence. The algorithm, on seeing each example, makes a prediction about it, and on receiving the true classification updates itself if necessary. They define models where the total number of mistakes made is polynomial bounded, or the probability that a mistake is made at any one step diminishes appropriately. They show that these models are equivalent to the pac model if the representation is universal (e.g. Boolean circuits). Two further variations are group learning and weak learning, both of which appear on the surface to be strictly less demanding models. In the first the requirement is to find a hypothesis that when given a set of examples, of appropriate polynomial size, promised to be all positive or all negative, determines which is the case, as the pac model does for single examples. In the second, weak learning, we revert to classifying single examples again but are satisfied with accuracy 1/(2 + p) when p is a polynomial in the relevant parameters. This captures the gambling context when it is sufficient to have any discernible odds in one's favor. That these two models are equivalent to each other was shown by Kearns and Valiant (1989), and Kearns, Li and Valiant (1989). Subsequently in a surprising development Schapire (1989) gave a construction showing that they are equivalent to the pac model also. His construction shows that the accuracy of any learning algorithm can be boosted provided it works for all distributions. An alternative construction has been given recently by Freund (1990).
271 The results above all give evidence of robustness with respect to changes in definition of the model. A second important but different robustness issue is that of whether learnability of classes is preserved under various simple mathematical operations on the classes or their members. It is shown by Kearns, Li, Pitt and Valiant (1987a) that the learnability of a class is preserved under a wide class of substitutions of variables. It follows from this that learning most classes of Boolean formulae does not become easier if we forbid repetitions of variables or negations of variables. Pitt and Warmuth (1988) consider much more general reductions that preserve learnability. They use it to show such unexpected relationships as that the learnability of finite automata would imply the learnability of Boolean formulae. Lastly we mention that the closure properties of learnable classes can be investigated for such operations as union, differences, nested differences and composition (Helmbold, Sloan k Warmuth 1989; Kearns et. al., 1987a; Ohguro & Maruoka 1989). SOME MORE DEMANDING VARIANTS
Resilience To Errors The model as described does not allow for any errors in the data. In a practical situation one would expect that examples would be occasionally misclassified or their description corrupted. Clearly it would be desirable to have algorithms that are resilient in the sense that they would generate hypotheses of acceptable accuracy even when errors occur in the examples at some rate. Several models of error have been proposed. It is generally assumed that each example called is correct with probability 1 — /i independent of previous examples. A worst-case, the so called malicious, model allows that with probability /x the example returned be arbitrary both as far as the description of the example as well as its classification. Both parts can be constructed by an adversary with full knowledge of the state of the learning algorithm. Even with this model a certain level of error can be tolerated for some classes of Boolean functions (Valiant, 1985). By
272 very general arguments Kearns and Li (1988) have shown, however, that the accuracy rate (1 - e) cannot exceed (1 - ///(l - //)). If we disallow corruption of the data but allow the classification to be wrong with probability // for each example independently, then learning becomes more tractable. Angluin and Laird (1987) show that learning to arbitrarily small e can be done for any known y < 1/2. Analyses of intermediate models are given by Shackelford and Volper (1988) and by Sloan (1988). The issue of errors is clearly important. There are large gaps in our current knowledge even for simple representations such as conjunctions (Kearns et. al., 1988). For geometric concepts in the case that the erroneous examples can be arbitrary even less is known. For example, there is no satisfactory algorithm or theory known for learning halfspaces with such errors. Positive Examples Only The question of whether humans learn largely from positive examples has received much attention. From a philosophical viewpoint induction from examples of just one kind appears even more paradoxical than the general case. It turns out, however, that such learning is feasible in some simple cases such as conjunctions and vector spaces (Helmbold et al., 1989; Shvaytser, 1990; Valiant, 1984). Some general criteria for learning from positive only examples are given in Natarajan (1987). Learning from examples of one kind has features that distinguish it from the general case. On the assumption that P = NP the class of all Boolean circuits is learnable in the two-sided case. This is not true in the one-sided case. In fact, learning simple disjunctions (e.g.Xi + xz + fis) requires exponentially many examples for information theoretic reasons (i.e. independent of computation) if only positive examples are available (Gereb-Graus, 1989; Kearns et al., 1987a; Shvaytser, 1990). Irrelevant Attributes We view learning as most interesting when it is allowed to be hierarchical. When learning a new concept we assume that it has a short description involving few variables, but these variables can be either
273 primitives of the input devices or the outputs of much higher level functions previously programmed or learned. In human learning the number of concepts recognized at any time has been estimated as upwards of 105. Hence we have to aim at situations in which the number of variables n is of this order, but most of them are irrelevant to any one new concept. Having the sample complexity grow linearly with n is unsatisfactory. We could hypothesize that humans have, in addition to induction capabilities, a focusing mechanism that on semantic grounds identifies which ten, say, of the 105 variables are really relevant. This, however, is exactly what we wish to avoid. We would like to absorb this "relevance identification" within the learning task, rather than leave it unexplained. The first indication that this might be possible was a result of Haussler (1988). He showed that, among other things, learning conjunctions of length k over n variables could be done from only 0 ( H o g n) examples. The reduction of the dependence on n from linear to logarithmic is the significant point here. Littlestone (1988) subsequently showed that the same effect could be achieved by a very elegant class of algorithms that resembled the classical perceptron algorithm but used a multiplicative update rule. Very recently in a further surprising development Blum (1990a) described a context in which the learning of short hypotheses could be made independent of the total number of variables. Here each example is described by a set of variables that indicate the ones that are positive in the example. The complexity of learning certain Boolean formulae such as conjunctions, can be bounded by an expression in terms of the length of description of the examples and of the hypothesis, even in an infinite attribute space. Heuristics The assumption in the basic model that the examples are totally consistent with a rule from a known class is one which one would like to relax. Error models offer one direction of relaxation. A second approach is to have the hypotheses learned still belong to a known class, but now regard them as heuristics in the sense that they account only for a certain percentage of the examples, say 80%. It may be that there is a simple
274 rule of thumb that explains such a sizable percentage of examples, but that a much more complex hypothesis would be required to explain a larger fraction. It turns out that learning heuristics, even when they are simple conjunctions, is more difficult than in the basic model (Pitt & Valiant, 1988; Valiant, 1985). Learning Functions Learning Boolean predicates is a special case of the problem of learning more general functions. Haussler (1990) has given an interesting formulation of this in the spirit of the pac model. An important instance of function learning is that of learning distributions. Instead of having a hypothesis that predicts whether an example is a member of the concept, it now outputs a probability. In spite of the greater generality of this formulation, Kearns and Schapire (1990) have shown that positive results can be obtained in this framework also. Reliable and Useful Learning In some contexts one may wish for hypotheses that are reliable in the sense that they never misclassify. In a probabilistic setting this is too much to expect unless one allows the hypothesis to output "don't know" in some cases. Rivest and Sloan (1988) have shown that such a model is viable and plays a significant role in hierarchical learning. Reliable learning becomes useful, in their sense, if the probability of a "don't know" is suitably small. Reliable and useful learning is a much more demanding model than the basic one and has been applied only in very restricted cases (Kivinen, 1989). Limiting the Computational Model In all of the above we required that computations be performed in polynomial time on a general purpose model of computation such as a Turing machine. Since biological nervous systems have particular characteristics it is natural to ask how these results change if we restrict the
275 models of computation appropriately. Such results have been obtained for certain models that are efficiently parallel (Boucheron & Sallantin, 1988; Vitter k Lin, 1988), space bounded (Floyd, 1989), or attempt to model neural systems directly (Valiant, 1988). Unsupervised Learning Many learning situations involve no teacher labeling examples as positive or negative. The case when a totally passive world is observed by the learner is called unsupervised learning. A simple instance of unsupervised learning is that of detecting pairs of attributes that occur with high correlation Paturi, Rajasekaran and Reif (1989). More generally it is associated with clustering and other statistical techniques. A point of view put forward in Valiant (1985, 1988) is that the most plausible way of overcoming the apparent limitations of both supervised and unsupervised learning separately is to combine them. For example, no effective algorithm is currently known for learning disjunctive normal form expressions in general. On the other hand one can imagine a system that learns special cases by the following two-tier strategy. It first learns conjunctions in some unsupervised sense, such as by detecting those pairs or n-tuples of variables that occur with high statistical correlation. In a separate stage it then learns a disjunction of these in supervised mode. It is possible that in human learning this kind of dynamic learning, where one alternates supervised and unsupervised phases, plays an important role. S O M E LESS D E M A N D I N G V A R I A N T S
Special Distributions It is possible that in biological learning special properties of distributions are exploited. Unfortunately we have no indications as to what properties natural distributions have that make learning easier than for worst case distributions. As far as mathematical simplicity the obvious
276 case to consider is when Z?+ and D~ are uniform, and this case has received some attention. For the distribution-free model an outstanding open problem is that of learning disjunctive normal form (DNF). Even when restricted to the uniform distribution DNF is not known to be learnable in polynomial time although it is learnable in time n°^ogn\ Furthermore, the class of formulae with a constant number of alternations of conjunctions and disjunctions (so called constant depth circuits) is learnable in time exponential in (log n)d where the d depends on the depth (Linial, Mansour k Nisan 1989). Some restrictions of DNF that are NP-hard to learn in the general model become learnable for the uniform distribution. These are //-DNF, where each variable occurs once (Kearns et al., 1987a), and fcterm DNF, where the disjunction is over k conjunctions (Gu k Maruoka, 1988; Kucera k Protasi, 1988; Ohgura k Maruoka, 1989). Baum has considered uniformly distributed points on a sphere in the context of geometric concepts. He has shown that for learning half spaces better polynomial bounds can be obtained than in the general case (Baum, 1990a). On the other hand for learning the intersection of two half-spaces by the fc-nearest neighbor algorithm exponential time is required (Baum, 1990c). The notion of learnability for fixed distributions has been analyzed by Benedek and Itai, (1988) and Natarajan, (1990). Finally we note that other special distributions have been investigated also. Li and Vitanyi, (1989) consider one that is in some sense the hardest. Baum, (1990b) considers distributions in Euclidean n-space that are symmetric about the origin, and shows that the intersection of two half spaces is learnable for these. Ignoring Computation As mentioned earlier if we ignore the computational aspect then we are back to purely statistical or information theoretic questions, which to within polynomial factors are trivial for discrete domains (Blumer et al., 1987). For infinite domains many issues come up which are more fully considered in Vapnik and Chervonenkis (1971), Blumer, Ehren-
277 feucht, Haussler and Warmuth (1989), Ben-David, Benedek and Mansour (1989), Linial, Mansour and Rivest (1988), and Benedek and Itai (1987). The major tool here is the Vapnik-Chervonenkis dimension, which is a discrete quantity that characterizes the number of examples required for learning. The VC-dimension has been worked out for several concept classes. It is n + 1, for example, for half-spaces in n dimensions. Furthermore for learning to within confidence S and error e fairly tight expressions are known on the number of examples needed (Blumer, Ehrenfeucht, Haussler & Warmuth, 1989; Ehrenfeucht, Haussler, Kearns k Valiant, 1989). One application given by Baum and Haussler (1989) is to neural nets, where this kind of analysis has given guidance as to the number of examples needed for a generalization to be reliable.
LEARNING B Y ASKING QUESTIONS So far we have only considered passive learners. One would expect that a more aggressive learner who can ask questions would be a more successful one. This turns out to be the case. The basic model can be adapted easily to this more general case. For each kind of question we allow the learner to ask we hypothesize an oracle that is able to answer it. One such oracle is MEMBER. A call to this oracle with example x when the unknown concept is c would return "yes" or "no" depending on whether x € c. Another oracle called EQUIVALENCE takes as input a hypothesis h and recognizes whether h = c. If there is equivalence it outputs "yes". Otherwise it produces a counterexample to the equivalence. For any fixed set of such oracles one can define learnability exactly as before except that the learning algorithm can consult one of these oracles in any one step (Valiant, 1984). If oracles are available one can also consider completely deterministic learning models where random draws from examples are dispensed with altogether. One such model is the "minimal adequate teacher" (Angluin, 1987a) which consists of the combination of MEMBER and EQUIVALENCE oracles. We note that the latter can be replaced by a probabilistic source of examples as in the pac model, since an equivalence h = c can be tested to a high degree of confidence by calling for enough random examples of c and checking
278 for consistency with h. The deterministic model, however, often makes analysis more manageable in specific cases. Several classes are now known to be learnable with such a minimal adequate teacher. These include deterministic finite automata (Angluin, 1987a), read-once Boolean formulae (Angluin, Hellerstein & Karpinski, 1989), and one-counter automata (Berman & Roos, 1987). A number of related results are given in Angluin (1987b), Hancock (1990), Rivest and Schapire (1987,1989), Sakakibara (1988), and Goldman, Rivest, and Schapire (1989). The issue of oracles is discussed more systematically in Angluin (1987b). Allowing oracles appears to enlarge the range of positive results that can be obtained. The question of what constitutes a reasonable oracle, however, is unresolved. Clearly one can devise oracles that ask explicitly about the hypothesis, such as the identity of the ith line in the program defining it, that trivialize the learning problem. On the other hand membership oracles seem very plausible in relation to human learning. LIMITS TO W H A T C A N B E L E A R N E D There are both information-theoretic and computational limitations to learning. Examples of the former already mentioned are the exponential lower bound on learning conjunctions from negative examples alone, and the lower bounds on sample complexity derived in terms of the Vapnik-Chervonenkis dimension. Current knowledge suggests that the computational limitations are much more severe. Without it the class of all Boolean circuits (or any equivalent representation of discrete programs) is learnable. Once we insist on polynomial time computability only restricted subclasses are known to be learnable. Representation-dependent Limits Suppose we have an algorithm for learning C by the class H of representations. If we enlarge C then the problem will typically get more difficult. Enlarging H, on the other hand, and keeping C unchanged will typically make learning, if anything, easier since no more has to be
279 learned but we have more ways of representing the hypotheses. Thus if H is learnable by If, then replacing if by a larger class Hf could, in principle, make learning either harder or easier. In this sense learnable classes are not monotonic. Another way of describing this phenomenon is the following. If C is not learnable by if, then this may be due to two reasons, either C is too large or H is too restricted. It turns out that existing techniques for proving JVP-completeness impediments to learning are all of the second kind. Among the simplest classes C that are known to be hard to learn in this sense are 2-term DNF (i.e. disjunctions of two conjunctions) and Boolean threshold functions (i.e. half spaces of the form J2 a%xi > b where each a t G {0,1}). For these classes learning C by C is NPhard. In both cases, however, by enlarging C as functions we can obtain learnable classes. In the first case 2-CNF suffices, and in the second unrestricted half spaces (Pitt k Valiant, 1988). A further example of an NP-complete learning problem is the intersections of two half spaces (Megiddo, 1986). This remains NP-complete even in the case of {0,1} coefficients corresponding to certain three-node neural nets (Blum k Rivest, 1988). NP-hardness results are also known for learning finite automata (Li k Vazirani, 1988; Pitt, 1989; Pitt k Warmuth, 1989) and other classes of neural nets (Judd, 1988; Lin k Vitter, 1989). Representation Independent Limits As mentioned above there is a second reason for a class C not being learnable, in this case by any representation, and that is that C is too large. For reasons not well understood the only techniques known for establishing a negative statement of this nature are cryptographic. The known results are all of the form that if a certain cryptographic function is hard to compute then C is not learnable by any H. For such proofs the most natural choice of H is Boolean circuits since they are universal, and can be evaluated fast given their descriptions and a candidate input. The first such result was implicit in the random function construction of Goldreich, Goldwasser and Micali (1986). It says that assuming oneway functions exist, the class of all Boolean circuits is not learnable even for the uniform distribution and even with access to a membership oracle.
280 Various consequences can be deduced from this by means of reduction (Pitt k Warmuth, 1988; Warmuth, 1989) Since positive learning results are difficult to find even for much more restricted models it was natural to seek negative results closer to the known learnable classes. In Kearns and Valiant (1989) it was shown that deterministic finite automata, unrestricted Boolean formulae (i.e. tree structured circuits) and networks of threshold elements (neural nets) of a certain constant depth, are each as hard to learn as it is to compute certain number-theoretic functions, such as factoring Blum integers (i.e. the products of two primes both equal to 3 mod 4) or inverting the RSA encryption function. MODELS USEFUL FOR ALGORITHM DISCOVERY Having precise models of learning seems to aid the discovery of learning algorithms. It focuses the mind on what has to be achieved. One significant finding has been that different models encourage different lines of thought and hence the availability of a variety of models is fruitful. Many of the algorithms discovered recently were developed for models that are either superficially or truly restrictions of the basic pac model. One such model is that of learning from positive examples alone. This constraint suggests its own style of learning. Another model is the deterministic one using oracles discussed in section 6. Although the results for these translate to the pac model with oracles the deterministic formulation often seems the right one. A third promising candidate is the weak learning model. In seeking algorithms for classes not known to be learnable this offers a tempting approach which has not yet been widely exploited. We shall conclude by mentioning two further models both of which have proved very powerful. The first is Occam learning (Blumer et al., 1987). After seeing random examples the learner seeks to find a hypothesis that is consistent with them and somewhat shorter to describe than the number of examples seen. This model implies learnability (Blumer et al., 1987) and is essentially implied by it (Board & Pitt, 1990; Schapire, 1989). It expresses the idea that it is good to have a short hypothesis, but avoids the trap of insisting on the shortest one, which usually gives rise to
281 NP-completeness even in the simplest cases. Occam learning can be generalized to arbitrary domains by relacing the bound on hypothesis size by a bound on the VC dimension (Blumer et al., 1989). There are many examples of algorithms that use the Occam model. These include algorithms for decision lists (Rivest, 1987), restricted decision trees (Ehrenfeucht k Haussler, 1989), semilinear sets (Abe, 1989) and pattern languages (Kearns & Pitt, 1989). The second model is that of worst-case mistake bounds (Littlestone, 1988). Here after each example the algorithm makes a clasisfication._ It is required that for any sequence of examples there be only a fixed polynomial number of mistakes made. It can be shown that learnability in this sense implies pac learnability (Angluin, 1987b; Kearns et al., 1987a; Littlestone, 1989). Recently Blum (1990b) showed that the converse is false if one-way functions exist. There are a number of algorithms that are easiest to analyze for this model. The classical perceptron algorithm of Rosenblatt (1961), Minsky and Papert (1988) has this form, except that in the general case the mistake bound is exponential. Littlestone's algorithms that perform well in the presence of irrelevant attributes (Littlestone, 1988), as well as Blum's more recent ones (Blum, 1990a) are intimately tied to this model, as are a number of other algorithms including one for integer lattices (Helmbold, Sloan & Warmuth, 1990).
282
References Abe, N. (1989). Polynomial learnability of semilinear sets. In Proceedings of the 2nd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 25-40. Angluin, D. (1987a). Learning regular sets from queries and counter examples. Information and Computation, 75:87-106. Angluin, D. (1987b). Queries and concept learning. Machine Learning, 2:319-342. Angluin, D., Hellerstein, L., & Karpinski, M. (1989). Learning readonce formulas with queries (Technical Report Rept. No. UCB/CSD 89/528). Computer Science Division and University of California and Berkeley. Angluin, D. & Laird, P. (1987). Learning from noisy examples. Machine Learning, 2:343-370. Baum, E. (1990a). The perceptron algorithm is fast for non-malicious distributions. Neural Computation, 2:249-261. Baum, E. (1990b). A polynomial time algorithm that learns two hidden unit nets. In Proceedings of the 3rd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA. Baum, E. (1990c). When are k-nearest neighbor and back propagation accurate for feasible sized sets of examples? Lecture Notes in Computer Science, 412:2-25. Baum, E. & Haussler, D. (1989). What size net gives valid generalization. Neural Computation, 1(1):151—160. Ben-David, S., Benedek, G., & Mansour, Y. (1989). A parametrization scheme for classifying models of learnability. In Proceedings of the 2nd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 285-302. Benedek, G. k Itai, A. (1987). Nonuniform learnability, (Technical Report TR 474). Computer Science Department, Technion, Haifa, Israel.
283 Benedek, G. M. k Itai, A. (1988). Learnability by fixed distributions. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 80-90. Berman, P. k Roos, R. (1987). Learning one-counter languages in polynomial time. In Proceedings of the 28th IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Washington, D.C., 61-67. Blum, A. (1990a). Learning boolean functions in an infinite attribute space. In Proceedings of the 22nd ACM Symposium on Theory of Computing, The Association for Computing Machinery, New York, NY. Blum, A. (1990b). Separating distribution-free and mistake-bound learning models over the boolean domain. In Proceedings of the 31st IEEE Symposium on Foundation of Computer Science, IEEE Computer Society Press, Washington, D.C., 211-218. Blum, A. k Rivest, R. (1988). Training a 3-node neural network is NPcomplete. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 9-18. Blumer, A., Ehrenfeucht, A., Haussler, D., k Warmuth, M. (1987). Occam's razor. Information Proc. Letters, 25:377-380. Blumer, A., Ehrenfeucht, A., Haussler, D., k Warmuth, M. (1989). Learnability and the Vapnik-Chervonenkis dimension. J. ACM, 36(2):929-965. Board, R. k Pitt, L. (1990). On the necessity of Occam algorithms. In Proceedings of the 22nd ACM Symposium on Theory of Computing, The Association for Computing Machinery, New York, NY. Boucheron, S. k Sallantin, J. (1988). Some remarks about spacecomplexity of learning, and circuit complexity of recognizing. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 125-138. Dietterich, T. (1990). Machine learning. Ann. Rev. of Comp. Sci., 4.
284 Ehrenfeucht, A. k Haussler, D. (1989). Learning decision trees from random examples. Inf. and Computation, 231-247. Ehrenfeucht, A., Haussler, D., Kearns, M., k Valiant, L. (1989). A general lower bound on the number of examples needed for learning. Inf. and Computation, 247-261. Floyd, S. (1989). Space-bounded learning and the Vapnik-Chervonenkis dimension. Proceedings of the 2nd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 349-364. Freund, Y. (1990). Boosting a weak learning algorithm by majority. Proceedings of the 3rd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA. Gereb-Graus, M. (1989). Lower Bounds on Parallel, Distributed and Automata Computations. (PhD thesis, Harvard University). Goldman, S., Rivest, R., k Schapire, R. (1989). Learning binary relations and total orders. In Proceedings of the 30th IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Washington, D.C., 46-53. Goldreich, 0., Goldwasser, S., k Micali, S. (1986). How to construct random functions. J. ACM, 33(4):792-807. Gu, Q. k Maruoka, A. (1988). Learning monotone boolean functions by uniform distributed examples. Manuscript. Hancock, T. (1990). Identifying //-formula decision trees with queries. In Proceedings of the 2nd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA. Haussler, D. (1987). Bias, version spaces and Valiant's learning framework. In Proc. 4th Intl. Workshop on Machine Learning, Morgan Kaufmann, 324-336 Haussler, D. (1988). Quantifying inductive bias: AI learning algorithms and Valiant's learning framework. Artificial Intelligence, 36(2):177222.
285 Haussler, D. (1990). Learning conjunctive concepts in structural domains. Machine Learning, 4. Haussler, D., Kearns, M., Littlestone, N., k Warmuth, M. (1988a). Equivalence of models of polynomial learnability. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 42-55. Haussler, D., Littlestone, N., k Warmuth, M. (1988b). Predicting 0,1functions on randomly drawn points. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 280-296. Helmbold, D., Sloan, R., k Warmuth, M. (1989). Learning nested differences of intersection-closed concept classes. In Proceedings of the 2nd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 41-56. Helmbold, D., Sloan, R., k Warmuth, M. (1990). Learning integer lattices. In Proceedings of the 3rd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA. Judd, J. (1988). Learning in neural nets. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 2-8. Kearns, M. (1990). The Computational Complexity of Machine Learning. MIT Press. Kearns, M. k Li, M. (1988). Learning in the presence of malicious errors. In Proceedings of the 20th ACM Symposium on Theory of Computing, The Association for Computing Machinery, New York, NY, 267-279. Kearns, M., Li, M., Pitt, L., k Valiant, L. (1987a). On the learnability of Boolean formulae. In Proceedings of the 19th ACM Symposium on Theory of Computing, The Association for Computing Machinery, New York, NY, 285-295.
286 Kearns, M., Li, M., Pitt, L., k Valiant, L. (1987b). Recent results on Boolean concept learning. In Proc. J[th Int. Workshop on Machine Learning, Los Altos, CA. Morgan Kaufmann, 337-352. Kearns, M., Li, M., k Valiant, L. (1989). Learning boolean formulae. Submitted for publication. Kearns, M. k Pitt, L. (1989). A polynomial-time algorithm for learning k-variable pattern languages from examples. In Proceedings of the 2nd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 57-71. Kearns, M. k Schapire, R. (1990). Efficient distribution-free learning of probabilistic concepts. In Proceedings of the 3rd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA. Kearns, M. k Valiant, L. (1989). Cryptographic limitations on learning boolean formulae and finite automata. In Proceedings of the 21st ACM Symposium on Theory of Computing, The Association for Computing Machinery, New York, NY, 433-444. Kivinen, J. (1989). Reliable and useful learning. In Proceedings of the 2nd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 365-380. Kucera, L., Marchetti-Spaccamela, A., k Protasi, M. (1988). On the learnability of dnf formulae. In ICALP, 347-361. Laird, P. (1989). A survey of computational learning theory (Technical Report RIA-89-01-07-0), NASA, Ames Research Center. Li, M. k Vazirani, U. (1988). On the learnability of finite automata. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 359-370. Li, M. k Vitanyi, P. (1989). A theory of learning simple concepts under simple distributions and average case complexity for the universal distribution. In Proceedings of the 30th IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Washington, D.C., 34-39.
287 Lin, J.-H.- k Vitter, S. (1989). Complexity issues in learning by neural nets. In Proceedings of the 2nd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 118-133. Linial, N., Mansour, Y., & Nisan, N. (1989). Constant depth circuits, Fourier transforms and learnability. In Proceedings of the 30th IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Washington, D.C., 574-579. Linial, N., Mansour, Y., k Rivest, R. (1988). Results on learnability and the Vapnik-Chervonenkis dimension. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 56-68. Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: a new linear threshold algorithm. Machine Learning, 2(4):245-318. Littlestone, N. (1989). From on-line to batch learning. In Proceedings of the 2nd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 269-284. Megiddo, N. (1986). On the complexity of polyhedral separability, (Technical Report RJ 5252), IBM Almaden Research Center. Minsky, M. k Papert, S. (1988). Perceptrons: an introduction to computational geometry. MIT Press. Natarajan, B. (1987). On learning boolean functions. In Proceedings of the 19th ACM Symposium on Theory of Computing, The Association for Computing Machinery, New York, NY, 296-304. Natarajan, B. (1990). Probably approximate learning over classes of distributions. Manuscript. Ohguro, T. k Maruoka, A. (1989). A learning algorithm for monotone kterm dnf. In Fujitsu HAS-SIS Workshop on Computational Learning Theory. Paturi, R., Rajasekaran, S., k Reif, J. (1989). The light bulb problem. In Proceedings of the 2nd Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 261-268.
288 Pitt, L. (1989). Inductive inference, dfas and computational complexity. In Jantke, K., (editor), Analogical and Indictive Inference. Lecture Notes in Computer Science, Vol 397, pp.(18-44) Spring-Verlag. Pitt, L. k Valiant, L. (1988). Computational limitations on learning from examples. J. ACM, 35(4):965-984. Pitt, L. k Warmuth, M. (1988). Reductions among prediction problems: on the difficulty of predicting automata. In Proc. 3rd IEEE Conf. on Structure in Complexity Theory, 60-69. Pitt, L. k Warmuth, M. (1989). The minimal consistent dfa problem cannot be approximated within any polynomial. In Proceedings of the 21st ACM Symposium on Theory of Computing, The Association for Computing Machinery, New York, NY, 421-432. Rivest, R. (1987). Learning decision lists. Machine Learning, 2(3):229246. Rivest, R. k Sloan, R. (1988). Learning complicated concepts reliably and usefully. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 69-79. Rivest, R. L. k Schapire, R. (1987). Diversity-based inference of finite automata. In Proceedings of the 28th IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Washington, D.C., 78-88. Rivest, R. L. k Schapire, R. (1989). Inference of finite automata using homing sequences. In Proceedings of the 21st ACM Symposium on Theory of Computing, The Association for Computing Machinery, New York, NY ,411-420. Rosenblatt, F. (1961). Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms. Spartan Books, Washington, D.C. Sakakibara, Y. (1988). Learning context-free grammars from structural data in polynomial time. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 330344.
289 Schapire, R. (1989). On the strength of weak learnability. In Proceedings of the 30th IEEE Symposium on Foundations of Computer Science, IEEE Computer Society Press, Washington, D.C., 28-33. Shackelford, G. & Volper, D. (1988). Learning k-dnf with noise in the attributes. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 97-105. Shvaytser, H. (1990). A necessary condition for learning from positive examples. Machine Learning, 5:101-113. Sloan, R. (1988). Types of noise for concept learning. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 91-96. Valiant, L. (1984). A theory of the learnable. Comm. ACM, 27(11):11341142. Valiant, L. (1985). Learning disjunctions of conjunctions. In Proc. 9th Int. Joint Conf on Artificial Intelligence, 560-566, Los Altos, CA. Morgan Kaufmann. Valiant, L. (1988). Functionality in neural nets. In Proc. Amer. Assoc, for Artificial Intelligence, 629-634, San Mateo, CA. Morgan Kaufmann. Vapnik, V. (1982). Estimation of dependencies based on Empirical Data. Springer-Verlag. Vapnik, V. k Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theor. Probability and Appi, 16(2):264-280. Vitter, J. k Lin, J.-H. (1988). Learning in parallel. In Proceedings of Workshop on Computational Learning Theory, Morgan Kaufmann, San Mateo, CA, 106-124. Warmuth, M. (1989). Toward representation independence in pac learning. In Jantke, K., (editor), Analogical and Inductive Inference, vol 397, Lecture Notes in Computer Science, 78-103. Springer-Verlag.
Chapter 9 The Probably Approximately Correct (PAC) and Other Learning Models* David Haussler and Manfred Warmuth [email protected], [email protected] Baskin Center for Computer Engineering and Information Sciences University of California, Santa Cruz, CA 95064
ABSTRACT This paper surveys some recent theoretical results on the efficiency of machine learning algorithms. The main tool described is the notion of Probably Approximately Correct (PAC) learning, introduced by Valiant. We define this learning model and then look at some of the results obtained in it. We then consider some criticisms of the PAC model and the extensions proposed to address these criticisms. Finally, we look briefly at other models recently proposed in computational learning theory. INTRODUCTION It's a dangerous thing to try to formalize an enterprise as complex and varied as machine learning so that it can be subjected torigorousmathematical analysis. To be tractable, a formal model must be simple. Thus, inevitably, most people will feel that important aspects of the activity have been left out of the theory. Of course, they will be right. Therefore, it is not advisable to present a theory of machine learning as having reduced the entirefieldto its bare essentials. All that can be hoped for is that some aspects of the phenomenon are brought more clearly into focus using the tools of mathematical analysis, and that perhaps a few new insights are gained. It is in this light that we wish *We gratefully acknowledge the support from ONR grants N00014-86-K-0454-P00002, N00014-86-K-0454-P00003, and N00014-91-J-1162. A preliminary version of this paper appeared in Haussler (1990).
292 to discuss the results obtained in the last few years in what is now called PAC (Probably Approximately Correct) learning theory (Angluin, 1988). Valiant introduced this theory in 1984 (Valiant, 1984) to get computer scientists who study the computational efficiency of algorithms to look at learning algorithms. By taking some simplified notions from statistical pattern recognition and decision theory, and combining them with approaches from computational complexity theory, he came up with a notion of learning problems that are feasible, in the sense that there is a polynomial time algorithm that "solves" them, in analogy with the class P of feasible problems in standard complexity theory. Valiant was successful in his efforts. Since 1984 many theoretical computer scientists and AI researchers have either obtained results in this theory, or complained about it and proposed modified theories, or both. Thefieldof research that includes the PAC theory and its many relatives has been called computational learning theory. It is far from being a monolithic mathematical edifice that sits at the base of machine learning; it's unclear whether such a theory is even possible or desirable. We argue, however, that insights have been gained from the varied work in computational learning theory. The purpose of this short monograph is to survey some of this work and reveal those insights. DEFINITION OF PAC LEARNING The intent of the PAC model is that successful learning of an unknown target concept should entail obtaining, with high probability, a hypothesis that is a good approximation of it. Hence the name Probably Approximately Correct. In the basic model, the instance space is assumed to be {0, l} n , the set of all possible assignments to n Boolean variables (or attributes) and concepts and hypotheses are subsets of {0, l}71. The notion of approximation is defined by assuming that there is some probability distribution D defined on the instance space {0, l} n , giving the probability of each instance. We then let the error of a hypothesis h w.r.t. afixedtarget concept c, denoted error(h) when c is clear from the context, be defined by error(h) = Y^ D(x), x£hAc
where A denotes the symmetric difference. Thus, error(h) is the probability that h and c will disagree on an instance drawn randomly according to D. The
293 hypothesis h is a good approximation of the target concept c if error(h) is small. How does one obtain a good hypothesis? In the simplest case one does this by looking at independent random examples of the target concept c, each example consisting of an instance selected randomly according to D9 and a label that is "+" if that instance is in the target concept c (positive example), otherwise " - " (negative example). Thus, training and testing use the same distribution, and there is no "noise" in either phase. A learning algorithm is then a computational procedure that takes a sample of the target concept c, consisting of a sequence of independent random examples of c, and returns a hypothesis. For each n > 1 let Cn be a set of target concepts over the instance space {0, l } n , and let C = {C n } n >i. Let Hn, for n > 1, and H be defined similarly. We can define PAC leamability as follows: The concept class C is PAC learnable by the hypothesis space H if there exists a polynomial time learning algorithm A and a polynomial p(-, •, •) such that for all n > 1, all target concepts c E Cn> all probability distributions D on the instance space {0, l } n , and all e and 8, where 0 < e,6 < 1, if the algorithm A is given at least p(n, 1/e, 1/6) independent random examples of c drawn according to Dy then with probability at least 1 - <$, A returns a hypothesis h e Hn with err or (h) < e. The smallest such polynomial p is called the sample complexity of the learning algorithm A. The intent of this definition is that the learning algorithm must process the examples in polynomial time, i.e. be computationally efficient, and must be able to produce a good approximation to the target concept with high probability using only a reasonable number of random training examples. The model is worst case in that it requires that the number of training examples needed be bounded by a single fixed polynomial for all target concepts in C and all distributions D in the instance space. It follows that if we fix the number of variables n in the instance space and the confidence parameter <$, and then invert the sample complexity function to plot the error e as a function of training sample size, we do not get what is usually thought of as a learning curve for A (for this fixed confidence), but rather the upper envelope of all learning curves for A (for this fixed confidence), obtained by varying the target concept and distribution on the instance space. Needless to say, this is not a curve that can be observed experimentally. What is usually plotted experimentally is the error versus the training sample size for particular target concepts on instances chosen
294 randomly according to a singlefixeddistribution on the instance space. Such a curve will lie below the curve obtained by inverting the sample complexity. We will return to this point later. Another thing to notice about this definition is that target concepts in a concept class C may be learned by hypotheses in a different class H. This gives us someflexibility.Two cases are of interest. The first is that C = H, i.e. the target class and hypothesis space are the same. In this case we say that C is properly PAC learnable. Imposing the requirement that the hypothesis be from the class C may be necessary, e.g. if it is to be included in a specific knowledge base with a specific inference engine. However, as we will see, it can also make learning more difficult. The other case is when we don't care at all about the hypothesis space H, so long as the hypotheses in H can be evaluated efficiently. This occurs when our only goal is accurate and computationally efficient prediction of future examples. Being able to freely choose the hypothesis space may make learning easier. If C is a concept class and there exists some hypothesis space H such that hypotheses in H can be evaluated on given instances in polynomial time and such that C is PAC learnable by H, then we will say simply that C is PAC learnable. There are many variants of the basic definition of PAC leamability. One important variant defines a notion of syntactic complexity of target concepts and, for each n > 1, further classifies each concept in Cn by its syntactic complexity. Usually the syntactic complexity of a concept c is taken to be the length of (number of symbols or bits in) the shortest representation of c in a fixed concept representation language. In this variant of PAC leamability, the number of training examples is also allowed to grow polynomially in the syntactic complexity of the target concept. This variant is used whenever the concept class is specified by a concept representation language that can represent any boolean function, for example, when discussing the leamability of DNF (Disjunctive Normal Form) formulae or decision trees. Other variants of the model let the algorithm request examples, use separate distributions for drawing positive and negative examples, or use randomized (i.e. coin flipping) algorithms (Keams, Li, Pitt, & Valiant, 1987a). It can be shown that these latter variants are equivalent to the model described here, in that, modulo some minor technicalities, the concept classes that are PAC learnable in one model are also PAC learnable in the other (Haussler, Keams, Littlestone, & Warmuth, 1991a). Finally, the model can easily be extended to non-Boolean attribute-based instance spaces (Haussler, 1988) and instance spaces for structural domains such
295 as the blocks world (Haussler, 1989). Instances can also be defined as strings over a finite alphabet so that the leamability of finite automata, context-free grammars, etc. can be investigated (Pitt, 1989).
OUTLINE OF RESULTS FOR PROPER PAC LEARNABILITY HARDNESS RESULTS BASED ON THE ASSUMPTION P ^ NP A number of fairly sharp results have been found for the notion of proper PAC leamability. The following summarizes some of these results. For precise definitions of the concept classes involved, the reader is referred to the literature cited. The negative results are based on the complexity theoretic assumption that R P ^ NP (Pitt & Valiant, 1988). 1. Conjunctive concepts are properly PAC leamable (Valiant, 1984) but the class of concepts in the form of the disjunction of two conjunctions is not properly PAC learnable (Pitt & Valiant, 1988), and neither is the class of existential conjunctive concepts on structural instance spaces with two objects (Haussler, 1989). 2. Linear threshold concepts (perceptrons) are properly PAC learnable on both Boolean and real-valued instance spaces (Blumer, Ehrenfeucht, Haussler, & Warmuth, 1989), but the class of concepts in the form of the conjunction of two linear threshold concepts is not properly PAC leamable (Blum & Rivest, 1988). The same holds for disjunctions and linear thresholds of linear thresholds (i.e. multilayer perceptrons with two hidden units). In addition, if the weights are restricted to 1 and 0 (but the threshold is arbitrary), then linear threshold concepts on Boolean instances spaces are not properly PAC learnable (Pitt & Valiant, 1988). 3. The classes offc-DNF,fc-CNF,andfc-decisionlists are properly PAC leamable for eachfixedk (Valiant, 1985; Rivest, 1987), but it is unknown whether the classes of all DNF functions, all CNF functions, or all decision trees are properly PAC leamable. Most of the difficulties in proper PAC learning are due to the computational difficulty offindinga hypothesis in the particular form specified by the target class. For example, while Boolean threshold functions with 0-1 weights are not properly PAC leamable on Boolean instance spaces (unless R P = NP), they are PAC leamable by general Boolean threshold functions. Here we have
296 a concrete case where enlarging the hypothesis space makes the computational problem offindinga good hypothesis easier. The class of all Boolean threshold functions is simply an easier space to search than the class of Boolean threshold functions with 0-1 weights. Similar extended hypothesis spaces can be found for the two classes mentioned in (1.) above that are not properly PAC leamable. Hence, it turns out that these classes are PAC leamable (Pitt & Valiant, 1988; Haussler, 1989). However, it is not known if any of the classes of DNF functions, CNF functions, decision trees, or multilayer perceptrons with two hidden units are PAC leamable. METHODS FOR PROVING PAC LEARNABILITY; FORMALIZATION OF BIAS All of the positive leamability results above are obtained by 1. showing that there is an efficient algorithm that finds a hypothesis in a particular hypothesis space that is consistent with a given sample of any concept in the target class and 2. that the sample complexity of any such algorithm is polynomial. By consistent we mean that the hypothesis agrees with every example in the training sample. An algorithm that alwaysfindssuch a hypothesis (when one exists) is called a consistent algorithm. As the size of the hypothesis space increases, it may become easier to find a consistent hypothesis, but it will require more random training examples to insure that this hypothesis is accurate with high probability. In the limit, when any subset of the instance space is allowed as a hypothesis, it becomes trivial to find a consistent hypothesis, but a sample size proportional to the size of the entire instance space will be required to insure that it is accurate. Hence, there is a fundamental tradeoff between the computational complexity and the sample complexity of learning. Restriction to particular hypothesis spaces of limited size is one form of bias that has been explored to facilitate learning (Mitchell, 1980). In addition to the cardinality of the hypothesis space, a parameter known as the VapnikChervonenkis (VC) dimension of the hypothesis space has been shown to be useful in quantifying the bias inherent in a restricted hypothesis space (Haussler, 1988). The VC dimension of a hypothesis space H> denoted VCdim(H), is
297 defined to be the maximum number d of instances that can be labeled as positive and negative examples in all 2d possible ways, such that each labeling is consistent with some hypothesis in H (Cover, 1965; Vapnik, 1982). Let H = {Hn}n>i be a hypothesis space and C = {Cn}n>i be a target class, where Cn C Hn for n > 1. Then it can be shown (Shawe-Taylor, Anthony, & Biggs, 1989) that any consistent algorithm for learning C by H will have sample complexity at most
This improves on earlier bounds given in Blumer, Ehrenfeucht, Haussler, and Warmuth (1989), but may still be a considerable overestimate. In terms of the cardinality of Hny denoted \Hn\y it can be shown (Vapnik, 1982; Natarajan, 1989; Blumer et al., 1987) that the sample complexity is at most
±(ln|tfn| + ln-^. For most hypothesis spaces on Boolean domains, the second bound gives the better bound. However, linear threshold functions are a notable exception, since the VC dimension of this class is linear in n, while the logarithm of its cardinality is quadratic in n (Blumer et al., 1989). Most hypothesis spaces on real-valued attributes are infinite, so only thefirstbound is applicable. LEARNABILITY PRESERVING REDUCTIONS A COMPLEXITY THEORETIC APPROACH We now return to the computational difficulty of PAC learning. The theory of complexity classes and reducibilities (e.g., NP-completeness) has been particularly useful in providing evidence for the intractability of general computational problems. In Pitt and Warmuth (1990) a similar complexity theory for learnability is developed. l Following the approach taken in complexity theory, we would like to allow a learning algorithm to receive more training examples (and to spend more time) before achieving accurate learning, depending on the "complexity" of the *In Pitt andWarmuth (1990) this is done with respectto anotion of'(polynomial) predictability" which is equivalent to "learnability" as defined in this paper (Haussler, Kearns, Littlestone, & Warmuth, 1991a).
298 hidden target concept to be learned. As mentioned in the section "Definition of PAC Learning," a reasonable measure of this complexity is the length (e.g. number of bits) of the representation of the concept in some given concept representation language. Thus the leamability will depend on what type of representations of the concepts we have chosen. For example, we may choose to represent regular languages by Deterministic Finite Automata (DFAs), Nondeterministic Finite Automata (NFAs), regular expressions, etc. We would like to ask the question "Are DFAs leamable?" rather than the question "Are regular languages leamable?". This motivates the following definition: A learning problem consists of a concept representation language and a mapping from representations in this language to concepts. The concept class of a learning problem is the class of all concepts that can be represented in the given representation language. Whereas in previous sections we have referred to the PAC leamability of concept classes, here we will be more precise and speak of the PAC leamability of learning problems, emphasizing the fact that the difficulty of learning may depend on the concept representation language used to represent the concept class. As in complexity theory, when faced with a new learning problem we can first attempt to reduce it to a problem that is known to be leamable, thus showing that the new problem is leamable as well. If we are unsuccessful, instead we can try to reduce a problem that is known or suspected to be unleamable to the new problem, thus gathering evidence that the new problem is not leamable. The type of reduction from one learning problem to another introduced in Pitt and Warmuth (1990) is called a polynomial-time learning-preserving reduction. This type of reduction generalizes those used previously in Kearns, Li, Pitt, and Valiant (1987b), Littlestone (1988), and Haussler (1989). A polynomial-time learning-preserving reduction consists of two mappings: a polynomial-time computable function / that maps unlabeled examples of the first learning problem to unlabeled examples of the second learning problem, and a function g that maps representations of concepts used in thefirstproblem to representations of concepts used in the second problem.2 For example, one such reduction shows that learning Boolean functions represented asfc-CNFsreduces to learning Boolean functions represented as conjunctive concepts. Here fc-CNFs are Boolean formulas in conjunctive 2
An interesting feature of the definition of reduction is that the mapping g need not be computable. It is only required that g be length preserving within a polynomial.
299 normal form with at most k literals per clause, where k is a constant, and conjunctive concepts are Boolean conjunctions of literals (also called monomials). If the number of variables in thefc-CNFproblem is n then the learning problem for conjunctive concepts will have
300
alternating DFA learning problem, a number of such problems have been found: • Convex Vertex Represented Polytope (Long & Warmuth, 1991): the concept is an unknown convex polytope represented by the list of its vertices; positive examples are points in the polytope and negative examples points outside of the polytope. • Horn Clause Consistency (Pitt & Warmuth, 1990): the concept is an unknown conjunction of Horn clauses; positive examples are sets of facts that are consistent with the conjunction. • Augmented CFG Emptiness (Pitt & Warmuth, 1990): the concept is an unknown context free grammar; positive examples are sets of productions that when added to the grammar, yield a grammar generating the empty language. It is unlikely that problems that are learning-complete for P are leamable. As a matter of fact, as discussed in the next section, there is convincing evidence that the opposite is the case. HARDNESS FOR PAC LEARNABILITY BASED ON CRYPTOGRAPHIC ASSUMPTIONS It is a much stronger result to show that a learning problem is not PAC leamable than it is to show that it is not properly PAC leamable, since the former result implies that the problem is not PAC leamable by any reasonable hypothesis space. Indeed, it follows from the work of Goldreich, Goldwasser, and Micali (1986) that problems that are learning-complete for P are not PAC leamable (even in an extremely weak sense) assuming the existence of any cryptographically secure pseudorandom bit generator, which is equivalent to the existence of a certain type of one-way function (Levin, 1987). While such an assumption is stronger than the assumption that R P ^ NP, there is still convincing evidence for its validity. Simpler problems can also be shown not to be PAC leamable based on stronger cryptographic assumptions. In particular, Keams and Valiant (1989b) show that a polynomial-time learning algorithm for DFAs can be used to invert certain cryptographic functions. This is done by first showing that learning arbitrary Boolean formulas is as hard as inverting the given cryptographic functions. Then, since it can be shown that learning Boolean Formulas reduces
301 to learning DFAs, it follows that DFAs are not polynomially leamable based on the same cryptographic assumptions. Such hardness results are disheartening. However, note that all of these hardness results are worst case with respect to the distribution and target concept. Thus when faced with learning a problem that is learning-complete for a reasonably large complexity class, the practitioner might look for assumptions that can be made on the distribution of the examples that will make the problem easier on average in some suitable sense. Further one might assume that the target concept is drawn at random according to some reasonable distribution rather that assuming that the target concept is worst case. We discuss some ideas along these lines in the following section, but only from the perspective of sample complexity. To date there has been very little general work on the average case computational complexity of machine learning.
CRITICISMS OF THE PAC MODEL The two criticisms most often leveled at the PAC model by AI researchers interested in empirical machine learning are 1. the worst-case emphasis in the model makes it unusable in practice (Buntine, 1990; Sarrett & Pazzani, 1989) and 2. the notions of target concepts and noise-free training data are too restrictive in practice (Amsterdam, 1988; Bergadano & Saitta, 1989). We take these in turn. There are two aspects of the worst case nature of the PAC model that are at issue. One is the use of the worst case model to measure the computational complexity of the learning algorithm, the other is the definition of the sample complexity as the worst case number of random examples needed over all target concepts in the target class and all distributions on the instance space. Here we address only the latter issue. As pointed out in the section "Definition of PAC Learning" above, the worst case definition of sample complexity means that even if we could calculate the sample complexity of a given algorithm exactly, we would still expect it to overestimate the typical error of the hypothesis produced as a function of the training set size on any particular target concept and particular distribution
302
on the instance space. This is compounded by the fact that we usually cannot calculate the sample complexity of a given algorithm exactly even when it is a relatively simple consistent algorithm. Instead we are forced to fall back on the upper bounds on the sample complexity that hold for any consistent algorithm, given in the previous section, which themselves may contain overblown constants. The upshot of this is that the basic PAC theory is not good for predicting learning curves. Some variants of the PAC model come closer, however. One simple variant is to make it distribution specific, i.e. define and analyze the sample complexity of a learning algorithm for a specific distribution on the instance space, e.g. the uniform distribution on a Boolean space (Benedek & Itai, 1988; Sarrett & Pazzani, 1989). There are two potential problems with this. Thefirstisfindingdistributions that are both analyzable and indicative of the distributions that arise in practice. The second is that the bounds obtained may be very sensitive to the particular distribution analyzed, and not be very reliable if the actual distribution is slightly different. A more refined, Bayesian extension of the PAC model is explored in Buntine (1990). Using the Bayesian approach involves assuming a prior distribution over possible target concepts as well as training instances. Given these distributions, the average error of the hypothesis as a function of training sample size, and even as a function of the particular training sample, can be defined. Also, 1 - S confidence intervals like those in the PAC model can be defined as well. Experiments with this model on small learning problems are encouraging, but further work needs to be done on sensitivity analysis, and on simplifying the calculations so that larger problems can be analyzed. This work, and the other distribution specific learning work, provides an increasingly important counterpart to PAC theory. Another variant of the PAC model designed to address these issues is the "probability of mistake" model explored in Haussler, Littlestone, and Warmuth (1990), Haussler, Kearns, and Schapire (1991b), and Opper and Haussler, (1991). This model is designed specifically to help understand some of the issues in incremental learning. Instead of looking at sample complexity as defined above, the measure of performance here is the probability that the learning algorithm incorrectly guesses the label of the t* training example in a sequence of t random examples. Of course, the algorithm is allowed to update its hypothesis after each new training example is processed, so as t grows, we expect the probability of a mistake on example t to decrease. For afixedtarget
303
concept and afixeddistribution on the instance space, it is easy to see that the probability of a mistake on example t is the same as the average error of the hypothesis produced by the algorithm from t - 1 random training examples. Hence, the probability of mistake on example t is exactly what is plotted on empirical learning curves that plot error versus sample size and average several runs of the learning algorithm for each sample size. In Haussler et al. (1990) the focus is on the worst case probability of mistake on the 2th example, over all possible target concepts and distributions on the training examples. In Haussler et al. (1991b) and Opper and Haussler (1991) the probability of mistake on the 2th example is examined when the target concept is selected at random according to a prior distribution on the target class and the examples are drawn at random from a certainfixeddistribution. This is a Bayesian approach. The former we will call the worst case probability of mistake and the latter we will call the average case probability of mistake. The results can be summarized as follows. Let C = {Cn}n>\ be a concept class and dn = VCdim(Cn) for all n > 1. First, for any concept class C and any consistent algorithm for C using hypothesis space C, the worst case probability of mistake on example t is at most 0((dn/t)\n(t/dn))y for t > dn. Furthermore, there are particular consistent algorithms and concept classes where the worst case probability of mistake on example t is at least Q((dn/t)ln(t/dn))y hence this is the best that can be said in general of arbitrary consistent algorithms. Second, for any concept class C there exists a (universal) learning algorithm for C (not necessarily consistent or computationally efficient) with worst case probability of mistake on example t at most dn/t On the other hand, any learning algorithm for C must have worst case probability of mistake on example t at least Q,(dn/t)y so this universal algorithm is essentially optimal. Third, if we focus on average case behavior, then there is a different universal learning algorithm, which is called Bayes optimal learning algorithm (or the weighted majority algorithm (Littlestone & Warmuth, 1991)) and there is a closely related, more efficient algorithm called the Gibbs (or randomized weighted majority) algorithm that have average case probability of mistake on example t at most dn/t and 2dn/t, respectively. Furthermore, there are particular concept classes C, particular prior probability distributions on the concepts in these classes, and particular distributions on the instance spaces of these classes, such that the average case probability of mistake on example t is at least £l(dn/t) for any learning algorithm (with constant« 1/2). This indicates
304
that the above general bounds are tight to within a small constant. Even better forms of these upper and lower bounds can be given for specific distributions on the examples, specific target concepts, and even specific sequences of examples. These results show two interesting things. First, certain learning algorithms perform better than arbitrary consistent learning algorithms in the worst case and average case, therefore, even in this restricted setting there is definitely more to learning than justfindingany consistent hypothesis in an appropriately biased hypothesis space. Second, the worst case is not always much worse than the average case. Some recent experiments in learning perceptrons and multilayer perceptrons have shown that in many cases dn/t is a rather good predictor of actual (i.e. average case) learning curves for backpropagation on synthetic random data (Baum, 1990; Tesauro & Cohn, 1991). However, it is still often an overestimate on natural data (Rumelhart, 1990), and in other domains such as learning conjunctive concepts on a uniform distribution (Sarrett & Pazzani, 1989). Here the distribution (and algorithm) specific aspects of the learning situation must also be taken into account. Thus, in general we concur that extensions of the PAC model are required to explain learning curves that occur in practice. However, no amount of experimentation or distribution specific theory can replace the security provided by a distribution independent bound. The second criticism of the PAC model is that the assumptions of welldefined target concepts and noise-free training data are unrealistic in practice. This is certainly true. However, it should be pointed out that the computational hardness results for learning described above, having been established for the simple noise-free case, must also hold for the more general case. The PAC model has the advantage of allowing us to state these negative results simply and in their strongest form. Nevertheless, the positive leamability results have to be strengthened before they can be applicable in practice, and some extensions of the PAC model are needed for this purpose. Many have been proposed (see e.g. Angluin and Laird (1988) and Keams and Li (1988)). Since the definitions of target concepts, random examples and hypothesis error in the PAC model are just simplified versions of standard definitions from statistical pattern recognition and decision theory, one reasonable thing to do is to go back to these well-established fields and use the more general definitions that they have developed. First, instead of using the probability of misclassification as the only measure of error, a general loss function can be defined that for every pair consisting of a guessed value and an actual value of the classification, gives a non-negative real number indicating a "cost" charged
305 for that particular guess given that particular actual value. Then the error of a hypothesis can be replaced by the average loss of the hypothesis on a random example. If the loss is 1 if the guess is wrong and 0 if it is right (discrete loss), we get the PAC notion of error as a special case. However, using a more general loss function we can also choose to make false positives more expensive than false negatives or vice-versa, which can be useful. The use of a loss function also allows us to handle cases where there are more than two possible values of the classification. This includes the problem of learning real-valued functions, where we might choose to use \guess — actual] or (guess — actual)2 as loss functions. Second, instead of assuming that the examples are generated by selecting a target concept and then generating random instances with labels agreeing with this target concept, we might assume that for each random instance, there is also some randomness in its label. Thus, each instance will have a particular probability of being drawn and, given that instance, each possible classification value will have a particular probability of occurring. This whole random process can be described as making independent random draws from a single joint probability distribution on the set of all possible labeled instances. Target concepts with attribute noise, classification noise, or both kinds of noise can be modeled in this way. The target concept, the noise, and the distribution on the instance space are all bundled into one joint probability measure on labeled examples. The goal of learning is then to find a hypothesis that minimizes the average loss when the examples are drawn at random according to this joint distribution. The PAC model, disregarding computational complexity considerations, can be viewed as a special case of this set-up using the discrete loss function, but with the added twist that learning performance is measured with respect to the worst case over all joint distributions in which the entire probability measure is concentrated on a set of examples that are consistent with a single target concept of a particular type. Hence, in the PAC case it is possible to get arbitrarily close to zero loss by finding closer and closer approximations to this underlying target concept. This is not possible in the general case, but one can still ask how close the hypothesis produced by the learning algorithm comes to the performance of the best possible hypothesis in the hypothesis space. For an unrestricted hypothesis space, the latter is known as Bayes optimal classifier (Duda & Hart, 1973).
306 Some recent PAC research has used this more general framework. By using the quadratic loss function mentioned above in place of the discrete loss, Keams and Shapire investigate the problem of efficiently learning a real-valued regression function that gives the probability of a "+" classification for each instance (Kearns & Schapire, 1990). InHaussler(1991)itisshownhowtheVC dimension and related tools, originally developed by Vapnik, Chervonenkis, and others for this type of analysis, can be applied to the study of learning in neural networks. Here no restrictions whatsoever are placed on the joint probability distribution governing the generation of examples, i.e. the notion of a target concept or target class is eliminated entirely. Using this method, specific sample complexity bounds are obtained for learning with feedforward neural networks under various loss functions. OTHER THEORETICAL LEARNING MODELS A number of other theoretical approaches to machine learning are flourishing in recent computational learning theory work. One of these is the total mistake bound model (Littlestone, 1988). Here an arbitrary sequence of examples of an unknown target concept is fed to the learning algorithm, and after seeing each instance the algorithm must predict the label of that instance. This is an incremental learning model like the probability of mistake model described above, however here it is not assumed that the instances are drawn at random, and the measure of learning performance is the total number of mistakes in prediction in the worst case over all sequences of training examples (arbitrarily long) of all target concepts in the target class. We will call this latter quantity the (worst case) mistake bound of the learning algorithm. Of interest is the case when there exists a polynomial time learning algorithm for a concept class C = {Cn}n>i with a worst case mistake bound for target concepts in Cn that is polynomial in n. As in the PAC model, mistake bounds can also be allowed to depend on the syntactic complexity of the target concept. The perceptron algorithm for learning linear threshold functions in the Boolean domain is a good example of a learning algorithm with a worst case mistake bound. This bound comes directly from the bound on the number of updates given in the perceptron convergence theorem (see e.g. Duda and Hart (1973)). The worst case mistake bound of the perceptron algorithm is polynomial (and at least linear) in the number n of Boolean attributes when the target concepts are conjunctions, disjunctions, or any concept expressible with
307
0-1 weights and an arbitrary threshold (Hampson & Volper, 1986). A variant of the perceptron learning algorithm with multiplicative instead of additive weight updates was developed that has a significantly improved mistake bound for target concepts with small syntactic complexity (Littlestone, 1988). The performance of this algorithm has also been extensively analyzed in the case when some of the examples may be mislabeled (Littlestone, 1989b). It can be shown that if there is a polynomial time learning algorithm for a target class C with a polynomial worst case mistake bound, then C is PAC learnable. General methods for converting a learning algorithm with a good worst case mistake bound into a PAC learning algorithm with a low sample complexity are given in Littlestone (1989a). Hence, the total mistake bound model is actually not unrelated to the PAC model. Another fascinating transformation of learning algorithms is given by the weighted majority method (Littlestone & Warmuth, 1991). This is a method of combining several incremental learning algorithms into a single incremental learning algorithm that is more powerful and more robust than any of the component algorithms. This method extends the Bayesian-style weighted majority algorithm mentioned in the previous section. The idea is simple. All the component learning algorithms are run in parallel on the same sequence of training examples. For each example, each algorithm makes a prediction and these predictions are combined by a weighted voting scheme to determine the overall prediction of the "master" algorithm. After receiving feedback on its prediction, the master algorithm adjusts the voting weights for each of the component algorithms, increasing the weights of those that made the correct prediction, and decreasing the weights of those that guessed wrong, in each case by a multiplicative factor. It can be shown that this method of combining learning algorithms is very robust with regard to mislabeled examples. More importantly, the method produces a master algorithm with a worst case mistake bound that approaches the worst case mistake bound of the best component learning algorithm (Littlestone & Warmuth, 1991). Thus the performance of the master algorithm is almost as good as that of the best component algorithm. This is particularly useful when a good learning algorithm is known but a parameter of the algorithm has to be tuned for the particular application (Helmbold, Sloan, & Warmuth, 1990). In this case the weighted majority method is applied to a pool of component algorithms, each of which is a version of the original learning algorithm with a different setting of the parameter. The master algorithm's performance approaches the
308 performance of the component algorithm with the best setting of the parameter. The weighted majority method can also be adapted to the case when the predictions of the component algorithms are continuous (Littlestone & Warmuth, 1991). This leads to a method for designing a master algorithm whose worst case loss approaches the worst case loss of the best linear combination of the component learning algorithms (Littlestone, Long, & Warmuth, 1991). Here instead of the total number of mistakes, the loss is the total squared prediction error. Finally, a version of the weighted majority method can also be used to obtain good mistake bounds in the case when the best component algorithm changes in various sections of the trial sequence. More general learning problems for "drifting" target concepts have been investigated as well (Helmbold & Long, 1991). This represents an interesting new direction in learning research. Both the PAC and total mistake bound models can be extended significantly by allowing learning algorithms to perform experiments or make queries to a teacher during learning (Angluin, 1988). The simplest type of query is a membership query, in which the learning algorithm proposes an instance in the instance space and then is told whether or not this instance is a member of the target concept. The ability to make membership queries can greatly enhance the ability of an algorithm to efficiently learn the target concept in both the mistake bound and PAC models. It has been shown that there are polynomial time algorithms that make polynomially many membership queries and have polynomial worst case mistake bounds for learning 1. monotone DNF concepts (Disjunctive Normal Form with no negated variables) (Angluin, 1988), 2. //-formulae (Boolean formulae in which each variable appears at most once) (Angluin, Hellerstein, & Karpinski, 1990b), 3. deterministic finite automata (Angluin, 1987), and 4. Horn sentences (propositional PROLOG programs) (Angluin, Frazier, & Pitt, 1990a). In addition, there is a general method for converting an efficient learning algorithm that makes membership queries and has a polynomial worst case mistake bound into a PAC learning algorithm, as long as the PAC algorithm is also allowed to make membership queries. Hence, all of the concept classes
309 listed above are PAC learnable when membership queries are allowed. This contrasts with the evidence from cryptographic assumptions that classes (2) and (3) above are not PAC learnable from random examples alone (Keams & Valiant, 1989a). Surprisingly, it can be shown, based on cryptographic assumptions, that slightly richer classes than those listed above list are not PAC learnable even with membership queries (Angluin & Kharitonov, 1991). These include: 1. non-deterministicfiniteautomata and 2. intersections of deterministic finite automata. This is shown by generalizing the notion of polynomial-time learning-preserving reduction (Pitt & Warmuth, 1990) (described in a previous section) to the case when membership queries are allowed, and then reducing known cryptographically secure problems to the above learning problems. CONCLUSION In this brief survey we were able to cover only a small fraction of the results that have been obtained recently in computational learning theory. For a glimpse at some of these further results we refer the reader to Haussler and Pitt (1988), Rivest, Haussler, and Warmuth (1989), Fulk and Case (1990), and Valiant and Warmuth (1991). However, we hope that we have at least convinced the reader that the insights provided by this line of investigation, such as those about the difficulty of searching hypothesis spaces, the notion of bias and its effect on required training size, the effectiveness of majority voting methods, and the usefulness of actively making queries during learning, have made this effort worthwhile. REFERENCES Amsterdam, J. (1988). The valiant learning model: Extensions and assessment. Master's thesis, MTT Department of Electrical Engineering and Computer Science. Angluin, D. (1987). Learning regular sets from queries and counterexamples. Information and Computation, 75:87-106. Angluin, D. (1988). Queries and concept learning. Machine Learning, 2:319-342. Angluin, D., Frazier, M., and Pitt, L. (1990a). Learning conjunctions of horn clauses. In 31th Annual IEEE Symposium on Foundations of Computer Science, pages 186-192.
310 Angluin, D., Hellerstein, L., and Karpinski, M. (1990b). Learning read-once formulas with queries. J ACM. to appear. Angluin, D. and Kharitonov, M. (1991). Why won't membership queries help? In Proceedings of the 23rd Annual ACM Symposium on Theory of Computing, pages 444-454, New Orleans. ACM. Angluin, D. and Laird, P. (1988). Learning from noisy examples. Machine Learning, 2(4):343-370. Baum, E. (1990). When are k-nearest neighbor and back propagation accurate for feasible sized sets of examples. In Snowbird conference on Neural Networks for Computing. unpublished manuscript. Benedek, G. M. and Itai, A. (1988). Learnability by fixed distributions. In Proc. 1988 Workshop on Comp. Learning Theory, pages 80-90, San Mateo, CA. Morgan Kaufmann. Bergadano, F. and Saitta, L. (1989). On the error probability of boolean concept descriptions. In Proceedings of the 1989 European Working Session on Learning, pages 25-35. Blum, A. and Rivest, R. L. (1988). Training a three-neuron neural net is NP-Complete. In Proceedings of the 1988 Workshop on Computational Learning Theory, pages 9-18, San Mateo, CA. published by Morgan Kaufmann. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1987). Occam's razor. Information Processing Letters, 24:377-380. Blumer, A., Ehrenfeucht, A., Haussler, D., and Warmuth, M. K. (1989). Learnability and the Vapnik-Chervonenkis dimension. Journal of the Association for Computing Machinery, 36(4):929-965. Buntine, W. (1990). A Theory of Learning Classification Rules. PhD thesis, University of Technology, Sydney. Forthcoming. Cover, T. M. (1965). Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition. IEEE Trans, on Electronic Computers, EC14:326-334. Duda, R. O. and Hart, P. E. (1973). Pattern Classification and Scene Analysis. Wiley. Fulk, M. and Case, J., editors (1990). Proceedings of the 1990 Workshop on Computational Learning Theory. Morgan Kaufmann, San Mateo, CA. Goldreich, O., Goldwasser, S., and Micali, S. (1986). How to construct random functions. / . ACM, 33(4):792-807. Hampson, S. E. and Volper, D. J. (1986). Linear function neurons: Structure and training. Biol. Cybern., 53:203-217. Haussler, D. (1988). Quantifying inductive bias: AI learning algorithms and Valiant's learning framework. Artificial Intelligence, 36:177-221. Haussler, D. (1989). Learning conjunctive concepts in structural domains. Machine Learning, 4:7-40. Haussler, D. (1990). Probably approximately correct learning. In Proc. of the 8th National Conference on Artificial Intelligence, pages 1101-1108. Morgan Kaufmann. Haussler, D. (1991). Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, to appear.
311 Haussler, D., Kearns, M., Littlestone, N., and Warmuth, M. K. (1991a). Equivalence of models for polynomial learnability. Information and Computation, to appear. Haussler, D., Kearns, M, and Schapire, R. (1991b). Bounds on the sample complexity of Bayesian learning using information theory and the VC dimension. In Proceedings of the Fourth Workshop on Computational Learning Theory. Haussler, D., Littlestone, N., and Warmuth, M. (1990). Predicting {0, l}-functions on randomly drawn points. Technical Report UCSC-CRL-90-54, University of California Santa Cruz, Computer Research Laboratory. Haussler, D. andPitt, L., editors (1988). Proceedings of the 1988 Workshop on Computational Learning Theory. Morgan Kaufmann, San Mateo, CA. Helmbold, D. and Long, P. (1991). Tracking drifting concepts using random examples. In Proceedings of the 1991 Workshop on Computational Learning Theory, pages 13-23, San Mateo, CA. Morgan Kaufmann. Helmbold, D., Sloan, R., and Warmuth, M. K. (1990). Learning nested differences of intersection closed concept classes. Machine Learning, 5:165-196. Kearns, M. and Li, M. (1988). Learning in the presence of malicious errors. In 20th ACM Symposium on Theory of Computing, pages 267-279, Chicago. Kearns, M., Li, M., Pitt, L., and Valiant, L. (1987a). On the learnability of boolean formulae. In 19th ACM Symposium on Theory of Computing, pages 285-295, New York. Kearns, M., Li, M., Pitt, L., and Valiant, L. G. (1987b). On the learnability of Boolean formulae. In Proceedings of the 19th Annual ACM Symposium on Theory of Computing, New York. ACM. Kearns, M. and Schapire, R. (1990). Efficient distribution-free learning of probabilistic concepts. In 31th Annual IEEE Symposium on Foundations of Computer Science, pages 382-391. Kearns, M. and Valiant, L. (1989a). Cryptographic limitations on learning boolean formulae and finite automata. In 21st ACM Symposium on Theory of Computing, pages 433-444, Seattle, WA. Kearns, M. and Valiant, L. G. (1989b). Cryptographic limitations on learning Boolean formulae and finite automata. In Proceedings of the 21st Annual ACM Symposium on Theory of Computing, pages 433-444, New York. ACM. Levin, L. A. (1987). One-way functions and pseudorandom generators. Combinatorica, 7(4):357-363. Littlestone, N. (1988). Learning quickly when irrelevant attributes abound: A new linearthreshold algorithm. Machine Learning, 2:285-318. Littlestone, N. (1989a). From on-line to batch learning. In Proceedings of the 2nd Workshop on Computational Learning Theory, pages 269-284. published by Morgan Kaufmann. Littlestone, N. (1989b). Mistake Bounds and Logarithmic Linear-threshold Learning Algorithms. PhD thesis, University of California Santa Cruz. Littlestone, N., Long, P., and Warmuth, M. (1991). On-line learning of linear functions. Technical Report UCSC-CRL-91-29, UC Santa Cruz. For an extended abstract see Proceedings ofTwenty Third Annual ACM Symposium on Theory ofComputing New Orlearns, Louisiana, May 1991, pages 465-475.
312 Littlestone, N. and Warmuth, M. (1991). The weighted majority algorithm. Technical Report UCSC-CRL-91-28, UC Santa Cruz. A preliminary version appeared in the proceedings of the 30th Annual IEEE Symposium on Foundations of Computer Sience, October 89, pages 256-261. Long, P. and Warmuth, M. K. (1991). Composite geometric concepts and polynomial learnability. Information and Computation. To appear. Mitchell, T. (1980). The need for biases in learning generalizations. Technical Report CBMTR-117, Rutgers University, New Brunswick, NJ. Natarajan, B. K. (1989). On learning sets and functions. Machine Learning, 4(1). Opper, M. and Haussler, D. (1991). Calculation of the learning curve of Bayes optimal classification algorithm for learning a perceptron with noise. In Computational Learning Theory: Proceedings of the Fourth Annual Workshop. Morgan Kaufmann. Pitt, L. (1989). Inductive inference, DFAs, and computational complexity. Technical Report UIUCDCS-R-89-1530, U. Illinois at Urbana-Champaign. Pitt, L. and Valiant, L. (1988). Computational limitations on learning from examples. J. ACM, 35(4):965-984. Pitt, L. and Warmuth, M. K. (1990). Prediction preserving reducibility. / . Comp. Sys. Sci., 41(3):430-467. Special issue of the for the Third Annual Conference of Structure in Complexity Theory (Washington, D C , June 88). Rivest, R., Haussler, D., and Warmuth, M., editors (1989). Proceedings of the 1989 Workshop on Computational Learning Theory. Morgan Kaufmann, San Mateo, CA. Rivest, R. L. (1987). Learning decision lists. Machine Learning, 2:229-246. Rumelhart, D. (1990). personal communication. Sarrett, W. andPazzani, M. (1989). Average case analysis of empirical and explanation-based learning algorithms. Technical Report 89-35, UC Irvine, to appear in Machine Learning. Shawe-Taylor, J., Anthony, M., and Biggs, N. (1989). Bounding sample size with the VapnikChervonenkis dimension. Technical Report CSD-TR-618, University of London, Surrey, England. Tesauro, G. and Cohn,D. (1991). Can neural networks do better than the Vapnik-Chervonenkis bounds? In Lippmann, R., Moody, J., and Touretzky, D., editors, Advances in Neural Information Processing, Vol. 3, pages 911-917. Morgan Kaufmann. Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27(11):113442. Valiant, L. G. (1985). Learning disjunctions of conjunctions. In Proc. 9th IJCAI, volume 1, pages 560-6, Los Angeles. Valiant, L. G. and Warmuth, M., editors (1991). Proceedings of the 1991 Workshop on Computational Learning Theory. Morgan Kaufmann, San Mateo, CA. Vapnik, V. N. (1982). Estimation ofDependences Based on Empirical Data. Springer-Verlag, New York.
Chapter 10 On the Automated Discovery of Scientific Theories* Daniel Osherson IDIAP
Scott Weinstein University of Pennsylvania
Abstract This paper summarizes recent research results on applications of computational learning theory to problems involving rich systems of knowledge representation, in particular, first-order logic and extensions thereof. INTRODUCTION Science is such a useful activity that people have become interested in automating it, at least in part. A great deal of fruitful effort has been devoted to this task (e.g., [9, 10, 30]) but the limitations of existing systems lead to reflection about the very character of empirical inquiry. What is a scientific theory, after all, and what makes one theory better or worse than another? How should inquiry proceed in order to maximize our chances of believing a true theory, and minimize the chance of believing a false one? And how much success can be expected of the scientific enterprise, especially when carried out with limited access to data? Consideration of such matters leads to a set of interlocking issues at the heart of contemporary epistemology, including questions about probability, simplicity, approximate truth, hypothetical entities, * Research support was provided by the Office of Naval Research under contracts Nos. N00014-87-K-0401 and N00014-89-J-1725. Correspondence to D. Osherson, IDIAP, C.P. 609, CH-1920 Martigny, Switzerland, e-mail: [email protected]
314 and rational belief. Already resistant to clarification and solution, these questions become even more difficult when scientists are conceived as resource-limited, computational agents. Faced with such conceptual complexity, the natural strategy is to experiment with alternative, simplifying assumptions about scientific practice and attempt to derive general theorems about empirical inquiry within the simpler contexts so defined. It may then be hoped that comparison and analysis of the results obtained will lead to insights that bear on the practical problem of building artificial systems of empirical inquiry in science, industry, medicine, etc. While there is no guarantee that such a research strategy will succeed, we note that it is analogous to past endeavours whose impact on technology have been substantial (for example, the analysis of alternative models of computation). Thus is born the discipline of Computational Learning Theory whose goal is to define and analyze increasingly realistic models of empirical inquiry. 1 Each such model is adapted to a particular discovery problem, by which we mean a scientific or engineering situation in which (a) it is desirable to possess an accurate theory of the processes giving rise to available data, but (b) such a theory cannot be deduced in the strict logical sense from this data. The solution of a discovery problem thus requires some kind of inductive reasoning, and the ability to solve a range of discovery problems requires an inductive method of wide applicability. Contemporary research within the foregoing framework may be divided into two categories according to the expressiveness of the theories emitted by envisioned systems of inductive inference. The first category deals with theories based on knowledge representations like recursive functions, formal languages, boolean functions, etc. The second deals with more expressive representational systems like first-order languages and extensions thereof. Within each of these categories two research areas have emerged, directed at different models of the data upon which inductive inference is based. In the first of these models, data is made available in some arbitrary order with no assumptions about the statistical processes that govern its generation. In the second model it is assumed that data arise via independent and identically distributed trials with respect to some underlying probability distribution (we refer to this below as iid data). We may thus picture the current state of 1
An entry to the literature is provided in [23, 2].
315
non-nd data iid data
1
Restricted
Expressive
I
II IV
III
Table 1: Contemporary Research on Machine Inductive Inference mathematical research on inductive inference as the 2x2 matrix shown in Table 1. Through the early 1980's most work in machine induction fell in Quadrant I (see [11] for an overview of this research). In 1984, Valiant [27] introduced a new model of inductive inference based on iid data. This model relaxed the requirements on the accuracy and reliability of inference algorithms. These relaxed requirements made possible the imposition of more stringent demands concerning efficiency, both in terms of the amount of data examined, and the resources consumed to examine them. Valiant's approach gave rise to the research thrust in Quadrant III which has yielded quantitative results relating the time complexity of learning algorithms to the level of accuracy and reliability demanded of the solutions they provide. Blumer et al [1] elaborated and extended Valiant's model of machine induction to give a deep mathematical analysis of the conditions under which a wide range of discovery problems can be solved within this model. Their analysis has led to a vigorous research effort on the part of many researchers devoted to investigating reliable and efficient inference of classes of geometric concepts, recursive functions and formal languages (see [23]). Simultaneous with the foregoing developments in Quadrant III, Osherson Sz Weinstein [14, 16, 18]— building on earlier work of Shapiro [26] and Glymour [6] — introduced a model of inference for first-order logical structures which extended the research in Quadrant I to the realm of highly expressive systems for knowledge representation. This work thus falls into Quadrant II. We obtained general results about the identification of classes of relational structures and about the behavior of algorithms satisfying various computational restrictions. In recent work, we have extended the research thrust in Quadrant III to the quantitative study of algorithms for inferring properties of relational structures. This
316 latter work thus belongs to Quadrant IV. In order to pursue this study, we have developed mathematical definitions of approximate truth which allow us to extend the iid data model to discovery problems which arise in the context of first-order logic. It is our hope that these developments open the way to further quantitative results on machine inductive inference in the domain of highly expressive knowledge representations. In this paper, we briefly describe our research on inductive inference corresponding to quadrants II and IV. Our work in quadrant II has focussed on a paradigm of scientific discovery known as "truth detection" wherein an inductive agent is responsible for determining the truth value of a first-order sentence in an unknown structure. Within this paradigm, data are presented in arbitrary order. In contrast, our research in quadrant IV has been devoted to articulating concepts of approximate truth and investigating the inference of approximately true theories on the basis of iid data drawn from arbitrary relational structures. We now proceed to describe some highlights of this research, beginning with approximate truth. APPROXIMATE TRUTH Ideally, we desire our inductive inference agents to provide complete solutions to the problems posed to them, to work with 100% reliability, and to be computationally feasible. It was thus an essential contribution to the theory of learning to discover that for many situations of interest to us, the existence of such inductive inference agents is ruled out in principle. Such was the message of Gold's [7] seminal paper and of much of the research to which it gave rise. Valiant's [27] paper may be viewed as a response to the negative theorems of this literature. He showed that for small sacrifices in reliability and accuracy, efficient inductive inference algorithms could be designed for nontrivial learning problems. Valiant's paradigm came to be known as "Probably Approximately Correct" (or PAC) learning since the solution sought need not be entirely correct nor obtained with perfect reliability. One goal of our research program has been to exploit Valiant's insight in the context of learning problems involving highly expressive languages, notably, first-order logical languages. For this purpose it is first necessary to formulate a sense in which solutions to such problems can be
317 partial. Two approaches have been pursued. The first approach extends the PAC framework in a straightforward way to the first order context. The second approach adopts a new analysis of the sense in which a firstorder sentence may be approximately true and then investigates algorithms designed to discover approximate truth-values for such sentences in a wide class of potential situations. We give some idea of the results achieved within each approach, starting with our extension of the PAC framework. Learning First-Order Concepts in the PAC Framework Within the PAC learning framework (see [1]), a space of points is selected, along with a collection of its subsets called "concepts". One concept, X , is selected arbitrarily, its content being initially unknown to the learner £. Points are then sampled from the space according to a probability distribution that is also unknown to £ . Each sampled point is labeled as falling in or out of X. C must convert the sampled points into a concept Xf such that the probability of the symmetric difference of X and X1 is low according to the distribution that governs sampling. It is desired that regardless of the concept X that was chosen before sampling began, the probability is high that a sample of points will be drawn that lead £ to a successful conjecture. In this case, the concept-class is said to be "learnable" in the space. We assume familiarity with the quantitative version of this concept-learning paradigm, which is presented in [1]. For simplicity in what follows, we allow learners to be any function from labeled samples to concepts, excluding coin tosses as further inputs. Now in a practical setting, the set of concepts cannot be arbitrary subsets of the given space. In order to be useful they must at least have finite descriptions in a well-behaved language; otherwise the learner could not communicate her findings to anyone else. First-order logic provides a set of drescriptions of finite character, so we now proceed to embed the foregoing paradigm in a model-theoretic context. Our discussion will be relatively nontechnical. To begin, we fix an arbitrary, nonlogical vocabulary and denote the resulting predicate calculus (with identity) by C. For example, the nonlogical vocabulary might consist of a single binary relation symbol S. The set of sentences of C — that is, the formulas in which all variables
318 are bound - are also denoted by £ . Let x denote a distinguished free variable of C By C(x) we denote the set of formulas in which just the variable x occurs free. Thus, for the language based solely on 5 , the following formulas belong to £(#). (1)
(a) 3yz(Szy A Syx) (b) Vy(a? = y V Sxy) (c) Vy(z = y V Syx)
Suppose now that a model M of C is given. Such a model consists of a nonempty set |<S| (called .M's domain) plus interpretations of the nonlogicai vocabulary in that set. For example, O = (a;, <) is a model of the language based on S; the domain \0\ of O is the set u = { 0 , 1 , 2 , . . . } . Each model determines the truth value of every 0 G £; for example, 3xVy(x = yV Sxy) is true in 0 and 3xVy(x = yV Syx) is false. Similarly, each model assigns a subset of its domain to every ^ £ £(#); this set consists of exactly the domain elements a such that y> is true in the model when x is interpreted as a. To illustrate, O assigns the sets { 2 , 3 , . . . } , {0}, and 0 to (l)a,b,c, respectively. It may thus be seen that any pair (My $) consisting of a model M for C and a subset $ of £(#) determines a concept-learning problem of the PAC variety. For example, O and (1) determine the problem in which u is the underlying space of points, and the extensions of (l)a,b,c in O are the collection of concepts. Given a class K, of models and $ C £(x), $ is said to be learnable in K, just in case $ is PAC-learnable in every S G /C. Within this analysis two mathematical problems arise. They may be stated as follows. (2)
(a) Given a set $ C £(a:), characterize the models in which $ is learnable, and the models in which $ is not learnable. (b) Given a collection K of models, characterize the sets of formulas that can be learned in /C, and the sets of formulas that cannot be learned in /C.
To address these questions, a fundamental tool is the work of Blumer et al. [1] relating VC-dimension to learnability. Relying on their results, we have been able to prove a variety of theorems bearing on (2)a,b. One finding of a positive character followed by one of a negative character may be described here; details, proofs, and further results are provided
in [19]. The following standard terminology will be helpful. A set T C C is called a theory. Given theory T and model <S, we write S \= T just in case every member of T is true in S. First finding: A theory T is called strong just in case it meets the following conditions, for all models <S, li: (a) if S f= T then |<S| is infinite; (b) if S |= T, U |= T, and both S and U have denumerable domains, then S and ZY are isomorphic (in other words, T is "u;-categorical"). For example, the theory of dense linear orders without end points is strong (see [3, Proposition 1.4.2]). The following theorem shows that the class of all first-order concepts can be learned in any model of a strong theory. (3) T H E O R E M : Suppose that T is a strong theory. Then C(x) is learnable in {6* | <S |= T } . Second finding: Given a set $ C £(#), we say that a theory T expresses the learnability o / $ just in case for all models <S, $ is learnable in S iff S \= T. Such theories have the useful property of providing a test for learnability in given situations. Unfortunately, no theory expresses the learnability of even relatively simple subsets of C(x). This is the content of the next theorem, stated with the following notation. The subset of C(x) of form 3yVz(p(xyz), with
320 D e t e r m i n i n g t h e A p p r o x i m a t e T r u t h of F i r s t - O r d e r Theories Our second approach to Discovery Problems within Quadrant IV starts from a definition of the concept "first order sentence 6 is approximately true in relational structure <S." We shall here limit ourselves to brief discussion of this idea; details are provided in [12, 13]. Our theory starts from consideration of the degree to which one structure approximates another. Approximate truth in a structure is then construed as (exact) truth in an approximating structure. It is not claimed that this approach illuminates every aspect of the problem of approximate truth. Rather, our theory is designed for situations of the following kind. Let us conceive of a narrow strip of land (e.g., a coastline) undergoing mineral exploration. A point along the strip is to be designated randomly according to some unknown probability distribution. Once the site is designated, it will be decided whether to drill at that location. Let p be a variable for points along the strip, and consider the following predicates and hypothesis (5). Lp
=
a lode exists within 1000 feet of the surface at point p.
Rp
=
there is superficial igneous rock at point p.
(5)
(yP)(Lp^Rp)
Even if false about the actual strip under exploration, (5) might be useful if true about a fictitious strip that approximates it. In this case, (5) can be considered to be approximately true about the actual strip. To give substance to the foregoing idea, let the actual and fictitious strips be represented by the same, real interval J. Let L, R be the extensions of L and R in the actual strip, and L', R/ be their extensions in the fictitious strip. For the fictitious strip to approximate the actual one we require that every point in L' be near to some point in L, and that every point in the complement of L' be near to some point in the complement of L; similarly for R' and R. It is natural, however, to ask for greater nearness in high probability subregions than in low probability subregions since our judgment about drilling is more likely to be put to the test in the former than in the latter. We thus define the "probability distance" of two points to be the probability mass of the interval that separates them. It can be seen that two points separated by a small
321 probability distance are either metrically close in a high mass interval or else common members of a low mass interval. Now fix b G (0,1). The fictitious strip is called a "b-variant" of the actual strip just in case for every point pf there is a point p such that p' is within probability distance b of p, and p1 G 1/ iff p G L; likewise for R/ and R. Thus, for the fictitious strip to be a b-variant of the actual one, every point p' G 1/ must be justified by a nearby point p G L; likewise, every point p1 $ 1/ must be justified by a nearby point p g L —and similarly for R' and R. In this case, we consider the fictitious strip to approximate the actual one, up to the parameter 6. Sentences like (5) are considered to be "6-true" in the actual strip just in case they are standardly true in at least one of its b-variants. The following example illustrates the potential usefulness of 6-true sentences. Let / , L, and R be as described above. We imagine that i" has been partitioned into ten regions. A point will be drawn randomly from J according to unknown, continuous probability distribution P , and the following question will be posed. (6) p G L for all p in the region from which the sampled point was drawn? Suppose that inspection reveals there to be no superficial igneous rock in the region actually sampled, and that hypothesis (5) is known to be .01true in the strip. Then, it may be proved that (6) is false with probability at least .80. Our theory is a generalization of the foregoing illustration. We have pursued its development from both the deductive and inductive points of view. For brevity, only the inductive logic of approximate truth will be considered here. Our approach is based on a paradigm of empirical inquiry that may be called "probably approximately correct truth detection." Within this paradigm scientists convert a given first-order sentence 6 along with accuracy and reliability parameters 6,c into a set of queries. The queries bear on the interpretation of predicates within a fixed but unknown structure S. Illustrating with a unary predicate A, these queries take the form: Does the nth randomly sampled point from the domain of S fall into the <S-extension of A?
322 The scientist has no knowledge of the probability distribution that governs random sampling from <S's domain. On the basis of a set of queries whose size grows no faster than polynomially in j and £ the scientist must emit, with reliability at least 1 - c , a 6-truth-value for 0 in S. In [12] we show that there is a formal scientist that succeeds in this task with respect to a wide class of first-order sentences and structures. Specifically, the class of sentences for which our method is provably successful includes all sentences in which no predicate letter occurs both negatively and positively. Such sentences are called "monotone." The class of structures in which the scientist can infer an approximate truth-value for monotone sentences includes all structures with continuous domain and measurable extensions for predicates. Placed in an arbitrary and unknown member of this extensive class of structures, and parameterized by any monotone sentence 0 and any choice of accuracy and reliability parameters 6,c, the scientist we define makes only polynomially many queries in y and \ and emits, with reliability at least 1 - c, a 6-truth-value for 8 in the unknown structure. Our current work on this topic is devoted to extending the foregoing result to nonmonotone sentences, to exploring alternative conceptions of approximate truth, and to investigating other paradigms of scientific discovery in which approximate truth is a satisfactory goal. TRUTH DETECTION By a paradigm of scientific discovery let us understand any specification of the concepts "scientist" and "discovery problem" along with a criterion that determines the conditions under which a given scientist is credited with solving a given problem. In our research in quadrant II we have investigated several paradigms of scientific discovery in which discovery problems are characterized using first-order logical languages and extensions thereof. We here describe one such paradigm, truth detection, and some of the recent results obtained about it. Let a countable, first-order language C with identity be fixed, suitable for expressing scientific theories and data in some field of empirical inquiry. Prior research in the field is conceived as verifying a set T of ^-sentences, which constitute the axioms of a theory already known to be true. Each model of T thus represents a possible world consistent
323 with background knowledge. Nature has chosen one of these models say, structure S - to be actual; her choice is unknown to us. (For ease of exposition, we will suppose that Nature's choice is limited to countable models of T.) Scientisits are conceived as attempting to divine the truth-value in S of specific sentences not decided by T. Suppose that scientist ^ wishes to determine the truth-value of 9 in S. At the start of inquiry, \P knows no more about S than what is implied by T. As inquiry proceeds, more and more information about S becomes available. This information has the following character. We conceive of 3> as being able to determine, for each atomic formula cp of C and any given sequence of objects from the universe of S whether or not that sequence satisfies tp in S. $ receives the entire universe of S in piecemeal fashion and bases its conjecture at a given moment on the finite subset of the universe of S examined by that time. In response to each new datum, \P emits a fresh conjecture about the truth of 9 in <S, announcing either "true" or "false." To be counted as successful, $'s conjectures must stabilize to the correct one. Notice that no assumption is made about the process generating the data $ receives; in particular, in order to successfully detect the truth of 9 in <S, we require $ to stabilize to a correct conjecture no matter what data sequence is presented. (This distinguishes the current model from the iid data model presented in the preceding section where data sequences are generated by randomly sampling points from a structure according to some time invariant probability distribution over the universe of that structure.) Let us summarize the above discussion with the following definition: we say a scientist $ detects the truth-value of a sentence 9 with respect to background knowledge T just in case for every countable model S of T and every data sequence e generated from <S, $ stabilizes to a correct conjecture about the truth-value of 9 in S. Our research on truth detection has addressed the following questions, among others: • For which sentences 9 and theories T do there exist scientists who detect the truth of 9 with respect to T? • Are there theories T such that some single scientist detects the truth-value of all sentences with respect to T?
324 • How are the answers to the preceding questions altered if we impose computational or methodological constraints on the scientists in question? We have also examined a further question, the answer to which provides considerable information about the choice of first-order logic as a mode of representation for background knowledge. The remainder of this section is devoted to this matter. With respect to the paradigm of truth detection, we may view discovery problems as parameterized by a theory T which represents the background knowledge available to a scientist at the outset of investigation. When discovery problems are so viewed the following uniformity question naturally presents itself. Is there a uniform method M for solving the problem posed by T, if that problem is solvable at all? Such a method M might be uniform in T in the following sense. In the course of computing its conjecture about the truth-value of the sentence 0, M could receive answers to any queries it chose about the membership of individual sentences in T. M's computation of its conjecture in the face of incoming data would then be entirely effective relative to the answers it received to its queries. Such a method M may be represented by a Turing machine with oracle. If M is such an oracle machine, we write MT to denote the scientist computed by M when equipped with an oracle for T. Given this understanding of uniform method for the solution of discovery problems, the following theorem provides an affirmative answer to the above question. (7) T H E O R E M : There is an oracle machine M such that for all sets of first-order sentences T and all first-order sentences 0, if there is a scientist who detects the truth-value of 0 with respect to T, then MT detects the truth-value of 0 with respect to T. Proof of the theorem is provided in [17]. Theorem (7) leads to a fundamental question about the role of firstorder logic in inductive inference. We ask: In making possible the uniform solution of scientific discovery problems, what is the role of our choice of first-order logic for representing background knowledge? Could some yet more expressive logical language be used to represent background knowledge that would still allow for a uniform solution of the
325 discovery problems thus represented? Our research has provided a partial answer to these questions. In order to give the answer, we will need to consider a slight strengthening of the paradigm of truth detection and introduce some concepts from the theory of models. Let C1 be a regular logical language which contains our first-order language C. (For the concept regular logical language see [4]; suffice it to say that first-order logic itself is a regular logical language as are most examples of extensions of first-order logic, such as second-order logic, extensions by the addition of cardinality quantifiers, etc.) Let T be a set of £' sentences, 0 a sentence of £, and S a model of T. We say that e is a restricted data sequence for S and 0 just in case e is the result of removing all information about atomic formulas containing vocabulary not present in 0 from some data sequence e' generated from S. We say that scientist $ strongly detects the truth-value of 0 with respect to T just in case for every countable model S of T and every restricted data sequence e for S and 0, \P stabilizes to a correct conjecture about the truth-value of 0 in S. Finally, we say that C has the uniform strong detection property just in case there is an oracle machine M such that for all sets of ^'-sentences T and all first-order sentences 0, if there is a scientist who strongly detects the truth-value of 0 with respect to T, then MT strongly detects the truth-value of 0 with respect to T. Our preceding theorem may be strengthened to exhibit further uniformity in the solvability of discovery problems characterized by first-order theories as follows. (8) T H E O R E M : First order logic has the uniform strong detection property. We are now in a position to show the extent to which first-order logic is unique in affording uniform solution of discovery problems. A regular logical language C! has the Lowenheirn-Skolem property just in case every satisfiable ^'-sentence has a countable model. It is a fundamental fact of model theory that first-order logic has the Lowenheim-Skolem property. The following theorem provides a characterization of first-order logic as a maximal regular logic with the Lowenheim-Skolem and uniform strong detection properties (see [17] for proof). (9) THEOREM : Let C be a regular logical language containing firstorder logic £ . If C! has the Lowenheim-Skolem property and the
326 uniform strong detection property then £ = £. Theorem (9) indicates that first-order logic has a special status as a knowledge representation language for scientific discovery problems. This result also suggests important topics for further research. First, are there proper extensions of first-order logic which fail to have the Lowenheim-Skolem property but do allow for uniform solution of discovery problems? Second, are there languages whose expressive power is incomparable with that of first-order logic which allow for uniform solution of discovery problems? Such languages might arise as fragments of proper extensions of first-order logic. The answers to both these questions may have significance for the choice of knowledge representation languages for discovery problems which arise in scientific or technological contexts. We plan to investigate these and related issues in our continuing research on automated scientific discovery. CONCLUDING REMARKS Each paradigm of empirical inquiry studied within Computational Learning Theory is a mathematical abstraction from the complex web of issues indicated in the introduction above. Study of these models is aimed at faciliating the development of practical algorithms for the automated solution of discovery problems arising in practice. It may also be hoped, as well, that results within the theory partially clarify some of the questions that surround the nature of scientific activity itself. Some of our work has been focussed on such questions (e.g., [5, 15, 21]). The present discussion concludes with a brief summary of one pertinent result. Scientific inference is an essentially non-deductive affair inasmuch as true theories — apart from trivial, exceptional cases - cannot be deduced from the data available to scientists. Nonetheless, deductive logic is widely recognized to play a central role in scientific thought, for example, in drawing out the consequences of a theory for empirical test. For this reason deductive logic has been central to the analysis of several components of scientific activity. To illustrate, it has been suggested that the confirmation of a scientific theory is a function of the empirical verification of its logical consequences (see [8] for discussion).
327 Unfortunately, a simple analysis of confirmation on this basis founders on the richness of the set of logical consequences of a given theory. Thus, one consequence of the axiom A is A V S for arbitrary sentence 5; yet verification of S (hence of A V S) need not confirm A. To save the insight behind the idea that confirmation of consequences yields inductive support, it is tempting to exclude inferences like A f= A V S from the set of "scientifically relevant" deductions. After all, this latter inference has a suspicious character inasmuch as it does not depend on any particular relation between A and S. Following this line of thought, several definitions of scientifically relevant deduction have been advanced, leading to fruitful analyses of confirmation and theorycomparison (see [29, 28, 24, 25]). To be pertinent to our understanding of scientific practice, however, a definition of relevant deduction must satisfy a further criterion. It must be the case that scientists whose deductive reasoning is limited to relevant inferences are just as scientifically competent as scientists not so limited. That is, for every scientific problem that is solvable in principle, there must exist a scientist who never reasons in deductively irrelevant fashion yet who also succeeds in solving that problem. Otherwise, the proposed definition of relevant deduction does not allow us to fully understand how it is that science sometimes succeeds. Starting with a simple definition of relevant deduction due to Schurz & Weingartner [25] we have shown that for every solvable problem of the kind described in the last section there is indeed a successful scientist whose deductive reasoning conforms to the definition. Details are given in [22]. Evidence is thereby provided that the kind of definition proposed in Schurz &; Weingartner [25] is plausible as a representation of the deductive component of scientific reasoning. In this way, study of formally defined paradigms of inductive inference within Computational Learning Theory can shed some light on the foundations of scientific inquiry.
References [1] Blumer, A., A. Ehrenfeucht, D. Haussler, & M. Warmuth (1987). Learnability and the Vapnik-Chervonenkis Dimension (Technical Report UCSC-CRL-87-20). Santa Cruz: University of California.
328 [2] Case, J. k Fulk, M. (Eds.) (1990). Proceedings of the third annual workshop on computational learning theory, San Mateo, CA: Morgan- K aufmann. [3] Chang, C. C. k Keisler, H. J. (1973). Model theory, Amsterdam: North-Holland. [4] Ebbinghaus, H.-D. (1985). Extended logics: the general framework. In Barwise, J. k S. Feferman (eds.), Model theoretic logics. New York: Springer-Verlag. [5] Gaifman, H., Osherson, D. k Weinstein, S. (1990). A reason for theoretical terms. Erkenntnis, 32, 149-159. [6] Glymour, C. (1985). Inductive inference in the limit. 22, 23-31.
Erkenntnis,
[7] Gold, E. M. (1967). Language identification in the limit. Information and Control, 10, 447-474. [8] Hempel, C. G. (1965) Aspects of scientific explanation and other essays in the philosophy of science. The Free Press. [9] Langley, P., Bradshaw, G., k Simon, H. (1983). Rediscovering chemistry with the BACON system. In R. Michalski, J. Carbonell, k T. Mitchell (Eds.) Machine learning: An artificial intelligence approach. Palo Alto, CA: Tioga. [10] Langley, P. k Nordhausen, B. (1986). A framework for empirical discovery. In Proceedings of the International Meeting on Advances in Learning, Les Arcs, France. [11] Osherson, D., Stob, M., k Weinstein, S. (1986). Systems that Learn. Cambridge, MA: MIT Press. [12] Osherson, D., Stob, M., k Weinstein, S. (1989). On approximate truth. In R. Rivest, D. Haussler, k M. Warmuth (Eds.), Proceedings of the second annual workshop on computational learning theory. San Mateo, CA: Morgan-Kaufmann. [13] Osherson, D., Stob, M., k Weinstein, S. (1989). A theory of approximate truth. (Technical Report). Cambridge, MA: M.I.T.
329 [14] Osherson, D. k Weinstein, S. (1986). Identification in the limit of first-order structures. Journal of Philosophical Logic, 15, 55-81. [15] Osherson, D., Stob, M., k Weinstein, S. (1988). Mechanical learners pay a price for Bayesianism. Journal of Symbolic Logic, 53, 12451251. [16] Osherson, D. k Weinstein, S. (1989). Paradigms of truth detection. Journal of Philosophical Logic, 18, 1-42. [17] Osherson, D., Stob, M., k Weinstein, S. (1991). A universal inductive inference machine," Journal of Symbolic Logic, 56, 661-672. [18] Osherson, D., Stob, M., k Weinstein, S., (in press). A universal method of scientific inquiry. Machine Learning. [19] Osherson, D., Stob, M., k Weinstein, S. (1991). New directions in automated scientific discovery. Information Sciences. [20] Osherson, D. k Weinstein, S. (1989). Identifiable collections of countable structures. Philosophy of Science, 56, 95-105. [21] Osherson, D. k Weinstein, S. (1990). On advancing simple hypotheses. Philosophy of Science, 51, 266-277. [22] Osherson, D. k Weinstein, S. (in press). Relevant consequence and scientific discovery. Journal of Philosophical Logic. [23] Rivest, R., Haussler, D., k Warmuth, M. (Eds.) (1989). Proceedings of the second annual workshop on computational learning theory. San Mateo, CA: Morgan-Kaufmann. [24] Schurz, G. (1991). Relevant deduction. Erkenntnis. [25] Schurz, G. k Weingartner, P. (1987). Verisimilitude defined by relevant consequence-elements: A new reconstruction of Popper's idea. In T. A. Kuipers (Ed.), What is closer-to-the-truth? Amsterdam: Rodopi. [26] Shapiro, E. (1981). An algorithm that infers theories from facts. In Proceedings of the seventh international joint conference on artificial intelligence.
330 [27] Valiant, L. (1984). A theory of the learnable. Communications of the ACM, 21, 1134-1142. [28] Weingartner, P. (1988). Remarks on the consequence-class of theories. In E. Scheibe (Ed.), The role of experience in science. Walter de Gruyter. [29] Weingartner, P. & Schurz, G. (1986). Paradoxes solved by simple relevance criteria. Logique et Analyse. [30] J. Zytkow (1987). Combining many searches in the FAHRENHEIT discovery system. In Proceedings of the Fourth International Workshop on Machine Learning, Irvine CA.