Parallel Processing for Artificial Intelligence (Machine Intelligence & Pattern Recognition) (v. 3)

Introduction This book is the third volume in an informal series of books about parallel processing for Artificial Intel...

Author: C.B. Suttner

94 downloads 726 Views 4MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Introduction This book is the third volume in an informal series of books about parallel processing for Artificial Intelligence. Like its predecessors, it is based on the assumption that the computational demands of many AI tasks can be better served by parallel architectures than by the currently popular workstations. However, no assumption is made about the kind of parallelism to be used. Transputers, Connection Machines, farms of workstations, Cellular Neural Networks, Crays, and other hardware paradigms of parallelism are used by the authors of this collection. Papers in this collection are from the areas of parallel knowledge representation, neural modeling, parallel non-monotonic reasoning, search and partitioning, constraint satisfaction, theorem proving, parallel decision trees, parallel programming languages and low-level computer vision. The final paper is an experience report about applications of massive parallelism which can be said to capture the spirit of a whole period of computing history. The articles of this book have been grouped into four chapters: Knowledge Representation, Search and Partitioning, Theorem Proving, and Miscellaneous. Most of the papers are losely based on a workshop that was held at IJCAI 1995 in Montreal, Canada. All papers have been extensively reviewed and revised resulting in a collection that gives a snapshot of the state of the art in Parallel Processing for Artificial Intelligence. In the chapter on Knowledge Representation, Shastri and Mani show how a good understanding of human neural processing can inspire and constrain the building of efficiently implemented parallel reasoning mechanisms. Boutsinas, Stamatiou and Pavlides apply parallel processing techniques to Touretzky-style nonmonotonic inheritance networks. Lee and Geller present an efficient representation of class hierarchies on a massively parallel supercomputer. Stoffel, Hendler and Saltz describe a powerful parallel implementation of their PARKA frame system. In the second chapter, all three papers, by Cook, Suttner and Berlandier and Neveu, deal with partitioning of a search space. Cook describes HyPS, a parallel hybrid search algorithm. Suttner presents static partitioning with slackness as an approach for parallelizing search algorithms. Berlandier and Neveu describe a partitioning technique applied to constraint satisfaction problems. The third chapter contains papers about parallel reasoning. Bergmann and Quantz describe a system based on the classical KL-ONE approach to knowledge representation. This system is called FLEX, as its main goal is to permit flexible reasoning. Fisher describes a technique for theorem proving that relies on the broadcast of partial results. The SiCoTHEO theorem prover of Schumann is based on competing search strategies. The last chapter contains a number of papers that do not fit well into any other category of this book. Destri and Marenzoni analyze different parallel architectures for their ability

vi to execute parallel computer vision algorithms. Kufrin's paper is about machine learning, namely the induction of a parallel decision tree from a given set of data. Lallement, Cornu, and Vialle combine methods from both connectionist and symbolic AI to build an agent-based programming language for Parallel Processing for Artificial Intelligence. Finally, Waltz presents an inside look at the development of the Connection Machine, and the problems that were arising when trying to make it financially viable. The Appendix of this book contains a list of references to papers about Parallel Processing for Artificial Intelligence that appeared at a number of workshops, giving the reader information about sources that are otherwise not easily accessible. The editors would like to thank the authors for their timely and diligent manuscript submissions, and all the reviewers for their efforts in selecting a set of high quality papers. We thank Laveen Kanal and Azriel Rosenfeld for their support in making this project happen. Additional thanks go to the staff at Elsevier, especially Y. Campfens, who patiently suffered through a sequence of delays in our preparation of the final document. We thank Y. Lee and J. Stanski of N JIT who helped with some of the "leg work" in editing the final version of this book. Finally, J. Geller would like to thank his family for giving him time off on Sundays to finish this document.

ix

About the Editors J a m e s Geller James Geller received an Electrical Engineering Diploma from the Technical University Vienna, Austria, in 1979. His M.S. degree (1984) and his Ph.D. degree (1988) in Computer Science were received from the State University of New York at Buffalo. He spent the year before his doctoral defense at the Information Sciences Institute (ISI) of USC in Los Angeles, working with their Intelligent Interfaces group. James Geller received tenure in 1993 and is currently associate professor in the Computer and Information Science Department of the New Jersey Institute of Technology, where he is also Director of the AI & OODB Laboratory. Dr. Geller has published numerous journal and conference papers in a number of areas, including knowledge representation, parallel artificial intelligence, and object-oriented databases. His current research interests concentrate on object-oriented modeling of medical vocabularies, and on massively parallel knowledge representation and reasoning. James GeUer was elected SIGART Treasurer in 1995. His Data Structures and Algorithms class is broadcast on New Jersey cable TV. Hiroaki Kitano

Dr. Hiroaki Kitano is a Senior Researcher at Sony Computer Science Laboratory. Dr. Kitano received a B.A. in Physics from International Christian University, and a Ph.D. in Computer Science from Kyoto University. He joined NEC's software Engineering Laboratory in 1984, and developed a number of very large software systems. From 1988 to 1993, he was a visiting researcher at the Center for Machine Translation, Carnegie Mellon University. In 1993, he received the Computers and Thought Award from the International Joint Conference on Artificial Intelligence. His current academic service includes, chairperson of the international committee for RoboCup (World Cup Robot Soccer), associate editor for Evolutionary Computing, Applied AI journal, and other journals, as well as executive member of various international committees. Christian Suttner

Christian Suttner studied Computer Science and Electrical Engineering at the Technische Universit~it Miinchen and the Virginia Polytechnic Institute and State University. He received a Diploma with excellence from the TU Miinchen in 1990, and since then he is working as a full-time researcher on parallel inference systems in the Automated Reasoning Research Group at the TU Miinchen. He received a Doctoral degree in Computer Science from the TUM in 1995. His current research interests include automated

theorem proving, parallelization of search-based systems, network computing, and system evaluation. Together with Geoff Sutcliffe, he created and maintains the TPTP problem library for automated theorem proving systems and designs and organizes theorem proving competitions.

Parallel Processing for Artificial Intelligence 3 J. Geller, H. Kitano and C.B. Suttner (Editors) 1997 Elsevier Science B.V.

Massively Parallel Knowledge Representation and Reasoning: Taking a Cue from the Brain Lokendra Shastri a* and D.R. Mani b aInternational Computer Science Institute 1947 Center Street, Ste. 600 Berkeley, CA 94707 bThinking Machines Corporation 14 Crosby Drive Bedford, MA 01730 Any intelligent system capable of common sense reasoning and language understanding must be capable of performing rapid inferences with reference to a large body of knowledge. The ability to perform rapid inferences with large knowledge bases is also essential for supporting flexible and effective access to the enormous body of electronically available data. Since complexity theory tells that not all inferences can be computed effectively, it is important to identify interesting classes of inference that can be performed effectively. Over the past several years we have tried to do so by working within a neurally motivated, massively parallel computational model. Our approach is motivated by the belief that situating the knowledge representation and reasoning problem within a neurally motivated computational architecture will not only enhance our understanding of the mind/brain, but it will also lead to the development of effective knowledge representation and reasoning systems implemented on existing hardware. In this chapter we substantiate this claim and review some results of pursuing this approach. These include a characterization of reflexive reasoning reasoning that can be performed effectively by neurally plausible networks; the design of CSN a connectionist semantic network that can perform inheritance and recognition in time proportional to the depth of the conceptual hierarchy; SHRUTI a connectionist knowledge representation and inference system that can encode a large number of facts, rules, and a type hierarchy, and perform a class of first-order inferences with extreme efficiency, and SHRUTI-CM5 an implementation of SHRUTI on the CM-5 that can encode over half a million rules, facts, and types and respond to reflexive queries within a few hundred milliseconds. 1. I N T R O D U C T I O N The ability to represent and use a large body of knowledge effectively is an essential characteristic of intelligent systems. For example, understanding language requires the *This work was partially funded by NSF grant IRI 88-05465, ARO grant DAA29-84-9-0027, ONR grants N00014-93-1-1149 and N00014-95-C-0182, and NSF resource grant CCR930001N.

hearer to draw inferences based on a large body of common sense knowledge in order to establish referential and causal coherence, generate expectations, and make predictions. Plausible estimates of the size of such a knowledge base range from several hundred thousand to more than a million items [8]. Nevertheless, we can understand language at the rate of several hundred words per minute. This clearly suggests that we are capable of performing a wide range of inferences with reference to a large knowledge base within a few hundred milliseconds. Any real-time language understanding system should be capable of replicating this remarkable human ability. There has been an explosive growth in electronically available information and the number of consumers of such information. The storage, transmission, access, and ultimately, the effective use of this large and heterogeneous body of data poses a number of technological challenges. A core challenge--and one that is relevant to our work--is the problem of providing intelligent content-based access to the available data. The ability to provide such access however, will depend in large part on a system's ability to bridge the "semantic gap" between a user's query and the relevant data items. This in turn would critically depend on a system's ability to perform rapid inferences based on a variety of knowledge such as ontological knowledge, terminological knowledge, domain knowledge, common sense knowledge, and user models. Several of these knowledge sources--in addition to the common sense knowledge base--will be ve.ry large and may contain several hundred thousand items. For example, the Unified Medical Language System's terminological component contains 190,863 entries consisting of medical, clinical and chemical concepts [26]. While database and information retrieval technology has evolved considerably over the past decade, the development of large-scale yet efficient knowledge based systems capable of supporting inference has lagged behind. There exist a number of robust and sophisticated database management systems that provide efficient access to ve.ry large databases, but there do not exist :high performance systems that can carry out efficient inference with respect to large knowledge bases. The integration of the inferential capabilities of an effective large-scale inference system and the full functionality of existing database and information systems should contribute to the development of a flexible, expressive, and efficient system for accessing large and heterogeneous databases. Thus from the point of view of building artificially intelligent systems capable of understanding natural language as well as from the perspective of supporting emerging technology for accessing electronically available information, it is important to develop high performance knowledge representation and reasoning systems. Complexity theo .ry however, rules out the existence of systems capable of performing all inferences effectively. Thus the key scientific challenge in building an efficient inference system consists of identif.ying interesting and useful classes of inference that can be performed effectively. AI researchers, including ourselves have pursued this goal using a number of strategies. Our approach focuses on identifying interesting and useful classes of inference that can be performed rapidly by neurally motivated and massively parallel computational models. The thesis underlying our approach is that crucial insights into the nature of knowledge representation and reasoning can be obtained by working within the computational constraints suggested by the human brain--the only extant system that exhibits the requisite attributes of response time and scalability. We believe that situating the

knowledge representation and reasoning problem within a neurally motivated computational architecture willnot only enhance our understanding of the mind/brain, but it will also lead to the development of effective knowledge representation and reasoning systems realized on existing high performance hardware platforms. In the rest of this chapter we describe some results of pursuing this approach. Section 2 discusses our approach and its motivation in more detail. Sections 3 and 4 describe two connectionist models of knowledge representation and reasoning. Section 5 describes the mapping of one of these models onto existing hardware platforms and Section 6 offers some conclusions. 2. C O M P U T A T I O N A L

EFFECTIVENESS

As the science of artificial intelligence has matured over four decades, it has become apparent that we had underestimated the complexity and intricacy of intelligent behavior. Today we realize that the task of building a system that performs intelligently in a limited domain is dramatically different from that of designing a system that displays the sort of natural intelligence we take for granted among humans and higher animals. This sharp difference is highlighted by the limitations of artificial intelligence systems developed to understand natural language, process visual information, and perform common sense reasoning. There are programs that "understand" English if the exchange is limited to talk about airplane tickets or restaurants; there are reliable vision systems that can identify a predefined set of objects presented under controlled conditions; but we have yet to design systems that can recognize objects with the skill of a monkey, or converse with the facility of a five year old. Given that existing AI systems perform credibly within restricted domains, one may be led to believe that in order to accommodate more complex domains all that is necessa .ry is to encode more facts and rules into our programs. But the situation is not so straightforward; it is not as though the existing programs are just miniature versions of larger programs that would perform intelligently in richer domains. The problem is that the solutions do not scale up: the techniques that work in restricted domains are inadequate for dealing with richer and more complex domains. As the domains grow bigger and more complex, we run into the stone wall of computational effectiveness; the performance of the system degrades and it can no longer solve interesting problems in acceptable time-scales. This is not surprising if we recognize that intelligent activity involves ve.ry dense interactions between many pieces of information, and in any system that encodes knowledge about a complex domain, these interactions can become too numerous for the system to perform effectively. A concern for computational effectiveness should be central to AI. From the viewpoint of AI, it does not suffice to offer a computational account of how an agent may solve an interesting set of problems. AI needs to solve a far more difficult problem: it must provide a computational account of how an agent may solve interesting problems in the time frame permitted by the environment. 2 The ability to satisfy the computational effectiveness constraint appears to be one of the basic properties of intelligent agents. Success, and 2The significance of computational effectiveness in the context of AI was first discussed in these terms in [17].

at times even the survival of an agent, may depend on his ability to make decisions and choose appropriate actions within a given time frame. In fact, in certain situations we would hesitate to label an activity as being "intelligent" if it takes arbitrarily long. To give an extreme example--if time were not a factor, even a dumb computer could beat the world's greatest chess player by simply enumerating the full search tree and following a path that guaranteed a win. No doubt this would take an aeon, but if time is not a factor this should not be of consequence. 3 It is tempting to ignore the computational effectiveness constraint by characterizing it as being merely a matter of efficiency or an implementation level detail. But doing so would be a mistake. Since computational effectiveness places strong constraints on how knowledge may be organized and accessed by cognitive processes, we believe that it may be essential to tackle the question of computational effectiveness at the very outset in order to understand the principles underlying the organization and use of information in intelligent systems. 2.1. C o m p u t a t i o n a l effectiveness n e c e s s i t a t e s a s t r o n g n o t i o n of t r a c t a b i l i t y As pointed out earlier, human agents perform a range of inferences while understanding language at the rate of several hundred words per minute. These inferences are performed rapidly, spontaneously and without conscious effort--as though they were a reflex response of our cognitive apparatus. In view of this we have described such reasoning as reflexive [20]. Reflexive reasoning may be contrasted with reflective reasoning which requires reflection, conscious deliberation, and often an overt consideration of alternatives and weighing of possibilities. Reflective reasoning takes longer and often requires the use of external props such as a paper and pencil. Some examples of such reasoning are solving logic puzzles, doing cryptarithmetic, or planning a vacation. W h a t should be the appropriate criteria of tractability in the context of knowledge representation and reflexive reasoning? Since polynomial time complexity is the usual "threshold" for distinguishing the tractable from the intractable in computer science, it may seem reasonable to adopt this notion of tractability in this context. But as argued in [21] reflexive reasoning requires a more stringent criterion of tractability. Let us amplify:

9 Reflexive reasoning occurs with respect to a large body of background knowledge. A serious attempt at compiling common sense knowledge suggests that our background knowledge base may contain as many as 106 items [8]. This should not be very surprising given that this knowledge includes, besides other things, our knowledge of naive physics and naive psychology; facts about ourselves, our family, friends, colleagues, history and geography; our knowledge of artifacts, sports, art, music; some basic principles of science a n d mathematics; and our models of social, civic, and political interactions. 9 Items in the background knowledge base are fairly stable and persist for a longtime once they are acquired. Hence this knowledge is best described as long-term knowledge and we will refer to this body of knowledge as the long-term knowledge base ( L T K B ) . 3Two caveats are in order. First, we are assuming that a path leading to a forced win exists, but such a path may not exist. Second, in addition to time, space or memory is also a critical resource!

9 Episodes of reflexive reasoning are triggered by "small" inputs. In the context of language understanding, an input (typically) corresponds to a sentence that would map into a small number of assertions. For example, the input "John bought a Rolls Royce" maps into just one assertion (or a few, depending on the underlying representation). The critical observation is that the size of the input, IInl, is insignificant c o m p a r e d t o the size of the long-term knowledge base ILTKBI. 4 9 The vast difference in the magnitude of ILTKBI and Ilnl becomes crucial when discussing the tractability of common sense reasoning and we h a v e t o be careful in how we measure the time and space complexity of the reasoning process. In particular, we need to analyze the complexity of reasoning in terms of ILTKB I as well as Ilnl. In view of the magnitude of ILTKBI, even a cursory analysis suggests that any inference procedure whose time complexity is quadratic or worse in ILTKBI cannot provide a plausible computational account of reflexive reasoning. A process that is polynomial in IIni however, does remain viable.

2.2. T i m e complexity of reflexive reasoning Observe that although the size of a person's ILTKBI increases considerably from, say, age five to thirty, the time taken by a person to understand natural language and draw the requisite inferences does not. This suggests that the time taken by an episode of reflexive reasoning does not depend on the ILTKBI. In view of this it is proposed t h a t a realistic criteria of tractability for reflexive reasoning is one where the time taken by an episode of reflexive reasoning is independent of ILTKBI and only depends on the depth of the derivation tree associated with the inference, s

2.3. Space complexity of reflexive reasoning The expected size of the LTKB also rules out any computational scheme whose space requirement is quadratic (or higher) in the size of the KB. For example, the brain has only about 1011 cells most of which are involved in processing of sensorimotor information. Hence even a linear space requirement is fairly generous and leaves room only for a modest constant of proportionality. In view of this, it is proposed that the admissible space requirement of a model of reflexive reasoning be no more than linear in ILTKBI. To summarize, it is proposed that as far as (reflexive) reasoning underlying language understanding is concerned, the appropriate notion of tractability is one where 9 the reasoning time is independent of [LTKB[ and is only dependent on the depth of the derivation tree associated with the inference, and

[In[ and

4A small input may, however, lead to a potentially large number of elaborate inferences. For example, the input "John bought a Rolls-Royce" may generate a number of reflexive inferences such as "John bought a car", "John owns a car", "John has a driver's license", "John is perhaps a wealthy man", etc. 5The restriction that the reasoning time be independent of ILTKBI may seem overly strong and one might argue that perhaps logarithmic time may be acceptable. Our belief that the stronger notion of effectiveness is relevant, however, is borne out by results which demonstrate that there does exists a class of reasoning that can be performed in time independent of ILTKBI.

9 the associated space requirement, i.e., the space required to encode the L T K B plus the space required to hold the working memo .ry during reasoning should be no worse than linear in ]LTKB]. 2.4. P a r a l l e l i s m The extremely tight constraint on the time available to perform reflexive inferences suggests that we must resort to massive parallelism. Many cognitive tasks, and certainly all the perceptual ones, that humans can perform in a few hundred milliseconds would require millions of instructions on a serial (von Neumann) computer, and it is apparent t h a t a serial computer will be unable to perform these tasks within an acceptable time frame [6]. The crux of the problem becomes apparent if one examines the architecture of a traditional von Neumann computer. In such a computer, the computational and the inferential power is concentrated in a single processing unit (the CPU) while the information on which the computations have to be performed is stored in an inert memo .ry which simply acts as a repository of the system's knowledge. As a result of the single processor design, only one processing step can be executed at any point in time, and during each processing step the CPU can only access a minuscule fraction of the memo .ry. Therefore, at any given instant, only an insignificant portion of the system's knowledge participates in the processing. On the other hand, intelligent behavior requires dense interactions between many pieces of information, and any computational architecture for intelligent information processing must be capable of supporting such dense interactions. It would therefore seem appropriate to treat each memo .ry cell--not as a mere reposito .ry of information, but rather as an active processing element capable of interacting with other such elements. This would result in a massively parallel computer made up of an extremely large number of simple processing elements--as many as there are memo .ry cells in a traditional computer. The processing capability of such a computer would be distributed across its memory, and consequently, such a computer would permit numerous interactions between various pieces of information to occur simultaneously. The above metaphor of computation matches the massively parallel and distributed nature of processing that occurs in the animal brain. 6 2.5. N e u r a l c o n s t r a i n t s W i t h nearly 1011 computing elements and 101'5 interconnections, the brain's capacity for encoding, communicating, and processing information is awesome and can easily support massively parallel processing. But if the brain is extremely powerfill, it is also extremely limited and imposes a number of rather strong computational constraints. First, neurons are slow computing devices. Second, they communicate relatively simple messages that can encode only a few bits of information. Hence a neuron's output cannot encode names, pointers, or complex structures. T The relative simplicity of a neuron's processing ability with reference to

~The importance of massive parallelism was discussed in the above terms in [17,18]. Several other researchers have also pointed out the significance of massive parallelism in AI. For example, see [47,11,12,32]. 7If we assume that information is encoded in the firing rate of a neuron then the amount of information that can be conveyed in a "message" would depend on AF, the range over which the firing frequency of a presynaptic neuron can vary, and AT, the window of time over which a postsynaptic neuron can "sample"

the needs of symbolic computation, and the restriction on the complexity of messages exchanged by neurons, impose strong constraints on the nature of neural representations and processes [6]. A specific limitation of neurally plausible systems is that they have difficulty representing composite structures in a dynamic fashion. Consider the representation of the fact give(John, Mary, a-Book). This fact cannot be represented dynamically by simply activating the nodes representing the roles giver, recipient, and give-object, and the constituents "John", "Mary", and "a-Book". Such a representation would suffer from cross-talk because it would be indistinguishable from the representation of give(Mary, John, a-Book). The problem is that this fact is a composite structure: it does not merely express an association between the constituents "John", "Mary", and "a-Book", rather it expresses a specific relation wherein each constituent fills a distinct role. Hence representing such a fact requires representing the appropriate bindings between roles and their fillers. It is easy to represent static (long-term) bindings using dedicated nodes and links (see Figure 1). For example, one could posit a separate "binder" node for each role-filler pair to represent role-filler bindings. Such a scheme is adequate for representing long-term knowledge because the required binder nodes may be created. This scheme however, is implausible for representing dynamic bindings arising during language understanding since these bindings have to be generated very rapidly--within a hundred milliseconds--and it is unlikely that there exist mechanisms for growth of new links within such time scales. An alternative would be to assume that interconnections between all possible pairs of roles and fillers already exist. These links normally remain "inactive" but the appropriate ones become "active" temporarily to represent dynamic bindings. This approach is also problematic because the number of all possible role-filler bindings is extremely large and will require a prohibitively large number of nodes and links. Techniques for representing role-filler bindings based on the von Neumann architecture cannot be used since they require communicating names or pointers of fillers to appropriate roles and vice versa. As mentioned above, the storage and processing capacity of nodes as well as the resolution of their outputs is not sufficient to store, process, and communicate names or pointers. As we shall see in Section 4, a t t e m p t s to solve representational problems such as the dynamic binding problem within a neurally constrained computational model lead to the identification of important constraints on the nature of reflexive reasoning.

the incident spike train. AT is essentially how long a neuron can "remember" a spike and depends on the time course of the postsynaptic potential and the ensuing changes in the membrane potential of the postsynaptic neuron. A plausible value of AF may be about 200. This means that in order to decode a message containing 2 bits of information, AT has to be about 15 msec, and to decode a 3-bit message it must be 35 about msec. One could argue that neurons may be capable of communicating more complex messages by using variations in interspike delays to encode information (e.g., see Strehler & Lestienne 1986). However, Thorpe and Imbert (1989) have argued that in the context of rapid processing, the firing rate of neurons relative to the time available to neurons to respond to their inputs implies that a presynaptic neuron can only communicate one or two spikes to a postsynaptic neuron before the latter must produce an output. Thus the information communicated in a message remains limited even if interspike delays are used as temporal codes. This does not imply that networks of neurons cannot represent and process complex structures. Clearly they can. The interesting question is how?

10

An instance

of gi

Figure 1. Static coding of bindings using binder nodes.

2.6.

Structured

connectionism

Structured Connectionist Models [6,22] are intended to emulate the information processing characteristics of the brain--albeit at an abstract computational level--and reflect its strengths and weaknesses. Arguably, the structured connectionist approach provides an appropriate framework for developing computational models that are constrained by the computational properties of the brain. Typically, a node in a connectionist network corresponds to an idealized neuron or a small ensemble of neurons, and a link corresponds to an idealized synaptic connection. The main computational features of structured connectionist models are as follows: 9 A structured connectionist model is a network of nodes and weighted links. 9 Nodes compute some simple functions of their inputs. 9 Nodes can only hold limited state information--while a node may maintain a scalar "potential", it cannot store and selectively manipulate bit strings. 9 Node outputs do not have sufficient resolution to encode symbolic names or pointers. 9 There is no central controller that instructs individual nodes to perform specific operations at each step of processing. 9 While links and link weights may change as a result of learning, they remain fixed during an episode of reflexive reasoning.

ll 2.7. M a p p i n g c o n n e c t i o n i s t m o d e l s to real machines The massively parallel structured connectionist models assume a ve.ry targe number of nodes and links, high fan-in and fan-out, and arbitrary interconnection patterns. These traits do not carry over to real machines. This shortcoming of real machines however, is offset by the fact that the processing speed and communication times of high performance platforms are several orders of magnitude faster than those assumed in connectionist models. Another important factor that facilitates the mapping of our models to real machines is the simplicity of messages exchanged by nodes. As we shall see, this allows us to leverage the active message facility provided by machines such as the Connection Machine CM-5 for low-latency interprocessor communication of short messages. Given the partial asymmetry in the strengths of connectionist models and existing hardware platforms, one needs to address several issues when mapping structured connectionist models to real machines. Some of these issues are the granularity of mapping, the coding of messages, processor allocation, and the tradeoff between load balancing and communication overhead. These issues have to be resolved based on a number of factors including the relative costs of communication, message handling, and computation, and the structural properties of the connectionist model. These issues are discussed in Section 5. 3. C S N - - A C O N N E C T I O N I S T S E M A N T I C N E T W O R K Several years ago we developed CSN, a connectionist semantic network [18] that solves a class of inheritance and recognition problems extremely fast--in time proportional to the depth of the conceptual hierarchy. In addition to offering computational effectiveness, CSN computes solutions to inheritance and recognition problems in accordance with a theory of evidential reasoning that derives from the principle of maximum entropy. The mapping between the knowledge level and the network level is precisely specified and, given a high-level specification of conceptual knowledge, a network compiler can generate the appropriate connectionist network. The solution scales because i) the time to answer questions only depends on the depth of the conceptual hierarchy, not on the size of the semantic memory, and ii) the number of nodes in the connectionist encoding is only linear in the number of concepts, properties, and property-value attachments in the underlying semantic network. Inheritance refers to the form of reasoning that leads an agent to infer property values of a concept based on the property values of its ancestors. For example, if the agent knows that "birds fly", then given that "Tweety is a bird", he may infer that "Tweety flies". Inheritance may be generalized to refer to the process of determining property values of a concept C, by looking up information directly available at C, and if such local information is not available, by looking up property values of concepts that lie above C in the conceptual hierarchy. Recognition is the dual of the inheritance problem. The recognition problem may be described as follows: "Given a description consisting of a set of properties, find a concept that best matches this description". Note that during matching all the property values of a concept may not be available locally. For this reason, recognition may be viewed as a very general form of pattern matching: one in which the target patterns are organized

12 in a hierarchy, and where matching an input pattern A with a target pattern Ti involves matching properties of A with local properties of Ti as well as with properties that Ti inherits from its ancestors. A principled treatment of inheritance and recognition is confounded by the presence of exceptions and conflicting information. Such information is bound to arise in any representation that admits default properties. Consider the following situation. An agent believes that most Quakers are pacifist and most Republicans are non-pacifist. She also knows that John is a Republican, Jack is a Quaker, and Dick is both a Quaker and a Republican. Based on her beliefs, it will be reasonable for the agent to conclude that John is, perhaps, a non-pacifist, and Jack is, perhaps, a pacifist. But what should the agent believe about Dick? Is Dick a pacifist or a non-pacifist? In [18,19] we proposed an evidential formalization of semantic networks to deal with such problematic situations. This formalization leads to a principled treatment of exceptions, multiple inheritance and conflicting information during inheritance, and the best match or partial match computation during recognition. The evidential formulation assumes that partial information about property values of concepts is available in the form of relative frequency distributions associated with some concepts. This information can bc treated as evidence during the processing of inheritance and recognition queries. Answering a query involves identifying relevant concepts and combining information (i.e., evidence) available at these concepts to compute the most likely answer. The method of estimating unknown relative frequencies using known relative frequencies is based on the principle of maximum entropy, and can be summarized as follows: If an agent does not know a relative frequency, he may estimate it by ascertaining the most likely state of the world consistent with its knowledge and use the relative frequency that holds in that world. Let us look at an informal example that illustrates the evidential approach. Consider the conceptual hierarchy shown in Figure 2 which is a generalization of the "Quaker example" mentioned above. The agent knows how the instances of some of the concepts are distributed with respect to their beliefs (pacifism or non-pacifism) and with respect to their ethnic origin (African or European). Answering an inheritance que .ry such as "Is Dick a pacifist or a non-pacifist" involves the following steps: 1. Determine the projection of the conceptual hierarchy with respect to the que .ry. The projection consists of concepts that lie above the concept mentioned in the que .ry and for which the distribution with respect to the property values mentioned in the query is known. Figure 3 shows the projected conceptual hierarchy for the example query "Is Dick a pacifist or a non-pacifist?" 2. If the projection has only one leaf, the question can be answered directly on the basis of the information available at the leaf. (In the case of our example que .ry however, the projection has two leaves Q U A K E R and R E P U B ) . 3. If the projection contains two or more leaves, combine the information available in the projection as follows: Combine information available at the leafs of the projection by moving up the projection. A common ancestor provides the reference frame for combining evidence

13 #PERSON= 200 #PERSON[has-belief,PACIFIST= 60 #PERSOM[has-belief,NON-PAC]= 140 :~RSsONN~':th~PERth'~ = 40

i

Propcrty:has-belie,f

Values:PACIFIST,NON-PAC

I

S~I

Property:has-ethEOi~O

~. ! ~ ~

I zo,~s

i

/

#MOR~ 50

I '~ #CHRIST- 60 #CHRIST[I~s-belief,PACIFISTI= 24 \#CHRIST[t~s-belief,NON-PAC]= 36

I ~_'! s" L

#QUAKER~as-belief,PACIFIST] = 7 ~_~

#QUAKE~NON-PAC] = 3

/

++ #REPUB= 80 #REPUB[has-belicf,PACIFIST]= 16 #REPUB[has-belief,NON-PAC]= 64 #REPUB[has-cth-org,AFRIC]= 5

# D E M O C = 120 #DEMOC[has-belid,PACIFIST] = 44

#REPUB[has-cth-orgJ~URO] = 75

#DEMOC[has-eth-org,EURO] = 85

#DEMOC[has-bclief,NON-PAC} = 76 #DEMOC[has-cth-org2~RIC] = 35

Figure 2. An example domain. #PERSON refer to the number of persons in the domain. #PERSON[has-belief, PACIFIST] refers to the number of persons who have the value pacifist for the property has-belief.

14

PERSON

]

................

Figure 3. The structure above the dotted line is the projection of the conceptual hierarchy (see previous Figure) that is relevant for determining whether Dick is a pacifist or a nonpacifist.

available at its descendents. The combination process is repeated until information from all the leaves in the projection is combined (at the root). In the example under discussion, the information available at QUAKER and REPUB would be combined at PERSON to produce the net evidence for Dick being a pacifist and Dick being a non-pacifist.

3.1. An evidential representation language Knowledge in the semantic memory is expressed in terms of a partially ordered set of concepts (i.e., an IS-A hierarchy of concepts) together with a partial specification of the property values of these concepts. The set of concepts is referred to as CSET, the partial ordering as <<, and the information about property values of a concept is specified using the distribution function ~;, where, 5(C, P) specifies how instances of C are distributed with respect to the values of property P. For example, 5(APPLE, has-color) may be {RED = 60, GREEN = 55, YELLOW = 23...}. Note that 5 is only a partial mapping. In terms of a traditional representation'language, knowing 5(C, P) amounts to knowing-explicitly--the values of P associated with C. For convenience we also make use of the # notation. Thus, #C[P, V] equals the number of instances of C that are observed by the agent to have the value V for property P. For example, given the above specification of 5(APPLE, has-color), #APPLE[has-color, RED] = 60. A generalization of the # notation is also used whereby #C[P1, V1][P2,V2]...[Pn,Vn] = the number of instances of C observed to have the value Vx for property P1, ... and value Vn for property pn. Observe that the representation language allows only property values to be exceptional, but not IS-A links: A concept is either an instance or subtype of another concept or it is

15 not; the << relation specifies this unequivocally. The notion of exceptions applies only to property values, and even here exceptions do not entail "'cancellation" or "blocking" of properties. This leads to a clear semantics of conceptual taxonomies while still allowing exceptional and conflicting information. In terms of the above notation, the inheritance and recognition problems may be stated as follows: I n h e r i t a n c e : Given a concept C and a property P, find I1/, the most likely value of property P for concept C. In other words, find Vi such that the most likely value of =/:/:C[P,V~] equals or exceeds the most likely value of #C[P, ~] for any other value ~ of P. R e c o g n i t i o n : Given a set of concepts, Z = {C1, C2, ...Ca}, and a description consisting of a set of property value pairs, i.e., DESCR = {[P1, V1], [P2, V2], ...[Pm, Vm]}, find a Ci such that relative to the concepts specified in Z, (7/is the most likely concept described by DESCR. In other words, find Ci such that the most likely value of: #Ci[P1, Yl], [/92, Y2], ...[Pm, Vm] exceeds the most likely value of: :#:Ci[P1, V1], [/92, V2], .-.[Pro, Vm] for any other C~- If z = {APPLE, GRAPE}, DESCR = {[has-color, RED],[hastaste, SWEET]} then the recognition problem may be paraphrased as: "Is something red in color and sweet in taste more likely to be an apple or a grape?" 3.2. C o n n e c t i o n i s t e n c o d i n g Figure 4 shows a fragment of a connectionist network that encodes some of the knowledge associated with the example shown in Figure 2. The network fragment only shows the encoding of: 9 Dick is a Quaker and a Republican. 9 Most Quakers have pacifist beliefs. 9 Most Republicans have non-pacifist beliefs. Furthermore, only nodes and links involved in computing inheritance are shown. It is assumed that one of the properties attached to persons is "has-belief", some of whose values are "pacifist" and "non-pacifist". The information about the relative frequencies of Quakers and Republicans with respect to pacifism and non-pacifism are encoded as weights on appropriate links. The encoding employs five distinct node types. These are the concept nodes, property nodes, binder nodes, relay nodes, and enable nodes. With reference to Figure 4, all solid boxes denote concept nodes, all triangular nodes denote binder nodes, and the single dashed box denotes a property node. Relay and enable nodes are not shown in Figure 4. A node is characterized by a state--either a c t i v e or inert, a potential--a real-value in the range [0,1], an output value--also in t h e r a n g e [0,1], and the functions P, Q, and V that define the values of potential, state and output at time t + 1, based on the values of potential, state and inputs at time t. For simplicity of description, we assume that nodes have multiple input sites, and incoming links are connected to specific sites (e.g., HCP, RELAY, CP, E, and EC in Figure 4). Each site has an associated site-function. These

16

BELIEF RELAY

#NON-PAC

#PACIFIST

#BELIEF

#BELIEF

#QUAKER[hb,PACI~ST]

RELAY

RELAY

NON-PAC

cP P A C I F I S T

cP

HCP

HCP #PERSON[hb.PACIFIST] #PACIFIST

PERSON RELAY

QUAKER

REPUB

RELAY

RELAY

RELAY

has-belief ....................

DICK

Figure 4. A fragment of the connectionist encoding of the example domain is depicted. The site E N A B L E has been abbreviated to E. Quantities along links denote weights. The property "has-belief" has been abbreviated to "hb" in the specification of weights. Only the weights on links from binder nodes to other concept and binder nodes pertaining to PACIFIST are shown. Analogous weights exist for NON-PACIFIST. There are links from has-belief to the E N A B L E site of each binder node in the network fragment. Only some of these links are shown.

17 functions carry out local computations based on the inputs incident at the site, and it is the result of this computation that is processed by the functions P, V and Q. The network must perform very specific computations in order to solve the inheritance and recognition problems and it must do so without the intervention of a central controller. The desired behavior is realized by introducing binder and relay nodes which mediate the spread of activation and provide a distributed control mechanism. In particular, relay nodes control the directionality of spreading activation along the IS-A links (top-down or bottom-up) while binder nodes control the flow of activation between property nodes, concept nodes, and property value nodes. The network is designed automatically by a network compiler that takes as input a high level specification of the knowledge to be encoded in the network. The syntax of the specification language parallels the representation language described above. In [18] we identify specific constraints that must be satisfied by the conceptual structure in order to achieve an efficient connectionist realization. For example, one of the constraints specifies that if the network is to correctly inherit the value of property P of a concept C, then the projection of the conceptual structure with respect to C and P must be a tree. Another constraint imposes a uniformity requirement on property value attachments in the conceptual structure and suggests that the conceptual hierarchy must comprise of several alternate "views" of the underlying concepts. A detailed discussion of these constraints is beyond the scope of this chapter.

3.3. Description of network behavior Nodes in the network can be in one of two states: active or inert. A node switches to an active state under conditions specified below and transmits an output equal to its potential. The computational characteristics of the node types shown in Figure 4 are as follows: Concept nodes: State: Node is in active state if it receives one or more inputs. Potential: If no inputs at site H C P then potential = the product of inputs at sites Q U E R Y , R E L A Y , C P and P V divided by the product of inputs at site I N V . else potential = t h e product of inputs at sites Q U E R Y , R E L A Y and H C P Binder nodes: State: Node is in active state if and only if it receives all the three inputs at site E N A B L E . Potential: If state = active then potential = product of inputs at E C else potential = 0

Property nodes switch to active state if they receive input at site Q U E R Y , and in this state their potential always equals 1.

18 3.4. T w o e x a m p l e s We illustrate the behavior of networks and the nature of inferences drawn by them with the help of two examples based on the knowledge encoded in Figure 2. Observe that there are two properties has-bel (has-belief) with values PACIFIST (pacifist) and NON-PAC (non-pacifist), and has-eth-org (ethnic-origin) with values AFRIC (African) and EURO (European). In qualitative terms, the distributional information is as follows. 9 Most persons are non-pacifist. 9 Most quakers are pacifist. 9 Most republicans are non-pacifist. 9 Most persons are of European descent. 9 Most republicans are of European descent. 9 Most persons of African descent are democrats. This information is encoded in a network in a manner analogous to that shown in Figure 4 with the distributional information encoded as weights on appropriate links. Queries to CSN are posed by activating appropriate nodes and answers are obtained by examining the level of activation (potential) of relevant nodes. The first example demonstrates how the network performs inheritance in the presence of conflicting information arising due to "multiple inheritance". Consider the inheritance query: "Is Dick a pacifist or a non-pacifist?" This question is posed to the network by activating the nodes DICK, has-belief, and BELIEF, and enabling the IS-A links from subconcepts to superconcepts. The resulting potentials of the nodes PACIFIST and NON-PAC determine whether Dick is more likely to be a pacifist or a non-pacifist. It can be shown that the potential of the node PACIFIST will equal: #QUAKER[has-bel, PACIFIST] x #REPUBLICAN[has-bel, PACIFIST] # B E L I E F x #PERSON[has-bel, PACIFIST] and that of the node NON-PAC will equal: #QUAKEa[has-bel, NON-PAC] x #REPUBLICAN[has-bel, NON-PAC] # B E L I E F x #PERSON[has-bel, NON-PAC] Ignoring the common factor, #BELIEF, in the above expressions, the potential of PACIFIST corresponds to the best estimate of the number of persons that are both quakers and republicans and believe in pacifism. Similarly, the potential of NON-PAC corresponds to the best estimate of the number of persons that are both quakers and republicans and believe in non-pacifism (for a justification, refer to [Shastri 88]). Hence, a comparison of the two potentials will give the most likely answer to the question: Is Dick a pacifist or a non-pacifist? The normalized potentials of PACIFIST and NON-PAC as a result of this query are 1.00 and 0.66 respectively. Thus, on the basis of the available information, Dick who is a republican and a quaker is more likely to be a pacifist than

19 a non-pacifist; the ratio of likelihoods being about 3:2. Similar queries for RICK, PAT, and SUSAN lead to the following results: Rick who is a Mormon republican is more likely to be a non-pacifist. The ratio of pacifist versus non-pacifist for Rick being 0.39 versus 1.00. Pat who is a Mormon democrat is also more likely to be a non-pacifist, but only marginally so (0.89 versus 1.00). Finally, Susan who is a quaker democrat is likely to be a pacifist with a ve.ry high probability (1.00 versus 0.29). As an example of recognition, consider the query: "among Dick, Rick, Susan, and Pat, who is more likely to be a pacifist of African descent?" The query is posed by activating the property nodes has-belief and has-ethnic-origin and the nodes PACIFIST and AFRIC, and enabling the top-down IS-A links. The resulting normalized potentials are for SUSAN 1.00, for PAT 0.57, for DICK 0.11, and for RICK 0.05. Susan who is a democrat and a quaker, best matches the description "person of African descent with pacifist beliefs". The least likely person turns out to be Rick. Observe that Rick is neither a democrat many of whom are of African origin, nor is he a q u a k e r - many of whom are pacifist.

4. S H R U T I - - A C O N N E C T I O N I S T

MODEL OF REFLEXIVE REASONING

Over the past years we have developed a connectionist knowledge representation and inference system, SHRUTI, that can encode a large number of facts and rules involving variables and quantifiers, and a type hierarchy, and perform a class of first-order inferences with extreme efficiency [1,23,15]. The time taken by SHRUTI t o perform an inference is just proportional to the depth of inference and is otherwise independent of ILTKB I (the size of the knowledge base). If the values of various system parameters are set to biologically plausible values, it can be shown that SHR.UTI can encode millions of rules and facts and yet draw interesting inferences in a few hundred milliseconds. Inference in SHRUTI can be viewed as the transient but systematic flow of rhythmic patterns of activity, where each phase in the rhythmic activity corresponds to a distinct entity involved in the reasoning episode and where variable bindings are represented as the synchronous firing of appropriate argument and object nodes. In SHRUTI a rule is an interconnection pattern that causes the propagation and transformation of rhythmic patterns of activity and a fact acts as a temporal pattern matcher that becomes active when it detects that the static bindings it encodes are present in the rhythmic pattern of activity. Emerging neurophysiological data suggests that the basic mechanisms proposed for representing and propagating dynamic variable bindings, namely, the propagation of rhythmic patterns of activity and the synchronous activation of nodes, exist in the brain and appear to play a role in the representation and processing of information [25]. By generalizing the results of our work on SHRUTI we have identified a class of inferences that can be performed by a parallel network whose size is linear in ILTKBI, in time that is proportional to the depth of inference and is otherwise independent of ILTKBI. This generalization relaxes the bound on the number of times a predicate may be instantiated during a derivation. Below we describe the form of knowledge that may be encoded in the network and offer a formal characterization of reflexive inference. Next we specify the computational model and give a brief overview of SHRUTI.

20 4.1. F o r m of rules, facts, a n d queries S H R U T I encodes knowledge in terms of entities, types (classes of entities), the memberof relation between entities and types, sub- and super-type relations between types, n-ary predicates, facts and rules. The sub- and super-type relations define a partial ordering of types and thus types are organized in the form of a directed acyclic graph (we will refer to it as the T-DAG). It is assumed that there exists a unique type that is a super-type of all types. For convenience we view entities as the leaves of the T-DAG. Rules have the following form: 3Xl:Xl,... , xp:Xp Vyl:Vl,..., yr:]~r [ P l ( . . . ) A ' " A Pn (. . .) =V 3Ul, . . . , ut Q ( . . .)] wherein an argument of Pi can be an entity or one of the variables xi and yi. An argument of Q can be an entity or one of the variables xi, y~, and ui. )(is and Yis are types and specify restrictions on the bindings of variables. Facts have the form: 3xl:X1,...

, x r : X r Vyl:Y1,..., Ys:Y8 P ( . . . )

where arguments of P are either entities or variables xi and Yi. Universally quantified variables are assumed to be distinct (i.e., repeated universally quantified variables are not supported). Observe that facts include ground atomic formulas as well as quantified assertions of the type "All permanent employees receive a bonus" and "there exists an employee who is the boss of all other employees". The form of queries is similar to that of facts except that whereas repeated universally quantified variables can occur in queries, existentially quantified variables are assumed to be distinct. Recently, we have extended SHRUTI to allow negative literals to occur in facts, queries, and rules [24]. The representation language shares the function-free property of Datalog [30] but differs from it in a number of ways. For example, unlike Datalog, our language does not force a dichotomy between extensional and intensional relations and allows both intensional as well as extensional relations to occur in the head of a rule. Thus the language allows rules to define new views (as in Datalog) as well as specify integrity constraints. Furthermore, the occurrence of extensional relations in the head of a rule also allows our system to reformulate a query into a number of alternate, but semantically related queries. While our language is more general than Datalog in its treatment of relations, it does impose a restriction on the form of rules (see below). 4.2. T h e class of reflexive inference A characterization of the class of reflexive inferences is provided in [21]. This characterization is facilitated by the following definitions: [] Any variable that occurs in multiple argument positions in the antecedent of a rule is a p i v o t a l variable.

[] A rule is balanced if all pivotal variables occurring in the rule also appear in its consequent. Observe that rules that do not contain pivotal variables are also balanced.

21 [] Let a derivation of a query Q obtained via backward chaining be called threaded, if all pivotal variables occurring in the derivation get bound as a consequence of bindings introduced in Q. [] Given a type hierarchy, rules, and facts, any que .ry that has a threaded derivation is called a reflexive que .ry. It can be shown that the worst case time for computing a yes answer to a reflexive yes-no que .ry Q is proportional to IInl yd, where: IInl is the number of distinct bindings specified in Q, V is the maximum arity of predicates in L T K B , and d equals the depth of the derivation of Q. Observe that V can be treated as a constant and hence, the worst case time required to answer a reflexive query is i) only proportional to d, ii) polynomial in IInl, but iii) independent of ILTKBI. An answer to a wh-que .ry can be computed in time proportional to IInl VD, except that IInl now equals the arity of the que .ry predicate Q, and D equals the greater of the two: the depth of the T-DAG and the diameter of the inferential dependency graph induced by the rules in L T K B . In [23] we had conjectured that when performing reasoning via backward chaining, any network model whose size is linear in ILTKBI and which computes inferences in time independent of ILTKBI must limit itself to answering reflexive queries involving balanced rules. A recent result [3] establishes a lower bound of f~(log n) on the time required for inferences that violate this restriction and thus provides a proof of our conjecture.

4.3. The computational model Our computational model is a network of nodes connected via weighted links. The model makes use of three node types each of which performs a simple computation and communicates via simple messages. p - b t u nodes: A p-btu node with threshold n becomes active upon receiving n synchronous inputs. In particular, a p-btu node B with threshold 1 receiving a (periodic) spike train from a p-btu node A will become active and produce a (periodic) spike train that is in-phase with the spike train it receives from A. 8 r - a n d n o d e s : A T-and node with a threshold of 1 becomes active on receiving an uninterrupted pulse of width >_ 7rmax. In other words, a T-and node behaves like a temporal and. Upon becoming active, such a node produces an output pulse similar to the input pulse. A threshold, n, associated with a T-and node indicates that the node will fire only if it receives n or more synchronous pulses. T-or n o d e : A T-Or node with threshold n becomes active on receiving n or more inputs during an interval 7r,nax. Upon becoming active, a T-Or node produces an output pulse of width 7rmax. Thus a T-or node behaves like a temporal or. The model also makes use of inhibitory modifiers that can block the flow of activation along a l i n k w a pulse propagating along an inhibitory modifier will block a synchronous pulse propagating along the link it impinges upon.

SThe "btu" in p-btu stands for binary threshold unit.

22

from John from Mary from a-Book

give

buy

from Mary from a-Ball

John

9

Mary

0

a-Book

O

s-Ball

0 csn-sell

Figure 5. An example encoding of rules and facts.

4.4. E n c o d i n g of rules a n d facts We discuss a simple example to illustrate how S H R U T I encodes rules and facts and performs inference. We suppress details pertaining to the mapping of multiple antecedent rules, the type hierarchy, and the dynamic representation of multiple instantiations of a predicate. A detailed description can be found in [23,15]. The network shown in Figure 5 encodes the following rules and facts:

1. Yx, y, z give(x,y,z) ~ own(y,z)

2. w , y buy(~,y)~ o~n(~,y)

4. give(John,Mary, a-Book) 5. 3 (x) buy(John, x) 6. own(Mary, a-Ball). The encoding makes use of two types of nodes mentioned above: p-btu nodes (depicted as circles) and v-and nodes (depicted as pentagons). The encoding of more complex rules

23

c:can-sell

f--W-V-"g f--V-W--T-W

c:own

c:give F1

e:give

I I I I I I I I I I I I

g-obj

recip 9 :buy

I I I I I I I I I I I I

b-obj buyer' e:ownl

I I I I I I I I I I I I I I

o-objl

owner e:can-sell a-Book

I I I I I I I I

=

cs-obj

input to e:can-sell ~

,.pu, to cs~jlt input to p-seller

u

I I I I I I ! 1 I I 1 1 I I I I

V

il

II

II

I I I I

I

I

Iiiiiiiii1111

I I I I I I I I I I I I I I o

1

2

3

4

5

6

7

8

9

time

Figure 6. Activation trace for the query can-sell(Mary, a-Book)?.

also makes use of the r-or nodes mentioned above. In Figure 5, inhibitory modifiers are shown as links ending in dark blobs. Each entity in the domain is encoded by a p-btu node. This acts as a "focal" node for this entity. The features of an entity and the roles it fills in various relations are encoded by linking its focal node to appropriate nodes in the network. Each n-ary predicate P is encoded by a "focal" cluster consisting of a pair of r-and nodes and n p-btu nodes, one for each of its n arguments. One of the r-and nodes is referred to as the enabler, e:P, and the other as the collector, c:P. In Figure 5 enablers point upward while collectors point downward. The enabler e:P becomes active whenever the system is being queried about P. On the other hand, the system activates the collector c:P of a predicate P whenever the system wants to assert that the current dynamic bindings of the arguments of P follow from the knowledge encoded in the system. All rules and facts pertaining to a predicate are represented by linking its focal cluster with other focal clusters and nodes as explained below. A rule is encoded by connecting the collector of the antecedent predicate to the collector of the consequent predicate, the enabler of the consequent predicate to the enabler of the antecedent predicate, and by connecting the arguments of the consequent predicate to the arguments of the antecedent predicate in accordance with the correspondence between

24 these arguments specified in the rule. A fact is encoded using a T-and node that receives an input from the enabler of the associated predicate. This input is modified by inhibitor. modifiers from the argument nodes of the associated predicate. If an argument is bound to an entity in the fact then the modifier from such an argument node is in turn modified by an inhibitory modifier from the appropriate entity node. The output of the T-and node is connected to the collector of the associated predicate. 4.5. T h e I n f e r e n c e P r o c e s s Posing a query to the system involves specifying the que .ry predicate and the argument bindings specified in the query. This is done as follows" Choose an arbitrary point in time--say, t0--as the point of reference for initiating the query (it is assumed that the system is in a quiescent state). The query predicate is specified by activating the enabler of the que .ry predicate with a pulse train of width and periodicity 7r starting at time to. The argument bindings specified in the query are communicated to the network as follows: Let the argument bindings in the query involve n distinct entities: Cl,...,c,~. With each ci, associate a delay 5i such that no two delays are within w of one another and the longest delay is less than 7 r - w. Here w is the allowable jitter (or lead/lag) between synchronously firing nodes, and 7r is the period of oscillation. Each of these delays may be viewed as a distinct phase within the period to and t0+Tr. Now the argument bindings of an entity c/are indicated to the system by providing an oscillatory spike train of periodicity 7r starting at to + (f/, to c/and all arguments to which c/is bound. This is done for each entity c/ (1 _< i < n) and amounts to representing argument bindings by the in-phase or synchronous activation of the appropriate entity and argument nodes. We illustrate the reasoning process with the help of an example. Consider the que .ry cansell(Mary, a-Book) (i.e., "Can Mary sell a-Book?") This que .ry is posed by providing inputs to the entities Mary and a-Book, the arguments p-seller, cs-obj and the enabler e:can-sell, as shown in Figure 6. Observe that Mary and p-seller receive in-phase activation and so do a-Book and cs-obj. Let us refer to the phase of activation of Mary and a-Book as Pl and p2 respectively. As a result of these inputs, Mary and p-seller fire synchronously in phase pl of every period of oscillation, while a-Book and cs-obj fire synchronously in phase P2 of every period. The node e:can-sell also fires and generates a pulse train of width 7r. The activations from the arguments p-seller and cs-obj reach the arguments owner and o-obj of the own predicate, and consequently, starting with the second period, owner and o-obj become active in pl and p2, respectively. At the same time, the activation from e:can-sell activates e:own. At this time, the system has essentially, created dynamic bindings for the arguments of predicate own and Mary has been bound to owner, and a-Book has been bound to o-obj. These newly created bindings in conjunction with the activation of e:own can be thought of as encoding the query own(Mary, a-Book) (i.e., ~"Does Ma.ry own a-Book?")! The v-and node associated with the fact own(Mary, a-Ball) does not match the que .ry and remains inactive. The activations from owner and o-obj reach the arguments recip and g-obj of give, and buyer and b-obj of buy respectively. Thus beginning with the third period, arguments recip and buyer become active in Pl, while arguments g-obj and b-obj become active in p2. In essence, the system has created new bindings for the predicates give and buy that together with the activation of the enabler nodes e:give and e:own can

25 be thought of as encoding two new queries: give(x,Mary, a-Book) (i.e., "Did someone give Maw a-Book?"), and buy(Mary, a-Book). Observe that now the T-and node associated with the fact give(John, Mary, a-Book)-this is the T-and node labeled F1 in Figure 5--becomes active as a result of the uninterrupted activation from e:give. Observe that the inhibitory inputs from recip and g-obj are blocked by the in-phase inputs from Mary and a-Book, respectively. The activation from the T-and node F1 causes c:give, the collector of give, to become active. The output from c:give in turn causes c:own to become active and transmit an output to c:can-sell. Consequently, c:can-sell, the collector of the query predicate can-sell, becomes active (refer to Figure 6) resulting in an affirmative answer to the query can-sell(Mary, a-Book)?. Conceptually, the proposed encoding of rules creates a directed inferential dependency graph: Each predicate argument is represented by a node in this graph and each rule is represented by links between nodes denoting the arguments of the antecedent and consequent predicates. In terms of this conceptualization, it should be easy to see that the evolution of the system's state of activity corresponds to a parallel breadth-first traversal of the directed inferential dependency graph. This means that i) a large number of rules can fire in parallel and ii) the time taken to generate a chain of inference is independent of the total number of rules and just equals 17r where l is the length of the chain of inference and 7r is the period of activity. The example discussed above assumed that each predicate was instantiated at most once during the inference process. In the general case, where a predicate may be instantiated several times during an episode of reasoning, the time required for propagating bindings from a consequent predicate to antecedent predicate(s) is proportional to kTr, where k is the number of dynamic instantiations of the antecedent predicate to which the bindings are being propagated.

4.6. Constraints and predictions SHRUTI identifies a number of representational and processing constraints on reflexive processing in addition to the constraints on the form of rules discussed above. These relate to the capacity of the "working memory" underlying reflexive processing and the bound on the depth of reasoning.

Working memory underlying reflexive processing: Dynamic bindings, and hence, dynamic (active) facts are represented in SHRUTI as a rhythmic pattern of activity over nodes in the LTKB network. In functional terms, this transient state of activation holds information temporarily during an episode of reflexive reasoning and corresponds to the working memory underlying reflexive reasoning (WMRR). Note that WMRR is just the state of activity of the LTKB network and not a separate buffer. Also note that the dynamic facts represented in the W M R R during an episode of reflexive reasoning should not be confused with the small number of short-term facts an agent may overtly keep track of during reflective processing and problem solving. W M R R should not be confused with the short-term memory implicated in various memory span tasks [2]. In our view, in addition to the overt working memory, there exist as many "working memories" as there are major processes in the brain since a "working memory" is nothing but the state of activity of a network. SHRUTI predicts that the capacity of W M R R is very large but at the same time it is constrained in critical ways. Most proposals characterizing the capacity of the working

26

memo~, underlying cognitive processing have not paid adequate attention to the structure of items in the working memo .ry and their role in processing. Even recent proposals such as [10] characterize working memory capacity in terms of "total activation". In contrast, the constraints on working memory capacity predicted by SHRUTI depend not on total activation but rather on the maximum number of distinct entities that can participate in dynamic bindings simultaneously, and the maximum number of (multiple) instantiations of a predicate that can be active simultaneously. B o u n d on t h e n u m b e r of d i s t i n c t entities referenced in W M R R : During an episode of reflexive reasoning, each entity involved in dynamic bindings occupies a distinct phase in the rhythmic pattern of activity. Hence the number of distinct entities that can occur as role-fillers in the dynamic facts represented in the working memory cannot exceed 7r,.~a~/w where 7rm~ is the maximum delay between two consecutive firings of cell-clusters involved in synchronous firing and ~v equals the width of the window of synchrony--i.e., the maximum allowable lead/lag between the firing of synchronous cell-clusters. If we assume that a neurally plausible value of ~rm~ is about 30 milliseconds and a conservative estimate of w is around 6 milliseconds, we are led to the following prediction: As long as the number of distinct entities referenced by the dynamic facts in the working memo .ry is five or less, there will essentially be no cross-talk among the dynamic facts. If more entities occur as role-fillers in dynamic facts, the window of synchrony w would have to shrink appropriately in order to accommodate all the entities. As w shrinks, the possibility of cross-talk between dynamic bindings would increase until eventually, the cross-talk would become excessive and disrupt the system's ability to perform systematic reasoning. The exact bound on the number of distinct entities that may fill roles in dynamic facts would depend on the largest and smallest feasible values of 7rm~ and ~v, respectively. However we can safely predict that the upper bound on the maximum number of entities participating in dynamic bindings can be no more than 10 (perhaps less). B o u n d on t h e m u l t i p l e i n s t a n t i a t i o n of r e l a t i o n s : The capacity of WMRR is also limited by the constraint that at most k dynamic facts pertaining to each relation may be active at any given time (recall that the total number of active dynamic facts can be very high). In general, the value of k need not be the same for all relations; some critical relations may have a higher value of k while some other relations may have a smaller value. The cost of maintaining multiple instantiations turns out to be significant in terms of space and time. For example, the number of nodes required to encode a rule for backward reasoning is proportional to k 2. Thus a system that can represent three dynamic instantiations of each relation may have up to nine times as many nodes as a system that can only represent one instantiation per relation. Furthermore, the worst case time required for propagating multiple instantiations of a relation also increases by a factor of k. In view of the additional space and time costs associated with multiple instantiation, and given the necessity of keeping these resources within bounds in the context of reflexive processing, we predict that the value of k during reflexive reasoning is quite small, perhaps no more than 3. B o u n d on t h e d e p t h of t h e c h a i n of reasoning: Consider the propagation of synchronous activity along a chain of role ensembles during an episode of reflexive reasoning. Two things might happen as activity propagates along the chain of role ensembles. First,

27 the lag in the firing times of successive ensembles may gradually build up due to the propagation delay introduced at each level in the chain. Second, the dispersion within each ensemble may gradually increase due to the variations in the propagation delay of links and the noise inherent in synaptic and neuronal processes. While the increased lag along successive ensembles will lead to a "phase shift", and hence, binding confusions, the increased dispersion of activity within successive ensembles will lead to a gradual loss of binding information. Increased dispersion would mean less phase specificity, and hence, more uncertainty about the role's filler. Due to the increase in dispersion along the chain of reasoning, the propagation of activity will correspond less and less to a propagation of role bindings and more and more to an associative spread of activation. For example, the propagation of activity along a chain of rules such as: 1='1(x, y, z) =v P2(x, y, z) =~ . . . Pn(x, y, z) due to a dynamic fact Pl(a, b, c) may lead to a state of activation where all one can say about Pn is this: there is an instance of Pn which involves the entities a, b, and c, but it is not clear which entity fills which role of Pn. In view of the above, it follows that the depth to which an agent may reason during reflexive reasoning is bounded. Thus an agent would be unable to make a prediction (or answer a query)--even when the prediction (or answer) logically follows from the knowledge encoded in the L T K B - - i f the length of the derivation leading to the prediction (or the answer) exceeds this bound. The actual value of this bound depends on values of appropriate physiological parameters. At this time we do not have the relevant data to arrive at a precise value, but we expect this bound to be rather low. Henderson [9] has developed an on-line parser for English using a SSRVTI-like architecture whose speed is independent of the size of the grammar and which can recover the structure of arbitrarily long sentences as long as the dynamic state required to parse the sentence does not exceed the capacity of the parser's working memory. The parser models a range of linguistic phenomena and shows that the constraints on the parser's working memory help explain several properties of human parsing involving long distance dependencies, garden path effects and our limited ability to deal with center-embedding. This suggests that the working memory constraints resulting from SHRUTI have implications for other rapid processing phenomena besides reasoning. 5. M A P P I N G

SHRUTI ONTO REAL MACHINES

Several aspects of SHRUTI suggest that a knowledge representation and reasoning system obtained by mapping SHRUTI onto real machines would be extremely efficient. These include some basic features of structured connectionism as well as constraints on rules and derivations resulting from SHRUTI. As discussed in Section 4, SHRUTI is a structured connectionist model. So it only requires nodes that perform simple computations. Second, unlike neural network models such as multilayer back-propagation networks and Hopfield nets, a SHRUTI network is sparsely connected. So even though a node may be connected to a large number of nodes, it is connected only to a relatively small percentage of nodes in the network. Consequently, only a fraction of the total number of nodes and links in the network participate in any update step. The most important source of SHRUTI'S efficiency however, is that it imposes several

28 constraints on the form of rules and derivations. These constraints were discussed in Sections 4.6 and 4.2. 5.1. M a p p i n g SHRUTI onto parallel machines In order to derive maximum benefit from the parallelism inherent in structured connectionist models, the mapping granularity must be tailored to the computational capabilities of the processors. For most real machines with relatively powerfill processors, knowledgelevel mapping provides a conceptually simple and flexible partitioning scheme. In this scheme, the knowledge base is partitioned at the relatively coarse level of knowledge elements like predicates, concepts, fact, rules and type hierarchy relations. The simplicity of the messages exchanged between nodes supports the use of interprocessor communication schemes which can handle short message packets very efficiently; complex messages and communication protocols are unnecessary. In particular, the information exchanged by nodes in SHRUTI lies in the synchronization - - o r lack t h e r e o f of converging spike trains. Given that nodes in our model are required to discriminate only among a small number of distinct phases, the necessary information can be encoded within a few bits. Consequently, information pertaining to a complete knowledge-level element can be encoded within a small message.

5.1.1. Exploiting constraints imposed by

SHRUTI

The constraints which SHRUTI imposes on the form of rules and type of inferences translates into bounds on system resources and time needed for a reasoning episode-thereby leading to a fast and efficient parallel implementation. These aspects are discussed below: 9 The form of rules and facts that can be encoded is constrained. S H R U T I attains its tractability from this fundamental constraint [21,3] which simplifies the network encoding of the knowledge base and makes it possible to perform efficient inference using spreading activation. 9 The number of distinct entities that can participate in an episode of reasoning is bounded. This restricts the number of active entities, and hence, the amount of information contained in an instantiation. In turn, this limits the amount of information that must be communicated between predicates. 9 Entities and predicates can only represent a limited number of dynamic instantiations. This constrains both space and time requirements. 9 The depth of inference is bounded. This constrains the spread of activation in the network and therefore directly affects response time and resource usage. In mapping S H R U T I onto parallel machines, we exploit these constraints to the fullest extent in order to achieve efficient resource usage and rapid responses with large knowledge bases. Of course, if any of these constraints can be relaxed without paying a severe performance penalty, we would like to obtain a more powerful system by relaxing these constraints. An example of this is the constraint on the number of instantiations of any predicate that can be active simultaneously. Based upon biological considerations, S H R U T I

29 places a limit of around 3 on this number. In our experimentation with the mapping of SHRUTI on the CM-5 we found that the limit could be raised without a major slowdown in inference times.

5.1.2. O t h e r c o n s i d e r a t i o n s The following assumptions also influenced the choices made in mapping machines:

SHRUTI

to real

9 Since the knowledge representation system should support any well-formed que .ry, the source of initial activation and the depth of derivation are unknown. In view of this, it is critical to focus on good average performance. 9 Since episodes of reasoning are expected to be rather brief, dynamic load balancing on a parallel machine is infeasible. Therefore, the static distribution of the knowledge base should be such that it leads to good dynamic load balancing on average. 9 Since the system has to reason with very large knowledge bases, the network size will be large. 5.2. SHRUTI-CM5 In this section, we briefly describe the design and implementation of S H R U T I - C M 5 , a n SPMD asynchronous message passing parallel reflexive reasoning system developed on the Connection Machines CM-5. A more detailed description of SHRUTI-CM5 can be found in [14]. 5.2.1. T h e C o n n e c t i o n M a c h i n e C M - 5 The Connection Machine model CM-5 [27] is an MIMD machine consisting of anywhere from 32 to 1024 powerful processors. 9 Each processing node is a general-purpose computer which can execute instructions autonomously and perform interprocessor communication. Each processor can have up to 32 megabytes of local memory 1~ and optional vector processing hardware. The processors constitute the leaves of a fat tree interconnection network, where the bandwidth increases as one approaches the root of the tree. A low-latency control network provides tightly coupled communications including synchronization, broadcasting, global reduction and scan operations. A high bandwidth data network provides loosely coupled interprocessor communication. The virtual machine emerging from a combination of the hardware and operating system consists of a control processor acting as a partition manager, a set of processing nodes, facilit~c~ for interprocessor communication and a CNTX-like programming interface. A typical user task consists of a process running on the partition manager and a process running on each of the processing nodes. Though the basic architecture of the CM-5 supports MIMD style programming, it is most often used to run SPMD (Single Program Multiple Data) style programs [29]. Both data parallel (SIMD) and message-passing programming on the CM-5 use the SPMD 9In principle, the CM-5 architecture can support up to 16K processors. 1~ amount of local memory is based on 4-Mbit DRAM technology and may increase as DRAMdensities increase.

30

model. If the user program takes a primarily global view of the system--with a global address space and a single thread of control--and processors run in synchrony, the operation is data parallel; if the program enforces a local, node-level view of the system and processors function asynchronously, the machine is used in a more MIMD fashion. We shall consistently use "SPMD" to be synonymous with the latter mode of operation. In this mode, all communication, synchronization and data layout are under the programs' explicit control. 5.2.2. T h e d e s i g n of SHRUTI-CM5 We outline the design and functionality of SHRUTI-CM5. A detailed discussion and justification of the design choices may be be found in [14]. T h e K n o w l e d g e Base. SHRUTI-CM5 supports all of SHRUTI'S legal rules, facts and type hierarchy relations. The type hierarchy can encode both is-a relations (which explicate the subconcept-superconcept relations between entities) and labeled relations which specify that two entities are related by a relation R. Q u e r i e s . SHRUTI-CM5 supports all the legal queries described in Section 4.1 and this includes queries posed to the rule base and/or the type hierarchy. With an appropriate front-end, the system can handle multiple queries, logical combinations of queries, and other variations. G r a n u l a r i t y of M a p p i n g . The individual processing elements on the CM-5 are fullfledged SPARC processors. A subnetwork in the connectionist model can therefore be implemented on a processor using appropriate data structures and associated procedures without necessarily mimicking the detailed behavior of individual nodes and links in the subnetwork. We therefore use knowledge-level partitioning in mapping SHRUTI onto the CM-5. This decision is also motivated by the fact that all the information pertaining to a predicate cluster can be encoded within a single CM-5 active message (see below). A c t i v e M e s s a g e s a n d C o m m u n i c a t i o n . SHRUTI-CM5 uses CMMD library functions [28] for broadcasting and synchronization, while almost all interprocessor communication is achieved using CMAML (CM Active Message Library) routines. CMAML provides efficient, low-latency interprocessor communication for short messages [28,31]. Active messages are asynchronous (non-blocking) and have very low communication overhead. A processor can send off an active message and continue processing without having to wait for the message to be delivered to its destination. When the message arrives at the destination, a handler procedure is automatically invoked to process the message. The use of active messages improves communication performance by about an order of magnitude compared with the usual send/receive protocol. The main restriction on such messages is their size--they can only carry 16 bytes of information. However, given the constraints on the number of entities involved in dynamic bindings (~ 10), there is an excellent match between the size of an active message and the amount of variable binding information that needs to be communicated between predicate instances during reasoning as specified by SHRUTI. SHRUTI-CM5 exploits this match to the fullest extent. P r o c e s s o r A l l o c a t i o n . SHRUTI-CM5 supports two major processor allocation schemes: random processor allocation and q-based processor allocation. Random processor allo-

31

while ( t e r m i n a t i o n condition not met) /* propagate a c t i v a t i o n in the type h i e r a r c h y */ spread bottom-up a c t i v a t i o n ; spread top-doma a c t i v a t i o n ; /* propagate a c t i v a t i o n in the r u l e base , / reverse-propagate c o l l e c t o r a c t i v a t i o n ; check f a c t matches; propagate enabler a c t i v a t i o n by r u l e - f i r i n g ;

Figure 7. The activation propagation loop.

cation involves allocating knowledge elements to random processors, q-based processor allocation allows the user to control the fraction of related elements that are assigned to the same processor. Random processor allocation is actually a special case of q-based allocation with q - ~1, where N is the number of processors in the parallel machine. 5.2.3. E n c o d i n g t h e k n o w l e d g e base The knowledge base is encoded by presenting rules and facts expressed in a human readable syntax like that of first-order logic. Knowledge encoding in SHRUTI-CM5 is a two-part process:

1. A serial preprocessor on a workstation reads the input knowledge base and partitions it into as many chunks as there are processors in the CM-5 partition. 2. During the parallel knowledge base encoding phase, each processor on the CM-5 independently and asynchronously reads and encodes the knowledge base fragment assigned to it by the preprocessor. This input mode parallelizes knowledge encoding and is well suited for large-scale knowledge bases. In addition, SHRUTI-CM5 also provides a direct input mode which by-passes the serial preprocessor, and is useful when small knowledge base fragments need to be added to an existing (large) knowledge base. Input processing results in allocating each knowledge base element 11 to a single processor, and encoding the knowledge using suitable internal data structures. The SHRUTI network is internally encoded by a series of pointers which serve to link predicate and concept representations. A specially designated server processor keeps track of processor assignments. The system is designed in such a manner that the server does not become a bottleneck--it is accessed only when posing a query, and does not come into play during the reasoning process. 12 11Predicates, concepts, facts, rules and is-a relations together constitute knowledge base elements. l~The server is also accessed when encoding a knowledge base in direct input mode.

20 4.1. F o r m of rules, facts, a n d queries S H R U T I encodes knowledge in terms of entities, types (classes of entities), the memberof relation between entities and types, sub- and super-type relations between types, n-ary predicates, facts and rules. The sub- and super-type relations define a partial ordering of types and thus types are organized in the form of a directed acyclic graph (we will refer to it as the T-DAG). It is assumed that there exists a unique type that is a super-type of all types. For convenience we view entities as the leaves of the T-DAG. Rules have the following form: 3Xl:Xl,... ,xp:Xp Vyl:Y1,..., yr:]~r [ P l ( . . . ) A ' " A P , ( . . . ) =V 3 u l , . . . , ut Q(...)] wherein an argument of Pi can be an entity or one of the variables xi and yi. An argument of Q can be an entity or one of the variables xi, y~, and ui. Xis and Yis are types and specify restrictions on the bindings of variables. Facts have the form: 3 x l : X l , . . . , xr:Xr Vyl:Y1,..., ys:Ys P ( . . . ) where arguments of P are either entities or variables xi and Yi. Universally quantified variables are assumed to be distinct (i.e., repeated universally quantified variables are not supported). Observe that facts include ground atomic formulas as well as quantified assertions of the type "All permanent employees receive a bonus" and "there exists an employee who is the boss of all other employees". The form of queries is similar to that of facts except that whereas repeated universally quantified variables can occur in queries, existentially quantified variables are assumed to be distinct. Recently, we have extended SHRUTI to allow negative literals to occur in facts, queries, and rules [24]. The representation language shares the function-free property of Datalog [30] but differs from it in a number of ways. For example, unlike Datalog, our language does not force a dichotomy between extensional and intensional relations and allows both intensional as well as extensional relations to occur in the head of a rule. Thus the language allows rules to define new views (as in Datalog) as well as specify integrity constraints. Furthermore, the occurrence of extensional relations in the head of a rule also allows our system to reformulate a query into a number of alternate, but semantically related queries. While our language is more general than Datalog in its treatment of relations, it does impose a restriction on the form of rules (see below). 4.2. T h e class of reflexive inference A characterization of the class of reflexive inferences is provided in [21]. This characterization is facilitated by the following definitions: [] Any variable that occurs in multiple argument positions in the antecedent of a rule is a pivotal variable. [] A rule is balanced if all pivotal variables occurring in the rule also appear in its consequent. Observe that rules that do not contain pivotal variables are also balanced.

33

QUERY DEPTH vs. RESPONSE TIME I

I

I

I

KB KB KB KB KB

0.6 spmd version 07.5 32 pe cm-5 kbl stats enabled

size size size size size

= = = = =

11.0036 219879 329871 440178 550041

', o :: :c, ;• =c

: : : ; :

0.5

"6" O

0.4

q)

E I-to

0.3

G)

0.2

0.1

0

2

4

6

8

10

Query Depth

Figure 8. SHR.UTI-CM5 query response time for artificially generated knowledge bases of different sizes.

KBSize 110036 219879 329871 440178 550041

Avg. Response Time Depth=0 Depth=l (Retrieval) 0.8 msec 1.4 msec 1.3 msec 2.5 msec 1.9 msec 4.5 msec 2.8 msec 9.2 msec 3.4 msec 15.7 msec

Table 1 Average response time for retrieval (depth 0) and depth 1 queries posed to artificially generated knowledge of different sizes.

34

!

QUERY DEPTH vs. TIME PER RULE FIRED | ,

0.0004

| KB size = 440178 KB size = 550041

spmd version 07.5 32 pe cm-5 k:bl stats enabled

0.0003

g ~0.00025

i

0.0002

!

0

2

4

6

8

Query Depth

Figure 9. Que .ry depth vs. time needed to fire a rule. Kbl on a 32 PE CM-5.

and facts, and (ii) Word Net, a real-world lexical database [16]. In this section we present these experimental results which demonstrate the effectiveness of SHRUTI-CM5 as a realtime reasoning system. Most of the experimentation has been carried out on a 32 node CM-5.

5.3.1. Experiments with artificially generated knowledge bases Part of the experimentation with SHRUTI-CM5 has been carried out using artificially generated knowledge bases. These knowledge bases are constructed automatically from a specification of their gross structure in terms of parameters such as the number of rules, facts, types and entities; the subdivision of the KB into domains; the ratio of interand intra-domain rules; and the depth of the type hierarchy. The specified structure is fleshed out with randomly generated predicates, facts, rules, and types. We have used two types of domains: target domains, which correspond to "expert" knowledge about various real-world domains; and special domains, which represent basic cognitive and perceptual knowledge about the world. A typical structured knowledge base consists of several target domains and a small number of special domains. The predicates within each domain are richly interconnected by rules. Predicates in each target domain are also richly connected by rules to predicates in the special domains. Predicates across different target domains, however, are sparsely connected. Predicates in different special domains are left unconnected. The structure imposed on the knowledge base is a gross attempt to mimic a plausible structuring of real-world knowledge bases. This is motivated by the notion that knowledge about complex domains is learned and grounded in metaphorical mappings from (to) some basic perceptually and bodily grounded domains [13]. Queries for experimenting with each artificial knowledge base were generated using the facts associated with predicates, and the inference dependency graph representing

35 rules interconnecting predicates. For the knowledge bases considered below more than 500 random queries were generated of which some 300 were answered affirmatively and had depths ranging from 0 (fact retrieval) to 8. Each successful query was run 5 times and the resulting data was used to evaluate the performance of SHRUTI-CM5. In the experimental results plotted below, points represent average values, point ranges shown are 95% confidence intervals, while the curves shown are piece-wise best-fits. Figure 8 plots response time for varying query depths and knowledge base sizes. Table 1 highlights the average response times for retrieval queries (these are queries with a derivation depth of 0) and queries with derivation depths of 1. The knowledge bases used for experimentation had 3 special domains, 150 target domains, about 50000 predicates and 50000 concepts. When the knowledge base size is about 200,000 or smaller, the response time for different query depths is essentially linear since activation is, for the most part, confined to the query domain (the domain in which the query was posed) and special domains. As the size of the knowledge base increases, the curve for each knowledge base size can be partitioned into two parts: For depths up to about 3 the response time increases steeply since all the predicates in the query domain and special domains become completely active. Beyond that, the rate at which the response time increases is lower and depends on the number of active predicates in other target domains. The latter in turn depends on the number of rules that link predicates in different target domains. As the knowledge base size increases, the number of inter-domain rules increases, and hence, the response time increases at a higher rate. This is illustrated by the top three curves in the figure. Figure 9 shows the time needed to fire a rule as a function of query depth for two knowledge bases. In general, it was found that for large knowledge bases and que .ry depths greater than 3, the number of rule firings per second on a 32 node CM-5 converged to a relatively constant value of about 125,000. This suggested that the number of rule firings per second per processor was 125.,000 52 , i.e., about 3,900. In other words, T, the time per rule firing per processor on a 32 node CM-5 was found to be about 39-~, i.e., 256 #sec. Experimental results showed that this value of T remained constant over various knowledge base structures and numbers of CM-5 processors. This means that if ]LTKB I is the size of the knowledge base, and a fraction r of this knowledge base becomes active during a reasoning episode, then the expected response time of SHRUTI-CM5 o n a n N processor CM-5 would be T = ILTrRIrT N The above observation suggests a way of designing a real-time reasoning system. Let T,nax be t h e maximum response time that the application can tolerate. The observation suggests that one requires:

Tmax <_ k]LTKB]rT N where k is a factor of safety whose value depends on the severity of the penalty for a response time exceeding Tmax. Based on this equation, we can estimate the largest knowledge base that will provide real-time responses on a given CM-5 partition. Conversely, given a knowledge base, we can determine the size of the CM-5 partition needed in order to obtain real-time performance. This assumes that (i) the dominant component of computation costs is the processing of rules and (ii) the computation is uniformly distributed

35

KB Size 329871 440201 550049

Average Speedup 64 P E CM-5 128 P E CM-5 1.82 2.96 1.67 3.10 1.76 3.39

Table 2 Average speedup for various knowledge base sizes. The maximum speedup possible for 64 and 128 node machines over a 32 node machine is 2 and 4, respectively.

SHRUTI-CM5 Query is-a hypernym (Bird,x)? is-a hyponym (Sparrow,x)? is-a synonym (Bird,x)? is-a hypernym (Entity, Sparrow)?

Response Time 0.029 sec 0.007 sec 0.004 sec 0.275 sec

Serial WordNet Query wn bird-treen wn sparrow -hypen wn bird-synsn -

Response Time 13.27 sec 0.046 sec 0.023 sec -

SHRUTI-CM5 Speedup 457 6.5 5.75 -

Figure 10. Timing data for some WordNet queries.

on all the processors of the machine. We have also experimented with artificially generated knowledge bases on 64 and 128 node CM-5 machines. Table 2 shows average speedups obtained (averaging over all que .ry depths) for a given knowledge base size as compared to the performance on a 32 node CM5. This speedup in suggests that SHRUTI-CM5 does exploit parallelism and can benefit from the availability of even greater parallelism.

Pile .'Udinlalle Pl~pm'Cin .................................................................................

o i-INNI--IN .

N

N ~ N N D D

~

FOe

~.o 3~7.1S

~

,~ E ] D D S D D

,~ E ] D N D N D

,, [Z] I-11-1 [3 D I-] ~. I-I V11-I I-I D [-1 "P~0CES~QR COHPUTATION~LLOAD FOR QUERY is-a

.4Jdmmal~ P i ~ s

~J~.6

,0 V ] D D D ~ N ~. D D V 1 N D D

(Enttt . S p a r ~ ) ? C~l= 0.03)"

~

1

)

1

LOI~DFOR QIJEIRuIs.-a

lr

(Erlt.n . S ~ 7

(

: 0.03)"

Figure 11. CM-5 processor computation (left) and communication (right) load for WordNet query is-a hypernym (Entity, Sparrow)?

37 5.3.2. Experiments with W o r d N e t WordNet is an on-line lexical database which attempts to organize information in terms of word meanings and their inter-relationships [16]. Groups of synonymous words, termed synsets, are used to represent meanings. The meanings of synsets are then defined in terms of their relationships with other synsets. For example, the ent .ry {Cock, Rooster, Chicken, @} has two words, Cock and Rooster, in the synset; this synset is a hyponym (subconcept) of the Chicken synset. WordNet (version 1.4) has about 75,000 synsets classified into nouns, verbs, adjectives and adverbs. WordNet has been mapped to SHRUTI by representing: (i) word-senses and word-forms by nodes; (ii) synsets by a pair of "focal" nodes connected to the appropriate word-sense and word-form nodes, and (iii) lexical relations between words (like antonym) and semantic relations between synsets (like hypernym or superconcept) as labeled is-a relations. The entire database translates into about 880,000 is-a relations. SHRUTI-CM5 supports labeled is-a relations of the form is-a R (A,B), which represents A _+R B. For example, is-a hypernym (Bird, Sparrow) asserts that Bird is a hypernym (or superconcept) of Sparrow. The system supports both specific queries like is-a hypernym (Bird, Sparrow)? ("Is Bird a hypernym of Sparrow?") and enumeration queries like is-a hypernym (Bird, x)? ("Enumerate entities which have Bird as a hypernym"). Even though WordNet exercises only the type hierarchy of SHRUTI-CM5, the similarity of activation propagation in the type hierarchy and rule base ensures that WordNet helps evaluate the effectiveness of SHRUTI-CM5 in terms of its ability to handle large real-world knowledge bases. Figure 10 shows the performance of WordNet on SHRUTI-CM5. For enumeration queries, the figure also compares the performance of SHRUTI-CM5 with the serial command line interface to WordNet. 13 The times shown offer a fair comparison and only take into account the time required to compute the results- they do not include the time required to display or output the results. The timing reported for the serial version is the minimum obtained in six consecutive runs of the same query (to avoid counting time to process cache misses) on a Sun Sparc Station 10. SHRUTI-CM5 timings are averages over ten runs. Figure 11 shows the computation and communication load distribution--i.e., the number of active entities and the number of messages sent and received, respectively--on the CM-5 processors for the query is-a hypernym (Entity, Sparrow)? For a detailed description see [14]. 6. C O N C L U S I O N In this chapter we have shown that there exist interesting and useful classes of inference that can be performed rapidly by neurally motivated and massively parallel computational models. These include a class of inheritance and recognition problems and a class of reflexive first-order inferences. We have also shown that these models can be mapped effectively onto existing hardware platforms and result in a high performance knowledgebased inference system. We believe that the effectiveness of SHRUTI-CM5 validates our belief that situating the knowledge representation and reasoning problem within a neurally motivated computational architecture may not only enhance our understanding of the mind/brain, but it 13We have modified the serial WordNet interface wn to report computation time using Unix timers.

38 may also lead to the development of effective knowledge representation and reasoning systems implemented on existing hardware platforms. In ongoing work we are extending the inferential power of SHRUTI-CM5 by systematically relaxing some of the biologically motivated constraints suggested by SHRUTI. The realization of S H R U T I on real machines offers an interesting set of trade-offs and we are investigating these in order to design a system that can respond rapidly to reflexive queries and at the same time perform reflective inferences at a slower speed. We are also investigating the possibility of integrating the inferential capabilities of SHRUTI-CM5 with the full functionality of existing database and information systems in order to develop a flexible and efficient mediator for accessing large and heterogeneous databases. REFERENCES

1. V. Ajjanagadde and L. Shastri. Rules and variables in neural nets. Neural Computation, 3:121-134, 1991. 2. A. Baddley. Working Memory. Clarendon Press, 1986. 3. P. Dietz, D. Krizanc, S. Rajasekaran, and L. Shastri. A lower bound result for the common element problem and its ,implication for reflexive reasoning. Technical Report MS-CIS-93-73, Department of Computer and Information Science, University of Pennsylvania, 1993. 4. M.P. Evett, W. A. Andersen, and J. A. Hendler. Providing computationally effective knowledge representation via massive parallelism. In L. Kanal and V. Kumar, editors, Parallel Processing for Artificial Intelligence, New York, 1994. Elsevier Science Publication. 5. S.E. Fahlman. NETL: A System for Representing and Using Real World Knowledge. MIT Press, Cambridge MA, 1979. 6. J.A. Feldman and D. H. Ballard. Connectionist models and their properties. Cognitive Science, 6(3):205-254, 1982. 7. J. Geller and C. Du. Parallel implementation of a class reasoner. Journal of Theoretical Artificial Intelligence, 3:109-127, 1991. 8. R . V . Guha and D. B. Lenat. CYC: A mid-term report. AI Magazine, 11(3):32-59, 1990. 9. J. Henderson. Connectionist syntactic parsing using temporal variable binding. Journal of Psycholinguistic Research, 23(5):353-379, 1004. 10. M. A. Just and P. A. Carpenter. A capacity theory of comprehension: Individual differences in working memory. Psychological Review, 99(1):122-149, 1993. 11. H. Kitano. Challenges of massive parallelism. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pages 813-834, 1993. 12. H. Kitano, H. Hendler, T. Higuchi, D. Moldovan, and D. Waltz. Massively parallel artificial intelligence. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pages 557-562, 1991. 13. G. Lakoff and M. Johnson. Metaphors We Live By. University of Chicago Press, Chicago, 1980. 14. D. R. Mani. The Design and Implementation of Massively Parallel Knowledge Representation and Reasoning Systems: A Connectionist Approach. PhD thesis, Depart-

39 ment of Computer and Information Science, University of Pennsylvania, 1995. 15. D. R. Mani and L. Shastri. Reflexive reasoning with multiple instantiation in a connectionist reasoning system with a type hierarchy. Connection Science, 5(3 & 4):205-242, 1993. 16. G. A. Miller, R. Beckwith, C. Fellbaum, D. Gross, K. Miller, and R. Tengi. Five papers on WordNet. Technical Report CSL-43, Princeton University, July 1990. Revised March 1993. 17. L. Shastri. Massive parallelism in artificial intelligence. Technical Report MS-CIS86-77, University of Pennsylvania, 1986. 18. L. Shastri. Semantic Networks: An Evidential Formulation and its Connectionist Realization. Morgan Kaufmann, San Mateo, CA, 1988. 19. L. Shastri. Default reasoning in semantic networks: A formalization of recognition and inheritance. Artificial Intelligence, 39(3):283-355, 1989. 20. L. Shastri. The relevance of connectionism to AI: A representation and reasoning perspective. In J. A. Barnden and J. B. Pollack, editors, Advances in Connectionist and Neural Computation Theory, Volume 1. Ablex Publishing Corporation, Norwood, N J, 1991. 21. L. Shastri. A computational model of tractable reasoning--taking inspiration from cognition. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, 1993. 22. L. Shastri. Structured connectionist models. In M. A. Arbib, editor, The Handbook of Brain Theory and Neural Networks. MIT Press, 1995. 23. L. Shastri and V. Ajjanagadde. From simple associations to systematic reasoning: A connectionist representation of rules, variables and dynamic bindings using temporal synchrony. Behavioral and Brain Sciences, 16(3):417-494, 1993. 24. L. Shastri and D. J. Grannes. Dealing with negated knowledge and inconsistency in a neurally motivated model of memory and reflexive reasoning. Technical Report TR-95-041, International Computer Science Institute, 1995. 25. W. Singer. Synchronization of cortical activity and is putative role in information processing and learning. Annual Review of Psysiology, 55:349-74, 1993. 26. U. K. Sources, April 1994. National Library of Medicine. 27. TMC. Connection machine CM-5 technical summary. Technical Report CMD-TS5, Thinking Machines Corporation, Cambridge, MA, 1991. 28. TMC. CMMD Reference Manual. Version 3.0. Thinking Machines Corporation, Cambridge, MA, 1993. 29. TMC. CM-5 User's Guide. CMost Version 7.3. Thinking Machines Corporation, Cambridge, MA, 1994. 30. J. D. Ullman. Principles of Database and Knowledge-Bases Systems. Computer Science Press, 1988. 31. T. yon Eicken, D. E. Culler, S. C. Goldstein, and K. E. Schauser. Active messages: A mechanism for integrated communication and computation. In Proceedings of the Nineteenth International Symposium on Computer Architecture. ACM Press, 1992. 32. D. L. Waltz. Memou-based reasoning. In M. A. Arbib and J. A. Robinson, editors, Natural and Artificial Parallel Computation, pages 251-276. MIT Press, Cambridge, MA, 1990.

40 L o k e n d r a Shastri

Lokendra Shastri is a Member of the AI Group at the International Computer Science Institute (ICSI) Berkeley, CA. His research interests lie at the intersection of Artificial Intelligence, Cognitive Science and Neural Computation. In particular, he is interested in the role of massive parallelism in knowledge representation and reasoning; neurally motivated models of reflexive (i.e., rapid) reasoning and one-shot learning of events and situations; and spatio-temporal networks for pattern recognition. Shastri is on the editorial board of Connection Science and IEEE Expert. Before joining ICSI Berkeley, he was on the faculty of the University of Pennsylvania. He received a Bachelor of Engineering degree in Electronics with distinction from the Birla Institute of Technology and Science, India, an M.S. in Computer Science from the Indian Institute of Technology, Madras, and a Ph.D. in Computer Science from the University of Rochester in 1985. D. R. M a n i

D. R. Mani is a principal research engineer at Thinking Machines Corporation, Bedford, MA. His research interests span the application of massively parallel processing to artificial intelligence, knowledge representation, and database systems. He is currently involved in the design and development of parallel machine learning algorithms for largescale data mining and knowledge discovery systems. Mani has a Bachelor's degree in Electronics and Telecommunication Engineering from the University Visvesvaraya College of Engineering, Bangalore, India, a Master's degree in Computer Science from the Indian Institute of Technology, Kanpur and a Ph.D. in Computer Science from the University of Pennsylvania, Philadelphia, PA.

Parallel Processing for Artificial Intelligence 3 J. Geller, H. Kitano and C.B. Suttner (Editors) 1997 Elsevier Science B.V.

41

Massively Parallel Support for Nonmonotonic Reasoning B. Boutsinas, a Y. C. Stamatiou, ~* and G. Pavlides ~ aDepartment of Computer Engineering and Informatics, University of Patras, 26500, Rio, Patras, Greece This work presents how Weighted Inheritance Networks, a new formalism that is capable of representing knowledge under a nonmonotonic multiple inheritance scheme with exceptions, support the design of massively parallel algorithms for problems related to nonmonotonic reasoning. Weighted Inheritance Networks are semantic networks with links that are annotated with symbolic pairs that we call weights. They can handle the inheritance structures described in the literature, including some which the so far proposed formalisms may fail to treat satisfactorily. Weighted Inheritance Networks support the development of both sequential and parallel efficient algorithms, without sacrificing any desired properties that inheritance networks should, in general, possess. We investigate the performance of parallel nonmonotonic multiple inheritance reasoners based on Weighted Inheritance Networks, for two important problems: the goal-directed inheritance reasoning problem, that consists in answering whether an object possesses a particular property and the recognition problem, that consists in finding all the objects possessing a particular set of properties. We present a system that provides massively parallel support for nonmonotonic "path-based" reasoning, called WINBRID, and we analyze its performance using randomly generated knowledge bases. 1. I N T R O D U C T I O N Nonmonotonic reasoning plays an important role in the development of systems that try to mimic commonsense reasoning. Human beings are constantly forced to make decisions and reach conclusions in a fuzzy world. The knowledge that can be acquired by observation is inherently incomplete and may even contain conflicting information as well as exceptions to general rules. Formal logic systems do not seem to provide a satisfactory solution in such an environment (see for example [19,3]) in terms of efficiency. It has been shown, that many of these systems have decision problems that are intractable, or even undecidable, when stated within their context. The algorithmic intractability of the proposed formal systems, along with their weak treatment of incomplete or contradictory knowledge, have led to the use of semantic networks in order to represent knowledge. The inheritance systems that are based on such a representation are capable of nonmonotonic multiple inheritance reasoning. Their semantics are supplied either indirectly, through a translation *This research was partially supported by the European Union ESPRIT Basic Research Projects ALCOM II (contract no. 7141), GEPPCOM (contract no. 9072) and Insight II (contract no. 6019).

42 into an extension of a standard logical formalism (see for example [5]), or directly, through "path-based" theories (see for example [21]). In this work, we propose a form of semantic networks with annotated links called Weighted Inheritance Networks (first defined in [2]), or WINs for short, as an alternative "path-based" approach that is capable of nonmonotonic multiple inheritance reasoning with exceptions. We also present how WINs treat the inheritance structures described in the literature, including those that are supplemented with "redundant generic statements" (see section 3), which the systems proposed so far may fail to treat satisfactorily ([12]). There is a clash of intuitions concerning the treatment of nonmonotonicity (on-path versus off-path preemption), the treatment of competing extensions (forms of skepticism versus forms of credulity) and the direction in which the represented inheritance is followed (upward versus downward view of inheritance) in many "path-based" inheritance systems (see [22]) that have appeared in the literature. It should be stressed that the way in which nonmonotonicity and multiple extensions are treated, is the principal cause for this state of affairs and the confusion that may result thereof. In this work, we give a formal definition of WINs, with these intuitions in mind, and we examine their semantics and algorithmic properties, focusing on the support they can offer for developing massively parallel, multiple inheritance reasoners. Then we examine the support that WINs offer for efficient parallelism when confronted with two well-known problems: the goal-directed inheritance reasoning problem, that consists in answering whether an object possesses a particular property and the recognition problem, that consists in finding all the objects possessing a particular set of properties. Developing efficient algorithms for these two problems is of great importance for many AI applications. A WIN could be, for example, a part of an inference engine of a knowledge-based system or a guiding formalism for structuring the case memory of a learning apprentice that incorporates a case-based learning mechanism. In that latter case, an efficient parallel solution to the recognition problem is highly desirable. This is the case, because case-based systems resort to a knowledge-base (usually of a large size) containing previously stored problems and solution/failure plans in order to recall similar problem situations. Recognition is therefore a frequent operation that, if implemented inefficiently, has a dramatic impact on the performance of the system [15]. WINs support the design of NC algorithms for the above problems. The class NC is the class of problems that are amenable to efficient parallel solutions, that is, solutions that use a polynomial number of processors and take time polynomial in the logarithm of the input size (see [14]). Problems that belong to this class are efficiently parallelizable. If we restrict the number of exceptions to be of the order of log n, where n is a measure of the size of the knowledge base, we obtain NC algorithms for the problems stated above. To the best of our knowledge, no NC algorithm has been given for these problems under other inheritance network formalisms. More specifically, for a reasoner based on a WIN whose underlying graph has n nodes and e edges, we present a parallel algorithm that solves the goal-directed inheritance reasoning problem in O(log 2 n) time, using O(2keM(n)) processors, where k is the maximum number of nodes that represent classes/properties that are cancelled by exceptions, and M(n) is the number of processors needed to multiply two n x n matrices over the ring (Z, +, x) in time O(log n). At the time of this writing, i ( n ) = O(n 2"376) (see [4]). Our model of parallel computation is the Concurrent Read Concurrent Write (CRCW) PRAM

43

shared memory machine (see [13] for a detailed description of this model). Additionally, a hybrid algorithm is also presented (a hybrid algorithm is a sequential one that includes some steps that are executed in parallel) with a time complexity determined by the factor n + e using O(n) processors. As far as the recognition problem is concerned, given a set of p features that must be satisfied and a set of objects, an O(log 2 n) time, O(2kpM(n)) processor, parallel algorithm for the same model of computation is presented. Also, a hybrid algorithm is given for this problem with the same performance as the algorithm for the previous problem. WINBRID (Weighted Inheritance Network hyBRID) is a reasoner capable of nonmonotonic, multiple inheritance reasoning that is based on WINs, that uses the hybrid algorithms in order to handle the two problems mentioned above. It is implemented under the PARIX environment for software development, in Parsytec's GCel-3/512 parallel machine. In what follows, we first present WINs, then we present the algorithms and finally we discuss the architecture of WINBRID.

2. W E I G H T E D

INHERITANCE

NETWORKS

The underlying structure of a Weighted Inheritance Network is a directed acyclic graph. Knowledge is represented by attaching to each node of the graph a label that denotes an object, a class of objects or a property possessed by objects and by establishing the desired inheritance structure through the insertion of properly directed edges. When we insert an edge that emanates from a node and is directed to another, we mean that the class of objects represented by the former node inherits some or all of the defining properties of the class represented by the latter. We believe that such a notion of inheritance is based on the reusability notion. For instance, when human beings think of the class/concept Whale, they first think that a whale "is a fish", because it really looks like a big fish, and then they add, as an afterthought, that it "is a mammal". This is an example of the priority that is apparently given to resemblance over class inclusion. This priority leads us to think that reusability must be an important ingredient in the process through which a human being reaches the above decision, since in order to define the class Whale there are more characteristic parts we can reuse from the definition of the class Fish than from the definition of the average member of the class Mammal. The precedence of resemblance over inheritance is not obvious, since objects (classes or individuals) usually reuse the definition of the class they belong to and, hence, there is the impression that inheritance is based principally on subclass inclusion. Thus, inheritance relates the various classes according to the possibility of reusing all or part of the definition of one in the definition of the other. As an example, if we have defined the class Elephant we can reuse this definition in order to define the object Clyde, the elephant. Also, if we have defined both Republican and Quaker, we can reuse part of each of these definitions to define the object Nixon, the Republican and Quaker. Which part we reuse depends on how the preference over possible conflicting choices is established. To be consistent with our belief that inheritance is based on the reusability notion, we will substitute the usual "isa" ("nisa") hierarchy by the "isl" ("nisl") hierarchy, which stands for "is like" ("is not like"). If an exception exists in reusing the definition of a class in the definition of an object

44 (class or individual), this is indicated by a symbolic pair that we call weight attached to the edge(s) emanating from the node which represents this object and belonging to paths which lead to the node that represents this class. A weight w has the form (s, c) and represents the possibility or the impossibility, depending on whether s = '+' or s = ' - ' , respectively, of reusing the definition of the class c in the definition of the entity from which the edge labeled by w emanates. This possibility or impossibility has higher priority than what can be normally derived by following the edges. In the proposed systems that are capable of representing knowledge under a multiple inheritance scheme with exceptions (e.g., [17,21,11]), if there exists an exception in reusing a definition, this is indicated by an exception link. However, the exception links no longer represent uniformly the reusability notion mentioned above. Instead, these links represent cancellations of the facts derived by the normal process of reusing definitions. Intuitively, this lack of uniformity is the reason why parallel inference algorithms, especially for multiple extensions, have severe limitations (see [21,6,16,12]).

~ ~ ~ --

Clyde

%A

x~ 1.

Clyde

9

-

-

"

RoyalElephant (-,Gray)

/ - -

RoyalElephant

~

o

~

~

~

1 [ .

%-.

Elephant

~

~

9

~

i ~

\~j,.

Gray

x-

x-

/ I .

/ 1 .

Elephant

Gray

Figure 1. An Inheritance Network and a Weighted Inheritance Network

Figure 1 illustrates an example of an inheritance structure that introduces nonmonotonicity. This example is:

isl(Clyde, RoyaIElephant)and isl(RoyaIElephant, Elephant)and isl(Elephant, Gray) and nisl(RoyalElephant, Gray) Here we should be able, of course, to infer nisl(Clyde, Gray). This is achieved by the existence of the weight (-,Gray) along the edge (RoyalElephant, Elephant). This weight is propagated along the path RoyalElephant ~ Elephant ~ Gray, thus excluding "Gray" from the definition of "Clyde", since the exception takes precedence over following the hierarchy links towards the node labeled "Gray". Generally, objects (individuals or classes) are considered to possess a set of properties or characteristics. Therefore, in order to define an object, all of its characteristics should be collected. This is achieved by following all the directed paths passing through the object under consideration, while gathering information about how this object looks like (that

45 is, about the possibility of reusing the characteristics of another class), thus collecting the actual characteristics t h a t it has. This information is represented by both the nodes a n d the weights of a WIN. We denote by C S the set of characteristics of an object that are collected by a reasoner in this way. This set is derived from two other ordered sets: the ordered set LS, whose elements are obtained by following the "looks-like" links and the ordered set W S that consists of the weights that are attached to those links. Notice that new elements are placed at the ends of the sets W S and LS. The elements of W S take precedence over the elements of L S and are ordered in decreasing priority, in the sense that an element has higher priority than its successor in the order they were gathered following the directed paths, according to Touretzky's inferential distance ordering. Therefore, if there are any conflicts among the elements of W S , they are resolved in favor of the element that appears earlier. By removing one of the conflicting elements, taking into account their relative precedences, we obtain the W S - set, the set of relevant weights. The set of characteristics of an object C S is the W S - set augmented, generally, by elements of L S which are not the negation of elements of W S - .

2.1. Formal definitions Before we give the formal exposition of the ideas discussed in the previous section, we will provide some preliminary notations and assumptions. Definition 1 (Weighted Inheritance N e t w o r k ) A Weighted Inheritance Network is an ordered pair W - (G(V, E), ~') where G(V, E) is a directed acyclic graph with V the set of nodes and E the set of edges and jc is a mapping from E to sets of weights and it is specified by the designer of the knowledge base. The set V (the objects) is partitioned into two sets, the set I (the individual entities) and the set C (classes or concepts). The set E is a set of ordered pairs (o, c) where o E V and c C C. The mapping ~ : E ~ 2 H associates each edge with a subset of the set of weights H where H is a set of ordered pairs of the form (s, n) where s E { + , - } and n e C. Since the graph G is acyclic, there is a function f : V ~ N that assigns to each node an integer, such that if node i is an ancestor of node j (in the sense that there exists a directed path from i to j in G), then f ( i ) < f ( j ) (topological ordering). Therefore, if we are given some nodes that lie on the same path, then f induces a total order on them. We observe that if we assign to each edge e -- (u, v) the number that is assigned to u, then a total order (one edge precedes another) is also induced on the edges that lie on the path. In what follows, by S[i] we denote the ith element of the ordered set S.

Definition 2 (inheritance p a t h ) An inheritance path p (from now on called path) from node s to node t, denoted by s ~ t, is a sequence of nodes of the form s = el, e2, . . . , ei = t, i > 1, such that (ei, ej+l) E E. Since G is acyclic, the nodes e l , . . . , ei are distinct. The length of the path, denoted by IPl, is equal to i - 1. Moreover, we require the following important property: Vs, t E V, if p is a path s ~ t : IPl > 1, then (s,t) fg E This property will be referred to as the non-shortcut property and simply states that there is no path whose endpoints are "short-circuited" by an edge. This property can be easily imposed during the construction or the update of a WIN in a preprocessing phase.

45 Definition 3 (looks-like hierarchy) Given an object o E V and a path p that starts from o, we define the set of characteristics of o according to the "is-like" hierarchy to be the ordered (according to f ) set LSo,p = {(+,j)lonpath(j,p)}, where onpath(j,p) is true if the node j belongs to p. For instance, in the example of figure 1, we obtain the following set: n SClyde,C lyde--+Royal Elephant-~ Elephant--+Gray --

{ (+, Clyde), (+, RoyalElephant), (+, Elephant), (+, Gray) } Definition 4 (the set of weights) Given an object o C V, and a path p that starts from o, we define the set of weights to be the ordered set WSo,p - {U(~,.i)eE~((i,j))ionpath(i,p)A onpath(j,p)}. Each element of WSo,p (which is a set of weights), possesses the same order as the edge to which it belongs.

For instance, in the example of figure 1, we obtain the following set: WSClyde,Clyde-+noyalElephant--+ Elephant--+Gray z { (--, Gray)}

Definition 5 (negation of a weight) If x = (s, c) is a weight, then -~x = (s', c) where s'

e

{-, +} - {~}.

Definition 6 (the set of relevant weights) Given an object o c V, and a path p that starts from o, we define the set of relevant weights to be the ordered set W S j , p = {xix E WSo,p[j], for some j A x r WSo,p[i]Vi " i < j A ~x r WSo,p[i]Vi " i < j}.

For instance, in the example of figure 1, we obtain the following set: W SClyde,Clyde_+ noyai Elephant_+ Elephant_+Gray -- { (--, Gray)}

Definition 7 (support of a fact due to an inheritance path) Given an object o E V, a path p that starts from o, an arbitrary set o f f acts S = {(+,k)ik E V} and an object c E C, we define the predicate edg_supo,p(c, S), with the meaning that isl(o, c), as follows: edg_supo,p(C, S) is true iff (+, m) e S (therefore isl(o, m)), where m is the direct predecessor of c along the path p.

For instance, in the example of figure 2, edg-supo,o~M-,c-~D-,E(C, {(+,O), (+, M)}) is true but not the edg_supo,o--~,-,c~r)--~E(E, {(+, 0), (+, M), (+, C)}). Definition 8 (support of a fact due to a positive weight) Given an object o c V, a path p that starts from o, a set of facts S = { ( + , k ) i k C V} and an object c C C, we define the predicate weight_supo,p(c, S), with the meaning that isl(o, c), as follows: weight_supo,p(c,S) is true, if there exists a weight (+,c) E WS~,p that is attached to an edge (k,l) and (+,k) C S but there does not exist a weight ( - , r ) E WS~,p such that it precedes (+, c) and r is both a successor of k and a predecessor of c.

As we will see in the next section, this definition precisely supports generic stability for WINs. For instance, w e i g h t - s u p o , o - ~ r . ~ R ~ T - , c ( C , {(+, O), (+, K), (+,L)}) is true in the example of figure 2, but it would not be true if a weight ( - , R) was attached to the edge (O, K).

47 Definition 9 (support of a fact) Given an object o E V, a path p that starts from o, a

set of facts S = {(+, k)lk e V} and an object c e C, we define the predicate supo,p(C, S), with the meaning that isl(o, c), as follows:

-2) ..... I C" ~176

(-,T .~

........

(+,C)~~~

Figure 2. A WIN

Definition 10 (the set of characteristics)

Given an object o C V, and a path p that starts from o, we define, inductively, the set of characteristics of o to be the maximal ordered set CS as follows:

{+,o}

CSo,, =

{x = (~, c ) l ( x 9

ifi=O

ws:,, A ~ =

-)v

p ))) ((x e LSo,p A-~x r WS~,p) A SUpo,p(c, CSo,~-1

otherwise

For instance, in the example of figure 1, we obtain the following sets: 0 C S Clyde,Clyde-+RoyalElephant--+Elephant-+Vray -- { ('+', Clyde)} 1 C S Clyde,Clyde-+ RoyaI Elephant--+ Elephant-+Gray - { (+, Clyde), (+, RoyalElephant), (-, Gray) )

2 C S Clyde,Clyde-+ Royal Elephant-+ Elephant -+Gray - { (+, Clyde), (+, RoyaIElephant), (+, Elephant), (-, Gray) ) 3 C S Clyde,Clyde-+ RoyalElephant-+ Elephant-+Gray - {(+, Clyde), (+, RoyalElephant), (+, Elephant), ( - , Gray)}

48

3. H A N D L I N G

REDUNDANCY

Since our aim is to simulate commonsense reasoning as closely as possible, it is important for a nonmonotonic reasoner to handle redundant information in a consistent manner. Of course, as long as a reasoner's knowledge about the world remains unchanged in time, this kind of information could be recognized and, subsequently, removed during a preprocessing phase of that reasoner's life cycle. However, if the reasoner's knowledge is continually subjected to revisions, as it is often the case, it is always possible that redundant, or even contradictory, information is introduced as a side effect of a revision process. Under our formalism, a weight is used to represent redundancy, in a way similar to the way a weight is used to represent an exception. This has as an important effect, the preservation of the uniformity of the inheritance network links as well as the preservation of the important "non-shortcut" property.

~ ~

w

o

/-

MaleRoyalElephant RoyalElephant

Clyde

(+,Elephant)

Clyde

~

~_

(-,Gray)

MaleRoyalElephantRoyalElephant

Elephant

Gray

~.

~,~

Elephant

Gray

Figure 3. A WIN with redundancy

Redundant information is closely related to the stability of a reasoner. Suppose that a reasoner, whose knowledge is represented by an inheritance network, supports the conclusion isl(X, ]I). We say that the reasoner is stable if, after inserting into its knowledge base the (redundant) information that isl(X, ]1"), it continues to support all the conclusions it supported before the insertion of this piece of information. Consider the example illustrated in figure 3. This inheritance network is supplemented with the "redundant atomic statement" isl(Clyde, Elephant). Since, the edge (Clyde, MaleRoyalElephant) has the weight (+, Elephant), we obtain the following sets, where o=Clyde and p=Clyde ~ MaleRoyalElephant ~ RoyalElephant --+ Elephant --+ Gray:

9 LSo,p = {(+, Clyde), (+, MaleRoyalElephant), (+, RoyalElephant), (+, Elephant), (+, Gray)}

9 WSo,p = WS~,~ = {(+, Elephant), (-, Gray)}

49 From these two sets, we form the set CS"

9 CS4o,p= {(+, Clyde), (+, MaleRoyalElephant), (+, RoyalElephant), (+, Elephant), (-, Gray)} Note that (+, Gray) is not included in this set, although it belongs to LSo,p and it holds that supo,p(Gray, CS4o,p). This is because (-, Gray) belongs to WSj, p. In shortest-path reasoners (see, for example, [8]), the addition of a redundant link alters the semantics of the inheritance network. In the previous example, a shortest-path reasoner supports the conclusion isl(Clyde, Gray), since the path Clyde -+ Elephant --+ Gray is shorter than the path Clyde --~ MaleRoyalElephant --+ RoyalElephant -~ Gray. If the redundant link were absent, we would conclude that nisl(Clyde, Gray), since the path Clyde -+ MaleRoyalElephant --~ RoyalElephant -/+ Gray would be shorter than the path Clyde -+ MaleRoyalElephant ~ RoyalElephant --+ Elephant -+ Gray). In [12] this kind of stability, called atomic stability, is viewed as an acceptability criterion for an inheritance reasoner. However the other kind of stability, called generic stability, is not viewed as such a criterion. Consider the inheritance network illustrated in figure 4. It shows a WIN supplemented with a "redundant generic statement" ([12]).

.

Clyde

.

.

.

.

-

RoyalElephant x-

Clyde

.

(-,Gray)

RoyalElephant

.

Elephant

Gray

InVisiblelnGrayBackground

x (+,InVisiblelnGrayl~kground) Elephant

Gray

InVisiblelnGrayBackground

Figure 4. A redundant generic statement

In the absence of the redundant information isl(Elephant, In VisibleInGrayBackground) it is (correctly) concluded that nisl(RoyaIElephant, In VisibleInGrayBackground). However, after inserting this redundant information, both shortest-path reasoners and skeptical reasoners conclude that isl(RoyalElephant, In VisibleInGrayBackground). This is, clearly, a false conclusion since royal elephants should always be visible in a gray background because they are not gray, regardless of whether it is directly stated or not that elephants are not visible in a gray background. Similarly, Clyde is not gray regardless of whether it is directly stated or not that Clyde is an elephant. Therefore, in our opinion, both atomic and generic stability should be viewed as acceptability criteria for an inheritance reasoner. WINs satisfy both of these criteria. In the previous example, the fact isl(RoyalElephant, In VisibleInGrayBackground) is not supported because, although isl(RoyalElephant, In VisibleInGrayBackground) belongs to LSo,p (where o=Clyde

50 and p--Clyde -+ RoyalElephant --+ Elephant ~ Gray ~ InVisibleInGrayBackground), SUpo,p(InVisibleInGrayBackground, CS4,p) is not true because neither edge_supo,p(In VisibleInGrayBackground, CSno,p) nor weight_supo,p(InVisibleInGrayBackground, CSno,p) is true. This latter fact is due to the weight (-, Gray) that takes precedence over the weight (+, InVisibleInGrayBackground) according to definition 4, and Gray is both a predecessor of In VisibleInGrayBackground and a successor of Elephant. Of course, one could argue that in the example shown in figure 5 generic stability prevents us from concluding that Moby is an air-breather. We believe that nisl(Moby, AirBreather) is a correct conclusion that is consistent with the knowledge that is represented by the inheritance network shown in this figure, since this network does not represent all of the information a man usually has about whales. The exception that whales, although not land-dwellers, are air-breathers, is not explicitly represented regardless of whether the fact that mammals are air-breathers is directly or indirectly represented.

dh, W

Moby lb,

Moby

Whale

Mammal

Land-Dweller

x_(-,Land-Dweller) x_(+,Air-Breather) ~ Whale

Mammal

Air-Breather ~,~

Land-Dweller

Air-Breather

Figure 5. Another redundant generic statement

The way in which redundancy is handled by WINs, defines the way in which conflicts due to exceptions are resolved. WINs do not support a mechanism analogous to the classical form of on-path preemption as this form is described elsewhere (see [12,18]). The semantics of WINs do not support off-path preemption either. Instead, they support a mechanism analogous to a restricted form of the classical on-path preemption, where the conclusion of a path is preempted only due to an edge (a weight in our formalism) of opposite polarity (sign) which emanates from (is attached to) a node (an edge) belonging to that path. Moreover, according to the classical form of on-path preemption, the conclusion of a path is also preempted due to an edge of opposite polarity which emanates from a node belonging to a path that starts from and ends at adjacent nodes of the path preempted. This second form of on-path preemption as well as off-path preemption are also used to support atomic stability for skeptical reasoners. However, they need not be incorporated in WINs since they support atomic and generic stability in a different way.

51 4. H A N D L I N G

MULTIPLE INHERITANCE

Nonmonotonic multiple inheritance frequently introduces multiple extensions of a theo.ry. We quote from [6]" " . . . a n extension is a set of beliefs which are in some sense "justified" or "reasonable" in light of what is known about a world". In the context of WINs, we associate an extension with a specific object of the represented world or system of beliefs (theory). Therefore, we restrict an extension to a set of beliefs about a world concerning (or obtained through knowing the existence of) that object. Formally, given an object o E V we define an extension E S for this object as follows: ESo,p, = {(+,o)U(s,n)l(s,n) e CSo,p,}, where CSo,p~ is defined for some path Pi that starts from o. Thus VES w.r.t an object o it is E S C_ Th((+,o)), where Th((+,o)) denotes all the properties of o that can be derived. The proposed inference consider all the facts that belong to any possible extension E S w.r.t, an object.

Pacifist (-,Pacifi~)~ Quaker

Republican ~

Nixon

Figure 6. Two conflicting extensions

In WINs, the beliefs obtained by passing through a path are included (either a subset or all of them) in an extension. Therefore, all cases of multiple inheritance are treated as cases of ambiguity, as it is also the case in [21] and [5], and there is no effort to resolve the possible conflicts as it is the case in [19]. The way in which multiple extensions are treated, depends on whether we adopt a credulous or a skeptical view. Consider the WIN shown in figure 6. Since there are two paths that pass through "Nixon", we obtain:

1.

,, LSlo,p, = {(+, Nixon), (+, Republican), (+, Pacifist)} 9 WS1 o,p~ = WSI~,p~ = { (-, Pacifist) }

,, CS12

= {(+,Nixon)(+

Republican)(-Pacifist)}

52 9 LS2o,v2 = {(4, Nixon), (+, Quaker), (4, Pacifist)} 9 WS2o,v ~ = WS2~,v2 = 0 2 9 CS2o,v~ = {(+, Nixon), (+, Quaker), (4, Pacifist)}

From the sets CS12

O,pl

and CS22op~ we obtain the following extensions, w.r.t Nixon: ,

'

1. ESlo,p, = {(+, Nixon), (+, Republican), (+, Quaker), (-, Pacifist)} 2. ES2o,v.~ = {(+, Nixon), (+, Republican), (+, Quaker), (+, Pacifist)}

This is because the conflict between the (-, Pacifist) of CS12,p, and the (4, Pacifist) of 2 CS2o,p~ cannot be resolved in favor of one or the other. This conflict, prevents the two C S sets from being contained both in the same extension. There is another additional feature of the proposed formalism that becomes obvious in the example shown in figure 6: the way in which negative links are treated. A WIN, as defined above, is a unipolar system (see [22] for the basic terminology), in the sense that negative (exception) links are not allowed. Nevertheless, WINs are capable of expressing negative links of classical inheritance networks through negative weights of the form ( - , c). For example, the negative link between Republican and Pacifist is expressed with the positive link (Republican, Pacifist) and the weight (-,Pacifist) attached to it. 5. T W O I M P O R T A N T

PROBLEMS

The sequential complexity of the goal-directed inheritance reasoning problem was examined in [18] when attacked by reasoners based on classical inheritance networks (see [17,21,12]). The problem was proven to be NP-hard for on-path/off-path credulous downward reasoners as well as for on-path upward credulous reasoners. On the positive side, the problem is in P for skeptical reasoners. In this section, we formally define the recognition problem, in the form presented in [19,7], and the recognition problem, for reasoners based on WINs. Definition 11 ( g o a l - d i r e c t e d i n h e r i t a n c e r e a s o n i n g ) Given a Weighted Inheritance Network W = (G(V, E), ~), an object o E I U C and a property/class p E C, decide whether the W I N supports the conclusion isl(o,p), nisl(o, p), neither or both. Definition 12 ( o b j e c t r e c o g n i t i o n p r o b l e m ) Given a Weighted Inheritance Network W = (G(V, E), jc), and a set of properties~classes sp, identify those objects o E I where W supports the conclusion isl(o, p)Vp C sp.

The central idea of the algorithms is to locate the most specific nodes which contain weights relevant to the properties under consideration (the nodes representing these properties can be among these most specific nodes). Intuitively, these nodes are the ones which decide inheritance according to the inferential distance measure, since they represent the most specific information. The fact that we can locate these nodes efficiently in a WIN, is the principal reason behind the good behavior of our algorithms. In the sections to follow, we consider parallel and hybrid algorithms that solve the problems defined above. The hybrid algorithms support both atomic and generic stability as

53

described previously. However, the current version of the parallel algorithms, supports neither atomic nor generic stability. Nevertheless, the parallel algorithms have the property that redundant links do not change previously derived conclusions, as it happens in skeptical or shortest-path reasoners, but additional extensions are derived instead. 5.1. Some definitions and n o t a t i o n Definition 13 An edge e = (i, j) E E is a p-edge if (•

E ~ ( e ) or e = (i,p), where p

is the property under consideration.

The edge (RoyalElephant, Elephant), in the example illustrated in Figure 1, is a p-edge as far as the property G r a y is concerned, since the weight (-, Gray) is attached to it. Definition 14 A p-path is a path containing a p-edge. In the previous example, the path Clyde --+ RoyalElephant ~ Elephant --+ Gray is a p-path because of the p-edge (RoyalElephant, Elephant). Definition 15 We say that an edge e = (i, j) E E or a node v C C is edge-reached by a node o E I U C if there exists a path (regardless of the constraints imposed by the weights of its edges) from o to i in W . These edges and nodes are called active.

Definition 16 We say that an edge e = (i, j) E E is weight-reached by a node o E I U C

iS: i) there exists a path q from o to i in W and ii) (fie' E q such that ( - , w) e ~'(e') and w is the label of an ancestor of i in q) or (~e' E q such that ( - , w) C ~'(e') and w is the label of an ancestor of i in q but 3e" C q such that e" precedes e' and (+, v) e 9V(e") and w precedes v and v precedes i).

Consider the node Gray in figure 1. This node is edge-reached by the node Clyde but it is not weight-reached by this node, since the weight (-, Gray) is attached to the edge (RoyalElephant, Elephant). 6. T H E P A R A L L E L

ALGORITHMS

We now present parallel algorithms for the goal-directed inheritance problem and the recognition problem. The algorithms are designed for the CRCW PRAM model of parallel computation (see [14]). Briefly, a PRAM is a parallel computer that consists of a number of processing elements that share a pool of memory, in addition to having a local memory area. The processing elements operate in lockstep, each executing the same instruction on different data, under global control. In the CRCW flavor of PRAM, any set of processing elements may read from or write into a memory location simultaneously. Bear in mind that the algorithms presented in this section are rather impractical in terms of processor requirements. However, they serve the purpose of illustrating the support that WINs offer for the development of efficient parallel algorithms. The departure point of our algorithm concerning the goal-directed inheritance problem, is to find the most specific p-edges w.r.t, the object o. The algorithm is based on digraph reachability, which is implemented by computing the transitive closure of the digraph

54 induced by the WIN. The reachability process is reduced to a breadth-first search (BFS) numbering of a directed acyclic graph. In [9], a parallel algorithm is given that solves this problem in O(log 2 n) time using O(M(n)) processors. After computing a single-source BFS numbering of the nodes, we can decide whether a node is reachable from the source by examining its distance from the source. The directed acyclic graph induced by the WIN is represented by an n x n adjacency matrix A, where n = II U C I is the number of labels of the WIN, and it is stored in the shared memory, before the beginning of the algorithm. If A[i,j] = 1, then there is a directed edge from node i to node j, whereas if A[i, j] = 0 then there is no such edge. It should be noted that whenever we use a variable with name Ay, we assume that we use a new copy of A, related to y. We have not included the copying procedure in our algorithm, since it can be easily done and its inclusion would only complicate the presentation. As a notational convenience, the statement for each ... is not followed by the processor indices of the processors it activates but, rather, by a symbol that represents the proper set of these processors. The algorithm is shown in figure 7. Before explaining its operation, we will list and explain some auxiliary procedures that are used by the algorithm.

9 Find_Reachable_From(o):

First, the algorithm identifies the subgraph of G that contains nodes reachable by the object under consideration. It uses the copy of the adjacency matrix that is stored in the shared memory.

9 Is_Reachable(x,y,z): This procedure returns true, if y is reachable by x, using the adjacency matrix specified by the parameter z.

9 Is_Active(x): It returns true, if x is an active edge or node (see definition 15). 9 ZeroOutColm(a,s): It write zeroes into the columns of the adjacency matrix a that correspond to nodes in the set s, i.e., it disregards all the edges of the WIN that are directed towards nodes in s.

9 Subset_Of(S): This operation computes all the subsets of the set S. It is used in the algorithm to activate appropriate sets of processors. First, the algorithm identifies the active nodes and edges (line 3). This is effected by a single-source multiple-sink reachability process taken from [9]. From that stage on, the predicate Is_Active(o) that is subsequently used, is well-defined and computable in constant time. In lines 5 through 7, all the plus weights, except those of the form (+, p), are converted to actual edges. Then the algorithm locates all the p-edges (lines 8 through 10). Consider, for example, the inheritance network shown in figure 8. In this particular example, the p-edges are denoted by labels that begin with "a". The next step is the elimination of those p-edges that are reachable by the node o only through other non p-edges. This is achieved in lines 11 through 17 by performing a single-source single-sink reachability process from the node representing the object o that is under consideration to each encountered p-edge. However, before this step is executed, the algorithm excludes from the corresponding copy of the adjacency matrix of each p-edge all the other p-edges (lines 12 and 14). In the previous example, only the edges whose labels begin with "ab" remain.

55 i p r o c e d u r e goal_dir ( W : W I N , o : i n t e g e r , p : i n t e g e r ) 2 begin

3 4

Find._Reachable_From(o) (* ldenti .fy active nodes *) (POS,NEG)+-(0,0)

5

f o r e a c h e--(i,j) E E:|s_Active(e) in p a r a l l e l do

6 7

for each (+,a)E ~'(e):aE V-{p} in p a r a l l e l do A[i,a]=l (* Convert positive weights to edges *) f o r e a c h e=(i,j)EE:ls_Active(e) in p a r a l l e l do if (-t-,p)E .T'(e) o r j = p type[e]='a' (* e is a p-edge *) for each e=(i,j)EE:type[e]='a' in p a r a l l e l do for each g=(k,1):gs~e a n d type[g]='a' in parallel do begin Ae[k,1]=0 if ls_Reachable(o,i,Ae) type[e]='ab' end for each e=(i,j) E E:ls..Active(e) in p a r a l l e l do (* Phase 1 begins *) i f 3 (-,a) E .~'(e):ls_Active(a) A'[i,j]=0 for each h=(k,l)EE:type[h]--'ab'in parallel do if Is_Reachable(o,k,A') begin type[h]='abc'; g o t o l i n e 52 end for each e--(i,j) E E:ls_Active(e) in p a r a l l e l do (* Phase 2 begins *) if 3 (-,a) E ~'(e):ls_Active(a) Zero_Out_Colm(A', { a}) for each h=(k,1)EE:type[h]='ab' in parallel do if ls_Reachable(o,k,A ' ) begin type[h]='abc'; g o t o l i n e 52 end for each e=(i,j) E E:ls_Active(e) in p a r a l l e l do (* Phase 3 begins *) if 3 (-,a) E _~'(e):ls_Active(a) S+-SUa for each X E Subsets_Of(S) in p a r a l l e l do begin for each e=(i,j) E E:ls_Active(e) in parallel do if 3 (-,a) E ~(e):a E X Ax[i,j]--0 (* Remove edge e *) Zero_Out_C olm ( A x , {S-X }) for each h=(k,1)EE:type[h]='ab' in p a r a l l e l do if ls_Reachable(o,k,A x ) begin type[h]='abc'; g o t o l i n e 52 end end for each e=(i,j) E E:type[e]='abc' in parallel do if (-,p)E Y(e) NEG+-I

8

9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 else 56 POS+-I 57 c a s e (POS,NEG) is: 58 (1,0): say 'isl(o,p)' 59 (0,1): say 'nisl(o,p)' 60 (1,1): say 'isl(o,p) and nisl(o,p)' 61 (0,0): say 'nisl(o,p)' (* Under the "closed world" assumption *) 62 e s a c 63 e n d

Figure 7. The parallel algorithm for the goal-directed inheritance reasoning problem

56 The next stage is the elimination of those p-edges that are not weight-reached by o (lines 18 through 51). This elimination is achieved in three phases, each of which is executed in parallel for each p-edge with label "ab". In our example, only the nodes whose label is "abE" remain. In each phase, a different possible kind of path that may connect the nodes o and p is sought. The first possible kind of path (considered in lines 18 to 26), includes no edges with negative weights, that refer to active nodes. In our example, and considering the p-edge emanating from K, the path O --+ E --+ F --+ A --+ C ~ K is exactly such a path. The weight ( - , X) that is in the weight-list of edge (C, K) does not refer to an active node, and thus, it has no effect on the progress of the phase. The existence of such paths is examined by performing a single-source single-sink reachability process from o to the tail of each edge whose label is "ab" (line 22), after eliminating the edges whose weight-lists contain negative weights that refer to active nodes (lines 18 through 20).

~ ~176176176176176

", ",

-,P)a oOo.~176176176

( ,P)ab

A,

/ i

1~.

\

.~176 o.~176176176176

S

(,B)

(-,L)

Figure 8. A WIN

57 The second kind of paths are those that include edges with negative weights but do not include the nodes mentioned in these negative weights. The existence of such paths is checked in lines 27 through 35 by performing a single-source single-sink reachability process from o to the tail of each edge whose label is "ab", after excluding the nodes mentioned in negative weights (lines 27 and 29). In the WIN of figure 8 we can see that the nodes that have to be excluded are A, B, C and L. The path O -+ E -+ G --+ H -~ I ~ D --+ J -+ K is a path of the second kind. Of course, if at least one such path is found, the algorithm does not proceed to the last phase for this particular p-edge. Consider now what would happen if the edges (E,F) and (I,D) did not exist. That is, if there were no paths of the two kinds of paths mentioned above. The third kind of paths are those that include at least one node that is mentioned in at least one negative weight of some active edge that is not included in the path. The existence of such paths is examined in lines 36 to 51 where a single-sink reachability process from o to the tail of each edge whose label is "ab" is performed in parallel for each possible subset of all the nodes mentioned in the negative weights (lines 36 through 38). In the WIN of figure 8 all the subsets of the nodes A, B, C and L are examined against inclusion in the same path. During the examination of a subset, we exclude the following: 9 All the edges with attached negative weights concerning the nodes of the subset. 9 The rest of the mentioned (in weights) nodes that are not included in the subset. In the WIN of figure 8 there is only one subset that exists in a path. This subset consists only of A and the path is O --+ E -~ G ~ H -+ A --+ D -+ J -~ K. Before the examination of this combination, both the edge (E, M) and the nodes B, C and L are excluded. Obviously, if o is reachable from p in any one of the three phases, it is considered to be reachable from p. Finally, in lines 52 through 62, the most-specific p-edges that are weight-reached by the object o provide the answer. As for the complexity of the algorithm, it is not difficult to see that it is dominated by the complexity of the third phase. This phase takes O(log 2 n) time using O(2~eM(n)) processors, since there are e p-edges in the worst case, and for each one of them there are 2~ possible subsets that are examined, performing for each of them a single-source, singlesink reachability process. We observe that there is an exponential term in the processor complexity figure, that depends on the value of k, the number of nodes mentioned by negative weights. This k can be in the order of n. However, in practice, the number of exceptions that are contained in knowledge bases is expected to be rather low. If it happens that k is in the order of log n, then the algorithm is an NC algorithm, i.e. it is technically an efficient parallel algorithm (see [14]). Of course, the above complexity is the worst case complexity. In practice, a lot of heuristics can be applied in order to reduce the complexity of the algorithm. For instance, the number of different combinations can be considerably decreased if we take under consideration that some nodes can not be in the same combination because they are not reachable by each other. The algorithm concerning the recognition problem proceeds in parallel for each property/class p C sp. For each property, it takes time O(log 2 n) using o(2kM(n)) processors.

58 Therefore, the overall complexity of the algorithm is O(log 2 n) using O(2*pM(n)) processors. The steps, for each property/class, are implemented in a way similar to the one described above for the goal-directed inheritance reasoning problem. The marking of the objects which can be weight-reached from the node representing the property/class under consideration is achieved through a three phase single-source multiple-sink reachability process starting from p. The three phases are exactly the same as those described in the algorithm for the goal-directed inheritance reasoning problem. In the end, all objects reachable from the node p in any of the three phases are considered to have the property p. In the processor complexity figure shown above for the recognition problem, we observe that the factor e is absent, although this is a more difficult problem than the goal-directed inheritance reasoning problem. This is because we only consider whether an object has a property or not and we do not use the full definition of the goal-directed inheritance reasoning problem, according to which we also consider if it both has and does not have a property, or there is no conclusion at all. Similarly, if we restricted in the same way the goal-directed inheritance reasoning problem, the processor complexity of the corresponding algorithm would be O(2kM(n)). 7. W I N B R I D :

A HYBRID

REASONER

A hybrid algorithm is a sequential one that includes some steps that are executed in parallel. The hybrid algorithms for both problems defined above, are based on a reachability process that locates the most specific p-edges w.r.t, an object o. We implement this process using a scheme based on depth first search (or DFS for short). In the hybrid algorithm concerning the goal-directed inheritance reasoning problem, the reachability process is guided by the encountered weights and starts from the node representing the object o under consideration. The hybrid algorithm concerning the recognition problem is actually a modified version of the algorithm for the goal-directed inheritance reasoning problem that is repeated in parallel, in the worst case, for every object o E I. It proceeds in parallel for each property/class p C sp. In figure 9, this reachability process for the goal-directed' inheritance reasoning problem is shown (procedure answer). It consists, mainly, of the procedure dfs shown in the same figure, that accepts as inputs a WIN W = (G(V, E), $'), the previously visited node v and the property/class p. In order to implement the main reachability process, we modify the DFS algorithm to propagate the weight-list of the previously visited edge to the weight list of each edge it visits, whenever it is needed (steps 31 through 39). Meanwhile, since it may be the case that an already visited edge has to be visited again via another path, if the algorithm reaches a node and there are also other edges directed towards this node, then the algorithm backtracks temporarily. It will only visit the edges emanating from this node (step 12) after all of them have been visited (steps 54 and 55). Therefore, an edge may hold more than one weight-list. This is because it accepts, during the upward propagation of the weight-lists, the weight-lists of eve .ry edge incident on its head node (step 15). These additional weight-lists, are constructed when the edge is visited (step 30). Before being propagated upwards, the weight-lists of the currently visited edge are examined against the following cases:

59 1 p r o c e d u r e answer ( W : W I N , o : i n t e g e r , p : i n t e g e r ) 2 begin 3 dfs(W,o,p) 4 case ( P O S , N E G ) is: 5 ( N O N E M P T Y , E M P T Y ) : s a y 'iM(o,p)' 6 ( E M P T Y , N O N E M P T Y ) : s a y niM(o,p) 7 ( N O N E M P T Y , N O N E M P T Y ) : say 'i.r a n d ~.i.~l(o,p)' 8 ( E M P T Y , E M P T Y ) : s a y 'ni.~l(o,p)' (* U n d e r t h e "closed world" a s s u m p t i o n "1 9 end 10 p r o c e d u r e dfs ( W : W I N , v : i n t e g e r , p : i n t e g e r / 11 b e g i n 12 f o r e a c h edge ( v , w ) : v i s i t e d [ ( v , w ) ] = f a l s e d o 13 begin 14 visited[(v,w)]=true 15 f o r e a c h wght_llst of (a,v) E E 16 begin 17 ifv r 18 st at e = w g h t _ l ist ((a,v)) [0] 19 else 20 state=true 21 if s t a t e = t r u e (* T h a t is, t h e s t a t e is active "1 22 if i s . p _ e d g e ( ( v , w ) ) 23 begin 24 if ( + , p ) E w g h t _ l i s t ( ( v , w ) ) 25 POS=(v,w) 26 if ( - , p ) E w g h t _ l i s t ( ( v , w ) ) 27 NEG=(v,w) 28 continue 29 end 30 m ake_a_w ght_l i st ((v,w)) 31 if s t a t e = t r u e a n d v ~ o 32 begin 33 for each i 34 if wght_list ((a,v)) [ i ] ! = E M P T Y 35 w g h t . l i s t ( ( v , w ) ) [ i ] 4- wght_list((a,v))[i] 36 end 37 else 38 if state=false a n d v ~ o 39 w g h t _ l i s t ( ( v , w ) ) 4-- w g h t _ l i s t ( ( a , o ) ) 40 if 3 i: w g h t _ i i s t ( ( v , w ) ) [ i ] = ( - , w ) 41 42

43 44 z15 46 47 48 49

50 51 52 53 54 55 56

begin

wght_list ( ( v ,w) [0] = f a l s e faise_because_of=w 1ast.Jxeg_wei g h t = (-,w) end if 3 i: w g h t _ l i s t ( ( v , w ) ) [ i ] = ( + , w ) a n d s t a t e = f a l s e a n d ((+,w)w)) begin wgiat_list ( ( v ,w) [0] = t r u e false_because_of=NULL 1ast_n e g_wei ght = N U L L end end

i f V (u,w) E E v i s i t e d ( ( u , w ) ) = t r u e dfs(W,w,p) end

57 e n d

Figure 9. The reachability process for the goal-directed inheritance reasoning problem

50 1. If the currently visited edge is a p-edge, then if it has a positive sign we set a fixed memory location POS (steps 24 and 25), whereas if it has a negative sign we set a fixed memory location NEG (steps 26 and 27). Then we backtrack (step 28), since DFS ensures that we visit a most specific p-edge w.r.t, to o. 2. In case a ( - , j ) weight exists (clearly j r p) in the weight-list of the currently visited edge e = (i, j) C E, the algorithm enters an "inactive state" since, according to the semantics of WINs, the weight-reachability of the successor nodes of j must be precluded (steps 40 through 45). 3. If the algorithm is in the "inactive state" and the updated weight-list of the currently visited edge e = (i, j) E E contains a ( + , j ) weight, then the algorithm switches to its normal active state except for some special cases where the generic stability should be satisfied (steps 46 through 52). While the algorithm is in the "inactive state", it does not examine the first case (step 21) and simply propagates weight-lists without updating the weight-lists it receives (steps 38 and 39). The validity and completeness of the sequential algorithm follow from the properties of DFS. The algorithm does not write into the locations POS and NEG, unless there exists a p-edge that is reachable by o through, at least, one path containing no other p-edge. The reachability process is constrained by the encountered weights due to switching back and forth to the "inactive state". Visiting the nodes in depth first fashion takes O(n + e) time, assuming that n denotes the number of nodes and e the number of edges of a WIN. During the examination of each edge (v,w) the algorithm processes all the items of a number of weight-lists. This is because it may be the case that it propagates the weight-lists of all the visited edges incident on v. The number of these weights-lists can, in the worst case, be equal to the number of different paths in the WIN under consideration. But, as can be shown by an average case analysis, this number is generally not large. The rest of the operations in the step consist in searching a weight-vector, usually very short, for a specific property. Moreover, we need O(lI I + m) processors in order for the DFS scheme to be executed in parallel for eve .ry object o E I where m is the number of memory-processors (see the next subsection). Therefore, using O(n) processors, in the worst case, the overall performance of the algorithm is determined by the n + e factor. Notice that during the reachability process we locate the p-edges concerning all the properties in sp simultaneously. In case the currently visited edge is a p-edge concerning the property/class Px, if it has a positive sign then we set a corresponding memory location POS[x] whereas if it has a negative sign we set a different memory location NEG[x]. Finally, we backtrack since DFS ensures that we visit a most specific p-edge concerning px w.r.t, to o. In this case, what actually happens is a "virtual backtracking" since the DFS algorithm really proceeds, but without checking for a p-edge concerning p~. Instead, it checks for p-edges concerning the other properties/classes of the sp set. The DFS algorithm will check again for a p-edge concerning Px when it actually backtracks from the current edge.

61 7.1. T h e a r c h i t e c t u r e of W I N B R I D In order to evaluate the hybrid algorithms described above, we have implemented a reasoner, called WINBRID, that uses the algorithms in order to handle the two problems that were previously defined. WINBRID was implemented on the Parsytec GCel-3/512 parallel computer under the PARIX operating system. This is a MIMD machine that consists of 512 processors (T805 transputers) arranged in a 2-dimensional mesh. Considering the computational demands of our algorithm, this number of processors was rather very restrictive. This fact had considerable impact on some of our implementation decisions. The set of the allocated processors is partitioned into two sets: the ones that hold the adjacency matrix of the underlying graph (memory-processors) and the weight lists, and the ones that perform the reachability process for each object o E I (search-processors). As an implementation strategy, we store the weights attached to edges in the node from which the edges emanate, along with information that associates the edges with their weights. Since every node has an identification number, we exploit this number in order to represent negative or positive weights as positive or negative integers. At most n such numbers are included in a weight list, although in practical cases there are only a few of them. Each of the memory-processors, communicates directly with all the search-processors. Whenever a search-processor needs more information in order to proceed with DFS, it requests it from the memory-processor that possesses it. One of the allocated processors acts as the computation coordinator. During a preprocessing phase, it divides the information about the adjacency matrix and the weight lists among the memory-processors. This processor is also responsible for maintaining a balanced workload among the processors performing DFS. This is necessary whenever we have at our disposal fewer processors than individuals. Our tests were aimed at an average case analysis of the behavior of WINBRID. To this end, we used a variety of WINs with a wide range of numbers of nodes and weight-list sizes as well as graph depths. The WINs that were used in our tests were generated randomly, in accordance with the properties that define them, that is non-short cut property and acyclicity. The generation algorithm begins with the maximal acyclic directed graph with n nodes, where n is the number of objects in the WIN. In this graph, there is a directed edge from a node i to a node j only if i < j. Then the algorithm starts discarding edges randomly. A randomly selected edge is discarded only if the resulting graph is still connected and possesses the non-short cut property. This process is continued, until all the edges of the initial graph have been considered. The next step, is to add a weight list for each edge of the resulting graph. This is accomplished in a random manner too. This process is governed by an input parameter defining the maximum size of a weight-list. In figure 10, the speedup is shown as a function of the size of the WIN and the number of processors used. For these tests, we generated WINs of various sizes and ran the algorithm with 50 queries on the average on each of them, using different numbers of processors. We observe that the attained speed-up is very good for a mesh topology, where our experimental work was carried out. In figure 11, the performance of WINBRID is shown as a function of the maximum size of the weight-lists. We ran the algorithm for 50 queries, for a specific WIN with size n + e - 500 and for different maximum numbers of weights. We can conclude that the overall performance of WINBRID is not related to the number

52

45

|

|

|

|

I /

w

,

.i

n+e=9053 -.~.--:" n + e = 2 0 7 1 0 " : - + - - .~

40 I/ /

n,~,,~o2 . ~ z . r},i~/=42455~-~ ..... ...... b , e = ~ -~-.-

.13.S"

25

..............................

....... ..~ . . . . . . . . . . . . . . . . . . . .

15

.::::fir." 10

.-...~ ...... ::=: :..~=,,.,...,~:_-_.':.~ . . . . . . . . . . .

""

~ ........... o....,i ,,o~

o-~ ....

. ....................................

'

o...i

5

I

I number

I of processors

Figure 10. Speedup as a function of the size of the WIN and the number of processors

of weights attached to the edges but it, rather, depends on the number of processors used, as it can be seen in figure 10. In figure 12, the performance of WINBRID is shown for the goal directed inheritance reasoning problem using 64 processors. The time complexity of a query concerning the recognition problem, is reduced to the time complexity of a query concerning the goal directed inheritance reasoning problem, if there are sufficiently many processors. Therefore, figure 12 shows the performance attainable by the algorithm for the recognition problem if there are sufficiently many processors. 8. R E L A T E D

WORK

PARKA is a frame-based knowledge representation system that was first implemented on a SIMD computer and it is designed to provide extremely fast property inheritance and inference capabilities (see [7]). It uses a multiple inheritance structure, the Frame Network, to represent relationships among concepts. Each concept is represented by a suitable frame. PARKA uses a fast hybrid algorithm for the recognition problem with a complexity that depends on the depth of the Frame Network and it is O(n) in the worst case, using O(n 2) processors ([7]). However, there is a new and improved version of PARKA, running on a MIMD computer ([20]) with time complexity O(d, n/k), where d is the depth of the inheritance network, n is the number of frames and k the number of processors of the parallel machine. PARKA does not, currently, support exceptions of inheritance. Moreover, it is shown in [7] that PARKA does not completely implement Touretzky's "inferential distance ordering". Thus, there are some Frame Networks that

63 O.E!

0.25

0.2

0.15

0.I

0.05

0

20

40 Maximum ( W9

60

80

100

/ FAg9

Figure 11. Response time related to the number of weights

can not be handled by this system. In [12], an inheritance reasoner is presented that uses a hybrid algorithm for a PMPM machine. A PMPM (Parallel Marker Propagation Machine) is a SIMD machine which is composed of a number of processing elements each of which is assigned to a node or link of an inheritance network and is capable of executing a number of simple marker manipulation and propagation instructions. The communication structure of the machine is fixed and matches the topology of the inheritance network. The time complexity of this algorithm is O(n 2) in the worst case (although cases that give rise to worst case performance are very rare in practice), using O(n + e) processing elements. Although the reasoner described in that work implements a complete version of the "inferential distance ordering," it may fail to treat satisfactorily some inheritance structures that are supplemented with "redundant generic statements" ([12]). Moreover, the fixed topology of the machine makes it difficult to facilitate the dynamic changes that may be later imposed on the structure of the inheritance network. On the other hand, WINBRID supports multiple inheritance with exceptions, and implements a complete version of the "inferential distance ordering" measure supporting both atomic and generic stability, with a time complexity that it is determined by a factor in the order of n + e, using O(n) processors. Moreover WINPAR, a massively parallel reasoner that is currently under development, will provide massively parallel support for both the goal-directed inheritance reasoning and recognition problems, using the parallel algorithms described above.

54 =

I

0 0

50000

,oo'ooo

,so'ooo

I 200000

I 250000

300000

nodes+edges

Figure 12. Response time (goal-directed inheritance reasoning problem) related to the size of the WIN

9. F U T U R E W O R K The design of more efficient algorithms for both the goal-directed inheritance reasoning and recognition problems is our main research goal. More specifically, we intend to use an extension of some transitive compression technique in order to efficiently manage transitive relationships. There is already a sequential algorithm that uses such a compression scheme for the management of transitive relationships with exceptions (see [1]). This approach is directly compatible with the key idea behind Weighted Inheritance Networks. A parallel algorithm can be based on the parallel scheme presented in [10]. We are also currently adapting the parallel algorithms in order to implement WINPAR, a parallel reasoner capable of nonmonotonic, multiple inheritance reasoning. WINPAR is being implemented under the PARIX environment for software development, on Parsytec's GCel-3/512 parallel machine. Moreover, it is interesting to examine heuristics that may lead to improvements in the performance of the parallel algorithms. As far as the hybrid algorithm described in section 7 is concerned, there are some places within the procedure dfs where more parallelism could help, such as the parallel manipulation of the weight-lists. 10. A C K N O W L E D G M E N T S The authors wish to thank I. Andreopoulos, K. Stathelakos, and A. Tzelaidis for their help during the tests. The authors also wish to thank the High Performance Computer Laborato .ry for the use of their Parsytec GCel-3/512 machine.

55 REFERENCES

1. B. Boutsinas, "Efficient Management of Transitive Relationships with Exceptions", Technical Report, C.T.I., TR.95.11.40, 1995. 2. B. Boutsinas and G. Pavlides, Belief Revision in nonmonotonic multiple inheritance using Weighted Inheritance Networks, Proceedings, AI'93 Workshop on Belief Revision: Bridging the gap between theory and practice, Australia, Melbourne, 141-155, 1993. 3. G. Brewka, Incomplete Knowledge in Artificial Intelligence, Proceedings of 1st Workshop in Information Systems and Artificial Intelligence: Integration Aspects, Ulm, FRG, 11-29, 1990. 4. D. Coppersmith and S. Winograd, Matrix multiplication via arithmetic progressions, Proceedings of 28th Annual A CM Symposium on Theory of Computing, Marina Del Rey, 1-6, CA (1987). 5. D. Etherington and R. Reiter, On inheritance hierarchies with exceptions, Proceedings of the AAAI-83, Washington, 104-108, DC(1983). 6. D. Etherington, Formalizing Nonmonotonic Reasoning Systems, Artificial Intelligence, 31:41-85, 1987. 7. M. Evett, J. Hendler, and W. Andersen, Massively Parallel Support for Computationally Effective Recognition Queries, Proceedings of the AAAI-93, 297-302, 1993. 8. S. Fahlman, NETL: A System for Representing and Using Real-World Knowledge, MIT Press, Cambridge, MA, 1979. 9. H. Gazit and G.L. Miller, An Improved Parallel Algorithm that Computes the BFS Numbering of a Directed Graph, Information Processing Letters, 28:61-65, 1988. 10. E. Lee and J. Geller, Parallel Operations on Class Hierarchies with Double Strand Representations, Proceedings of the third Parallel Processing for Artificial Intelligence Workshop of IJCAI-95, Mondreal, Canada, 107-119, 1995. 11. J. Horty, R. Thomason, and D. Touretzky, A Skeptical Theory of Inheritance in Nonmonotonic Semantic Networks, Proceedings of the AAAI-87, Seattle, Washington, 358-363, 1987. 12. J. Horty, R. Thomason, and D. Touretzky, A Skeptical Theory of Inheritance in Nonmonotonic Semantic Networks, Artificial Intelligence, 42:311-318, 1990. 13. J. J~ Js An Introduction to Parallel Algorithms, Addison Wesley, Reading, Massachusetts, 1992. 14. R.M. Karp and V. Ramachandran, Parallel algorithms for shared-memory machines, in: J. van Leeuwen, ed., Handbook of Theoretical Computer Science (Elsevier, Amsterdam, 1990). 15. B. Kettler, J. Hendler, W. Andersen, and M. Evett, Massively Parallel Support for Case-based Planning, Proceedings of the 9th CAIA, Florida, 3-8, 1993. 16. T. Krishnaprasad, M. Kifer, and D. Warren, On the Declarative Semantics of Inheritance Networks, Proceedings of the IJCAI-89, Detroit, 1099-1103, MI(1989). 17. E. Sandewall, Nonmonotonic Inference Rules for Multiple Inheritance with Exceptions, Proceedings IEEE 7~, 1345-1353, 1986. 18. B. Selman and H.J. Levesque, The Tractability of Path-Based Inheritance, Proceedings of 11th IJCAI, 1989.

55 19. L. Shastri, Default Reasoning in Semantic Networks: A Formalization of Recognition and Inheritance, Artificial Intelligence, 39:283-355, 1989. 20. K. Stoffel, J. Hendler, and J. Saltz, PARKA on MIMD-supercomputers, Proceed-

ings of the third Parallel Processing for Artificial Intelligence Workshop of IJCAI-95, Mondreal, Canada, 132-142, 1995. 21. D. Touretzky, The Mathematics of Inheritance Systems, (Morgan Kaufmann, Los Altos, CA), 1986. 22. D. Touretzky, J. Horty, and R. Thomason, A Clash of Intuitions: The Current State of Nonmonotonic Multiple Inheritance Systems, Proceedings of the IJCAI-87, Milan, 476-482, Italy(1987). 23. W. Woods, What's in a Link: Foundations for semantic networks, in: R. Brachman and H. Levesque (eds.), Readings in Knowledge Representation, 217-241, 1985.

67

Basilis Boutsinas Basilis Boutsinas was born in Cefalonia, Greece, in 1963. He received his diploma in Computer Engineering and Informatics in 1991 from the University of Patras, Greece. He also conducted studies in Electronics Engineering at the Technical Education Institute of Piraeus, Greece and Pedagogics at the Pedagogical Academy of Lamia, Greece. He finished the preparation of his Ph.D. thesis at the University of Patras in 1996. His primary research interests include Knowledge Representation techniques, NonMonotonic Reasoning and Parallel Algorithms. He also works in the application of Knowledge Representation schemes in building educational systems. He is a member of AAAI and AACE. He can be reached at the University of Patras, Department of Computer Engineering and Informatics, 26500, Rio, Patras, Greece. Home Page: http://socrates.ceid.upatras.gr/-vutsinas

Yannis C. Stamatiou Yannis C. Stamatiou was born in Volos, Greece, in 1968. He graduated from the department of Computer Engineering and Informatics, University of Patras, in 1990. He is currently working towards a Ph.D. in computer science. His main research interests are in parallel and distributed algorithms for problems stemming from Artificial Intelligence, and especially Constraint Satisfaction Problems, and techniques for analyzing their expected performance. He is currently working on the problem of the satisfiability threshold for random formulae as well as on the extension of the problem for random Constraint Networks. He is also interested in Artificial Life, Fractals, Philosophy of Science and the roots of Computability and Logic. He is a voting member of ACM, IEEE, the Greek Computer Society, and the Greek Technical Chamber. He can be reached at the University of Patras, Department of Computer Engineering and Informatics, 26500, Rio, Patras, Greece. Home Page: http://fryni.ceid.upatras.gr/-stamatiu

Georgios Pavlides Georgios Pavlides is an associate professor in the Department of Computer Engineering and Informatics of the University of Patras, Greece. His research interests are in Software Engineering and Artificial Intelligence and their applications in Management Information Systems. He received his Ph.D. in Operational Research from the University of Sofia. He can be reached at the University of Patras, Department of Computer Engineering and Informatics, 26500, Rio, Patras, Greece.

Parallel Processing for Artificial Intelligence 3 J. Geller, H. Kitano and C.B. Suttner (Editors) 9 1997 Elsevier Science B.V. All rights reserved.

Parallel Operations

69

on Class Hierarchies with Double Strand

Representation Eunice (Yugyung) Lee* and James Gellert Department of Computer and Information Sciences New Jersey Institute of Technology Newark, NJ 07102 This paper continues a series of papers dealing with the problems of (1) fast verification of the existence of a transitive relation in an IS-A hierarchy, and (2) dynamic update of such a hierarchy. As in our previous work, a directed acyclic graph (DAG) of IS-A relationships is replaced by a set of nodes, annotated by number pairs, and stored on a massively parallel computer. In this paper a new mapping of this set of nodes onto the processors is described, called the Double Strand Representation (DSR). The DSR improves the processor usage compared to our previously used Grid Representation (GR). This paper shows IS-A verification and number pair propagation algorithms for the Double Strand Representation. Test runs on a CM-5 Connection Machine a are reported. 1. I N T R O D U C T I O N Most Knowledge Representation systems, as well as all object-oriented languages and databases, use an IS-A hierarchy as their backbone. An efficient hierarchy encoding technique would aid all these areas, especially for large hierarchies. The IS-A hierarchy has been especially important in the KL-ONE family of Knowledge Representation (KR) systems [1-9]. Object-oriented systems, based on SIMULA [13] and Smalltalk [14], always incorporate generalization hierarchies with inheritance behavior. Object-oriented methods have been applied to the design of programming languages, e.g., C + + [15], type systems [16], objectoriented extensions of existing languages, e.g., CLOS [17], and object-oriented database systems such as ORION [~0], o~ [~], and ONTOS [12]. Given the importance of the IS-A hierarchy, one would like to achieve the fastest possible processing for query and update operations in this hierarchy. If we assume that it is known that a Mammal is an Animal, a Dog is a Mammal, and a Collie is a Dog, then we want *This research was (partially) done under a cooperative agreement between the National Institute of Standards and Technology Advanced Technology Program (under the HIIT contract, number 70NANB5H1011) and the Healthcare Open Systems and Trials, Inc consortium. tThis work was conducted using the computational resources of the Northeast Parallel Architectures Center (NPAC) at Syracuse University, which is funded by and operates under contract to DARPA and the Air Force Systems Command, Rome Air Development Center (RADC), Griffiss Air Force Base, NY, under contract# F306002-88-C-0031. aThe Connection Machine is a trademark of Thinking Machines Corp.

70

Thins ( Animal ( Mammal(

M a m ~

(_) Reptile

Dog ( Collie ( )

Collie

k_) Cocker U Spaniel

Figure 1. IS-A Hierarchy

to quickly answer "yes" to the query whether a Collie is an Animal (Figure 1). We also want to be able to quickly update the hierarchy, e.g., when adding the facts that Cocker Spaniels are Dogs, Cats are Mammals, and Reptiles are Animals. In the past we have been especially interested in techniques where the response time for an IS-A que .ry does not depend on the length of the chain of IS-A links that must be traversed to answer the query. Besides these techniques, our tool of choice for achieving fast query and update operations is fine-grained parallelism. This raises the question of how to map the IS-A hierarchy onto the available space of processors. The most obvious intuitive choice is to assign eve .ry class of the hierarchy to a single processor. However, this intuitive choice does not carry over to the links between classes. If the whole hierarchy were known at the beginning of system design, one could opt for a strong form of isomorphism, where every IS-A link is implemented as a hardware link. However, our basic assumption is that Artificial Intelligence is not intelligence at all, if knowledge structures cannot be updated dynamically. Therefore, the isomorphism solution would require dynamic hardware changes as part of any update of the IS-A hierarchy, a solution that is currently still not practical. The idea of custom-made hardware is also not appealing to us. The solution that we have been using in a series of papers [23,25-30] has been to eliminate the need for the IS-A links as much as possible, while still maintaining all the knowledge that is contained in the IS-A hierarchy. In this way, we d o n o t have to worry about the mapping of the IS-A links onto the actual hardware. Like other researchers, we are representing the classes of an IS-A hierarchy by nodes in a graph. The IS-A links are represented by the directed edges of this graph. In our first paper on the subject [26] it was shown that for the special case of a tree-shaped IS-A hierarchy of nodes, the hierarchy could be replaced by a linear order of the same nodes, together with o n e number pair assigned to each node. The assignment of number pairs was based on an encoding due to Schubert [31]. Later on we extended the representation of IS-A hierarchies "without explicit IS-A

71

Figure 2. Three Step Mapping

links" to directed acyclic graphs (DAG). In this representation, several number pairs became necessary at some nodes. The assignment of these number pairs was based on an extension of [31] by Agrawal et al. [18]. While doing this, we were able to prove that the linear order used in [26] is not necessary at all. Rather, a set of nodes with an associated number pair(s) at each node could perfectly represent a DAG-shaped IS-A hierarchy without explicitly maintaining the IS-A links [23]. Unfortunately, the original fast algorithms [26] were possible due to the fact that one n u m b e r pair was assigned to one processor, and not due to the fact that one node was assigned to one processor. So, in order to maintain the speed of processing, at least for queries, it became necessary to change the mapping of nodes onto processors. In [23,28] we mapped each node onto one column in a two-dimensional grid of processors. Every number pair of each node was assigned to a different processor (row) in its column. A pleasant side effect of eliminating explicit links is that the time necessary to traverse them is also eliminated, giving, within certain limitations, constant time responses for transitive closure queries [26]. In other words, by using the Schubert/Agrawal representation, it takes as much time to verify that a Collie is an Animal as it takes to verify that a Collie is a Dog. By adding parallelism, updates can be performed in "almost" constant time. Experimental verification of this claim was provided in [26] for the case of trees. All necessary details of Agrawal et al.'s encoding, our node set representation, and the GR will be described later on in this paper. However, we summarize now that the main feature of our previous work is a three step mapping (Figure 2). In the first step, an IS-A

72 T "l~ng

Mineral

Animal

Domestic-

Siamese

Cheetah

Figure 3. A Class Hierarchy

hierarchy of classes of the real world is mapped onto an isomorphic DAG of nodes, with one class per node. In the second step, the hierarchy of nodes is mapped into a set of those nodes, so that eve~. node is annotated with one or more number pairs. In the third step, this node set and the associated number pairs are mapped onto the processor space of a fine-grained parallel computer. In brief, class hierarchy --4 directed acyclic graph --4 node set + number pairs ~ processor space. In our previous papers, the third step was performed by organizing processors as a twodimensional grid, with one node per column, and one number pair per row. Unfortunately, the GR causes a number of difficulties. We will describe the major problem now, while mentioning some other problems later in the paper. Because some nodes have only one number pair, while others have many number pairs, some columns might be virtually empty, while other columns might run out of processors, disrupting the functioning of our algorithms. Therefore, in this paper, we are showing a different representational approach. The general idea of a three step mapping class hierarchy --4 DA G --4 node set + number pairs --4 processor space is still maintained. Indeed, the first two steps of the mapping are not changed at all. However, the assignment of number pairs to processors is changed in a way that eliminates the main problem of the GR described above. The new representation is called Double Strand Representation (DSR) and forms the main subject of this paper. In addition, we will show parallel algorithms for fast IS-A queries in the DSR. We will also show parallel algorithms for an important operation of Agrawal et al.'s encoding [18], called propagation, in the DSR. Propagation is a necessa .ry part of eve~. update operation on a DAG with number pair annotation. Work related to ours in the symbolic paradigm has been published, e.g., [32,33,39]. The PARKA system, a symbolic approach to combining KR with massive parallelism, has been described there. It is a frame system for handling large amounts of knowledge. It is implemented on the Connection Machine, and its temporal behavior has been extensively

73

Class Name Thing

Animal

Plant Mineral

Mammal

Thing [ 1 10]

Number Pair

[1 [3 [2 [ 10 [5

10] 9] 2] 1O] 8]

Domestic-Animal

[9 9](8 8)

Wild-Animal

[4 4](7 7)

Feline

[6 8]

Siamese

[8 8]

Cheetah

[7 7]

k.~ [10 1 0 ~ Domcstic-~al

[9 9] 0

(8 8 ) ~ ,

8,si o o

,•

[2 2~-J

Mammal ",~ild-Animal 0[4 4] 4 [ ( 7 7) Feline ,

( )[5 8]

[7 7]

w

Figure 4. Node Set Representation for Class Hierarchy

tested. The newer version runs on an IBM SP2 [34]. Neural network approaches close in spirit to ours are, e.g., [40,42-46]. Shastri's work [40,41] combines massive parallelism implemented on a neural network simulator with a well defined, limited inference approach. According to Shastri [42], the distinction between the processes of a special-purpose reasoner and a general-purpose reasoner is akin to two human modes of reasoning, namely, reflexive reasoning and reflective reasoning. Sun [44], on the other hand, presents an intensional neural network approach of reasoning based on the semantic closeness of concepts. His work implements inheritance employing massive parallelism. In Section 2 we discuss the numeric encoding of class hierarchies and the node set representation. In Section 3 the Double Strand Representation is presented. In Section 4 constant time subclass verification in the Double Strand Representation is discussed. Section 5 presents massively parallel propagation algorithms with the Double Strand Representation. We compare the performances of both representations, giving experimental results on the Connection Machine in Section 6. Finally, we conclude this paper in Section 7.

2. N O D E SET R E P R E S E N T A T I O N A N D A G R A W A L / S C H U B E R T E N C O D ING Our work has been based on Agrawal et al.'s [18] extension of Schubert's class hierarchy reasoner [31] towards directed acyclic graphs. Agrawal et al.'s approach [18] makes it possible to verify that A is a subclass of B by comparing a number pair at A to one or more number pairs stored at B. This approach makes no use of the path from A upwards to B at all.

74

!

~g

, Path

Left Part

t Part

! 9

I I

! I

al I

l

I

I I

I I

i Domestic-Anita I

ild-Animal .

.

.

.

.

.

;

. . . . .

!

O ~ Siamese

.

i I

r~F~ine- -

-

"i

i Subtrcc

'I I

I

'i U~ Ch~,etah I

Figure 5. The Four Areas of Spanning Tree

The basic approach of [18] is: (1) Construct an optimal spanning tree of a given b A G such that at every node with multiple parents, we select the link to the parent with the maximum number of predecessors. Predecessors are nodes that are reachable from a node by an "up search." (2) Assign a pair of preorder and maximum number to every node. Preorder numbers are generated by a right-to-left preorder traversal of the spanning tree. The maximum number for every node is the maximum preorder number in the subtree rooted at that node. Tree pairs result from this step. (3) All the arcs that are not part of the optimal spanning tree are used to propagate number pairs upward. This is done so that every transitive IS-A relation in the bAG can be verified, but no redundant pairs are generated. Graph pairs result from this step. In Figure 4 we use the notation [Tr #] for tree pairs and the notation (Tr #) for graph pairs. In this representation a node A is a subclass of a node B iff the tree pair of A is included in or equal to any one of the pairs of B. For instance, a Cheetah is a Feline because [7 7] is a subinterval of [6 8]. A Cheetah is also a Wild Animal, because the tree pair of Cheetah [7 7] is propagated to Wild Animal as a graph pair (7 7). However, a Cheetah is not a Mineral because [6 8] and [10 10] are disjoint. In [23] our incremental massively parallel encoding of bAGs, called "node set representation," was introduced. We proved that the node set representation together with the number pairs is sufficient to represent a class hierarchy. We can operate with a set of nodes because all important update and retrieval operations require only three items at every node: (1) the key item, e.g., Mammal, (2) the number pairs, and (3) the area of the spanning tree where the node is located [23]. It is easy to see that the tree pair at each node N can be used to determine four areas of the graph (Figure 5). Every node N, except for the root, defines a path of spanning tree arcs that connect N to the root. This path divides the spanning tree into four (possibly empty) areas: (1) the path itself; (2) the left part of the path; (3) the right part of the path; (4) the subtree which is rooted at

75

[1 10] [2 2] 0

1

[3 9]

. . . . .

[8 81 lit9 91 ]ltlO ioJll""

2 ..... 7 Tree Pairs Strand

8

9 (Pq;J

|1 t9 911 (8 8) I Ira 41 (7 7) | Maxld-3 Maxld-2 MaxId- 1 Maxld ((I~ Graph Pairs Strand

Figure 6. Double Strand Representation for Figure 4

N. For instance, for Mammal in Figure 5, we can easily define area 1: {Thing, Animal, Mammal}, area 2: {Mineral, Domestic-Animal}, area 3: {Plant, Wild-Animal}, and area 4: {Feline, Siamese, Cheetah}. Many important steps of the update operations treat each of these four areas uniformly, with the same operations being applied to all the nodes in one area. Therefore, if we have area information, we do not need the class hierarchy any more. Details can be found in [23]. 3. D O U B L E S T R A N D

REPRESENTATION(DSR)

Now we will show how the node set is practically mapped onto the processor space of the CM-5. We are interested in a mapping that will represent tree pairs and graph pairs efficiently, so that it is possible to achieve a high degree of parallelism, memory efficiency, and optimal use of available processors. We call the result of our new mapping the "Double Strand Representation" (DSR). In this representation the given processors are divided into two areas: the tree pairs strand and the graph pairs strand (See Figure 6). In the tree pairs strand, every node is represented in a separate processor. The tree pair of a node may be assigned to the tree pairs strand in any order. In the graph pairs strand, pairs of processors are used to store a sequence of pairs, each consisting of a tree pair and a graph pair. Every processor is assigned an address, called its ID. Let source of propagation be a node which propagates its tree pair and let target of propagation be a node to which a number pair is propagated from the source of propagation. The tree pair U stored in a processor with an odd ID x is used to represent the target of propagation. The graph pair V in the processor with ID x + l is used to represent the source of propagation. Let Z be a processor pair (U, V) in the graph pairs strand. Let Y be the set of all Z. Every time a pair V is propagated to a node with tree pair U, we will represent (U, V) in the graph pairs strand. If 1K processors are available, the maximum ID (MaxId) will be 1022 in the DSR, and the first processor pair (U, V) will be stored in the two processors 1021 and 1022. In Figure 6 (representing the hierarchy of Figure 4) the tree pair of Wild-Animal [4 4] occurs as U in the graph pairs strand in processor MaxId - 1 (1021) and a propagated pair (7 7) which is the tree pair of Cheetah occurs in the even processor MaxId (1022) as a graph pair of Wild-Animal. Therefore, we can verify that Cheetah is a subclass of Wild-Animal. (Details will be shown later.)

76

Thing

Plant

Animal

W-Ani

Mammal

Feline Cheetah Siamese D-Ani

o 1[1 lO]I 1[2 2] I [tY'~ 1[4 41 I 115 811 I [6 811 1[7 711 ~ 1 I'

i

I ~

I!

i

11

I I (7 7)1 I

II

IF--G[

II

Mineral

i[9 911 1[1010]1

II

I F-

I (8 8)1 L___]

II

I1

I!

il

Figure 7. Grid Representation of Figure 4

We will now compare the Double Strand Representation to our Grid Representation used in previous research [27]. The Grid Representation is a distributed representation of a node set (Figure 7). Every node is represented as a column. The first row contains the tree pair of the node, while up to k graph pairs are maintained in the other rows. Nodes may be assigned to columns in the order that the system is informed about their existence. As the Double Strand Representation, this order is irrelevant. We have been using a grid of 128 columns and 8 rows. On our Connection Machine 1 2 8 , 8 = 1024 is the minimum number processors that may be used. The choice of 128 columns and 8 rows corresponds to a compromise between having a large node set and permitting a reasonably large number of pairs at each node. Note that these 1024 processors are "virtual," meaning that they are simulated on 32 real processors. While we could use larger sets of virtual processors, this would only slow down real-time results. Therefore, we needed to choose the minimum configuration. While using the grid structure, we encountered some problems. First of all, we have to allocate k processors for graph pairs for eve .ry tree pair. This causes a significant number of processors to be left empty. This can lead us to run out of processors when the number of graph pairs in one column exceeds k while in other columns processors are empty. Secondly, during update of the class hierarchy, we may have an unused processor between two used processors, called a "hole," because of our implementation of Agrawal et al.'s subsumption algorithm. Unfortunately, with the current algorithm for the grid structure, we have not found an efficient technique to reclaim such a hole. Agrawal et al.'s subsumption technique, as we have mentioned earlier, is an algorithm which eliminates subsumed pairs during propagation. For example, if a number pair (7ri pi)is subsumed by another pair [Trj #j] at the same node (column) due to propagation, i.e., 7r.i _< 7ri and #i <_ #j, then discard (7ri #i). It is due to this that some processors are left without pairs and become holes. All these problems have lead us to abandon the Grid Representation and turn to the Double Strand Representation. One question which arises now is: How do we efficiently organize processors into two strands? Suppose that we organize processors (by SW) in two rows; the first row is used to represent tree pairs and the second row for graph pairs. Since these two strands are growing at different rates of speed, we may encounter a case where all the processors in the tree pairs strand are used up while a lot of unused processors remain in the graph pairs strand. This is clearly not a good representation. It is necessary

1

77

Figure 8. Dynamic Storage Management of Double Strand Representation

to design a processor-efficient technique to avoid this problem. The main idea of our storage management is borrowed from the dynamic paradigm of languages such as Pascal that maintain a stack and a heap. In our representation, processors of the tree pairs strand are allocated starting at processor 0 and grow towards higher processor IDs. Processors of the graph pairs strand are allocated starting at the processor with the largest ID and grow towards lower processor IDs. There are two pointers, (I)~ and ~v, to show the borders of both areas. We define jc to be the size of "free space." Jr = ~)~ - ~)r

(1)

~" should be bigger than a certain threshold, say 10% of processor space. If this is not the case, we can take some corrective measures, to be described in [24]. 4. S U B C L A S S V E R I F I C A T I O N Now we will describe how to verify subclass relationships. Suppose that we want to verify whether B IS-A A. There are two cases: (1) A is a tree predecessor of B. (2) A is a graph predecessor of B. The first case can be easily verified by a subsumption test: the tree pair of A subsumes the tree pair of B. For the second case, we have to check whether A has a graph pair propagated from B or from a tree predecessor of B. We now introduce some CM-5 terminology. A parallel variable (pvar) is an array (perhaps multi-dimensional) where every component is maintained by its own processor and all values are usually changed in the same way and in parallel [47]. In the algorithm, variables marked with !! are parallel variables, and operations marked with !! or involving parallel variables are parallel operations. The parallel variable pre!! contains for eve .ry node a preorder number, and the expression max!! stands for a parallel variable that contains for eve .ry node a maximum number. The functions prenum(A), maxnum(A), and tree-pair(A) retrieve the preorder number, maximum number, and the tree pair, respectively, for the given node A. Additionally, the variable g-lb (Graph Strand Lower Bound) represents (I)~ and t-ub (Tree Strand Upper Bound) represents (I)~. The parallel function self-address!! returns IDs of all active processors and oddpll contains TRUE on a processor if the processor's

78 ID is an odd number. The algorithm ACTIVATE-PROCESSORS-WITH consists of two parts. The first part describes a set of processors to be activated. The second part, starting with DO, describes what operations should be performed on all active processors. We now show a function IS-A-VERIFY that performs subclass verification. As we mentioned above, if A is a tree predecessor of B (by IS-A-VERIFY-I) or A is a graph predecessor of B (by IS-A-VERIFY-2), then B IS-A A. ; B is-a A iff I S - A - V E R I F Y returns T. IS-A-VERIFY (B, A: Node)' BOOLEAN return(IS-A-VERIFY-I(B, A) OR IS-A-VERIFY-2(B, m))

IS-A-VERIFY-1 (B, A: Node): BOOLEAN ; If A is a tree predecessor of B, then the tree pair of A subsumes the tree ; pair of B. ACTIVATE-PROCESSORS=WITH prenum(B) >_!! prenum(A) AND!! maxnum(B)
Now we will show how to verify that B IS-A A when A is a graph predecessor of B. Remember that a pair of processors (U, V) in the graph pairs strand is used to represent a graph pair. The tree pair in the odd processor (U) is used to represent a node S and the graph pair in the even processor (V) is used to represent a node which propagates its tree pair to S. Therefore, we are looking for a pair of processors (U, V) such that the tree pair of A is contained in processor U and the graph pair of B or its tree predecessor is contained in processor V. In the following functions the expression mark!![x] "- y means that the pvar mark!! on the processor with the ID x is assigned the value y. IS-A-VERIFY-2 (B, A: Node)" BOOLEAN ; Activate every occurrence of the tree pair of A in the graph pairs strand. ; Set the parallel flag mark!! on the right neighbor processors of the active ; processors. ACTIVATE-PROCESSORS-WITH pre!! <__!! prenum(tree-pair(A)) AND!! max!! __!! maxnum(tree-pair(m))AND!! self-address!!() ___!! g-lb AND!! oddp!! (self-address!!()) DO BEGIN mark!![self-address!!() +!! 1]:= 1 END

79 ; Test whether any marked processor has the tree pair from B or ; from a tree predecessor of B, as a graph pair. If this is the case, return T. ACTIVATE-PROCESSORS-WITH pre!! <_!! prenum(tree-pair(B)) AND!! max!! :>!! maxnum(tree-pair(B))AND!! mark!![self-address!!()] =!! 1 DO BEGIN IF any processor is still active THEN return T END

In our example (Figure 6), Feline is a subclass of Animal because [6 8] is a subinterval of [3 9] (by IS-A-VERIFY-I). IS-A-VERIFY-2 will verify that Siamese is a subclass of Domestic-Animal because the tree pair [8 8] of Siamese will occur as (8 8) together with the tree pair [9 9] in the graph pairs strand. Feline is not a Plant because [6 8] is neither a subinterval of [2 2] nor is there a processor pair ([2 2], (6 8)) in the graph pairs strand. In summary, with the Double Strand Representation, it can be rapidly decided whether a subclass relation exists between two classes. 5. I N C R E M E N T A L

UPDATE

OF C L A S S H I E R A R C H Y

WITH DSR

In a directed acyclic relationship graph, there are two "obvious" incremental update operations: (a) inserting a graph component into another graph component, when both of them are initially disconnected components; (b) adding a new link between two nodes of the same graph component. We call (a) graph insertion and (b) link insertion, while insertion and update may refer to either one of them. We previously presented algorithms for graph insertion in [28] and link insertion in [23]. We need to show how the insertions can be performed with the DSR. The steps that have to be taken for an insertion consist of (1) the global number pair update [23] and (2) the propagation of number pairs. As shown before, in the Double Strand Representation, a class is represented by storing its tree pair in the tree pairs strand. Graph pairs are stored in pairs of processors in the graph pairs strand. Note that this makes the graph pairs strand a list of pairs of number pairs. In Section 2 we pointed out that all important update and retrieval operations require only three items of information at every node. The first two, the key item of the node and the number pairs at the node, need to be stored explicitly. The third item, the tree area, can be determined easily, as will be argued now. The global update operations (1) treat every number pair (whether it is a tree pair or a graph pair) in the two strands uniformly, with the same operations being applied to all the processors. The reason for that is as follows. The change of a graph pair has to mirror the change of the tree pair from which it was created by propagation. But how does the algorithm know which transformation to apply to a tree pair? It makes thisdecision based completely on the tree pair itself! Therefore, the same criteria can be applied to the graph pairs that are identical to a specific tree pair. It should be noted that we are not dealing with every possible transformation resulting from a global number pair update as this is the subject of a separate paper [23].

80 Now, we will discuss a parallel number pair propagation algorithm for (2) in detail. The propagation of number pairs is performed as follows. Suppose that a graph link is inserted from a node C to a node N. We have to propagate the tree pair of C to eve .ry predecessor Ni of N (including N itself). For every processor Ni to which a pair V is propagated we need to generate a new entry for the graph pairs strand. This new ent .ry consists of the tree pair of Ni and V: (Tree-Pair(N1), V), (Tree-Pair(N2), Y), ..., (Tree-Pair(Nk), V), where k is the number of predecessors. These newly generated pairs have to be assigned to 2 * k currently unused contiguous processors to the left of (I)~ (Figure 9(c, d)). If several pairs Vi need to be propagated, this process needs to be done serially. As an example (Figure 9(a, b)), due to inserting the arc from H to E, the tree pair V = [5 5] of H should be propagated to every predecessor of E, and E itself (namely, E, C, B, A). As m has the pair [1 8], we do not need to propagate [5 5] to A because [1 8] subsumes [5 5]. In our terminology, only E, C, and B are targets of propagation. For propagating [5 5], we need to find the appropriate IDs of processors to assign [5 5] to (in parallel). For this, we develop a parallel function to find proper processor IDs for each propagated pair (Figure 9(c)). First, we activate processors in the tree and graph pairs strands that correspond to predecessors Ni to which we want to propagate a specific graph pair V. Second, there is a parallel operation, enumerate!!, on the CM-5 that will assign numbers 0, 1, 2 ... to active processors. Third, we define a parallel function T to compute the processor ID where the processor with the number x (assigned by the enumerate function) should deposit its number pair.

T(x) = r

2(z + 1)

(2)

where x >_ 0 and x < ~ . T computes the odd position, and we generate the pair (T(x), T ( x ) + 1) for (Tree-Pair(N/), V). With these three steps, mapping each predecessor to its corresponding processor ID in the graph pairs strand can be completed. For instance, in Figure 9(a, b), when inserting the arc from H to E, we first activate every tree predecessor of E (C and E itself), but not A. Similarly we activate every graph predecessor of E (just B). Then, we call enumerate!! and assign numbers, 0, 1, and 2, respectively. The tree pairs of C and H are assigned to 1019 and 1020 which are T(0) = (I)~ - 2 9 1 and T(0) + 1 - (I)7 - 2 9 1 + 1. Similarly the tree pairs of E and H are stored at 1017 and 1018, and the tree pairs of B and H, at 1015 and 1016. We will now present our parallel propagation algorithm. During the propagation, we may have to consider two problem cases caused by redundant pairs. Let a pair (Tri #i) be the newly propagated pair and let another pair (n.i pj) be a pair at a target node of propagation. In the first problem case, a pair (Trj #j) at the target is enclosed by the propagated pair ( ~ #~), i.e., ~ < n.i and pj < #~, then the pair (nj #j) must be replaced by (hi pi). In the second problem case, the pair (Ei p.i) encloses the newly propagated pair ( ~ p~), i.e., r.i < ~ and #i _< #j, and we do not need to propagate the pair (~ #~) to this target. In the propagation, we replace the redundant pairs just described. The results of this algorithm correspond to the results of Agrawal et al.'s algorithm. The boolean function evenp!! returns TRUE on a processor if the processor's ID is an even number. In the

81 A[1

8] D[2 5]

D [2 5]

B

B

[8 8] (7 7)

-.

F [3 5] [71~7] I~, (5 5) ~

G [4 5]

G [4 5] ' ( ~ H [ 5 5]

H[551

(b) Graph After (H, E) is inserted.

(a) OriginalGraph

(c) Before Propagation 0

1

2

3

4

5

6

7

1021 1022 MI-1 MI

A I~I~IBI~I~IGI"I

. . . . . . . 9 ..

. .ll~ ~]ltO-,llE~~1 [~~t r,-,t~4~q~ ~!~

V>

< ( [ 6 7] ~

5

)

. . . . . . .

)

~.,~~~----_,...,~

"'" 1

2

3

4

5

6

7/~

[S811(55)![6711(55,1t8,] /~o15 lO1610171018 lO19lO2OlO21lO22 o . .

Tree Pairs Strand

"~

>

....... (d) After Propagation

o

~

~ Free Space ~

Figure 9. Propagation in Double Strand Representation

Graph Pairs Strand

MI

82 algorithm, the expression redundantii stands for a boolean parallel variable that represents any redundant pairs in the predecessors. As before, in the following functions the expression mark!![x] "- y means that the pvar mark!! on the processor with ID x is assigned the value y. Finding tree predecessors will be different from finding graph predecessors because the tree pairs and the graph pairs are stored in a different form in the tree pairs strand and in the graph pairs strand, respectively. The function target-address!! returns addresses of the target processors of the propagated pairs for tree predecessors and graph predecessors uniformly. Mark-Predecessor(N-Pair, M-Pair" Pair) ; Activate every graph predecessor of a node N which is not predecessor ; of the node M, where N is a new parent node of C and M is the tree ; parent of the child node C. The nodes N and M have the tree pairs N-Pair ; and M-Pair, respectively. Then set the flag mark// on the graph predecessors. ACTIVATE-PROCESSORS-WITH pre!! _!! maxnum(N-Pair)AND!! NOT!!(pre!! __!! prenum(M-Pair) AND!! max!! :>!! maxnum(M-Pair)) DO BEGIN mark!![target-address!!()]:- 1 ; s e t predecessors END Note that, due to propagation, redundant pairs could appear in the marked predecessors. As mentioned before, there are two problem cases caused by redundant pairs. In the first case, the problem could occur only in graph pairs because in this step we are dealing with replacing enclosed pairs with enclosing pairs while in the second case it could occur either in tree pairs or in graph pairs. In the following algorithm, we will present the solution for these problems. For the first case, in the IF!! clause, we examine whether any graph pair in the predecessors is subsumed by the newly propagated pairs but only check the even processors in the graph pairs strand using evenp!! because every graph pair is stored at the even processors in the graph pairs strand. In contrast, for the second case, we examine whether any graph pair and any tree pair in the predecessors is subsuming the newly propagated pair because if that is true, we do not have to propagate the new pair any further. In both cases, the boolean pvar redundantii is set and additionally, in the first case, the enclosed pair is replaced with the number pair to be propagated. Redundant-Pair-Elimination(PM-Pair-V : Pair) ; Replace the pair at the target processor with the newly propagated ; pair PM-pair-V in the first case, set the flag redundant// on ; the target processor in both cases. ACTIVATE-PROCESSORS-WITH mark! ![target-address!! ()] = ! ! 1

83 DO BEGIN ; check whether it is the first case of redundant pairs IF!! (pre!! >!! prenum(PM-Pair-V)AND!! max!! ___!! maxnum(PM-Pair-V) AND!! evenp! !(self-address!! ()) AND!! self-address!!() ___!! g-lb) THEN pre!![self-address!!()]:= prenum(gM-Pair-V) ; replace the prenum max!![self-address!!()]:= maxnum(PM-Pair-V) ; replace the maxnum redundant!![target-address!!()]:= 1 ; set the flag ; check whether it is the second case of redundant pairs ELSE IF!! (pre!! _
At this stage, the boolean pvar mark!! is set for every predecessor of the given node and the boolean pvar redundant!! is set for the processors at which redundant pairs could appear due to the number pair propagation. The next step is to enumerate processors which are predecessors without redundant pairs. Order-Strand( ) ; Enumerate the marked predecessors. No parameter is needed, ; because the global variable mark!! is already set on the predecessors. ACTIVATE-PROCESSORS-WITH mark!![self-address!!()] =!! 1 AND!! NOT!!(redundant!![self-address!!()] =!! 1")AND!! self-address!!()
Now every preliminary step for mapping each predecessor to its corresponding processor ID in the graph pairs strand is finished. Finally, using the functions T ( x ) and T ( x + l ) , the propagation is performed in the following two steps. First, the copies of the tree pairs of the target nodes are copied to their destinations on odd processors. Then the unique pair to be propagated, V, is propagated to the corresponding even processors. Pos!! stands for a parallel variable that contains the numbers 0, 1, 2 ... assigned by enumerate!! in Order-Strand. Propagate-Pair(PM-Pair-V : Pair) ; Propagate U~ pairs to the targets of propagation. The processor IDs ; for the targets are calculated by T(x). ; Propagate the same pair PM-Pair-V to the targets of propagation.

84 ; The processor IDs for the targets are calculated by T ( x ) + 1.

ACTIVATE-PROCESSORS-WITH pos!! >_!! 0 DO BEGIN pre!![g-lb- (pos!! + 1) * 2]:= pre!! ;T(x) max![[g-lb- (pos!! + 1) * 2]:- max[! pre!![g-lb- (pos[! + 1) * 2 + 1]:= prenum(PM-Pair-V) ; T(x + 1) max!![g-lb- (pos!! + 1) * 2 + 1]:= m a x n u m ( P U - P a i r - V ) END Now comes the top level propagation algorithm which combines the above algorithms. It propagates every number pair of a node C to the targets of propagation which were defined by the predecessors of N. The node M is the tree parent of the child node C. Propagation(N, M, C : Node) ; N - P a i r and M - P a i r are the tree pairs of a node N and a node M, respectively. ; P M - P a i r - V is a pair at C to be propagated.

Initialize-Pvars() Mark-Predecessor (N-Pair, M-Pair) ; mark tree predecessors of N FOR Each number pair P M - P a i r - V at C DO Redundant-Pair-Elimination (PM-Pair-V); eliminate any redundant pairs Order-Strand() ; enumerate the marked processors Propagate-Pair(PM-Pair-V)

; propagate U and V pairs ; do some house keeping

Unset-Pvars END FOR END 6. P E R F O R M A N C E

IN THE GR AND

THE DSR

6.1. S p a c e a n d r u n - t i m e c o m p l e x i t i e s We now analyze how many processors are required for implementing the GR and DSR. Agrawal et al. proved that / (N+I)~ 4 ] number pairs are required to represent the worst case of a bipartite graph G with N nodes [18]. Let N be the number of nodes (the number of tree pairs) and P be the number of graph pairs in a DAG; then in the worst case t.

=

4

1

- N.

(a)

Let k be a predefined maximum number of graph pairs for the GR. For the GR the total space requirement is O(k * N). In the worst case k can be O(N) and the space complexity for the GR is O(N2). Note that we are currently fixing k to be 8 as a good compromise between processor use and efficiency of the algorithm. In the DSR the space complexity is O ( N + 2 , G) = O

N + 2,

4

- N

= O ( N 2)

(4)

85 i.e., the same space complexity in the worst case. However, the DSR does not have the problem of unused processors. Now we analyze the run-time complexity for the GR and DSR. Our parallel algorithms for subclass verification and propagation were presented in Sections 4 and 5. In order to analyze the time complexity of these algorithms, we need to define the following parameters: T~(N, C) : Parallel time to determine whether the tree pair of N encloses the tree pair of C. Tg(N, C) : Parallel time to determine whether the predecessors of N have a graph pair from C. Td(N) : Parallel time to determine every predecessor of a node N. Trn(N) : Parallel time to determine every tree predecessor of a node N. T~ (N) : Parallel time to determine every graph predecessor of a node N. Tr (N, C) : Parallel time to replace pairs at the predecessors of N with pairs from C or mark the processors where redundant pairs may have appeared. Tp(N) : Parallel time to propagate a number pair to the marked predecessors. P(C) : Average number of number pairs in C. In the subclass verification algorithm, there are two possible cases with IS-A-VERIFY (N, C). If N is a tree predecessor of C, the run-time for this operation is Tt. If N is a graph predecessor of C, the run-time is 7'9. Assuming a unit communication time [21], a commonly made assumption, Tt and Tg are O(1). Therefore, overall run-time complexity for subclass verification is constant. One question which arises now is whether there are any differences in run-time complexity between the GR and DSR. The difference between the two representations is not in the verification processing, but in the graph pair distribution. The run-time complexity of the subclass verification for the DSR is the same as that in the GR. Consequently, we have a constant subclass verification algorithm in both cases. In the propagation algorithm for the Double Strand Representation, there are two processing steps required as mentioned in Section 5: one for tree predecessors and another for graph predecessors. There are three phases: (1) identify the tree predecessors (Tin) and the graph predecessors (Tn), (2) replace any redundant pairs (T~), and (3) enumerate the predecessors and propagate number pairs (Tp). We can formulate the average run-time for the propagation algorithm as follows:

T --- TIn(N) -t- Tn(N) + P(C) 9 (T,.(N, C) + Tp(N)).

(5)

As before Tin, Tn, Tr, and Tp can be regarded as constants because within constant processor set size, these do not grow with increasing knowledge base size. Then, we can simplify the run-time complexity to O(P). Similarly, the run-time of the propagation algorithm in the Grid Representation is

T' = Td(N) + P(C) 9 (T,.(N, C) + Tp(N)). By the same reasoning, it can be simplified to O(P).

(6)

85

Processor Number

Utilization

of Processors

.vs.

Number

of

Nodes

100.0 ':

'

i

80.0

,?'.

..' ..~, ;.'

-

'

:

i ..,-~:: ..:~.:.-.;

9 9

. :,."

:.,:.,-

9::. . . . . .

;" ,,,..

60.0

i..i -"

!

....

."

4~ ,.~. ,:': ,.. , .-" ..... ..'

40.0

...-.... i i / i ; ~;" i. ..

"!J'

1:3----O GR ,:.:...............:: D S R

!i

' / r-4_~" "

'

"

s - 4 3 - 4~" _ E3 _ E3_I:--I-

0.0

,

0.0

i

500.0

,

,

1000.0 Number of Nodes

,

i

1500.0

E3-

,

2000.0

Figure 10. Processors Utilization

6.2. Experimental results In this section we present experimental results of the parallel subclass verification and number pair propagation algorithms for the GR and DSR. The experiments were done on a Connection Machine CM-5, which makes use of groups of virtual processors executing serially on real processors. There are 32 real processors on the NPAC CM-5. Eve .ry processor emulates the activities of at least 32 virtual processors. The CM-5 [47] is programmed in *LISP, a dialect of Common LISP, by mapping parallel variables (pvar) onto distinct processors.

6.2.1. Experiments with random data For our experiments, we are using a random generator for DAGs. The following parameters are supplied as input to this generator: the number of nodes (N), the branching factor of each node (B), and the depth (D). Prelimina .ry experiments with several values of B and D showed that the computation time seems to be unaffectedby B and D. This should be expected as we have eliminated the explicit representation of the IS-A links from the outset. Therefore, we limited D = 9 , . . . , 12 and set B = 5. The effect of graph size on run-time was determined for both representations. The number of nodes was varied from 25 to 2000. Graphs have approximately 20% of graph arcs, e.g., a graph with 2000 nodes has about 400 graph arcs. For the GR, assume that k is 8. Then 1K processors are required for 1 to 128 nodes, 2K for 1 to 2 5 6 , . . . , 16K for up to 2K nodes. Processor utilization is very low, only up

87 Subclass Verification Run 0.014

Time

.vs. Number

i

of

N o d e s

i

|

[E~-E]- -Ea, l:~:]d'

0.012

--r (1)

,JX~.Fa "Q"

"15]- -IE]- -

r-a----El GR (: .......... -:. D S R

0.010

E I--e-

rr

0.008

i

,

'

]

_

. : ...... ...:-:.:'.L....:..~,.....:.:~..

....

0.006

0.004

. .:7 9 , :-:~:,:.:'::-::. :.,:~......~.,::::.:..--.~ ::.-:. r~

0.0

500.0

1000.0 Number

1500.0

2000.0

of Nodes

Figure 11. Run Time for Subclass Verification

to 18%. We also determined that the maximum number of actually used rows in the GR was 5. Our experiments with random graphs showed that the number of graph pairs increased at approximately the same rate as the number of nodes. For instance, 48 graph pairs are generated in a 100-node graph, ..., 900 graph pairs in a 2000-node graph. In our experiments, typically, the number of graph pairs is limited to less than half the number of tree pairs. According to that, for the DSR, approximately 1K processors are required for graphs with up to 0.5K nodes, 2K processors for graphs with up to 1K nodes, and 4K for up to 2K nodes with very high processor utilization (up to 99%). In Figures 11-13, the run-times jump at two critical points, namely at the node numbers 500 and 1000. These jumps are due to doubling of the number of allocated virtual processors, i.e., from 1K to 2K and 2K to 4K. As the number of real processors stays the same, every real processor has to double the number of operations it performs. The DSR shows better performance than the GR in terms of both the amount and utilization of processors with increasing knowledge base size (Figure 10). For the comparative run-time evaluation of DSR and GR with various sizes of the knowledge base, we implemented the graph insertion, link insertion, and subclass verification algorithms. The test data for link insertion makes a number of simplifying assumptions which are based on problems described in [23]. Figures 11-13 show the results of experiments with various sizes of the knowledge base. The figures show the run times in seconds

88 G r a p h Insertion Run

Time

.vs.

|

0.20

Number

of

Nodes

,

i

G----El

GR

C:.-', ........ ~,-:: D S R

0.15

O

....

G-O-G-E3

(.)

E ._ i-r

0.10

,~:~.

n'--

i

: -,

))

i;: '~.......... ~..:.,...,..:.::....<.-.,,..,~,->

:

rn-D

0.05

/ .: . . . .

[]

...:...-

,..-.,..~"::z".'.::..::,,:..",..:.,2::<, --':. :..,:'. ~: ".:~.:..'::

.... ,...>..

,.-...

t~_~-:~.::.~ .....

0.00

,

0.0

,

,

500.0

i

,

1000.0 Number

of

1500.0

2000.0

Nodes

Figure 12. Run Time for Graph Insertion

over the total numbers of nodes in a graph for the subclass verification (Figure 11), graph insertion (Figure 12), and link insertion (Figure 13) in both representations. As can be seen, the computation times of subclass verification and dynamic update algorithms in the DSR grow much more slowly than in the GR. It is interesting that the execution time for increasing numbers of nodes in a graph are almost the same for constant processor set size. In summary, when we are increasing the size of the graphs, the cost of implementing the subclass verification and number pair propagation algorithms in the DSR in terms of processor utilization is much lower than that in the GR. For run-times, the DSR becomes better for over 500 nodes. 6.2.2. E x p e r i m e n t s with I N T E R M E D Now we want to show experimental results using an existing large medical vocabulary. We have tested our verification and update operations in the Grid Representation and the Double Strand Representation using the INTERMED (INTERnet version of the Medical Entities Dictiona .ry) system of CPMC (Columbia Presbyterian Medical Center) [35-38]. The INTERMED system currently has about 2,500 medical terms. These terms are related by general relations such as IS-A, but also by domain specific relations such as PHARMACEUTIC-COMPONENT-OF. The following results show the necessity of the Double Strand Representation as well as

89

Link Run Time 0.40

Insertion .vs. N u m b e r

'

of Nodes

'

'

r

I 0.35

G----El GR C:.......... ","-) D S R

0.30

0.25

[

~.~ m .E_ 0.20 v-= r 0.15

~

,:

~ ~ ~ :

o. 1 o

0.05

o.OOo.o

-El- -El- -E]- -E3- -El- -E:]- -E~

~.:-:~:.:>.--:::-...... :-:...... ..:.-:....~,::.... ..:.:~....~-!-~...<.:-:..

::::.:

IE]E]IE~"

'

~oo.o

'

~ooo.o

Number

'

~soo.o

....

2000.0

of Nodes

Figure 13. Run Time for Link Insertion without Propagation

the efficiency of the Double Strand Representation. First, we extracted information from the INTERMED and then translated it into a format fit for our system. Each term in the INTERMED is treated as a node and we maintain only the IS-A relation. We used 8K processors with the Double Strand Representation while 32k would be necessary with the Grid Representation. The average run-times for a subclass verification, a graph insertion, and a link insertion in the Double Strand Representation are 0.0004 sec, 0.023 sec, 0.139 sec, respectively. Figure 14 shows the run-time results for the link insertion algorithm over the number of pairs propagated. As the number of number pairs to be propagated increases, the run-time for the link insertion algorithm increases. This confirms our claim in Section 6.1 that the run-time for the number pair propagation algorithm is proportional to the number of number pairs to be propagated. In the Double Strand Representation, 2494 tree pairs and 1442 graph pairs are generated. However, some nodes have up to 456 graph pairs. This means 449 graph pairs cannot be represented in the Grid Representation because the Grid Representation restricts the number of rows to 8. As it would be quite unacceptable to extend the Grid Representation to 512 rows, this result shows that with real data the Grid Representation is not practical at all, while the Double Strand Representation performs well.

90

Link Insertion Run 4.0

Time

of Pairs to b e p r o p a g a t e d

.vs. Number w

|

3.0

r G) tv~ G) E .m pe::3 rr

2.0

J

J

J

J

1.0

0.0 10.0

0.0 The

Number

of Number

20.0 Pairs

to be

30.0 propagated

Figure 14. Run Time for Link Insertion with Number Pair Propagation

7. C O N C L U S I O N In this paper, we have introduced a new massively parallel representation for class hierarchies that are constrained to be representable by directed acyclic graphs. We call it the Double Strand Representation. This representation maps the node set representation onto a linear space of processors. Processor space is divided into two strands, the tree pairs strand and the graph pairs strand. We showed how the DSR can be used to quickly verify the existence of an IS-A relationship that is possibly the result of transitive closure, and therefore not explicitly represented. We also showed how to implement the propagation of a number pair in parallel. Propagation is fundamentally necessa .ry for eve.ry update operation based on Agrawal et al.'s encoding. Experimental results show that the processing times of subclass verification and number pair propagation are mainly affected by the number of allocated processors. The DSR achieves high performance compared with the GR in terms of processor utilization and run-time.

Acknowledgments We thank James Cimino for making the MED available to us and for explaining it to us. Scott C. Pearson has written the random graph generator that was used for providing us

91

with random test data. We also thank Mike Halper who has improved the presentation of this book chapter. REFERENCES

1. R. J. Brachman, On the epistemological status of semantic networks, Associative Networks (N. Findler, ed.), pp. 3-50, New York, NY: Academic Press, 1979. 2. R. J. Brachman and J. Schmolze, An overview of the KL-ONE knowledge representation system, Cognitive Science, vol. 9, no. 2, pp. 171-216, 1985. 3. W . A . Woods, What's in a link. foundations for semantic networks, in Representation and Understanding (D. G. Bobrow and A. M. Collins, eds.), pp. 35-82, New York, NY: Academic Press, 1975. 4. W.A. Woods, Knowledge representation: What's important about it?, in The Knowledge Frontier (N. Cercone and G. McCalla, eds.), pp. 44-79, New York, NY: Springer Verlag, 1987. 5. A. Borgida, R. J. Brachman, D. L. McGuinness, and L. A. Resnick, CLASSIC: A structural data model for objects, in Proceedings of the 1989 A CM SIGMOD International Conference on the Management of Data, appeared as SIGMOD, vol. 18, no. 2, pp. 58-67, 1989. 6. R . J . Brachman, R. E. Fikes, and H. J. Levesque, KRYPTON: A functional approach to knowledge representation, IEEE Computer, vol. 16, no. 10, pp. 67-73, 1983. 7. R. M. MacGregor, A deductive pattern matcher, Seventh National Conference on Artificial Intelligence, pp. 403-408, San Mateo, CA: Morgan Kaufmann, 1988. 8. B. Nebel and K. von Luck, Issues of integration and balancing in hybrid knowledge, GWAI-87 (K. Morik, ed.), pp. 114-123, Berlin, Germany: Springer Verlag, 1987. 9. S. Bayer and M. Vilain, The relation-based knowledge representation of King Kong, SIGART Bulletin, vol. 2, no. 3, pp. 15-21, 1991. 10. W. Kim, N. Ballou, H.-T. Chou, and J. F. Garza, Features of the orion object-oriented database system, in Object-Oriented Concepts, Databases, and Applications (W. Kim and F. H. Lochovsky, eds.), Reading, MA: Addison Wesley, 1989. 11. C. Lecluse, P. Richard, and F. Velez, O2 an object-oriented data model, Readings in Object-Oriented Database Systems (S. B. Zdonik and D. Maier, eds.), pp. 227-236, San Mateo, CA: Morgan Kaufmann, 1990. 12. V. Soloviev, An overview of three commercial object-oriented database management systems: ONTOS, ObjectStore and O2, SIGMOD Record, vol. 21, no. 1, pp. 93-104, 1992. 13. O. J. Dahl and K. Nygaard, SIMULA-an ALGOL-based simulation language, Communications of the A CM, vol. 9, no. 9, pp. 671-678, 1966. 14. A. Goldberg and D. Robson, Smalltalk-80: The language and its implementation. Reading, MA: Addison Wesley, 1983. 15. S. C. Dewhurst and K. T. Stark, Programming in C++. Englewood Cliffs, NJ: Prentice Hall, 1989. 16. L. Cardelli and P. Wegner, On understanding types data abstraction, and polymorphism, A CM Computing Surveys, vol. 17, pp. 471-522, 1985. 17. S. E. Keene, Object-Oriented Programming in COMMON LISP. Reading, MA:

92 Addison-Wesley, 1989. 18. R. Agrawal, A. Borgida, and H. V. Jagadish, Efficient management of transitive relationships in large data and knowledge bases, in Proceedings of the 1989 A CM SIGMOD International Conference on the Management of Data, (Portland, OR), pp. 253-262, 1989. 19. H. V. Jagadish, A compressed transitive closure technique for efficient fixed-point query processing, (San Francisco, CA), pp. 209-223, 1988. 20. K. Guh and C. Yu, Efficient management of materialized generalized transitive closure in centralized and parallel environments, IEEE Transactions on Knowledge and Data Engineering, vol. 4, no. 4, pp. 371-381, 1992. 21. W. Wang, S. Iyengar, and L. M. Patnaik, Memo .ry-based reasoning approach for pattern recognition of binary images, Pattern Recognition, vol. 22, no 5, pp. 505-518, 1989. 22. J. Han and W. Lu, Asynchronous chain recursions, IEEE Transactions on Knowledge and Data Engineering, vol. 1, pp. 185-195, 1989. 23. E. Y. Lee and J. Geller, Representing transitive relationships with parallel node sets, in Proceedings of the IEEE Workshop on Advances in Parallel and Distributed Systems (B. Bhargava, ed.), pp. 140-145, Los Alamitos, CA: IEEE Computer Society Press, 1993. 24. E. Y. Lee, Massively Parallel Reasoning in Transitive Relationship Hierarchies PhD dissertation, NJIT, 1996. 25. J. Geller, Upward-inductive inheritance and constant time downward inheritance in massively parallel knowledge representation, in Proceedings of the Workshop on Parallel Processing for AI at IJCAI 1991, (Sydney, Australia), pp. 63-68, 1991. 26. J. Geller and C. Y. Du, Parallel implementation of a class reasoner, Journal of Experimental and Theoretical Artificial Intelligence, vol. 3, pp. 109-127, 1991. 27. J. Geller, Innovative applications of massive parallelism, A A A I 1993 Spring Symposium Series Reports, AI Magazine, vol. 14, no. 3, p. 36, 1993. 28. J. Geller, Massively parallel knowledge representation, in A A A I Spring Symposium Series Working Notes: Innovative Applications of Massive Parallelism, pp. 90-97, 1993. 29. L. Kanal, V. Kumar, H. Kitano, and C. Suttner, Inheritance operations in massively parallel knowledge representation, Parallel Processing for Artificial Intelligence, pp. 95-113, New York: North Holland Publishing, 1994. 30. H. Kitano and J. Hendler, Advanced update operations in massively parallel knowledge representation, in Massively Parallel Artificial Intelligence, pp. 74-101, AAAI/MIT Press, 1994. 31. L. K. Schubert, M. A. Papalaskaris, and J. Taugher, Accelerating deductive inference: special methods for taxonomies colors and times, in The Knowledge Frontier (N. Cercone and G. McCalla, eds.), pp. 187-220, New York, NY: Springer Verlag, 1987. 32. M. Evett, L. Speetor, and J. Hendler, Knowledge representation on the connection machine, in Supercomputing '89, (Reno, Nevada), pp. 283-293, 1989. 33. M. P. Evett, J. A. Hendler, and W. A. Andersen, Massively parallel support for computationally effective recognition queries, in Proceedings of the Eleventh National

93

Conference on Artificial Intelligence, pp. 297-302, Cambridge, MA: MIT Press, 1993. 34. K. Stoffel and J. Hendler, PARKA on MIND-supercomputers, in IJCAI-95 Workshop Program Working Notes, (Montreal, Quebec), pp. 132-142, 1995. 35. J. J. Cimino, G. Hripcsak, S. B. Johnson, and P. D. Clayton, "Designing an introspective, multipurpose, controlled medical vocabulary," in Proceedings of the Thirteenth Annual Symposium on Computer Applications in Medical Care, pp. 513-518, Los Alamitos, CA: IEEE Computer Society Press, 1989. 36. J. J. Cimino, P. L. Elkin, and G. O. Barnett, "As we may think: The concept space and medical hypertext," Computers and Biomedical Research, vol. 25, pp. 238-263, 1992. 37. J. J. Cimino, A. A. Aguirre, S. B. Johnson, and P. Peng, "Generic queries for meeting clinical information needs," Bulletin of the Medical Library Association, vol. 81, no. 2, pp. 195-206, 1993. 38. J. J. Cimino, P. D. Clayton, G. Hripcsak, and S. B. Johnson, "Knowledge-based approaches to the maintenance of a large controlled medical terminology," Journal of the American Medical Informatics Association, vol. 1, no. 1, pp. 35-50, 1994. 39. M. P. Evett, W. A. Andersen, and J. A. Hendler, Massively parallel support for efficient knowledge representation, in Proc. of the 13th Int. Joint Conference on Artificial Intelligence, pp. 1325-1330, San Mateo, CA: Morgan Kaufmann, 1993. 40. L. Shastri, Default reasoning in semantic networks: a formalization of recognition and inheritance, Artificial Intelligence, vol. 39, no. 3, pp. 283-356, 1989. 41. L. Shastri, Semantic Networks: an Evidential Formalization and its Connectionist Realization, Morgan Kaufmann Publishers, (San Mateo, CA),1988. 42. L. Shastri and V. Ajjanagadde, An optimally efficient limited inference system, in Proceedings of IJCAI-90, (Boston, MA), pp. 563-570, 1990. 43. L. Shastri, A computational model of tractable reasoning- taking inspiration from cognition, in Proc. of the 13th Int. Joint Conference on Artificial Intelligence, pp. 202207, San Mateo, CA: Morgan Kaufmann, 1993. 44. R. Sun, An efficient feature-based connectionist inheritance scheme, IEEE Transactions on SMC, vol. 23, no. 2, 1993. 45. R. Sun, Integrating Neural and Symbolic Processes, Connection Science, 1994. 46. R. Sun, Robust Reasoning: Integrating Rule-Based and Similarity-Based Reasoning, Artificial Intelligence, no. 1, 1995. 47. Thinking Machines Corporation, *LISP Reference Manual Version 5.0 edition. Cambridge, MA: Thinking Machines Corporation, 1988.

94 E u n i c e (Yugyung) Lee Eunice (Yugyung) Lee received a BS degree in Computer Science from the University of Washington at Seattle, in 1990. She is currently a Ph.D. candidate, expecting her Ph.D. in the Summer of 1996, at the New Jersey Institute of Technology. Her research has been published in workshops on parallel AI and on parallel and distributed systems. Her current research interests include massively parallel knowledge representation and reasoning, object-oriented modeling, and high performance distributed databases.

J a m e s Geller James Geller received an Electrical Engineering Diploma from the Technical University Vienna, Austria, in 1979. His M.S. degree (1984) and his Ph.D. degree (1988) in Computer Science were received from the State University of New York at Buffalo. He spent the year before his doctoral defense at the Information Sciences Institute (ISI) of USC in Los Angeles, working with their Intelligent Interfaces group. James Geller received tenure in 1993 and is currently associate professor in the Computer and Information Science Department of the New Jersey Institute of Technology, where he is also Director of the AI & OODB Laboratory. Dr. Geller has published numerous journal and conference papers in a number of areas, including knowledge representation, parallel artificial intelligence, and object-oriented databases. His current research interests concentrate on object-oriented modeling of medical vocabularies, and on massively parallel knowledge representation and reasoning. James Geller was elected SIGART Treasurer in 1995. His Data Structures and Algorithms class is broadcast on New Jersey cable TV. Home Page: http://hertz.njit.edu/-geller

Parallel Processing for Artificial Intelligence 3 J. Geller, H. Kitano and C.B. Suttner (Editors) 1997 Elsevier Science B.V.

95

Parka on MIMD-Supercomputers* Kilian Stoffel, James Hendler and Joel Saltz Computer Science Dept./UM Institute for Advanced Computer Studies University of Maryland, College Park We have ported the SIMD Parka knowledge representation system to generic MIMD machines. The system has been recoded in C and supported using runtime optimization packages developed in the High Performance Systems Software Laboratory at the University of Maryland. New "scanning" algorithms have been developed for inheritance and recognition inferences. These algorithms have been tested with both random networks and on a recoding of the ontology of the CYC knowledge base as well as on large planning case-bases. Tests show that the new version is significantly faster than the SIMD system, and that it promises to scale well to knowledge bases orders of magnitude larger than CYC. 1. I n t r o d u c t i o n In the past, research in the field of (very) large knowledge bases has primarily concentrated on defining KR languages and analyzing their complexity. This research has resulted in a wide range of well-studied languages, with much known about their theoretical performance. Less work, however, has been spent examining details of efficiently implementing such languages, or on the scaling properties of these KR systems using large, real-world, applications. Up until recently, this negligence has largely been justified with the excuse that very large KBs don't yet exist, and that the community could postpone dealing with these issues. However, this excuse is no longer valid. Current research has resulted in the development of extremely large KBs including not only the "common sense" KB of Lenat's CYC [15], but also (much larger) KBs including machine-readable dictionaries [14], large ontologies [9], and very large case-bases for AI planning systems [13]. Despite this, most of the current KR systems are not able to accommodate these KBs, which may contain millions of assertions about many thousands of objects. The ParalleL Understanding Systems (PLUS) Laboratory at the University of Maryland has been working for a number of years on parallel support for very large knowledge bases. The main system to come out of this laboratory was the Parka system, which was the SIMD implementation of a frame-based AI KR language on the CM-2. The system was shown to be extremely fast, and to perform extremely well on very large knowledge bases *This research was supported in part by grants from NSF(IRI-9306580), ONR (N00014-J-91-1451), AFOSR (F49620-93-1-0065), the ARPA/Rome Laboratory Planning Initiative (F30602-93-C-0039, DABT-63-95-C-0037), the ARPA I3 Initiative (N00014-94-10907). Dr. Hendler is also affiliated with the UM Institute for Systems Research (NSF Grant NSF EEC 94-02384).

96 (having on the order of 105 frames). Details of the Parka system and its implementation, as well as a comparison of Parka to other efforts in VLKBs, can be found in [8,6]. There were two main problems with Parka in its SIMD form. First, there was the obvious problem that SIMD computers are currently less popular than MIMD platforms. Second, the ability to scale to much larger knowledge bases than what could be maintained in the memory of the CM2 (i.e. we needed at least one (virtual) processor per node) was constrained by the hardware. This led us to reimplement Parka on a more conventional MIMD platform, also moving from *Lisp 2 to C. In this paper, we describe this reimplementation effort. In particular, the goal of this project is to reimplement Parka for generic MIMD computers. To reach this goal, we use the CHAOS-Libra .ry developed by Dr. Saltz and his students at the High Performance Computing Software Systems Laboratory at the University of Maryland [17]. The language is being implemented in as portable a manner as possible so as to work on a number of different MIMD computers. We will first overview the UM Parka KR project and present some of the applications for which we are developing very large KBs. We then describe in the next two sections Parka's basic data structures and their current implementation. In Section 5 we will analyze the potential parallelism in the system. The central element of the Parka system is the inference engine which is described in Section 6. We present some performance results in Section 7 and we conclude with a summary of the contribution of this work and future research directions. 2. P a r k a O v e r v i e w The Parka system is a frame-based AI language (sometimes called a "property/class" system in today's literature) which was designed to be supported efficiently using highperformance computing techniques. The goal of the project is, and has always been, to develop a fairly traditional AI language/tool that can scale to the extremely large size applications mandated by the needs of today's information technology revolution. More specifically, Parka allows the user to define a frame-based knowledge base with class, subclass, and property links used to encode the ontology. Property values can themselves be frames, or alternatively can be string, numeric values, or specialized data structures (used primarily in the implementation). The language allows exceptions, in the form of multiple-inheritance, and provides extremely efficient (and efficiently parallelizable) algorithms for performing inheritance using a true inferential-distance-ordering calculation [11]. 3 Parka has also been shown to effectively compute recognition, and also to handle extremely complex "structure matching" queries - a class of conjunctive queries relating a set of variables and constraints and unifying these against the larger KB. While it is difficult to exactly compare KR languages, a very loose categorization would put Parka as more expressive than Classic [3], due to the presence of exception handling, 2*Lisp is the parallel version of Lisp which runs on the Thinking Machine Corporation CM-2 and CM-5 multiprocessors. 3Although it has been shown that IDO with fully general exceptions (i.e. cancellation links) is exponentially hard, we've demonstrated that IDO with multiple inheritance exceptions can be computed in polynomial time and is efficiently parallelizable [18].

97 although slightly less expressive than Loom [16] due to the lack of extensive numerical capabilities. A full description of the language, and more details on past results can be found in http://www.cs.umd.edu/projects/plus/Parka. Parka is being used in a number of research initiatives at the University of Maryland. Some of the projects using Parka include 9 A Knowledge Discovery in Databases application for a large medical knowledge base. In particular, it has been noted that data mining systems can benefit from the use of KBs to provide semantic information for data mining [4]. We are exploiting this sort of information by using medical knowledge in the data mining of an OB-Gyn patient database containing records on over 20,000 patients. The current Parka KB has over 1.2 million assertions [20].4 9 Several Case-based planning applications including the memory-intensive case-based planning system, Caper, developed by Brian Kettler in his recent doctoral thesis [13]. The Caper KBs, the largest of which contains over 2,000,000 assertions, are described later in this paper (section 4). Parka is also being used to support a logistics planning project, jointly being developed with MITRE Corp. [10], with a KB currently containing over 650,000 assertions. 9 A recent project involves the use of Parka as a hybrid knowledge and database for storing and retrieving designs and process plans for mechanical products, as part of ongoing effort to develop a new hybrid variant/generative approach to process planning in manufacturing. Parka's content-based retrieval of designs and plans is important in order to achieve the desired functionality of the process planning system, and will represent a significant advance over current "variant" approaches to process planning, which (among other things) involves retrieving process plans from databases based on fixedlength alphanumeric keys. The KB for this project is currently under construction, but is expected to dwarf those described previously, since current part databases are extremely large, containing information about the machining of thousands of parts. 3. Parka~s basic d a t a s t r u c t u r e s The need for portability has influenced the choice of the environments to implement Parka: C o m p i l e r : The compiler used was a standard ANSI C one. This compiler exists on all of the computers we currently access, particularly including the IBM SP-2 and the CRAY T3D. C o m m u n i c a t i o n L i b r a r y : Our implementation is based on functionality provided by the CHAOS run-time support package which provides scheduling libraries that have been ported to a wide range of MIMD supercomputers. (Work on the Parka project has actually led to a new scheduling approach which is being integrated into the CHAOS libraries[19]). 4We report KB sizes in "assertions," basically the number of links in the semantic network corresponding to the frames, as this accurately reflects the total number of relations between items in the KB and corresponds directly to the "concepts" in a description logic representation.

98 We have focused on the design and implementation of a very general system that can easily be ported to the different supercomputers. This means that currently, no special optimizations are done using the specific characteristics of any one machine. (Future work will include optimizing on several important architectures.) 3.1. T h e d a t a s t r u c t u r e s of t h e K B To use CHAOS and other generic tools, it was necessary to reimplement the Parka system to be based on arrays, rather than on the linked lists of the earlier *Lisp implementation. There are three major structures used: F r a m e s : The primary data structure used in the new implementation is a frame. A frame is an array of a variable size. Each element of the array points to an element of the Network Array described below. This pointer defines the properties of the frame. If an element points to a "normal" frame then the element represents an ISA-link. If the pointer goes to a frame that is characterized to be a property, then the link defines an explicitly valued property of the frame. This semantic difference allows us to separate the links into two sub-arrays, one for ISAs and one for properties.

A property frame (property) has a different structure than a normal frame. The basic structure of a property frame is again an array. This array contains two subarrays of the same size. For each non-ISA property in every normal frame, there is a pointer to a property frame. In the property frame, there are two links which are created, each in one of the sub-arrays. The link created in the first sub-array points back to the frame for which the property was defined. The second link points to the value for the property of that frame. In addition, a property frame also contains the arrays of a normal frame (see Figure 1). (This allows properties themselves to be frames with hierarchies and other properties, which was not allowed in the earlier SIMD system.) N e t w o r k - A r r a y : The network array is an array which holds all of the normal and property frames in the system as sub-arrays. (This array can be quite large for very large KBs.) F r a m e - T a b l e : The frame table is an array of strings. In each element of the frame table the name of a frame is stored. The name corresponds to the name of the frame located at the same position in the Network Array (e.g. the fifth element of the frame table contains the name of the fifth element of the network). This allows a mapping from "human" to machine descriptors, for use in querying, etc. This representation of the KB has several considerable advantages. 9 The representation of the KB as an array (of arrays) allows us to use standard communication libraries like CHAOS. In addition, computation over array-based data structures is a well-studied area for MIMD Systems. 9 The separation of the properties from the ISA links allows a fast parallel implementation for inheritance algorithms.

99

Frame

I I ISAS I I I I I~1 II portiesII

~

Property

I

II

I

FROM

II

I I I

I I I II ISAS

Frame

I

I I Properties I I I I I

IIIIIIISAS I I Properties I I II

'

I1

I

Figure 1. Frame Arrays in MIMD Parka

9 The replacement of the single property link by a frame permits efficient implementations of different parallel algorithms such as recognition and structure matching. 9 A separate name space (Frame Table) can be used to calculate starting points for scan operations without the presence of the whole KB.

4. C u r r e n t S t a t u s

We have currently reimplemented all of the major algorithms of the Parka system the creation of frame-based semantic networks, the performance of inheritance operations thereon, the execution of recognition queries and structure matching queries, all of which we explain in the following sections. 4.1. C o n s t r u c t i o n of t h e F r a m e s , t h e N e t w o r k a n d t h e F r a m e T a b l e There are three methods for creating and/or loading the networks on a parallel machine. 1. The KB is defined directly by a program using commands such as define_frame, define_property, define_isa and assert. 2. A KB can be read in from an ASCII file containing these same commands. 3. If a KB is once constructed then the memory can be dumped on the disk. The dump files can be read in again later. It is possible to generate either a dump of the whole network or a distributed dump for multiprocessors. In the latter case, a separate dump for each node is created. All three methods can be used to generate one network on a single processor or a network being distributed over the nodes of the parallel computer. There are two predefined

100 methods that distribute the network in blocks or stripes. (It is also possible to use other methods to distribute the network which are predefined or user-defined.) In the first two methods (where a network is created) a consistency check for the whole network is performed. For the third method it is assumed that the network is correct. 4.2. Inheritance For the efficient performance of inheritance, we require a fast algorithm for scanning the ISA hierarchy. The efficiency of this algorithm is crucial to supporting very large knowledge bases because, as we argue in [8], most interesting inferences in large knowledge bases are dependent on the efficiency (and correctness) of the inheritance algorithm. 4.2.1. The Scanning A l g o r i t h m The scanning algorithm is dependent on the distribution of the Network Array among the processors. The current implementation supports the two predefined types (block and cyclic) of distributions. The user is free to define other distributions, but all of these distributions must be static. In principle, there are two approaches as to how the scan algorithm could work: 1. To expand the ISA hierarchy of a frame, a processor has to find out on which node the frame is located. If the frame is on the local node the ISA hierarchy can directly be expanded. If the frame is installed on a remote processor, then the local processor could ask for a copy of the frame from the remote processor and could then expand the hierarchy. This is essentially a "load balancing" solution, as frame information is propagated amongst processors. 2. The local processor has again to locate the frame in question and expand it directly if its local. Otherwise it will inform the remote processor to expand the corresponding frame. This approach (which we use) has more of a scheduling burden (coordinating communication) but needs less explicit load-balancing. In the second approach, the work load is dependent on the distribution of the Network Array. Each frame is expanded on the processor which hosts that frame. The first approach does not contain such an implicit distribution of the work load. Instead, it would require that an explicit distribution of the work must be defined. However, the amount of work to realize during the expansion of a frame is very small (only a few instructions) compared with the communication overhead. It would be very hard to implement an efficient load balancer because of the supplementary network load produced by the balancer. For this reason we chose the second approach for our implementation. Each frame is expanded by the processor on which it is stored. This leads to the following basic scan algorithm as used in this implementation: I. locate the starting frame 2. look up the ISA array of the frame 3. send msgs to all processors which host a frame defined in the ISA array 4. receive all msgs from remote processors

I01

5. f o r each r e c e i v e d frame goto s t e p 2 This program is executed in SPMD mode on each processor of the parallel computer. Based on this scan algorithm it is now possible to define an inheritance algorithm.

4.2.2. Implementation of the Scanning Algorithm Our programming model attempts to have as much of a data-parallel flavor as possible. This places restrictions on the types of task-parallelism we allow. We list these restrictions below. 1. The task-graphs (network array) are static and acyclic. 2. Each data item (frame) is updated by a unique task and all data-items updated by a given task are on the same processor. This allows an owner-computes strategy. 3. The same task-graph is repetitively used for different data-sets. This allows runtime preprocessing optimizations. These restrictions are not severe and the previously described implementation of the Parka system satisfies these restrictions. The nodes for the task-graph are distributed across processors based on a user-specified distribution and the computation is distributed using the owner-computes rule. Since the task-graph computation is iterative, certain optimizations can be performed once in a preprocessing step and reused. One such optimization is to perform a topological sort of the DAG, thus dividing it into levels. As shown in Figure 2a, the synchronization requirements can now be confined to the levels thus increasing the computation granularity. This method has been suggested in [17], and was used to parallelize sparse matrix codes with loop carried dependencies. However global synchronization does not work ve.ry well because the synchronization constraints affect the load balancing. All processors have to wait for the slowest processor at each level. The total computation time is __~. #/=processors Comp (procj, leveli) ) 5 thus z~i=v'#teve~sl(,,,a~j=l The alternative approach, considered in [5] is to use better distribution mechanisms to increase locality and then use low-latency active messages to communicate the data that does need to go off-processor. The arrival of data automatically triggers computation, thus the synchronization is implicit. The synchronization requirements of this method are ve.ry relaxed and thus the load-balance is better. The total computation time is #processors [W-,#levels Comp (procj, level~)). While the computation time is better than max~=l the level-synchronized scheme, the communication/synchronization time may be worse since more messages are sent, even though the amount of data sent is the same. [5] found that high efficiencies could be achieved on the CM-5 using its very low overhead communication support. A drawback is that a data flow programming model must be used. Our approach is to use the level-based data-parallel model but relax the synchronization constraints by using split-phase synchronization mechanisms. This allows processors to continue processing incoming data from the next level while waiting for the slowest 5Where Comp(proci,leveli) is the time used by processor j to compute level i.

102

DO K = 1, num_levels MPI_scatter(level(k-I)) MPI_gather(level(k)) COMPUTE(level(k)) END DO

a) Collective synchronization

DO K = 1, num_levels id = FZY_scatter(level(k-I)) WHILE ( FZY_recv(id, data) LEVEL_NOT.DONE)) COMPUTE(level(k)) END DO

b) Fuzzy synchronization

Figure 2. Collective/Fuzzy synchronization

processor from the previous level. Since each processor does not progresses to the next level until it has processed all nodes at the current level, the skew is limited to a maximum of one level. This is most useful when a processor can have a high load at one level and a low load at the next level. In such a case, that processor can catch up without slowing the others down. Figure 2b shows how a DAG could be parallelized using such a fuzzy barrier. Each processor sends out all the outgoing data at the end of each level, but processes each incoming data-chunk as soon it arrives. The condition NOT__DONE remains true until all the incoming messages from a level have arrived, or when a termination condition has been detected. Thus even if processor A is on level X, processor B can begin to process data on level X + I (until it eventually waits to receive A's message). Though, we do not discuss it here, the termination detection check can be incorporated to implement branchand-bound algorithms. The computation time is thus better than that of the level-based scheme with no extra communication overhead. Figures 3 and 4 shows the results of using fuzzy synchronization and the level-based runtime techniques on the inheritance network. These experiments used an inheritance network with 500,000 classes and instances (i.e. the DAG has 500K nodes) and several links between each (for a total of about 1.5M links). As can be seen from Figure 3, fuzzy synchronization provides significant benefits over using collective synchronization. Figure 4 shows that high efficiencies can be obtained over a wide range of platforms using these techniques. The efficiencies are around 70%. 4.2.3. T h e I n h e r i t a n c e A l g o r i t h m The inheritance algorithm we use is based on the Touretzky IDO algorithm as modified in Parka. The theoretical worst case complexity of this algorithm is O(d 9 n/k) where d is the depth of the ISA-hierarchy, n is the number of frames in the network and k is the number of nodes of the parallel computer. The idea of our algorithm is that in one pass through the ISA-hierarchy, we can simultaneously create a list of all properties (direct and inherited) for each node and also a list of all properties "overwritten" by more specific properties (therefore being eliminated from the first list). The difference of these two lists contains the properties inherited according to IDO. We discuss the efficiency of this algorithm in section 7.2.2 (Table 1).

103 Barrier vs. Fuzzy Synchronization on SP-2

.

.

.

.

.

.

mo, Fu~,-':4--.

|

Ide'~,l -m--. 1

...

..- ..

.o

9-'"'"'"'"

0 0

!

!

2

4

i .... "" . . . . ~....

!

6

8

10

12

14

16

#nodes

Figure 3. Barrier vs. Fuzzy Synchronization

5. P o t e n t i a l P a r a l l e l i s m in K B

In this section, we discuss the kind and degree of MIMD-parallelism we exploit. The parallelism of a KB is hidden in the ontology. In our case the ontology is represented as a directed acyclic graph. In such a graph the parallelism usable by the scanning algorithm is defined by the branch out of each frame in the network (i.e. the number of ISA links per frame). A network in which each frame has only one ISA link defined has no parallelism exploitable by the scanning algorithm. Each frame has to be completed before the descendent frame can be attacked. Such a graph imposes serial execution of the scanning algorithm. On the other hand, a graph with depth two and a large amount of ISAs defined for the root frame has the highest possible degree of parallelism for the scanning algorithm. This algorithm therefore seems to be well adapted for AI KBs. Generally, these are not very deep but often very bushy. The fact that the parallelism is low for narrow ISA hierarchies is not very harmful. In addition, in several types of queries there are multiple scans to perform. These scans can be combined and executed as one single larger scan. Such a method is used to implement recognition queries (see Section 6.1) and in more complex inference methods like the structure marcher (see section 6.2).

6. C o m p l e x I n f e r e n c i n g in M I M D P a r k a As mentioned previously, we have built several other inference classes on top of the inheritance mechanism. In this section, we discuss the two most important: recognition queries and structure matching inferences.

104 ,

i

i

sp2

sp2/mpi -4--.

x.,..~, *'~-..~

4

2

u

1

"

0.5

Paragon -I~--.

~"-.-......

cm5 -~(,.-t3d/pvm -,&---

-.~.,.~.. ~'-...~.,..

"'-.,..... ""'~''" "1~.,..~..,.

9

"~'~"~''~.~.

99~-..~.~._.,.

".-..~.,..

........

-......:::...., .

025

0.125

0.0625

i 2

!

!

4

8

#nodes

Figure 4. Performance of parallelized Inheritance Network code: Timings shown include the native communication systems for the SP2, the Paragon and the CM5 use the native communication system as well as MPI and PVM times for the SP2 and T3D machines respectively.

6.1. Recognition We have implemented a new version of Parka's recognition algorithm - its means for handling conjunctive property queries. 6 The algorithm is based on the special structure of the property frames and again uses the scanning algorithm. Essentially, we handle these queries as a set of inheritance queries and some join operations, thus we can maintain the complexity of O(d 9 n/k). To process the conjunctive property query, we split the conjunction into the conjunctive elements. This produces a list of simple expressions of the form (prop1, x, y) (where x and y can both be variables, or one of the two can be a constant). So we have to find all the frames which have an explicit or an inherited definition for the given property and their value in respect to the property. It is easy to find all explicitly labeled frames for a given property. The sub-array of the property frame with the back pointers contains all the pointers to the frames which are directly labeled for this property. The other sub-array of the property frame holds the pointers to the value for the property. On one hand, it is possible to create a list with all the frames for which the property is directly defined and which have the searched value. On the other hand, a list with all the frames which do not have the value being searched for, but which are able to be "overwritten" with the correct values during the inheritance can also be created. The basic idea of the recognition algorithm is that all the frames which have the required property value are activated and start to send information downward along the ISA links. 6The SIMD algorithm for performing this was one of the main contributions of the original Parka system[7].

105 During this scan, all frames in the ISA hierarchy will inherit the values of the property. If the inherited information arrives on a node which is directly labeled for a property, but with a different value, then all the descendants of this node are commanded to remove the value they inherited from the same node as this node. Thus, it is possible to correctly handle all inherited properties according to IDO in one single scan. We can find the solution to each sub-query defined by the elements of the original conjunctive property query in one single IDO run (scan run). The result of this run is a number of sets containing the frames and their values for the analyzed property. Frames which are elements of all the sets and their corresponding values are the solutions of the original conjunctive query. T The potential degree of parallelism of this algorithm is high, because there are multiple starting points (all the frames with direct correct property values). At the same time it is possible to send multiple properties across the ISA hierarchy during one scan step which results in a large scan graph with a high degree of parallelism.

6.2. Structure Matcher The second even more complex inference method is the structure matching algorithm. It allows general relation-based structures to be retrieved from a knowledge base. The algorithm takes a conjunctive query similar in form to the recognition query, but with binary constraints (i.e. with two variables). A set of these conjuncts defines a graph structure, where variables or constants are the nodes and predicates are the arcs. The algorithm compares the probe structure with the knowledge base and returns a set of all satisfying matches. Our description of the problem of structure matching follows that given in [21]. A knowledge base defines a set, P, of unary and binary predicates. Unary predicates have the form Pi(x) and binary predicates have the form P.~(xl,x2), where each x~ is a variable on the set of frames in the KB. An existential conjunctive expression on these predicates is a formula of the form 3 x l , . . . ,Xm " 1 ' 1 A P 2 A , . . . , APn, where n >_ 1. Our task is to retrieve all structures from memory which match a given conjunctive expression. Therefore, we would like to find all such satisfying assignments for the x~. We can view the problem of matching knowledge structures in two ways. The first is as a subgraph isomorphism problem, s We view variables as nodes and binary predicates as edges in a graph. We want to find structures in memory which "line up" with the graph structure of the query expression. The other way to view the matching problem is as a problem of unification or constraint satisfaction. If we can find a structure in memory which provides a consistent assignment to the variables xi (i.e., unification), then that structure matches the conjunctive expression. 6.3. O v e r v i e w of the algorithm The structure matching algorithm operates by comparing a retrieval probe, P, against a knowledge base (KB) to find all structures in the KB which are consistent with P. This match process occurs in parallel across the entire knowledge base. A Parka KB 7This can be formalized as a set of joins in the relational calculus. SMore specifically, this is a problem of subgraph isomorphism with typed edges, the edges being the relations in the KB between frames.

106 consists of a set of frames and a set of relations (defined by predicates) on those frames. Most relations are only implicitly specified and so must be made explicit by expanding the relation with the appropriate inference method. By computing inherited values for a relation, all pairs defining the relation are made explicit. We currently allow only una .ry and binary relations. 9 A retrieval probe is specified as a graph consisting of a set of variables V(P) and a set of predicates (or constraints) C(P) that must simultaneously hold on frames bound to those variables. The result of the algorithm is a set of k-tuples, where each k-tuple encodes a unique 1 - 1 mapping of frames to variables in V(P), that unifies the description of the structure in memory with C(P). Figure 5 shows a simple example of the structure matching algorithm; given a knowledge base represented as a semantic network and a probe graph, the algorithm finds all matching subgraphs in the network which are isomorphic to the probe.

Probegraph

Fragmentof semanticnetwork color

(~

~

color

jj

color

I

StructureMarcher

/ I

Matching subgraphs

Figure 5. A simple example of structure matching.

The set of frames that can bind to each variable is initially restricted by a set of constraints indicated by unary predicates. Each una .ry constraint may only constrain the gFuture versions of Parka may support higher-order predicates.

107 values of one variable. Examples of these constraints are "X is a dog" or "the color of X is yellow". We allow set theoretic combinations of the,unary constraints; for example "X is a dog and the color of X is yellow", or "X is a dog but X is not yellow." The domains for each variable are maintained throughout the match process and are further restricted as more constraints are processed. Constraints between frames bound to variables are specified by a set of binary constraints. For example, we can say "the color of X must be Y", or "X is a part of Y", for some X and Y in V(P). Binary constraints are processed by "expanding" the binary relation given in the constraint. By expansion we mean that all pairs participating in a relation R in the KB are made explicit by invoking the inference method for the associated predicate. The pairs allowed to participate in the expanded relation are restricted to those in the domains of the variables related by R. For example, a binary constraint may be expressed as (Color X Y). In this case the values for each concept in the domain of X are computed for the color predicate and pairs that have values outside the domain of Y are excluded. Two additional binary predicates, "eq" and "neq," are provided to provide co-designation and non-co-designation of variables. These constraints act as a filter, eliminating any tuples from the result for which the constrained variables are (not) bound to the same frame. (Other binary predicates could be supported as well, such as inequality and other numeric constraints. However, in the current implementation, we do not provide support for these constraints.) The result of a structure match is a set of k-tuples, where each tuple corresponds to a satisfying assignment of the k variables. Alternatively, the result can be viewed as a relation. Initially, the matcher begins with an empty set of relations. During the match, several intermediate relations may be constructed. Simple binary relations result from the expansion of a binary constraint. These are later fused (via a relational join operation) or filtered (via co-designation or non-co-designation) until a single relation remains. The algorithm selects binary constraints to process using a greedy algorithm based on a simple cost model stored in the meta-data. 7. Testing To realize the tests it was necessary first to program a network generator. The networks generated by this program were afterwards used to test the implemented KB.

7.1. Network generator The network generator is able to generate two kinds of KB formats: 1. ASCII file 2. memory dump files There are three characteristics of the network that can be defined: 1. the branching factor in the network 2. the number of levels

108 3. the number of levels with the maximal size Further it is possible to randomly generate properties for different frames. There is a parameter to define the density of these properties. This allows us to duplicate some of the tests made on the original Parka system on the SIMD CM-2 computer (cf. [8]). 7.2. Results on I B M S P 2 (16 processors)

7.2.1. Degree of Parallelism As mentioned above, the performance of the scan algorithm depends on the structure of the ISA hierarchy. To test this, we created two networks, both with 2,500 frames. The first network is 2, 500 levels deep and each level consists of one frame. The second network has two levels, a first where one frame is inserted and and a second with 2, 499 frames. For the sequential algorithm the amount of work is the same for both networks. This is also true for the parallel scanning algorithm but the big difference is that for the first network there is little opportunity for parallelism - one level has to be executed after another. In the second network it is different. All the frames on the second level can be treated in parallel if they are distributed among the processors. To execute this first test, we used 4 processors on an IBM SP2. A cyclic network distribution was used. Thus, processor 1 owned the frames {0, 4, 8, ...}, processor 2 owned the frames {1, 5,9, ...}, etc. The execution of the scan algorithm on the first network took 0.4 s. The scan through the second network only 0.008 s. Note that although the theoretical speedup should be no more than 4, we have a speedup of 50! This is related to the communication system of the SP2 (and most other MIMD machines). For the first network we have to communicate 2499 short messages, in the second one long message. The throughput of the communication system is much higher on a small number of long messages then on a large number of short messages. This result seems to be very promising for real KBs, because most of the existing KBs have a very flat ontology. For example, in a recoding of the ontology of CYC for Parka (see section 7.2.3), the deepest subgraph we found was only 27 levels deep, whereas maximum branch out was over 1,000.

7.2.2. Scaling and Speedup As a second test, we want to analyze how the scanning algorithm performs if the configuration of the computer changes. The results presented here are again based on generated networks. The branching factor was fixed to 3. The depth of the ISA-hierarchy was varied from 9 to 12, generating networks of about 29,500, 88,500, 265,700, and 797,000 frames respectively. (As a point of comparison, our encoding of the ontology of CYC (Section 7.2.3) has about 32, 500 frames.) A property to find was placed on the root node and the inheritance was done by one of the leaf nodes (this corresponds to the worst case for the previous Parka implementation). The absolute values represented in Figure 6 and Table 1 are extremely fast, however we note that they are not so important because the programs were not optimized for the SP2 where the tests were done. What is interesting is the comportment of the algorithms. There are two points that seem to be interesting: 1. the scalability in terms of processor number

109

nbr. nodes/netsize .... 1 0.115 29524 0.345 88573 1.037 265720 797161

2 0.060 0.175 0.523 1.689

4 0.033 0.091 0.265 0.735

8 0.021 0.050 0.137 0.388

16 0.017 0.032 0.088 0.201

Table 1 Inheritance

Scanning-Performance using CYCLIC-distribution 1.6

-

u

1 0.8 ~ 0.6 [ __-.......... 0.4 time [sec]

0.2

--

""

i

""'"

,

"''"'.

~

i

l

,

,

,

,

,

i,

i

,-

29524 ~ 88573 .-F-- ~<. . . . . . . 265720 4~- .... "" x 797161 .x.-

"......+..

" ~ " 9 ,...)

<

1

Figure 6. Scanning Performance

I

I

I

2

3

4 5 nbr. nodes

I

I

I

6

7 8

16

110 2. the scalability in terms of network size

On both measures, we see that the algorithms perform quite well. In addition, we believe further speedup would be available if we optimized the algorithms for this particular platform.

7.2.3. Results for the C Y C Ontology We have performed a number of tests on KBs that were not randomly created. The largest one we had used previously in testing the SIMD version of the system was the CYC ontology. As previously reported in [7], we encoded the ontology of version 8 of MCC's CYC system into Parka. In our version, this ontology has about 32,000 frames and about 150,000 assertions relating them (property and ISA links). The biggest difference between the generated networks and CYC is that the test network had only a single property on the root, thus making all "subgraphs" equal to the entire KB. In CYC the ISA hierarchies under particular properties are only subnetworks of the full 32,000 frames. We hoped to directly compare the SIMD and MIMD versions using queries on the CYC system, but most of the tests we had run previously were, basically, too easy for the new version. In particular, the subnetworks in CYC are all much smaller than the smallest test network described in the previous section. For this reason simple queries like "What is the color of frame X" could not be used to show parallel effects - their time was on the order of 50-100 #seconds on a single processor! Thus, instead of timing single inheritance queries, we generated a new class of recognition-like queries that were designed to stress the new system. To start with, we tried the timing of recognition queries in some of CYC's largest subtrees. The results were quite promising, but again too fast for showing parallel speedups. For example, using the scan algorithms, we executed some recognitions with more than twenty conjuncts and had single processor response times of under 1/100 of a second (compared with about 1 second for the SIMD system). To further stress the scanning algorithms and to explore speedups, we designed a new set of inferences specifically to that purpose. In particular, we used queries that would include great amounts of both inheritance and recognition. These queries were of the following form: "Give me all the frames which have one or more properties in common with frame X." To be sure to get "slow" times, we ran these queries on frames in big ISA sub-hierarchies of CYC. Two of the biggest, plant and animal, were chosen for testing, since they had very large numbers (in the tens of thousands) of other frames which shared at least one property. The execution time for the queries "Give me all the frames with the same properties as Animal" and "Give me all the frames with the same properties as Plant" are presented in Figure 7, which compares these times to the optimal speedup curve. As one can see the recognition algorithm behaves very well. The efficiency is about 75%. That is very high with respect to the small amount of CPU time needed for the inheritance algorithm. The queries represented in the Figure 7 are much more complex than standard queries in CYC.

111

Inheritance-Performance in CYC 8

I

I

I

t~..

Animal ~-Plant + - _

4 2 time [sec] 1 0.5 0.25

1

I

I

I

2

4 nbr. nodes

8

16

Figure 7. Recognition-Performance in CYC

7.3. R e s u l t s for U M - T r a n s l o g A second domain used for testing the MIMD implementation was on knowledge bases that were created as part of doctoral thesis work on the development of a system for memory intensive case-based planning [12]. Part of this project involved the automatic seeding of large case memories by a generative planner. One domain used in this work was the "UM Translog" domain, a logistics planning domain developed for the evaluation and comparison of AI planning systems. Case-bases of various size were created, with each containing a number of plans, derivation information, and planning related ontologies. 1~ To measure the performance of the structure matcher we used the UM-Translog KB in different sizes (10 cases, 100 cases and 200 cases). The results presented in this section are especially interesting for two reasons. On one hand we are presenting the timings for the structure matcher, the most complex retrieval algorithm implemented in Parka to date. On the other hand we are also able to show, that the new Parka system has the capability to handle very large KBs on a single Sparc workstation. The queries we used to test the system are presented in detail in Appendix A. These are all queries used by the case-based planning system during its problem solving. They were chosen at random from a large number of stored queries. The results are summarized in Table 2. Table 2a presents the timings for the six queries on a SPARC 20 using the three different KBs, and Table 2b presents the timings on 1, 8 and 16 nodes of an SP2 using the 200 case KB (the largest of the three). The actual sizes of these KBs are shown l~ full description of the domain, the complete specification of the planning operators used, and the casebases themselves are available on the World-Wide Web at http://www.cs.umd.edu/projects/plus/UMT/.

112

II Query 1 2 3 4 5 6

20 CB 1020 195 225 630 675 405

100 CB 4740 1305 1470 3570 3600 585

200 CB II II Query 6990 1635 1725 4590 4605 645

a: Timings on a SPARC 20 for a 20, 100 and 200 cases KB

1 3041 713 753 1997 2003 284

8 546 129 135 361 360 52

16 313 75 79 205 206 29

1:8 69.6 69.1 70.0 69.1 69.5 68.2

1:16 II 60.7 59.4 56.6 60.9 60.8 61.2

b: Timings on a 1, 8 and 16 nodes of an SP2 on 200 case KB

Table 2 UM-Translog Timings[msec] (serial and parallel).

II cases I frames I structural links 20 11612 26412 100 59915 130558 200 123173 266176

property links II 114359 800481 1620456

Table 3 Sizes of the UM-Translog KBs.

in Table 3, where frames is the number of nodes in the DAG, structural links are those in the ISA ontology, and property links are all others (i.e. the number of the edges in the DAG is equal to the structural links plus the property links). (We believe the 200 case case-base to be the largest meaningful semantic net created to date. It contains about 1.8M assertions concerning over 123,000 entities.) As can be seen, the sequential timings range from under a second for the simplest query to about 7 seconds for the most complex. On the parallel system all queries were executed in under one second, with the simplest query taking only 29 milliseconds on 16 processors, and the most complex taking only 303 millisecond. Table 2b also shows the efficiencies of the parallel algorithm. Even in the current non-optimized form, the efficiency averages about 69.3% for eight processors and 59.9% for 16 processors. 8. F u t u r e work We are currently working on pushing the current implementation and developing new algorithms and applications. Algorithmically, current work is focused on doing empirical research on how these algorithms scale to much larger networks and greater numbers of processors. In particular, the current implementation is done on a small IBM SP2 machine (16 "wide" nodes). The next step will be to test these programs on bigger SP2s to see how the scaling continues in larger networks and on larger machines. We also will do work on optimizing communication patterns for the SP2 implementation (which should

113 significantly improve absolute performance) as well as porting this implementation to the Cray T3D machine, which has significantly faster communication times. We expect to be able to run extremely large networks (potentially billions of frames) extremely efficiently on this machine. First tests of the basic scan algorithm seem to support this assumption. In addition, we are exploring the use of I/O to allow the use of secondary storage for processing extremely large knowledge bases where even the online memory of the large parallel machines is not adequate. The CHAOS runtime support package is being extended to support efficient I/O, and we are also beginning an examination of the special I/O requirements for the storage and caching needs of much larger knowledge bases (containing 100s of millions of assertions). The SP2 at Maryland is being configured to contain about 200 Gigabytes of disk storage, making it uniquely suited for exploring massive knowledge and database applications. In regard to applications, we are currently exploring the use of knowledge based techniques in the guidance of data mining applications. In data mining applications, a knowledge repository is searched for information with the aim of obtaining as much information as possible. One weakness with current data mining systems, however, is their lack of access to semantic knowledge about the domain being searched. Storing such semantics, however, requires much larger knowledge bases that function significantly faster than most of today's systems. Such systems will be needed for doing significant knowledge discovery against large, heterogeneous data corpora such as medical data systems or the World Wide Web. Parka promises to provide support for such knowledge bases. We are currently exploring the development of such hybrid knowledge and databases in the areas of medical informatics, military transportation logistics, and health care financing. 9. C o n c l u s i o n In summary, we have ported the SIMD Parka system to more generic MIMD machines. The system has been recoded in C and supported using runtime optimization packages developed in the high performance computing laboratory at Maryland. New "scanning" algorithms have been developed for inheritance, recognition, and structure matching inferences. These algorithms have been tested with both random networks and on two very large knowledge bases (CYC and the UM-translog case bases). Tests show that the new version is significantly faster than the SIMD system, and that it promises to scale well to knowledge bases of the order of magnitude needed for complex information technology applications. REFERENCES

1. Proceedings of the 12th National Conference of the American Association for Artificial Intelligence. AAAI Press, MIT Press, 1994. 2. W. Andersen, , M. Evett, J. Hendler, and B. Kettler. Massively Parallel Artificial Intelligence, chapter Massively Parallel Matching of Knowledge Structures. AAAI/MIT Press, 1994. 3. A. Borgida and P.F. Patel-Schneider. A semantics and complete algorithm for subsumption in the classic description logic. Journal of Artificial Intelligence Research, 1, 1994.

114

10.

11.

12. 13. 14.

Paolo Bresciani. The challenge of integrating knowledge representation and databases. In KARP-95 Second International Symposium on Knowledge Acquisition, Representation and Processing, pages 16-18. Auburn University, AL, USA, 1995. Extended version in IRST TR 9412-06, IRST, Povo TN (Italy). Fredric T. Chong, Shamik D. Sharma, Eric Brewer, and Joel Saltz. Multiprocessor runtime support for fine-grained, irregular DAGs. Technical Report CS-TR-3266, University of Maryland, Mar 1994. To appear in Parallel Processing Letters. M.P. Evett. PARKA: A System for Massively Parallel Knowledge Representation. PhD thesis, Dept. of Computer Science, University of Maryland, College Park, 1994. M.P. Evett, J.A. Hendler, and W.A. Andersen. Massively parallel support for computationally effective recognition queries. In Proc. Eleventh National Conference on Artificial Intelligence. 1993. M.P. Evett, J.A. Hendler, and L. Spector. Parallel knowledge representation on the connection machine. Journal of Parallel and Distributed Computing, 1994. T.R. Gruber. A translation approach to portable ontology specifications. Knowledge Acquisition, 5(2):199-220, 1993. J. Hendler, K. Stoffel, and A. Mulvehill. High Performance Support for Case-Based Planning Applications. Advanced Planning Technology. MIT/AAAI Press, Menlo Park, CA., USA, May 1996. J.F. Horty, R.H. Thomason, and D.S. Touretzky. A skeptical theroy of inheritance in nonmonotonic semantic networks. Technical Report CMU-CS-87-175, Carnegie Mellon, Department of Computer Science, Pittsburgh, PA, USA, Oct. 1987. B.P Kettler. Case-based Planning with a High-Performance Parallel Memory. PhD thesis, Dept. of Computer Science, University of Maryland, College Park, 1995. Brian B. Kettler. Case-based Planning with High-Performance Parallel Memory. PhD thesis, University of Maryland, College Park, November 1995. K. Knight and S. Luk. Building a large knowledge base for machine translation. In AAAI-94

[~].

15. Douglas B. Lenat and R.V. Guha. Building Large Knowledge-Based Systems. Addison-Wesley, 1991. 16. Robert M. MacGregor. A description classifier for the predicate calculus. In AAAI-94 [1]. 17. Joel H. Saltz, Ravi Mirchandaney, and Kay Crowley. Run-time parallelization and scheduling of loops. IEEE Transactions on Computers, pages 603-611, 1991. 18. K. Stoffel and J. Hendler. An efficient inferential-distance-ordering algorithm for parallel and distributed systems, submitted for publication, 1996. 19. K. Stoffel, J.A. Hendler, and J. Saltz. High performance support for very large knowledge bases. In Proc. Frontiers of Massively Parallel Computing. Feb. 1995. 20. Merwyn Taylor. Hybrid kb-db. In Proceedings of the 13th National Conference of the American Association for Artificial Intelligence. AAAI Press, MIT Press, 1996. to appear. 21. L. Watanabe and L. Rendell. Effective generalization of relational descriptions. In AAAI Eighth National Conference on Artificial Intelligence. 1990.

115 A. UM-Translog Queries ;;; QUERY 1: "find all plans for getting a package from a LOC to a LOC in DIFF city connected by a direct route"

(query! '(:and (#!instanceOf ?plan #!E-Plan) (#!instanceOf ?goal #!CE-Located-At) (#!everyInstanceOf ?pkg #!Package) (#!everyInstanceOf ?oloc #!Place) (#!everyInstanceOf ?dloc #!Place) (#!instanceOf ?ocity #)City) (#!instanceOf ?dcity #!City) (#!instanceOf ?sit #!Initial-Situation) (#!instanceOf ?sval #!Situated-Value) (#!instanceOf ?pkg-pvar #!Plan-Variable) (#!instanceOf ?dest-pvar #!Plan-Variable) (#!everylnstanceOf ?route #!Route) (#!plan-Goal= ?plan ?goal) (#!thing= ?goal ?pkg-pvar) (#!var-Value= ?pkg-pvar ?pkg) (#!location= ?goal ?dest-pvar) (#!var-Value= ?dest-pvar ?dloc) (#!situation-Before= ?plan ?sit) (#!location-sit= ?pkg ?sval) (#!situation= ?sval ?sit) (#!sslot-Val= ?sval ?oloc) (#!in-City ?oloc ?ocity) (#!in-City ?dloc ?dcity) (#!neq ?ocity ?dcity) (#!route-Origin= ?route ?ocity) (#!route-Destin= ?route ?dcity)

)) ;;; QUERY2: "find all plans (query! ' ( :and (# !instanceOf (# !instanceOf (#!everyInstanceOf (# !instanceOf (# !instanceOf (# !vehicle= (# !var-Value= (# !destin= (# !var-Value= (#!action-Of-Plan

using a particular train station, Hub-Ts" Tact ?plan ?train ?p-varp ?pvara Tact ?p-varp Tact ?pvara Tact

# !P-Move-Vehicle) # !E-Plan) #}Train) # !Plan-Variable) # !Plan-Variable) ?p-varp) ?train) ?pvara) # !Hub-Ts) ?plan)

)) ;;; QUERY 3: "find all top-level plans using a Regular truck (uses IDO and transitive inheritance)" (query ! ' ( :and (# !everyInstanceOf ?truck1 # !Regular-Truck) (# !instanceOf ?act # !P-Move-Vehicle) (# !instanceOf ?plan # !E-Plan) (#! instanceOf ?plant #!E-Plan)

116

(#!instanceOf (#!instanceOf (#!var-Value= (#!vehicle= (#!action-Of-Plan (#!tmember-Of (#!plan-of-Case ))

?case ?pvart ?pvart ?act ?act ?plan ?plant

#!Case) #!Plan-Variable) ?truckl) ?pvart) ?plan) ?plant) ?case)

;;; QUERY 4: "find all plans for getting an object from a LOC to a LOC in SAME city" (query! '(:and (#!instanceOf ?plan #!E-Plan) (#!instanceOf ?goal #!CE-Located-At) (#!everyInstanceOf ?pkg #!Package) (#leveryInstanceOf ?oloc #!Place) (#!everyInstanceOf ?dloc #!Place) (#!instanceOf ?city #!City) (#!instanceOf ?sit #!Initial-Situation) (#!instanceOf ?sval #!Situated-Value) (#!instanceOf ?pkg-pvar #!Plan-Variable) (#!instanceOf ?dest-pvar #!Plan-Variable) ?plan ?goal) (#!plan-Goal= ?goal ?pkg-pvar) (#!thing= (#!var-Value= ?pkg-pvar ?pkg) (#!location= ?goal ?dest-pvar) (#!var-Value= ?dest-pvar ?dloc) (#!situation-Before= ?plan ?sit) (#!location-sit= ?pkg ?sval) (#!situation= ?sval ?sit) (#!sslot-Val= ?sval ?oloc) (#!in-City ?oloc ?city) (#!in-City ?dloc ?city) )) ;;; QUERY 5: "find all plans DIFF city" (QUERY! '(:and (#!instanceOf (#!instanceOf (#!everyInstanceOf (#!everyInstanceOf (#!everyInstanceOf (#!instanceOf (#!instanceOf (#!instanceOf (#!instanceOf (#!instanceOf (#!instanceOf (#!plan-Goal= (#!thing= (#!var-Value= (#!location= (#!var-Value=

for getting an object from a LOC to a LOC in

?plan #!E-Plan) ?goal #!CE-Located-At) ?pkg #!Package) ?oloc #!Place) ?dloc #!Place) ?ocity #!City) ?dcity #!City) ?sit #!Initial-Situation) ?sval #!Situated-Value) ?pkg-pvar # !Plan-Variable) ?dest-pvar # !Plan-Variable) ?plan ?goal) ?goal ?pkg-pvar) ?pkg-pvar ?pkg) ?goal ?dest-pvar) ?dest-pvar ?dloc)

11"7 (#!situation-Before= ?plan?sit) (#!location-sit= ?pkg ?sval) (#!situation= ?sval ?sit) (#}sslot-Val= ?sval ?oloc) (#}in-City ?oloc ?ocity) (#!in-City ?dloc ?dcity) (#}neq ?ocity ?dcity) ))

;;; QUERY 6: "find compatible packages and vehicles" (query! '(:and (# !everyInstance0f ?pkg # !Package) (#!ancestor ?ptype #!Package) (#!ancestor ?vtype #!Vehicle) (#!everyInstance0f ?veh #}Vehicle) (#!instance0f ?pkg ?ptype) (#!instance0f ?veh ?vtype) (#!pv-Compatible= ?ptype ?vtype) ))

118 Dr. Kilian Stoffel

Dr. Kilian Stoffel is an affiliate research scientist at the University of Ma .ryland where he has worked in the Information Technology Laborato .ry since November 1994. His main research interests lie in Symbolic AI, Knowledge Representation and High Performance Computing. J a m e s A. H e n d l e r

Dr. Hendler is an associate professor and head of the Advanced Information Technology Laboratory at the University of Maryland, where he has worked since Janua .ry, 1986. He is the author of the book "Integrating Marker-Passing and Problem Solving: An activation spreading approach to improved choice in planning" and is the editor of'Expert Systems: The User Interface," "Readings in Planning" (with J. Allen and A. Tate), and "Massively Parallel AI" (with H. Kitano). Dr. Hendler was the recipient of a 1995 Fulbright Fellowship and has authored over 100 technical articles in the areas of artificial intelligence, high performance computing, and robotics. J o e l Saltz Associate Professor, Computer Science and head of High-Performance Systems Software Laboratory, University of Maryland. Joel Saltz leads a research group at the University of Maryland, College Park whose goal is to develop technology to make it easier for users to develop multiprocessor applications. Projects include the development of runtime support for parallel compilers, development of class libraries, optimizations for high performance I/O and development of techniques to link multiple separately parallelized applications.

Parallel Processing for Artificial Intelligence 3 J. Geller, H. Kitano and C.B. Suttner (Editors) 9 1997 Elsevier Science B.V. All rights reserved.

121

A Hybrid Approach to Improving the Performance of Parallel Search* Diane J. Cook Department of Computer Science and Engineering Box 19015 University of Texas at Arlington Arlington, TX 76019 (817) 273-3606 [email protected] Many artificial intelligence techniques and applications rely on performing heuristic search through large problem spaces. Iterative-Deepening-A* (IDA*) search has proven to be effective for large search spaces, because it requires no intermediate state storage and is guaranteed to find optimal solutions. However, the time taken to perform IDA* search on real-world tasks often prevents the everyday usage of AI techniques. Parallel processing can considerably reduce the time spent in search, and can thereby speed up AI applications. This paper describes HyPS, a hybrid parallel window / distributed tree search algorithm. Using this algorithm, the set of available processors is divided into clusters. Each cluster searches simultaneously through the same search space, but to a unique cost threshold. Within each cluster, the search space is divided so that an individual processor will search a fraction of the total search space. Operator ordering and load balancing techniques are used to further improve the performance of HyPS. Results on two real-world and one artificial domain show a substantial performance improvement over serial search algorithms, and indicate an improvement over existing parallel search approaches. In this paper we also describe a mechanism for automatically selecting the optimal number of clusters to use. 1. I N T R O D U C T I O N Heuristic search provides the driving force for many applications of artificial intelligence including problem solving, robot motion planning, concept learning, theorem proving and natural language understanding [9,21]. Computational complexity is a major limitation of search, thus the research community is continually t .rying to develop more efficient search algorithms. Parallel search algorithms significantly increase the size of the search space that can be traversed in a given amount of time. Parallel search techniques have been implemented on MIMD [3,20,21] and SIMD [4,7,8,12,19] architectures. Because of the overwhelming size of real-world AI applications, and because of the increasing power and accessibility *This work is supported by National Science Foundation grants IRI-9308308 and IR1-9502260.

122 of parallel computers, there exists a constant need for improvement of parallel search algorithms. This paper introduces a hybrid parallel search technique (HyPS) that improves the performance of search using a MIMD architecture. The idea behind the approach is to blend the strengths of existing parallel search algorithms. In particular, HyPS merges the power of a parallel window search with that of a distributed tree search. The resulting algorithm offers improvements over either approach used by itself. The addition of load balancing and operator ordering further improves the performance of the HyPS system. The remainder of this paper describes the HyPS system and demonstrates its performance on two well-known AI application areas as well as an artificially-generated search space. Section 2 defines the search problem and describes the basic search techniques employed by HyPS, and section 3 provides an overview of existing parallel search techniques. The following section introduces the HyPS approach along with operator ordering and load balancing extensions. Section 5 demonstrates the performance of HyPS applied to two real-world problems and one artificial search-intensive problem. Section 6 overviews a technique for automatically selecting the number of clusters to use for a given problem. We conclude this paper with observations and a discussion of future research directions.

2. H E U R I S T I C

SEARCH

Heuristic search techniques are used to find a sequence of operators that lead from the initial state (root node) of a problem to a goal state (goal node). An example search tree is given in Figure 1. Search begins by evaluating the root node and generating the children of the root node. At each later step, one of the previously generated child nodes is evaluated and its children are generated. This process continues until a node is evaluated that meets the goal criteria. Search time can be reduced by using estimated distances to the goal (heuristics) to direct the search. A* is a type of heuristic search algorithm which selects a node from the search queue that minimizes the function f(n) - g(n) + h(n). In this function, g(n) represents the cost of the path from the initial state to node n, and h(n) represents the heuristic estimate of the least-cost path from node n to a goal node. If the function h(n) never overestimates the distance to the goal, A* is guaranteed to find an optimal (least-cost) solution. The main drawback of any heuristic search algorithm is that the memory requirement is exponential in the depth of the tree [17]. The resulting demand for memo .ry often exceeds the resources of available machines. Iterative-Deepening-A* (IDA*) search [10] provides a solution to this problem. IDA* performs a series of incrementally-deepening depth-first searches through the search space. In each iteration through the space, the depth of the search is controlled by the cost threshold. The cost of a node is calculated using the A* function f(n) = g(n)+ h(n). The initial cost threshold is computed as the estimated distance from the root node to the goal. Search down any branch of the space terminates when the f(n) value of a generated child node exceeds the cost threshold for that iteration. If a goal node is not found during a given iteration, the search starts back at the root node, but the cost threshold is set to the minimum f(n) value in the search space that exceeded the previous threshold. Successive iterations continue until a goal node is found.

123 INITIAL STATE

I

GOAL STATE

Figure 1. Application search tree

IDA* also guarantees optimal solutions if h(n) does not overestimate the distance to a goal. Although redundant work is performed, the number of nodes that are expanded more than once is a small fraction of the total work. Unlike A* search, the memory requirement for IDA* is linear in the depth of the search space. 3. P A R A L L E L

SEARCH

A number of researchers have explored methods for improving the efficiency of search. These methods include making use of background knowledge, learning search macro operators, and designing parallel search algorithms. Parallel architectures have been shown to be useful in reducing the amount of serial time spent in search by dividing the work among multiple processors. Many of the existing parallel search efforts can be classified as parallel window searches or distributed tree searches. 3.1. P a r a l l e l W i n d o w S e a r c h One popular type of parallel search algorithm is a parallel window search (PWS), introduced by Powley and Korf [20]. Using PWS, each processor is given a copy of the entire search tree and a unique window size or cost threshold. The processors search the same tree to different thresholds simultaneously. If a processor completes an iteration without finding a solution, it is given a new unique threshold (deeper than any threshold yet searched) and begins a new search pass with the new threshold. The first processor to reach a solution informs the rest of the processors. The remaining processors stop their search if this solution is acceptable. If an optimal solution is desired, processors searching at lower thresholds finish their current iteration and the least-cost goal is returned. Given the search tree shown in Figure 1, the division of work using PWS is illustrated in Figure 2. One advantage of parallel window search is that the redundant search inherent in IDA* is not performed serially. A serial algorithm would search the space to threshold t, t h e n

124 Processor 1 I N I T I ~ ~

Processor 2 INITIAL ....

Processor 3 INITIAL ---

~~

O. O 4 O

........GOAL threshold I : h(root)

threshold 2 = thresholdI + i

threshold 3

=

threshold I + 2i

Figure 2. Division of work in parallel window search

would search the same space to threshold t plus some increment (t + i), then to threshold t + 2i, etc., until a goal node is found. On each iteration, all of the nodes expanded in the previous iteration are expanded again. Using multiple processors, this redundant work is performed concurrently. A second advantage of parallel window search is the improved time in finding a first solution. If the solution density is high (there are many solutions in the search space), the depth-first search used by IDA* may find a deep solution much more quickly than an optimal solution. Parallel window search can take advantage of this type of search space. Processors that are searching beyond the optimal threshold may find a solution down the first branch they explore, and can return that solution long before other processors finish their search iteration. This will often result in superlinear speedup in comparison to the serial algorithm, because the serial algorithm will always increment the cost threshold by the least possible amount and not look beyond the current threshold. On the other hand, parallel window search can face a decline in efficiency when the number of processors is significantly greater than the number of iterations required to find an optimal (or a first) solution. This situation will occur when a machine is used that offers many processors, yet few iterations are required because the heuristic estimator is fairly accurate. 3.2. D i s t r i b u t e d Tree Search An alternative approach to parallel search distributes the search space among available processors and is referred to here as distributed tree search (DTS) [11,21]. Using our version of DTS, one processor traverses the search tree until there are at least as many nodes in the queue as there are available processors. Once a sufficient number of nodes have been generated, the first processor passes a unique node from the search queue to the remaining processors. Each processor is thus responsible for the entire subtree rooted at the node it received. The processors perform IDA* on their unique subtrees simultaneously. All processors search to the same threshold. After all processors have

125

~llllliii,liiiiiil|liil|lliiiiliilii,lllliiliiiiiiliii|llil~ i 9 = expanded on all three iterations ~ : | i ~ = expanded on last two iterations i 9 i 0 = expanded on last iteration | I,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,~

i ! GOAL

Figure 3. Division of work in distributed tree search

finished a single iteration, they begin a new search pass through the same set of subtrees using a larger threshold. Given the search tree shown in Figure 1, processors running DTS search the spaces shown in Figure 3. One advantage of DTS is that no processor is performing wasted work beyond the goal depth. As the algorithm searches the space completely to one threshold before starting the search to a new threshold, none of the processors is ever searching at a level beyond the level of the optimal solution. It is possible, however, for DTS to perform wasted work at the goal depth. For example, in Figure 3 processor 3 searches nodes at the goal level that would not be searched in a serial search algorithm moving left-to-right through the tree. A disadvantage of DTS is the fact that processors are often idle. This is because a processor that finishes an iteration quickly must wait for all other processors to finish before starting the next iteration (in order to ensure optimality). This idle time can make the system ve.ry inefficient and reduce the performance of the search application. The efficiency of this approach can be improved by performing load balancing between neighboring processors working on the same iteration. Load balancing itself will generate overhead and communication costs, however. A second disadvantage of DTS is that it does not provide a mechanism for finding a quick first solution. The original DTS algorithm described by Kumar and Rao assigns the root node of the space to the first processor, and load balancing must be performed for other processors to receive initial work. We modified this algorithm to reduce load balancing time by distributing nodes to all processors from the host.

3.3. Other Parallel Search Techniques Both parallel window search and distributed tree search are implemented on MIMD architectures. Additional MIMD and SIMD approaches have been suggested which have

126 not yet been incorporated into the HyPS system. Bidirectional search has been used to search the space forward from the initial state and backward from the goal state concurrently [13]. Variations on the distributed tree search technique have been tailored to specific applications such as robot path planning [2]. Yet another body of research focuses on concurrent random movements through a search space that cover the search space quickly [6]. Ferguson, Powley and Korf describe a SIMD search technique in which, like DTS, the work is distributed among processors and each of the processors search to the same threshold [18,19]. The distribution relies on a series of copies which takes O(log p) steps, where p represents the number of available processors. Another SIMD search techniques which relies on fast information distribution and a distributed tree search is the Massively Parallel IDA* (MIDA*) system [4]. MIDA* also takes O(log p) steps to perform distribution, but search is performed as the information is distributed, so that the amount of work performed during the parallel IDA* step is reduced. A third SIMD search techniques again uses distributed tree search, but focuses on techniques for load balancing between processors during the parallel IDA* stage [12]. 4. H y P S We introduce a new type of parallel search that combines characteristics of existing approaches. We refer to the algorithm as Hybrid Parallel Search (HyPS) [14-16]. In particular, HyPS combines parallel window search with distributed tree search. As the experiments will show, the combination of search techniques outperforms either technique used in ~isolation. HyPS divides t h e set of processors into groups called clusters. Each cluster performs parallel window search - - each cluster receives a copy of the entire search space, and each cluster searches that Space to a unique cost threshold. Within each cluster, the space is distributed among the cluster's processors and the processors perform distributed tree search. In this way, clusters perform parallel window search, and the processors within each cluster perform distributed tree search. HyPS has been implemented in C on a nCUBE and in C* on a Connection Machine 5. The next two subsections describe the distribution and search phases of the HyPS algorithm. The following subsections describe improvements that can be made using operator ordering and. load balancing.

4.1.' DistributiOn Phase During the distribution phase, the host program distributes work among the node processors. The host program first determines Jhow manyclusters will be used and divides the set of processors equally among the cluSters. The search tree is then expanded t o t h e point that distinct subtrees can be distributed to each processor within a cluster. The number of distinct subtrees being explored is equal to the number of processors avaiiable Within a cluster. For example, if we have n = 6 available processors and we wish to use two separate parallel windows, then HyPS will form two clusters, each with n/2 - 3 processors. Hence, the host program will have to divide the initial tree into three distinct subtrees, which in turn are distributed to the three processors in each cluster (see Figure 4). ~ ~

127

Cluster 2

Cluster 1 IN

INITIAL

.......

9

9

O

rocessor

1

Processor

Q

Processor

2

Processor

5

O

Processor

3

Processor

6

O

o

o

4

Figure 4. Space searched by two clusters, each with 3 processors

The host processor stores generated nodes in a queue (Q). The number of generated nodes is dependent upon the number of distinct subtrees required for the distributed tree search. The steps taken in the formation of the queue are detailed below. 1. Initialize Q to contain the root of the search tree. 2. While length(Q) < number of processors in each cluster do 9 Remove node n from head of Q. 9 Evaluate node n. If n is not a goal node, then generate the children of n and append them onto Q. 3. If length(Q) > number of processors in each cluster, compress nodes at end of Q. Note that using a level-by-level search, more nodes may be generated than available processors. In this case, nodes at the end of the search queue are removed from the queue and replaced by their parent nodes. This compression is performed as many times as needed until the length of the queue is less than or equal to the number of processors in each cluster. If the final queue length is less than the number of processors needing information, child nodes from the most recent compression are distributed to the idle processors. In this case, both the parent node and some of its children are being searched by separate processors, yielding some redundancy in the search process. We provide an example of the distribution algorithm for an instance of the Fifteen Puzzle problem where each cluster contains p = 6 processors. The expansion of the search tree is shown in Figure 5. In this picture, invalid nodes (the blank square cannot move beyond the board boundaries and cycles of length two are disallowed) are not shown. The final queue will contain the leaf nodes of this partial tree, and will be distributed

128

INITIAL STATE

, 812 723~

, 8112 I , ~ 21 13

lo 11 13 1~

14

1

9i

i

14

7

2

3

51

7

101~ 1, I1~

1 "9

6

2

5

3

, ~12 ~2 31~

io 11 13

l o 1111311~

14 1 ~ ~] [1! ;:9 6 7

2n

31

4

8 12

5

7

2

3

lo11~3~

immn ilUiNi ININI ImEEE

Figure 5. Distribution of work for a fifteen puzzle problem

to the respective processors in each cluster. Once distribution is complete, the signal is given to begin the search phase. 4.2.

Search

Phase

The search phase is performed in parallel by the processors within each cluster. To begin the search phase, the host sends the nodes defining the subtrees to the appropriate processors within each cluster. The cluster also receives the cost threshold to which it will search. Each processor then begins searching at the root of its subtree. The first cluster receives a cost threshold equal to the heuristic estimate of the distance from the initial state to the goal. All of the remaining clusters are given incrementally larger thresholds. For many problems, the appropriate increment can be predetermined in a way that guarantees no loss of optimality. Figure 4 illustrates the case where n = 6 processors are available and we wish to use two separate parallel windows. By changing the cluster size, a greater or lesser number of windows can be employed. When the number of clusters c is equal to the number of processors available on the machine, HyPS simulates pure parallel window search. When c is equal to 1, HyPS simulates pure distributed tree search. Because clusters search the space to unique depths and processors within each cluster search a unique portion of the space, it is unknown where the first solution found will lie, and whether its cost will be optimal or nonoptimal. HyPS can be used to specifically return an optimal solution or to return the first (optimal or nonoptimal) solution found. 4.3.

Operator

Ordering

Problem solutions can exist anywhere in the search space. Using IDA* search, the children are expanded in a depth-first search from left to right, bounded in depth by the cost threshold. If the solution lies on the right side of the tree, a far greater number of nodes must be expanded than if the solution lies on the left side of the tree. The left-to-right direction of search represents the order in which search operators are

129 Orlglnal Ordering: New Ordering:

0123

1320

0.." i

W Most promising node

Figure 6. Operator ordering example

applied. If information can be found to re-order the operators in the tree from one search iteration to the next, the performance of IDA* can be greatly improved. As an example, one of the tested Fifteen Puzzle problem instances can require expansion of 900,587 search nodes for the best ordering of operators, yet requires expansion of 4,570,051 search nodes for the worst operator ordering. HyPS attempts to find a good operator ordering for each iteration based on information gained from the previous iteration [3]. The reduction in number of expanded nodes can be exponential for some problem instances. HyPS stores the order of operator applications that lead to the most promising node at the bottom of the subtree explored in the previous search pass. The "most promising" node is the one that has the minimum h value. Since all nodes at the bottom of a subtree represent equal f values (by definition of IDA*), nodes with smaller h values are expected to be more accurate. Consider the situation in which a car needs to be fixed. A customer generally feels more confidence about the statement that his car will be ready in 2 days (h = 2) if it has already been in the shop 28 days (g = 28), than he would feel about the statement that his car will be ready in 28 days (h = 28) if the car has only sat in the shop for 2 days (g = 2). If, as in many cases, the error in the heuristic function h grows monotonically with the distance from the goal, nodes with the lowest h values represent the most accurate estimated distances to the goal and contain fewer nodes in their subtrees [20]. The following example illustrates the operator ordering technique. Figure 6 shows the search tree expanded on one iteration of IDA*. The most promising leaf node is starred.

130 illllll

9-: _-

IIIIIIIIIIIIIIIIIIIIII

IIIIIIIIIIIIIII

II

IIIIIII

IIIIi

Clusterl

:

"

......

. ........................

illl

ll

llllllllllllllllllll

Illllllllll

ll

lllllllllllll

llllll

:_.

:"9 C l u s t e r 2

: ,

:

i

i

J

-: . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

, .......................

J

Figure 7. Load balancing of two clusters, each with 4 processors

The current operator order is 0123 (always try operator 0 first, operator 1 next, operator 2 next, and operator 3 last), but the sequence of operators that leads to the most promising node at the current cutoff point is 133202, so the operator order is changed to reflect this search direction. Because operator 1 appears first in the path to the most promising node, operator 1 will always be tried first. The next unique operator in the sequence is operator 3, then operator 2, and finally operator 0. This ordering is updated at the end of each parallel window search iteration. 4.4.

Load

Balancing

Within

a Cluster

In the same way that PWS is aided by the addition of operator ordering, DTS benefits from the addition of load balancing. Distributed tree search suffers greatly due to processor idling. Load balancing can be added to improve the intra-cluster performance by reducing this idle time. Using load balancing, a processor that has finished its current search iteration asks for work from its neighbor and helps that neighbor to complete its search iteration. We have added load balancing to HyPS to improve the overall performance. Load balancing here is implemented using a ring model of communication. Consider the case where we have eight available processors divided into two clusters. Load balancing will be performed within each of these four-processor clusters. If processor Pl finishes work, then Pl asks for work from its right neighbor, P2. Each processor performs a similar type of request except for the last processor in the ring, such as p4, which requests work from Pl. Figure 7 depicts this communication model. When work is requested from a processor, the processor first determines if it has work to spare (the number of nodes on the stack is greater than a specified threshold). If there is work to spare, the donating processor sends information corresponding to the node at the bottom of the stack, or the node closest to the root which has children left to expand. If the processor has no work to spare, then the requesting processor must remain idle until the search is finished or a new threshold is specified. In this load balancing algorithm, communication is restricted to the adjacent processors within the cluster. Load balancing was added to HyPS to see how the performance of DTS and PWS would be affected. To improve the performance further, more advanced load balancing techniques could be employed [1,23]. As the experiments demonstrate, integrating the

131

1

2

3

4

5

6

7

8

9

i0

ii

12

13

14

15

Figure 8. The fifteen puzzle

additional strategies of load balancing and operator ordering improve the performance of HyPS. 5. E X P E R I M E N T S This section demonstrates the performance of HyPS on two real-world applications and one artificially-crafted domain. The experiments on real-world domains reveal that 1) HyPS does provide a substantial speedup over a serial algorithm, 2) operator ordering and load balancing can in many cases improve the performance of HyPS, and 3) the optimal number of clusters varies, verifying that PWS and DTS perform better in combination than in isolation. We will first introduce the two real-world applications and provide the experimental results, then describe our artificial domain experiments. 5.1. T h e F i f t e e n P u z z l e One well-known discrete optimization problem (DOP) is the Fifteen Puzzle. It consists of a 4 x 4 grid containing tiles numbered one to fifteen, and a single empty tile position called the blank tile. A tile can be moved into the blank position from an adjacent position, resulting in the four operators up, down, left, and right. Given the initial and goal configurations, the problem is to find a sequence of moves to reach the goal. An example of a goal configuration is given in Figure 8. In our problem instances, the heuristic estimate function used is the Manhattan Distance Function, which is known to be an admissible heuristic for the Fifteen Puzzle problem. The Fifteen Puzzle problem can generate approximately 240 unique problem states [7], thus this represents a large search problem which can benefit from an efficient parallel algorithm. 5.2. R o b o t A r m M o t i o n Planning Discrete optimization problems belong to the class of NP-hard problems. Parallel processing of these problems cannot reduce their worst-case run time to a polynomial without using an exponentially large number of processors. However, the average-time complexity of heuristic search algorithms for many of these problems is polynomial. Robot motion planning, speech understanding, and task scheduling are some of the DOPs which require real-time solutions. Parallel processing of these problems can bring their real-time

132

Figure 9. Robot arm motion planning problem instance

solutions closer to being realized. Traditional motion planning methods are ve.ry costly when applied to a robot arm, because each joint has an infinite number of angles to which it can move from a given configuration, and because collision checking must be performed for each arm segment. Craig provides a detailed description of the calculations necessa .ry to determine the position of the end effector in the 3D workspace given the current joint angles [5]. For our experiments, we use the parameters defined for the Puma 560 robot arm with six degrees of freedom shown in Figure 9. The size and layout of the room is the same for each of our test problems.

5.3. Results from real-world experiments We tested the performance of HyPS for the Fifteen Puzzle and Robot Arm Motion Planning domains using the basic HyPS approach as well as HyPS with operator ordering and load balancing added. Each of these experiments were run on 64 processors of an nCUBE 2. Table 1 lists the speedup that resulted from each selected number of clusters, averaged over 20 instances of the Fifteen Puzzle problem. Speedup over the corresponding singleprocessor algorithm was generated using basic HyPS, HyPS with operator ordering, HyPS with load balancing, and HyPS with both operating ordering and load balancing. Results were collected for finding a first solution and an optimal solution. In all of the result summaries, the speedup of the best cluster distribution is highlighted. As table 1 indicates, all versions of the program realize significant speedup over the serial algorithm. Speedup is greater when the program stops with a first solution, because

133 Table 1 Speedup for Fifteen Puzzle domain Number of clusters Algorithm Basic Operator Ordering Load Balancing Operator Ordering / Load Balancing

1 2 4 (DTS) ~2.4 ~6.6 24.2 12.4 2 2 . 0 45.9 18.4 43.1 43.1 18.4 4 4 . 2 82.3 17.6 28.2 2 4 . 9 17.6 3 0 . 6 47.1 41.3 2 3 . 1 4 0 . 7 41.3 2 4 . 3 7 9 . 2

8

16

32

64

Avg

Solution

12.2 28.9 19.5 50.9 15.3 33.2 19.9 52.1

Optimal First Optimal First Optimal First Optimal First

(pws) ~6.8 46.4 16.2 85.3 17.3 57.3 19.6 74.8

6.~ 33.4 6.2 53.5 9.2 36.7 6.5 87.3

6.5 20.3 5.4 43.4 7.5 26.8 5.5 41.1

3.0 22.2 3.9 29.5 2.7 16.3 2.7 16.5

the serial algorithm is constrained to always find an optimal solution. In addition, both operator ordering and load balancing improve the speedup of HyPS, and the best speedup is realized when all of these techniques are used. Superlinear speedup may occur when finding a first solution because the comparison is made to a serial algorithm finding an optimal solution. Superlinear speedup may also result when the goal node lies on the right side of the space, because the space is divided evenly among processors and an individual processor will find that solution before a serial algorithm that must search the entire left side of the space before reaching the goal. Table 2 lists the number of problems for which the corresponding number of processors yields the best performance. Note that in no case does pure DTS (number of clusters - 1) or pure PWS (number of clusters - 64) consistently perform best. In general, performance improves with a greater number of clusters when looking for a first solution. This occurs often in domains like the Fifteen Puzzle where the solution density (percentage of nodes representing goals) is high, and increasing the number of windows increases the chance of quickly finding a goal. As expected, operator ordering and load balancing affect the ideal number of clusters. Because operator ordering benefits PWS, the optimal number of clusters increases for these tests. When load balancing is used in isolation, the performance of DTS improves and the optimal number of clusters becomes lower. The greatest number of clusters and the greatest speedup is realized when both operator ordering and load balancing are used to find a first solution. The results for the Robot Arm Motion Planning domain are listed in table 3. The results are averaged over eight problem instances in which only optimal solutions are sought. As before, speedup is obtained over the serial algorithm and performance improves when operator ordering and load balancing are added. However, this domain differs from the Fifteen Puzzle domain in two main features. First, many of the solutions in this domain lie on the far right side of the tree. For this reason, superlinear speedup is often realized when

134

Table 2 Optimal number of clusters for Fifteen Puzzle domain

Number of clusters Algorithm

1

2

4

8

16

32

64

Avg

Solution

Basic

4 1 1 0 5 2 3 0

7 3 6 3 7 3 6 1

3 6 7 7 3 5 6 8

6 6 3 5 4 8 3 6

0 3 1 3 0 1 1 3

0 0 2 1 1 0 1 1

0 1 0 1 0 1 0 1

3.9 10.1 7.3 10.9 3.2 8.6 5.6 11.3

Optimal First Optimal First Optimal First Optimal First

Operator Ordering Load Balancing Operator Ordering / Load Balancing

Table 3 Speedup for Robot Arm Motion Planning domain

Number of clusters Algorithm Basic Operator Ordering Load Balancing Operator Ordering / Load Balancing

1 2 (DTS) 14.7 21.4 13.1 38.7 12.9 21.2

4

8

16

32 2.1 2.0 2.3

64 (PWS) 2.1 2.0 2.1

70.2 230.6 78.0

27.1 891.7 27.0

10.6 1671.1 14.1

12.8 84.9

155.0

1518.4

354.1

Avg 18.5 356.2 19.7

2.8

1.8

266.2

135 Table 4 Optimal number of clusters for Robot Arm Motion Planning domain

Number of clusters Algorithm

1

2

4

8

16

32

64

Avg

Basic Operator Ordering Load Balancing Operator Ordering / Load Balancing

0 0 0

0 0 0

7 0 7

1 7 1

0 1 0

0 0 0

0 0 0

4.5 9.0 4.5

0

0

1

6

1

0

0

8.5

operator ordering is employed. Second, the branching factor is nearly constant throughout the tree, so the tree is very uniform. As a result, the load is balanced and addition of load balancing does not greatly improve performance. In fact, in some cases load balancing can actually degrade performance due to the time taken to query neighboring processors for available work. Table 4 lists the number of problems for which a given number of clusters performed best. As before, operator ordering causes an increase in the ideal number of processors. Because load balancing did not significantly change performance, the ideal number of clusters remains the same when load balancing is added. For both of these problems, the optimal number of clusters was not constant, but varied with characteristics of the problem domain and with the inclusion of operator ordering and load balancing. 5.4. A r t i f i c i a l D o m a i n Intuitively, selecting the number of clusters to use for a given search application seems to be affected by factors such as the branching factor of the tree, the load imbalance, the accuracy of the heuristic function, and the position of the solution in the tree. To isolate the factors that affect the performance of parallel search, we constructed an artificial search space. In this section, we will describe the artificial search space and show results of experiments using this space. The artificial search space is represented as a tree. Instead of storing the entire space in memo~., portions of the space are generated as needed while the algorithm traverses the tree. The following characteristics of the tree can be adjusted by the user. 9 Maximum branching factor. The user can select a maximum branching factor. No node in the tree will have more children than this specified value. 9 Load imbalance. A real number between 0 (perfectly balanced) and 1 (perfectly imbalanced) is used to indicate the amount of imbalance in the tree, and thus in the distributed work. An imbalanced tree contains the greatest number of nodes in the middle of the tree, and the left and right sides contain sparse subtrees. 9 Accuracy of heuristic estimate. A real number between 0 and 1 is used to indicate

136

\ \ m

o~

O 1 2

3 Branching

4

5

Factor

Figure 10. Artificial Experiment #1

the accuracy of the heuristic estimator. Because h should never overestimate the true distance to the goal, h is defined as e r r o r h 9 d i s t a n c e . 9 Cost of optimal solution. Each move is assigned a cost of 1, so the optimal cost is the depth in the tree at which the optimal solution can be found. 9 Position of optimal solution. A sequence of moves is supplied to define the path to the first optimal goal found in a left-to-right search of the tree. 9 Solution density. A real number between 0 and 1 represents the probability that any given node found after the first optimal solution is also a goal node. Using this artificial domain, we can test several hypotheses about the features of a search space that affect the optimal number of clusters to be used. Note that there are no random distributions used in the generation of the tree, because the tree must maintain the same structure for each iteration of IDA*. Instead, eve .ry feature of a node is determined as a function of the node's position in the search tree. All of the experiments described in this section were conducted using 32 processors on an nCUBE 2. The first experiment compares the optimal number of clusters to the average branching factor in the tree. When the average branching factor is low, there is little work to distribute and better performance is obtained by parallelizing iterations of IDA* using PWS. As the branching factor increases, there is more work to distribute and a lower number of clusters is favored. To verify this hypothesis, we ran HyPS on four different artificially-generated search spaces with the average branching factor ranging from 2 to 5. In each case, the tree

137

/

30

r~ S.,

_= u

/

...i

/

0

0.0

0.4

0.8

Imbalance

Figure 11. Artificial Experiment # 2

is perfectly balanced, the heuristic estimate is 0.8 (fairly accurate), the optimal cost is 16, and the first optimal solution is positioned on the far right of the tree. No operator ordering or load balancing is employed. As Figure 10 shows, the hypothesis is verified by our experiment. When the branching factor is at the low end, the ideal number of clusters is 8. As the branching factor increases, the optimal number of clusters decreases eventually down to 1. In the second experiment, we verify that a nonuniform tree requires a greater number of clusters. Figure 11 shows the result of this experiment. We ran HyPS on six trees with load imbalance ranging from 0 to 1. For each tree, the solution is positioned in the middle of the tree, the branching factor is 3, the optimal cost is 15, and no load balancing or operator ordering is used. As the imbalance increases, more windows are required. This is due to the fact that with a high level of imbalance, many of the subtrees are very small and the processors searching these trees will be idle. As the first experiment in this section demonstrated, a high branching factor results in a lower optimal number of clusters. In contrast, a high number of IDA* iterations and a solution positioned on the left side of the tree will result in a higher optimal number of clusters. The third experiment controls the number of IDA* iterations required by varying the accuracy of the heuristic estimator function and by varying the position of the first optimal solution. For each run, the branching factor is 3 and the optimal cost is 15. While the accuracy of h varies from 0.2 to 1.0, the position of the first optimal solution also varies from the far left side of the tree to the far right side of the tree. Figure 12 shows that in fact with a low accuracy and solution on the left, more clusters are needed than when h is accurate and the first optimal solution appears on the right side of the

138

30

20

~= lO

0.2

0.6 Accuracy of h

1.O

Figure 12. Artificial Experiment #3

tree. The last experiment focuses on the effects of employing operator ordering. In experiment four, we demonstrate that operator ordering becomes more effective the farther to the right the solution lies. Each search tree for experiment four has a branching factor of 4, a heuristic estimate of 0.2, a uniform tree, and an optimal cost of 12. The solution is first position 0/5 of the way through the tree (far left), and moved successively further to the right (positions of 1/5, 2/5, 3/5, 4/5, and 5/5 through the tree). For each test case, HyPS is run with and without operator ordering. As Figure 13 verifies, operator ordering causes a greater increase in the ideal number of clusters and a more significant speedup as the solution is positioned farther to the right in the tree. The figure on the left graphs the increase in optimal number of clusters with the addition of operator ordering, and the figure on the right graphs the speedup that results from employing operator ordering. In this section we have used an artificial domain to identify features that contribute to optimizing the performance of HyPS. In the next section, we will describe an architecture that uses these features and a set of rules to automatically select the number of clusters. 6. A U T O M A T I C A L L Y

SELECTING OPTIMAL NUMBER

OF CLUSTERS

In this section we describe our method of automatically selecting the number of clusters that will optimize the performance of HyPS. To make this decision, we note that many characteristics of a search space can be determined by looking at the first few levels of a tree. In particular, by exploring just a few levels down in the tree we can estimate factors such as the average branching factor, the average error of the heuristic estimate function, the possible location of the optimal solution, and the amount of load imbalance.

139

/

|

I _/

/L J

/ / 0.0

Solution P o s i t i o n

0.4

0.8

Solution P o s i t i o n

Figure 13. Artificial Experiment # 4

Our architecture makes use of the host distribution mechanism to collect statistics about the search space. The amount of time required to search the first few levels of the tree is minimal, and valuable information can be gleaned about the nature of the problem. Using results from the real-world and artificial domains, rules were constructed to select the ideal number of clusters. Statistics are gathered about the problem space as the host searches a few levels down in the tree. The statistics are used to calculate the number of clusters for HyPS to use for the rest of the search. This method of gathering information about the search space to make decisions about the parallel search is similar to Suttner's SPS model, in which a small portion of the space is searched serially to determine whether the problem is a good candidate for parallelization, to control redundancy, and to determine an effective distribution scheme [22]. Using this architecture, we reran the artificial domain experiments and compared results to those shown in section 5.4. In particular, Figures 14 through 17 graph the results of experiments 1 through 4 where the number of clusters is selected automatically by the system. In each case, the trend in selected number of clusters is the same as the trend for the verified optimal number of clusters, though actual numbers vary in a few cases. We also reran the two real-world application domains, using this architecture to select the number of clusters for HyPS. We compared the number of clusters pick by our architecture to the optimal number of clusters verified by the experiments in section 5. For the Fifteen Puzzle, the average error in optimal number of cluster selection averaged over all 20 problem instances for the basic HyPS system was 1.1. This represents one selection away from optimal: choosing to use 2 clusters instead of 1, or 4 clusters instead of 8. For these results, the host system searched six levels into the tree. Similarly, the host searched six levels into the search tree for the Robot Arm Motion Planning domain. Averaged over the six problem instances using basic HyPS, the error in optimal number of clusters selection using our architecture was 0.125.

140

8-

16-

\

I..

I=i

\

6-

\

4-

2-

\

"~10_

_ 2

\ 3

J /

-

~14_

4

~

8 _-

5

0.0

Branching Factor

/

f 0.4

0.8

Imbalance

Figure 14. Artificial Experiment #1

Figure 15. Artificial Experiment #2

16-,

1.0k

A

/

ffJ

"•

0.812-

~

-

'~

0.6-

~ ~

~

8-

\

4 0

.~

\

Herror

Figure 16. Artificial Experiment #3

0.4_ -

~11 0 , 2 -

g r~ o.o-

/ !

/ ! !

Solution Position

Figure 17. Artificial Experiment #4

141 7. C O N C L U S I O N S A N D F U T U R E W O R K The goal of this research is to improve search-intensive applications by improving the performance of parallel search algorithms. This paper introduces the Hybrid Parallel Search algorithm HyPS, which improves the performance of existing parallel search techniques such as parallel window search and distributed tree search, by combining the benefits of both approaches. The system combines the two techniques by dividing the set of processors into clusters. The clusters perform parallel window search, while the processors within a cluster perform distributed tree search. The performance of HyPS is improved by the addition of operator ordering (which strengthens inter-cluster performance) and load balancing (which strengthens intra-cluster performance). The results of experiments applied to two problem domains reveal that HyPS outperforms both serial search algorithms and either PWS or DTS used in isolation. An artificial domain was used to identify the features that contribute to the selection of the optimal number of clusters. Finally, we introduced a mechanism for automatically selecting the number of clusters for each search domain that performed well in all three application domains. While the HyPS system has been shown to be effective in these three domains, we are continuing to apply the system to a variety of applications. As more data is gathered, we can use machine learning techniques to generate a set of parameter-setting rules for HyPS that achieves the greatest performance possible for parallel heuristic search.

REFERENCES

1. I. Ahmad and A. Ghafoor. Semi-distributed load balancing for massively parallel multicomputer systems. IEEE Transactions on Software Engineering, 17(10):9871004, 1991. 2. D. Challou, M. Gini, and V. Kumar. Parallel search algorithms for robot motion planning. In Proceedings of the AAAI Symposium on Innovative Applications of Massive Parallelism, pages 40-47, 1993. 3. D . J . Cook, L. Hall, and W. Thomas. Parallel search using transformation-ordering iterative-deepening a*. The International Journal of Intelligent Systems, 8(8), 1993. 4. D. J. Cook and G. Lyons. Massively parallel IDA* search. Journal of Artificial Intelligence Tools, 2(2):163-180, 1993. 5. J.J. Craig. Introduction to Robotics. Addison-Wesley Publishing Company, Inc, 1989. 6. W. Ertel. Massively parallel search with random competition. In Proceedings of the AAAI Symposium on Innovative Applications of Massive Parallelism, pages 62-69, 1991. 7. M. Evett, J. Hendler, A. Mahanti, and D. Nau. PRA*: Massively parallel heuristic search. Journal of Parallel and Distributed Computing, 25:133-143, 1995. 8. G. Karypis and V. Kumar. Unstructured tree search on simd parallel computers. In Supercomputing 92, pages 453-462, 1992. 9. H. Kitano. Massively parallel ai and its application to natural language processing. In Proceedings of the First International Workshop on Parallel Processing for AI, pages 99-105, 1991.

142 10. R. E. Korf. Depth-first iterative deepening: An optimal admissible tree search. Artificial Intelligence, 27:97-109, 1985. 11. V. Kumar and V. N. Rao. Scalable parallel formulations of depth-first search. In Kumar, Kanal, and Gopalakrishan, editors, Parallel Algorithms for Machine Intelligence and Vision, pages 1-41. Springer-Verlag, 1990. 12. A. Mahanti and C. Daniels. Simd parallel heuristic search. Artificial Intelligence, 60(2):243-281, 1993. 13. P. C. Nelson and A. A. Toptsis. Superlinear speedup using bidirectional and islands. In Proceedings of the 12th International Joint Conference on Artificial Intelligence, pages 129-134, 1991. 14. S. Nerur and D. J. Cook. Maximizing the speedup of parallel search using hyps. In

Proceedings of the Third International Workshop on Parallel Processing for Artificial Intelligence, pages 40-51, 1995. 15. S. S. Nerur. A hybrid parallel IDA* search. In Proceedings of the National Conference on Artificial Intelligence, 1994. 16. S. S. Nerur and D. J. Cook. A hybrid parallel-window/distributed tree algorithm for improving th e performance of search-related tasks. In Proceedings of the Sev-

enth International Conference on Industrial and Engineering Application of Artificial Intelligence and Expert Systems, pages 629-637, 1994. 17. J. Pearl. Heuristics. Addison-Wesley, Reading, Massachusetts, 1984. 18. C. Powley, C. Ferguson, and R. E. Korf. Parallel tree search on a simd machine. In Proceedings of the Third IEEE Symposium on Parallel and Distributed Processing, pages 249-256, 1991. 19. C. Powley, C. Ferguson, and R. E. Korf. Depth-first heuristic search on a simd machine. Artificial Intelligence, 60(2):199-242, 1993. 20. C. Powley and R. E. Korf. Single-agent parallel window search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(5), 1991. 21. V. N. Rao, V. Kumar, and K. Ramesh. A parallel implementation of IterativeDeepening-A*. In Proceedings of AAAI, pages 178-182, 1987. 22. C. B. Suttner. Static partitioning with slackness. In Proceedings of the IJCAI Workshop on Parallel Processing for Artificial Intelligence, pages 143-155, 1995. 23. J. Xu and K. Hwang. Heuristic methods for dynamic load balancing in a messagepassing multicomputer. Journal of Parallel and Distributed Computing, 18(1):1-13, 1993.

143

Diane Cook Diane Cook is currently an Associate Professor in the Computer Science and Engineering Department at the University of Texas at Arlington. Her research interests include parallel algorithms for artificial intelligence, machine planning, and machine learning, and she has published over 60 peer-reviewed journal articles, book chapters, and conference papers in these areas. Dr. Cook received her B.S. from Wheaton College in 1985, and her M.S. and Ph.D. from the University of Illinois in 1987 and 1990, respectively. Her current Home Page: http://www-cse.uta.edu/'cook/home.html

Parallel Processing for Artificial Intelligence 3 J. Geller, H. Kitano and C.B. Suttner (Editors) 1997 Elsevier Science B.V.

145

Static Partitioning with Slackness Christian B. Suttner a* aInstitut fiir Informatik TU Miinchen, Germany Email: [email protected] Static Partitioning with Slackness (SPS) is a method for parallelizing search-based systems. Traditional partitioning approaches for parallel search rely on a continuous distribution of search alternatives among processors ("dynamic partitioning"). The SPSmodel instead proposes to start with a sequential search phase, in which tasks for parallel processing are generated. These tasks are then distributed and executed in parallel. No partitioning occurs during the parallel execution phase. The potentially arising load imbalance can be controlled by an excess number of tasks (slackness) as well as appropriate task generation. The SPS-model has several advantages over dynamic partitioning schemes. The most important advantage is that the amount of communication is strictly bounded and minimal. This results in the smallest possible dependence on communication latency, and makes efficient execution even on large workstation networks feasible. Furthermore, the availability of all tasks prior to their distribution allows optimization of the task set not possible otherwise. The paper describes the basic SPS-model, presents general simulation results, provides a worst-case comparison with other parallelization approaches, and discusses the appropriateness of using the model for parallelization. 1. I n t r o d u c t i o n

Combinatorial search (or simply search) denotes the quest for a solution to a problem by systematic trial-and-error. It is a fundamental computational paradigm that can be applied to many problems where the sequence of steps required to obtain a solution is not known in advance. Especially the fields of Artificial Intelligence and Engineering involve many problems where this is the case. Unfortunately, as the minimal number of steps (in terms of ideal search decisions) required for the solution of a problem increases, the size of the associated search space typically grows exponentially or even faster (combinatorial explosion). This rapidly decreases the chances to find solutions for non-trivial problem instances. Therefore, much work is devoted to the improvement of the search capabilities of current systems. A particularly promising approach is parallelization. Parallel search allows a significant increase in the utilized processing power, making it possible to search orders of magnitude larger search spaces. Moreover, while breadth-first search usually is *This research has been funded by the DFG SFB 342 subproject A5 (Parallelization in Inference Systems).

146 infeasible due to the exponential growth in memo .ry requirements, the concurrent exploration of different search alternatives can be used to introduce a breadth-first component into depth-first search which avoids the detrimental effects of wrong search decisions at an early stage of the search, which can result in exponential reductions of the required amount of search. The traditional approach for parallelizing search is to partition the search space in a dynamic fashion. This means that different search alternatives are processed in parallel, and newly arising alternatives are more or less frequently distributed among processors (e.g., [5,1,13]). The advantage of this is that the partitioning can automatically adapt to the search space. Idle processors can receive new work from other processors, and thereby a dynamic load balancing capability is available. However, there are also disadvantages. Dynamic partitioning allows no global control of the partitioning process, and the tasks 2 generated cannot be easily treated with respect to their global priority (because there is no global view). Furthermore, the realization of work migration for dynamic load balancing may result in extensive communication, increasing the parallelization overhead significantly. Finally, dynamic partitioning capabilities in practice are limited due to finite task buffers. In this paper, a parallelization model is proposed in which no partitioning occurs during the parallel computation. Instead, the partitions required for parallel execution are determined in a separate search process, at the beginning. In order to avoid load imbalance, the number of generated partitions may exceed the number of processors (slackness). For distinction from dynamic partitioning, this scheme is called static partitioning with slackness. The application of this model to the parallelization of a high performance sequential theorem prover has lead to substantial performance improvements. A detailed and extensive analysis in [17] shows that significantly more problems become solvable, and good speedup results are achieved. Improvements are also obtained compared to a previous parallelization of the same sequential system, which is based on the random competition model [6]. In the paper here, the general SPS model is presented, and some of the results on its general performance (i.e., independent of a particular application) are given. Section 2 contains a description of the basic model, and discusses the consequences of the model phases for a parallelization. Section 3 treats specifically the load balance issue, containing general simulation results. Section 4 provides a worst-case discussion of the SPS-model compared to two other parallelization approaches. In Section 5, the appropriateness of using the SPS-model is discussed. Section 6 gives an overview of related work. Finally, Section 7 contains a summa .ry of the presented material.

2. P h a s e s of t h e S P S - M o d e l Informally, the execution of a parallel search-based system based on the SPS-model can be separated into three phases as shown in Figure 1. 2We use the term task to denote a subproblem of the original problem. The processing of a task leads to the exploration of some part of the search space of the original problem.

147

Phase 1 Task Generation

1 Phase 2 Task Distribution

1 Phase 3 Parallel Execution

Figure 1. The phases of Static Partitioning with Slackness.

2.1. T h e Task G e n e r a t i o n P h a s e The task generation phase is defined by a set of partitioning rules, a generation strategy, and the desired number of tasks. An initial, finite segment of the search space is explored according to the generation strategy (e.g., depth-first, breadth-first, or best-first search). Depending on the strategy, the search is abandoned at certain points (e.g., by invoking backtracking due to a search bound), and the search states constituting unexplored alternatives are stored as tasks for later parallel execution. The type of tasks generated depends on the employed partitioning rules. Choices for partitioning rules are AND- and OR-partitioning. For AND-partitioning, a given set of subgoals is split such that each subgoal forms a separate task. Each of these AND-tasks needs to be solved to provide a solution to the original problem, and the solutions for the AND-tasks need to be compatible. A special case occurs for independent-AND partitioning: if the individual subgoals are independent of each other (e.g., do not share a variable directly or indirectly), incompatible solutions cannot occur and the subgoals can be solved independently without a final compatibility check. For OR-partitioning, different search alternatives are used to produce a set of OR-tasks, and the solution to any of these tasks is sufficient for a solution to the original problem. The desired number of tasks m is equal to or larger than the number of processors n to be utilized. The term slackness is used to express the relation between the number of tasks and the number of processors. In order to specify a degree of slackness, the slackness m shall denote the average number of tasks (slackness) per processor. parameter spp = -~ C o n s e q u e n c e s for Parallelization The advantages and disadvantages due to an isolated task generation phase before a parallel execution phase can be summarized as follows: 9 Information about the problem-specific search space becomes available. Such information is of high heuristic significance, and can be used to control many aspects of a search-based system. In particular, it allows

148 the identification of problems inadequate for parallelization. Untypically low branching factors or large numbers of search tree nodes without successors may indicate inappropriateness for parallel treatment or suggest that only some fraction of the available processors should be used. Since the initialization of a parallel execution on many processors often causes significant overhead, restricting the virtual machine size or refraining from parallel execution completely for such problems can actually reduce the execution time as compared to a full scale parallelization. Proper treatment of this type of problems is important for achieving good average case performance for applications where no pre-selection of especially hard problems (i.e., problems on which other attempts have previously failed) is done.

-

- redundancy control and significance-driven search space partitioning. In case (independent-)AND-partitioning is employed, it is possible to inspect the generated task set for redundant (e.g., syntactically equivalent) tasks. Due to such analysis, redundant search can be avoided and tasks which are important due to their frequent occurrence can be partitioned further, leading to a significance-driven task generation control. - a dynamic adjustment of system control parameters. Depending on the number of search tree nodes without successors encountered during the generation phase, the potential for load imbalance can be estimated. This can be used to adjust the degree of slackness appropriately. Furthermore, many search-based systems are controlled by a number of parameters. Based on information gained during generation, these parameters can be adjusted in a problem dependent way, for the parallel execution phase (e.g., the bound increment for systems using iterative-deepening search). 9 No parallelism is utilized in the initial phase. The effect of this on the performance depends on the following issues: - For large search problems, the duration of the generation phase is negligible. - In the generation phase, there is often little parallelism available (the search space just starts to "explode"). Parallel treatment of this phase may therefore be of limited effectiveness. - There is no search rate slowdown due to parallelization management (such as communication for task management). -

Within a given computing environment, the fastest available sequential hardware can be utilized, which can be many times faster than the average processor available for a parallel computation.

- Simple problems are solved immediately sequentially (no burden is put on a parallel computing environment); in those cases a parallel computation may require more time than a sequential one due to communication overhead. 2.2.

The

Task

Distribution

Phase

In the task distribution phase, the previously generated tasks are distributed among the available processors. The availability of all tasks prior to their distribution can be utilized

149

beneficially in several ways. First, redundancies among the tasks can be investigated and removed. Second, relationships among the tasks can be used to guide the mapping onto processors, e.g., to reduce the potential for load imbalance. Third, the complete mapping information can be given to each processor. This allows direct processor-toprocessor communication for distributed task management (such as termination control), or in case information exchange between particular tasks is desired (for extensions of the basic SPS-model). Regarding an ideal mapping of tasks to processors, information about the search space (task size) would be required. Unfortunately, such information is usually not available. In that case a good heuristic is to maximize the distance among tasks mapped to the same processor with respect to the number of common ancestors in the search tree. The motivation for this is that tasks which share many common ancestors denote similar situations, and are more likely to span search spaces of similar size than tasks with less commonalities. Indeed, the results in [9] show that such a heuristic performs best among a set of mapping strategies. A simple approximation to this heuristic is easy to realize: Each processor obtains eve .ry n-th task from a task list in which the tasks are ordered in the sequence of their generation (i.e., Taski ~ P(i ,nod n)+l, where Task~ denotes the i-th task generated, i E {1..m}, and Pj denotes processor j,

j e {1..~}).

2.3. T h e Parallel Task E x e c u t i o n P h a s e Finally, in the task execution phase, the tasks are executed independently of each other on their assigned processors. Since usually more than one task has to be processed per processor, a service strategy is required. Possible service strategies are serial execution and processor sharing. For serial processing, one task after the other is executed until a success state is found or all tasks failed. For processor sharing, the processing capacity of a processor is split among the tasks via preemptive execution (e.g., using a round robin strategy, where a so-called processing time-slice is given to one task after the other in a cycle until the computation terminates). Processor sharing is preferable over serial processing, because a task may not terminate within a given runtime limit. In applications involving undecidable search problems (such as automated theorem proving), termination after a finite amount of time cannot even be guaranteed in principle. In that case, serial processing causes incompleteness of the search, since a nonterminating task would infinitely delay all tasks remaining at the respective processor. Even for decidable search problems, a strategy which is fair in the sense that the available runtime is evenly split among the tasks is desirable. Since in practice the search is bounded by a given runtime limit T,,nit, a fair strategy ensures that a solution is found if some task can be solved within T, mit/spp. Altogether, for each task to be processed on a processor, a copy of the original sequential search-based system (extended by the ability to process a task) is started under a time-sharing regime. It is noteworthy that no change of the main search engine of a sequential system is required for SPS-parallelization. All that is required is an interface which allows the processing of a given task (task generation can be done by a different system, or a specifically modified version of the sequential system). After initialization of the task processing, there is no slowdown of the search engine due to program code required for the parallelization.

150 3. L o a d B a l a n c e

In this section we inspect the effect of slackness on the load balance. For measuring the degree of load imbalance quantitatively, we define LI(n) (for Load Imbalance) as

LI(n) = E~=~(Tpe(,) - T~) ( n - 1) • Tpe(n)' where T/denotes the total runtime spent at processor i and Tp~(n) denotes the total system runtime (i.e., Ti < Tpe(n)), for n > 1. The term n - 1 in the denominator represents the largest number of terms which can be different from zero, because there is at least one processor i with T/ = Tpe(n) (Tpe(n) = max,=l...nTi). LI(n) is an absolute measure, not taking into account the best or worst possible balance that can be obtained for a particular set of tasks. It ranges from perfect balance (Ll(n) = 0), which means that all processors finish working at the same time, to maximal imbalance (LI(n) = 1), where exactly one processor is busy during the execution. In order to assess the effects of slackness, a general modeling of search is used. For the simulations, OR-partitioning is assumed. This means that a given problem is solved as soon as one of the tasks is solved. Also, it is assumed that m = n • spp, and that each processor obtains spp tasks. Each computation (consisting of the treatment of a set of tasks) is assumed to be constrained by a user determined runtime limit Zlimi t. Therefore, a computation terminates as soon as a task terminates successfully, or after all tasks failed, or when Zlimi t is reached, whichever occurs first. T h e probability of a task terminating successfully is specified by p (i.e:, a task terminates unsuccessfully with probability 1 - p). The runtimes for all tasks ai'e independently drawn from a uniform distribution which is equal to 0.5 from 0 to 2 and equal to 0 otherwise, resulting in a mean value of 1. Previous experiments [18] have shown little qualitative dependence of the load balancing effect on the particular choice of runtime distribution. The runtime limit is important for the simulation for two reasons. First, externally triggered termination by a runtime limit influences load balance. Early system termination (compared to the average runtime of a task) renders load imbalance unlikely. In the simulation, all tasks have a mean runtime of one unit of time, and runtime limits are issued as multiples of the mean. Second, the actual runtime of a task which is terminated by Zlimi t becomes irrelevant. Such a task might as well have an infinite runtime. Thus, in those cases where tasks are terminated due to Zlimit, the runtime limif allows the extrapolation of the results to distributions which have larger runtimes (for those tasks) and therefore have a larger variance. The system model does not take into account communication or time-sharing overhead. This omission, however, is unlikely to affect the qualitative results: Neglecting the communication overhead is tolerable since communication occurs only at the beginning (distribution of tasks) and at the end (collection of results and global termination)of the parallel computation, and thus mainly depends on the number and physical size of tasks, but not the duration of their computation. The time-sharing overhead is usually low for contemporary operating systems, as long as no swapping occurs. Thus, all processes running on the same processor should fit into the main memo .ry, thereby limiting the degree of slackness. Another limitation for the degree of slackness arises from the time-sharing delay for a task. For these reasons the investigated slackness is limited to 16.

151 Three different modeling variants are presented. There are two options for obtaining different slackness values for a parallel search. The first is to generate different numbers of tasks for a fixed number of processors, and the second is to choose different numbers of processors for a fixed number of tasks. In the first case, another important distinction arises: How does the increasing degree of partitioning influence the overall probability to find a solution (within the given time limit)? In practice, increasing partitioning is likely to increase the overall probability for success. This is modeled by assuming that the probability of success per task p remains constant for different slackness values (i.e., for different numbers of tasks). For the assumption that increasing partitioning does not improve the overall success probability, the value of p is decreased for increasing slackness such that the overall success probability remains constant (and equal to the value obtained for s p p - 1). The load imbalance values for these two options are shown in Figure 2. Figure 3 shows the results for the case where the number of processors is varied instead of the number of tasks. The load imbalance for s p p = 1 is the same in all plots, because in this case in all plots 32 processors and 32 tasks are used. All figures give results for a low success probability p = 0.01 (higher success probability values result in lower load imbalance, since the computation terminates earlier).

p overall(success), 0.01 0.45

............ .

.

.

.

.

i .

.

.

.

.

.

.

A .

.

. .

0.4

p(success), 0.01 . . . . r,~p ,,,- 1 . - , ~ | s p p - 2 --,--

0.35

0.45

0.35

t;, ........

0.3

0.3

777 ....... .....7 .......... 7-

0.25

0.25

0.2

0.2

0.15

/j._

0.1

i ....... -

0.05

0.15 0.1 0.05

.

0 0

5

.

.

,

.

I0

15

T limit

20

25

30

0 0

5

10

15 TJi~

20

25

30

Figure 2. Load imbalance LI for uniform runtime distribution for constant (left plot) and increasing (right plot) overall success probability.

The left plot in figure 2 shows that even in the (worst) case that no advantage is gained by increased partitioning, the load imbalance can be cut down by more than one half with a slackness of 16 tasks per processor. A much larger reduction occurs in the right plot, where L I becomes negligible for s p p >_ 8. This is due to the increasing overall~ success probability, which increases the chances for an early OR-parallel termination. The load imbalance reduction found in Figure 3 is about in between the two previous cases. The experiments show that for all modeling variants, slackness leads to a noteworthy reduction of the load imbalance.

152 p(succem) ,, 0.01

~ 0.4

0.45

0.3 ~176 0.2 0.15~ 0.1 0.05 00

5

.........................................

10

15 T_,mn

20

25

30

Figure 3. Load imbalance LI for uniform runtime distribution with 32 tasks. Different slackness values are obtained by varying the number of processors from 2 to 32.

In [18] a set of experiments regarding slackness has been reported, focusing on the case where the overall success probability increases with increasing slackness. They show that similar results are obtained for quite different runtime distributions (exponential, uniform, and triangle). This suggests that the load imbalance reduction of slackness is largely independent of the shape of the distribution. Those experiments also show that for success probabilities p > 0.01, small slackness values soon reduce load imbalance to negligible values.

4. Worst Case Analysis Regarding worst-case behavior, it may be suspected that an extremely unbalanced search space will render the SPS-model inappropriate as compared to a dynamic scheme, which can adapt to such a situation by dynamically producing new tasks as necessa~.. Although this may be the case in many specific situations, the following considerations surprisingly reveal that the SPS-model performs quite well compared to dynamic partitioning schemes, in some straightforward worst case scenarios. In the following, two general worst case situations are described and analyzed. In the first situation it is assumed that regardless of the parallelization method, the generation of tasks is always possible as desired, but a maximal work imbalance among the tasks occurs. In the second situation no parallelism is inherent in the problem. All discussions are based on the assumption that no solutions exist. This is necessa .ry to make a comparison between the approaches possible. If solutions are allowed, the performance of a parallelization approach depends critically on the number of solutions and their location in the search space (in relation to the parallel search strategy). Situation 1: Maximal Work Imbalance among Tasks. Regardless of the parallelization model employed, assume that at any time of the parallel computation, all but one of the

153 currently available tasks terminate after one search step. The particular task that does not terminate spans the remaining search space, and may be used to generate new tasks depending on the parallelization model. Regarding runtime and accumulated processing time, this scenario describes the worst possible situation that can occur for the SPS-model. It leads to the generation of m ---n • spp tasks (where n is the number of processors), which are distributed once among the processors. Thus, a runtime delay of O(n) (for processor-to-processor communication, assuming that spp tasks fit into one message package) and an accumulated processing time overhead of O ( m ) (for communication and task handling) occurs. As a benefit, n • spp search steps are performed in parallel. Assuming that a single search step takes much less time than distributing a task, the overhead will usually outweigh the benefit from parallel execution. Furthermore, the main work remains to be done by a single task (no further partitioning occurs after the first task generation3). Assuming the search terminates after k search steps (with k >> m), the constant amount of work performed in parallel will be insignificant, and therefore the runtime will be approximately the same as without parallelization. However, it is important to note that no adverse effects (besides the overhead for task generation and distribution of O ( m ) ) occur either. In particular, the single remaining task can be executed without slowdown arising due to the parallelization. Thus, while no speedup is achieved, no relevant slowdown occurs either. The increase in the accumulated processing time depends on the ratio between the time for initializing the processing of a task and the time for performing a single search step. Assuming a low initialization overhead, the accumulated processing time will remain close to the sequential runtime. Let us now turn to the behavior of dynamic search space partitioning approaches. A dynamic partitioning scheme has provisions to generate tasks (or obtain tasks from other processors) whenever some number of processors becomes idle. Thus, in the given scenario, all processors will continue to participate in the computation until termination. Therefore, in contrast to the SPS-model, the number of search steps executed in parallel is not constant, but increases as the search continues. This, in fact, is usually considered as the prima .ry advantage of dynamic partitioning compared to static partitioning. While of course there are situations where this ability will pay off, in the given scenario this is in fact disadvantageous: 9 There is a permanent need for communication. Depending on the parallel hardware platform, the frequent necessity for communication can significantly decrease the performance of the system. In a multiuser system, this can seriously affect other users as well. 9 Assuming that a communication operation together with the task processing initialization takes significantly longer than the execution of a single search step (a realistic assumption for most search-based systems), a large fraction of the accumulated processing time is spent on overhead rather than useful search. 3Note that in the SPS-model parallel execution is avoided in situations where the problem does not provide enough inherent parallelism to generate the desired number of tasks. This advantage is ignored in the analysis.

154 9 There is no single task which runs undisturbed; unless a specific task leads to the generation of all other tasks, fast processing of such a "main task" is not possible. In the described scenario, typical dynamic partitioning schemes will actually run longer than a SPS-based system, for at least n times the cost of a SPS-based system. Both T1 where Tap(,,) = F~'~=IT~ regarding speedup (S(n) = Tpe(, T~ ) ) and productivity9 ( P ( n ) = Tap(n) , is the accumulated processing time of all involved processors), the described scenario is significantly less harmful for the SPS-model than for dynamic partitioning schemes. In particular, the potential for negative effects of dynamic partitioning schemes in multiuser time-sharing systems require precautions which, in effect, can only be achieved by reducing the dynamic partitioning ability of the system, thereby moving towards a static model. Of course, scenarios where dynamic partitioning performs better than the SPS-model exist. For example, if most of the tasks that are generated by the SPS-model terminate immediately unsuccessfully, and the remaining tasks could be partitioned further into equal slices of work, a dynamic partitioning scheme would be advantageous. In fact, this particular situation maximizes the advantage of dynamic partitioning over static partitioning. Altogether, the performance of a parallelization scheme depends not only on the structure of the search space. The size of tasks, the relationship of communication time to task runtime, and the given runtime-limit all influence the adequacy of a partitioning approach, and make an absolute comparison between static and dynamic schemes difficult. The advantages of static partitioning over dynamic partitioning are mainly due to the initial exploration phase lying at the base of the SPS-model. Of course, one may argue that such a phase can be used at the beginning for any dynamic partitioning scheme as well, combining the best of both approaches. This, indeed, can lead to interesting systems. A simulation study which investigates the use of an initial task distribution for a dynamic partitioning scheme is found in [15]. In this study, the use of an initial task distribution increased the efficiency Erei by approximately 15~ when more than about 50 processors were used. It thereby improved the scalability of the employed dynamic partitioning scheme. In general, it is difficult to determine in which cases the additional implementation effort and computational overhead for dynamic partitioning pay off. In practice the unnecessa .ry employment of dynamic partitioning may lead to increased computational costs for little or no benefit (if not slowdown).

Situation 2: No Inherent Parallelism. The worst case with respect to distributing work among processors is that the search tree consists of a single path (i.e., no alternatives for search steps exist). Thus, neither static nor dynamic partitioning is possible. Static partitioning simply performs sequential search in that case, since no parallel tasks are generated. Assuming an appropriate design of the task generation phase, the overhead for potential task generation is negligible. The performance of dynamic partitioning approaches depends on the particular scheme employed. If parallelism is utilized only after an initial set of tasks has been generated, no parallelization will occur and the performance is comparable to the SPS-model. Otherwise, if parallel processes are started independently of the current availability of tasks, a significant overhead may occur.

155

Performance Stability. Another important issue regarding performance is its stability. In multiuser systems the time for a communication operation is not fixed, but depends on the interconnection traffic caused by other users. Similarly, the load of individual processors may change due to processes unrelated to the intended computation. Both factors considerably influence the communication timing (i.e., the order in which communication events occur). In the SPS-model, a communication delay in the worst case causes a prolongation of the computation and an increase of the computational costs both of which are bounded by the order of the delay. The reason that the change is bounded is that the search space partitioning is independent of the order in which the communication events occur. In dynamic partitioning schemes, however, the generation of tasks, and therefore the exploration of the search space, is usually dependent on the order in which tasks are processed. As a consequence, large variations in the runtime may occur (see for example [12]). In general, changes in the communication overhead will lead to an undesirable system operation mode for dynamic partitioning approaches, because such systems are usually tuned to optimize their performance based on knowledge about typical communication performance parameters (for work on such optimization for a parallel theorem prover, see [10]).

Summary of Worst Case Considerations. The fact that in many particular situations dynamic partitioning schemes provide better flexibility for adapting to irregularly structured search problems is obvious. However, the previous discussions show that the overhead incurred with this flexibility leads to a nontrivial trade-off between static and dynamic schemes which is frequently overlooked. In general, the following statements can be made. Disadvantageous scenarios for the SPS-model lead to a strictly limited overhead with acceptable upper bounds. If the worst case occurs, there is no benefit from parallel computation. However, as the possible speedup decreases for more unfortunate situations, the accumulated processing time decreases as well. The computational gain (product of speedup and productivity) for the SPS-model achieves acceptable values even in worst case scenarios, for problems of sufficient size. This is not necessarily the case for dynamic partitioning schemes, for which the worst case overhead cannot be bounded easily. A benefit from parallel computation may not only be lost, it may actually lead to a significant performance decrease for this and, in multiprogramming environments, other computations. This happens because, unlike for the SPS-model, the accumulated processing time increases. As a result the computational gain can drop to very low values in the worst case for dynamic partitioning approaches, regardless of the problem size.

5. Appropriateness of the S P S - M o d e l This section consists of three parts: a discussion of the SPS-model with respect to important design issues arising for the construction of a parallel system; a list of system properties which make the application of the SPS-model particularly interesting; and a summary of the advantages and disadvantages that can arise from using the SPS-model.

156 5.1. D i s c u s s i o n of Suitability In general, the adequacy of a particular parallelization approach depends on many issues. A decision among different approaches requires a detailed specification of the intended goal of parallelization. In order to specify the intended use of parallelism sufficiently, at least the issues described below need to be clarified. For each item, first a discussion regarding parallelism in general is given, followed by remarks regarding the SPS-model. 9 W h i c h t y p e of c o m p u t i n g problems are to be solved? In general: Performance can be optimized for the average case or for specific problems. Especially for parallel systems, this distinction makes a significant difference. A large enough problem size commonly leads to a good scalability of most techniques, and therefore tends to pose little difficulties. The treatment of comparatively small problems, however, often leads to unacceptable overheads due to the unprofitable initialization of a large parallel system. Since the size of a search problem usually is not known prior to its solution, this can result in the inadequacy of a parallelization technique if average case performance improvement is desired. SPS-model: The SPS-model avoids parallelization overhead for problems which are small enough to be solved during the task generation phase. This feature automatically adapts to the machine size: more processors require more tasks, which leads to more search during task generation; thereby more problems become solvable before parallel processing is initiated (in effect, problems need to be more difficult in order to be parallelized on larger machines). Furthermore, it is possible to determine heuristically the number of processors to be utilized, based on information about the search space growth gathered during the task generation phase. It can thereby support the processing of medium-size problems, keeping the initialization costs at a level suitable for the given problem. 9 W h i c h t y p e of parallel m a c h in e will be used? In general: The particular strengths and weaknesses of the intended hardware platform (e.g., memory model, topology, communication bandwidth) significantly influence the suitability of a parallelization technique. Techniques with little requirements on hardware properties are less sensitive to this issue, while some parallelization approaches can be realized efficiently only on specific hardware platforms. SPS-model: The SPS-model has particularly little communication and no memo.ry model requirements, and is therefore suited to all types of MIMD machines, including workstation networks. 9 W h a t is th e intended degree of portability to different parallel ma c hi ne s ? In general: If no portability is required, an implementation can be optimized for combining the parallel model and the hardware. However, such tight bounds limit the lifetime of the system severely. Due to the unavailability of a common machine model for parallel computers (such as the von-Neumann model for sequential computers), successor models often feature major design changes. In that case, a specifically tuned implementation is bound to its first platform, and may be soon outperformed by an improved sequential system on the latest sequential hardware.

157

SPS-model: Due to its little requirements on communication performance, the SPSmodel can be realized efficiently using portable implementation platforms, such as PVM [2], p4 [4], or MPI [7]. This makes an implementation available on a large number of parallel machines 4 as well as on workstation networks. 9 W h a t is t h e m i n i m a l p e r f o r m a n c e increase e x p e c t e d ? In general: A given desired increase (e.g., S > 100) constrains the minimal number of processors, and thereby defines a minimal degree of required scalability. Scalability of search-based systems is application-specific, and can be hard to predict for a particular parallelization method. SPS-model: The SPS-model simplifies scalability prediction and analysis due to the possibility of simple and meaningful sequential simulation before a parallel system is built. 9 W h a t is t h e desired trade-off b e t w e e n s p e e d u p and p r o d u c t i v i t y ? In general: The adequacy of a parallelization technique depends on the relative importance of speedup S T1 and productivity P = T1 5 This can be %e(,) " Tap(,)" expressed by choosing an appropriate tuning parameter r in the definition of a computational gain G = P x S r, r E ]R+. SPS-model: Taking into account the accumulated processing time has been one of the driving motivations for the development of the SPS-model. The static task generation avoids much of the communication and management overhead required for dynamic partitioning schemes. 9 Are t h e r e r e q u i r e m e n t s regarding worst case p e r f o r m a n c e ? In general: The worst case runtime and accumulated processing time vary significantly for different parallelization approaches, and depend heavily on a number of system aspects. SPS-model: As shown in Section 4, the SPS-model has better worst case performance than many other approaches. 9 D o e s t h e search-based s y s t e m d e p e n d on i t e r a t i v e - d e e p e n i n g search? In general: Iterative-deepening is a wide-spread search technique. For parallel systems employing this method, the maintenance of deepening balance is desirable. Dynamic partitioning schemes in principle allow control of the balance, however for impractical costs (the associated communication problem is NP-complete). SPS-model: In [17] it is shown that slackness together with an iterative-deepening diagonalization and a delayed successor start strategy are effective techniques for reducing deepening imbalance (i.e., the differences in the iterative-deepening levels worked on at different processors) and load imbalance, without requiring explicit communication. 4E.g., PVM is available on: Intel iPSC/2, iPSC/860, and Paragon; Kendall Square Research KSR.-1; Thinking Machines CM2 and CM5; BBN Butterfly; Convex C-series; Alliant FX/8; etc. 5The formulas represent relative or absolute speedup and productivity, depending on the definition of 7'1. If T1 equals the runtime of the parallel system on one processor, relative metrics are obtained. If 7"1 equals the runtime of a sequential reference system, absolute metrics result.

158

9 A r e there s y s t e m - e x t e r n a l constraints? In general: In multiuser systems, constraints due to the computing environment arise. The number of processors available for computation may not be constant for all computations. This arises either when an independent system resource management splits a parallel machine into fixed partitions for individual users or when the load on individual nodes discourages additional processing load. SPS-model: It is possible to take such constraints into account within the SPSmodel, and to adjust the search partitioning to the available number of processors. 5.2. Beneficial P r o p e r t i e s for A p p l i c a t i o n There are several properties of search-based systems which render them suitable for the application of the SPS-model: 9 low probability of load i m b a l a n c e Tasks which only span a small search space (and do not contain a solution) cause load imbalance and therefore should be rare. This may be known empirically for an application, or may be ensured for individual tasks by look-ahead during task generation. 9 fast task g e n e r a t i o n A fast task generation reduces the serial fraction of the computation caused by the task generation phase. Useful for this are - a high task generation rate; Only a small amount of search is required to produce the next task. - an efficient task representation. Tasks can be stored and transferred with little overhead. This is generally desirable for all parallelization schemes, because it reduces the parallelization overhead.

5.3. S P S - M o d e l : C o n s e q u e n c e s of its A p p l i c a t i o n A summary, of the consequences of the application of the SPS-model is given below. Appropriate usage of the model promises the following properties: 9 little c o m m u n i c a t i o n is required As a consequence, - the communication overhead is bounded and small. This is important for achieving good productivity. - there are little requirements on hardware communication performance. Therefore, the SPS-model is well suited to the usage of general purpose parallel programming libraries and networks of workstations. - the complexity of communication is low, which simplifies the implementation and maintenance effort required.

159

9 informed d y n a m i c decisions about parallelization and search Based on information gathered during the task generation phase, heuristic decisions can be made regarding parallelization (e.g., appropriate number of processors, appropriate slackness) and search control (e.g., appropriate iterative-deepening increments). See Section 2.1. 9 global search o p t i m i z a t i o n The use of AND-partitioning can lead to a reduction of the amount of search to be done (see Section 2.1). 9 little m o d i f i c a t i o n of the target s y s t e m The search part of a sequential system does not need to be modified. The necessa .ry extensions consist of a means to produce tasks, and the ability to start the search given by a task. 9 efficient integration of A N D - p a r a l l e l i z a t i o n The use of static task generation before parallel execution allows control over the overhead induced by AND-parallelism. 9 m e a n i n g f u l s i m u l a t i o n for any n u m b e r of processors This is possible because the parallel exploration of the search space does not depend on the communication timing. For a simulation, all generated tasks can be processed in a sequence. The results can be combined to obtain results for any slackness between s p p - 1 (number of processors n - m) and s p p - m ( n - 1).

9 c o m b i n a t i o n of different search-based s y s t e m s Different systems can be combined by using one system for task generation, and several copies of one or more different systems for task processing. In particular, it allows a combination of forward- and backward-chaining search. In cases where the SPS-model is inappropriate, the following consequences of its application may occur:

9 no s p e e d u p c o m p a r e d to the original sequential s y s t e m (worst case) However, no significant slowdown occurs either (see also Section 4). 9 th e task g e n e r a t i o n phase b e c o m e s a b o t t l e n e c k For the generation of a large number of tasks, or in systems with a low task generation rate, the overall computation is slowed down due to the serial fraction caused by task generation. The potential for a bottleneck can be reduced by distributing tasks as soon as they become available and by hierarchical task generation.

9 th e task d i s t r i b u t i o n phase b e c o m e s a b o t t l e n e c k This can be avoided by distributing tasks during the generation phase (see previous item) or by hierarchical distribution. 9 a p e r f o r m a n c e discontinuity occurs Whenever the initial exploration phase finishes immediately before a solution could

160

be found by continued sequential search, the communication time to distribute the tasks and collect the results is wasted. In this particular case, a runtime increase compared to the sequential runtime occurs. The runtime decreases as the amount of further search required after the generation phase increases. 6. R e l a t e d W o r k Work related to the SPS-model can be grouped into three topics, namely research on static scheduling, usage of a task pool, and bulk synchronous programming. Static Scheduling. A traditional static scheduling problem is, given a set of independent tasks and the runtime for each, to find a distribution of the tasks such that the overall runtime is minimized. The tasks considered typically form an independent-AND-set, i.e., all given tasks need to be solved, but independence is assumed. This scheduling problem is well-known to be NP-complete [8], and research in this area focuses on efficient approximations to optimal solutions [3,16]. Unfortunately, for search-based systems the runtime is usually neither known nor can it be estimated accurately. Work on static scheduling without given knowledge about task runtimes is rare. However, interesting research with relation to the SPS-model is found in [11]. The authors investigate static task distribution for minimizing the runtime of a set of independentAND tasks. In their model, each processor repeatedly obtains a batch of k subtasks from a central queue and executes them, until all subtasks have been processed. For the case k - r n~, k becomes equivalent to spp. The authors conclude that for many distributions a static allocation provides reasonable performance relative to an optimal scheme. Task Pool. An alternative for the processor sharing of several tasks in the SPS-model is to choose a pool model: only one task is initially distributed to each processor, and the remaining tasks are stored in a central pool. Whenever a processor finishes its task unsuccessfully, it obtains another task from the pool. Obviously, such a scheme obtains a better load distribution due to its dynamic reactivity. For this, it requires additional communication and control. The expected performance of such a scheme for OR-parallel search has been theoretically analyzed in [14], for three different runtime distributions of tasks (constant, exponential, and uniform) and the probability of success as a variable. The case of constant runtime for all tasks (not realistic for search problems), in fact, is identical for the pool model and the SPS-model, if serial execution of the tasks at a single processor is chosen in the SPS-model. The pool model (as well as serial execution of tasks at one processor for the SPS-model), however, is inappropriate for many applications of search. The reason is that any practical search has to be terminated after some finite time, i.e., a runtime limit is posed onto each search process to ensure termination. For difficult search problems, many tasks take longer to terminate than can be allotted by any reasonable means. In fact, for undecidable problems termination itself cannot be guaranteed. Thus, in a model of computation where some tasks are delayed until some other tasks terminate, the ve.ry tasks which may allow a solution to be found quickly (which is the spirit of OR-partitioning) may be executed prohibitively late (or even never). Therefore a pool model is inappropriate. Bulk Synchronous Programming. The SPS-model also bears some relation to the bulk synchronous programming (BSP) model developed by L. Valiant [19,20]. In this model,

161 the programmer writes a program for a virtual number of processors v, which is then executed on a machine with n processors. According to the model, n should be much smaller than v (e.g. v = n log n). Then thisslackness can be exploited by compilers in order to optimize scheduling and communication. As for the SPS-model, a surplus of tasks is used to achieve a load-balancing effect. However, the BSP model is intended as a basic computational model for parallel processing, and assumes compiler and operating system support. It allows communication and dependencies between the tasks, and assumes that all tasks need to be finished for completing a job (AND-parallelism). While the BSPmodel is a model for general computations, the SPS-model is focused specifically towards search-based systems, where more specific assumptions apply.

7. S u m m a r y In this paper, the parallelization scheme static partitioning with slackness has been presented, independent of a particular application. The advantages and disadvantages of a sequential initial search phase for task generation have been discussed. The potential drawback of the model, namely the occurrence of load imbalance due to tasks with finite (and small) search spaces, can be effectively reduced by slackness. A worst case analysis revealed that, unlike for other parallelization approaches, the worst case for the SPS-model is bounded and moderate. Typical design issues occurring for the construction of a parallel search-based system have been considered; then advantageous system properties for applying SPS parallelization and resulting properties of a parallel system were presented. Finally, research related to the SPS-model has been discussed.

REFERENCES 1. K.A.M. Ali and R. Karlsson. The MUSE Or-parallel Prolog Model and its Performance. In Proceedings of the 1990 North American Conference on Logic Programming. MIT Press, 1990. 2. A. Beguelin, J. Dongarra, A. Geist, R. Manchek, and V.S. Sunderam. A User's Guide to PVM: Parallel Virtual Machine. Technical Report ORNL/TM-11826, Oak Ridge National Laborato .ry, 1991. 3. K.P. Belkhale and P. Banerjee. Approximate Algorithms for the Partitionable Independent Task Scheduling Problem. In Proceedings of the 1990 International Conference on Parallel Processing, 1990. 4. R. Butler and E. Lusk. Users Guide to the p4 Programming System. Technical Report ANL-92/17, Argonne National Laboratory, 1992. 5. W.F. Clocksin and H. Alshawi. A Method for Efficiently Executing Horn Clause Programs using Multiple Processors. New Generation Computing, (5):361-376, 1988. 6. W. Ertel. Parallele Suche mit randomisiertem Wettbewerb in Inferenzsystemen, volume 25 of series DISKI. Infix-Verlag, 1993. 7. Message Passing Interface Forum. MPI: A Message-Passing Interface Standard. 1994. 8. M.R. Garey and D.S. Johnson. Computers and Intractability: A Guide to the Theory of NP-Completeness. Freeman, 1979.

162 M. Huber. Parallele Simulation des Theorembeweiser SETHEO unter Verwendung des Static Partitioning Konzepts. Diplomarbeit, Institut fiir Informatik, Technische Universit/it Miinchen, 1993. 10. M. Jobmann and J. Schumann. Modelling and Performance Analysis of a Parallel Theorem Prover. In A CM SIGMETRICS and PERFORMANCE '92, International

Conference on Measurement and Modeling of Computer Systems, Newport, Rhode Islands, USA, volume 20, pages 2 5 9 - 260, SIGMETRICS and IFIP W.G. 7.3, 1992. ACM. 11. C.P. Kruskal and A. Weiss. Allocating Independent Subtasks on Parallel Processors. IEEE Transactions on Software Engineering, SE-11(10):1001-1016, 1985. 12. E. Lusk and W. McCune. Experiments with ROO, a Parallel Automated Deduction System. In Parallelization in Inference Systems, pages 139-162. Springer LNAI 590, 1992. 13. E.L. Lusk, W.W. McCune, and J. Slaney. ROO: A Parallel Theorem Prover. In Proceedings of CADE-11, pages 731-734. Springer LNAI 607, 1992. 14. K. S. Natarajan. Expected Performance of Parallel Search. In International Conference of Parallel Processing, pages 121-125, 1989. 15. J. Schumann and M. Jobmann. Analysing the Load Balancing Scheme of a Parallel System on Multiprocessors. In Proceedings of PARLE 94, LNCS 817, pages 819-822. Springer, 1994. 16. B. Shirazi, M. Wang, and G. Pathak. Analysis and Evaluation of Heuristic Methods for Static Task Scheduling. Journal of Parallel and Distributed Computing, (10):222232, 1990. 17. C.B. Suttner. Parallelization of Search-based Systems by Static Partitioning with Slackness, 1995. Dissertation, Institut fiir Informatik, Technische Universitht Miinchen. Published as volume 101 of series DISKI, Infix-Verlag, Germany. 18. C.B. Suttner and M.R. Jobmann. Simulation Analysis of Static Partitioning with Slackness. In Parallel Processing for Artificial Intelligence 2, Machine Intelligence and Pattern Recognition 15, pages 93-105. Elsevier, 1994. 19. L.G. Valiant. A Bridging Model for parallel Computation. Communication of the A CM, 33(8), August 1990. 20. L.G. Valiant. General Purpose Parallel Architectures. In J. Van Leeuwen, editor, Handbook of Theoretical Computer Science, chapter 18. Elsevier Science Publishers, 1990.

163 Christian S u t t n e r

Christian Suttner studied Computer Science and Electrical Engineering at the Technische Universit~it Miinchen and the Virginia Polytechnic Institute and State University. He received a Diploma with excellence from the TU Miinchen in 1990, and since then he is working as a full-time researcher on parallel inference systems in the Automated Reasoning Research Group at the TU Miinchen. He received a Doctoral degree in Computer Science from the TUM in 1995. His current research interests include automated theorem proving, parallelization of search-based systems, network computing, and system evaluation. Together with Geoff Sutcliffe, he created and maintains the TPTP problem library for automated theorem proving systems and designs and organizes theorem proving competitions. Home Page: http://wwwjessen.informatik.tu-muenchen.de/personen/suttner.html

Parallel Processing for Artificial Intelligence 3 J. Geller, H. Kitano and C.B. Suttner (Editors) 9 1997 Elsevier Science B.V. All rights reserved.

165

Problem Partition and Solvers Coordination in Distributed Constraint Satisfaction P. Berlandier a and B. Neveu b aILOG Inc. 1901 Landings Drive Mountain View, CA 94043, USA bINRIA - CERMICS 2004, Route des Lucioles, B.P. 93 06902 Sophia-Antipolis Cedex, FRANCE This paper presents a decomposition-based distributed algorithm for solving constraint satisfaction problems. The main alternatives for distributed constraint satisfaction are reviewed. An algorithm using a partition of the constraint graph is then detailed, with its parallel version. Experiments on problems made of loosely connected random constraint satisfaction problems show its benefits for under-constrained problems and problems with a complexity in the phase transition zone. 1. I n t r o d u c t i o n Many artificial intelligence problems (e.g. in vision, design or scheduling) may take the shape of constraint satisfaction problems (csPs) [1]. Being NP-complete, these problems are in need of any computational mean that could speed up their resolution. Parallel algorithms and parallel hardware are good candidates to help in this matter. A second motivation for distribution is that there exist csPs where the structure of the constraint graph is naturally close to a union of independent components. We are indeed especially interested in problems that result from the connection of subproblems by a global constraint. In such problems, the partition into subproblems is given and the challenge is to use that partition in a distributed search algorithm in order to speed up the resolution. Such problem structures happen to be quite common in configuration or design problems, where the whole problem consists of an assembly of subparts, each having its own constraints and being connected by a few global constraints on decision variables such as cost, weight or volume. In this paper, we will present a distributed algorithm for solving csPs over several processors using a partition of their constraint graph. 2. Sources of Parallelism in C S P R e s o l u t i o n The usual way of solving a csP alternates problem reduction and variable instantiation [2]. There are several opportunities for introducing some amount of parallelism in

166 these two processes. We give a brief review of these opportunities below. P a r a l l e l i s m in P r o b l e m R e d u c t i o n

Problem reduction is usually achieved by enforcing some level of partial consistency such as arc or path consistency [3] or by using a more limited filtering process such as forward checking [2]. Some operations that are required to enforce partial consistency can be performed independently. First, checking the consistency usually means controlling which possible value combinations are allowed by a constraint. These tests can obviously be conducted in parallel. This source of parallelism is easily exploited in most constraint systems by the use of operations on bit vectors [4]. A coarser grain parallelism is the parallel propagation of constraints: several constraints (or groups of connected constraints) are activated independently. Some synchronization mechanism is needed as different constraints might share the same variable. However, the fact that constraint propagation results in a monotonic reduction of the problem may simplify the synchronization. Parallel propagation [5-8] has received a great deal of attention. Especially, several procedures to achieve parallel arc-consistency have been devised, sometimes exhibiting a supra-linear speedup. However, [9] exhibits some theoretical restrictions on the gain that can be expected from this kind of parallelism. P a r a l l e l i s m in Variable I n s t a n t i a t i o n

Variable instantiation is a tree search process and the independent exploration of the different branches leads to or-parallelism. The introduction of or-parallelism in search procedures has been studied thoroughly, especially inside [10] but also outside [11] the logic programming community. An experiment of exploiting or-parallelism in the CHIP constraint logic programming language is described in [12]. P a r a l l e l i s m based on t h e C o n s t r a i n t G r a p h

Another way to parallelize the resolution is to partition the variable set and to allocate the instantiation of a subset of the variables to a process. The difficulty of this approach is to synchronize the different processes, which are not independent: there exist constraints between variables and conflicts may occur between the processes. A solution is to order the processes [13,14]. Another way to solve this difficulty is to have a central process that is responsible of the inter process conflict resolution. We will detail our approach in the next section. It takes place in that graph-based distributed framework, with a centralized control. 3. D i s t r i b u t e d C o n s t r a i n t Satisfaction

Binary constraint problems can be represented as graphs where variables are mapped to the vertices and constraints are mapped to the edges. For instance, the constraint graph associated with the well-known N-queens constraint problem is a clique: each variable is connected to all the others. But this is not the case for most real-world problems where it is more common to have loosely connected clusters of highly connected variables. Such

167

Constraint-connected subproblems

Variable-connected subproblems

Figure 2. Solving subproblems independently

almost independent subproblems could thus be solved almost independently with parallel processes and their results combined to yield a global solution. This is the approach proposed in the paper. The most important questions are: (1) How can the problem be partitioned "well"? (2) How can the efforts of the separate solvers be coordinated? As shown in figure 1, a problem can be partitioned along the constraints or along the variables. In the first case, the subproblems can be solved independently right from the start. But, when partial solutions are found, they must be tested against interface constraints. If some of these constraints are not satisfied, the resolution of the subproblems connected by these constraints has to be resumed. If the partition is made along the variables, we must start by looking for a consistent instantiation of the interface variables. After such an instantiation is found, each subproblem can be solved with complete independence as illustrated by figure 2. If they all succeed in finding a solution, we have a global solution. If no solution can be found for one of the subproblems, the instantiation of the interface variables should be changed and the resolution of the subproblems concerned by this change has to be resumed.

168 Let us suppose that the problem that we want to solve has n variables and is divided into k subproblems, each with p variables (so that n - kp). Each variable has a domain of d elements. Using a partition along the constraints, the worst case time complexity for finding the first solution is bounded by (dP)k which gives us a complexity of O(d~P). Now, using a partition along the variables and supposing that there are m interface variables, the worst case time complexity for finding the first solution is bounded by kdmd p which yields a complexity of O(dm+p). Of course, if the problem is not partitioned (i.e. k = 1, p = n and m = 0), we find the same usual complexity for both cases, i.e. O(dn). Conversely, when there are several subproblems, the theoretical complexity of the resolution with constraint-connected subproblems is much higher than with variable-connected subproblems. This is why we have chosen to investigate the latter mode of partition, keeping in mind that this choice is valid if and only if one solution is sufficient for our needs. 4. Definitions

Definition 1 (constraint p r o b l e m ) A binary constraint satisfaction problem P is pair of sets (V,C). The set ~; is the set of variables { v l , . . . , v n } . Each variable vi associated with a finite domain di where its values are sought. The set C is the set constraints. Each constraint is a pair of distinct variables {vi, vj} noted cij which associated with a relation rij that defines the set of allowed value pairs.

a is of is

Definition 2 ( s o l u t i o n ) A value assignment is a pair variable-value noted (v~, x~) where xi E di. A substitution a~ is a set of value assignments, one for each element in the set of variables E. A solution to the problem P = (~2, C) is a substitution av that satisfies all the constraints in C.

Definition 3 ( c o n s t r a i n t s u b p r o b l e m ) A subproblem P~ of P is a triple (Zi, ~)i,C~). The set Zi C V is the set of interface variables, ~)~ C )2 is the set of own variables, and Ci C C is the set of own constraints. Given an instantiation of its interface variables Zi, the solution to Pi is a substitution av~ that satisfies all the constraints in Ci. A subproblem has the following properties: P r o p e r t y 1 The sets of interface and own variables are disjoint i.e.: Zi N Vi = 0 P r o p e r t y 2 The set of own constraints is the maximal subset of the problem constraints that connect one own variable with either an own or an interface variable:

c~ = {C.b e Cl(vo, vb) 9 (V, x Z~) U (Z~ x V~) U (V~ x V~)}

Definition 4 (partition) A k-partition II~ of a problem P is a set of subproblems {P1,. 9 Pk} that have the following properties: k P r o p e r t y 3 The sets of own variables of the subproblems are disjoint i.e." N~=~ V~ = 0

169

Property 4 Each variable of the problem is either an interface or an own variable of a subproblem: k

k

(Ur~)u(UI~)=r ~=~ ~=1

k

and

k

(Uv~)n(Uz~)=O ~=~

~=~

Definition 5 (interface problem) The interface problem Pn of a partition I-I~ is a subproblem of P for which the own variable set is the union of the interface variable set of all the subproblems of the partition: k

))n = U Ii and I n = 0 i--1

Property 5 The set of constraints of the interface problem is the maximal subset of the problem constraints that connect any two of its own variables (that is any two interface variables from the other subproblems): c . = {c~b e Cl(v~ vb) e v . x v . }

T h e o r e m 1 Given a partition of P and a solution avn to its interface problem, the union of avn with any solution for all the subproblems of the partition constitutes a solution to the problem P. Proof." From the properties of a partition and the definition of the internee problem, k Gvi) instantiates once and only once each it is easy to show that the union avn U (Ui=I variable in 12 (from properties 1, 3 and 4) and that this union satisfies all the constraints k in 0 = On U (U~=~ 0~) (from properties 2 and 5). The union is therefore a solution to the whole problem P = (1;, 0). [] This theorem allows us to implement the resolution of a constraint problem as the resolution of an internee problem followed by an independent resolution of k subproblems. The following two sections describe shortly how to compute a problem partition and what coordination to implement between the parallel solvers.

5. P r o b l e m Partition Given k available processors and a problem P, the goal is to find a k-partition that best combines the following (possibly conflicting) desiderata: 1. The complexity of solving the different subproblems should be about the same. 2. The number of variables of the interface problem should be kept to a minimum. Of course, a complete exploration of the partition space is out of the question. We thus turn to a heuristics-based algorithm and we use the classic K-way graph partitioning algorithm presented in [15]. For our purposes, the cost of an edge is inversely proportional to the degree of satisfiability of the constraint represented by this edge. Therefore, a constraint that is easy to satisfy (i.e. with a high degree of satisfiability) has a low cost and will be preferred as a separating edge between two subproblems. The weight of a vertex is proportional to the domain size of the variable represented by this vertex. The set of interface variables is chosen as the minimal set of vertices that is connected to all the crossing edges determined by the partitioning algorithm.

170

1 while a new instantiation for interface variables can be found 2 3 4 5

instantiate all the interface variables for each subproblem Pi solve Pi as an independent CSP in case of success:

1,2

and 3}

store the partial solution {variants 2 and ,9}

6 7

{variants

in case of failure:

store the nogood {variants 2 and 3}

8 9

return to step 1 10 a solution is found; end. 11 no solution can be found; end.

Figure 3. Decomposition-based Search Algorithm Schema

6. A D e c o m p o s i t i o n B a s e d S e a r c h A l g o r i t h m The previous section was about how to get a good partition of the constraint problem. We have designed an algorithm for finding one solution that uses that problem partition to t.ry to solve each subproblem independently. The main idea of this algorithm is the following: first find a consistent instantiation for the interface variables in Vn, and solve each subproblem Pi with the so instantiated interface variables. As soon as a subproblem fails, we store the inconsistent instantiation (also known as a nogood) in order not to reproduce it and we look for a new consistent instantiation for the interface variables. The outline of the algorithm is presented in figure 3. From this outline, we can design three variants, depending on the backtracking schema for the interface variables and on the intermediate results we store. We use the following notations in the description of these three variants: 9 d: domain size 9 n: total number of variables 9 m: number of interface variables 9 p: maximum number of variables of a subproblem 9 s: maximum number of interface variables of a subproblem 9 k: number of subproblems 9 V a r i a n t 1: Standard backtracking on interface variables. The first idea is to use a standard chronological backtracking algorithm for the instantiation of the interface variables. No backtracking can occur between two subproblems and the reduction of worst case time complexity is then given by using

171 the constraint graph partition, as mentioned in section 3. For each instantiation of the interface variables, we have in the worst case one c o m p l e t e t r e e search for each subproblem, so the total complexity is in O(kdPd m) = O(kdp+m), instead of a complexity in O(d kp+m) for a global tree search algorithm. For that variant, the space complexity is the same as the global tree search algorithm: no additional storage is required. 9 V a r i a n t 2: Standard backtracking with storage of partial solutions and nogoods. In that second variant, we store the partial solutions and nogoods to ensure that each subproblem Pi will be solved only once for a given instantiation of its interface variables Z~. Then, the time complexity can be reduced to O(d TM + kdPd s) = O(d TM + kdP+S), the space complexity becoming O(kpdS), s being the maximum number of interface variables of one subproblem. 9 V a r i a n t 3: Dynamic backtracking on interface variables. We still store nogoods and partial solutions, as in the previous variant. We can notice that these nogoods can be used to implement a dynamic backtracking schema [16] for the instantiation of the interface variables. When a failure occurs in the solving of Pi, we try to change the instantiation of the interface variables Ii and to keep the instantiation of the other interface variables, which do not take part in the nogood. For that third variant, we have the same time and space complexity as for the preceding one. This variant is the decomposition-based search algorithm (DS) we have implemented. 7. Parallelization

7.1. A Parallel A l g o r i t h m with Centralized Control The DS algorithm we have presented can be naturally transformed into a parallel version PDS with k + 1 processes, one master for the instantiation of the interface variables and k slaves for solving of the subproblems. As for the sequential DS algorithm, given a partition II k, we first have to find a consistent instantiation as of the interface variables. This instantiation becomes the master's current instantiation. Once this instantiation is found, we can initiate the parallel processing of each subproblem. The master process, which is in charge of the interface instantiation, keeps waiting for the result of the slave processes. The result of the resolution of each subproblem can be either a failure or a success. In case of success, the partial solution of P~ is stored by the master in a dictiona~. (line 15). The slave in charge of Pi will then wait until the master restarts it further with another instantiation of its interface variables. In case of failure on a subproblem Pi, the subset ai of the current substitution as, corresponding to the interface variables of Pi is stored as a nogood (lines 6 and 15). The current substitution as of all interface variables is then invalid and it is necessa .ry to find a new combination of values for these variables. In order to not interrupt all the processes and let run some of them, the master will not follow a chronological backtracking, but a dynamic backtracking model, as seen in 6.2. Once this new instantiation a~ is found, we have to first interrupt, if they are running,

172

0 let Q be an empty message queue; 1 until exit from enumeration-end 2 instantiate consistently all the variables in ])n; 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17

Figure 4.

if no possible instantiation then exit from enumeration-end with no solution; else if failure in Q then store the nogood; else start in parallel the slaves processes whose interface variables have changed; tag reception-end while one of the slaves has not responded if Q is empty, then wait for a message; process the first message; in case of success: store the partial solution; in case of failure: store the nogood; exit from reception-end; exit from enumeration-end with solution;

PDS

algorithm for the master process

and then restart with the new instantiation of the interface variables the resolution of eve .ry subproblem Pj such that (lines 7 and 8): 3v 6 :Z'j,as(v) # ds(V ). Before restarting the resolution of a subproblem, the existence of a precomputed solution is looked-up in the dictiona .ry of partial solutions. The algorithm performed by the master process is presented in figure 4. 7.2. C o m m u n i c a t i o n s In this algorithm, the inter-process communications take place only between the master and the slaves:

9 communication from master to slaves The master is controlling the execution of the slaves. A message is an interruption that is immediately taken into account, as soon as the slave receives it. There is no message queue and the only possible message is: "start the resolution of the subproblem with the current instantiation of the interface variables." 9 communication from slave to master Here, a message corresponds to a report about the resolution of a subproblem. Either a partial solution was found, or a failure occurred. These messages are put in a queue, the master is not interrupted and it will process them in sequence. All the failures will be processed before restarting the slaves with another instantiation of their interface variables (lines 5, 6).

173

7.3. M e m o r y The storage of the nogoods and of the p~rtial solutions is done by the master. The slaves do not need to store any results. The memory requirement is the same as for the sequential DS algorithm (variant 3). 7.4. P r o c e s s Load B a l a n c i n g Due to synchronization, some processors can become idle. The master waits until one slave gives a result (line 11). A slave that gave an answer (a partial solution or a failure) will wait until the master relaunches it with another instantiation of its interface variables. If the new instantiation of the interface variables of a slave corresponds to an already solved subproblem, the processor remains idle. In future works, we will study a more efficient version of the coordination, where some idle processors can perform some search in advance, i.e. solve subproblems, with values of their interface variables different from the current values of these variables in the master. 8. E x p e r i m e n t a l E v a l u a t i o n

8.1. Tests on r a n d o m problems In order to study the performances of these decomposition-based algorithms, we have experimented with randomly generated constraint problems. The generation of random problems is based on the four usual parameters [17]: the number n of variables, the size d of the variables' domain, the constraint density cd in the graph and the constraint tightness ct. The constraint density corresponds to the fraction of the difference in the number of edges between a n-vertices clique and a n-vertices tree. A problem with density 0 will show n - 1 constraints; a problem with density 1 will show n(n - 1)/2 constraints. The constraint tightness ct corresponds to the fraction of the number of tuples in the cross-product of the domain of two variables that will not be allowed by the constraint between these two variables. Tightness 0 stands for the universal constraint and tightness 1, the unsatisfiable constraint. In our experiments, each problem we solved was made up of three constraint problems, generated with the same four parameters, and they were coupled by three interface variables, vl, v2, va, one for each subproblem. The coupling constraints were two difference constraints, vl ~- V2 and v2 r v3. We compared 3 algorithms: 1. A global algorithm which is the classical forward-checking with first fail (FC-FF), using a dynamic variable ordering, the smallest domain first. 2. The decomposition-based search algorithm, DS, which corresponds to variant 3 presented in section 6. 3. A simulation of the parallel version of the algorithm, PDS, presented in section 7. In order to have a fair comparison between these 3 algorithms, in the DS and PDS algorithms, the subproblems were solved with the same FC-FF algorithm used to solve the entire problem in the global algorithm. The parallelism of the third algorithm was simulated in one process, the communication time being then reduced to zero. One simulated processor was assigned to each subproblem.

174 8.2. E x p e r i m e n t a l Results We measure for each algorithm the cpu-time. In the case of the simulation of parallelism, the cpu-time we report is the execution time the parallel algorithm would have, if we suppose that the communications take no time. The results in figures 5, 6 and 7 report the time (in sec.) for solving sets of 40 problems, each set composed of subproblems generated with the same parameters. All tests reported here were run on problems made up of 3 subproblems composed of 25 variables with a domain size d equal to 7. In the figure 5, the constraint density cd is 0.2, and the constraint tightness ct is varying between 0.2 and 0.55. In the figure 6, the constraint density cd is 0.4, and the constraint tightness ct is varying between 0.1 and 0.4. In the figure 7, the constraint density cd is 0.6, and the constraint tightness ct is varying between .05 and 0.4. All these tests were run on a SUN Sparc 5 workstation, in a LELIsP implementation, the subproblems being solved with the PROSE [18] constraint toolbox. This toolbox offers some simple primitives to define constraints, build some constraint problems by connecting constraints and variables, and solve those problems. The main tool provided for resolution is parametrized tree search. The search can be adapted by selecting a variable ordering method, a value ordering method and a consistency method which is usually the forward-checking method. We have obtained three behaviors, depending on the difficulty of the problems. These behaviors correspond to under-constrained problems, to over-constrained problems and to problems in the phase transition zone [19,20]. U n d e r - C o n s t r a i n e d P r o b l e m s For the under-constrained problems (ct < 0.4 in figure 5, ct < 0.25 in figure 6, ct < 0.2 in figure 7), the global solution process is quite efficient, the DS algorithm doesn't improve the global algorithm very much. The parallelization is quite interesting in these cases: we have obtained between DS and P D S a speedup from 1.8 to 2.8. P h a s e Transition For the phase transition zone, (ct - 0.4 in figure 5, ct = 0.28 in figure 6, ct = 0.22 in figure 7), where some problems have few solutions and some are over-constrained, there exist problems that the global algorithm could not solve in 10 minutes. These problems, which were very difficult for the global algorithm, were solved by the decomposition-based algorithm in few seconds and the parallelization did not improve the results very much. We can explain that fact by the reduction of complexity thanks to the DS algorithm, which exploits the special structure of the constraint graph, while the standard forward checking algorithm with smallest domain heuristic cannot exploit it. O v e r - C o n s t r a i n e d P r o b l e m s For the over-constrained problems, (ct > 0.4 in figure 5, ct > 0.28 in figure 6, ct > 0.22 in figure 7), the uncoupling and the parallelization are not efficient. We can remark that the global forward checking with first fail algorithm focuses automatically on an unfeasible subproblem, and detecting that a subproblem is unfeasible is enough to deduce that the complete problem has no solution: in that case, the decomposition and the parallelization are useless. Furthermore, the decomposition-based algorithm changes the variable

175 200

|

|

i

|

w

!

run-time

i

"DS" -4--"PDS" --.~"FC-FF" -B--.

150

100

50 /jJ"

.15

.

i

i

!

i

i

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0.55

Figure 5. Runtime comparison for 3 connected subproblems with n = 25, d = 7, cd = 0.2

ordering, beginning with the interface variables and this new ordering is often less efficient than the ordering given by the standard minimum domain heuristics.

8.3. Future Experiments These first results show that our decomposition-based algorithm outperforms the standard forward checking algorithm in the transition phase, and that the parallel version is interesting in the zone of under-constrained problems. Some other experiments should be done, varying the number of connections between the subproblems and the number of subproblems in order to confirm these results. We are now close to completing the implementation of our distributed solution process on a multi-computer architecture (i.e. a network of SUN Sparc 5 computers Connected by Ethernet and using the CHOOE protocol [21]). We will then be ready to apply our approach to some benchmark problems and evaluate correctly the cost of the communications through the network [22] and the possible workload imbalance. REFERENCES 1. E. Tsang. Foundations of Constraint Satisfaction. Academic Press, 1993. 2. B. Nudel. Consistent-labeling problems and their algorithms. Artificial Intelligence, 21:135-178, 1983. 3. A. Mackworth. Consistency in networks of relations. Artificial Intelligence, 8:99-118, 1977. 4. R. Haralick and G. Elliott. Increasing tree search efficiency for constraint satisfaction problems. Artificial Intelligence, 14:263-313, 1980. 5. D. Baldwin. CONSUL: A parallel constraint language. IEEE Software, 6(4):62-69, 1989.

176

,

,

,

,;

i :

.

!

,

"os"

/~

-.--

"PDS" ..4--.

8OO

5OO

3O0

100

0 0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

Figure 6. Runtime comparison for 3 connected subproblems with n = 25, d = 7,

16001

'

'

i

'

i/~

i' i ~

'.. "pDDs S" ~-~ "FO-FF......

0.2

0.25

0.3

0.35

run-li rne

,~

I-

6

]~

~

cd

= 0.4

1200

1000

600

4O0

200

0

005

0.1

0.15

0.4

Figure 7. Runtime comparison for 3 connected subproblems with n = 25, d = 7,

cd --

0.6

177

10. 11. 12. 13. 14.

15. 16. 17. 18. 19. 20. 21. 22.

P. Cooper and M. Swain. Domain dependance in parallel constraint satisfaction. In Proc. IJCAI, pages 54-59, Detroit, Michigan, 1989. W. Hower. Constraint Satisfaction via Partially Parallel Propagation Steps, volume 590 of Lecture Notes in Artificial Intelligence, pages 234-242. Springer-Verlag, 1990. J. Conrad. Parallel Arc Consistency Algorithms for Pre-Processing Constraint Satisfaction Problems. PhD thesis, University of North Carolina, 1992. S. Kasif. On the parallel complexity of discrete relaxation in constraint satisfaction networks. Artificial Intelligence, 45:275-286, 1990. D. Warren. The SRI model for or-parallel execution of prolog. In International Symposium on Logic Programming, pages 92-101, 1987. R. Finkel and U. Manber. DIB: A distributed implementation of backtracking. A CM Transactions on Programming Language and Systems, 2(9):235-256, 1987. P. Van Henten .ryck. Parallel constraint satisfaction in logic programming. In Proc. ICLP, pages 165-180, 1989. Q. Y. Luo, P. G. Hendry, and J. T. Buchanan. A hybrid algorithm for distributed constraint satisfaction problems. In Proc EWPC'92, Barcelona, Spain, 1992. M. Yokoo, E. Durfee, T. Hishida, and K. Kuwabara. Distributed constraint satisfaction for formalizing distributed problem solving. In Proc of 12th IEEE International Conference on Distributed Computing Systems, pages 614-621, 1992. W. Kernighan and S. Lin. An efficient heuristic procedure for partitioning graphs. The Bell System Technical Journal, 49(1):291-307, 1970. M. Ginsberg. Dynamic backtracking. Journal of Artificial Intelligence Research, 1:25-46, 1993. D. Sabin and E. Freuder. Contradicting conventional wisdom in constraint satisfaction. In Proc. ECAI, pages 125-129, Amsterdam, Netherlands, 1994. P. Berlandier. PROSE : une boite A outils pour l'interpr~tation de contraintes : guide d'utilisation. Rapport Technique 145, INRIA Sophia Antipolis, 1992. P. Prosser. Binary constraint satisfaction problems : Some are harder than others. In Proc. ECAI, pages 95-99, Amsterdam, the Netherlands, 1994. B. Smith. Phase transition and the mushy region in constraint satisfaction problems. In Proc. ECAI, pages 100-104, Amsterdam, Netherlands, 1994. F. Lebastard. CHOOE: a distributed environment manager. Technical Report 93-22, CERMICS, Sophia-Antipolis (France), D~cembre 1993. P. Crandall and M. Quinn. Data partitioning for networked parallel processing. In Proc. 5th Symposium on Parallel and Distributed Processing, pages 376-380, Dallas, TX, 1993.

178 Pierre Berlandier Pierre Berlandier received his Ph.D. in computer science from INRIA and the University of Nice (France) in 1992. His research interests are focused on various aspects of constraint programming such as constraint satisfaction algorithms, consistency maintenance techniques and constraint-based languages design. He is now a senior consultant for ILOG Inc. in Mountain View, CA where he is working on constraint-based design and scheduling applications.

Bertrand Neveu

Bertrand Neveu graduated from Ecole Polytechnique and Ecole Nationale des Ponts et Chaussees. He worked as a research scientist at INRIA Sophia Antipolis since 1984 on the Smeci expert system shell project. He has then been leading the Secoia research project that focused on the design of AI tools using object oriented and constraint based knowledge representations. He is currently in charge of a constraint programming research team in the CERMICS laboratory in Sophia-Antipolis.

Parallel Processing for Artificial Intelligence 3 J. Geller, H. Kitano and C.B. Suttner (Editors) 1997 Elsevier Science B.V.

181

Parallel Propagation in the Description-Logic System FLEX* Frank W. Bergmann a and J. Joachim Quantz b aTechnische Universit~it Berlin, Projekt K I T - V M l l , FR 5-12, Franklinstr. 28/29, D-10587 Berlin, Germany, e-mail: [email protected] bTechnische Universit~it Berlin, Projekt K I T - V M l l , FR 5-12, Franklinstr. 28/29, D-10587 Berlin, Germany, e-mail: [email protected] In this paper we describe a parallel implementation of object-level propagation in the Description-Logic (DL) system FLEX. We begin by analyzing the parallel potential of the main DL inference algorithms normalization, subsumption checking, classification, and object-level propagation. Instead of relying on a parallelism inherent in logic programming languages, we propose to exploit the application-specific potentials of DLs and to use a more data-oriented parallelization strategy that is also applicable to imperative programming languages. Propagation turns out to be the most promising inference component for such a parallelization. We present two alternative PROLOG implementations of paralle!ized propagation on a loosely coupled MIMD (Multiple Instruction, Multiple Data) system, one based on a .farm strategy, the other based on distributed objects. Evaluation based on benchmarks containing artificial examples shows that the farm strategy yields only poor results. The implementation based on distributed objects, on the other hand, achieves a considerable speed-up, in particular for large-size applications. We finally discuss the impact of these results for real applications. 1. I N T R O D U C T I O N In the last 15 years Description Logics (DL) have become one of the major paradigms in Knowledge Representation. Combining ideas from Semantic Networks and Frames with the formal rigor of First Order Logic, research in DL has focussed on theoretical foundations [1] as well as on system development [2] and application in real-world scenarios

[3-5]. Whereas in the beginning it was hoped that DL provide representation formalisms which allowed efficient computation, at least three trends in recent years caused efficiency problems for DL systems and applications: 9 a trend towards expressive dialects; *This work was funded by the German Federal Ministry of Education, Science, Research and Technology (BMBF) in the framework of the Verbmobil Project under Grant 01 IV 101 Q 8. The responsibility for the contents of this study lies with the authors.

182 9 a trend towards complete inference algorithms; 9 a trend towards large-scale applications. With the current state of technology it seems not possible to build a DL system for largescale applications which offers an expressive dialect with complete inference algorithms. The standard strategy to cope with this dilemma is to restrict either expressivity, or completeness, or application size. In this paper we investigate an alternative approach, namely a parallelization of Description Logics. Due to physical limitations in performance gains in conventional processor architectures, parallelization has become more and more important in recent years. This comprises parallel structures inside processors as well as outside by scaling several processors to parallel systems. Several fields of high-performance computing already adopted to this new world of paradigms, such as image processing [6], finite element simulation [7], and fluid dynamics [8]. We expect that in the future parallelism will become a standard technique in the construction of complex AI applications. A standard approach to parallelization in the context of logic programming concentrates on the development of parallel languages that exploit the parallelism inherent in the underlying logic formalism ([9,10] and many more). In this paper we will follow a rather different approach which analyzes a particular application, namely Description Logics. The parallelization we propose uses explicit parallelism based on the notion of processes and messages that is programming language independent. In the next section we give a brief introduction into Description Logics. Section 3 then pi'esents the main inference components of the DL system FLEX and investigates their parallel potential. In Section 4 we describe two different strategies of parallelizing objectlevel propagation in DL systems. The corresponding implementations are evaluated in detail in Section 5 based ' On benchmarks containing artificial examples. Section 6 finally discusses the impact of these results on real applications. 2. D E S C R I P T I O N

LOGICS

In this section we give a brief introduction into Description Logics. Our main goal is to provide a rough overview over DL-based knowledge representation and DL systems. In the next section we will then take a closer look at inferences in the DL system FLEX and their respective parallel potential. 2.1. T h e R e p r e s e n t a t i o n L a n g u a g e In DL one typically distinguishes between terms and objects as basic language entities from which three kinds of formulae can be formed: definitions, descriptions, and rules (see the sample model on page 3 below). A definition has the form 'tn:= t' and expresses the fact that the name tn is used as an abbreviation for the term t. A list of such definitions is often called terminology (hence also the name Terminological Logics). If only necessary but no sufficient conditions of terms are specified a definition has the form 'tn:< t', meaning that 'tn' is more specific than 't'. Terms introduced via ' : - ' are called defined terms, those introduced via ':<' are primitive terms.

183 All DL dialects provide two types of terms, namely concepts (una .ry predicates) and roles (binary predicates), but they differ with respectto the term-forming operators they contain. Common concept-forming operators are: conjunction (cl and c2), disjunction (cl or c2), and negation (not(c)), as well as quantified restrictions such as value restrictions (all(r,c)), which stipulate that all fillers for a role r must be of type c, or number restrictions (atleast(n,r,c), atmost(n,r,c)), stipulating that there are at least or at most n role-fillers of type c for r. Role-forming operators are, besides conjunction, disjunction, and negation, role composition (rl comp r2), transitive closure (trans(r)), inverse roles (inv(r)) and domain or range restrictions (domain(c), range(r,c)). In a description, an object is described as being an instance of a concept (o :: c), or as being related to another object by a role (ol :: r:o2). Rules have the form ' c 1 - > c2' and stipulate that each instance of the concept cl is also an instance of the concept c2. In general, the representation language is defined by giving a formal syntax and semantics. Note that DL are subsets of First-Order Logic (with Equality), which can be shown easily by specifying translation functions from DL formulae into FOL formulae [11,12]. Just as in FOL there is thus an entailment relation between (sets of) DL formulae, i.e. a DL model can be regarded as a set of formulae F which entails other DL formulae (F ~ 7)- Depending on the term-forming operators used in a DL dialect this entailment relation can be decidable or undecidable and the inference algorithms implemented in a DL system can be complete or incomplete with respect to the entailment relation. 2.2. A S a m p l e M o d e l In order to get a better understanding of DL-based knowledge representation let us take a look at the modeling scenario assumed for applications. An application in DL is basically a domain model, i.e. a list of definitions, rules, and descriptions. Note that from a system perspective a model or knowledge base is thus a list of tells, from a theoretical perspective it is a set of DL-formulae F. Consider the highly simplified domain model below, whose net representation is shown in Figure 1. One role and five concepts are defined, out of which four are primitive (only necessa .ry, but no sufficient conditions are given). Furthermore, the model contains one rule and four object descriptions. product chemical product biological product company produces chemical company some(produces,chemical product) toxipharm biograin chemoplant toxiplant

:< :< :< :< :< := => :: :: :: ::

anything product product & not(chemical product) some(produces,product) domain(company) company & all(produces,chemical product) high risk company chemical product biological product chemical company atmost(1,produces) & produces:toxipharm

As mentioned above such a model can be regarded as a set of formulae and the service provided by DL systems basically is to answer queries concerning the formulae entailed

184

1..in

company

O

produces

high risk

~

~

9 fr

chemical ~ ( product j , . r _ . . ~ ~

biological product

chemical company

chemoplant

toxiplant

toxipharm

biogmin

produces

Figure 1. The net representation of the sample domain on page 3. Concepts are drawn as ellipses and are arranged in a subsumption hierarchy. Objects are listed below the most specific concepts they instantiate. Roles are drawn as labeled horizontal arrows (annotated with number and value restrictions) relating concepts or instances. The dashed arrow relates the left-hand side of a rule with its right-hand side ('conc_l' is the concept 'some(produces,chemical product)'). The flashed arrow between 'chemical product' and 'biological product' indicates disjointness.

185 by such a model. The following list contains examples for the types of queries that can be answered by a DL system: 9

tl ?< t2 Is a term tl more specific than a term t2, i.e., is tl subsumed by t27 In the sample model, the concept 'chemical company' is subsumed by 'high risk company', i.e., eve .ry chemical company is a high risk company. 2

9 t l and t2 ?< nothing Are two terms tl and t2 incompatible or disjoint? In the sample model, the concepts 'chemical product' and 'biological product' are disjoint, i.e., no object can be both a chemical and a biological product. .o?:c Is an object o an instance of concept c (object classification)? In the sample model, 'toxiplant' is recognized as a 'chemical company'. 9 o17:r:o2

Are two objects ol,o2 related by a role r, i.e., is 02 a role-filler for r at o1? In the sample model, 'toxipharm' is a role-filler for the role 'produces' at 'toxiplant'. 9 Which objects are instances of a concept c (retrieval)? In the sample model, 'chemoplant' and 'toxiplant' are retrieved as instances of the concept 'high risk company'. 9 o::cfails

Is a description inconsistent with the model (consistency check)? The description 'chemoplant :: produces:biograin' is inconsistent, with respect to the sample model, i.e., 'biograin' cannot be produced by 'chemoplant'. 3 This very general scenario can be refined by considering generic application tasks such as information retrieval, diagnosis, or configuration. 2.3. S y s t e m I m p l e m e n t a t i o n s From the beginning on, research in DL was praxis-oriented in the sense that the development of DL systems and their use in applications was one of the primary interests. In the first half of the 1980's several systems were developed that might be called in retrospection first-generation DL systems. These systems include KL-ONE, NIKL, KANDOR, KL-TWO, KRYPTON, MESON, and SB-ONE. In the second half of the 1980's three systems were developed which are still in use, namely BACK, CLASSIC, and LOOM. The LOOM system [13] is being developed at USC/ISI and focuses on the integration of a variety of programming paradigms aiming at a general purpose knowledge representation system. CLASSIC [2] is an ongoing 2'chemical company' is defined as a 'company' all whose products are chemical products; each 'company' produces some 'product'; thus 'chemical company' is subsumed by 'some(produces,chemical product)' and due to the rule by 'high risk company'. 3Object tells leading to inconsistencies are rejected by DL systems.

186 AT&T development. Favoring limited expressiveness for the central component, it is attempted to keep the system compact and simple so that it potentially fits into a larger, more expressive system. The final goal is the development of a deductive, object-oriented database manager. BACK [14] is intended to serve as the kernel representation system of AIMS (Advanced Information Management System), in which tools for semantic modeling, defining schemata, manipulating data, and que .rying, will be replaced by a single high-level description interface. To avoid a "tool-box-like" approach, all interaction with the information reposito .ry occurs through a uniform knowledge representation system, namely BACK, which thus acts as a mediating layer between the domain-oriented description level and the persistency level. The cited systems share the notion of DL knowledge representation as being the appropriate basis for expressive and efficient information systems [15]. In contrast to the systems of the first generation, these second generation DL systems are full-fledged systems developed in long-term projects and used in various applications. The systems of the second generation take an explicit stance to the problem that determination of subsumption is at least NP-hard or even undecidable for sufficiently expressive languages: CLASSIC offers a ve.ry restricted DL and almost complete inference algorithms, whereas LOOM provides a ve~. expressive language but is incomplete in many respects. Recently, the KRIS system has been developed, which uses tableaux-based algorithms and provides complete algorithms for a ve.ry expressive DL [16]. KRIS might thus be the first representative of a third generation of DL systems, though there are not yet enough experiences with realistic applications to judge the adequacy of this new approach. 4 In the following section we describe the FLEX System, which can be seen as an extension of the BACK system. FLEX is developed at the Technische Universit~it Berlin within the project KIT-VM11, which is part of the German Machine-Translation project VERBMOBIL. 3. T H E F L E X S Y S T E M Having sketched some of the general characteristics of DL we will now turn our attention towards a specific DL system, namely the FLEX system [18]. Compared to its predecessor, the BACK System, FLEX offers some additional expressivity such as disjunction and negation, term-valued features, situated descriptions, and weighted defaults. In the context of this paper another characteristic of FLEX is more important, however, namely the one giving rise to its name, i.e. flexible inference strategies. 3.1. D L I n f e r e n c e s Given the specification of DL in the previous Section, the inference algorithms have to answer queries of the form o tl

?: ?<

c t2

with respect to a knowledge base containing formulae of the form 4The missing constructiveness of the refutation-oriented tableaux algorithms (see next section) leads to problems with respect to object recognition and retrieval (see [17]).

187 tn tn

:---:(

El

=~

0

::

t t C2 C

Two things are important to note: 1. The FLEX system already performs inferences when reading in a model. There are two major inference components, namely the classifier and the recognizer. The classifier checks subsumption between the terms defined in the terminology and thus computes the subsumption hierarchy. The recognizer determines for each object which concepts it instantiates and thus computes the instantiation index. 2. For answering both kinds of queries, the same method can be used, namely subsumption checking. Thus when answering a query 'o ?: c', the system checks whether the normal form derived for 'o' is subsumed by 'c'. Though the recognizer thus uses the classifier to perform its task, there is an important difference between the two components. Whereas the classifier performs only "local" operations, the recognizer has to perform "global" operations. This distinction can be illustrated by briefly sketching the algorithmic structure of both components. Given a list of definitions, the classifier takes each definition and compares it with all previously processed definitions, thereby constructing a directed acyclic graph called the subsumption hierarchy. Thus the concept classifier is a function 'Concept x DAG --+ DAG', where the initial DAG contains the nodes 'anything' and 'nothing'. Locality here means that classifying a concept has no impact on previous classification results, i.e. classifying concept Ca has no impact on the subsumption relation between cl and c2. Recognition, on the other hand, has global effects. Basically, the recognizer processes a list of descriptions and computes for each object which concepts it instantiates, i.e. it is a function 'Description x Index x DAG--+ Index'. Nonlocality here means that recognition triggered by a description 'ol :: c' can lead to changes in the instantiation index for some other object 02, as exemplified by Ol Ol

:: ::

r:o2 all(r,c)

Here processing the second description includes the derivation of '02 :: c' Note that another distinction between classification and recognition is thus that there is exactly one definition for each term in the terminology, whereas objects can be described incrementally, i.e. we can have several descriptions for an object in a situation. In the following we will briefly describe the algorithmic structure of normalization, subsumption checking, and object-level propagation. Before doing so, however, we will first present the proof-theoretical basis of these algorithms.

3.2. Tableaux-Based Algorithms vs. Normalize-Compare Algorithms The first classifiers for DL were specified as structural subsumption algorithms [11]. The basic idea underlying structural subsumption is to transform terms into canonical normal forms, which are then structurally compared. Structural subsumption algorithms

188 are therefore also referred to as normalize-compare algorithms. Note that there is a general tradeoff between normalization and comparison: the more inferences are drawn in normalization, the less inferences have to be drawn in comparison, and vice versa. There is one severe drawback of normalize-compare algorithms--though it is in general straightforward to prove the correctness of such algorithms there is no method for proving their completeness. In fact, most normalize-compare algorithms are incomplete which is usually demonstrated by giving examples for subsumption relations which are not detected by the algorithm [19]. At the end of the 1980's tableaux methods, as known from FOL (cf. [20, p. 180fl]), were applied to DL (e.g. [1,21]). The resulting subsumption algorithms had the advantage of providing an excellent basis for theoretical investigations. Not only was their correctness and completeness easy to prove, they also allowed a systematic study of the decidability and the tractability of different DL dialects. The main disadvantage of tableaux-based subsumption algorithms is that they are not constructive but rather employ refutation techniques. Thus in order to prove the subsumption 'cl :< c2' it is proven that the term 'cl and not(c2)' is inconsistent, i.e. that 'o :: Cl and not(c2)' is not satisfiable. Though this is straightforward for computing subsumption, this approach leads to efficiency problems in the context of retrieval. In order to retrieve the instances of a concept 'c', we would in principle have to check for each object 'o' whether F U {o :: not(c)} is satisfiable, where F is the knowledge base. 5 In most existing systems, on the other hand, inference rules are more seen as production rules, which are used to pre-compute part of the consequences of the initial information. This corresponds more closely to Natural Deduction or Sequent Calculi, two deduction systems also developed in the context of FOL. A third alternative, combining advantages of the normalize-compare approach and tableaux-based methods has therefore been proposed in [12]. The basic idea is to use Sequent Calculi (SC) instead of tableaux-based methods for the characterization of the deduction rules. Like tableaux methods, Sequent Calculi provide a sound logical framework, but whereas tableaux-based methods are refutation based, i.e. suitable for theorem checking, sequent calculi are constructive, i.e. suitable for theorem proving. By rewriting SC proofs of FOL translations of DL formulae we obtain sequent-style DL inference rules like rl :< r2, cl :< c2 ~ cl and c2:< nothing, h : < r2, p > 0 ~

all(r2,cl) :< all(rl,c2) all(r2,c2) and atleast(p,rl,cl) :< nothing

The first rule captures the monotonicity properties of the all operator [23, 24]. If 'has_daughter' is subsumed by 'has_child' and 'computer_scientist' is subsumed by 'scientist', then 'all(has_child,computer_scientist)' is more specific than 'all (has_daughter,scientist)'. The second rule means that a qualified minimum restriction combined with a value restriction which is disjoint from the concept used in the minimum restriction is incoherent. If 'european' and 'american' are disjoint concepts and 'workshop_participant' 5See [22] for tableaux-based algorithms for object-level reasoning and [17] for a discussion of efficiency problems.

189 subsumes 'workshop_speaker', then the concept 'all(workshop_participant,american) and some(workshop_speaker,european)' is incoherent, i.e. is subsumed by 'nothing'. Note that this format is sufficient for a theoretical characterization of a deduction system, i.e. given a set of inference schemata E we can define a least fixed point r by taking the logical closure of a set of formulae F under E. We can then say that F k-z "7 iff e r ~ Though we can study formal properties like soundness or completeness, i.e. the relation between F k-z -y and F ~ % on the basis of this characterization, we need an additional control strategy for turning the deduction system into an algorithm. The main reason for this is that r is not finite. The sequent-style approach thus falls into two separate phases: 1. Derivation of a complete axiomatization by systematically rewriting FOL proofs. 2. Specification of a control strategy to turn the complete axiomatization into an algorithm. In the second phase we have to turn the sequent-style inference rules into normalization rules and subsumption rules. As it turns out, some inference rules can be straightforwardly encoded as both normalization and subsumption rules, while others can only be represented as normalization or subsumption rules, respectively. The idea of a flexible inference strategy then means that each inference rule can be used during normalization, during subsumption or switched off completely. Note that this "customization" of the inference engine of FLEX should be performed by the system developers according to the inference requirements arising in a particular application. 3.3. N o r m a l i z a t i o n For each object, concept, and role, the FLEX system computes a normal form. The basic building blocks of normal forms are so-called atoms, which correspond to the termbuilding operators of the DL, e.g. 'all(R,C)', 'fills(R,O)'. Note that the R's and C's occurring in the atoms are themselves normal forms of roles and concepts. One way of formally defining atoms and normal forms is thus by means of a parallel inductive definition, as done in [12]. For languages not containing negation or disjunction, a normal form is simply a set of atoms. Since the FLEX system supports both disjunction and negation, however, we use the format of disjunctive normal forms proposed in [26]. In the context of this paper, however, it is sufficient to consider the simpler case in which normal forms are represented as sets of atoms. When reading in a concept or a role the parser already performs certain kinds of normalization, such as eliminating negation. Thus given a concept definition 'cn:= c' or an object description 'o :: c', the system transforms the term 'c' into a normal form. This normal form is then further processed by applying normalization rules. The basic idea of normalization is to make information implicitly contained in a normal form explicit, i.e. NORMALIZE: NF -~ NF 6See [25] for a formal definition.

190 The general format of a normalization rule is: 7 0/1,

.

.

.

,

Ot n

~

O~

The application of such a normalization rule is triggered when a normal form contains the atoms a l , . . . , an, in which case a is added to the normal form. The parallel potential of normalization consists in the possibility to apply all normalization rules in parallel. However, the results of applying normalization rules then have to be synchronized in a second step. 3.4. S u b s u m p t i o n C h e c k i n g The task of subsumption checking is to decide whether a normal form subsumes another normal form, i.e. SUBSUMES: NF x NF -~ BOOL As has been mentioned above, there is a general trade-off between normalization and subsumption checking. The more inferences are performed during normalization, the less inferences have to be drawn during subsumption checking. In principle it is possible to produce normal forms which guarantee that subsumption checking only has to test subsumption between individual atoms. Such vivid normal forms have been presented for a simple fragment of DL in [12, Sect. 5]. The disadvantage of such vivid normal forms is that they require the application of many normalization rules, making information explicit which might never be needed. Performing inferences during subsumption checking on the other hand guarantees that inferences are only drawn when actually needed. 8 Basically, subsumption checking between normal forms is ultimately reduced to subsumption checks between atoms, but includes also special subsumption rules for disjunctive parts, non-disjunctive parts, etc. This reduction of subsumption offers the possibility to perform subsumption tests between the atoms in the normal forms in parallel. It should be noted, however, that these atomic subsumption tests are rather fine-grained and unevenly distributed tasks. 3.5. Classification As already sketched above, the primary task of the classifier is to compute a subsumption hierarchy for the terms defined in the terminology. Thus given the normal form of a term, the classifier has to find its direct subsumers and its direct subsumees in the hierarchy. This can obviously achieved by comparing the normal form with all normal forms contained in the hierarchy, i.e. by checking the respective subsumption relations. The number of subsumption checks can be reduced, however, by taking into account the information already contained in the hierarchy. Thus if 'tl' subsumes 't2' and the subsumption test between the normal form and 't2' fails, there is no need to test subsumption between the normal form and 'tl'. Such optimization techniques are employed in all DL systems and are discussed in some detail in [16]. TThis is a further simplification since there are more complicated normalization rules involving disjunctions of atoms. Sin general, a mixed strategy is needed, which ensures both efficient performance and detection of inconsistency already during normalization.

191 The FLEX system uses different forms of classification which are based on a single algorithm, however. When processing an object description, for example, the object's normal form is first classified only with respect to a subset of the subsumption hierarchy, namely the left-hand sides of rules. Moreover, this classification step only computes direct subsumers of the object's normal form, since there is no need to determine the subsumees of objects. The parallel potential of classification obviously is given by the possibility to perform the comparisons of a normal form with all the normal forms contained in the hierarchy in parallel. However the more efficient algorithms utilize the fact that subsumption tests within a classification task are not independent from each other. Consequently, algorithms using this fact pose an order on the subsumption tests and thus loose the property of being straightforwardly parallelizable. 3.6. O b j e c t - L e v e l R e a s o n i n g

As already indicated above, object-level reasoning is inherently non-local and it is therefore useful to distinguish between a local phase and a non-local phase in object-level reasoning. In the local phase we determine for an object the most specific concept it instantiates. This can be done by using the standard normalize and compare predicates. Thus we normalize the description of an object thereby obtaining a normal form and compare it with the normal forms of the concepts in the hierarchy. In addition to this standard classification we also have to apply rules when processing objects. This is achieved by applying all rules whose left-hand sides subsume the object's normal form. After this application the normal form is again normalized and classified until no new rules are applicable [27]. In the non-local phase we have to propagate information to other objects. There are five basic rules for propagation, whose effects we will illustrate below with examples:

01 ::.all(r,c), 01 :: r:02 -~

o2 :: c

01 :: all(r,oneof([ol,...,on])), 02 :: c,...,on :: c --~ 01 :: all(r,c) 01 :: r:02, o2 :: c -~

01 :: (r and range(c)):02

01 :: h:02, o2 :: r2:03 -~ o2 :: (rlcomp r2):03 01 :: r:02 - -

o2:: inv(r):01

(1) (2) (3) (4) (5)

Rule (1) is usually called forward propagation since the value restriction is propagated to each individual filler of a role. Rule (2) is called backward propagation or abstraction over closed role-filler sets--if all fillers of a role are known the most specific value restriction for that role can be abstracted. The other three rules are related to role-forming operators. Rule (3) says that a filler for a role 'r' which is an instance of a concept 'c' is automatically a filler for the role 'r and range(c)'. Rules (4) and (5) capture the semantics of role composition and inversion, respectively. The following examples illustrate the effects of these propagation rules:

192

~(5) (3) (2)

(4) (1)

has_child has_parent has_daughter has_grandchild father happy_father john john mary mary john mary john mary mary john ma.ry

:< := := := := := :: :: :: :: :: :: :: :: :: :: ::

domain(human) and range(human) inv(has,child) has_child and range(female) has_child comp has_child male and some(has_child) father and all(has_daughter,married) male and has_child:ma.ry and exactly(1,has_child) father has_parent:john female has_daughter:mary married happy_father has_daughter:kim has_child:kim has_grandchild:kim female

Note that this sequent-style characterization of object-level reasoning is sufficient for a theoretical characterization of a deduction system (see Section 3.2). The actual propagation algorithm of the FLEX system consists of these rules together with data structures and predicates which trigger the appropriate rules. Basically, whenever an object is processed, we collect propagations of the form 'O :: c' resulting from the five propagation rules. First, it should be noted that object-level propagation is a rather coarse-grained task, since adding information to an object involves normalization and classification (and hence subsumption checking). Furthermore, processing propagations in parallel does not require any additional synchronization of propagation results. Processing a propagation can trigger additional propagations, but these can be processed by the same parallel mechanism. Finally, object-level propagation poses the main performance problem in most DL applications. In Section 6 we sketch some mechanisms to reduce the complexity of objectlevel reasoning in the sequential FLEX system and compare their effects to the speed-up obtained by our parallel implementation. 3.7. P a r a l l e l P o t e n t i a l of Inference A l g o r i t h m s in F L E X To summarize this section on the FLEX system we briefly recapitulate the results concerning the parallelization potential of the different inference algorithms. Figure 2 shows which inference components are used by which other inference components. First, we note that normalization, subsumption checking and classification share the following negative properties with respect to parallelization: 1. they are rather fine-grained tasks; 2. the distribution of the task length is rather widespread; 3. they involve a synchronization step that is not required in the parallelization of object-level propagation.

193 Object-Level Propagation

Classification

~

USeS

Normalization

Subsumption Checking

Figure 2. Dependencies between the DL Inference components.

Object-level propagation, on the other hand, has the following positive characteristics with respect to parallelization: 1. propagation involves coarse-grained tasks; 2. many applications involve a high number of propagations; 3. propagations are independent from each other, monotonic, and invariant with respect to the order of execution; 4. synchronization is only required before an object query ist stated (to make sure that all propagations have terminated); 5. the propagation algorithm tion/communication ratio.

can

be

implemented

with

a

high

computa-

We therefore concentrate on the parallelization of object-level propagation in the following. 4. P A R A L L E L I Z A T I O N S T R A T E G I E S In the remainder of this paper we investigate two parallel implementations of objectlevel propagation. The following properties of propagation should be kept in mind: 1. the length of propagation tasks is not known a priori; 2. the number of propagation tasks is not known a priori; 3. the "direction" of propagation flow is not known a priori. We will briefly illustrate these properties by analyzing the propagation data flow.

194

initial propagation

Figure 3. A group of objects interchanging propagations.

Hems

I000I

/

~

i00

Pending Propagations

/

N

k Iz

I0 Propagation o

2

4

6

8

1o

12

14

16

18

I

l

Figure 4. Exponential increase of propagations.

20

Steps

195

lime

Figure 5. Timing of the farm communication scheme.

4.1. F L E X D a t a F l o w We begin by noting several relevant properties of object-level propagation. As already indicated above propagation of information from one object to another can cause additional propagation to other objects. This kind of 'ping-pong' interaction terminates only when a 'fixed point' is reached and no new information is produced. Since propagation in Description Logics is monotonic, we can execute propagations in an arbitra .ry order, always ending up with the same result. We will refer to this property as confluence. For illustration consider the artificial example in Figure 3. The initial propagation affects object '02' and processing this object yields propagations to objects 'o1', '06', and 'o8', etc. The propagation at '02' thus creates a "chain reaction", i.e. the number of pending propagations increases exponentially. After some time the new information is integrated into the network and the pending propagations decrease until the fixed point is reached. For a parallel processing of propagations we thus obtain the work distribution qualitatively sketched in Figure 4. In the initial phase there are only few propagations and hence idle processors, in the middle phase there are more pending propagations than processors, and in the final phase there are again only few propagations and hence idle processors. Thus the propagation steps in the middle part will take longer to process since they cannot be completely distributed to the limited number of processors available. Given the analysis of the FLEX data flow, we consider two parallel paradigms as potential candidates for an implementation: The farm paradigm and the distributed objects paradigm. In the remainder of this section we briefly present these two alternatives. Theoretical considerations and numerical calculations towards efficiency and scalability can be found in the detailed analysis in [28].

196

tim

Figure 6. Communication events and workload distribution during the first two propagation stages.

4.2. Farm Parallelism The .farm communication structure shown in Figure 5 is widely used in industrial applications such as image processing [6] and finite element simulation [7]. It is theoretically well known and there exists a variety of strategies to distribute workload evenly across a network. A farm consists of several parallel processes with a fixed structure: one process is called 'master' and is responsible to distribute tasks to a group of 'worker' processes which perform their tasks in parallel and return control and results back to the master. Farm structures are frequently used to parallelize applications that can be split into subtasks with a priori known duration. Examples are image processing or finite element systems. From a theoretical point of view, there are two potential sources of inefficiency in this architecture: 1. uneven distribution of workload and 2. a communication bottleneck created by the centralized position of the master.

4.3. C o m m u n i c a t i n g Objects Parallelism In the communicating-objectsparadigm the central management institution (master) of the farm parallelism is replaced by (local) knowledge of all objects about the 'addresses' of

197

,s+r.o,,..~..~ . ~ . .

_

:

-_

. . - - _ - _ - - - . -

-

-.

gN, g,,l : :,

Topology Processors Workers

2xl 3 1

3xl 4 2

2x2 5 3

", -:%

3x2 7 5

3x3 10 8

4x3 13 11

4x4 17 15

Figure 7. Hardware configuration and working nodes.

their neighbors. Objects communicate directly with each other, in contrast to the centered communication scheme of the farm. This helps to avoid communication bottlenecks in a network. The general difference between a farm and a network of communicating objects is the different perspective of parallelism: Within a farm, tasks are distributed; within the distributed objects scheme, objects are distributed. This approach appears to be similar to the agent-based paradigm developed by distributed AI research [29]. In contrast to this approach, objects within FLEX have to be considered elements of a distribution strategy rather then independently interacting entities. With respect to the definition given in [30] we have to subsume our efforts here under the field of 'distributed problem solving'. For an effective balancing of workload, certain assumptions about tasks and the computational environment have to be made. In our case, all processors can be assumed to behave identical and the statistical distribution of the task length is assumed to be narrow. Uneven distributions of workload can finally be treatedby special load balancing algorithms (see below).

5. E X P E R I M E N T A L

RESULTS

In this section we describe the hardware and the software used for implementing and evaluating our parallel implementations of FLEX. We also present benchmark results for artificial examples. In the next section we discuss the connection between these benchmark evaluations and real applications.

198 cl c2 c3

:< :< :<

all(r,c2) all(r,c3) all(r,cl)

ol o2 03 04 05 06 07 08 ol

:: :: :: :: :: :: :: :: ::

r:o3 r:o4 r:o7 r:ol r:o2 r:ol r:o3 r:o7 cl

and and and and and and and and

r:o2 r:o7 r:o2 r:o8 r:o7 r:o7 r:o8 r:o4

and and and and and and and and

r:o8 r:o2 r:ol r:o6 r:o8 r:o5 r:o4 r:o6

Figure 8. A sample benchmark with 8 objects, 3 concepts and fan-out 3.

5.1. H a r d w a r e and Software We chose the 'Parsytec Multicluster II' machine as base for the parallel implementation of FLEX. It consists of 16 processing nodes that each contain an INMOS T800 Transputer with 4 MByte of RAM. Each Transputer consists of a RISC processing kernel, a memo .ry interface and 4 DMA driven serial interfaces, called 'links'. Each link has a transfer rate of approximately 1.2 MByte/s and all 4 links can run independently together with the RISK kernel, hardly: affecting processing performance (communication delays could be neglected). This hardware platform is especially suitable to serve as a testbed for parallel implementations due to its simple architecture and the availability of comfortable development environments. However, it does not provide state of the art computational performance and suffers substantially from memo .ry restrictions. Figure 7 shows the topologies used for the tests in this section and the number of available worker nodes. The overhead of 2 processors is due to memo .ry limitations. Processor I could not be used because its 1 MByte RAM is not sufficient to hold the FLEX code. Processor 2 is used to hold the 'shell' process that synchronizes the generation of new objects. Normally this process can be located somewhere in the network and would not consume any computing performance, but in this case it had to be separated due to memo .ry restrictions. The language used to implement Distributed FLEX is a Prolog dialect called Brain Aid Prolog (BAP)[31]. It represents a standard Prolog with parallel libra .ry extensions, implementing a scheme similar to CSP [32]. Parallelism and synchronization is expressed explicitly using the notion of processes and messages. A process in BAP is a single and independent Prolog instance with a private database. A message is any Prolog term that is exchanged between two processes. Messages are sent and received using the 'send_msg(Dest, Msg)' and 'rec_msg(Sender, Msg)' predicates. Message sender and destination are identified by their process id' (PID). Messages are routed transparently through the network. The order of messages is maintained when several messages are sent from the same sender to the same destination. When a message reaches its destination process, it is stored in a special database, called 'mailbox'. Each process owns its private

199

el 0_3_2 c20_3_2

(seq) (58) (177)

1 78 253

2

3

5

63 185

55 159

56 160

8 58 162

Figure 9. Execution times (seconds) for the farm parallelization.

mailbox in which the messages are stored FIFO. Although the development of parallel FLEX was greatly simplified by the way BAP expresses parallelism, it is possible to apply the same parallel techniques to future FLEX implementation in programming languages such as LISP or C. 5.2. B e n c h m a r k s

To evaluate our parallel implementations we developed a benchmark generator that is capable of generating randomly connected networks of objects. The basic structure of these benchmarks is shown in Figure 8, which contains a benchmark consisting of 8 objects, 3 concepts, and a fan-out of 3. Thus, the final object tell 'ol :: cl' triggers propagation to three objects, namely 'o2', '03', and 'o8'. Each of these propagations again triggers 3 more propagations, etc. It should be obvious that the average fan-out, i.e. the number of additional propagations triggered by each propagation is a measure to describe the "avalanche" effect and is the major factor for the system's performance. 9 Figure 9 shows the execution times for the farm benchmarks. The first row contains the benchmark names that are composed by three numbers that indicate the number of objects, concepts and fan-out respectively (for example 'c20_5_3' means that the benchmark consists of 20 objects, 5 concepts and a fan-out of 3). The following rows give the execution times with respect to the number of processors. The '(seq)' fields gives the reference time of the (original) sequential version of FLEX. The parallelization of FLEX using the farm paradigm showed very poor results. This can be explained by the rather high costs to distribute the system state out to the workers and to integrate the computation results back into the system state. Both activities have to be done sequentially, thus slowing down the parallel execution. Although there is some potential for optimizing the FARM implementation, we stopped the development and focused on the distributed-object version of FLEX. The parallelization of FLEX using the distributed objects paradigm turned out to be a lot more promising. Figure 10 shows the absolute execution times of the considered benchmarks. The names of the benchmarks are composed as in Figure 9. Note that the execution times in Figure 10 are measured with an accuracy of • seconds. The sequential execution times (entries in the '1' row) for several benchmarks are not available due to the memory limitations. This means that it is not possible to calculate the relative speed-up in such a line (future tests on Transputer machines with 9[28] analyzes quantitatively the influence of the avalanche exponent on the applicability of parallel execution.

200

c10_3_2 c10_3_3 c10_3_4 c20_3_2 c20_3_3 c20_3_4 c20_5_3 c20_5_4 c40_3_2 c40_3_3 c40_3A c80_3_2 c80_3_3 c80_3_4

1.0 1.0 1.0 1.0

1 59 43 327 179

1.9 1.5 2.0 1.8

30 28 159 97 145 240 527 344 314

1.8 1.5 2.3 3.0

3 32 28 141 59 129 173 355 453 176 258 569 1032

3.3 2.4 3.1 3.1

5 18 18 105 58 58 164 173 411 137 231 467 665

3.3 2.7 3.8 4.0

8 18 16 87 45 56 70 155 185 105 141 264 225 443

947

2.8 2.5 4.4 4.8

11 21 17 74 37 66 74 141 126 111 111 319 200 336 662

3.3 2.9 4.4 4.5

15 18 15 73 40 51 68 160 189 72 87 230 181 266 583

Figure 10. Benchmark Execution Times.

more memo .ry will fill these gaps). This is the reason why we omited the speed-up figures in all but the first 4 benchmarks. The table shows high speed-ups (efficiencies > 80~163for all benchmarks, if the number of objects exceeds the number of processors by a certain factor (between 5 and 10). This result can be interpreted by the perspective of Section 3, where we saw that network efficiently is dependent on the number of pending propagations in the network. If this number is too low, few processing nodes are busy, resulting in a bad network efficiency. Within [28] the quantitative analysis shows that the propagation-processor ratio is more relevant to system performance than the overhead caused by message passing. 1~ It also indicates how these problems can be overcome, allowing for even larger networks. A major problem for all distributed systems is the balance of load inside the network. Within distributed FLEX each object represents a potential load. Unfortunately, the presence of objects is only a statistical measure for load, while the actual distribution depends on runtime conditions. The illustration in Figure 11 depicts execution of a benchmark with an uneven load distribution. The Transputers 2 and 4 slow down the overall performance. It is easy to see that the network is quite busy during the first half of the execution time (ca. 75~163 efficiency). At the second half, all object servers have terminated, except two (ca. 25% efficiency). This leads to a reduction of the overall efficiency to ca. 50~163 and explains the variation of the results in Figure 10. The necessa~, optimization of the uneven distribution of processor load over the network can be achieved by temporarily 'relocating' objects to other processors. Such a mechanism would be capable of reducing overhead time created by loads remaining on a '~ is valid for Transputer systems with 2..256 processors, 2D matrix topology and shortest path routing algorithm.

201

Load (%)

Time l )

o, ~

~

c:) eq

3

Procuaor

Figure 11. Runtime behavior of distributed FLEX within a 3x3 Network.

few processors. We are currently implementing this optimization. 6. A P P L I C A T I O N S C E N A R I O S In the previous section we have presented evaluation results for benchmarks based on artificial examples. These results have shown a considerable speed-up for the parallelization based on distributed objects. In this section we discuss the impact of these results for real applications. First, it should be noted that the "optimal speed-up" is achieved for examples involving both 1. many propagation steps; 2. many pending propagations. If there are only few pending propagations, there are no tasks to distribute; if there are only few propagation steps, the phase in which many processors are used is rather short (cf. Figure 4). Obviously, such examples pose severe performance problems for sequential propagation algorithms and several strategies are possible to avoid these problems, such as 1. restricting the expressivity of the representation language; 2. using incomplete inference algorithms;

202 3. simplifying the structure of the model. As can be seen from the examples in Section 3 one source of propagations are roleforming operators such as composition, inversion, or range. Banning such operators from the representation language or using them carefully in the model is thus one way to avoid performance problems due to propagation. Let us illustrate this strategy by briefly sketching the use of the sequential FLEX system in the VERBMOBIL project. VERBMOBIL is a project concerned with face-to-face dialogue interpreting [33] and focuses in its current phase on the translation of appointment scheduling dialogues from German and Japanese to English. The FLEX system is used in the component for semantic evaluation which takes the semantic representation of German utterances and performs contextual disambiguation in order to provide additional information needed by the transfer component. A central task of semantic evaluation is to determine the dialogue act performed by an utterance, e.g. suggest_date, reject_date, request_comment [34]. This is achieved by using the approach of preferential disambiguation [4,35] where preference rules are encoded as weighted defaults [36] in FLEX. Semantic evaluation in the current VERBMOBIL system thus contains three steps: 1. translating the semantic representation into FLEX object tells; 2. processing these objects tells; 3. applying the weighted defaults to determine the preferred interpretation. The basic idea of representing utterances in FLEX is to introduce objects for the linguistic signs used in the utterance. Thus a simple sentence like Ich schlage Freitag vor. (I propose Friday.) would be represented as ol 02 oa

:: :: ::

per_pron and per:l and num:sg freitag vorschlagen and argl:ol and arg2:o2 and tense:pres

In order to achieve acceptable performance we minimized the number of object tells and the number of propagations. This is possible by using a flat representation instead of a structured one. Furthermore, we used a feature of the flexible inference strateKy implemented in the FLEX system. This feature allows the user to specify for each role which propagation rules are to be applied. The user can thus control the amount of propagation in the model and in particular can avoid propagations which do not add any significant information. This approach thus yields the performance advantages of incomplete algorithms while offering at the same time a control of the incompleteness to the user. In the current implementation of the semantic-evaluation component we thus achieve an average processing time of 5 seconds per utterance on a Sparc 10. In the prototype of the VERBMOBIL system, which will be released in October '96, we will use a C-t-+

203 Processors 1 2 3 4

Seconds 33 22 23 17

Figure 12. Processing time for a real-world example from NLP.

reimplementation of the FLEX system and thereby hope to reach processing times of 1 second per utterance. It should be obvious, however, that the constraints posed on the modeling and on finetuning the propagation control by hand is far from satisfying. Achieving the performance gains by parallel algorithms instead would be much more adequate for the user. Since the memory restrictions of the transputer hardware did not allow to test our parallel implementation with our real application, we reimplemented parallel FLEX on a cluster of SUN workstations connected by Ethernet. First evaluation results show that communication between processors is 20 times faster on transputers than on SUNs, whereas computation is 50 times faster on SUNs than on transputers. The computation/communication ratio is thus 1000 times better on transputers than on SUNs. Figure 12 shows the first evaluation result for processing an utterance of the VERBMOBIL scenario. Note that all propagation rules are active at all roles in this example, hence the 33 seconds execution time on a single processor. As can be seen, the speed-up is not as good as for our benchmarks which is due to the comparatively low number of pending propagations and propagation steps, as well as to the communication overhead in the SUN cluster. Nevertheless, this result shows that pursuing the parallelization of propagation is a promising alternative to the current strategies of dealing with performance problems, i.e. incomplete algorithms, simplified modeling structures, or restricted expressiveness. 7. C O N C L U S I O N The results of the parallel implementation of FLEX are in general very satisfying. We achieved high speed-ups with benchmarks that are structurally similar to a real-world applications in natural language processing (> 80% for benchmarks that fit the size of the network). The efficiency of execution rises mainly with the propagation/processor ratio and thus with the application size. This is an important result because especially large applications are to be considered candidates for a parallel implementation. Theoretical considerations [28] show that there are only few technical limits to the scalability of the distributed objects implementation. We have to state that the Transputer system under consideration here is not applicable to real world problems due to its poor overall performance and its memo .ry restrictions. Ideal candidates for such implementations are parallel computers with large (local)

204 memory resources and high communication bandwidth. Alternatively, shared-memo .ry multiprocessor workstations fulfill all requirements for an efficient parallelization. First evaluation results obtained on a cluster of workstations for an example taken from the VERBMOBIL application confirm the benchmark results, even though the speed-up is not quite as high. We assume that the communication structure of FLEX is similar to many other applications in Artificial Intelligence. In particular, applications involving complex, forwardchaining inferencing are potential candidates for a parallelization based on the distributedobjects approach presented in this paper. REFERENCES

1. F.M. Donini, M. Lenzerini, D. Nardi, W. Nutt, "The Complexity of Concept Languages", KR'91,151-162. 2. R. Brachman, D.L. McGuiness, P.F. Patel-Schneider, L. Alperin Resnick, A. Borgida, "Living with CLASSIC: When and How to Use a KLONE-like Language", in J. Sowa (Ed.), Principles of Semantic Networks: Explorations in the Representation of Knowledge, San Mateo: Morgan Kaufmann, 1991, 401-456. 3. J.R. Wright, E.S. Weixelbaum, G.T. Vesonder, K.E. Brown, S.R. Palmer, J.I. Berman, H.H. Moore, "A Knowledge-Based Configurator that Supports Sales, Engineering, and Manufacturing at AT&T Network Systems", Proceedings of the Fifth Innovative Applications of Artificial Intelligence Conference, 1993, 183-193. 4. J.J. Quantz, B. Schmitz, "Knowledge-Based Disambiguation for Machine Translation", Minds and Machines 4, 39-57, 1994. 5. The KIT-MIHMA Project, Technische Universitiit Berlin, Projektgruppe KIT, http: //www. cs.t u- berlin, de/..-kit / mihma, ht ml. 6. H. Burkhard, A. Bienick, R. Klaus, M. NSlle, G. Schreiber, H. Schulz-Mirbach, "The Parallel Image Processing System PIPS" in R. Flieger, R. Grebe (eds), Parallelrechner Grundlagen und Anwendung IOS Press, Amsterdam, Netherlands, 1994, 288-293. 7. R. Diekmann, D. Meyer, B. Monien, "Parallele Partitionierung unstrukturierter Finite Elemente Netze auf Transputernetzwerken" in R. Flieger, R. Grebe (eds), Parallelrechner Grundlagen und Anwendung IOS Press, Amsterdam, Netherlands, 1994, 317-326. 8. Strietzel, "Large Eddy Simulation turbulenter StrSmungen auf MIMD-Systemen" in R. Flieger, R. Grebe (eds), Parallelrechner Grundlagen und Anwendung, IOS Press, Amsterdam, Netherlands, 1994, 357-366. 9. K. Clark, S, Grego.ry, "PARLOG: Parallel Programming in Logic" in E. Shapiro (ed), Concurrent Prolog The MIT Press, Cambridge, Massachusetts, 1987, 8 4 - 139. 10. E. Pontelli, G. Gupta, "Design and Implementation of Parallel Logic Programming Systems", Proceedings of ILPS'9~ Post Conference Workshop 1994. 11. J. Schmolze, D. Israel, "KL-ONE: Semantics and Classification", in BBN Annual Report, Rep.No. 5421, 1983, 27-39. 12. V. Royer, J.J. Quantz, "Deriving Inference Rules for Terminological Logics", in D. Pearce, G. Wagner (eds), Logics in AI, Proceedings of JELIA-92, Berlin: Springer, 1992, 84-105.

205 13. R. MacGregor, "Using a Description Classifier to Enhance Deductive Inference", in Proceedings SeventhIEEE Conference on AI Applications, Miami, Florida, 1991; 141147. 14. T. Hoppe, C. Kindermann, J.J. Quantz, A. Schmiedel, M. Fischer, BACK 1/5 Tutorial ~ Manual, KIT Report 100, Technische Universit~it Berlin, 1993. 15. P.F. Patel-Schneider, An Approach to Practical Object-Based Knowledge Representation, Technical Report 68, Schlumberger Palo Alto Research, 1987. 16. F. Baader, B. Hollunder, B. Nebel, H.J. Profitlich, E. Franconi, "An Empirical Analysis of Optimization Techniques for Terminological Representation Systems", KR-92, 270-281. 17. C. Kindermann, Verwaltung assertorischer Inferenzen in terminologischen Wissensbanksystemen, PhD Thesis, Technische Universit~it Berlin, 1995. 18. J.J. Quantz, G. Dunker, F. Bergmann, I. Kellner, The FLEX System, KIT Report 124, Technische Universit~it Berlin, 1995. 19. B. Nebel, Reasoning and Revision in Hybrid Representation Systems, Berlin: Springer, 1990. 20. G. Sundholm, "Systems of Deduction", in D. Gabbay, F. Guenthner (eds), Handbook of Philosophical Logic, Vol. I: Elements of Classical Logic, Dordrecht: Reidel, 1983, 133-188. 21. F.M. Donini, M. Lenzerini, D. Nardi, W. Nutt, "Tractable Concept Languages", IJCAI-91, 458-463. 22. A. Schaerf, Query Answering in Concept-Based Knowledge Representation Systems: Algorithms, Complexity, and Semantic Issues, Dissertation Thesis, Dipartimento di Informatica e Sistemistica, Universit~ di Roma "La Sapienza", 1994. 23. D. Westersts "Quantifiers in Formal and Natural Languages", in D. Gabbay, F. Guenthner (eds), Handbook of Philosophical Logic, Vol. IV: Topics in the Philosophy of Language, Dordrecht: Reidel, 1989, 1-131. 24. J.J. Quantz, "How to Fit Generalized Quantifiers into Terminological Logics", ECAL 9P, 543-547. 25. V. Royer, J.J. Quantz, "On Intuitionistic Query Answering in Description Bases", in A. Bundy (Ed.), CADE-94, Berlin: Springer, 1994, 326-340. 26. B. Kasper, "A Unification Method for Disjunctive Feature Descriptions", A CL-87, 235-242. 27. B. Owsnicki-Klewe, "Configuration as a Consistency Maintenance Task", in W. Hoeppner (Ed.), Proceedings of GWAI'88, Berlin: Springer, 1988, 77-87. 28. F.W. Bergmann, Parallelizing FLEX, KIT Report in preparation, TU Berlin. 29. A.C. Dossier, "Intelligence Artificielle Distribuee", Bulletin de I'AFIA, 6, 1991. 30. A. Bond, L. Gasser, "Readings in Distributed Artificial Intelligence", Morgan Kaufmann, Los Angeles, CA, 1988. 31. F.W. Bergmann, M. Ostermann, G. von Walter, "Brain Aid Prolog Language Reference" Brain Aid Systems, 1993. 32. C. A. R. Hoare, "Communicating Sequential Processes" Prentice Hall, Englewood Cliffs, N.J., USA, 1985. 33. M. Kay, J.M. Gawron, P. Norvig, Verbmobil: A Translation System/or Face-to-Face Dialog, CSLI Lecture Notes 33 August 1991.

206 34. B. Schmitz, J.J. Quantz, "Dialogue Acts in Automatic Dialogue Interpreting", TMI95, 33-47. 35. J.J. Quantz, "Interpretation as Exception Minimization", IJCAI-93, 1310-1315. 36. J.J. Quantz, S. Suska, "Weighted Defaults in Description Logics--Formal Properties and Proof Theo .ry", in B. Nebel, L. Dreschler-Fischer (eds), KI-94: Advances in Artificial Intelligence, Berlin: Springer, 1994, 178-189.

207

Frank Bergmann Frank Bergmann is a researcher in the Department of Computer Science at the Berlin University of Technology. He received a Diploma in electronic engineering from the RWTH Aachen in 1995. He developed several parallel Prolog systems at Parsytec GmbH, Aachen and Brain Aid Systems GbR. He is now a researcher in the project KIT-VM11 which is part of the German Machine Translation project VERBMOBIL. His current research interests include Robust Natural Language Processing and Parallel Logic Programming.

J. Joachim Quantz J. Joachim Quantz is a researcher in the Department of Computer Science at the Berlin University of Technology. He received a Diploma in computer science in 1988, a Master's degree in linguistics and philosophy in 1992, and a Ph.D. in computer science in 1995. From 1988 to 1993 he worked as a researcher in an ESPRIT project on Advanced Information Management Systems. Since 1993 he is leader of the project KIT-VM11 which is part of the German Machine Translation project VERBMOBIL. His current research interests include Robust Natural Language Processing, Natural Language Semantics, Machine Translation, Nonmonotonic Reasoning and Description Logics.

Parallel Processing for Artificial Intelligence 3 J. Geller, H. Kitano and C.B. Suttner (Editors) 9 1997 Elsevier Science B.V. All rights reserved.

209

An Alternative Approach to Concurrent Theorem-Proving Michael Fisher Department of Computing Manchester Metropolitan University Manchester M1 5GD United Kingdom EMAIL:

M. F i s h e r @ d o c .mmu. ac. uk

We present an alternative mechanism for representing concurrent theorem-proving activity which primarily relies upon massive parallelism and efficient broadcast communication. This model of distributed deduction can be utilised in order to provide more dynamic, flexible and open systems. In addition to the representation of deduction in classical propositional and first-order logics framework, we provide correctness results for the approach, and consider the practical aspects of the system's implementation. The approach to concurrent theorem-proving we propose is based upon the use of asynchronously executing concurrent objects. Each object contains a particular set of formulae and broadcasts messages corresponding to specific information about those formulae. Based upon the messages that an object receives, it can make simple inferences, transforming the formulae it contains and sending out further messages as a result. Communication can be organised in such a way that, if a formula, distributed across a set of objects, is unsatisfiable then at least one object will eventually derive a contradiction. In addition to representing simple deduction in this framework, we indicate how, by utilising the underlying computational model, information not directly associated with the formulae being manipulated may be encoded. In particular, complex control behaviour can be provided, allowing not only the implementation of a range of proof strategies, including opportunistic, competitive and cooperative deduction, but also providing the basis for the development of simple agent societies. 1. I n t r o d u c t i o n

Theorem-proving is a complex activity and, as such, is a natural application area for techniques in concurrent programming. Indeed, it could be argued that the practical viability of large-scale theorem-proving is dependent upon the effective utilisation of concurrent systems. This observation has led to a range of concurrent theorem-provers, for example [26,19,21,20]. Although these approaches have succeeded to some extent, it has become clear that a more dynamic and open approach to concurrent theorem proving will be required in the future. The majority of concurrent theorem-proving systems developed so far have been based upon the idea of concurrent processes with centralised control, for example a tree structure

210 of communication and control within a set of currently active processes. While there have been notable exceptions to this, for example [22,7,4], these have still been based on a fairly standard model of deduction. (Note that a comparison with such related work will be provided in w In this paper, we propose an alternative view of concurrent theorem-proving where the formulae are distributed amongst autonomous concurrent objects (sometimes referred to as agents). These objects have control over their own execution (during which deduction takes place), as well as their own message-passing behaviour (during which information is shared with other objects). Further, since broadcast message-passing is used as the basic communication mechanism, other objects are able to view (and utilise) the intermediate deductions produced by each object. The purpose of our approach is not only to provide a framework in which cooperative and competitive deduction can take place, and which is open (the object space is unconstrained) and dynamic (objects can be dynamically created), but also to view deduction in a radically different operational manner. Consider a logical formula represented, for example, in clausal form. If the generation of an explicit contradiction is attempted, for example using resolution, information is effectively passed from one clause to another on each resolution step, with new clauses being generated in the process. While standard resolution systems have centralised control mechanisms which match clauses together and perform resolution, the approach we propose is based upon a data-driven view of this deduction mechanism. Here, the clauses themselves pass information around and decide how to proceed on the basis of information received. 1.1. M o t i v a t i o n Assuming we have a computational model based upon sets of communicating objects, we can define the basic concurrent theorem-proving activity by distributing formulae across objects and by providing an execution mechanism within objects based upon a suitable logical deduction mechanism. To ensure that communication between objects corresponds to appropriate proof steps, we impose the following constraints on execution within, and on communication between, objects.

1. Objects containing single positive literals should pass the values of these literals on to the other objects, since these literals represent basic information that the object is sure of. (In our model this transfer of information is achieved via broadcast message-passing.) Note that the use of just positive literals does not restrict the logical power of the deduction mechanism. 2. Objects should transform their formulae (for example, using a resolution-like procedure) on the basis of the information received, again passing on any new positive literals generated. 3. Objects that derive a contradiction should notify other objects of this fact. As an example, consider the following set of propositional Horn Clauses.

211 1. p 2. -~p V q V -~r 3. -~p V -~q V -~r 4.

-~p V r.

Now, assume that each clause is stored in an object of the type outlined above. As the objects begin executing, the first object, containing only the positive literal p, broadcasts the fact that p is true to the other objects. Once p has been received by the other objects, the object containing clause 4 is transformed to a positive literal r. This again causes new information to be broadcast, in this case the fact that r is true. Once the r broadcast above reaches all the other objects, then the objects containing clauses 2 and 3 will both be transformed to literals. Finally, q is broadcast and the object containing clause 3 generates a contradiction. In spite of the simplicity of this proof, we can see how the communication patterns of the concurrent system match the proof steps in the sequential refutation. There are several important points to notice about this computation, as follows. 9 As objects execute asynchronously, each object can process and send messages at its own speed. Note that the deduction is in no way dependent upon the order of the messages. If the set of clauses is unsatisfiable, then as long as the messages eventually arrive, false will be generated. The fastest that the above system can generate a contradiction, given the initial configuration of objects, is the total time for the three messages (i.e., p, q, and r) to be sent. Note that this still holds even if we add a large number of other objects dependent upon these messages. Thus, the branching factor that is common in proof search is exhibited in our model by additional broadcasted messages. As objects execute concurrently, several objects can broadcast information at the same time. In particular, if we imagine a set of clauses such as ].

p.

2. q. 3. -~p v -~q. then messages 'p' and 'q' may be broadcast simultaneously. Note that such concurrency effectively represents a form of hyper-resolution, a variety of resolution whereby more than two clauses are resolved together in one resolution step. This correspondence is exhibited by the fact that the object containing clause 3 may consume both the p and q messages, transforming its clause to false in one step. Although, in the above example, we allocated one clause to each object, efficiency can be improved by clustering clauses together within objects. We discuss this approach further in w Further, there is the potential for deleting objects as those whose information has been broadcast are no longer needed (although, for efficiency, we may wish to retain these objects rather than requiring that objects record potentially large histories). It is this framework for concurrent theorem-proving that we outline in the rest of the paper.

212 1.2. S t r u c t u r e of the Paper The structure of this paper is as follows. In w we outline the computational model on which our concurrent theorem-proving activity is based. In w we present the mechanisation of classical propositional logic based upon this model, while in w we present the correctness arguments for the approach. In w we extend the theorem-proving activity to first-order logic and consider practical aspects relating to the implementation of the approach, particularly those highlighted by the first-order extension. In w we indicate how this computational model allows a range of multi-agent theorem-proving activities to be represented. Finally, in w concluding remarks are provided and future work is outlined.

2. A Computational Model for Communicating Objects The computational model we use combines the two notions of objects and concurrency and is derived from that utilised in [13]. Objects are considered to be self contained entities, encapsulating both data and behaviour, able to execute independently of each other, and communicating via message-passing. In addition, we specify the following properties. 1. The basic method of communication between objects is broadcast message-passing. 2. Each object has an interface (see below) defining messages that it will r e c o g n i s e an object can dynamically change its own interface. 3. Messages are sent concurrently, and there may be an arbitrary, but finite, delay between a message being broadcast and it being received. 4. The object-space can be structured by grouping appropriate objects together and by restricting broadcast communication across group boundaries. Thus, rather than seeing computation as objects sending mail messages to each other, and thus invoking some activity (as in the actor model [16,1]), computation in a collection of objects can be visualised as independent entities listening to messages broadcast from other objects. This model is both general and flexible, and has been shown to be useful in a variety of dynamic distributed systems (see, for example [11]).

Object Interfaces Since objects communicate via broadcasting messages and individual objects only act upon certain identified messages, a mechanism is provided for each object to filter out messages that it wishes to recognise, ignoring all others. The definition of those messages that an object recognises, together with a definition of the messages that an object may itself produce, is provided by the interface definition for that particular object. The interface definition for an object, for example 'stack', is defined as follows. s t a c k (pop, push) [popped, s t a c k f u l l ] . Here, {pop, push} is the set of messages the object recognises, while {popped, s t a c k f u l l } is the set of messages the object might produce itself. These sets of messages need not

213 be d i s j o i n t - an object might broadcast messages that the object itself recognises. Note also that many distinct objects may broadcast and recognise the same messages. In the case of concurrent theorem-proving described in this paper, the only messages that an object is interested in are those relating to the negative literals within that object's clauses. In this way, the object's interface can be automatically generated from the set of clauses it contains. Thus, the interface definition of, for example, an object (called obj2) containing the clause ~p V q V ~r has the interface definition o b j 2 ( p , r ) [q]. Here, obj2 will respond to 'p' and 'r' messages, ignoring all others. It will only send 'q' messages.

3. Automating Deduction in Propositional Logic We now consider the mechanization of classical propositional logic within our model of concurrent theorem-proving. Throughout, we will assume that the original propositional formula that is to be checked for satisfiability has been rewritten as a set of clauses. We note that this is a general, and relatively cheap, operation [25]. 3.1. N o n - H o r n C l a u s e s Whereas the use of Horn clauses within objects (such as in the example provided in w ensured that a contradiction could be found simply by passing messages representing positive literals between objects, the switch to full propositional logic implies a more complex type of message. Consider the following set of clauses

1. 2. 3. 4.

~pyq ~qvp ~p v ~q pvq

While the first three clauses are Horn clauses, the fourth is not and, as there is no clause containing a single positive literal, no simple starting message can be broadcast. Thus, we extend the model so that Horn clauses themselves can be broadcast.

3.2. Notation Rather than using the standard clausal form in order to represent formulae distributed amongst objects, we introduce a simple notation that will be useful for representing not only the basic logical information but also the mechanism by which objects perform execution and communication. In addition, since a single clause may serve several different purposes within the system, it is useful to represent the clause as several different entities, characterizing each of these behaviours. The basic atoms in our notation are Horn clauses, with clauses such as ~aV-~bv

~cv d

214 being represented by Such atoms can now be passed between objects and are used in themselves. These rules are of the form

rules within

the objects

A[Xi]ci =~ [Y]d 4

where ci and d are positive literals and simply corresponds to the formula

Xi and

Y are sets of positive literals. Such a rule

V

V V - xvc 4 xE Xi

yv,t.

yEY

Often we will use the notation

[x,p]r in order to represent the Horn clause

(p^ Aq)

r

qEX

Note that syntactic elements such as X here represent universally quantified variables that will be matched against sets of positive literals later. 3.3. G e n e r a t i n g R u l e s f r o m Clauses To show how to map clauses in the original set to rules of the above form, we consider a general form of clause and present the rules produced. (Note that one clause may generate several rules.) The general clause we take is n

m

Vp, v V qj 4=1

j=l

where P4 and qj are propositions. The rules produced are (A A[. X ~ =] qI j

4=lA[X'Pi]false) ~

[X]false

and, for each i 9 {1... n}, (j~I[X]qj)

=~ [X, Pi]Pi

where, if m = O, then X = 0. Note that '[true]' is equivalent to '[]' and atoms of the form '[true]p' are often abbreviated simply by 'p'. Note that, in certain circumstances, we may wish to transform the rules generated in this way in order to increase/decrease either communication or internal execution. For example, a rule such as

[X]m

A [X,l]false =~ [X]false

215 may be transformed into the (equivalent) rule

[X]m ~ [X]t This second rule requires less information to be activated in that no IX,/]false atom is required. However, in certain circumstances, the second rule may generate extraneous communication. For example, if the set of messages {[a]m, [b]m, [c]m} are received then {[all, [b]/, [c]/} will be broadcast, regardless of whether a, b or c are known to provide a contradiction with I. The first rule is thus more constrained but also requires more complex matching on sets of messages received.

3.4. Object Execution Given that objects contain sets of rules of the above form, and that they communicate by passing atoms of the above form, we can define the execution and communication mechanism in order to match the required deduction. Regarding execution, if appropriate atoms are available to the object (having either generated them itself or having received them via communication), then new atoms can in turn be generated using the following transformation (cast as an inference rule). [X1]pl

[X~]p~ n

A[Y]p, =~ [Y]q i--1

[Z]q Here, Z must be equal to n

Yu

UX, i--1

The relationship between this and the classical resolution rules is provided in Theorem 2. Relating to communication, whenever the object generates an atom, that atom is broadcast to the other objects. Note that this includes atoms containing t r u e or false. 3.5. E x a m p l e To show both the rules produced from a given set of clauses, and how execution might proceed within these rules, we consider the following set of clauses. 1. 2. 3. 4. 5.

pVqVr -~pV qV r pV-~qVr pVqV-~r -~p V -~q V r

6.

-~p V q V -~r

7. p V -,q V -~r 8.

-~p Y -~q V ~ r

216 We will split these clauses between 8 objects, expanding the clauses into our rule form as follows.

Object 1: a) b)

[p]false A [q]false A [r]false

=v [true]false

[pip [q]q [r]r

c)

d)

[x]p ~ [X,q]q [X]p =~ IX, r]r

Object 2: a)

b) c)

[X]p

A [X,q]false A [X,r]false

Object 3: a)

b) c)

=v [X]false

[X]q =v [X,p]p [X]q =~ [X, r]r [X]q A [X,p]false A [X,r]false =~ [X]false

Object 4: a)

[X]r ~

b) c)

[X]r

A

[X,p]false

Object 5: a)

[X]p

A

[X]q =~ [X]r

Object 6: a)

[X]p

A

[X]r ==~ [X]q

Object Z: a)

[X]q

A

[X]r ==~ [X]p

Object 8: a)

[X]p

A

[X]q

[X,p]p

[X]r =~ [X, q]q

A

A

[X,q]false =~ IX]false

[X]r ==:> [X]false

Recall that the rules in object 1 correspond to the facts that p is true under the assumption p, q under the assumption q, r under the assumption r, and that if assuming p generated a contradiction, assuming q generated a contradiction and assuming r generated a contradiction, then a general contradiction can be produced. We now describe a possible execution of the system, identifying the messages that an object broadcasts, together with those messages it received which caused it to broadcast

217 in such a way. Object 1 2 3 4 5 6 7

2 3 4 1

Received

[p]p [q]q [r]r [p, q]p, [p, q]q [p, r]p, [p, r]r [q, r]q, [q, rlr [19,q]p, [19,q]q, [p, q]r [p, r]p, [p, r]q, [p, r]r [q, rlp, [q, r]q, [q, r]r [p, q]false, [p, r]false, [p]p [p, q]false, [q, r]false, [q]q [p, r]false, [q, r]false, [r]r [p]false, [q]false, Jr]false

[

Rulesfired b, c, d a, b a, b a, b a a a a a a c c c a

I

Broadcasted

[pip, [qlq, [r]r [p, q]q, [p, r]r [p, q]p, [q, rlr [p, r]p, [q, r]q [p, q]r [p, r]q [q, r]p [19,q]false [p, r]false [q, r]false [p]false [q]false [r]false [true]false

Thus, if all objects execute synchronously, a contradiction is found in 4 steps of the system. The corresponding resolution proof is given below.

1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.

pvqyr -Tyqyr pV~qVr p V q V-~r -~p y -~q y r -~p y q y -~r p V -~q V -~r -~p V -~q V -~r ~p V -~q ~p V -~r ~q Y -~r

[5 + 8] [6+8] [7+8] 12. ~p [2 + 9 + 10] 13. ~q [3 + 9 + 11] 14. -~r [4 + 10 + 11] 15. false [1 + 12 + 13 + 14] Notice how, in the execution, the first set of messages broadcast just provide basic information about which literals the one positive clause (in Object 1) contains. This is necessa~, as it will be this object that finally generates a contradiction. Thus, in general, positive clauses start the computation by 'seeding' the message space with information about certain propositions, then waiting until contradicto .ry information about all those literals has been generated, at which point a general contradiction is produced. 4. C o r r e c t n e s s Each object processes incoming messages via a form of resolution between these messages and the rules the object contains. This may, in turn, generate new messages and

218 transform the rules within the object. By interpreting formulae such as [Y]r directly as Y =v r, we can prove results regarding the representation of clauses and the execution within objects. In particular, we can show that this approach represents a complete proof mechanism. Note, however, that although we use resolution as the mechanism for computation within objects, this need not be the case. We could use any similar proof method. For example, in w we consider the possibility of having a variety of different proof mechanisms within objects in the system. We first examine the relationship between general clauses and the rules whose generation was described in w

Theorem 1 (Correspondence between C l a u s e s a n d R u l e s ) As a clause is transformed into a set of rules, satisfiability is preserved. Proof To establish this, we first note that, as 'IX, pi]pi' is valid, the second form of rule is valid, while the first form of rule can be derived from the original clause using satisfiability preserving transformations, as follows. Begin with the general clause form and rewrite this as

q~ A A --'pi 3=1

=~ false

i=1

Introduce 'X =~... ', giving

Since ':=v ' distributes over 'A ', this can be rewritten to X =:v qj) A A ( z =~ ~pi) 9"--

==~ ( x =v false)

i--1

Now move negative literals over implications (since X =~ -"Pi is equivalent to ( X A Pi) =~ false) and rewrite to '[]' notation, giving

A[X]qj A j=l

A[X, pi]false

~ [X]false

"=

which is exactly the form of the first rule. We can now establish the soundness of the object's internal execution mechanism by recasting rules and atoms as clauses and showing that the execution rule is equivalent to the standard resolution rule.

Theorem 2 (Soundness) Any application of the object's execution rule preserves satisfiability.

219 P r o o f For simplicity, we consider only a binary version of the execution rule. Recasting atoms such as [X]p as formulae such as '-~X V p' (recall that ' X ' represents a conjunction of positive literals, so '-~X' represents a disjunction of negative literals), the ezecution rule becomes ~XVp -~Y V ~p V q ~ZVq As each X , Y and Z is a conjunction of positive literals, then the side condition ensures that Z ~ (Y A X ) and so the above rule is equivalent to the standard binary resolution rule.

We now consider the completeness of the deduction mechanism. We first assume that all the clauses have been allocated to a single object. This simplifies the proof in that aspects relating to object communication can be ignored. Theorem 3 (Completeness f a l s e can be generated.

Single Object) If a set of clauses is unsatisfiable, then

P r o o f We establish this result by induction on the number of clauses in the set. Note that as clauses and rules are equivalent, we choose to describe this proof mainly in terms of clauses for simplicity, although it can be fully framed in terms of rules. * Base:

If the set of clauses contains only one clause and the set is to be unsatisfiable, then the clause must be ~alse' (i.e. the empty clause).

9 Induction: We assume that for an unsatisfiable set of clauses of size n, a contradiction can be generated, and consider an unsatisfiable set of clauses of size n + 1 (call this An+l). Since the set An+l is unsatisfiable, it must contain at least one purely positive clause. Choose one of these positive clauses and call it C, where C = pl v p2 v ... v Pm Now, for each 1 <_ i <_ m, then each of the sets (/,,,,+, - { c } ) u {p,} must also be unsatisfiable. As, in each of these sets, p~ is just a positive literal, then we can resolve the Pi 's with the rest of the clauses in the set, removing the Pi clause. This gives m sets of clauses, each unsatisfiable and each of size n. Now, by the induction hypothesis, a contradiction (i.e. false) can be generated for each of these sets. Thus, assuming each proposition pi in turn, false is generated and so m

A [pi]false 4=1

220 However, since from clause C we will have generated the rule m

A[pi]false =~ [true]false i--1

then [true]false is generated representing a contradiction. has been generated for sets of clauses of size n + 1.

Thus, a contradiction

In summary, the method is complete since for a particular positive clause, each positive literal in the clause is assumed in turn and a contradiction is generated under each of these assumptions. Once all these sub-proofs have been carried out, a contradiction for the whole set of clauses can be generated. Note that, during these sub-proofs further assumptions may be required. For example, if we have assumed 'p' and encounter the clause q V r (or its rule representation), then we will require two further sub-proofs, one assuming p and q, the other assuming p and r. Once atomic formulae and rules are distributed amongst objects, we require the following lemma.

Lemma 3.1 (Communication) Once an atom is generated by an object, that atom will eventually appear in all objects. Given this lemma, we can prove distributed completeness. T h e o r e m 4 ( C o m p l e t e n e s s - - D i s t r i b u t e d ) Given a set of clauses that is unsatisfiable, distributed amongst a set of objects, then false will eventually be generated by one of the objects.

Proof If the set of clauses is unsatisfiable, then there is a sequence of applications of the object execution steps that would generate a contradiction if the clauses all appeared within one object. We can simply distribute these steps across multiple objects once the clauses have been distributed. By Lemma 3.1, we can be sure that the atoms required will eventually reach each object.

5. Automating Deduction in First-Order Logic We can extend our approach to first-order logic simply by extending the syntax so that propositions are replaced by predicates (which take terms as arguments) and quantification is added. If we again assume that the formula is represented in clausal form (to which skolemization has been applied in order to remove existential quantifiers), then Horn clauses of the form

VX. V-~pi(X)V q(X) are represented by rules such as A pi()() ~ i

q(X)

221 Non-Horn clauses are again represented by utilising assumptions. We note that, given a rule such as

p(X) ~ q(X) then if a message p(a) is received, then q(a) is broadcast. If p(a) is received by an object containing

p(a) ~ q(X) then the message q(X) is broadcast (rather than individual messages such as q(t) for each term t). Although this extension from propositional to first-order logic is relatively straightforward, this more expressive logic does highlight potential drawbacks for our approach. In order to focus on these aspects, we next consider two specific examples. The first shows the approach in a positive light, while the second is deliberately designed to cause problems for the method.

Example 1 Consider the set of clauses

1. 2. 3. 4. 5. 6. 7.

p(3) -~p(x) ~p(x) -~p(x) -~p(x) -~q(Y) -~(z)

v v v v V v

q(X) ~(x) ~(x) t(x) ~r(Y) V w(Y) -~t(z) v -~o(z)

If each clause is placed in a separate object, then the computation can proceed as follows. 1. The message p(3) is broadcast. 2. Objects 2, 3, 4, and 5 all receive p(3), derive positive literals and broadcast q(3), r(3), s(3) and t(3), respectively. 3. Object 6 receives q(3) and r(3), derives a positive literal and broadcasts w(3). 4. Object 7 receives s(3), t(3) and w(3), and derives false. Thus, with relatively few messages, and with as little as four global steps, a contradiction is generated. Example 2 Consider the set of clauses 1. p(3)

2. ~p(x) v p( f (x) ) 3. ~p(x) v p(g(x)) 4. -~p(X) Y p(h(X))

5. ~p(g(g( h( h( f (g(3) ) ) ) ) ) )

222 Here, p(3) is broadcast initially and, for eve .ry message of the form 'p(...)', the number of messages increases threefold on each 'cycle'. Thus, there is a severe danger of the system being flooded with messages. While this is potentially a problem in practice, we note that 9 examples where the search process is exponential can be generated for all such proof methods - - in standard resolution such problems would occur from different examples, 9 by utilising the grouping of clauses into objects (see w and of objects into groups (see w together with suitable heuristics, such problems may be alleviated, and, 9 if certain objects are identified as causing an excess of broadcast messages, their execution can either be suspended or 'throttled' to allow the communication medium to 'recover'. However, the above assumes that having a large amount of broadcast activity is undesirable. It may be that the architecture on which this is to be implemented provides particularly efficient broadcast mechanisms, in which case this will be less of a problem. We next consider various practical aspects relating to our approach, in particular those elements that can be used to improve efficiency.

5.1. Multi-Clause Objects So far, in distributing clauses amongst objects, we have only considered the strateKy of allocating one clause to each object. In practice, however, it will be necessa~, to cluster several clauses within each object. This not only reduces the number of objects in the system but, if the clauses are allocated carefully, may also reduce the number of messages sent (and thus alleviate some of the problems outlined above). To see this, let us consider again the example presented in w Let clauses 1 and 4 remain in separate objects, called A and C respectively, but now let clauses 2 and 3 be allocated to a single object, B. Given this scenario, the execution might proceed as follows. 1. Initially, the configuration is Object A:

p

Object B:

{-~p v q v ~r, ~p V -~q V -~r}

Object C:

~p v r

2. p is broadcast from Object A, and is received by both Object B and Object C: Object B:

{q V ~r,

Object C:

r

~q V -~r}

So, now, r is also broadcast. 3. Once r is received by Object B the system becomes Object B:

{q,

-~q}

and Object B can broadcast the fact that it has generated a contradiction.

223 In this example, there is an important point to note. While, in general, objects 'listen' for messages relating to all the negative literals they contain, this need not always be the case. In particular, Object B above can be defined so that it need not listen for, or indeed broadcast, messages relating to the proposition q. This is because all the clauses containing either the literal q or the literal ~q appear within that object. Thus, there can never be additional information relating to these literals that is required by, or available from, other objects. This, for example, might also be used to alleviate the 'p(X) :::v p ( f ( X ) ) ' problem by ensuring that all clauses referring to p occur in a single object. As the computation proceeds, the set of messages that an object is interested in might decrease in size. Thus, the need for the objects to have the ability to dynamically (and autonomously) change their interfaces. We can also see from this that the allocation of clauses to objects is important for improving the efficiency and size of the system. Careful static analysis, together with dynamic re-allocation, are required in order to provide a practical large-scale system of this form. Indeed, some of our future work is to attempt to derive heuristics which will usefully partition clauses amongst objects. Finally, note that the allocation of multiple clauses to an object in no way affects the completeness of the approach.

5.2. Grouping While above we considered assigning several clauses to each object, we now briefly outline a mechanism by which communication can be further limited. This is based upon the idea of grouping. Each object may be a member of several groups. When an object sends a message, that message is, by default, broadcast to all the members of its group(s), but to no other objects. Alternatively an object can select to broadcast only to certain groups (of which it is a member). This mechanism allows the development of complex structuring within the object space. This group mechanism is derived from that used by Maruichi et al. [23] to model DAI systems, and the concept of process groups used within distributed operating systems [3]. Thus, in addition to the allocation of clauses to objects, described above, heuristics for constructing groups of objects are being developed. When such groups are constructed successfully, then the communication between groups will be more restrained than the communication within groups. In this case, it would be natural to implement such a scheme by limiting each group to one processor if possible. While grouping has many practical advantages, it can obviously lead to incompleteness. This is analogous to resolution systems where the set of clauses can be partitioned - - if the refutation requires elements from more than one partition, completeness may be lost. Again, some of our future work is concerned with the safe use of the group mechanism.

5.3. Partial Subsumption The process of subsumption, whereby clauses can be removed if more general ones exist, is widely used in automated reasoning systems. Although full subsumption can not be achieved in this framework as there is no way to compare clauses once they have been distributed amongst objects, we can provide simple mechanisms which attempt to avoid regenerating information. The basic approach is for an object only to broadcast an atom when it has not seen it before. Thus, it must not have generated the atom previously and

224 must not have received it via a communication. While this refinement obviously requires that the object store more historical information, the resources consumed in this process can be varied by altering the amount of subsumption available. Additionally, this form of subsumption solely involves internal object computation and does not have an adverse effect on the object's external communication. As an atom will not be broadcast if it has already been seen, and since, if one object sees an atom, all the others are guaranteed to see it eventually, then this refinement does not affect completeness.

5.4. Implementation We have developed a prototype implementation of the propositional framework described above using Concurrent METATEM, a language based upon the execution of temporal formulae [13]. The model of computation in Concurrent METATEM is exactly that described in w so in order to implement the system we need only code the deduction mechanism as a set of temporal formulae. We will not describe this implementation here (see [9] for details), but merely note that there is a simple and direct transformation from our rules to the temporal formulae executed in Concurrent METATEM. Although this implementation is only a sequential simulation, initial results suggest that the approach is feasible, though more work is obviously required in order to assess its practicality for large-scale and truly parallel systems. We have initiated investigation into the implementation of this model on a class of massively parallel architectures, namely Virtual Shared-Memory machines [10]. Again initial results are promising. Finally, we note that while broadcast communication has traditionally been considered as being inefficient, not only are we able, by utilising a form of group structuring, to retain many of the advantages of full broadcast while avoiding some of its drawbacks, but low-level mechanisms for efficient broadcast have been developed in many computer systems [2]. Further, not only is broadcast one of the basic communication mechanisms on local area networks [24], but the advent of novel parallel architectures (e.g. data parallelism [17]) has meant that more powerful programming techniques based upon broadcast communication are beginning to be developed.

6. Multi-Agent Theorem-Proving So far, we have motivated and outlined our approach to concurrent theorem-proving. In this section, we suggest how the fact that broadcast communication is fundamental, together with the potential for group structuring and object creation, can be used to enhance this model further. In particular, we outline how these features can be utilised in order to provide behaviour that is extraneous to the refutation being undertaken. While it is initially counter-intuitive to add extra complexity of the reasoning agents, it will become clear that this additional behaviour is required in order not only to refine the behaviour of individual agents, but also to carry, out organisation within the system as a whole. In this way, we will see that the communicating objects can truly be considered as agents, communicating together to achieve a proof. We only provide a brief description of these attributes, though further details can be found in [14].

225

6.1. Alternative Proof Mechanisms We have described a framework for concurrent theorem-proving where each agent utilises the same basic deduction mechanism. However, this need not necessarily be the case. A range of proof methods can be implemented within agents of this form and, indeed, a heterogeneous system can be developed with agents being implemented in a variety of languages. Thus, in the coarsest scenario, we could develop a set of different theorem-provers all working concurrently on the same problem. A more fine-grained approach would be to develop, within agents, alternative approaches for generating new sub-proofs. As we will see below, in order to utilise cooperative and competitive theorem-proving, it is often useful to have a range of agents implementing, if not different algorithms, then the same algorithm with a slightly different emphasis.

6.2. Cooperation The approach described earlier in this paper is inherently cooperative. Intermediate results (in the form of positive atoms) are broadcast to all agents, thus providing sharing of information (and, indeed, a form of common knowledge [8,15]). The use of cooperative techniques has not only been successful in the field of DAI, but work on cooperative search [18] can be seen to have direct relevance to theorem-proving. If, as above, a range of theorem-proving techniques were to be implemented, then sharing of important results could also be implemented. However, with a variety of proof mechanisms implemented across the agent space, the structuring of agents into groups would also be appropriate. For example, a formula generated during a refutation proof might not be useful to a forward-chaining theorem-prover. Thus, proof mechanisms utilising mutually beneficial approaches might be grouped together so that intermediate results are only shared between those relevant agents. When more important results, for example complete sub-proofs, are generated, these can be passed on to other groups to be used as part of their deduction.

6.3. Competition If either alternative proof techniques, or alternative strategies within the same technique, are employed in different agents, then competitive behaviour can also be developed. As has been shown by work on genetic algorithms, systems containing an element of competition can often solve problems that traditional approaches find difficult. Thus, we could use such techniques to develop, for example, systems whereby agents that successfully find a proof are rewarded. In particular, if an agent is successful, it can create a new agent implementing the same proof technique (possibly with slight modifications in strategy). In this way, successful proof techniques abound, while less ones will not. With competition of this form comes the ability to adapt, in a limited fashion, to changing circumstances. If certain proof techniques have been successful, they will be very common. However, if the type of problems being posed changes radically, so that those techniques are no longer appropriate, then other, more relevant, proof methods will begin to be successful.

226 6.4. A g e n t Societies Combining the facets of cooperation and competition described above, together with both the dynamic grouping of agents and the dynamic creation of new agents, allows us to develop simple agent societies [12]. For example, agents within a particular group might be cooperating in order to achieve a certain proof at the same time as the group, as a whole, competes with other groups tackling similar problems. As both group structuring and agent creation are dynamic, these simple societies can be ve~. fluid, with groups (and sub-groups) growing, appearing and being destroyed and agents moving between a range of group structures. Thus, this very flexible model potentially allows the development of a range of societal architectures and, in the future, we envisage this model forming the basis for dynamic, evolving and adaptable systems performing automated deduction. 6.5. F a u l t - T o l e r a n c e In many models of concurrent theorem-proving, the processes defined are not faulttolerant, i.e., if one process was to 'die' unexpectedly (e.g., by a processor crash), then the whole proof would fail. As we use broadcast as the underlying communication mechanism, we are able to utilise fault-tolerant approaches from distributed operating systems to safeguard the proof process. One common approach is that of 'process pairs' [5] where an agent 'shadows' another, particularly important, agent and is able to take over its role at any time. This approach might be appropriate for some key agents where it would provide a measure of fault-tolerance. 7. S u m m a r y We have described a framework for concurrent theorem-proving based upon asynchronously executing objects communicating by broadcast message-passing, and have outlined several advantages of this approach, as follows. 9 Each object need only recognise and record only a small number of message types. 9 Execution within each object is ve~. simple. Only the matching of incoming messages to rules is required. Perhaps the most complex part of this process is the matching required between sets of assumptions when a rule is to be executed. 9 Delays and the order of message sending are all less important - - regardless of these, computation is still as slow as the longest 'critical' path. 9 In spite of the use of broadcast message-passing, the system need not be 'flooded' with messages, especially if the objects are structured so that certain literals only occur within one object. The partitioning of clauses across objects is vital the observation that by grouping clauses containing the same literal together, we avoid much message-passing, is also important. 9 Branching in the search space is now replaced by additional broadcast messages. Thus, in architectures where broadcast is expensive, this approach will obviously be inefficient. However, many architectures now provide efficient broadcast mechanisms and, indeed, many distributed operating systems are based upon this mechanism

227 (e.g. [3]). In other concurrent theorem-proving systems, the branching search results in the spawning of new processes or threads, while our approach generates additional messages (which are generally 'cheaper' than processes). 7.1. R e l a t e d W o r k In this section we consider related work in the area of distributed (and multi-agent) theorem-proving. While other systems share some features with our approach, the particularly close link between the operational model, communication and deduction, the use of grouping and clustering, and the openness of the system makes it significantly different from those described below. In the DARES distributed reasoning system [22], agents cooperate to prove theorems. As in our model, information (a set of clauses) is distributed amongst agents and local deduction occurs purely within an agent. If an agent finds that it can proceed no further, it can make requests of other agents for information to continue. This request is based upon broadcast and the agent then uses any replies to continue local deduction. By contrast to our approach, not only is the number of agents fixed, but the opportunity for more sophisticated structuring of the agent space within the DARES system is absent. Further, the broadcast mechanism is not pervasive throughout the DARES m o d e l - it is only used to solicit new data when an agent stalls. However, results from that system do confirm how the distribution of clauses amongst agents is a crucial factor. While the agents within the DARES system are all of the same type, one of the suggestions of that work was to consider different 'specialist' agents within the system. This form of system has been developed using the Teamwork approach [7], a general framework for the distribution of knowledge-based search [6]. While the number of agents within such a system is more fluid than in the DARES system, and more sophisticated structuring is provided through the concept of 'teams', the control within the system is constrained particularly through the use of 'supervisor' agents. Also, in contrast to our model, less reliance is placed on broadcast communication. The Clause Diffusion approach [4] also partitions sets of clauses amongst distributed objects. Unlike our system, new clauses generated may be allocated (using a particular algorithm) to other objects. Thus, while the new information generated in our approach is distributed by broadcast message-passing, this is achieved in Clause Diffusion via the migration of clauses. In contrast to our approach, Clause Diffusion is not primarily intended as a basis for open and dynamic systems. 7.2. F u t u r e W o r k Part of our future work involves investigating the extension of this approach, including both theoretical results and implementations, to more powerful logics such as modal and temporal logics. We also intend to carry out a range of larger scale case studies in order to assess the viability of the approach when applied to more complex systems, and to provide a full (and optimised) implementation of this approach. Although we have undertaken an initial investigation of the agent-based aspects of the model, as outlined in w we will continue working on this, incorporating the benefits of cooperation and cooperative search (e.g. [18]) to concurrent theorem-proving. Finally, while this model of distributed deduction has been outlined in the context of concurrent theorem-proving, we are also extending it to provide the basis for general models of logic-based computation,

228 in particular extensions of logic programming.

Acknowledgements The author wishes to thank Clare Dixon, Robert Johnson and Michael Wooldridge for their helpful comments and suggestions. This work was partially supported by EPSRC under grant GR/J48979.

REFERENCES 1. G. Agha. Actors - A Model.for Concurrent Computation in Distributed Systems. MIT Press, Cambridge MA, 1986. 2. K. Birman and T. Joseph. Reliable Communication in the Presence of Failures. A CM Transactions on Computer Systems, 5(1):47-76, Februa .ry 1987. 3. K. Birman. The Process Group Approach to Reliable Distributed Computing. Techanical Report TR91-1216, Department of Computer Science, Cornell University, July 1991. 4. M. Bonacina and J. Hsiang. The Clause-Diffusion Methodology for Distributed Deduction. Fundamenta Informaticae, 24:177-207, 1995. 5. A. Borg, J. Baumbach, and S. Glazer. A Message System Supporting Fault Tolerance. In Proceedings of the Ninth A CM Symposium on Operating System Principles, pages 90-99, New Hampshire, October 1983. ACM Press. 6. J. Denzinger. Knowledge-based distributed search using teamwork. In Proceedings of the First International Conference on Multi-Agent Systems (ICMAS), pages 81-88, San Francisco, USA, June 1995. 7. J. Denzinger and M. Kronenburg. Planning for Distributed Theorem-Proving: The Teamwork Approach. SEKI-Report SR-94-09, University of Kaiserslautern, Kaiserslautern, Germany, 1994. 8. M. Fischer and N. Immerman. Foundations of knowledge for distributed systems. In Theoretical Aspects of Reasoning about Knowledge: Proceedings of the 1986 Conference, pages 171-185, Monterey, California, March 1986. Morgan Kaufmann Publishers, Inc. 9. M. Fisher. Applying Executable Temporal Logic Implementing Concurrent Theorem-Proving. In IJCAI Workshop on Executable Temporal Logics, Montreal, Canada, August 1995. 10. M. Fisher and J. Keane. Realising a Concurrent Object-Based Programming Model on Parallel Virtual Shared Memo .ry Architectures. In Giloi, Jahnichen, and Schriver, editors, Programming Models for Massively Parallel Computers, pages 88-97. IEEE Computer Society Press, Los Alamitos CA, 1995. 11. M. Fisher and M. Wooldridge. Executable Temporal Logic for Distributed A.I. In Twelfth International Workshop on Distributed A.L, Hidden Valley Resort, Pennsylvania PA, pages 131-142, May 1993.

229 12. M. Fisher and M. Wooldridge. A Logical Approach to the Representation of Societies of Agents. In N. Gilbert and R. Conte, editors, Artificial Societies. UCL Press, London, 1995. 13. M. Fisher. Concurrent METATEM - - A Language for Modeling Reactive Systems. In Parallel Architectures and Languages, Europe (PARLE), Munich, Germany, June 1993. (Published in Lecture Notes in Computer Science, volume 694, pages 185-196, Springer-Verlag). 14. M. Fisher. An Agent-Based Approach to Concurrent Theorem-Proving. (Unpublished manuscril~t), 1996. 15. J. Halpern and M. Vardi. Reasoning about Knowledge and Time in Asynchronous Systems. In Proceedings of A CM Symposium on the Theory of Computing (STOC), pages 53-65, ACM Press, 1988. 16. C. Hewitt. Control Structure as Patterns of Passing Messages. In P. H. Winston and R. H. Brown, editors, Artificial Intelligence: An MIT Perspective (Volume P), pages 433-465. MIT Press, Cambridge MA, 1979. 17. W. Daniel Hillis and Guy L. Steele. Data Parallel Algorithms. Comm. A CM, 29( 12):1170-1183, December 1986. 18. T. Hogg and C. Williams. Solving the Really Hard Problems with Cooperative Search. In Proceedings of Eleventh National Conference on Artificial Intelligence (AAAI-93), pages 231-236. MIT Press, Cambridge MA 1993. 19. A. Jindal, R. Overbeek, and W. Kabat. Exploitation of parallel processing for implementing high-performance deduction systems. Journal of Automated Reasoning, 8:23-38, 1992. 20. R. Johnson. Parallel Analytic Tableaux Systems. PhD thesis, Department of Computer Science, Queen Mary. and Westfield College, London, 1996. 21. F. Kurfefl. Parallelism in Logic. Vieweg, 1991. 22. D. Macintosh, S. Con .ry, and R. Meyer. Distributed Automated Reasoning: Issues in Coordination, Cooperation, and Performance. IEEE Transactions on Systems, Man and Cybernetics, 21(6):1307-1316, November/December 1991. 23. T. Maruichi, M. Ichikawa, and M. Tokoro. Modelling Autonomous Agents and their Groups. In Y. Demazeau and J. P. Muller, editors, Decentralized AI 2 - Proceedings of the 2nd European Workshop on Modelling Autonomous Agents and Multi-Agent Worlds (MAAMAW '90). Elsevier/North Holland, Amsterdam 1991. 24. R. M. Metcalfe and D. R. Boggs. Ethernet: Distributed Packet Switching for Local Computer Networks. Comm. ACM, 7(19), July 1976. 25. D. A. Plaisted and S. A. Greenbaum. A Structure-Preserving Clause Form Translation. Journal of Symbolic Computation, 2(3):293-304, September 1986. 26. J. Schumann and R. Letz. PARTHEO: A High-Performance Parallel Theorem Prover. In Lecture Notes in Computer Science, volume 449, pages 40-56. Springer-Verlag, Heidelberg, 1990.

230 Michael Fisher Michael Fisher is a Reader in the Department of Computing at the Manchester Metropolitan University. He has over ten years research experience relating to the fundamentals of temporal reasoning, the development and application of multi-agent systems, and the specification, verification and implementation of concurrent and distributed systems. Dr. Fisher is the organiser of two international workshops on executable temporal logics (in 1993 and 1995), is an editor of "Executable Modal and Temporal Logics" (SpringerVerlag, 1995) and "The Imperative Future" (Research Studies Press, 1996), and is the principal editor of a special issue of the Journal of Symbolic Computation devoted to this subject. In addition, he is a programme committee member for several international conferences and workshops relating to temporal logics, automated reasoning, high-level programming languages and multi-agent systems. Home Page: http://www.doc.mmu.ac.uk/STAFF/M.Fisher

Parallel Processing for Artificial Intelligence 3 J. Geller, H. Kitano and C.B. Suttner (Editors) 9 1997 Elsevier Science B.V. All rights reserved.

231

SiCoTHEO Simple Competitive Parallel Theorem Provers based on SETHEO* Johann Schumann

Institut fiir Informatik, Technische UniversitSt Miinchen email: s chumann@inf ormat ik. tu-muenchen, de In this paper, we present SiCoTHEO, a collection of parallel theorem provers for first order predicate logic. SiCoTHEO is based on the sequential prover SETHEO. Parallelism is exploited by competition: on each processor, an identical copy of SETHEO tries to prove the entire formula. However, certain parameters which influence SETHEO's behavior are set differently for each processor. As soon as one processor finds a proof, the entire system is stopped. Three different versions of SiCoTHEO are presented in this paper. We have used competition on completeness mode (parallel iterative deepening, SiCoTHEO-PID), on completeness bounds (a parameterized combination of bounds, SiCoTHEO-CBC), and on the search mode (top-down combined with bottom-up, SiCoTHEO-DELTA). The experimental results were obtained with a prototypical implementation, running on a network of workstations. This parallel model is fault-tolerant and does not need communication during run-time (except start and stop messages). We found that only little efficiency is gained for SiCoTHEO-PID, which reaches peak performance with only 4 processors. SiCoTHEO-CBC and SiCoTHEO-DELTA, however, showed significant speed-up, and improved performance up to the 50 processors used. 1. I n t r o d u c t i o n Automated theorem provers, like many other AI tools must explore large search spaces. This leads to long (and often too long) run-times of such systems. One possibility to reduce run-time is the exploitation of parallelism. Automated theorem proving in general seems to be suitable for parallelization, as many implemented parallel theorem provers show (see [12] for an extensive survey). Current trends lead away from special purpose, tightly coupled multiprocessor systems, often with shared memory. Much more interesting and feasible seem to be networks of workstations, connected by a local area network (e.g., Ethernet, ATM). Such a hardware configuration is readily available in many places. It features processing nodes with high processing power and (comparatively) large resources of local memory and disk space. The operating system (mostly UNIX) allows multi-tasking and multi-user operation, features *This work is supported by the Deutsche Forschungsgemeinschaft within the Sonderforschungsbereich 342, Subproject A5: PARIS (Parallelization in Inference Systems).

232 not necessarily available on a multiprocessor machine. The underlying communication principle is message passing. Common data (e.g., the formula to be proven) can be kept in file-systems which are shared between the processors (e.g., by NFS). However, the bandwidth of the connection between the workstations is comparatively low and the latency for each communication is rather high. Models of parallelism which are ideally suited for such networks of workstations must therefore obey the following requirements: small (or ideally no) necessary communication between the processors and no dependency on short latencies. A parallel model which fulfills these requirements is competition: each processor tries to solve the entire problem, using different methods or parameters. As soon as one processor finds a solution, the entire system can be stopped. In a competitive system there is no need for communication except for the start and stop messages. Competitive parallel models have been studied in various approaches (cf. [3] and [12] for competitive parallel theorem provers). A competitive parallel theorem prover based on the sequential prover SETHEO [7,6] is RCTHEO (Random Competition) [2]. Here, the search is controlled by a pseudo-random number generator which is initialized with a different number on each processor. Therefore, when the system has been started, the search space is processed in a different way on each processor. A detailed evaluation of the RCTHEO system can be found in [3]. In this paper, we will focus on models where competition is accomplished by a different setting of parameters of the proof algorithm for each processor. This is in contrast to RCTHEO which exploits parallelism by randomizing the proof algorithm. Also, we only discuss systems, which are based on one sequential prover (homogeneous competition). In a heterogeneous system, different theorem provers (e.g., OTTER, METEOR, SETHEO, ... ) could compete for a proof. ~ This paper proceeds as follows: First, we will give a short introduction to the sequential theorem prover SETHEO which is the basis for SiCoTHEO. Next, we define competition between homogeneous processes and discuss important properties of competitive theorem provers, such as efficiency, scalability, soundness and completeness. Then, we will describe which parameters are suitable for competition for SETHEO, and we will sketch the basics of the prototypical implementation of all SiCoTHEO systems on a network of workstations. Finally, we describe in detail the different SiCoTHEO provers and present results of experiments. In the conclusions, we will summarize the paper and give an outlook on future work.

2. S E T H E O

SETHEO is a sequential theorem prover for first order predicate logic. Since it is the basis of SiCoTHEO, we will shortly describe SETHEO's proof procedure and the parameters which influence the search. For details on SETHEO, the reader is referred to [7,5,6]. The proof calculus underlying SETHEO is Model Elimination [8], a sound and complete tableau based calculus. The input formula for SETHEO is a set of clauses consisting of literals (e.g., p(X), ~q(a)) which in turn are negated or unnegated atoms. An example is shown below in Figure 1.

233 The system tries to refute the given set of clauses by constructing a tableau, a tree with nodes labeled by literals of the formula, as shown in the example below. The root node of the tableau is always marked by c, comprising the empty tableau. All direct child-nodes of a given inner node belong to one clause. In order to see, if a refutation has already been found, we have to look for complementary pairs of literals. Two literals are complementary, if they have opposite signs, but are otherwise equal to each other. A branch in the tree is said to be closed, if the path from the root to the current leaf node contains at least one pair of complementa .ry literals. A formula is unsatisfiable, if and only if there exists a tableau with all its branches closed. Given a set of clauses, SETHEO tries to construct a closed Model Elimination tableau by continuously applying the following rules: S t a r t S t e p Given an empty tableau (with only the root node e), one may select an arbitra .ry clause ("start clause") and add its literals as the children of the root node to the tableau. Per default, SETHEO behaves like PROLOG and selects such start clauses with negative literals only (e.g., clause (1) in Figure 1). E x t e n s i o n S t e p Given is a tableau with an open leaf node L ("subgoal"). If there exists a literal K in one of the clauses of the formula which is unifiable with L, yielding a complementary pair of literals, then all literals of that clause may be appended to the tableau as the children of L, and the leaf K is marked closed. Note, that all other literals of that clause (if there are any) cause new open branches in the tableau. R e d u c t i o n S t e p If there exists a leaf node L and an ancestor K of it, such that L and K are unifiable and yield a complementary pair, then the leaf node can be marked as closed. The substitutions which are necessa .ry in the extension and reduction steps are applied to the entire tableau. Figure 1 shows a formula and a corresponding closed tableau. Variables are written as upper-case characters (e.g., X, Y). In this example, variable X is substituted by a, and Y by b.

Formula: ~p(X, Y) V -~q(Y, X)

p(a, b) v -~q(b, a) q(b, a) V ~p(a, b)

p(~, b) v q(b, ~) Substitution:

x\~,Y\b

A

(1)

^

(2)

A

(3)

(4)

p(a, b) q(b, ~) *

~q(b, a) p(~, b) *R

q(b, a)

-p(a, b) p(a, b) *

q(b, ~) *R

Figure 1. A closed Model Elimination tableau. An asterisk indicates a closed branch; a "*R" marks a branch closed by a reduction step.

234 Of course, there may be more than one way to construct a tableau out of a given formula, since there are many possibilities for selecting a rule or clause for extension. Some of these tableaux can eventually be closed, whereas others cannot be closed at all. This shows t h a t we have to search for a closed tableau. Formally, we can depict the search by a tree, the Model Elimination OR-search tree (OR-tree for short). Each node of the tree is labeled by a tableau, the root is the empty tableau. If we select an open node of a tableau T, we may construct new tableaux out of T by applying all possibilities for an extension or reduction step. These new tableaux become the child nodes of T. If such a rule leads to a closed tableau, we may stop with "proof found". If no extension or reduction step is possible at this point, the node will be marked by a "FAIL". Within SETHEO, the OR-tree and its tableaux are constructed in a depth-first, left-toright manner and its nodes are traversed by backtracking. This search, however, can be varied by setting parameters of SETHEO (e.g., which clause to try next for an extension step). In order to obtain completeness, we apply depth-first iterative deepening. This means that we set a certain bound which must not be violated during the search. Typical bounds in S E T H E O are the maximal depth of a tableau (A-literal depth), or the maximal number of leaves in a tableau (inference bound). If a proof cannot be found with the given bound, the bound is increased and the search starts again. In general, the size of the search space increases tremendously when the bounds are incremented. Therefore, a variety of techniques have been implemented which t.ry to reduce the size and complexity of the search space. These techniques contain pruning methods (realized as constraints), and additional inference rules in order to use intermediate results ("lemmata", "fold-up"). For a detailed description of these techniques which can be turned on and off individually via parameters see [6].

3. Parallelism by Competition Given is a sequential theorem proving algorithm 2 fl.(7~1,..., 7~) where Pi are parameters which may influence the behavior of the system and its search (e.g., completeness bounds or pruning methods). Then, a homogeneous competitive theorem prover running on P processors is defined as follows: on each processor p (1 _ p < P ) a copy of the sequential algorithm . A ( P [ , . . . , P~) tries to prove the entire given formula. Some (or all) parameters P~' are set differently for each processor p. All processors start at the same time. As soon as one processor finds a solution, it is reported to the user and the other processors are stopped ("winner-takes-all strategy"). The efficiency of the resulting competitive system strongly depends on the influence the respective parameter setting has on the search behavior of the system. The larger the difference, created by the values of 7~, the higher the probability that one processor finds a proof very fast (assuming, of course that there exists a proof). If the influence of the parameters on the search is only weak, all processes will have a run-time (i.e., the time needed to find a proof) which is quite similar. Then, the efficiency will be ve.ry low (the 2The definition of a parallel competitive system can easily be generalized to any search algorithm. An algorithm, suitable for competition takes a problem as its input and tries to solve it. If a solution exists, the algorithm eventually must terminate with a message "solution found". Furthermore, the algorithm must be controlled by parameters which influence its behavior.

235 speed-up is about 1). Good scalability and efficiency can be obtained only, if there are enough different values for a parameter, and if no good default estimation to set that parameter is known. Only then, a large number of processors can be employed reasonably. A good default estimation for a parameter will in general result in poor speed-up values for a competitive system. Then, a different model of parallelization, (e.g., partitioning as in PARTHEO [10] or SPTHEO [13]) will be appropriate. When we want to develop and evaluate competitive parallel theorem provers, we additionally have to address the following issues:

S o u n d n e s s : although our parallel theorem provers are based on the sound sequential prover SETHEO, care must be taken that we obtain only correct proofs with SiCoTHEO. This means in particular that only such combinations of parameter settings are allowed which retain the correctness of the proof algorithm.

C o m p l e t e n e s s : the entire system must be complete, i.e., we must not loose any proofs which can be found with the sequential prover. Since our system is intended to run on a network of workstations (where processors or communication links may fail), the competitive system should be complete even in cases with a reduced number of processors (fault-tolerance). This can be easily accomplished by using complete search strategies on each processor. Note that with a standard partitioning scheme, like OR-parallelism (e.g., PARTHEO [10]), a failure of processing elements would lead to an incomplete system.

In order to evaluate the parallel system, we compare the run-times of the parallel system with the sequential one for a given formula to be proven. As the sequential reference, we use SETHEO (V3.2) with its default parameters. Its run-time for a given example is denoted by Tseq. The run-time of the parallel system Tll (for the same formula) is the time until one of the processors has found a proof, Til = minp(Ta(~,~,.....~,~)). Based on the run-times, we can define the speed-up s and efficiency rl of SiCoTHEO for a given formula in the usual way: s = Tseq/TiI and ~ = s/P where P is the number of processors. In contrast to many parallel algorithms (e.g., numeric computations), the speed-up s obtained varies strongly from example to example. This is due to the complex behavior of the search algorithm which depends on the input formula. Therefore, a reasonable evaluation of a parallel system can be given only, if measurements with a large number of different examples are made. In this paper, we present a graphical representation of the ratio 711 over T~eq for each measurement; a representation which allows to make reasonable estimates of the system's behavior even in cases of varying speed-ups. In general, it is rather difficult to give a good estimation of a mean value for the speedup over a set of examples, especially in cases where the speed-up shows a high variance. Therefore, different definitions of mean values yield quite different results (see e.g. [4]). For our measurements, we give four common mean values where si is the speed-up obtained

236 with example #i: ga = ~1 ~ i si

gg = r

...-sn

Sh = v~nl/s,

(arithmetic mean) (geometric mean) (harmonic mean)

#t = ~-,i Tseq(i)/ F-,i TII(i) (waiting times)

The arithmetic mean is often too optimistic, resulting in a mean value too large, because a few large values of si are taken into account too much. On the other hand, the harmonic mean gives rather low values, because examples with speed-up values near 1 are considered too much. Often, the geometric mean is considered appropriate, yielding results between the other two mean values (s-a _> s-g > S-h). #t relates the time needed to solve all problems (one after the other) with one processor to the time needed with p processors. This measure is especially useful for applications of the theorem prover, where one proof obligation after the other is to be solved. Then #t represents the ratio of the "waiting time" for the user, before the prover has finished all examples. The run-times given in this paper are those of the SETHEO Abstract Machine, including the time to load the compiled formula. The times needed to start and stop the system are not considered here. For a discussion of these times, see Section 5. All proof attempts (sequential and parallel) have been aborted after a maximal run-time of Tm~ = 300s (for SiCoTHEO-PID, Tmax = 300s is used). All times are CPU-times and are measured on HP-750 workstations with a granularity of 1/60 seconds. 4. P a r a m e t e r Competition for S E T H E O The Model Elimination Calculus and SETHEO's proof procedure can be parameterized in several ways. Table I shows a number of typical ways for modifying the basic algorithm. For each parameter, common values are shown. Values which are default for SETHEO are given in bold-face. The selection function determines which clause and literal is to be taken next, the search mode determines in which way the OR-tree is explored. Additional inference rules ( "fold-up" and "unit-lemmata") allow SETHEO to use intermediate results (lemmas). Finally, completeness mode and completeness bound determine SETHEO's search strategy.

parameter Selection function Search mode addt'l inference rules completeness modes completeness bounds

values as in formula/random/heuristically ordered top-down/bottom-up/combination

none/fold-up/unit-lemmata i t e r a t i v e d e e p e n i n g / o t h e r fair strategies depth/#inferences/#copies/combinations

Table 1 Basic parameters for SETHEO's calculus and proof-procedure. Values shown in bold-face are default values for SETHEO.

237 Given the parameters and their possible values from Table 1, a number of different paralleltheorem provers , based on competition could be designed. , In the following, we will focus on three parallel competitive systems, based on SETHEO. Since they compete on rather simple settings of parameters, the system is called SiCoTHEO (Simple Competitive provers based on SETHEO). The three systems compete via different completeness modes ("parallel iterative deepening", SiCoTHEO-PID), via a combination of completeness bounds (SiCoTHEO-CBC), and a combination of top-down and bottomup processing (SiCoTHEO-DELTA). Before we go into details of each prover, we sketch the common prototypical implementation for all SiCoTHEO provers. 5. P r o t o t y p i c a l Implementation SiCoTHEO is running on a (possibly heterogeneous) network of UNIX workstations. The control of the proving processes, the setting of the parameters and the final assembly of the results is accomplished by the tool pmake [1]. This implementation of SiCoTHEO has been inspired by a prototypical implementation of RCTHEO [2]. Pmake is a parallel version of make, a software engineering tool used commonly to generate and compile pieces of software given their source files. Pmake exploits parallelism by exporting as many independent jobs as possible to other processors. Hereby it assumes that all files are present on all processors (e.g., via NFS). Pmake stops, as soon as all jobs are finished or an error occurs. In our case, however, we need a "winner takes all strategy" which stops the system, as soon as one job is finished. This can be accomplished easily, by adapting SETHEO so that it returns "error" (i.e., a value :/: 0), as soon as it found a proof. Then pmake aborts all actions per default. In contrast, the implementation of RCTHEO had to transfer the output generated by all provers to the master processor. There, a separate process searched for success messages. This resulted in heavy network traffic and long delays. A critical issue in using pmake is its behavior w.r.t, the load of workstations: as soon as there is activity (e.g., keyboard entries) on workstations used by pmake, the current job will be aborted (and restarted later). Therefore, the number of active processors (and even the start-up times) can vary strongly during a run of SiCoTHEO.

6. Evaluation and Results In this section we look in detail at the results, obtained with the three different versions of SiCoTHEO. The experiments have been carried out on a network of HP-750 workstations, connected via Ethernet. All formulae for the experiments have been taken from the TPTP-problem libra .ry [11].

6.1. S i C o T H E O - P I D Parallel iterative deepening is one of the simplest forms of competition: each processor explores the search space to a specific bound. Assume we have to perform iterative deepening over the A-literal depth as the completeness bound. Then processor i (1 < i _< P) explores the search space to a depth i. If that processor could not find a proof with the given bound i, it starts the search again with bound i + P, i § 2P, and so on. This parallel scheme for iterative deepening (written in a C-like notation below) assures

238 completeness with a limited number of processors: all values for the depth bound are used by a processor eventually, while no two processors work with the same bound. f o r i = 1 , 2 . . . . . P i n p a r a l l e l do on p r o c e s s o r i do f o r k = 0 , 1 , 2 . . . . do depth_bound = i + k,P; s e t h e o (depth_bound) Due to time and resource restrictions, results on SiCoTHEO-PID have been obtained by evaluating existing run-time data 3 of SETHEO. Figure 2A shows the resulting mean values of the speed-up for different numbers of processors. As can be seen immediatedly, the variance of the speed-up values is extremely high. This fact results in a high arithmetic mean, whereas the gg, .Oh and st are ve.ry close to 1. This behavior can also be seen in Figure 2B which shows the ratio of 711 over Tseq for each example, using 5 processors. The speed-up s is always _> 1 since the entire search space (which has to be searched in the sequential case) is partitioned.

A

~

s-a

21

700

.o

B

600

or

500

16

m"

T, 400

g 11

...................

300

~

200

:4-h

100

Ce 9

,P

.0.o .s

,I

.m

...........

6

1 1

.......... 1 2 3 4 5 6 7 8 9 P

0

~ , ,. ~,o 0

"

.....~~~

200 300 460 56o 660 760

T~eq

Figure 2. SiCoTHEO-PID: A: mean speed-up values for different numbers of processors P (x for ffa, o for fig, 9 for fib, o for gt). The dotted line marks linear speed-up s = P. B: parallel run-time 711 over sequential run-time Tseq for P - 5. The dotted line corresponds to s --- 1, the solid line to s -- P.

Furthermore, it is evident from Figure 2A that SiCoTHEO-PID is not scalable. The speed-up values reach a saturation level already with 4 processors. Adding more processors 3The data have been obtained by running SETHEO (V3.1) on all examples of the TPTP [11] with Tmax = 1000s [13]. For our experiments, we have selected all examples which have a run-time Tseq <_ 1000s on a HP-750.

239 does not increase the speed-up any more. This behavior is obvious, since about two third of the examples (67%) could be solved with a depth bound of 3 or 4. The number of examples which need a higher depth (and thus can sensefully occupy more processors) is rather low, as the histogram in Figure 3 shows.

1000

8O0 48% 19% 12% 10% 6OO number of samples 4OO

2OO

0

, 3

i 4

i ] l 5 6

I

I

,........__1

7

I

I

I

8

9

10

I

|

I

|

11 > 12

depth

Figure 3. Number of examples with a proof found with A-literal depth d over the depth d. Numbers are % of the total number of 858 samples.

Although in many cases, high speed-up values can be obtained, SiCoTHEO-PID should be used in applications only where deep proofs are expected. 6.2. S i C o T H E O - C B C

The completeness bound which is used for iterative deepening determines the shape of the search space and therefore has an extreme influence on the run-time the prover needs to find a proof. There exist many examples, for which a proof cannot be found using iterative deepening over the depth of the proof tree within reasonable times, whereas iterative deepening over the number of inferences almost immediatedly reveals a proof, and vice versa. 4 In general, during iterating over the depth of the proof-tree balanced trees are preferred. The growth of the search space per iteration level, however, is extremely high. On the other hand, the inference bound first tries rather unbalanced and deep trees. Here, the search often reaches areas with unmanageable search spaces. In order 4This dramatic effect can be seen clearly in e.g. [7], Table 3.

240 to level both extremes, R. Letz 5 proposed to combine the A-literal-depth bound d with the inference bound imoz. When iterating over depth d, the inference bound i m ~ is set according to imoz = d o where r/is the mean length of the clauses. For our experiments, however, we take a slightly different approach by using a quadratic polynomial:

imam, = ad 2 + fld where a,/3 E R+. 6 This polynomial approximates the structure of a tableau: a tableau (a tree) with a given depth do has do < i < #ao inferences (leaf nodes), where # is the maximal number of literals per clause. Hence, we estimate the number of inferences in the tableau to be i = x a~ for some x < # and allow the prover to search for tableaux with at most i inferences by setting ima~ "= i. A Taylor development of this formula leads to i = 1 + ~ dko(logx)k/k!. Since, in most cases, x is ve.ry close to 1, we only use the linear and quadratic terms, finally obtaining our quadratic polynomial. SiCoTHEO-CBC (ComBine Completeness bounds) explores a set of parameters (or,/3) in parallel by assigning different values to each processor. For the experiments we selected 0.1 < c~ < 1 and 0 < / 3 < 1. In our first experiment (Experiment 1) we used 50 processors with the following parameter settings: Pl" (0.1,0.0)

(0.1,0.2)

...

(0.1,0.S)

(0.2, 0.0)

(0.2, 0.2)

...

(0.2, 0.s)

(1.0,0.0}

(1.0,0.2}

...

Ps0" (1.0,0.8)

Note, that this grid does not reflect the architecture of the system. It rather represents a two-dimensional arrangement of the parameter values. For Experiment 2, Experiment 3 and Experiment 4, the number of processors was reduced to 25, 9, and 4 respectively by equally thinning out the grid. For all experiments, a total of 92 different formulae from the T P T P have been used. 48 examples show a sequential run-time Tseq of less than one second. 36 of the remaining examples have a sequential run-time which is higher than 100 seconds. Although measurements have been made with all examples, we do not present the results for those with run-times of less than one second. In that case, the resulting speed-up (ga = 1.57 for P = 50) is by far outweighted by the time, SiCoTHEO needs to export proof tasks to other processors. In a real application, this problem could be solved by the following strategy: first, start one sequential prover with a time-limit of I second. If a proof cannot be found within that time, SiCoTHEO-CBC would start exporting proof tasks to other processors. Table 2 (first group of rows) shows the mean values for all three experiments. These figures can be interpreted more easily when looking at the graphical representation of the ratio between Tseq and 7il , as shown in Figure 4. Each 9 represents a measurement with one formula. The dotted line corresponds to s = 1, the solid line to s = P, where P is the number of processors. The area above the dotted line contains examples where the parallel system is slower than the sequential prover, i.e., s < 1. Dots below the solid 5Personal communication. ~For a - 0,/~ = 1 we yield inference-bounded search, a = oo, ~ = c~ corresponds to depth-bounded search.

241 line (with a gradient of l/P) represent experiments which yield a super-linear speed-up

s>P. Figure 4 shows that even for few processors a large number of examples with superlinear speed-up exist. This encouraging fact is also reflected in Table 2 which exhibits good average speed-up values for 4 and 9 processors. For our long-running examples and P = 4 or P = 9, ~g is even larger than the number of processors. This means that in most cases, a super-linear speed-up can be accomplished. Table 2 furthermore shows that with an increasing number of processors, the speedup values are also increasing. However, for larger numbers of processors (25 or 50), the efficiency 77 = siP is decreasing. This means that SiCoTHEO-CBC obtains its peak efficiency with about 15 processors and thus is only moderately scalable.

Experiment ..

SiCoTHEO-CBC SiCoTHEO-CBC SiCoTHEO-CBC SiCoTHEO-CBC Experiment

s-a s-g S-h gt ..

4 P=4 61.21 5.92 2.12 2.06 7

,,

P=4 ,,

SiCoTHEO-DELTA SiCoTHEO-DELTA SiCoTHEO-DELTA SiCoTHEO-DELTA Experiment

s-a

SiCoTHEO-DELTA+ SiCoTHEO-DELTA+ SiCoTHEO-DELTA+ SiCoTHEO-DELTA+

s-a s-g S-h ~t

gg 8-h

gt ..

18.31 4.49 1.34 1.42 10 P=5 18.39 5.15 1.43 1.83

3 P=9 77.30 12.38 3.37 2.93 6 P=9 63.78 10.89 2.00 2.45 9 P=10 63.84 12.07 2.92 3.47

2 P=25 98.85 18.18 4.34 3.84 5

1 P=50 101.99 19.25 4.41 3.88

P=25 76.46 15.96 2.76 3.16 8 P=26 76.50 16.97 3.71 4.13

Table 2 SiCoTHEO: Mean values of speed-up for different numbers of processors P. The number of examples is 44.

6.3. S i C o T H E O - D E L T A

The third competitive system which will be considered in this paper affects the search mode of the prover. SETHEO normally performs a top-down search. Starting from a goal, Model Elimination Extension and Reduction steps are performed, until all branches of the tableau are closed. The DELTA iterator [9], on the other hand, generates small tableaux, represented as unit clauses in a bottom-up way during a preprocessing phase.

242 P - 9

300

:

200

.....:

9 9 9

P -- 25

300

...:

P - 50

300

..:

9

200

200

7il 100

,s ..."

. 9 o

9

100

s

711 100

s""

~~ ~

9

o ..-'"

."

9

9

..,----

' ~ ' ~ - - ' : : - " = "~ 0 100 200 300

T~q

o 9

0~ 0

_~o~,~_,

100

200

T.~q

9

0

300

9

,

0

100

200

--,

300

T~q

Figure 4. Parallel run-time 711 over sequential run-time Tseq for SiCoTHEO-CBC and different numbers of processors.

For example, the left subtree of Fig. 1 corresponds to the clause p(a, b). These unit clauses are added to the original formula. Then, in the main proving phase, SETHEO works in its usual top-down search mode. The generated unit clauses now can be used to close open branches of the tableau much earlier, thus combining top-down with bottom-up processing. This decrease of the proof size can reduce the amount of necessary search dramatically. On the other hand, adding new clauses to the formula increases the search space. In cases, where these clauses cannot be used for the proof, the run-time to find a proof increases (or a proof cannot be found within the given time limit). Thus, adding too many (or useless) clauses has a strong negative effect. The DELTA preprocessor has various parameters to control its operation. Here, we focus on two parameters: the number of iteration levels l, and the maximally allowable term depth td. l determines how many iterations the preprocessor executes. The number of generated unit clauses increases monotonically with 1. The term depth td of a term is the maximal nesting level of function symbols in that term. E.g., td(a) -- 1, td(f(a, f(b, c))) -3. In order to avoid an excessive generation of unit clauses, the maximal term depth td of any term in a generated unit clause can be restricted. Furthermore, DELTA is configured in such a way that a maximum number of 100 unit clauses are generated to avoid excessively large formulas. For our experiments, we use competition on the parameters l and td of DELTA. The resulting formula is then processed by SETHEO, using standard parameters (iterative deepening over A-literal depth). Hence, execution time in the parallel case consists of the time needed for the bottom-up iteration TDELTAplus that needed for the subsequent topdown search Ttd. As before, the overall execution time of the abstract machine, including the time to load the formula is used. With I E { 1 , 2 , . . . , 5 } and td E ( 1 , 2 , . . . , 5 } , a total of 25 processors are used. Figure 5 shows the ratio between Tseq and TII for all examples. Again, we va.ry the number of processors (4, 9, 25). Our experiment has been carried out with the same examples as in the previous section (Tseq > ls). Table 2 (middle section, Experiments 5-7) shows the

243 numeric values for the obtained speed-up.

P=4 200

P=9

.....""""

t ....i

200

.....i.""""*

TH 1oo

..,.. .-'"

0

P=25

....

$

200

Tll .

1oo

100

200

Zseq

300

.-""

m...~..---'-~

0 L-..~-~-= ~ . . . . 0 100 200 300 9

0

:,.

Zseq

Figure 5. Parallel run-time TII over sequential run-time different numbers of processors.

0 0

:

.

i!" 100

.. 200

300

Zseq

Tseq for

SiCoTHEO-DELTA and

In general, the speed-up figures obtained with SiCoTHEO-DELTA show a similar behavior as those for SiCoTHEO-CBC. This can be seen in Figure 5 and Table 2 (Experiments 5-7). Here, however, there are several cases in which the parallel system is running slower than the sequential one. The reason for this behavior is that the additional unitclauses are not useful for the proof and increase the search space too much. This negative effect can easily be overcome by using an additional processor which runs the sequential SETHEO with default parameters. The resulting speed-up figures are shown in Table 2 (Experiments 8-10, rows marked by SiCoTHEO-DELTA+). It is obvious that in this case the speed-up will always be greater or equal to 1. Although the arithmetic mean is not influenced dramatically, we can observe a considerable increase in the geometric mean. This fact indicates a "smoothing effect" when the additional processor is used. The scalability of SiCoTHEO-DELTA is also relatively limited. This is due to the coarse controlling parameters of DELTA. The speed-up and scalability could be increased, if one succeeded in producing a greater variety of preprocessed formulas. 7. C o n c l u s i o n s In this paper, we have presented a parallel theorem prover based on the theorem prover SETHEO. Parallelism is exploited by homogeneous competition. Each processor in the network is running SETHEO and tries to prove the entire formula. However, on each processor, a different set of parameters influence the search of SETHEO. If this influence results in large variations of the run-time, good speed-up values can be obtained. In this work, we compared three different systems based on this model: SiCoTHEO-PID performs parallel iterative deepening, SiCoTHEO-CBC combines two different completeness bounds using a parameterized function, and SiCoTHEO-DELTA combines the traditional top-

244 down search of SETHEO with the bottom-up preprocessor DELTA, where the parameters of DELTA are the basis for competition. Since the search space is partitioned by the prover, the speed-up values of SiCoTHEOPID are always larger than one. However, only little efficiency could be obtained, and SiCoTHEO-PID's peak performance is reached with only 4 processors. In general, good efficiency and reasonable scalability can be obtained only, if there are enough different values for a parameter, the parameter setting strongly influences the behavior of the prover, and if there is no good default estimation for that parameter. Both the combination of search-bounds and the combination of search modes have been shown to be appropriate for competition. The scalability of both systems is still relatively limited. The implementation of SiCoTHEO, using pmake combines simplicity and high flexibility (w.r.t. the network and modifications) with good performance. In many cases, superlinear speed-up could be obtained. Future enhancements of SiCoTHEO will certainly incorporate ways to control DELTA'S behavior more subtly. Furthermore, a combination of SiCoTHEO-DELTA with SiCoTHEO-CBC and heuristics control using Neural Networks will increase the overall efficiency and scalability of SiCoTHEO substantially. Finally, experiments with SiCoTHEO can reveal, how to set the parameters of sequential SETHEO and DELTA in an optimal way. REFERENCES

1. A. de Boor. PMake - A Tutorial. Berkeley Softworks, Berkeley, CA, Janua~. 1989. 2. W. Ertel. OR-Parallel Theorem Proving with Random Competition. In Proceedings of LPAR '92, pages 226-237, St. Petersburg, Russia. Springer LNAI 624, 1992. 3. W. Ertel. Parallele Suche mit randomisiertem Wettbewerb in Inferenzsystemen. Series DISKI 25. Infix, St. Augustin, 1993. 4. W. Ertel. On the Definition of Speedup. In PARLE, Parallel Architectures and Languages Europe, pages 289-300. Springer, 1994. 5. C. Goller, R. Letz, K. Mayr, and J. Schumann. SETHEO V3.2: Recent Developments (System Abstract) . In Proc. of Conference on Automated Deduction (CADE) 12, pages 778-782. Springer LNAI 814, 1994. 6. R. Letz, K. Ma.yr, and C. Goller. Controlled Integration of the Cut Rule into Connection Tableau Calculi. Journal of Automated Reasoning, 13:297-337, 1994. 7. R. Letz, J. Schumann, S. Ba.yerl, and W. Bibel. SETHEO: A High-Performance Theorem Prover. Journal of Automated Reasoning, 8:183-212, 1992. 8. D.W. Loveland. Automated Theorem Proving: a Logical Basis. North-Holland, 1978. 9. J. Schumann. D E L T A - A Bottom-up Preprocessor for Top-Down Theorem Provers. System Abstract. In Proc. of Conference on Automated Deduction (CADE) 12, pages 774-777. Springer LNAI 814, 1994. 10. J. Schumann and R. Letz. PARTHEO: a High Performance Parallel Theorem Prover. In Proc. of Conference on Automated Deduction (CADE) 10, pages 40-56. Springer LNAI 449, 1990. 11. G. Sutcliffe, C.B. Suttner, and T. Yemenis. The TPTP Problem Library.. In Proc. of Conference on Automated Deduction (CADE) 12, pages 252-266. Springer LNAI 814,

245 1994. 12. C.B. Suttner and J. Schumann. Parallel Automated Theorem Proving. In Parallel Processing for Artificial Intelligence, pages 209-257. Elsevier, 1994. 13. C.B. Suttner. Static Partitioning with Slackness. Series DISKI. Infix, St. Augustin, 1995.

246 Johann Schumann

Johann Schumann studied computer science at the Technische Universit~it Miinchen (TUM) from 1980 to 1986. Then he joined the research group "Automated Reasoning" at the TUM and worked on the development of sequential and parallel theorem proving. In 1991 he obtained his doctoral degree with a thesis on efficient theorem provers based on an abstract machine. 1991-1992 he worked in industry as project manager in the area of network management and control systems. His current research interests include sequential and parallel theorem proving and application of automated theorem provers in the area of Software Engineering.

Parallel Processing for Artificial Intelligence 3 J. Geller, H. Kitano and C.B. Suttner (Editors) 9 1997 Elsevier Science B.V. All rights reserved.

Low-Level Computer

249

Vision Algorithms: Performance Evaluation on

Parallel and Distributed Architectures G. Destri and P. Marenzoni Dipartimento di Ingegneria dell'Informazione Universit~ di Parma Viale delle Scienze, 1-43100 Parma, Italy Tel. +39-521-905708 Fax +39-521-905723 e-mail: {destri,marenz}@CE.UniPR.IT 1. I N T R O D U C T I O N Computer Vision (CV) is a valuable tool in many fields: from robotics to geophysics, from medicine to industrial quality control. The recognition of objects and their configuration in a scene is allowed by the identification and extraction of significant information from raw image data. Several paradigms can be used to classify CV methods. However, regardless of the chosen approach, a "low-level processing" is always necessary, namely processing which does not alter the image data structure (pixel array), but only changes individual pixel values. Early filtering, image smoothing, noise reduction, edge detection and region partitioning are some examples of low-level processing. In many cases low-level CV methods are not completely satisfactory, and it is necessary to introduce also at this first level the knowledge of scene contents into the automated system [1,2]. Other forms of computation can have output data structures different from the input ones. For example, a high-level image descriptor can output the number of regions with a particular label, obtained through an image segmentation algorithm [1]. This paper is about the performance evaluation of low level CV algorithms operating in parallel and distributed environments. Typically, the number of pixels to be processed in the images can range from several thousands to a few millions. This fact is more relevant in applications such as remote sensing, where realistic sizes range from 1024 x 1024 up to 4096 x 4096 pixels, or more. The necessity to limit processing times to reasonable values (e.g., in real-time systems), or the large memory requirements, force the user to go beyond the single machine hardware limits, exploring parallel or distributed approaches. In lowlevel CV algorithms most of the operations are local, that is, the new value associated with each pixel depends only on the values coming from a well defined and limited neighborhood of that pixel. Therefore, low-level CV problems are well suited to be ported on a parallel or distributed environment, since they show also the most balanced behavior from the point of view of the computation versus communication ratio. The Cellular Neural Network (CNN) paradigm [3,4] is very appropriate to describe this kind of computation, because it embodies, as special cases, all CV problems solved with algorithms involving local

250 operations. Hence, the use of CNNs to evaluate performance of CV applications in parallel environments is the best choice, since CNNs are both a superset of all local low-level CV algorithms and well suited for parallelization. A measurement taken with respect to the most general CNN formulation can give an effective value of the lower performance bound offered by a given platform for CV applications. Indeed, from a computational point of view, the chosen algorithm is comparable to the most intensive CV operators. In this work a complete performance evaluation of a low-level CV distributed algorithm, based on the CNN paradigm, is presented. This analysis can be considered as a performance measure, on the distributed platforms tested, of the whole class of CV algorithms which may be expressed by means of the CNN formalism and parallelized following the proposed scheme. A coarse-grained parallelization scheme is used in task and data partitioning. Each processor operates on a horizontal slice of the image, the communications being limited to slice borders updating [5]. The resulting communication pattern involves only the exchange of large packets, therefore performance is not penalized by hardware and software communication latencies. To obtain satisfactory performance speed-up with this CV algorithm it is not essential to run on a dedicated parallel architecture, in particular when large images must be processed and the computational weight is more important than the communication weight. Clusters of high performance workstations can provide a significant memo .ry extension, maintaining high efficiencies. Versions of the program have been implemented both on parallel architectures, a Connection Machine CM-5, a Cray T3D, and an IBM SP2, and on workstation clusters, adopting the available Fortran languages, in order to achieve the best performance in the most intensive computations on any parallel platform. Fortran environments are, in fact, better supported than C ones in many parallel architectures, and guarantee higher performance of the programs. While a portable version is implemented with the public domain PVM library, the CM-5, T3D and SP2 ones are optimized to achieve the best performance on each parallel architecture, exploiting the most effective message passing environment with respect to the underlying hardware. A wide set of tests have been performed, in order to complete a detailed performance evaluation of this CNN-based CV algorithm. Processing times have been taken, at increasing image sizes, on 32 nodes of a Connection Machine CM-5, 32 nodes of a Cray T3D, and 32 thin nodes of an IBM SP2 [6], comparing them with sequential CNN runs carried out on a SPARC-20 workstation. Homogeneous comparisons have been obtained also, running the PVM version on a cluster of SPARC-IPX workstations and the sequential version on a single identical machine.

2. C N N P A R A D I G M 2.1. C N N Basics Cellular Neural Network (CNN) [3] is a computational paradigm defined in discrete regular N-dimensional spaces. The building block element of such a paradigm is a unit (or cell), corresponding to a point in the N-dimensional space, that performs both arithmetic and logic operations. Typically, in CV applications each unit corresponds to a pixel. CNN's main characteristic is the locality of the connections between the units. The most important difference between CNN and other Neural Network paradigms is the fact that

251

information is directly exchanged only between neighboring units. In Hopfield networks, for example, the units can be distributed on the nodes of a regular lattice, but each node is connected to all the other nodes of the network. Furthermore, in the widely diffused multilayer perceptrons each unit is connected to all the units of the previous layer. CNN locality, however, does not prevent global processing. Communications between non directly connected (remote) units are obtained by means of consecutive moves through other units, with several algorithm iterations. Generally, to give a measure of the neighborhood size, the chessboard distance convention [1] is used, expressed by the equation:

d~ = m~x(Ix~ - xjl, ly~ - yjl).

(~)

CNN cells are multiple input-single output "processors," each one described by its own parametric functional. A cell is characterized by an internal state variable, usually not directly observable from outside the cell itself. The CNN cell grid can be a planar array with rectangular, triangular or hexagonal geometry, a 2-D or 3-D torus, a 3-D finite array, or a 3-D sequence of 2-D arrays (layers) [3,4]. A CNN dynamic system can operate both in continuous or discrete time. It is possible to consider the CNN paradigm as an evolution of the Cellular Automata paradigm [7], and it is also possible to exploit the existing Cellular Automata rules to design a CNN system [3]. Moreover, it has been demonstrated in [8] that the CNN paradigm is universal, being equivalent to the Turing Machine. 2.2. C e l l u l a r A u t o m a t a A Cellular Automaton (CA) is a discrete dynamical system. Space, time, and states of the system are discrete. Each point in a regular spatial lattice, called cell, can have one of a finite number of states. The states of the lattice cells are updated according to a local rule. That is, the state of a cell at a given time depends only on both its own state at the previous time step and on the states of its nearby neighbors at the previous time step. All cells are updated synchronously. Thus, the state of the entire lattice advances in discrete time steps. A CA A is formally described by a quadruple:

A=< S,d,V,f > . S is a finite set of labels or numbers, d is the dimension of the CA, V is the neighborhood and f is the evolution rule. The dimension d defines a d-dimensional lattice L - Z d, where Z is the set of all integers. A point x C L is called cell. The underlying space of A is St'; an element c E S L is a map from L to S, associating a label s C S with every point x E L:

c:L~S;

c(x)=s;

c E S r"

A label s associated with a cell is called a state, and c is called a configuration of A. The set of the finite configurations is the domain of the "local" function f:

f :S v ~S

252 the function f associates a label from the set S with every finite configuration. The "global" evolution function G is the application of the local function f to the finite configuration of the neighborhood of every point in L: the new value of G(c) at a point x is the value of f applied to the neighborhood of x. The function G is spatially invariant on S L. Repeated applications of G give the dynamics of a CA; i.e., a sequence of configurations, each obtained by applying G to the previous one. The time evolution of a CA is fully determined by the local function f. From the above definitions and considerations it is more evident why CNNs can be considered as an evolution of CAs. The states are continuous values (i.e., real numbers), the functions acting on these values can be as complex as required, and different in each lattice point. Moreover, external inputs can also participate in the time evolution of the system. 2.3. C N N E q u a t i o n s In this work only discrete time CNNs are considered, since we are interested in them as a formal model for the software algorithms. A formal mathematical description of the discrete time case is:

xj(t~+~) =

Aj[yk(t,~), pA] + y~ Bi[uk(tn),P~]

g[xj(t,~)] + 5 + keNt(j)

yj(t.) = f[xj(t.)],

(2)

kENs(j)

(a)

where xj is the internal state of a cell, Y.i is its output, uj is its external input and Ij is a local value called bias. Aj and Bj are two generic parametric functionals, also called templates, and pjA and P f are the two arrays of parameters. The two functionals can be, for example, linear combinations, nonlinear functions (e.g., exponentiations), or polynomial functions, while the parameter arrays play the role of the involved coefficients (e.g., in the case of linear connections pA and P f are the sets of connection weights). At each iteration, the y and u neighbor values are collected from the cells present in the neighborhoods Nr (for the feedback functional A) and Ns (for the control functional B). The two neighborhoods may be different, and they can include the cell itself, that is, the cell input value uj (in the B functional) or its output value yj(tn-1) (in the A functional) can be arguments to the functional itself. Then the two templates are computed, generating the internal state x with the addition of the bias I. Finally, the activation function f generates the output from the internal state, f is typically a Gaussian, a linear function with saturation (amplifier model), a quantizer, a single step or a sigmoid (also called logistic function). The instantaneous local feedback function g, often not used, expresses the possibility of an immediate feedback effect. In many cases the system is non-Markovian, that is the future system evolution can depend not only on the present but also on the past history. The system represented in equations 2 and 3 is, of course, strictly Markovian. Only the linear_connection subset of two-dimensional CNNs will be considered, since most CV algorithms can be expressed on the basis of this assumption. The A and B functionals become linear combinations of the neighborhood values, where the parameters are the connection weights [9]. The operation expressed in functionals A and B becomes

253 a convolution with a weight mask, usually represented as an h • h 2-D matrix, also called linear template (see block scheme in Figure 1). In this way, the equation 2 becomes, for the lattice point j:

xj(t~+l) = g[xj(t~)] + b +

~ kENt(j)

ak(.j) "y~(t~) +

~

bk(j) "uk(t,~)

(4)

kE Ns(j)

where ak(j) and bk(j) are the parameters, acting as weights of the connections, that is, coefficients of the linear combination of the neighborhood Nr(j) of the point j. In the following r will denote the template radius, i.e. the chessboard distance defining the neighborhood limit (for example, an r = 1 template is equivalent to a 3 x 3 template), and it will be considered constant in the whole lattice for both control and feedback templates. The f function will be considered space-invariant, but with different coefficients in each lattice cell (e.g., the gain and saturation limits in the linear case). Since our goal is to analyze the actual performance of several parallel or distributed architectures with respect to low-level CV applications, based on a CNN implementation, we have chosen the most penalizing case, from the point of view of computational cost and memory requirements. Therefore, even if always performing the linear combination of the neighborhood values, the A and B templates will not be constant along the image, but rather they will be space-variant, namely with different coefficients in each lattice point. This allows us to perform different operations on different image pixels, while executing everywhere the same machine sequence of instructions. Data and CNN parameters in this framework are floating-point numbers. 3. W H Y C N N s F O R CV? Given the formalism described in the previous section, in what follows we will analyze several concrete applications of CNNs to the CV domain, in this manner justifying the use of CNNs to "parameterize" the general behavior of low-level CV algorithms.

3.1. Elementary operators and C N N s Low-level CV algorithms generate as output 2-D arrays, i.e., a data type conformal to the input. The pixel transformations can be classified into three main categories.

9 Point functions, where the output of each pixel value depends only on the local input value of the pixel itself. Value rescaling and thresholding operations represent typical examples. 9 Local functions, where in each pixel the output value depends on the input values coming from a limited neighborhood of that pixel. Derivative operators, local averaging and convolution-based algorithms are some examples of these functions [10,5]. Some matching techniques require, however, enhanced values of the template radius (e.g. 33 • 33 neighborhoods), with a much higher computational weight. 9 Global functions, where the output value in each pixel depends on the whole image. Complex image transformations (e.g. spatial Fourier transform) or global histogrambased techniques [1] belong to this category.

254 Control Template

I

J 3

Local Input

Bias

f-k

I

_I f

:~Output

Feedback Template Input from Neighborhood

! I I

!.11 Feedback from Neighborhood

Figure 1. Scheme of a complete CNN iteration.

Both the first and the second transformation kinds can be immediately expressed in terms of the equations 2 and 3. In particular, the CNN becomes a point function when the feedback template and the bias are set to zero, and the control template has all coefficients equal to zero, except for the local cell, the input image being the external input. Moreover, without feedback template and performing just one iteration, the CNN becomes a simple convolution-based algorithm (a typical local function). In a similar way, the simple averaging algorithm can be expressed by a linear feedback template, by setting: 1

xj(t + 1)= -6 Z

y~(t),

(5)

kENt(j)

that is, a linear combination of the 3 • 3 pixel neighborhood with equal weights, while the control template and the bias are set to zero, and the f function is simply a multiplication by a rescaling coefficient. The well-known Sobel or Kirsch operators [1] can also be obtained by properly setting the control template weights and the f internal function, the control input variable u3 being the input image itself. Generally, a single iteration of a CNN has a computational cost greater than or at least equal to a local operator and, in this case the roles of the control and the feedback template are completely equivalent from a computational point of view. It must be observed that, for example, one CNN iteration has the same computational cost as a gradient extraction obtained through the combination of two spatial derivative operators. The internal function plays the role of this combinator (e.g., the maximum choice).

255

0.05 0.1 0.05

0.1 0.05

0

0

0

0.1

0

0.5

0

0.1 0.05

0

0

0

0.44

Figure 2. A noise cleaning CNN operator, suitable for clustering. (left) Feedback template. (right) Control template. The bias parameter is a function of the local average luminance. This operator is typically iterated from two to eight times.

A CNN-based noise-reduction operator, derived from [9], is shown in fig. 2. The same operator, with a quantizer as internal function f, was successfully used in [11] for the clustering of a noisy image [12]. In particular, this algorithm is based on the combination of the clustering with a luminance correction obtained through an appropriate function, based on a 7 x 7 CNN collecting the average value of luminance in the neighborhood of the point. Figure 3 shows the application of the operator to a real world noisy image (a road). In a similar way, we can express the operations dedicated to edge detection [9]. Other significant results of CNN applications to CV are texture image segmentation (e.g., [13, 14]), feature detection (e.g., [15,16]), and object tracking and recognition (e.g., [17]). Modular convolution-based algorithms are other interesting applications (e.g. the wellknown Canny Edge Detector [18], which can be expressed as a multilayer single-step iteration CNN, in which the operations are consecutively performed by the layers). Many global functions (e.g., diffusion algorithms [5]) can also be obtained through a CNN with appropriate coefficients, iterating a sufficient number of times to ensure the information transmission along the whole image. Generally, CNN parameters defining the system behavior can be chosen by the programmer, in such a way as to define "mathematically" the function to be performed by the network [4]. The possibility to impose a machine learning of these coefficients may become a necessity, especially in case of a desired complex behavior. In some cases the applicability of "traditional" techniques for Neural Network learning, such as back-propagation has been successfully demonstrated [14]. Sometimes, the characteristics of some nonlinear activation functions, widely used in CNNs, create many problems to the application of these learning techniques. To overcome these limits, a new kind of training, based on Genetic Algorithm techniques [19], has been developed in [20] and [15]. An application of these techniques to the design of filters for image clustering, presented in [21], is shown in figure 4 All the previous considerations and examples support the use of the CNNs as a performance evaluation "paradigm" for CV, since measurements taken with CNN can give a realistic image of the actual performance of a given architecture with respect to the whole class of local CV operators.

256

Figure 3. An example of CNN clustering algorithm: (a) original image, (b) processed image.

257

Figure 4. An advanced example of CNN clustering algorithm: the filter has been obtained by means of a genetic approach. (Top-left) original image, (top-right) filtered image after 4 CNN iterations, (bottom-left) filtered image after the application of a step threshold and (bottom-right) the edges of the regions, shown for clarity.

258

3.2. C N N s for e x p e c t a t i o n - d r i v e n algorithms One of the characteristics of low-level processing of images is that it is a data-driven process, this means that generally no global knowledge about the image is required to process it. Nevertheless, an iterative CV processing can "extract" part of its evolution rules from a priori knowledge. For example an expectation-driven algorithm has been applied in [22] and [23], with a "synthetic" image playing a guide role. In CNN-based algorithms the "synthetic" input becomes the control input, that can also be variable in time, while the initial state is the image to be processed, and the algorithm acts in several iterations. 3.3. C N N s for m i ddl e level functions The algorithm we describe here has the main goal to detect presence and position of some a-priori known shapes in a real world image [24]. The algorithm must be very robust with respect to noise and imperfection of the image. The method is based on the matching approach [5], enhanced by means of some a-priori knowledge. Given the a-priori knowledge of the shape and of its approximate size, with an appropriate tolerance, the algorithm is obtained by means of a single-iteration CNN, where the control input is the "ideal" representation of the shape to be searched, and the internal initial state is a subwindow of the image. We want to know whether the central pixel of this subwindow (i.e. the pixel where the CNN is applied) is a good candidate to be the center of the shape searched for. The response is a sort of quantitative measurement of the goodness to this matching. First the image is processed by means of a gradient operator to extract the border candidate pixels of the image. This result is not binarized, since we want to maintain a quantitative measurement of the edge intensity, and to exploit it to enhance the correctness of the matching process. Since a tolerance with respect to the size is also desired, the algorithm is iterated in successive steps, driven by appropriate thresholds, in a hysteresis approach [18]. 9 The first matching takes place between the gradient image subwindow and the ideal representation of the shape (see fig. 5 and 6); only in the pixels in whose neighborhood the gradient intensity exceeds an appropriate threshold the matching takes place; 9 if the value of the error obtained between the two images is smaller than the first threshold we have found the shape searched for; 9 if the value of the error is greater than the second value this point cannot be the center of the shape; 9 if the value is between the two thresholds, then the matching is performed with the "enlarged" ideal shape (see fig. 5 and 6), where new pixels have a smaller value than the original ones; 9 the iteration is repeated for two or three times. The tests, performed on sample real world images obtained in gray tones by means of a normal video-camera and a digitizer, demonstrated that this simple method allows to

259

Figure 5. The CNN-based matching algorithm for shape detection: the "ideal" form to be compared with the real image subwindow. In the successive steps the circular shape is "enhanced" by the adding of internal pixels. The darkness of the pixels is proportional to the intensity of match: the internal pixels contribute with a lower weight to the matching value.

obtain about 96% of correct results, and often to find the shape searched for also where it is difficult to find even for a human observer. 4. W H Y " P A R A L L E L " C N N s ? Parallel architectures can be classified into two main categories. The systems where a single code execution takes place and the Processing Elements (PEs) only operate on different data sets are called SIMD (Single Instruction stream over Multiple Data streams) parallel computers. The systems where each processor is able to run its own code asynchronously on its data set are the MIMD (Multiple Instruction streams over Multiple Data streams) parallel computers. Regular grid-oriented algorithms are best suited to be easily ported to any parallel or distributed platform, both SIMD and MIMD. Grid applications, in fact, allow all processors to execute the same work, the load balancing being always perfectly even, therefore minimizing the delay phases in MIMD environments. Moreover, many grid-oriented applications show the locality property, that is the evolution in time of each point (i.e., pixel) only depends on its nearest neighbors. This feature allows on a parallel platform to avoid general interprocessor communications, and to achieve the maximum efficiency of the algorithm. As a matter of fact, many parallel architectures dedicated to low-level image processing have been developed, and a number of parallel algorithms have been designed for that purpose. Moreover, most of the algorithms discussed in the previous section have been successfully parallelized [5]. CNNs combine both characteristics - the regularity and the locality- that are typically shown by CV applications. Therefore, the use of CNNs to "parameterize" the behavior and to analyze the performance of low-level CV applications on distributed platforms can be considered very adequate to obtain results of general validity. 5. M A C H I N E

CHARACTERISTICS

In this section we will briefly review some main characteristics of the parallel architectures we have used to test and analyze our general CNN a l g o r i t h m - the Connection

260

Figure 6. The CNN-based matching algorithm for shape detection" four step of the matching algorithm. (Top-left) Original Image, (top-right) gradient image, (bottom-left) thresholded gradient image with marker on shape position, (bottom-right) original image with marker on shape position.

261 Machine CM-5, the Cray T3D, and the IBM SP2. 5.1. C o n n e c t i o n M a c h i n e CM-5 The Connection Machine CM-5 is a multiuser, MIMD, timeshared massively parallel system [25] comprising many processing nodes, each with its own memory., supervised by a control processor, the partition manager (PM). A CM-5 PE consists of a RISC processor, a Network Interface chip, 4 memo .ry units of 8 MBytes RAM, and 4 Vector Unit (VU) arithmetic accelerators connected through a 64-bit M-bus. The RISC processor (a SPARC-2 chip with a clock rate of 32 MHz) acts as a control and scalar processing resource for the VU. The SPARC-2 performs address calculations, loop controls and instruction fetches, and executes the "scalar" portion of the node application. The microprocessor sends all the "parallel" operations (i.e. the vectorizable code) to the VU for execution. The VU accelerators use deep pipelines and large register files (128 32-bit registers per VU) to improve peak computational performance. The peak floating-point performance rate is 128 MFlop/s per node with VU. PM and PEs are connected to two communication networks, organized in a fat tree architecture: the data network, used for bulk data transfers in which each item has a single source and destination, and the control network, used for operations that involve all the nodes at once, such as synchronization, broadcasting and combining. In the 4-ary fat tree implementation of the data network each PE is a leaf, and data routers are all the internal nodes, each connection providing a bandwidth rate of 20 MBytes/s in each direction. In the first two levels of the tree, however, each router uses only two parent connections to the next higher level and only starting from the third level all routers use four parent connections. The CM-5 system supports several high level languages for the message passing programming model: CM-Fortran, C* and the standard C, C + + and Fortran77. Only CMFortran and C*, however, can support the VU hardware. C, C + + and Fortran77, instead, can program only the SPARC-2 microprocessors. All the languages can be exploited for message passing programming, integrating them by means of the CMMD library. All our programs, running on a CM-5, are written in CM-Fortran version 2.2 [26], with CMMD libra,, version 3.2 [27]. 5.2. C r a y T 3 D The Cray T3D [28] is a MIMD massively parallel system [25], connected to a host computer, that provides support for applications running on the T3D. All applications are compiled on the host system but run on the Cray T3D system. Each node contains two PEs, a local memory, a Network Interface and a block transfer engine. A T3D PE is a RISC DEC Alpha microprocessor, performing arithmetic and logical operations on 32 integer and 32 floating-point 64-bit registers, with a clock rate of 150 MHz. The microprocessor contains an internal instruction cache memory and a data cache memory each storing 256 lines of data or instructions. Each line is four 64-bit words wide. The size of local memory is 64 MBytes per PE. The block transfer engine is an asynchronous direct memory access controller that redistributes system data. The peak floating-point performance per PE is 150 MFlop/s. The interconnection network forms a three-dimensional matrix of paths and is composed of communication links and network routers, allowing a bidirectional maximum transfer rate of 300 MBytes/s.

262 Two compilers are available on the T3D: CRAFT Cf77 Fortran and C. The message passing programming model is supported on both compilers by the public domain PVM libra .ry for the message passing primitives. Moreover, Cray provides some extensions to the PVM libra .ry, in order to speed-up interprocessor data transfers. On the T3D the virtual shared memo .ry mechanism is also available. In a T3D system, in fact, memo .ry is physically distributed among processors, but is globally addressable. Any PE can address any memo .ry location in the system, providing communications much faster than PVM. All the high level compilers available on the T3D can directly take advantage of this shared memory mechanism. All the applications tested on the T3D are written in Cf77 Fortran version 6.2 [29], with shared memory primitives for message exchanges. 5.3. I B M S P 2 The IBM SP2 is a general-purpose scalable parallel system [6], based on a distributed memo .ry MIMD architecture. The POWER2 RISC System/6000 processors constitute SP2 nodes, each with its own private memory and its own copy of the AIX operating system. The SP2 provides two node types: wide node and thin node. Both nodes have two fixed-point units and two floating-point units (each capable of a mult-add every cycle), running at 66.7 MHz of clock rate, for a peak floating-point performance of 267 MFlop/s per node. An SP2 wide node can have up to 2 GBytes of memory, with a bandwidth of 2.1 GBytes per second, and a 256 KBytes four-way set associative cache. The SP2 thin nodes are similar to the wide nodes, but have a less robust memory hierarchy and I/O configurability. SP2 nodes are interconnected by a High-Performance Switch. The topology of the switch is an any-to-any packet-switched multistage or indirect network similar to an Omega network. This allows the bisection bandwidth to scale linearly with the size of the system, a critical aspect for system scalability. A consequence of the High-Performance Switch topology is that the available bandwidth between any pair of communicating nodes remains constant irrespective of where in the topology the two nodes lie. The parallel Message Passing Library (MPL) is the native communication library supporting explicit message passing for the XLF Fortran77 or C languages, tuned and optimized for the underlying communication hardware [30]. PVMe, the IBM optimized version of the public domain PVM library, is also available on the SP2. All our programs are written in XLF version 3.2 [31], with MPL support for message passing.

6. M E S S A G E

PASSING CNN ALGORITHM

There are two programming models used to parallelize programs in a distributed environment - the data parallel model and the message passing model. Data parallelism refers to a situation where the same operation is synchronously executed on a large array of data (operands) and the elementary unit viewed by the programmer is the array element, rather than the machine processor. In this environment only one code is running and the compiler is responsible of both data distribution and communications. The message passing model instead requires the user point of view to be the single processing unit, where a copy of the user program is asynchronously running. Therefore, given the problem to be parallelized, the programmer must explicitly code data distribution and data exchanges among processors.

263 Since our aim is to implement a general purpose CNN-based program, the only programming model universally supported is message passing. The algorithm has been conceived in order to run on any workstation cluster, therefore reserving particular attention to minimize communications. 9 Data partitioning. Two different strategies can be implemented when distributing grid-oriented problems over a multiprocessor platform. In particular, for a two dimensional image (Figure 7(a)), one can cut the corresponding 2-D data structure along both dimensions, producing a number of small squares [32], or one can subdivide the image into only horizontal or vertical stripes [5]. For the proposed message passing CNN implementation the second solution (i.e. the coarse-grained parallelization scheme) has been chosen, partitioning the two dimensional image along a single dimension. The resulting partial "windows" assigned to each task are then simply horizontal slices (Figure 7(b)). 9 Data

movement

scheme.

Great care must be taken, when designing a general purpose parallel algorithm, to the optimization of data exchanges between processors. This is even more important on a generic workstation cluster, where dedicated interconnection networks are not present. In our CNN algorithm, remote data items must be accessed only during the template computation, when operating over border pixels. A significant performance optimization can be obtained with a complete separation, in the source program, of computations and communications. The whole computation phase, thus, can be performed only on private (local) data. Moreover, to minimize the impact of the latency overheads on the communication primitives, the processors should exchange the largest allowed packet sizes, for the minimum number of times. Therefore, at each message passing phase, r complete rows of each image border are exchanged between logically adjacent P Es. 9 Data

rearrangement.

Each partial window of the image, assigned to a PE, is extended by means of two dummy stripes, placed at the cut borders. The image pixels necessary during the border computation, and belonging to the logically adjacent partial windows, are duplicated in the local PE memory and stored in the dummy stripes (Figure 7(c)). In this way, the whole computation phase can be completed without communications. In fact, a specific procedure called at each iteration is dedicated to rearrange (through a sort of shift operation) the border pixels between logically adjacent windows (i.e. PEs), filling the dummy stripes through specific communication primitives (Figure 7(d)). When processing an L x L image over Np PEs, each PE stores an L x L/Np partial window. Adding the two further border stripes, to collect the top and bottom r neighborhoods, the total window stored in each P E assumes an L x (L/Np + 2 x r) shape. Similar image partitioning schemes have already been proposed in [5], applied to a Meiko multiprocessor architecture.

264

Figure 7. Image partitioning across four PEs. (a) Original image. (b) Logical image subdivision in four windows. (c) Dummy borders enhanced. (d) Border exchanges between two logically adjacent windows.

The parallelization scheme described above, a coarse-grained technique from the point of view of the communication design, is specifically conceived to run on few nodes of loosely coupled machines, with large amounts of data stored on each processor, and with not too frequent exchanges of large packets. When running on generic workstation clusters, the major impact on the performance is determined by the ve.ry high communication latencies to be paid, due to the absence of high speed networks and of optimized dedicated communication protocols. The whole communication weight at each iteration is constituted only by two send primitives per running task, each one involving r x L data pixels for an L x L image. Therefore, latency overheads to be paid for each data transfer would be negligible, compared to transfer times. On the other hand, the increased amount of memo .ry required in each PE by the two dummy image stripes can be neglected, if the size of each local subimage is large enough. The complete flow chart of the general purpose distributed algorithm is sketched in Figure 8. After the spawning of the slave processes by the host task, at each iteration the function dedicated to border data movement is executed, then the two template procedures are performed, operating only on local data, and finally the internal function computes the cell outputs. No explicit synchronization is introduced among the processors, an implicit synchronization among the running tasks being guaranteed by the communication step.

265

BEGIN

I SPAWN PROCESSES

L COMPUTE CONTROL TEMPLATE

1 COMPUTE FEEDBACK TEMPLATE

1

INTERNAL FUNCTION

1 COMMUNICATION SHIFF

END

Figure 8. Complete flow chart diagram of the implemented CNN algorithm.

266

6.1. Optimized versions Four versions of the message passing CNN algorithm have been developed. Three versions are specifically optimized for the Connection Machine CM-5, the Cray T3D, and the IBM SP2 parallel machines, exploiting the most efficient message exchange capabilities offered by the architectures. Furthermore, a general purpose portable version also exists, efficiently running over any workstation cluster. All the programs are written in Fortran. More precisely, on the CM-5 we have used the CM-Fortran version 2.2 [26], on the T3D the CRAFT Cf77 version 6.2 [29], and on the SP2 the XLF version 3.2 [31]. The general purpose version has been coded using the standard Fortran77 language. The portable code is written in a master-slave model and exploits only public domain PVM message passing primitives. In the three dedicated parallel versions the master program must not be supplied by the user, since the slave executable copies are automatically spawned at run time by the operating systems. These programs are written in the so called host-less programming model. The CM-5 CNN version exploits the CMMD_send_and_receive message passing function. In the T3D version, the virtual shared memory mechanism, supported by Cray, is directly exploited. More precisely, the shmera_put functions are adopted in the communication procedure to write image borders on the memory of another PE, providing communication transfers much faster than the standard PVM primitives [33]. Finally, the SP2 code uses for message passing the optimized MPL_SHIFT primitive. 6.2. I m a g e I / O The image I/O problem has two possible solutions, depending on the environment. The first one, always available in a massively parallel environment, is the simultaneous accesses to the same file by many tasks, managed by the parallel operating system. Each task "knows" that it must read only a well-defined portion of file, that is, the assigned subwindow with the dummy borders. In a workstation cluster environment the same approach is possible only if a shared file system is available. Another solution can be a master task reading the image file and performing the image partition among the other tasks with the creation of the dummy borders and the collection of final results. 7. P E R F O R M A N C E

ANALYSIS

A number of experiments have been carried out in order to analyze the performance of this distributed CNN-based implementation of low-level CV applications, on all the platforms tested. The performance evaluation issue has been addressed following two main perspectives. 9 The first aspect concerns the study of the improvements in computing times allowed by the distributed implementations with respect to sequential ones. Therefore, runs have been performed over the three parallel architectures (CM-5, T3D, and SP2), comparing the results with a high performance workstation running a sequential CNN program. Furthermore, some measurements have been taken on homogeneous workstation clusters, comparing the computational times with the sequential implementation, running on a single identical workstation.

267 9 The second main performance parameter characterizing a distributed algorithm concerns the speed-up and the efficiency achieved at increasing number of PEs. This analysis has been carried out both on the parallel architectures and on different configurations of the workstation cluster, running the PVM portable CNN version. In what follows several performance measurements will be reported, with r = 1, as a function of both the image size and the number of processors (or workstations) used. Only timings referring to the representative linear internal function with saturation will be discussed. In this manner, the computation of the total number of elementary floating-point operations (Flops) executed is easier to be parameterized. Processing times measured with other functions are not substantially different from the reported ones and do not significantly modify performance considerations and speed-up behaviors.

time [sec]

5

/

/

sequential

/

1

T3D

0.5

,i

/.//

...I

---..t-,,.--

0.01 0.005 0.05 0.001

16o

260

560

1o'oo 20'o0

M-Flop/s 1000. ~.--

500

"

--

200.

SP2

.- -- ~ -._..-"- - - 7 . . ~ . _ ~. / --

/

CM-5

................

/ I00.

T3D

/ /

5O.

/

20. lO.

sequential 100.

200.

500.

1000.

2000.

50005

Figure 9. Processing times (top) and corresponding MFlop/s (bottom), as a function of the image size L, measured running a sequential CNN with r = 1 on a SPARC-20 workstation and the distributed CNN on a 32 PE CM-5, a 32 PE T3D, and a 32 PE SP2.

268

7.1. Performance comparison Figure 9 (top) reports the processing times of the three parallel CNN implementations, on a 32 PE CM-5, a 32 PE T3D, and a 32 thin node SP2, measured with L • L images (times refer to a single CNN iteration and are expressed in seconds). In the figure the execution times of a serial CNN implementation running on an high performance workstation are also plotted (a SPARC-20 workstation with 64 MBytes of memory). The comparison shows that an improvement by at least one order of magnitude is achieved by the message passing algorithm on all the parallel systems. The improvement is more important at large image sizes, because of the lower impact of communications on the performance of the parallel architectures. The SP2, and particularly the CM-5, show, in fact, a performance degradation at low L values. For the T3D, instead, the time scaling with L 2 is almost linear, owing to the very efficient communications capabilities allowed by the shared memory machine configuration. As a matter of fact, with L = 64, the CM-5 runs 6 times better than the SPARC-20, the SP2 30 times better, and the T3D 45 times better. Moreover, to give an exact evaluation of the different computational capabilities of the four systems under consideration, Table 1 reports the sec/cell required by the algorithm on the platforms tested, as resulting from all the measures, that is, even when more realistic problem sizes are processed (such as L = 1024). Figure 9 (bottom) reports the MFlop/s corresponding to the previous computational times. The number of elementary floating-point operations to be performed at each CNN iteration, as a function of the template radius r, considering the linear function with saturation, is: Ni,op = 2 - ( 2 - ( 2 r + 1) 2 - 1) + 5

(6)

While the sequential implementation assures always about 10 MFlop/s of performance, the three parallel implementations achieve, respectively for the CM-5, the T3D and the SP2, a maximum performance of 865, 425, and 1110 MFlop/s. It should be noted that, while CM-5 and SP2 performance increases with L, the T3D is not able to maintain the performance sustained at low L values, owing to the direct mapped architecture of the cache memo .ry present on the T3D Alpha processors, which causes an increasing number of cache miss memo .ry accesses, when the total memo .ry required by the problem increases. Another important aspect to be taken into account is the increase of the image sizes that can be be processed, due to the much larger memory available on the parallel systems. The maximum image size fitting the memory of the sequential workstation is L = 1024. On the contrary, with 32 nodes of the Massively Parallel Processor (MPP) machines, image sizes up to L -- 4096 can be processed, given the r = 1 template. Since the availability of MPP architectures is not yet widespread, interesting results can arise also from a performance evaluation of the distributed CNN algorithm on a workstation cluster. In fact, clusters of workstations are the most common type of "parallel machine" available, and it is important to focus on how CV programs perform on such a platform. Figure 10 reports the computational times (top) of the distributed CNN algorithm on a cluster of eight SPARC-IPX workstations with 32 MBytes of memo .ry, as a function of L. The results of the sequential runs carried out on an identical single workstation of the cluster are also plotted. The improvement allowed by the distributed

269 time [sec] sequential 5

2

,,- f "

PVM

11 l

0,5

z t

0.2

z z

z

0.i I

0.05

,." ""

16o

260

560

M-FI o p / s

lo'oo

~

2000 L

PVM

I0. .4

7 f /

5

/ / / /

2 1.5

----______ s e q u e n t i al

1do'.' i'd6~6'6~30"o. 5do.Tdo:lObO:lSbO:

L

Figure 10. Processing times (top) and corresponding MFlop/s (bottom), as a function of the image size L, measured running a sequential CNN with r = 1 on a SPARC-IPX workstation and the distributed PVM version on 8 identical workstations.

implementation is significant only at large image sizes, and asymptotically it reaches almost a factor of eight of speed-up, when the communication weight is smaller compared to the computational weight of the problem. On the contrary, at low L values communications become a significant bottleneck, due to the low bandwidth and high latency ethernet network and TCP/IP protocol. This consideration is confirmed by the analysis of the corresponding performance (in MFlop/s) obtained as a function of L, Figure 10 (bottom). While the single workstation is able to sustain almost 2 MFlop/s at low L and slightly decreases its performance with L (due to the increasing number of page faults), the distributed PVM version scales with L from 3 up to almost 14 MFlop/s of performance. However, the sequential implementation is limited at L - 512 by memory bounds, while images up to L = 1024 can be processed using the whole cluster. Hence, owing to the coarse-grained parallelization scheme followed for the distributed

270

CNN implementation, the program is able to perform very well even on loosely coupled workstation clusters, given that the problem complexity is high enough to keep the weight of the communications low compared to the computational cost.

speed-up ..- T3D

15

12.5

..'"/

10

,.,,-"

7.5

.--

CM-5

~,.~ ..,. --,-

14"

2.5 Ff

5

10

15

20

25

30

speed-up T3D

15 -/

i SP2

12,5

10 7.5 5

p' f

2,5

i j 0

'

' " " " ' 5

" " ".L' 10

'

" . . . . . . . . . . . . . . 15 20 25

'

'-.'.." 30

" "

P

Figure 11. Speed-up of the distributed CNN algorithm, as a function of the available processors P with a 3 • 3 template and 240 • 240 (top) and 480 • 480 (bottom) images, on the three MPP machines.

7.2. S p e e d - u p and efficiency In order to study the scaling behavior of this prototype of CNN application, as a function of the number of available processors, several experiments have been carried out, at a variable number P of PEs on the three parallel machines. Figure 11 plots the performance speed-up Sp achieved by increasing P from 2 to 32 (assuming Sp = 1 for P = 2), measured with two representative image sizes: a larger one of 480 • 480 (bottom) and a smaller one of 240 • 240 (top).

271 The key parameter in this test is also the communication weight. It increases either when the number of available processors increases, or when the image size decreases. In both cases the partial image size stored on each PE decreases, providing a larger weight of communications. Very impressive is the behavior observed on the T3D. This architecture seems to be only slightly sensitive to the increasing communication weight, due to the very fast interprocessor bandwidth available (more than 100 MBytes/s) [34]. Its speed-up scaling is almost linear with P, up to a large number of processors, particularly with L - 480. However, decreasing the grid size has a negligible impact on the measured speed-up. An important index, directly related to the degree of parallelism achieved as a function of P, is the efficiency Ep, defined as Sp/P (in our case Ep =~Sp 9 2/P, as the starting point is P = 2). Ep is plotted in Figure 12, for both L = 240 and L - 480: at P = 32, Ep on the T3D is 98% with L - 480, and 93% with L = 240.

eFFiciency - " --,. ,~.~- . , . ~ . . . . . . . . . . . . . . -..,,-,~ .,~.

T3"D

0.8 ' < ~ . -., . _ . . . . . . . .

SP2

.,.,_

0..6

"- C M - 5

0..4

0.2

0

....

5 ....

1"0 . . . .

1"5 . . . .

2"0 . . . .

2"5 . . . .

:3"0

" "

P

eFFiciency 1 --.

.

.

.

................

. -'-

~=-

T3D

.~_... ~.,_-_

_

--0.8

__

_~- -.- - s P 2 -CM-5

0.6

0.4

0.2

0

....

5

....

1 '0 . . . .

1"5 ....

2"0 ....

2'5 . . . .

3'0"

" "

P

Figure 12. Efficiency of the distributed CNN algorithm, as a function of the available processors P with a 3 x 3 template and 240 x 240 (top) and 480 x 480 (bottom) images, on the three M P P machines.

272 The CM-5 efficiency is much more penalized by communications. Whereas for the larger image the speed-up is not too far from linearity (the corresponding efficiency is Ep = 79% at P -- 32), on the smaller one the machine efficiency significantly slows down (Ep = 58% at P = 32). A better behavior than the CM-5, even if worse than the T3D, is shown by the SP2, whose speed-up curve is satisfactory even with a large number of processors and with small problem sizes. The corresponding efficiencies are 83% (L - 480) and 71% (L = 240), with P = 32. These high efficiency values are explained both by the extremely high performance interconnection networks available on the most recent parallel architectures and by the favorable ratio between computations and communications of the presented CNN algorithm.

speed-up T3D

,.,-'/,',

,-'/

3.5

SP2

,.,-,,,.

/

2.5

//

,//

/

1.5

I

/"t /

]]] / s'//

" '

2

4

6

' i0

8

speed-up 4

p

. T3D .,,~ SP2

z",/

,,'/ ,.,/

3.5

,,I

cluste~

2.5

:1..5 ,

,

..

.

.

.

.

,

.

.

.

.

,

.

.

.

.

.

,

,

,

,

,

p

10

Figure 13. Speed-up of the distributed CNN algorithm, as a function of the available processors P with a 3 x 3 template and 240 x 240 (top) and 480 x 480 (bottom) images, on the SPARC-IPX workstation cluster.

Important measurements are provided also by the evaluation of the performance speed-

273 up achieved when running on an increasing number of workstations in a cluster, in order to study the range of applicability of CV codes even on a generic distributed platform, without dedicated interconnection networks. Figure 13 reports, as a function of the number of workstations P, the performance speed-up Sp (up to P = 8) achieved running the portable PVM code on a cluster of SPARC-IPX, with L - 240 (top) and L = 480 (bottom). The above Sp values obtained by the T3D and by the SP2 are also reported in the figure, in order to compare the cluster results with the two best MPP results previously analyzed. The corresponding efficiencies Ep are reported in Figure 14. Key points, to be carefully considered when running on workstation clusters, are the load of the machines, which can heavily affect computation performance and subtract memory resources of the running processes, and the load of the interconnection network, which can affect communication times. Our measures are taken in single user mode and with low network traffic, considering, as the overall execution time, the best wall clock time among a series of several runs. The measurements show that, at least with up to four or six workstations (when communications do not introduce relevant overheads), the algorithm assures a nearly linear speed-up in the number of processors. The efficiencies are about 90% with both L = 240 and L = 480 image sizes. On the contrary, in the final regions of the graphs, the cluster speed-up becomes not so good, and the efficiency drops to less than 80% with L = 240, while for the two parallel machines Ep is still more than 90% with P = 8. These considerations emphasizes the fact that, when dedicated fast interconnection networks are not present, the number of processors should not become too high, in order to maintain satisfactory degrees of efficiency, even though the algorithm design is low in communication cost. Table 1 Computational times in s/cell/iteration over a L x L image, with a 3 x 3 template. Machine L = 64 L = 128 L - 256 L = 512 L = 1024 CM-5 8.25-10 -7 2.61- 10 -7 1.10-10 -7 6.71-10 -s 5.11- 10 -s T3D 9.23- 10 -8 9.85-10 -s 1.05-10 -7 1.63-10 -7 2.06- 10 -7 SP2 1.30- 10 -7 6 .5 7 . 1 0 -s 4.73-10 -s 3 . 8 9 . 1 0 -s 3.52-10 -8 SPARC-20 3.90-10 -6 3 . 9 1 . 1 0 -6 3.97-10 -6 4.04-10 -6 7.45-10 -6

8. CONCLUSIONS The CNN paradigm plays a crucial role in the CV framework. Most of the low-level CV algorithms can be expressed in terms of the general CNN equation. Moreover, middle-level CV functions too (e.g., object detection and recognition) can be efficiently implemented by means of this powerful formalism. Many CV applications involve the processing of large images, requiring the designer to overcome the limits imposed by the single workstation bounds. Parallel or distributed platforms, providing a large computational power and large memory availability, can significantly speed-up CV computations and extend the range of images to be processed. In this work a complete performance analysis of a

274

eFFiciencw 1

~

T3D

O. 9

SP2

0.8

cluster

0.7 0.6

0o5 2

. . . .

,~ . . . .

6

. . . .

~} . . . .

10

P

e~iciency

~

0.9

T3D SP2

cluster

0.8 0.7 0.6 o.s--2

. . . .

,i

. . . .

6

. . . .

e

. . . .

1'o P

Figure 14. Efficiency of the distributed CNN algorithm, as a function of the available processors P with a 3 • 3 template and 240 • 240 (top) and 480 • 480 (bottom) images, on the SPARC-IP workstation cluster.

coarse-grained distributed implementation of a general CNN algorithm for CV has been presented. Four program versions were presented: three optimized for the Connection Machine CM-5, the Cray T3D and the IBM SP2 parallel machines and a general purpose version for any workstation cluster supporting PVM. The processing times of the message passing algorithm on the parallel machines measured with increasing image sizes are proven at least 20-30 times faster than on a SPARC20 workstation. Moreover, significant improvements can be obtained in the size of the images which can be processed. The favorable coarse-grained parallelization scheme adopted in the design of the distributed algorithm, however, allows effective performance results even when running on loosely coupled workstation clusters, given that large enough image sizes are processed. The communication bottleneck becomes important on the clusters of workstations only when running with very high numbers of processors or when too small problem sizes are stored on each processor.

275 ACKNOWLEDGMENTS The authors warmly thank Prof. Leon O. Chua of Berkeley University, Prof. Gianni Conte, Prof. Giovanni Adorni of Parma University and Dr. Pietro Sguazzero and Dr. Carla Conci of IBM for the helpful suggestions and the encouragement. This work has been made possible by the cooperation of IPG in Paris allowing us to use the Connection Machine CM-5 and CINECA in Bologna (Italy), allowing us to use the Cray T3D and the IBM SP2.

REFERENCES

10. 11.

12.

13.

14.

D. H. Ballard and C. M. Brown. Computer Vision. Prentice-Hall, Englewood Cliffs, 1982. D. Marr. Vision. Prentice-Hall, Englewood Cliffs, 1982. L.O. Chua and L. Yang. Cellular Neural Network: Theory. IEEE Transactions on Circuits and Systems, 35:1257-1272, 1988. L.O. Chua and T. Roska. The CNN Paradigm. IEEE Transactions on Circuits and Systems- I, 40:147-155, 1993. Ioannis Pitas editor. Parallel Algorithms for Digital Image Processing, Computer Vision and Neural Networks. John Wiley & Sons, New York, 1993. T. Agerwala amd J.L. Martin, J.H. Mirza, D.C. Sadler, D.M. Dias, and M. Snir. SP2 System Architecture. IBM System Journal, 34(2):152-184, 1995. Tommaso Toffoli and Norman Margolous. Cellular Automata Machines. MIT Press, Cambridge, MA, 1987. T. Roska and L.O. Chua. The CNN is Universal as the Turing Machine. IEEE Transactions on Circuits and Systems - I, 40:289-291, 1993. L.O. Chua and L. Yang. Cellular Neural Network: Applications. IEEE Transactions on Circuits and Systems, 35:1273-1290, 1988. W. K. Pratt. Digital Image Processing. John Wiley & Sons, New York, 1978. G. Adorni, V. D'Andrea, and G. Destri. A Massively Parallel Approach to Cellular Neural Networks Image Processing. In Proceedings of The Third IEEE International Workshop on Cellular Neural Networks and their Applications CNNA-94, pages 423428, Rome, Italy, December 1994. P. Saint-Marc and G. Medioni. Adaptive smoothing for feature extraction. In Proceedings of Image Understanding Workshop, pages 1100-1113, Cambridge, MA, 1988. MIT Press. A. Kellner, H. Magnussen, and J.A. Nossek. Texure Classification, Texture Segmentation and Text Segmentation with Discrete-Time Cellular Neural Networks. In Proceedings of The Third IEEE International Workshop on Cellular Neural Networks and their Applications CNNA-94, pages 243-248, Rome, Italy, December 1994. IEEE Press. H. Magnussen, G. Papoutsis, and J.A. Nossek. Continuation-Based Learning Algorithm for Discrete-Time Cellular Neural Networks. In Proceedings of The Third IEEE International Workshop on Cellular Neural Networks and their Applications CNNA94, pages 171-176, Rome, Italy, December 1994. IEEE Press.

276 15. F. Dellaert and J. Vandewalle. Automatic Design of Cellular Neural Networks by means of Genetic Algorithms: Finding a Feature Detector. In Proceedings of The Third IEEE International Workshop on Cellular Neural Networks and their Applications CNNA-9~, pages 189-194, Rome, Italy, December 1994. IEEE Press. 16. N.N. Aizemberg, I.N. Aizemberg, and T.P. Belikova. Extraction and Localization of Important Features on Gray-Scale Images: Implementation on the CNN. In Proceedings of The Third IEEE International Workshop on Cellular Neural Networks and their Applications CNNA-9~, pages 207-212, Rome, Italy, December 1994. IEEE Press. 17. M. Balsi and N. Racina. Automatic Recognition of Train Tail Signs Using CNNs. In Proceedings of The Third IEEE International Workshop on Cellular Neural Networks and their Applications CNNA-9~, pages 225-230, Rome, Italy, December 1994. IEEE Press. 18. J. Canny. A computational approach to edge detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 8:679-698, November 1986. 19. D.G. Goldberg. Genetic Algorithms in Search, Optimization and Machine Learning. Addison Wesley, Readings, MA, 1989. 20. T. Kozek, T. Roska, and L.O. Chua. Genetic Algorithms for CNN Template Learning. IEEE Transactions on Circuits and Systems - I, 40:392-402, 1993. 21. G. Destri. Discrete Time Cellular Neural Networks Construction Through Evolution Programs. In Proceedings of The Fourth IEEE International Workshop on Cellular Neural Networks and their Applications CNNA-96, pages 473-478, Seville, Spain, June 1996. 22. A. Broggi and G. Destri. Expectation-driven segmentation: A Pyramidal Approach. In Proceedings of The International Conference on Image processing: Theory and Applications 1993, pages 147-150, SanRemo, Italy, June 1993. 23. Alberto Broggi and Simona Berth. Vision-Based Road Detection in Automotive Systems: a Real-Time Expectation-Driven Approach. Journal of Artificial Intelligence Research, 3:325-348, 12 1995. 24. G. Adorni, V. D'Andrea, G. Destri, and M. Mordonini. Shape Searching in Real World Images: a CNN-Based Approach. In Proceedings of The Fourth IEEE International Workshop on Cellular Neural Networks and their Applications CNNA-96, pages 213218, Seville, Spain, June 1996. 25. K. Hwang. Advanced Computer Architecture: Parallelism, Scalability, Programmability. McGraw-Hill, Inc., New York, 1993. 26. Thinking Machines Corporation. CM-5 CM Fortran Language Reference Manual Version 2.1., 1992. 27. Thinking Machines Corporation. CMMD Reference Manual Version 3.2, 1992. 28. Crag T3D System Architecture Overview. Technical report, Cray Research Inc., 1994. 29. Cray Research Inc. MPP Fortran Programming Model, 1994. 30. IBM Corporation. Parallel Programming Subroutine Reference, 1995. 31. IBM Corporation. AIX XL Fortran Compiler//6000 Language Reference, Version 3.1, 1994. 32. G. Destri, V. D'Andrea, and M. Pontremoli. Using a 3-D mesh massively parallel computer for Cellular Automata Image Processing. In Proceedings of First Italian

277

Workshop on Cellular Automata for Research and Industry A CRI-94, pages 191-200, Rende (CS), Italy, September 1994. 33. G. Destri and P. Marenzoni. Cellular Neural Networks: A Benchmark for Lattice Problems on MPP. In Proceedings of ParCo '95, Gent, Belgium, September 1995. 34. P. Marenzoni. Performance Analysis of Cray T3D and Connection Machine CM-5: a Comparison. In Proceedings International Conference "High-Performance Computing and Networking HPCN'95", pages 110-117. Springer-Verlag, May 1995.

278 Giulio Destri

Giulio Destri received his Laurea degree in Electronic Engineering from the University of Parma in 1992, discussing a Master's Thesis about the implementation of Image Processing techniques on Massively Parallel Architectures. He is currently a Ph.D. candidate in Computer Engineering at the Dipartimento di Ingegneria dell'Informazione of the University of Parma. His research interests include the study of parallel paradigms such as Cellular Neural Networks, and parallel and distributed processing, and the application of them to computer vision. He has been involved in the Eureka project PROMETHEUS, an EEC project for improving traffic safety. From April to June 1996 he has been a TRACS visitor at the Edinburgh Parallel Computing Centre and at the AI department of the University of Edinburgh, Scotland. Giulio Destri is a member of AI*IA, IEEE, and the IEEE Computer Society. Home Page: http://www.ce.unipr.it/people/destri

Paolo Marenzoni

Paolo Marenzoni received his Laurea degree in Physics from the University of Parma in 1992, discussing a Master's Thesis about the parallel implementation of Monte Carlo simulations on the Connection Machine CM-2. He is currently a Ph.D. candidate in Computer Engineering at the Dipartimento di Ingegneria dell'Informazione of the University of Parma. His major interests cover the field of parallel and distributed algorithms and programming languages. In particular, he has been interested in the study of the parallel solutions of Petri Nets, the application of computational paradigms to distributed image processing and the implementation and optimization of communication protocols in distributed environments. Paolo Marenzoni is a member of IEEE and the IEEE Computer Society. Home Page: http://www.ce.unipr.it/people/marenz

Parallel Processing for Artificial Intelligence 3 J. Geller, H. Kitano and C.B. Suttner (Editors) 9 1997 Elsevier Science B.V. All rights reserved.

279

Decision Trees on Parallel Processors Richard Kufrin National Center for Supercomputing Applications University of Illinois at Urbana-Champaign 405 N. Mathews Ave., Urbana, Illinois 61801, U. S. A. rkufrin 9 u i u c . edu A framework for induction of decision trees suitable for implementation on sharedand distributed-memory multiprocessors or networks of workstations is described. The approach, called Parallel Decision Trees (PDT), overcomes limitations of equivalent serial algorithms that have been reported by several researchers, and enables the use of the very-large-scale training sets that are increasingly of interest in real-world applications of machine learning and data mining. 1. I n t r o d u c t i o n One of the most active areas of machine learning research over the past several years has been the development of algorithms for supervised learning (or learning from examples) [13]. Numerous techniques have been proposed, implemented, and extended by researchers from several disciplines; all have the primary goal of deriving a concept description from pre-classified examples - or equivalently, of inducing generalized descriptions of one or more classes given a set of examples representing instances of each class. The ability to classify unseen examples is applicable to a wide variety of real-world tasks. Sample applications include medical diagnosis~ financial analysis and forecasting, engineering fault diagnosis, and information retrieval [11]. There has been a great deal of interest in the emerging field of knowledge discovery in databases (KDD). Classification of examples extracted from real-world databases can be expected to involve huge amounts of training data; hence the ability to cope with extremely large training sets efficiently is an active research topic within the KDD community [4,6]. Most studies of machine learning algorithms to date have involved training sets of small to moderate size (for example, the mean size of training sets in the UCI Machine Learning Repository [15] is less than 2500 examples per d a t a b a s e - a figure that drops by approximately one-third if one excludes the two largest databases in the repository). To effectively deal with increasingly-large real-world databases, machine learning algorithms that are both space- and time-efficient are needed. Concurrent to advances in machine learning algorithms, development of hardware and software technologies that enable the application of multiple processors to the solution of problems have brought massive parallelism from the prototype phase to the production environment. Massively-parallel architectures are now in routine use in commercial settings for scientific, engineering, and business applications. Even more pervasive has

280 been the development of extremely high-powered workstations that now bring computeand memory-capacities once available only through supercomputer-class systems to the desktop. Further, the introduction of enabling software such as PVM, Express, and Linda (among others) that allow the creation of "virtual" parallel machines composed of networks of workstations has enabled cost-effective parallelism to be exploited by individual research groups for whom the purchase of tightly-coupled multiprocessors would be impossible. In the coming decade, we can expect the incorporation of new network technologies (notably ATM) to provide a distributed network computing environment for parallel applications in which inter-processor transfer rates of hundreds of megabytes per second will be possible [23]. The following sections describe Parallel Decision Trees (PDT) - a strategy for implementing a class of symbolic inductive inference algorithms within a parallel computing framework available today in shared- and distributed-memory multiprocessors or networked workstations. Section 2 gives an overview of this class of algorithms and approaches to parallelization, Section 3 presents the details of the PDT algorithm, Section 4 presents empirical results, and Section 5 describes additional modifications to the algorithm for improving performance. Section 6 discusses the incorporation of this parallelization approach to other inductive learning programs, and Section 7 provides a summary and offers suggestions for future work.

2. Supervised Learning Several different paradigms for supervised learning are in common use today, including neural networks, instance-based, genetic algorithm, rule induction, and analytic methods [10]. Numerous investigators have conducted empirical comparisons of the performance of representative systems from each of these classes (for a recent example, see [14]). Although no consensus has been reached regarding the relative accuracy of classification among these methods across problem domains, it is clear that methods such as decision trees require far less CPU time to induce a classifier due in part to the greedy algorithms employed by these techniques. However, recent studies of the scalability of symbolic methods have indicated that, for extremely large training sets, even decision tree algorithms can require an inordinate amount of CPU time to complete [3,17,12]. This work is concerned with decision tree algorithms and their application to very large training sets. We begin with a brief review of terminology; for a thorough description of methods for inducing decision trees, see [20,1,24]. Notation and Terminology The fundamental task of supervised learning algorithms is to find some representation of one or more classes (or concepts), denoted {C1, C 2 , . . . , CN}, from a training set T of preclassified examples (or cases), denoted {XI, X2,... ,Xm}. Each example Xi in T consists of a set of k attribute values described by a vector Xi = {xl,x2,... ,xk} and an associated class label c. Attributes in X may be categorical, where the domain of x~ is finite and has no inherent ordering, continuous (i.e., real-valued), or ordinal. Ordinal attributes, like continuous attributes, have a well-defined order among elements but are restricted to a countably infinite domain. These attributes are often the result of a discretization step

281

>

CL

2

= IIIGlt

I CLASS2

CI..~~S3 lO7

,q~ lO.5

GLASS4

CLASS1

Figure 1. A decision tree.

applied to a continuous attribute. A split, denoted S(T, xi), defines a partitioning of the examples X in T according to the value of attribute x~ for each example. The result of split S is two or more subsets of T, denoted {T1,T2,..., TD(x~)}, where D(x~) denotes the cardinality of attribute xi (in the case of continuous attributes where S enforces a binary split of T, D(xi) 2). =

2.1. D e c i s i o n T r e e s Decision tree methods are a family of learning algorithms that use a divide-and-conquer approach to inducing a tree-based representation of class descriptions. Among the most well-known decision tree algorithms are Quinlan's ID3 and its successor C4.5 and the CART method of Breiman et al. For consistency, we focus on ID3 methods hereafter; see Section 6 for comments regarding general applicability of the parallelization strategy described. Figure 1 shows a decision tree classifier induced from a four-class training set with three attributes: vl, v2, and v3. vl is ordinal with the possible values high, raed, or low, v2 is continuous, and v3 is categorical with values red, green, or blue. To classify an unseen case using this tree, a path is traced from the root of the tree to a leaf according to the value of attribute x encountered at each internal node. When a leaf is encountered, the class label associated with that leaf becomes the predicted class for t h e new case. For example, to classify a case with attribute vector X = {raed, 25.1, blue}, the path v3 = blue, v2 < 50, vl = reed leads to the leaf labeled c l a s s 3, so this class is chosen as the predicted value of the new case. Note that, in the decision tree of Figure 1, attribute

282

vl has been treated as if categorical in type, with separate branches created for each possible value. In practice, ordinal attributes are often treated as continuous, so that internal nodes associated with vl are labeled with relational tests such as "vl < reed" rather than tests of equality as shown here (we will return to this issue in Section 3.2). Having described the procedure for classifying unseen cases with an existing decision tree, we turn our attention to the issue of training, that is, determining the structure of a decision tree, given a particular pre-classified training set. Top-down decision tree algorithms begin with the entire set of examples and repeatedly subdivide the training set according to some heuristic until the examples remaining within a subset represent only a single class (or, if the available attributes do not sufficiently discriminate between classes, when no further discrimination is possible). A great many variations on this approach have been investigated, but in general, these algorithms follow a recursive partitioning scheme with the following outline: 1. Examine the examples in T. If all examples belong to the same class Cj, create a leaf node with label Cj and terminate. 2. Evaluate potential splits of T according to some "measure of goodness" H and select the "best" split, S(T, x~). If all attribute values are identical within examples in T or if no potential split appears beneficial, determine the majority class Cj represented in T, create a leaf node with label Cj and terminate. 3. Divide the set of examples into subsets according to the split S selected in step 2, creating a new child node for each subset. 4. Recursively apply the algorithm for each child node created in step 3. Decision tree algorithms can be themselves classified by how they address the following issues [1,16]: 9 restrictions on the values of xi (i.e., categorical, ordinal, continuous), 9 methods for constructing (and restrictions on) candidate partitions (S), 9 measures for evaluating candidate partitions (H), 9 approaches for coping with missing values, 9 approaches for pruning the resulting tree to avoid overfitting, and 9 strategies for dealing with noisy attributes or classifications. With respect to these issues, ID3 (as originally proposed): 9 accepts categorical and continuous data, 9 partitions the data based on the value of a single attribute, creating branches for each possible value,

283 9 uses the information-theoretic criterion uating a candidate partition S(T,x~),

gain

as the heuristic H for a means of eval-

9 provides for missing attribute values when evaluating H by assigning examples in T with unknown values for x~ in proportion to the relative frequency of known values for xi across C. Examples with unknown values for the most promising split are then discarded before the recursive call to the algorithm. The gain criterion evaluates the weighted sum of entropy of classes conditional on the selection of variable xi as a partition; at the core of this calculation is the determination of entropy (also known as info):

info(T) = - ).~

freq(Cj, T)

(freq!Cr * log2 k

IT I

where freq(Cj, T) represents the number of examples of class Cj among the total examples in T, and ITI is the total number of examples in T. By using the gain criterion as the heuristic for evaluating potential splits of the training set, ID3 attempts to judiciously select those attributes that discriminate among the examples in the training set so that, on average, the impurity of class membership at each node is reduced as quickly as possible. C4.5 is a descendant of ID3 that incorporates several additions to improve the capabilities and performance of the parent system. These improvements include: 9 refined information gain criterion to adjust for apparent gain attributable to tests with many attribute values, 9 modified approach to handle missing values during training and classification where examples with unknown values for a partitioning criterion are "fragmented" among child nodes {T1, T2,..., Tr)(x~)}, 9 methods for pruning to compensate for noise and to avoid overfitting the training set, and 9 providing for

value groups,

which merge subsets of attribute values.

In the remainder of this paper, references to ID3 should be taken to include the decision tree induction component of the C4.5 system, except where noted. Unlike categorical attributes, an infinite number of candidate splits are applicable to continuous attributes. ID3 (like CART) attempts to create binary splits for continuous attributes of the form x~ < = K, where K is some constant. Although there are an infinite number of possibilities for the choice of K, ID3 examines only m - 1 candidates, which are exactly those represented in the training set. The information gain is computed for each of the m - 1 candidates and is used (as in the categorical case) to evaluate possible splits.

284

Computational Requirements To evaluate the information gain associated with a split of T based on attribute xi, we must determine the class frequency totals for: 1. all examples in T, and 2. each subset T~ based on partitioning T according to possible values of xi. Quinlan [19] notes that the computational complexity of ID3 (for categorical attributes) at each node of the decision tree is O ( ] N [ , [A[), where N is the number of examples and A is the number of attributes examined at the node. A separate analysis that focused on the effect of continuous attributes on ID3's time requirements concludes that the total cost of the algorithm is over-quadratic in the size of the training set [17]. Clearly, the use of continuous data greatly expands the domains for which ID3 is useful, however it also significantly increases the computational time required to build a decision tree. To speed the examination (and associated frequency sum calculation) of the candidates, ID3 first sorts the training examples using the continuous attribute as the sort key. The sorting operation, which increases the computational requirements to O(m log2 m), contributes to potentially exorbitant CPU time for large training sets. In empirical studies involving very large training sets, Catlett [3] writes: ... as training sets become enormous, error rates continue to fall slowly, while learning time grows with a disturbingly large exponent . . . . Profiling on large training sets shows that most of the learning time is spent sorting the values of continuous attributes. The obvious cure for this would be not to sort so many continuous values, provided a way could be found of doing this that does not affect the accuracy of the trees, which may hinge on very precise selection of critical thresholds. Catlett's solution to the above problem is called peepholing; the basic idea is to discard a sufficient number of candidates for threshold values so that the computational expense of sorting is lessened. It is (approximately) an intelligent sampling of the candidates that aims to create a small "window" of threshold values; this window is then sorted as usual. Empirical results showed that peepholing produced significant improvements over the traditional ID3 algorithm for several large training sets, however there is no guarantee that this approach will perform with consistent accuracy over all possible domains.

Pruning Although not the focus of the present work, simplification of decision trees through pruning techniques is an important component of any decision tree algorithm. It is sufficient to note that several methods have been developed, some of which estimate error rates using unseen examples or cross-validation techniques while other approaches base simplification decisions on the examples used to induce the tree. In either case, we need only to obtain misclassification totals for the (training or test) set in order to predict error rates for the purposes of pruning. No aspect of the algorithm presented here precludes following an appropriate pruning algorithm as the entire training set is available throughout execution.

285

Figure 2. A model-driven parallelization of decision tree induction.

2.2. Approaches to Parallel Induction of Decision Trees For training sets of small to moderate size, ID3 is computationally inexpensive - it is unnecessary to apply parallelism to gain a benefit in execution time. However, when applied to massive quantities of data, eventually the sheer size of the training set can be expected to require non-trivial amounts of computation. Additionally, one can employ the aggregate available memory of distributed-memory multiprocessors or workstation clusters to accommodate ever-increasing sizes of training sets that may not be feasible on individual machines. M o d e l - d r i v e n Figure 2 shows a model-driven parallelization strategy for decision tree induction, which may seem to be the most natural strategy of assigning processing elements to nodes of the decision tree and reflects the "divide and conquer" nature of the algorithm. Although appropriate to many search strategies such as branchand-bound, the limitations of this approach when applied to decision tree induction become apparent. It is difficult to partition the workload among available processors (as the actual workload is not known in advance) - if the partitioning of Figure 2 is chosen, clearly the processor assigned the middle branch of the root of the tree will complete first and will idle. Alternatively, a "master-worker" scheme for task assignment, where available processors are assigned the task of determining the best attribute for splitting of a single node and then is returned to a "waiting pool" may exhibit excessively fine-grained parallelism the overall computation time may be dominated by the overhead of task assignment and bookkeeping activities. In both approaches, potential speedup is limited by the fact that, on a per-node basis, the root of the decision tree requires the largest computational effort as all m examples and k attributes must be examined to determine the initial partitioning of the full training set. Finally, this approach assumes global access to the training set, preventing efficient implementation on distributed-memory parallel platforms.

286

x1 x2 x3 x4

Training Set

~.

Figure 3. An attribute-based parallelization of decision tree induction.

A t t r i b u t e - b a s e d Shown in Figure 3, attribute-based decomposition is another strategy that associates each of p processing elements with kip independent subsets of the available attributes in X so that the evaluation of gain for all k attributes can proceed concurrently. This approach has the benefit of simplicity as well as achieving excellent load-balancing properties. Although this strategy does not require global access to the full training set, at least two limitations of attribute-based parallelism should be noted. The first involves potential load imbalance at lower nodes of the decision tree when data sets include a significant number of categorical attributes that are selected at higher nodes of the tree for splitting. Secondly, the potential for concurrent execution p is bounded by k, the total number of available attributes. D a t a - p a r a l l e l A data-parallel decomposition strategy, as shown in Figure 4, assigns "blocks" of training examples to processors, each of which executes a SIMD (singleinstruction/multiple-data) program on the examples assigned locally. A straightforward adaptation of a serial decision tree algorithm for data-parallel execution must still enable global access to the complete training set, discouraging development of implementations with this strategy. However, the PDT algorithm, described in Section 3, is a modified data-parallel approach that offers a solution to this limitation. Pearson [18] evaluated the performance of a combination of the "master-worker" approach and attribute-based decomposition. His experiments, implemented using the coordination language Linda on a Fujitsu cellular array processor, involved relatively complex strategies for processor assignment to tasks in order to compensate for rapidly-decreasing workloads in lower levels of the decision tree and the accompanying increase in the ratio of parallel overhead to "useful" work. Pearson's conclusion that "none of the results show a decrease in speed [ commensurate ] with the possible parallel computation" underscores the drawbacks of this strategy.

287

I I!

I

I I

I I

I

!

.... T r a i n i n g S e t

Figure 4. A data-parallel approach to decision tree induction.

3. T h e P D T A l g o r i t h m Returning to the data-parallel approach shown in Figure 4, we see that the motivation behind this decomposition strategy arises from the observation that most decision tree induction algorithms rely on frequency statistics derived from the data itself. In particular, the fundamental operation in ID3-1ike algorithms is the counting of the attribute value/class membership frequencies of the training examples. Parallel Decision Trees (PDT) is a strategy for data-parallel decision tree induction. The machine model employed assumes the availability of p processing elements (PE), each with associated local memory. The interprocessor communication primitives required are minimal: each PE must be able to send a contiguous block of data to its nearest neighbor; additionally, each PE must be able to communicate with a distinguished "host" processor. This machine model is general enough so that the strategy may be employed on currently-available massively-parallel systems as well as networks of workstations. Because the communication patterns involved are regular, with the bulk of transfers involving only nearest-neighbor PEs, the additional overhead incurred due to inter-processor communication is kept to a minimum (certain optimizations may be employed if the underlying machine supports them; these are described in Section 5). 3.1. D a t a D i s t r i b u t i o n PDT partitions the entire training set among the available PEs so that each processor contains within its local memory at most [m/p 1 examples from T. This partitioning is

288 static throughout induction and subsequent pruning. No examples are allocated to the host processor, which is instead responsible for: 1. Receiving frequency statistics or gain calculations from the "worker" PEs and determining the best split. 2. Notifying the PEs of the selected split at each internal node. 3. Maintaining the data structures for the decision tree itself. As attributes are chosen for splitting criteria associated with internal nodes of the decision tree, the host broadcasts the selected criterion to worker processors that use this information to partition training events prior to the recursive call to the algorithm at lower levels of the tree. 3.2. P a r a l l e l E v a l u a t i o n of C a n d i d a t e Splits In PDT, the evaluation of potential splits of the active training subset T proceeds ve.ry differently according to the type of attribute under consideration. We turn our attention first to the simpler case. Categorical Attributes The calculation of class frequency statistics for categorical variables is straightforward: each processor independently calculates partial frequency sums from its local data and forwards the result to the host processor. For an n-class training set, each of the a categorical attributes xi that remain under consideration will contribute n , D(xi) values to the host processor (where D(xi) again denotes the cardinality of attribute xi). These intermediate results are combined at the host processor, which can evaluate the required "measure of goodness" H. Each PE requires O ( m / p , a) time to complete the instance-count additions for its data partition; the information gain calculations (still computed on a single processor) remain the same. Communication between the host and PEs is now required, restricting the potential speedup to less than optimal (we do not consider effects of hierarchical memories that may lessen the penalties of the host/PE communication). Because the host processor is responsible for combining the partial frequency sums obtained from each PE, no communication between individual PEs is required for these tasks.

Continuous Attributes Continuous attributes pose at least two challenging problems for a data-parallel implementation. First, as in the serial implementation, we have no a priori knowledge of the candidate splits present in the training set. Since the data is distributed among the PEs, a parallel sort is required to allow a scan and update of the thresholds if we adhere strictly to the serial version of the algorithm. Although PRAM formulations of distributed sorts have been described that exhibit a parallel run time of O(log2N), implementations on more realistic parallel models are far less scalable and can vary depending on run-time conditions. Even if a distributed sort is available, the subsequent frequency count update step across PEs is not fully parallelizable due to dependencies on frequency counts from

289 preceding PEs in the sorted sequence. Second, it is likely that the calculation of information gain associated with all possible thresholds for continuous attributes will consume much more time than for categorical attributes if we concentrate all of the information gain calculations at the host processor. By following a different approach in the case of continuous attributes, we can significantly reduce the time complexity associated with these attributes while still evaluating all possible thresholds. The key observation is that it is not necessary to produce a single sorted list of all training examples. As mentioned earlier, we are only interested in frequency statistics gathered from the data sorting merely enables gathering of these frequencies in a single pass. A second observation is that, while the calculation of information gain for categorical attributes is most conveniently done at the host processor, we would do better to evaluate all m potential gain calculations for continuous attributes in parallel at the level of the worker PEs. The solution is to incorporate a three-phase parallel algorithm as shown in Figures 5 and 6. The strategy for evaluating candidate splits associated with continuous attributes in P DT can be summarized as follows: 1. L o c a l p h a s e . Perform p concurrent sorts of the partitions of data local to each PE. As in the serial ID3 implementation, determine the frequency statistics for each (local) candidate threshold as usual. Note that ID3 chooses data values present in the training set for candidate thresholds while other algorithms choose midpoints between data values - - either approach is suitable in this framework. 2. Systolic phase, p - 1 exchanges of each PE's local thresholds and associated frequency statistics ensue. Each PE receives the thresholds and frequencies calculated in step 1 from its neighbor; as these are already sorted, they can be merged with the current partial sums in a single pass. After all p - 1 exchanges have occurred, all PEs contain the frequency statistics required for information gain calculations, that are then calculated (locally) within-processor. 3. Reduction phase. Each PE determines the "best" candidate within its assigned subset. The candidate threshold and associated information gain value are sent to the host processor from all p workers; the host selects the best threshold and updates the decision tree once all requisite gain calculations are obtained for all candidate attributes.

Ordinal Attributes and Example Compression An important distinction between thresholds and examples should be noted. As depicted in Figures 5 and 6, it may appear that processors executing the P D T algorithm must exchange the entire active set of examples during the systolic phase of the algorithm. In fact, what must be shared are not examples, but thresholds. In the case of continuous (real- or integer-valued) attributes, these may be identical. However, in the case of ordinal attributes, it is possible that the domain of the attribute is far less than the number of representative examples of that attribute in the data set. More precisely, for an ordinal attribute that may assume one of d values, we can reduce the amount of data that must

290

I~-~176 ~176~o~1

I~o~i~.~ ~o~~o~!

I~176 ~~ ~~ ~~ ~~

~,~ Io.~I~.~I~.~I~.~I~.oI I!.oI~.~Io-oI~.oI~.oI I~.~I.... q~.ol ,, ~,'I ~I~I ~I~I ~I I~I~I ~I ~I ~I I ~I ~I ~I ~I ~I '~,~1, ol ot~t~l~q Io, I otol~t~ Io, ! ~1 ~i~l~

l~.~l~176176 I~I ~I~I ~I ~I I;l~, I~l~l~

~, i~.oi~.~i~.~io.~I~.oI ~,~I~I~I~121~i SORT

~,~ I~ ~.~I~.~I~.~I~176 ~,' i '~ I~ I~ I~ I;I PARTIAL FREQUENCY CALCULATION

SYSTOLIC SHIFT

1

s~, 1o.~l~.~l~.~-1~.4o.o~i~.~t2.~l~.ol~.ol~oo~ - I~i~i~.ol~.~l~i~l~.o~t~-.~l~.olo-~l~.~t~.~l.

.s x, ol01 l l llololol l I 10101~i~i0q

i.0i01~l,l~q

I I l 101 ohll l l l l h I. I lliq

I. i111

ql

Figure 5. Parallel formulation of information gain calculation for continuous attributes. Values of continuous attributes for each example are labeled Xi; associated classes are labeled C(X~). S(X~) indicates a sorted list of attribute X~. L(S(X~)) denotes the frequency count of examples less than or equal to the candidate threshold - - a similar count is maintained for examples greater than the threshold. For clarity, the algorithm is shown as applied to a two-class problem; extension to multi-class training sets is straightforward.

291

LOCAL GAIN CALCULATION

s~,)lo.~il.~l~~~l~.sl,.ol 11.512.716.ol7.ol9.o I 12.91~.ol,.118.519., I 12.11,.616.117.919.91 ~ , ) , t oil~ott~~.

t,o i o l o t ~ l ~

t,o t ~ l ~ i ~ l . ~

t,~ l ~ l ~ t ~ l ~

,ololol~l~--t, ol~l~l~l,~-~,~l~l~l~l~-~,olol~l~l ~

~,'~,))t,olot~t~

i,ol~lotst~o~

t,~t~tol~oll~n t,~t~tTl~i~

"('~('~-~,~I.~q."ol-~l.~71.-I

I.~71.~71-~1.-I.-I

I."o1-~1--I.~71-~1

I.~1.~1.~1.-I.-I

[!,,~i].,71.~1.,i., i

tii~i.~,l.,i. ~71.~1

H.-I.-I.-I.991

GLOBAL REDUCTION OF GAIN

,c,~,.,,:,, i. ~l,,i~]. 9~1.~71.,i

Figure 6. Parallel formulation of information gain calculation for continuous attributes (cont'd). Lg(S(X~)) contains the accumulated global frequency counts for thresholds (S(X~)). After the p - 1st shift, information gain H(S(Xi)) can be calculated locally. In the final stage, each PE submits its local "best" threshold (indicated by shaded boxes in the lower figure) to the host processor, which selects the "winner" from the p candidates.

292 be communicated between processors during each step of the systolic phase of PDT by a factor of I - ~--~/. This factor represents the amount of example compression, which can contribute greatly"" to improved performance of the algorithm in practice. Note that an alternative approach for treatment of ordinal attributes would likely produce superior performance improvements, both for sequential and parallel implementations of ID3 [8]. In this approach, class membership totals for ordinal attributes are gathered as if categorical (requiring no sorting of individual examples), after which the information gain associated with each binary split is calculated. The current version of PDT does not implement this strategy, instead treating ordinal and continuous attributes identically.

3.3. Training Set Fragmentation The recursive partitioning strategy of divide-and-conquer decision tree induction inevitably transforms a problem well-suited to data-parallel execution into a subproblem in which the overhead of parallelism far outweighs the potential benefits of parallel execution. At each internal node of the tree, the number of available examples that remain "active" decreases according to the chosen split; this is referred to as training set fragmentation. At the same time, the overhead associated with parallelism remains constant and is proportional to the number of processors p (recall that the systolic phase of PDT requires p - 1 steps). This overhead can be expected to quickly dominate the processing time, particularly in situations where the training set is evenly split at each internal node. Early experiments with P DT showed that parallel execution required an order of magnitude more processing time than serial execution on identical training sets; virtually all of the additional time was consumed by communication of small subsets of examples at lower nodes of the decision tree. The simplest remedy for coping with the effect of training set fragmentation on parallel execution is to monitor the size of the subset of training examples that remain active at each step of the induction process. When this subset reaches a user-selected minimum size threshold, the remaining examples are collected at the host processor which assumes responsibility for further induction associated with the example subset. Parallel execution is suspended until the host processor has induced the complete subtree, after which all processors resume execution with the complement of the fragmented set of examples. This approach is used in the current implementation of the algorithm and is discussed further in Section 4.

3.4. Combining Data-Parallel and Attribute-Based Parallelism While the basic PDT algorithm provides an approach for data-parallel decision tree induction, clearly the overhead associated with communication can be substantial. Specifically, the systolic phase of the PDT algorithm requires ( p - 1) 9 k communication steps to collect all the information required before determining the most promising split so that an increase in the number of processors and/or the number of attributes causes a corresponding increase in the time required at each node of the decision tree. An extension to the basic PDT algorithm involves a combined data-parallel/attribute-based approach in which the pool of available processors is divided into j subsets, called processor groups, each responsible for evaluating potential gain for k/j attributes, concurrently executing j independent instances of the PDT algorithm (note that, when j = p, this strategy is effectively a pure attribute-based decomposition). For induction problems where both a

293 significant amount of compute power (in terms of available processors) and a moderate-tolarge problem dimensionality (in terms of attributes) is present, such an approach offers a solution that may ameliorate the problem of increased communication costs. 4. E x p e r i m e n t s To evaluate the performance of PDT on representative data sets, the algorithm was implemented in ANSI C using message-passing for inter-processor communication. In order to conduct experiments that would permit evaluation under differing architectures, two compute platforms were chosen. The first is a workstation cluster consisting of eight Hewlett-Packard (HP) 9000-735/125 workstations, each configured with 128 MB of memo.ry. The workstations are interconnected with a 100 Mb/sec fiber optic network based on an ATM switch. The second platform is a Silicon Graphics (SGI) Challenge multiprocessor with 8 MIPS R4400 processors and I GB of memory. The message-passing library used was Parallel Virtual Machine (PVM) software, a freely-available package for programming heterogeneous message-passing applications from the University of Tennessee, Oak Ridge National Laboratory, and Emory University [7]. Although PVM is most frequently used as a message-passing layer utilizing UDP/TCP or vendor-supplied native communication primitives, recent enhancements support message-passing on shared-memory multiprocessors using IPC mechanisms (shared memory segments and semaphores) to increase efficiency. The shared-memory version of PVM was used for experiments conducted on the SGI. The application was programmed in a single-program, multiple data (SPMD) style in which a distinguished processor acts both as "host" and "worker", while the remaining processors perform the "worker" functions exclusively. No particular optimizations were applied to reduce the communication overhead of this software except for specifying direct TCP connections between communicating tasks through the PVM PvmRouteDirect request. The application can run either in single-processor mode or in parallel. Care was taken to avoid executing "useless" portions of the application in the single-processor case so as not to penalize serial performance with parallel bookkeeping and overhead. D a t a Sets

Two data sets were used in the experiments. The first is from a biomedical domain, specifically the study of sleep disorders. In this data set, each example represents a "snapshot" of a polysomnographic recording of physical parameters exhibited by a patient during sleep. The continuous measurements are divided into thirty-second epochs so that a full sleep study of several hours in duration produces several hundred examples per study. Over 120 individual sleep studies were available, ranging in size from 511 to 1051 examples. These studies were combined into a single data set-of 105,908 examples, 102,400 of which were used for parallel benchmarks. Each example consists of 13 ordinal attributes of cardinality 11 and a class label of cardinality 5. The task is to identify the sleep stage (i.e., awake, light/intermediate/deep sleep, or rapid eye movements). For a more complete description of this domain and the classification task, see [21,2]. The second data set, constructed specifically for these experiments, is a synthetic data set (SDS) consisting of 6 continuous attributes and a class label of cardinality 3. Three of the

294 attributes are relevant to the classification task; the others are irrelevant. Classification noise was introduced in 15% of the examples (i.e., in 15% of the training set, the class label was chosen randomly). A total of 1 million synthetic examples were generated for this training set.

Baseline Timings PDT was compared with three implementations of the ID3 algorithm (one public domain, two commercial) to benchmark the sequential run time of PDT. Although the standard release of C4.5 was not the fastest, a minor modification to the C4.5 source resulted in run times consistently faster than the other implementations and approximately equal to P D T (for splits associated with continuous attributes, C4.5 performs a scan to determine the largest value present in the training set that lies below each split so that reported threshold values are actually represented in the training set; the modification removes this scan, resulting in splits that correspond to midpoints as in CART - the resulting tree structure is unchanged). Table 1 summarizes the results on the experimental data sets. It appears that neither PDT nor modified C4.5 hold a clear advantage in execution time; for the sleep data set PDT required approximately 15% less time to complete, while for the synthetic data set, C4.5 showed an improvement of nearly 6% over PDT.

Table 1 CPU time comparison (in seconds) of C4.5 and PDT. Data Set Training Set Size C4.5 sleep 105,908 2876 synthetic 1,000,000 13652

C4.5 (modified) 155 2342

PDT 133 2480

Speedup and e]ficiency are metrics commonly used to evaluate the performance of parallel algorithms and/or implementations [9]: S(p) =

E(p) =

T1 S(p)

P

Speedup (S) is defined as the ratio of the serial run time (T1) of an application to the time required for a parallel version of the application on p processors (Tp). We distinguish between apparent speedup, which measures the speedup of a given parallel application with respect to the same application run on a single processor, and absolute speedup, which measures the speedup of a parallel application to the best-known sequential implementation. Efficiency (E) measures the effective use of multiple processors by an application as a fraction of run time so that an efficiency of one indicates ideal parallel execution. Based on the similar execution time shown in Table 1, speedup and efficiency measures are reported with respect to the serial run-time of P DT.

295 140

i

|

|

i

pTotal /Local --,---/ Other - e - Sort -~(......

120

100 ,+

~ ........ m . _ . . ~ . . . . . . . . . . . . :=:'_':'-':h~':

. ..... . - - " ...----x-

4"

12800

25600

i 51200 Training Set Size

i 102400

Figure 7. Single-processor (HP) benchmark for the sleep data set.

|

|

2500

|

~Total / Local / Sort Other

2000

1500

1000 .....,e/

500

250000

500000 Training Set Size

1000000

Figure 8. Single-processor (SGI) benchmark for the synthetic data set.

-~--4--. -e-.-~.--

296 Figure 7 shows the single-processor run time of PDT on the sleep data set (HP workstation), varying the training set size from 12,800 to 102,400 examples. Similar results are shown in Figure 8 using the synthetic data set with training set size varying from 250 thousand to 1 million examples (using the SGI machine as the full synthetic data set could not be run on the HP workstation due to insufficient memory). For the purpose of both sequential and parallel benchmarks (to follow), timing results are broken down as follows: t o t a l The total execution time measured on the host processor, which consistently requires the most CPU time due to the additional work assigned to the host after examples are gathered when the active examples in the training set fall below the minimum fragmentation threshold. This time includes the overhead of spawning p - 1 worker tasks on non-host processors. local The total time on the host processor for local operations such as collecting and updating frequency counts, calculating information gain for candidate thresholds, performing example compression, and updating the decision tree. s o r t Total time on the host processor executing the library routine q s o r t prior to determining frequency statistics for candidate thresholds. o t h e r Total time on the host processor doing all other tasks such as I/O, initialization of internal data structures, memory management, etc.

Results from parallel runs of PDT also include: c o m m u n i c a t i o n Total time on the host processor spent sending and receiving threshold values and frequency statistics during the PDT systolic phase, broadcasting selected splits during induction, and receiving gathered attributes from worker processors after reaching the training set fragmentation limits. As is evident from Figures 7 and 8, the majority of time is spent in local operations unrelated to sorting. A further breakdown shows that, for the sleep data set, 75% of the time spent in local operations is due to counting of frequency statistics (47~ and copying of data (28%) as a prelude to sorting. A similar breakdown for the synthetic data set reveals that the most time-consuming local operation is entropy calculation (46%), followed by data copying (24%), with frequency statistic counting requiring a considerably smaller percent of time (12%). These differing rankings for components of local operations are primarily due to the nature of the data sets; recall that all attributes in the sleep data set have a limited domain and therefore require relatively few calculations to evaluate the partitioning heuristic. Surprisingly, sorting accounted for only 10% and 12% of total execution time for the sleep and synthetic data sets, respectively. Parallel Benchmarks

Figures 9 and 10 display the best obtained times for the sleep benchmark for 1 to 8 processors on the HP and SGI. CPU times, speedups, and efficiencies are presented in Table 2. As noted on the figures, in both cases processor group attribute subsets

297

140 Total Local -4--. C o m m u n i c a t i o n -la--. Sort ..~ .... O t h e r ..,i,..-

120

100

§

b ..................................................... .4~

..............................

9

9- . . . . . . . . . . . . . . . . .

0

I 1

I

,

,,

2

~

,,

,

4

I

8

Number of Processors

Figure 9. Benchmark results (HP) for the sleep data set (m = 102,400). eight-processor runs specified two and four processor groups, respectively.

140

,

i

1

Four- and

|

Total -4-Communication

120

L o c a l -~---Bo-. Sort --~-.. Other - ~ . -

100

....................

I

I

I

2

I

4

"4"

,,I,,

.

8

Number of Processors

Figure 10. Benchmark results (SGI) for the sleep data set (m = 102,400). Four- and eight-processor runs specified two and four processor groups, respectively.

298 Table 2 Speedup and efficiency of PDT (sleep data set, m - 102,400). Processors Machine CPU Time Speedup 1 HP 134 2 HP 120 1.12 4 HP 96 1.40 8 HP 88 1.52 1 SGI 133 2 SGI 112 1.19 4 SGI 101 1.32 8 SGI 95 1.40

Efficiency 0.558 0.349 0.190 0.594 0.329 0.175

provided the best timings. Interestingly, the efficiency of the workstation cluster (HP) on this benchmark was slightly superior to the multiprocessor (SGI) for 4 and 8 processors, with the workstation cluster requiring less time for all components of the computation except for communication. However, it is difficult to draw any clear conclusions from these tests; in practice, the time required to induce a decision tree from this data set is minimal, therefore the potential for gains through parallelism is quite small.

Table 3 Speedup and efficiency of PDT (synthetic data set, m - 500,000 for HP; m - 1,000,000 for SGI). Processors Machine CPU Time Speedup Efficiency 1 HP 1629 2 HP 1227 1.33 0.664 4 HP 1072 1.52 0.380 8 HP 915 1.78 0.223 1 SGI 2480 2 SGI 1645 1.51 0.754 4 SGI 1288 1.93 0.481 8 SGI 1130 2.20 0.274

Figures 11 and 12 present the best obtained times for the synthetic data set benchmark for 1 to 8 processors on the workstation cluster and multiprocessor. CPU times, speedups and efficiencies are presented in Table 3. As noted previously, the training set size is limited to 500,000 examples for the cluster due to memory constraints (PDT was run successfully with the full 1 million example training set on 4 and 8 processor configurations of the cluster, however the timings do not appear due to inability to present accurate speedups and efficiencies). The results show improved efficiency (versus the smaller sleep data set) for all processor totals. Although the SGI timings in Figure 12 were obtained with processor groupings for attributes (as in the sleep benchmarks), the HP numbers shown use only a single processor group (strict data-parallel execution) to assist in un-

299

,

|

|

i

-~ -+---

Total Local

Communication -B--.

1600

Sort ..~--..

Other ..a,.1400

1200

._~

1000

I-o

800

600

400

200

B-

I

I

1

2

I

I

4

8

N u m b e r of Processors

Figure 11. Benchmark results (HP) for the synthetic data set (m = 500,000). All runs specified one processor group (default PDT algorithm).

2500

|

| Total - ~ - Local -+--C o m m u n i c a t i o n .n... Sort -.~.... O t h e r -,~---

2000

1500

"-,,. ",11.....

+

500

.......................

---"

. . . . . . :==2

I

I

1

2

~.

........ ':':':"=::tt

I

I

4

8

N u m b e r of Processors

Figure 12. Benchmark results (SGI) for the synthetic data set (m = 1,000,000). Fourand eight-processor runs specified one and four processor groups, respectively.

300 =

=

|

i

|

|

8192

co

P=8 P=4

--4--.

P-2 P=I

-B--. ........

4096

I-Q.. 0

...B.

"§

bm O

Q

"~ o.

2048

lO24

,

,

1

4

's 1

"~'.,. ..... B ..... "-§

'

'

64

256

02 1

4

Training Set Fragmentation Threshold

Figure 13. Effect of various training set fragmentation thresholds on CPU time (SGI) for 2, 4, and 8-processor runs using the synthetic data set (m - 1,000,000). For comparison, single-processor run time shown as horizontal line.

derstanding the extent of performance degradation if the attribute-based dimension of parallelism is not exploited. As can be seen in Figure 11, the pure data-parallel approach leads to rapidly-increasing communication overhead, although these costs do not appear to dominate the total execution time until reaching 8 processors, at which point communication costs exceed those for local operations- improved interprocessor networks would allow data-parallel execution on greater numbers of processors. Figure 13 provides a closer look at the effect of various thresholds of training set fragmentation for 2, 4, and 8 processors. It appears that the optimal level for fragmentation (at least for this combination of hardware and problem definition) lies near the 1000example threshold; choosing smaller values causes communication overhead to adversely affect CPU time, while larger values concentrate an excessive amount of CPU time at the host processor, which is responsible for inducing the subtree corresponding to the gathered examples. Another view of the algorithm's behavior is shown in Figure 14, in which the total communication requirements for various "configurations" of PDT is shown. The leftmost points shown correspond to execution with a single processor group (data-parallel execution) - for the synthetic data set, the total communication volume is equivalent to sending the entire training set between processors over 30 times! Not unexpectedly, the volume is considerably less for the sleep data set due primarily to the effects of example compression, as discussed in Section 3.2. The benefits of attribute-based execution are

301

Synthetic, P = Synthetic, P = SynthetiC,P = Sleep, P = Sleep, P = Sleep, P =

Communication ~Bytes)

10000000000

8 4 2 8 4 2

-e.--+---a--. -M--~-.-~--

1000000000 100000000 ~

10000000

?

xL. " a k : "

100o000 100000

,.--"---~-.~-~.-"2~.~~ ""~

9 ~=(~-"

""-

= ~ ~ - 5 - - - - - - ~ ....... -~.% ".:.:.. -:~... "--<--=4 .-..~.. ~:',::.. "~-:.."r....":-S'...~. "%k-.j "~:" "'"r

10000

"" %;"

~ ~

51~~..L...:

Min. Frag. Threshold

"'e

"''~:~,~

. . . . . . . . . . . . . . . . . . . . . :::::::-.::;~-~.......i ................L . . ~ .... "-.-.,.:. ,,,

1

2

4

Processor Groups

Figure 14. Communication volume for PDT on the sleep and synthetic data sets, varying the minimum fragmentation threshold and number of processor groups.

clearly shown here, with the communication volume for the synthetic data set decreasing dramatically for pure attribute-based parallelism (where the number of processors equals the number of processor g r o u p s - interestingly, these configurations did not result in the minimum execution times presented previously). Finally, although larger values for the minimum fragmentation threshold reduce the communication volume somewhat, the total effect is clearly not as pronounced as that produced by the use of processor groups. 5. A d d i t i o n a l P e r f o r m a n c e I m p r o v e m e n t s There are additional opportunities for improvements to the overall performance of the P D T strategy. The first two enhancements are dependent on the capabilities of the underlying communications network and as such may not be possible in all implementations. The first optimization strategy stems from the observation that data communicated between each PE during the systolic phase associated with continuous attributes is readonly - - in essence, threshold values and associated frequency statistics simply "visit" all other PEs which do not "own" the examples used to generate the data. Because each PE that receives this block of data does not modify it in any way, we can take advantage of a s y n c h r o n o u s communication primitives that are available in several multiprocessor systems to overlap communication with computation. Specifically, before each PE updates its own partial frequency counts, it injects the just-received block of data into the communications network and initiates an asynchronous receive of the data expected from its other neighbor. Once these communications operations have been started, the PE can

302 process the current resident block of data as the actual transfer proceeds. Secondly, we can exploit the capabilities of multiprocessors that support cooperative communication primitives such as reduction operations. Sum-reductions, which sum the values contained in several PE's to produce a single sum, are applicable to the combining of frequency counts of categorical attributes, while min-reductions, which select the minimum of a group of values contributed by PEs, can be used to efficiently determine the best threshold value from the gains calculated by each PE during the processing of continuous attributes. The improved strategies for dealing with ordinal attributes alluded to in Section 3.2 are also likely to provide significant gains in performance. As shown in Figure 14, the treatment of ordinal attributes with a limited domain as continuous can contribute to large inter-processor communication requirements. Handling such attributes in a different manner should not only alleviate communication overhead, but allow for a reduced sorting task in these cases. The implementation of PVM software on shared-memory systems, although extremely convenient for code portability, introduces a subtle overhead that contributes to reduced performance for PDT. The act of sending a message from task A to task B first involves copying the message data from task A's private data space to a buffer within task A's shared memo .ry segment, then copying a small "message pending" header to task B's shared memo~, segment. When task B is ready to receive the message, the header is retrieved, then the message data is copied from task A's shared memory segment to task B's private data space. While this strategy is a reasonably efficient general-purpose messagesending protocol, it involves two data-copying steps that are not necessary in the case of the PDT algorithm. It is possible to avoid this overhead by direct management of sharedmemo .ry, leading to overall performance improvements and improved load-balancing (by enabling the use of a reduced minimum fragmentation threshold). Finally, the most promising optimization may come from the incorporation of the modeldriven parallelization strategy discussed in Section 2.2. In one sense, the approach for dealing with fragmented training sets that associates a processor with a related subset of training examples is related to this strategy, with the different aim of eliminating communication overheads, rather than exploiting parallel execution of independent subtasks. Perhaps the data-parallel approach of PDT, yielding to the more coarse-grained model-driven strategy after the training set has been divided into partitions that can be comfortably accommodated by the resources of individual processors, will provide for the optimal parallel solution.

6. E x t e n s i o n s The strategy outlined here is of potential benefit to other types of decision tree algorithms; we close with three examples. For consistency, we have focused on ID3 and derivatives; however, several other strategies have been proposed as alternatives to selection criteria based on entropy. For example, Fayyad and Irani [5] present arguments regarding the suitability of entropy-based selection criteria; several other measures are used in [1,16]. Note that the PDT approach is suitable for any algorithm that evaluates potential splits based solely on class frequency

303 information. ID3-Iike algorithms induce axis-parallel decision trees in which each test of a single attribute value is equivalent to a hyperplane parallel to an attribute dimension in feature space. Recently, techniques for induction of oblique decision trees such as Oblique Classifier 1 (OC1) [16] have been developed that have the potential of capturing a more general class of concepts more succinctly and accurately than their axis-parallel counterparts. OC1 employs hill-climbing and randomization to identify oblique splits of the training data. Because OCI's search starts at an axis-parallel split that is subsequently perturbed into oblique orientations to find better partitions, the strategy outlined by PDT can benefit algorithms such as OC1. ID3 is a non-incremental (or batch) inductive procedure; all training examples are required prior to induction of the classifier. Several modifications to ID3 have been proposed to provide for an incremental variant of the algorithm. Utgoff's ITI [22] is an example of an incremental decision tree method that augments the decision tree with frequency counts associated with each non-leaf node. The information necessary to construct the initial tree is available in this parallel formulation, therefore the "bootstrap" procedure can benefit from parallel construction using the PDT method, and the resulting tree can be modified as new examples are presented. 7. C o n c l u s i o n The PDT algorithm appears to be a viable approach to induction of decision trees from massive data sets, although there is clearly ~/great deal of room for improvement and much work remains to establish the properties'of this algorithm in practice. In particular, although the results presented here are encouraging, the experiments did not exercise the algorithm on the truly massive data sets for which it was designed. Both the sleep and synthetic data sets contain only a small number of attributes; while the synthetic data set contains a large (not massive) number of examples, the sleep data set is only moderate in size. As such, the potential for performance gains through parallelism is quite small; this is confirmed by the experimental results. However, the sleep data set represents a real-world training set that has been previously studied and both data sets allow direct comparison between the PDT algorithm executed serially and with multiple processors. Ideally, future experiments would involve millions of examples with hundreds of attributes. The focus of work to date has been on continuous attributes, which have been previously identified by several researchers as presenting challenging problems for large-scale decision tree induction. However, a truly general induction algorithm will accept not only continuous (and ordinal) attributes, but also categorical attributes. As described, the PDT algorithm will accommodate categorical attributes, but implementation and evaluation remains as future work. Also left unaddressed is the problem of coping with missing values. As noted previously, various approaches have been proposed, one of the most common being example fragmentation as used in C4.5. While this strategy can be readily incorporated into the PDT framework, it is not as clear whether approaches such as CART's "surrogate splits" are feasible; this remains an open issue.

304 The performance of PDT on tightly-coupled distributed-memo .ry multiprocessors is of considerable interest. The choice to target the current software development to allow experiments on a network of workstations was deliberate, so that a challenging, lessthan-ideal platform would be used. I expect that these types of systems have the most widespread appeal and availability and hope that these results will encourage use of workstation clusters for decision tree induction. However, truly massively-parallel systems remain available and their use with algorithms such as PDT demands reasonable scalability; work to evaluate the scalability of PDT on massively-parallel systems both empirically and through formal analysis of the algorithm's behavior is an interesting future research area. The goal of applying parallelism to achieve a reduced time-to-solution for a particular problem size has been the focus of much previous work on parallel and distributed computing. An equally important pursuit is the application of multiple CPU/memo .ry resources to solve extremely large-scale problems previously unapproachable - perhaps this work will assist in providing future solutions to such problems.

Acknowledgements I am grateful to Sylvian Ray for encouragement and comments during the development of this work and for providing the sleep stage data used for evaluation of the PDT algorithm. Many suggestions by Christian Suttner and Ronny Kohavi improved the content and presentation greatly. Ross Quinlan clarified technical points regarding the implementation of C4.5. Finally, I would like to thank Klaus Schulten and the NIH Resource for Concurrent Biological Computing at the University of Illinois for providing access to the computational facilities used to perform benchmarks. Dorina Kosztin and Aritomo Shinozaki were particularly helpful in providing technical advice in the use of these facilities; their assistance is greatly appreciated.

REFERENCES 1. Leo Breiman, Jerome H. Friedman, Richard A. Olshen, and Charles J. Stone. Classification And Regression Trees. Wadsworth International Group, Belmont, CA, 1984. 2. Jason Catlett. Megainduction: Machine Learning on Very Large Databases. PhD thesis, Basser Department of Computer Science, University of Sydney, June 1991. 3. Jason Catlett. Peepholing: Choosing attributes efficiently for megainduction. In Proceedings of the Ninth International Workshop on Machine Learning, pages 49-54, 1992. 4. Usama M. Fa.yyad, Grego.ry Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy. Advances In Knowledge Discovery and Data Mining. AAAI Press/The MIT Press, Menlo Park, CA, 1996. 5. Usuma M. Fayyad and Keki B. Irani. The attribute selection problem in decision tree generation. In Proceedings of the Tenth National Conference on Artificial Intelligence, pages 104-110, San Jose, CA, 1992. 6. William J. Frawley, Gregory Piatetsky-Shapiro, and Christopher J. Matheus. Knowledge discovery in databases: An overview. AI Magazine, pages 57-70, Fall 1992.

305 Reprint of the introductory chapter of Knowledge Discovery in Databases, AAAI/MIT Press, 1991. 7. A1 Geist, Adam Beguelin, Jack Dongarra, Weicheng Jiang, Robert Manchek, and Vaidy Sunderam. P VM: Parallel Virtual Machine - A Users' Guide and Tutorial for Networked Parallel Computing. The MIT Press, Cambridge, Massachusetts, 1994. 8. Ronny Kohavi. Personal communication. 9. Vipin Kumar, Ananth Grama, Anshul Gupta, and George Karypis. Introduction to Parallel Computing: Design and Analysis of Algorithms. Benjamin/Cummings, Redwood City, CA, 1994. 10. Pat Langley. Elements of Machine Learning. Morgan Kaufmann, San Francisco, CA, 1996. 11. Pat Langley and Herbert A Simon. Applications of machine learning and rule induction. Communications of the ACM, 38(11):55-64, November 1995. 12. J. Kent Martin and Daniel S. Hirschberg. The time complexity of decision tree induction. Technical Report UCI ICS-TR-95-27, University of California at Irvine, 1995. 13. R. S. Michalski, J. G. Carbonell, and T. M. Mitchell. Machine Learning: An Artificial Intelligence Approach. Tioga, 1983. 14. D. Michie, D. J. Spiegelhalter, and C. C. Taylor. Machine Learning, Neural and Statistical Classification. Ellis Horwood, 1994. 15. Patrick M. Murphy and David W. Aha. UCI repository of machine learning databases. University of California, Department of Computer and Information Science. Irvine, CA, 1994. Machine-readable data repository. 16. Sreerama Murthy, Simon Kasif, and Steven Salzberg. A system for induction of oblique decision trees. Journal of Artificial Intelligence Research, 2:1-32, 1994. 17. Georgios Paliouras and David S. Bree. The effect of numeric features on the scalability of inductive learning programs. In Proceedings of the European Conference in Machine Learning. Springer-Verlag, Berlin 1995. 18. Robert A. Pearson. A coarse grained parallel induction heuristic. In Hiroaki Kitano, Vipin Kumar, and Christian B. Suttner, editors, Parallel Processing for Artificial Intelligence 2, volume 15 of Machine Intelligence and Pattern Recognition, chapter 17, pages 207-226. North-Holland, 1994. 19. J. R. Quinlan. Induction of decision trees. Machine Learning, 1:81-106, 1986. 20. J. R. Quinlan. C~.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo, CA, 1993. 21. S. R. Ray, W. D. Lee, C. D. Morgan, and W. Airth-Kindree. Computer sleep stage scoring - an expert system approach. International Journal of Biomedical Computing, 19:43-61, July 1986. 22. Paul E. Utgoff. Decision tree induction based on efficient tree restructuring. Technical Report 95-18, Department of Computer Science, Univerity of Massachusetts, Amherst, MA, March 1995. 23. Ronald J. Vetter. ATM concepts, architecutures, and protocols. Communications of the A CM, 38(2):31-38, Feb 1995. 24. Sholom M. Weiss and Casimir A. Kulikowski. Computer Systems That Learn: Classification and Prediction Methods from Statistics, Neural Nets, Machine Learning, and Expert Systems. Morgan Kaufmann, San Mateo, California, 1991.

306 Richard Kufrin

Richard Kufrin is a senior member of the technical staff at the National Center for Supercomputing Applications (NCSA) at the University of Illinois at Urbana-Champaign. He has worked extensively with parallel applications and algorithms, primarily in the areas of machine learning and computational biology. He was instrumental in establishing the parallel processing program at NCSA in 1988 to apply massively-parallel computers to the solution of large-scale problems in computational science and engineering. More recently, his research interests involve the application of parallel processing to data mining algorithms for knowledge discovery in very large databases. Home Page: http://www.ncsa.uiuc.edu/People/rkufrin/

Parallel Processing for Artificial Intelligence3 J. Geller, H. Kitano and C.B. Suttner (Editors) 9 1997 Elsevier Science B.V. All rights reserved.

Application Development

307

u n d e r ParCeL-1

Yannick Lallement a and Thierry Cornu b and St~phane Vialle c aHuman-Computer Interaction Institute, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh PA 15213, USA bSwiss Federal Institute of Technology EPFL DI-LITH, Parallel Computing Research Group, 1015 Lausanne, Switzerland r 2 rue Edouard Belin, 57078 Metz cedex 3, France In this paper, we present several kinds of programs developed in a new parallel language, ParCeL-1. This language is based on autonomous actors that compute concurrently as virtual processors. The applications we present here cover various domains of interest to Artificial Intelligence, especially tree search and connectionist programming. We present general methods to develop such kinds of algorithms in ParCeL-1. Then we emphasize several rules that should be followed to write efficient parallel programs. Finally, we describe the performances of these applications on two parallel computers. 1. I N T R O D U C T I O N Programming languages are a key issue for Artificial Intelligence. Many languages have been designed so far to make the development of AI applications easier. Most of these languages, such as Lisp, Prolog or Smalltalk are slanted towards symbolic AI rather than connectionist AI. For quick neural network programming, dedicated languages can be used, but these languages lack symbolic abilities. A possible way out is to use the C language, but its low level makes it more difficult to use than symbolic- or connectionistoriented languages. The aim of ParCeL-1 is to offer a high-level parallel language suited both for classical and connectionist AI applications. Versatility is an issue of increasing importance, especially because of the emergence of a new field in AI, namely hybrid symbolic-connectionist systems, that strive to achieve both symbolic and connectionist capabilities [14]. A language able to represent both symbolic and numeric modes of computation is suitable for this kind of systems. Parallelism is relevant too, since both types of applications demand more and more computer time, and thus efficiency is a key issue. ParCe[-1 is implemented on a multiprocessor architecture and the performance measures we obtained so far are encouraging. The target applications of ParCeL-1 have a medium or fine grain of parallelism: typically neural networks and semantic networks. ParCeL-1 is currently implemented on three parallel architectures (Cray T3D and two Transputer-based architectures: Telmat T-Node and Parsytec MTMSun board) and an implementation on Intel Paragon is under way.

308 In this paper, we give an overview of the ParCeL-1 language: the next section deals with the underlying computational model. Section 3 describes the language itself, gives a complete example of a simple ParCeL-1 program, and presents how the language is implemented on parallel computers. In section 5 several kinds of algorithms and applications for which ParCe[_-I is suitable are presented. The performance of these applications on different parallel architectures is assessed in section 6, which also gives a set of programming guidelines to get good performance from ParCeL-1. 2. T H E C E L L C O M P U T A T I O N A L

MODEL

2.1. A n a g e n t - b a s e d f o r m a l i s m The applications ParCeL-1 is dedicated to are explicitly distributed. To adapt to this framework, an agent-oriented computational model has been adopted [7]. The term agent, here and throughout the rest of this paper, will be used as a generic term to refer to both the objects of concurrent object languages and the actors of actor-based languages. The agents of ParCeL-1 are somewhat inspired by the actor model of computation. However, their behavior exhibits several significant differences from traditional actor models, so that we found it useful to give them a special name. Agents in the ParCeL-1 computational model are called cells, because of some similarities with cellular automata and cellular network models. The basic idea of the cell computational model is to allow the representation of each neuron of a neural network system (and also of each agent of a Distributed Artificial Intelligence system) with a separate instance of a cell. If the language makes this method feasible, programming can then be convenient and straightforward: - explicitly parallel systems such as neural network systems do not need to be reparallelized; - the required data structures are expressed in a natural way with the primitives of the language; - the explicitly distributed format of the resulting program makes it easier to map it automatically to a multi-processors architecture. In the remainder of this section, we will explain the ParCeL-1 design choices and the behavior of the cells. 2.2. A u t o n o m o u s versus reactive agents We will make a distinction between systems based on autonomous agents and reactive agents. In reactive agent systems, a computation is triggered by the reception of a message by the agent. On the contrary, in autonomous agent systems, each agent is able to initiate computations on its own. In the context of parallelism, the difference between autonomous and reactive agent systems becomes fundamental: in autonomous agent systems, concurrent tasks (or processes) are associated with the agents; conversely in reactive agent systems, they can be considered as attached to the messages traveling between the agents.

309 In ParCeL-1 an autonomous agent model was preferred to a reactive one. Indeed, in typical ParCeL-1 applications (especially neural networks), the reception of multiple messages is often necessa .ry to initiate even the simplest calculation. Therefore, making a single process responsible for supervising multiple receptions is a more efficient implementation method. This is only possible if processes are bound to the receiving agent rather than to the received messages.

2.3. Synchronous versus asynchronous concurrency Parallel programming formalisms may be divided into two classes [1]: synchronous and asynchronous models. In the first one, a central clock drives the concurrent tasks and defines for the whole system a common time scale. In asynchronous systems, on the contra .ry, no global time is defined. Combining an agent based approach with a synchronous working mode is one of the original features of the ParCel_-I computational model. This is motivated once more by the usual need of collecting many messages before initiating a computation. In contrast, an asynchronous model would require complex consistency checking, e.g.: Have all necessary data been received? Do all input data correspond to the same time interval or should one of them be stored for use at a subsequent time? Multiple deadlocks could also result from an asynchronous formalism, with different cells waiting for each other to proceed. In our cell computational model, computing proceeds synchronously in a cyclic way. A cycle consists of three main phases: computation, management and communication (see section 3.2 and Figure 1).

2.4. Degree of dynamic reconfiguration capability In agent based distributed systems, three degrees of dynamic reconfigurability can be considered:

static systems where the number of concurrent agents (or tasks) and their intercommunication patterns are fixed before run-time;

systems allowing dynamic resource allocation where new agents may be created at run time;

systems allowing dynamic code creation where new types of agents may be created at run-time. Several learning mechanisms of neural network algorithms imply topological modifications (pruning of connections, addition of new neurons, etc). Distributed AI models also require dynamic changes in most cases. Therefore, some degree of reconfigurability is needed to implement them. The ParCel-1 system supports dynamic resource allocation: cells are dynamically created instances of statically compiled cell types. Dynamic code creation is not supported due to its potentially very complex implementation on distributed systems.

2.5. Expression of the communication pattern between the agents In actor models of computation [1], communication is performed by sending messages to mailboxes associated with each actor. The communication pattern between the agents is therefore implicit. No reconfiguration is needed when the communication pattern changes.

310 Though the ParCeL-I system is also dynamically reconfigurable, the communication scheme is expressed in an explicit way: each cell has input and output channels in addition to its local variables. These channels have to be explicitly connected, using a connection operation. Once established, a connection can be used as long as necessary, until destroyed by another explicit instruction. A single output channel may be connected to an unlimited number of input channels of different cells. Two reasons led to this choice. In typical ParCeL-1 applications, like neural networks, the same connection between two agents is usually used more than once. Thus, establishing a connection once and for all before using it several times is more efficient. Also, in neural network algorithms, the pattern of connections is always described explicitly and is an important characteristic of the algorithm. So it is interesting to have it appear in a program source code.

2.6. Agent naming system In the cell computational model, new cells are created and connected by other cells. The control instructions, by which a cell can manipulate another cell, use a naming system, providing a unique label for eve.ry existing cell instance. A set of three different control instructions is sufficient to make possible the creation of any cell network topology: creation of a new cell, destruction of a cell, connection between two cells. A cell always knows its own label and the labels of all cells it created. In addition, labels can be exchanged through connection channels to establish new control relations and to make it possible to create any topology of connections. 3. ParCel-1 D E S C R I P T I O N

AND IMPLEMENTATION

In this section, we describe ParCeL-1 syntax and present a complete sample program. Then we give a quick overview of the parallel implementation of the language.

3.1. ParCeL-1 syntax A ParCeL-1 program is a list of declarations and definitions of functions and cell types. Figure 3 shows a sample ParCeL-1 program that will be explained in section 3.2. Functions and cell types can have parameters and local variables. In addition, cell types can have local functions and communication channels. Cell types are declared and defined with the keyword t y p e c e l l . One of them has to be named main. Communication channels are declared with the 'in' or 'out' prefixes, and are typed as any other variable. The core of a cell type is a set of rules. Figure 3 shows the definition of two cell types with output and input channels and rules. The syntax of the rules is: ==> The condition part can be either a boolean expression or one of the special keywords i n i t i a l l y or f i n a l l y (see Figure 3). These keywords make it possible to define initial and final rules, that will be executed respectively on creation and destruction of the cell. The action part is a sequence of instructions with a C-like syntax. The ParCel-1 language is strongly typed. The available types are: i n t , double, boolean, symbol and cell registration: r e g i s t r a t i o n (formerly immat). For example, r e g i s t r a t i o n f i l t e r f in Figure 3 declares the f variable with the type: registration

311

Computation phase: cell outputs updating Management phase: network topology modifications Communication phase: broadcasted routing of the channel contents

Figure 1. Main stages of a ParCeL-1 cycle.

of cell of type filter. As shown in the example program below, cell registrations are used to refer to given cells (not unlike pointers), in order, for example, to connect or kill them. 3.2. E x a m p l e of ParCeL-1 p r o g r a m In this paragraph, we explain the execution of a ParCeL-1 program with a simple example: the prime number generation program listed in Figure 3. When a program starts, a cell of type main is automatically created which will be in charge of creating the other ones. In the cell computational model, new cells are created and connected by other cells. At each cycle, every existing cell selects and executes at most one of its rules. The selection is done as follows: the initial rule is executed on creation of the cell; the final rule is executed on death of the cell; otherwise the highest priority rule is selected among those having their condition verified. Once a rule is selected, an additional primitive, called sub-rule, makes it possible to fix the evolution of the cell for several cycles (see [21]). Sub-rules are not used in the example program. The propagation from the output channels to the input channels is postponed until the communication phase at the end of the cycle. Thus, input channel values do not change during the computation phase (see Figure 1). A set of specific instructions, called requests, is used for the cell network management. Three of these requests are sufficient to create any cell network topology: creation of a new cell, destruction of a cell, connection between two cells, the cells being referred by their registrations (see table 1). The execution of the requests is postponed until the end of the computation phase of the cycle, thus the network cell topology will not change during the computation phase (see Figure 1). The basic principle of the prime number program (see Figure 3) is to use a series of filters, each filter being initialized by a prime number, and filtering each number divisible by that prime. The filters are arranged in a pipe-line (see Figure 2) and run concurrently. The main cell is a generator of integers, output on its new_int channel. A first f i l t e r cell is created in the i n i t i a l l y rule of the main cell (line: f = c r e a t e f i l t e r ( 9 ) ) , which will filter any integer divisible by 2. This filter will let odd numbers pass. The first non

312

filter(2) ~'~

._~"filter(3)

filter(5)

=~

o

Figure 2. Communication graph of the prime number program.

#include #define MAX_INT I000 typecell filter (int init) { /, cell type definition with one param ,/ in int input; /* one integer input channel ,/ out int output; /, one integer output channel ,/ registration filter next; /, one registration on a filter cell initially ==> { printi(init); printnl(); next = NULL;

}

,/

/* rule executed on creation of the cell ,/ /* printing of the init parameter */ /* new line */ /* no next cell for now ,/

TRUE ==> { /* new rule. always executed ,/ if (input ~ init != 0 ) /* if the input is not divisible by init ,/ if (next != NULL) /* and there is a next filter cell... ,/ output = input; /* then input is transmitted to this cell*/ else { /* if there is no next filter cell... ,/ next = create filter(input); /, we create a new one which */ connect input of next /, we connect to ourselves */ to output of self;

} else output = 0;

} t y p e c e l l main () { out i n t new_int; registration filter f ;

/, main cell type definition /* one integer output channel, new_int

,/ */

/* a registration on a filter cell

,/

initially ==> { /* rule executed on creation of the cell,/ f = create filter(2); /* creation of a filter cell... ,/ connect input of f /, and connection to this main cell ,/ to new_int of self; new_int = 2;

}

new_int < MAX_INT ==> new_int += I;

/, increase new_int at each cycle

,/

new_int = MAX_INT ==> halt;

/* stop when MAX_INT is reached

,/

Figure 3. ParCeL-1 prime number program.

313 Table 1 Main available ParCeL-1 requests. action syntax cell creation <x> = c r e a t e f i l t e r ( 9 ) ; cell removal k i l l <x>; cells connection connect of <x> to of ;

filtered number is three, thus three is a prime number, and a new cell is created that will filter all the numbers divisible by three. The process is iterative: each time a new f i l t e r cell is created, it prints the number it will filter, which is prime, and filters all the multiples of that prime number. The communication pattern between the cells is shown in Figure 2. The output channel new_int of the main cell is connected to the i n p u t channel of the first f i l t e r cell. Then, the o u t p u t channel of each f i l t e r cell is connected to the i n p u t channel of the next f i l t e r cell.

3.3. Parallel implementation of ParCeL-1 ParCeL-1 is implemented on a dedicated virtual machine: the Virtual Cellular Machine (VCM-1). This virtual machine implements three main operations. First, it manages the cyclic and synchronous functioning of all cells. Second, it executes all the cell requests and resolves possible conflicts. Third, it manages all the communications needed by ParCeL-1 on an MIMD 1 architecture, i.e. the routing of the requests and channel contents. Thus, in order to port ParCeL-1 to a new architecture, only the virtual machine needs to be re-implemented, using the communication instructions available on this architecture. 4. R E L A T E D L A N G U A G E S This section situates ParfieL-1 among several kinds of programming languages it is closely related to: concurrent object-oriented languages, actor languages, and connectionist languages.

Concurrent object-oriented l a n g u a g e s Object-oriented programming was initially developed without any relationship to parallel programming models. The main distinct feature of object-oriented languages is inheritance and not concurrency. However, object programming introduces a natural partitioning of data and tasks, so that the concept of an object provides a good metaphor for parallel computing models. Therefore parallel extensions have been proposed for most object-oriented languages (see for instance [17,9,3,6]). In concurrent object-oriented languages, the concept of inheritance as such is completely independent from the semantics of parallelism. Nearly all models based on concurrent objects use asynchronous execution models, i.e., models where computation and message passing are not synchronized. Consequently, 1Multiple Instructions Multiple Data

314 communication has to be synchronized explicitly and each transaction between two objects must be implemented as a separate communication. Mail boxes or message queues have to be managed by the underlying system. As a result, it is not easy to implement efficient communication, and concurrent object systems are often restricted to coarse grain parallelism, for performance reasons. As an additional result, concurrent object programming, while finding increased acceptance for implementing distributed systems over wide area networks [5], is still seldom used in massively parallel high performance computers. ParCeL-1 was designed with high performance computing in mind. It does not provide any inheritance mechanism. However, since objects in ParCeL-1 are statically typed, it would be feasible to extend it with inheritance functionalities similar to those of compiled object languages such as C + + .

Actor languages Actor-based programming traces back to the computational model proposed by Hewitt [10] and later improved by Clinger and Agha [1]. Actors are conceived as active objects, i.e., objects with their own resources and activities. Actor languages may or may not provide inheritance. When available, inheritance does not directly influence the way concurrency is handled. Most actor languages provide not only inter-object parallelism, i.e., concurrency between different objects, but also often intra-object parallelism, i.e., objects are able to process several requests simultaneously [1]. In the ACT language [20], eve .ry single computation (for instance the addition of two numbers) implies transactions between several actors. The communication and synchronization protocols of actor languages may prove tricky to implement efficiently on multiprocessor machines; as a matter of fact, actual multiprocessor implementations of actor languages appeared only after 1990. The HAL language based on the CHARM virtual machine [12] is a system in the lineage of the work of Hewitt, Clinger and Agha. The successive variants of the ABCL language [22], were developed by the team of A. Yonezawa. In ABCL, several conservative design choices were made, in an attempt to facilitate an efficient parallel implementation. ABCL/1 is a parallel extension of Common-Lisp. Simple data structures and operations are implemented in Lisp without involving any actor-based mechanism. Similarly, an actor can process only one request at a time; however, this processing may be interrupted to switch to another task. Several versions of ABCL actually exist as multiprocessor implementations. ParCeL-1 is similar to actor-based languages, but with a synchronous computational model. Several multiprocessor implementations of ParCeL-1 are fully operational. The underlying VCM-1 virtual machine is available since 1992, and the first implementation of ParCeL-1 itself since 1994.

Connectionist languages Three kinds of tools can be used to program neural networks: simulators; dedicated languages; general purpose languages. The easier to use, the more restricted a tool is. In this paragraph we compare ParCeL-1 viewed as a connectionist language to other such languages. A language can be more or less versatile, according to the variety of networks that can be built, and the variety of agents that can compose the network. Some languages, such as

315 Aspirin [15] or NeuDL [19], implement only one kind of model (usually back-propagation networks). Other languages, such as NIL [4] propose only one type of elementa .ry agents (the neuron). ParCeL-1 is completely unconstrained, that is, any kind of neural network can be implemented using any kind of basic agents. Synchronization mechanisms also differ from one language to another. In CuPit [18] or Condela-3 [2], a central agent applies a given function to a given set of agents (e.g. neurons) at each cycle. NIL's agents [4] follow a CSP-like [11] rendez-vous mechanism, and are activated when they receive data. ParfieL-l's agents are activated at each cycle, thus, when necessa .ry, synchronization must be implemented explicitly. For example, it is possible to create a manager cell connected to and controlling the other cells. Finally, parallelism is another key issue: connectionist applications are very demanding for computing power, but parallel languages are still rare. CuPit [18] relies on SIMD 2 parallelism on MasPar architectures, by triggering several identical agents at the same time. ParCeL-1 is implemented on MIMD architectures, and different agents can perform different computations at the same time. 5. A P P L I C A T I O N

PROGRAMMING

I N ParCeL-1

In this section, we describe several examples of ParCeL-1 programs: first some connectionist applications, then some general numeric computation applications, and finally a tree search application. We conclude by a set of possible methods to write programs in ParCeL-1. 5.1. K o h o n e n s e l f - o r g a n i z i n g f e a t u r e m a p As an extended programming example, we will explain here the implementation of a Kohonen self-organizing map [13]. Among the neural networks using unsupervised training, the Kohonen self-organizing map is the best known. Its applications are in the field of vector quantization and mapping of multi-dimensional spaces. A Kohonen map is a network of N neurons, usually arranged on a two-dimensional grid. With each neuron is associated a weight vector of a length equal to the number of inputs of the neural network. Training is performed iteratively. At each iteration t, an input vector is chosen randomly in the training set. For each input, its distance from each neuron's weight vector is measured. Then the neuron with the smallest distance (the winner neuron) is determined and all weights are updated according to a given learning law. The ParCeL-1 implementation of a Kohonen map uses two types of cells (Figure 4): N cells of type neuron and one supervising cell. The supervising cell first creates the neuron cells, using the c r e a t e instruction. It is then responsible for inputing training vectors to the neuron mesh, getting back the corresponding neuron activations, finding the minimal output value (and thus the winner), and broadcasting to every neuron the identity of the winner neuron. Figure 5 shows the declaration of the neuron cell type, of which the neurons will be instances. Two parameters are passed to each neuron cell when it is created: the (i, j) coordinates of the neuron on the feature map (we choose a classical square-grid topology). The body of the declaration starts with variable declarations, followed by input and output channel

2Single Instruction Multiple Data

316 activations

-y $ti

neural networkinputs

Figure 4. Kohonen feature maps in ParCel-l: implementation principle

declarations. First come the input vector and the output value of the neuron. Then, additional input channels contain information coming from the supervisor cell, which are useful for training: the index t of the current iteration and the (i, j) coordinates of the current winner neuron. The actions performed by the neurons are specified in a list of action rules. The first rule, with condition i n i t i a l l y , fires during the first cycle of the neuron life. Its function is to initialize the weights to random values. Then a new rule, with condition TRUE, fires iteratively starting at the second cycle until the death of the cell. The execution of this rule spreads over three cycles, thanks to the subrule operator -->. 9 first cycle: the distance between the input vector and the weight vector is computed; the result is sent to the supervisor cell. 9 second cycle: an empty cycle to wait for the identity of the winner neuron, computed by the supervisor. 9 last cycle: the Kohonen training formula is applied to the weight vector; functions a l p h a and ahood have been defined beforehand as global functions. To make the Kohonen program complete, a supervisor cell must now be built. For the sake of simplicity, the supervisor will be the main cell that is created at startup. Thus the whole program will comprise only two different cell types. Figure 6 shows the declaration of the m a i n cell type. The input and output channels are the ones also appearing in the neuron cell type: the value of input vector, the neuron output values (i.e. distance between input and weight vectors), and information related to the training process (iteration index and identity of the winner neuron). At first (rule i n i t i a l l y ) the main cell creates all neuron cells and connects them to itself. The next rule fires iteratively as long as t is smaller than the desired number of iterations. A new input vector is first sent to the neurons. For this test program, uniform random distribution is used as the training set. This is the usual type of input to test a Kohonen program. In a real application, we would read data from a file or from an array in memory instead. Once a new input has been sent, one empty cycle is

317 typecell neuron(int my_i, int my_j) { in double input [INPUTS]; /, the input vector */ out double output; /, the output value of the neuron */ in int t, winner_i, winner_j; /* time, coordinates of the winner cell ,/

double weight[INPUTS]; double p o t , d i s t ;

/* the internal weights of the c e l l / , temporary v a r i a b l e s

,/ ,/

int i ; initially ==> { /* randomly initialize the weights for(i = O; i < INPUTS; i += I) weight[i] = frand(); } TRUE •ffi> { for(i = O, pot = 0.0; i < INPUTS; i += I) pot += sqr(weight [i] - input [i]) ; /* compute distances

} --> { }

--> {

output = pot;

,/

,/

/* output the r e s u l t to the supervisor */ /* wait until the winner's identity /* is computed by the supervisor cell

/* update the weights f o r ( i = O; i < INPUTS; i += 1) { d i s t = abs(my i - winner i) + abs(my_j - winner j)

,/ */

,/

weight[i] += alpha(~) * nhood(t, dist) 9 (input[i] - weight[i]); )

}

Figure 5. The neuron cell source code of the Kohonen program

performed while the neurons compute the distances. Finally during the third cycle, the neuron with minimal distance to the input vector is identified. When t is greater than the number of iteration, the last rule is selected, and the program is stopped. This program uses a particular programming method that can be called supervised. The processing of a set of cells (the neuron cells) is sequenced by a single master cell (the main cell). In the following, we will more briefly describe another application using this kind of programming method, as well as other applications using two other programming methods. 5.2. T e m p o r a l O r g a n i z a t i o n M a p Temporal Organization Map (TOM) [8] is a temporal connectionist system that is used for speech recognition. The goal of this architecture is to detect sequences of patterns in a temporal phenomenon in a connectionist fashion. A set of super-units is used to encode the acoustic features of the speech, and is trained using a Kohonen-like algorithm. At the end of this training phase, each super-unit reacts to some particular acoustic event. To take into account the flow of acoustic events, each super-unit contains a set of units. During the learning phase, units are created and connected with each other into chains representing the succession of acoustic events. The training algorithm is robust enough to deal with the fuzziness and the variability of speech.

318

typecell main() { out int t, winner_i, winner_j; /* time, coordinates of the winner cell*/ out double proto[INPUTS] ; /* the input vector for the neurons */ in double activation[WIDTH] [HEIGHT]; /* activations of the neurons */ registration neuron neuron[WIDTH] [HEIGHT]; /, addresses of the neuron cells ,/ int i,j,k; double smallest; /, current smallest distance */ initially ==> { /, creation of the neural network */ for(i = O; i < WIDTH; i += I)

f o r ( j = 0; j < HEIGHT; j += 1) { n e u r o n [ i ] [j] = c r e a t e n e u r o n ( i , j ) ; /* c r e a t e a n e u r o n /* p e r f o r m e v e r y c o n n e c t i o n t o t h e n e u r o n c o n n e c t w i n n e r _ i of n e u r o n [ i ] [j] t o w i n n e r _ i of s e l f ; c o n n e c t winne r _ j of n e u r o n [ i ] [j] t o winner_j of s e l f ; f o r ( k = 0; k < INPUTS; k += 1) c o n n e c t i n p u t [ k ] of n e u r o n [ i ] [j] t o p r o t o [ k ] of s e l f ; c o n n e c t a c t i v a t i o n [ i ] [j] of s e l f t o o u t p u t of n e u r o n [ i ] [ j ] ; c o n n e c t t of n e u r o n [ i ] [j] t o t of s e l f ; } t = O;

/* initialize iteration counter

*/ */

*/

} t < ITERATIONS ==> { /* compute new prototype(random distribution)*/ for(i = O; i < INPUTS; i += I) proto[i] = frand() ; } --> { }

/, wait for the neurons to compute distance

--> { /* find the winner neuron smallest = MAX_FLOAT; winner_i = O; winner_j = O; for(i = O; i < WIDTH; i += I) for(j = O; j < HEIGHT; j += I) if(activation[i][j] < smallest) { smallest = activation[i] [j] ; winner_i = i; winner_j = j ; } t += I; /* this was one more iteration

}

TRUE

==>

halt;

/, otherwise,

s t o p t h e program

F i g u r e 6. T h e m a i n cell source code of the Kohonen p r o g r a m

*/ */

*/

,/

319

Figure 7. Cellular network for the TOM program

TOM has an intrinsically parallel functionality: each unit and each super-unit can be updated simultaneously, thus TOM's parallelism can be easily expressed in ParCeL-1. The first implementation possibility consists of creating one cell type for the units, and one cell type for the super-units. However, the units are very small processing elements, and are strongly interconnected with each other; thus, it is faster and more efficient to represent them as data structures in the super-units, rather than as independent cells. Therefore, the implementation of TOM in ParCeL-1 involves only two types of cells: a supervisor cell that is in charge of sending the inputs and collecting the results, and the super-unit cells, in charge of the actual processing (see Figure 7). TOM uses the same kind of programming method as the Kohonen programs: the processing of a set of computing cells is managed by a single supervisor cell. 5.3. G e n e r a l n u m e r i c c o m p u t a t i o n Many scientific computing applications may be expressed as the iterative computation of a set of variables: each new value of a variable is a function of the variables (or a subset of them) at the previous iteration. Two examples have been implemented: the Jacobi relaxation and the N-body simulation. The Jacobi relaxation program [16] iteratively solves the Laplace differential equation:

02v 02v x--OZ + ~ = 0 using a finite difference technique. An application is, for instance, the computation of the voltage v(x, y) at any point (x, y) of a two-dimensional conducting metal sheet. The N-body program [16] is a simulation of the trajectories of N weighted punctual objects in the three dimensional space. Each object has an instantaneous position, speed and acceleration and its trajectory is influenced by the position of the other bodies, due to a long range interaction (typically gravitational or electro-magnetic interaction). The computational model of ParCeL-I makes it well suited for the implementation of such algorithms. Typically, each cell is responsible for periodically calculating one or several variables, using the output values of its partner cells that are in charge of one or several other variables. The program essentially uses one type of cell, and as many cells as

320

Figure 9. Cellular network for the N-queens program

the number of subsets of variables have to be created (see Figure 8). These applications use another kind of programming method than the Kohonen and TOM programs: the cells compute concurrently without supervision, and return their results after a pre-determined number of cycles. This kind of programming method can be called iterative programming. 5.4. Tree s e a r c h As an example of tree search, which is a fundamental algorithm used in AI, we have implemented the N-queens problem. Solving this problem consists of exploring a tree in which the nodes are positions on the chess board. On the first level of the tree, only one queen is on the board; on the second level, two queens are on the board, etc. The basic principle to implement this kind of algorithm in ParCeL-1 is to divide the tree into several branches, and process the branches concurrently. In the case of the 8-queens, the tree can be easily divided into 8 branches, each branch fixing a particular position for the first queen. In ParCeL-1 this is implemented using 8 cells of a single cellular type that will each process one of the 8 branches of the tree. This division can go further, by fixing more levels of the tree, for example, the first two, that is, the first two queens. Then, 8 x 7 branches can be developed concurrently (see Figure 9). We have here yet another kind of programming method: each cell processes its own branch of the tree without regard to what the other cells do. The termination of each cell is independent of all the others: we can call this kind of programming independent processing.

321 Table 2 Performance of several ParCeL-1 programs on T-node with 9 processors v s ParCeL-1 and C versions on 1 processor Test program basic speed-up: speed-up vs C: efficiency: Tp.,C,L-- 1 (9)

9

Jacobi relaxation

TP,,C,L-- 1

4.2

1.5

17%

N-body

5.0

4.1

45%

6.0

2.3

25%

7.2

4.4

49%

8.6

7.1

79%

30 x 30, 5000 iter.

N

, 5000 iter.

Kohonen 1 neuron/cell 9 0 0 neur'ons, 5 inputs, 1000 Kohonen 25 neurons/cell

iter.

9 0 0 neur'ons, o inputs, l v u v iter.

N-queens

= 13, 132 cells

5.5. Conclusion: ParCeL-1 programming overview We have given some examples of possible programming methods that can be used in ParCel-1 programs: supervised (case of the Kohonen and TOM programs), iterative programming (case of the numeric computation programs), independent processing (case of the N-queens). These methods can be combined; we have developed an application called Resyn [14], that implements a hybrid symbolic-connectionist database, and interacts with the user by means of a a command interpreter. This command interpreter acts as a global supervisor for the program, and can sometimes order an iterative relaxation phase. Resyn also implements a delegation mechanism [1]: when the command interpreter receives a read-file instruction, the parsing of the file is delegated to a specialized cell. Resyn emphasizes the versatility of ParCel-1 and its computational model, since several kinds of programming methods can coexist in the same program. 6. A P P L I C A T I O N

PERFORMANCES

In this section, we present some measurements we have collected for several of the applications presented above, and we suggest a few guidelines for writing efficient ParCeL1 programs.

6.1. Methodology The performance measurements we obtained are shown in table 2. In each case, we compared the execution time of the ParCeL-1 program on a T-node machine with 9 processors to the execution time on one processor of either the ParCel-1 program or the corresponding C program. This last comparison is more significant for the user, because the language giving the best execution times for these applications on sequential computers is C. The speed-up is the sequential execution time divided by the parallel execution time, and the efficiency is the speed-up divided by the number of processors, that is the fraction of the processor's computational power actually used.

6.2. Experimental results and programming guidelines In order to obtain efficient parallel programs, two conditions should be observed. First, the load balance of the processors must be as good as possible: because of the cyclic

322 225

iti nHiiHiit r Hi iiHHHil ~::::!'~::i::ii~_.:!g~i~ii~#!:i!!i!!i!i@i!i::::i!:..;,!ii~ :::::-::~:;~.~.~:::::~::i:~.~ui::::.:::.~.~i~::::~::i:~.~..

1

10

100 1000 Number of cells

10000

Figure 10. Optimum number of cells for the 12-queens program on T-node with 9 processors: close to 1000

computational model, the least loaded processors will have to wait for the most loaded processor to complete its computation phase before a new cycle can start. Thus, a good load balance is essential. Second, the overhead due to the cell management and the communication time must be minimized. Two parameters are important to meet these conditions: the number of cells, and the granularity of the cells. In order to meet the first condition, the number of cells must be at least equal to the number of processors, to install one cell on each processor. If all the cells are identical, then the load balance is good. More generally, that is, if several kinds of cells exist in the program, a greater number is suitable to ensure a good statistical distribution of the cells on the processors; ten times the number of processors seems to be a minimum. For example, Figure 10 shows that, on a T-node with 9 processors, the optimal number of cells in the case of the 12-queens algorithm, is close to 1000 (100 times the number of processors). However, the performance obtained with only 100 cells (10 times the number of processors) is already ve~. close to the optimum. The performance degradation beyond the optimum number of cells is explained below. In order to meet the second condition, it is necessary to create cells with a granularity (essentially their computation time) that is large enough, compared both to their communication time and to the cell management overhead due to ParCeL-1 itself. If this condition is not met, the cell management time or communication time will be important compared to their computation time, and the overall performance will decrease. In the case of the 12-queens program (Figure 10), the performance decreases when too many smaller and smaller cells are active. In the case of the Kohonen program, it is necessa .ry to associate more than one neuron with each cell to get cells of a sufficiently coarse grain of parallelism, and thus to improve the performance. Table 2 shows different results for the Kohonen program according to the granularity of its cells, that is, the number of neurons

323

Table 3 Performance of the 15-queens program on Cray T3D and Intel Paragon Test program basic speed-up: speed-up v s C: efficiency: Tp.,C.L_,(1)

15-queens on T3D, 8 proc. time

speed-up

Tp,,,Cek--1 (n)

n

7.8

6.8

84%

29.7

25.7

80%

vs

= 13.6 s

'

~5-queenson T3D, 128 proc.

109.3

89.2

70%

15-queens on T3D 256 proc.

205.7

167.8

66%

7.9

7.1

88%

29.8

26.8

82%

ime

time

= J.y

s

= 2.1 s

15-queens on Paragon, 8 proc. time

-- 1 6 2 s

15-queens on Paragon, 32 proc. time

C

= 51.8 s

15-queens on T3D 32 proc. time

Tc(1)

Tp,,,CeL-l(n)

= 33 s

a given cell is in charge of. The Kohonen program with one neuron per cell is the one presented above. Grouping several neurons per cell consists basically of adding a loop in one of the rules of the n e u r o n cell, and does not result in a much more complex program. The source code for this Kohonen program is ve.ry close to the one we showed. Besides, the Jacobi relaxation program also has a too fine granularity, hence poor efficiency, but the granularity could be augmented as in the Kohonen program. Programs conforming to these rules (optimal number of cells and optimal granularity of the cells) show a rather good performance: the Kohonen program with 25 neurons per cell and the N-body program show a speed-up close to 4.5 on 9 processors, t h a t is, an efficiency close to 50%. The N-queens program also supports these rules, and its cells communicate very little, hence the excellent speed-up of 7.1 on 9 processors. These results have been obtained on a computer based on T-800 processors. These processors were released before 1990, and are now aging. Therefore, even if the speed-ups are good, the execution times remain higher than on modern workstations (e.g. Sparcstation 20 or DEC Alpha). We have implemented ParCeL-1 on two state of the art MIMD architectures: Cray T3D and Intel Paragon. This last implementation is ve.ry recent: optimizations and benchmarks are under way. Of course, on these architectures, the execution times are dramatically lower than on T-node and modern workstations: table 3 gives some samples of execution times and speed-ups on T3D and Paragon for the 15-queens program (generating 2184 cells). 7. C O N C L U S I O N We have presented a new language dedicated to AI programming. ParCeL-1 has proven its efficiency on several types of applications, on the connectionist and on the symbolic side of AI. ParCeL-1 is closely related to and benefits from many features of actor-based and connectionist-oriented languages. The performance tests we carried out so far on parallel implementations resulted in good speed-ups. Some small programs (e.g. Nqueens) were written efficiently in a short time by students, indicating that one can easily master ParCeL-1 and its computational model. Finally, ParCeL-1 can be used for many types of programming, even if its predilection domain remains networks composed of small

324 computing elements--such as neural or semantic networks. Its versatility and its parallel implementation make it especially attractive as a connectionist language. From a parallel programming viewpoint, ParCeL-1 seems to be a good compromise between an automatic parallelization of the source code, still out of reach, and an architecturedependent parallel style of programming. Compared to other concurrent object and actorbased systems, ParCeL-1 is more adapted to applications with very dense communication patterns, like neural network programs and other similar applications. A low-level language may give better results in terms of pure performance, but the masking of the parallel architecture and of the communication layers make ParCeL-1 suitable for a quicker development of portable programs - a single ParCeL-1 program can be executed without modification on several multi-processor architectures. The ongoing work on the ParCeL-1 project follows several directions: extensive programming experiments including real size applications, assessment and performance measurements of the parallel implementation, further development of the language itself to include higher-level functionalities and porting to new multi-processor architectures. REFERENCES

1. G. Agha. ACTORS, a model of concurrent computation in distributed systems. MIT Press, 1986. 2. N. Almassy, M. K5hle, and F. SchSnbauer. Condela-3: A language for neural networks. In International Joint Congress on Neural Networks, pages 285-290, San Diego, 1990. 3. I. Attali, D. Caromel, and S. Ehmety. Une s~mantique op~rationnelle pour le langage eiffel//. In Journdes du GDR Programmation, Grenoble, 22, 23 et 24 novembre 1995. 4. A. S. Bavam. Nps: A neural network programming system. In International Joint Congress on Neural Networks, pages 143-148, San Diego, 1990. 5. J . P . Briot and R. Guerraoui. A classification of various approaches for object based parallel and distributed programming. Technical report, University of Tokyo and Swiss Federal Institute of Technology, 1996. 6. A. Chien, U. Reddy, J. Plevyak, and J. Dolby. ICC++ A C + + dialect for high performance parallel computing. Lecture Notes in Computer Science, 1049:76-??, 1996. 7. Thierry Cornu and St(!phane Vialle. A framework for implementing highly parallel applications on MIMD architectures. In J. R. Davy and P. M. Dew, editors, Abstract Machine Models for Highly Parallel Computers, Oxford Science Publications, pages 314-337. Oxford University Press, 1995. 8. S. Durand and F. Alexandre. Learning speech as acoustic sequences with the unsupervised mod el, TOM. In NEURAP, 8th International Conference on Neural Networks and their Applications, Marseilles, France, 1995. 9. A. Grimshaw. Easy-to-use object-oriented parallel processing with Mentat. Computer, 26(5):39-51, May 1993. 10. C. Hewitt, P. Bishop, and R. Steiger. A universal modular actor formalism for artificial intelligence. In IJCAI-73, pages 235-245, 1973. 11. C. A. R. Hoare. Communicating Sequential Processes. Prentice Hall, 1985.

325 12. W. Kim and G. Agha. Compilation of a highly parallel actor-based language, pages 1-15. Lecture notes in computer science 757. Springer-Verlag, 1993. 13. T. Kohonen. Self-Organization and Associative Memory, volume 8 of Springer Series in Information Sciences. Springer-Verlag, 1989. 14. Y. Lallement, T. Cornu, and S. Vialle. An abstract machine for implementing connectionnist and hybrid systems on multi-processor architectures. In V. Kumar H. Kitano and C. B. Suttner, editors, Parallel Processing for Artificial Intelligence, 2, Machine Intelligence and Pattern Recognition Series, pages 11-27. Elsevier Science Publishers, 1994. 15. R. R. Leighton. The aspirin/migraines neural network software, user's manual. Technical Report MP-91W00050, MITRE Corporation, 1992. 16. Bruce. P. Lester. The Art of Parallel Programming. Prentice Hall, 1993. 17. J. Pallas and D. Ungar. Multiprocessor Smalltalk: A case study of a multiprocessorbased programming environment. In Conference on programming language design and implementation, pages 268-277, Atlanta, June 1998. 18. L. Prechelt. Cupit a parallel language for neural algorithms: Language reference and turotial. Technical report, Uni. Karlsruhe, Germany, 1994. 19. S. 3. Rogers. Neudl: Neural-network description language. Available by ftp at cs.ua.edu, file/pub/neudl/neuDLver0.2.tar.gz, August 1993. 20. D. G. Theriault. Issues in the design and implementation of act 2. Technical Report AI-TR-728, Massachusetts Institute of Technology, A.I. Lab., Cambridge, Massachusetts, 1983. 21. S. Vialle, T. Cornu, and Y. Lallement. Parcel-I, user's guide and reference manual. Technical Report R-10, Supelec Metz campus, SUPt~LEC, 2 rue Edouard Belin, F57078 Metz Cedex 3, November 1994. 22. A. Yonezawa, S. Matsuoka, M. Yasugi, and K. Taura. Implementing concurrent object-oriented languages on multicomputers. IEEE parallel and distributed technology, 1(2):49-61, May 1993.

326 Yannick Lallement Yannick Lallement was born in France in 1968. He obtained the masters degree in computer science in 1990 from the University of Metz, and the Ph.D. in Computer Science from the University of Nancy I in 1996. His research interests include parallel computation, hybrid connectionist-symbolic models, and cognitive modeling. He is currently a research scientist in the Soar group at Carnegie Mellon University.

Thierry Cornu Thierry Cornu was born in France in 1966. He obtained the engineering degree from Sup61ec in 1988 and the Ph.D. in Computer Science in 1992 from University of Nancy I. Since 1993, he has been a lecturer and a research scientist at the Computer Science Department of the Swiss Federal Institute of Technology (EPFL), first with the MANTRA Research Centre for Neuro-Mimetic Systems, and, since 1996, with the Parallel Computing Research Group of the EPFL. His research interests include parallel computation, performance prediction, neural network algorithms, their engineering applications and their parallel implementation.

St~phane Vialle St6phane Vialle was born in France in 1966. He graduated from the institute of technology of Grenoble (IUT 1) in electrical engineering in 1987. He obtained the engineering degree from Supdlec in 1990 and the Ph.D. in Computer Science in 1996 from the University of Paris XI. He has been a lecturer and research scientist at Supdlec since 1990. His research interests include parallel languages and parallel algorithmics, and their application to multi-agent systems.

Parallel Processing for Artificial Intelligence 3 J. Geller, H. Kitano and C.B. Suttner (Editors) 9 1997 Elsevier Science B.V. All rights reserved.

327

A I A p p l i c a t i o n s of M a s s i v e P a r a l l e l i s m : A n E x p e r i e n c e R e p o r t David L. Waltz NEC Research Institute Princeton, NJ and Brandeis University Waltham, MA For nearly ten years my group and I at Thinking Machines Corporation worked at selling massively parallel computers for a variety of applications that fall broadly in the area now called "database mining." We had an amazing team of scientists and engineers, saw trends far ahead of the rest of the world, and developed several great systems. However, we began as novices in the business arena. Sometimes we made sales, sometimes we did not; but we learned a great deal in either case. This chapter recounts the sales process and a brief histo .ry, mostly in the form of "war stories" mixed with technical details, and attempts to summarize some messages to take away, based on what we learned. 1. I N T R O D U C T I O N : TANT?

W H A T IS D A T A M I N I N G A N D W H Y IS I T I M P O R -

Database mining is the automatic (or semi-automatic) extraction of information - e.g. facts, patterns, trends, rules, etc. - from large databases. Relevant basic methods for database mining include: statistics, especially various types of regression and projection pursuit; decision trees, CART (Classification And Regression Trees), and rule induction methods; neural nets; genetic algorithms and genetic programming; and memory-based reasoning (MBR) and other nearest neighbor methods. An ideal database mining system can identify interesting and important patterns quickly in current data, with little or no direct guidance from humans. For example, over a weekend a database mining system might collect sets of "factoids" and rules that characterize the behavior of customers or retailers, identify trends and likely causal factors, or find latent semantic categories in text databases. Data mining and learning are important for two main reasons: 1) the explosive growth in on-line data, and 2) the costs of developing software products. Let me explain: the amounts of data that we must cope with are growing explosively - even faster than computing power per dollar. Moreover, few if any people have much intuition about what patterns are in (or responsible for) this data. Automatic processing- sifting, refining, searching for patterns and regularities, etc. - is necessary and inevitable: hand analysis and human hypothesis and model generation are too uncertain, too time consuming, and too expensive. Since 1950, the cost of a transistor has fallen by eight orders of magnitude; the cost per line of (assembly code) software has fallen by at best one order of magnitude, perhaps not at all (see figure 1). Why such a discrepancy? The answer is that hardware is generated by a highly automated process that borrows from photolithography and other technologies, while software is still a labor-intensive process. Most of the gain in software

328

$1000

Cost per small black & white TV

-

$100

--)

T

$10 $1 $.1

"..

-

Cost per ofcode

"'.

line

$.01 $.001

m

$.0001

-

$.00001

Cost per transxstor

Ordinary rate ot laborintensive engineering progress

'.. ...

m

,

$.00(K~I $.0000001 I 1950

I

I

1960

I 1970

I

I 1980

I

I 1990

I

I 2000

'", 2010

Figure 1. Trends in Software and Hardware Costs from Software Developer's Perspective

(if there is one) is due to the availability of higher-level languages and better editing and debugging tools. The software we developed demonstrated that it is possible to break out of this cycle, and realize software productivity gains as dramatic as the gains in hardware costperformance. Figure 2 shows actual results from two projects (two application systems built for the US Bureau of the Census, described in more detail below). First a confession - the exact cost per line of code could not be easily calculated, so the placement of the dotted line is somewhat arbitra .ry, though we believe that by using expert system techn o l o g y - in effect a very high-level programming language- it was possible to develop the application for less cost than the usual cost per line of code. The critical point is that, by using learning and data discove .ry methods, we were able to build a "memo .ry-based reasoning" system that learned the application at a much lower cost (4 person-months vs. 200 person-months) - and performed with significantly higher accuracy to boot. Note that the slope of the line connecting these two points parallels the slope of the transistor cost curve. It is an open question is whether learning technologies will allow us to extend this trend; the answer to this question has huge economic consequences. 2. T H E S A L E S P R O C E S S I went to Thinking Machines Corporation at the beginning of 1984. (At the time I agreed to go in October 1983, there were only about 15 employees.) TMC's first hardware

329

$1000

$100

C o s t per line o f c o d e

_

$10 $1 $.1

"', _

Expert ~ystem

' -

$.01

Memo.ry Based Reasoning

% % %

$.001

% %

C o s t per transistor

$.0001

"o...~~

Cost trend

%

%

for learned software?

%

$.00001

% %

$.000001 $.00000Ol I

1950

I

1960

I

I

1970

I

I

1980

I

I

1990

I

I

2000

9

I

v

2010

Figure 2. Strategy: Use Hardware Power to Reduce Software Engineering Costs

product - the 64,000 processor CM-1 - went on sale in 1985. While smaller machines were introduced later, all hardware offered by Thinking Machines was in the $1/2M - $20M range. As a consequence, sales took a minimum of 9 months from first contact, often two years or more. Typically 3-5 FTE people were involved for the entire sales cycle. Profitable prospects were thus limited to large organizations and even then the Corporate Board would generally need to approve a purchase of this magnitude. (Thinking Machines also made a number of "sales" to universities and government laboratories. The university sales were always at steep discounts, and were not a significant source of revenue. Government sales were reasonably lucrative, but often had strings attached. While these sales would also make interesting stories, from here on in this chapter I will describe only commercial sales attempts.) Virtually all commercial (and many other) sales fell into one of two modes: 1) Benchmarking-based sales, or 2) Novel functionality-based sales. In benchmarkingbased sales "technical marketing" staff ported customer's existing applications (or key kernels), and demonstrated the level of performance possible with TMC hardware. In novel functionality-based sales, our R&D staff wrote novel, speculative software prototypes, exploiting the power and features of TMC's hardware, and we then attempted to convince customers to adopt proprietary software embodying the new functionality, which required buying our hardware. Virtually all the stories in this chapter are of the novel functionality type. The goal in this type of sale is daunting. In essence it is necessary to convince customers that they

330 can't live without something that they've been living without forever! In such sales the customer has to balance the commercial advantage in being first or fastest vs. the risk of finding no customers or being unable to recover the cost of the system in savings or sales.

3. P R O J E C T

1: A U T O M A T I C

BACK-OF-THE-BOOK

STYLE INDEXER

My group was responsible for the first product of Thinking Machines: TMC Indexer, a software product, built in 1985, while the CM-1 was still in development. The key ideas for the Indexer were the due to Dr. Howard Resnikov, one of the founders of Thinking Machines. The Indexer ran on Symbolics Lisp Machines, then a hot product in what was a thriving AI market. The TMC Indexer generated a back-of-the-book style index from a formatted text input. It used a novel natural language processing technology, with simple heuristics for finding indexable materials. The TMC Indexer concentrated on locating noun phrases. It had pattern recognizers for proper noun phrases (names of people, places, companies, and agencies), as well as for content noun phrases. The TMC Indexer automatically generated an alphabetized list of entries plus page numbers. The technology consisted of lists of proper nouns; noun phrase parsers; and simple rules for inverting phrases: e.g., "processing, parallel" as well as "parallel processing"; "Reagan, Ronald"; etc. The Indexer could index an entire book in less than one hour; it required only about 1/2 day total to complete the index (including hand editing) for Danny Hillis' MIT Press book "The Connection Machine" vs. 2-3 tedious days for hand indexing a book of similar length. With a prototype in hand, we approached publishers as a natural target market. While everyone seemed impressed and interested, we made no progress in selling any copies. This was really puzzling, since indexing was known to be time-consuming and expensive, and the quality of indexes often left much to be desired. Eventually we learned why we were having no success. The manager at a publishing house responsible for producing indexes told us that he (and other people in parallel positions) had "black books" of p e o p l e - mostly highly educated mothers at h o m e - that they used as a workforce for producing indexes. For these managers, having a list of (human) indexers was their prima .ry job security. He argued that, if this process were automated, he would be out of a job, and therefore he - and all our other potential customers - would never buy the product, regardless of the savings or quality advantage. While I cannot say for certain that he was correct in his analysis, we never did sell a copy to a publisher, despite a considerable effort to do so. We did sell one copy, the first sale of any sort by Thinking Machines. The customer was a "Beltway bandit" (i.e. a Washington, DC, area consulting firm working primarily with the US government), employing about 5000 people. Each of their technical employees was an expert in one or more areas, such as nuclear waste disposal, spread spectrum radar, or computational fluid dynamics. When generating a bid for a project listed in the Commerce Business Daily or responding to a request from an agency, this company typically had to find a suitable multi-person team, e.g. to do a nuclear waste disposal study. In the past, the company had formed teams by using a phone search to find those who they thought might be able to respond. This process was ve.ry time-consuming and spotty in its results. After seeing a demo of the TMC Indexer during a visit (for a different topic) someone realized that

331 the Indexer might help improve the team selection process: each of the 5000 technical employees had a text biography on file, including education, previous work experience, areas of expertise, previous projects, etc. By using the Indexer, this company was able to easily identify groups by topic, expertise, geographic area, previous company or agency contacts. They used this system successfully for several years, well into the period of obsolescence of the Lisp Machines running the application. This was our only sale of the TMC Indexer. The Indexer (and the prospect for all other software-only projects) was killed when the CM-1 was announced. In a nutshell, Thinking Machines wasn't big enough for both the VP who championed software and the VP who championed the CM-1. The CM-1 won; the Software VP left; the TMC Indexer died. 4. P R O J E C T 2: C M D R S ( C O N N E C T I O N M A C H I N E D O C U M E N T TRIEVAL SYSTEM)

RE-

In 1986, Thinking Machines was approached by an intelligence agency that was interested in whether the CM-1 offered any potential advantage for information retrieval tasks. As a result of this contact, an experiment was set up, benchmarking a task that at the time was being done on mainframes. This task involved searching for text documents against a ve.ry large number of query terms, corresponding to the batched questions of a number of analysts. Questions were batched for efficiency. Answers were returned in a batch, and then sorted out into bins corresponding to each analyst's question. Craig Stanfill and Brewster Kahle wrote a prototype system [8] that searched the database in serial chunks. A chunk of the database that exactly filled the memo .ry of a CM-1 was loaded; because of the massively parallel architecture of the CM, the memory was distributed evenly among the 64K processors. The query terms were then serially "broadcast" to all of the processors, which then searched their local memory for hits, a process that the CM performed very quickly. All hits could then be appended to the answer list, and the next section of the database loaded, etc. The CM-1 had ve~. slow I/O (only through its controlling front end), but the CM-2 had a fast I/O system designed into it. Following this experiment, Stanfill and Kahle asked the question: how might this fast search apply to commercial information retrieval? In commercial IR, users typically submit ve.ry small queries, so there is no advantage to the scheme above, which requires time serial in the number of memo .ry loads needed go through the entire database. Even though the memory of the CM-2 was ve.ry large by the standards of the day (32 MB), most commercial databases were much larger. Stanfill and Kahle devised a way of generating 1000-bit signatures for 100-word sections of a database. This allowed about 8:1 compression, so that potentially about 1/4 GB of (compressed) documents could be stored and searched (probabilistically) in the memo~, of one 1988-vintage CM-2 (1988 hardware cost: about $8M). Stanfill and Kahle also noted that, since many que~. terms could be searched for in a short time (less than 1/2 second, even for 100 search terms), a user could generate que .ry terms by pointing to a "good" document, once one was found, and using ALL the terms of the document as a que~.. They found experimentally that this led to high-quality search (high precision-recall product, where "precision" is the fraction of high-ranking documents actually relevant to one's query, and "recall" is the fraction of

332

all the relevant documents retrieved within the high-ranked set of documents). Thus they inadvertently reinvented the idea of "relevance feedback," initially described by Salton

[7]. This work led to the first commercial system based on "relevance feedback." Here's how it came about. Based on the earlier prototype, my group generated a demo using a Symbolics Lisp Machine with a fancy (for the time) mouse-driven interface. After typing in an initial query - which could be an English question or a "word salad," with no boolean operators required - a user would get back a ranked list of documents. Thereafter, the user could point and click at relevant documents or drag a mouse over relevant sections of a document, and search without further typing. The visionary President of Dow Jones Information Services, saw a demo and was wildly enthusiastic. Through his efforts, and despite resistance from his staff, he pushed through a sale of the system, and a joint project to build a commercial product, eventually known as CMDRS at Thinking Machines, and marketed as DowQuest. Thinking Machines wrote the back end - a memory-free-transaction system (i.e. one that kept no record of individual sessions or users, but simply took a set of query terms and returned pointers to a set of documents). Next-generation memories allowed two 32K processor CM-2's to each search signatures corresponding to 1/2 GByte of raw text in roughly 200 msec. (The final compression rate of text was only about 4:1 in the deployed system, since we found that adding common word pairs, e.g. White House, New Mexico, Artificial Intelligence, etc., dramatically improved search performance, offsetting the cost of handling only a smaller database.) Dow Jones produced the user interface, concentrating first on dumb lineoriented terminals, using a (ve .ry clunky) menu-driven interaction scheme, since that's what most users had in 1988. The plan was to eventually build a PC version as well, but Down Jones never followed through on it. Unfortunately the visiona .ry President - and through him Dow Jones marketing people - misunderstood CMDRS. They considered it a natural language system, and in the advertising and user manuals emphasized ability to use sentences rather than Boolean queries. In fact, CMDRS simply stripped out content words and ignored other words. (It had a list of all words in the database). Meanwhile the visiona .ry President lost interest in the product once the decision was made to buy, and moved on to investigating chaos and neural nets. In addition to technical and marketing problems, TMC - used to working with government agencies - was "taken to the cleaners" in negotiations with Dow Jones. During negotiations it became clear that Thinking Machines would not make much if any money on the Dow Jones sale and development project. However, the sale was viewed by TMC as very important p u b l i c i t y - too important to let a low profit margin interfere with it and we decided to go ahead, based on the idea that the development costs would be amortized over a number of other sales, made more likely by the visibility of the Dow Jones system, and the use of Dow Jones as a reference account. By many measures, this was a very successful project, constructed within the (very aggressive) schedule agreed to in negotiations by a team led by Bob Millstein. CMDRS was honored as Online Magazine product of the year (1989), and remained in service from 1989 through 1996, well past the obsolescence of the two 32K processor CM-2's with VAX 11/780 front ends. (Two systems were built to provide a "hot backup" in case of a failure -

333 of the prima .ry machine, but eventually, with the high reliability of the system, the two machines were loaded with different data, to provide a larger searchable database.) Throughout its lifetime, Dow Jones claimed that the DowQuest service was not profitable, and regularly threatened to turn the system o f f - thereby gaining concessions on hardware and software upgrades. The shortage of profits was also cited as excuse for not upgrading the user interface for PCs/workstations, insuring awkward operation and limited success of the service as PCs became widespread. We frequently asked ourselves: Would Dow Jones ever have bought the system if the demo had been of the system they ended up deploying? We think not. Starting in about 1989, Brewster Kahle and a small team built a PC interface for WAIS, an acronym for "Wide Area Information Server." This was the PC interface that Dow Jones had wanted to build, but never did. In addition, WAIS embodied the idea that the text servers on the Internet would be distributed, presaging the later World Wide Web explosion. In 1992, Brewster Kahle left Thinking Machines, along with a few other employees who had worked on this project, and founded WAIS, Inc. to commercialize a PC/Workstation version of CMDRS, for the most part developed while Kahle et al. were at Thinking Machines. In 1995, WAIS, Inc., with about a dozen employees, was purchased by America Online for about $15M.

5. P R O J E C T

3: L E G A L D A T A B A S E V E R S I O N O F C M D R S

We made several attempts to sell CMDRS to other database vendors. In one experiment, we quickly built a legal search demo for XYZ, Inc. (a well-known v e n d o r - not its real name). For this demo, we first built a list of legal technical terms by comparing word counts of news and legal databases, and keeping words that occurred much more often in the legal text as our legal lexicon. (We eliminated some common words that didn't seem to belong in a legal dictionary, only to discover later that most of them should have been kept.) Based on brief experiments, we also built special recognizers for case names and statute designators ("Massachusetts vs. Smith," "New Jersey HR-3224.05," etc.). Our test data was taken from real on-line interactions by legal users. We were given a large database, and the lists of documents retrieved for each query. Queries were made to our system by stripping out all the Boolean connectives, and searching on just the lists of terms from the queries. We brought in a legal expert to judge the results of our experiments, and to tune the system for maximum performance. In the judgement of our expert, we found a significant number of relevant cases (perhaps 50% more - my memory is hazy at this point) that had not been found by the existing Boolean search system; we also missed perhaps 5% of the articles found by the existing system. We - and our legal expert - were very excited by the quality of this result, which took on the order of two person-months total to achieve. We sent the results to the potential customer with high expectations. Our first warning should have come when the task of evaluation was given to the writers of the customer's existing legal search system. This group would naturally be expected to view its own credibility as being at stake in the comparison. The verdict from the customer was indeed grim, but the reason given was astounding: in their opinion, we had simply gotten the wrong answers. The "right answers" would have been an exact match

334

with what the Boolean system produced- it was viewed as the Gold Standard. No matter that the Gold Standard missed about 1/3 of the relevant articles. Our system didn't get the right answers. We didn't make the sale. As a postscript, all legal services, including the customer above, now proudly offer search services at least superficially identical to what we demoed to them in 1988. 6. P R O J E C T 4: P A C E : A U T O M A T I C C L A S S I F I C A T I O N O F U.S. C E N S U S LONG FORMS In 1989, we received a contact from Rob Creecy, a scientist at the U.S. Bureau of the Census. Creecy had seen the paper that Craig Stanfill and I had written on Memo .rybased reasoning [9], and felt that our method might be applicable to the task of classifying Census Long Forms into the appropriate Occupation and Industry categories of the respondents. Long Forms are given to 10% of the population, and have free text fields for respondents to describe their occupations and the industries in which they work. Through 1980, these returns were assigned to about 500 occupation and 250 industry categories by human operators, trained in the task, and working with a reference book, moved in 1980 to a computer terminal used by each operator. Starting in about 1983, M.V. Appel and others at the Census Bureau built a rule-based expert system to automate the classification task. They kept careful records on their effort, which required nearly 200 person-months to build, test, and tune the system. By the end of the project, the expert system, called AIOCS, assigned classification categories and confidence levels for each return. For each occupation and indust .ry category, a threshold was selected, and all classifications with confidence levels below that threshold were given to humans for hand classification. The outcome was a system that performed with the same accuracy that had been obtained by humans: AIOCS could classify about 47% of the input data at human levels of accuracy [1,2]. At some point after AIOCS was completed, Rob Creecy had tried to write a memo .rybased reasoning (MBR) system to do the same task. The basic idea of MBR is to use the classification of the nearest-neighbor (of k-nearest neighbors) to decide how a novel item should be classified. Rob's system worked, but not as well as AIOCS. Craig Stanfill and I had proposed a metric for judging nearness that applies to both symbolic and numeric data, and, with lots of computing power, had the possibility of trying lots of experiments in a short time to build and tune a system. The Census Bureau prides itself on being forward-looking- it was the first customer for a commercial c o m p u t e r - but it had fallen far behind the times. All processing was still mainframe-based. Creecy argued within the Bureau for an experiment and small budget to car .ry it out. He/we received approval. We produced a very successful benchmark in a very short time: About 61% of long forms were handled at human levels of accuracy vs. 47% for expert system. Moreover, the entire project required only 4 person-months to develop vs. 200 for AIOCS! A rough calculation showed that deploying the system would have saved more money (in salaries for human classifiers) than the purchase price of the Connection Machine hardware.

335 So this should have been a success and should have led to a sale. We had an in-house champion, an impressive demonstration, a very favorable cost-benefit analysis, and there was no Not-Invented-Here Syndrome at work. So was this an obvious win? Alas, no. We did get a nice paper out of this [3], but there was no sale. Why? Contractual agreements had already been made with human classifiers before a purchase decision could be made. No budget savings would be possible. We could try again in ten years... 7. O T H E R P R O J E C T S Over the years we completed a number of projects with customers, some of which led to sales, some to papers; all were great learning experiences (no sarcasm intended). One set of experiments with a credit-card issuing bank helped us to pioneer many tasks that have now been deployed as data mining applications. These included experiments in learning to recognize good credit card customers (for the sake of retaining them), and for rooting out bad customers (so that their credit lines would not be increased, or that cards would not be issued to them). We tried and compared many different learning methods. For example, in one experiment, a neural net predicted a set of cardholders ten times more likely than the general cardholder population to "attrite" (i.e. not renew their cards), and in another experiment CART and K-nearest neighbors outperformed ten other methods tested to find people about to miss a payment. In targeting marketing experiments for catalog sales customers, we used simulated annealing and genetic engineering-like mating methods to generate optimal catalogs and mailing lists. By "optimal" I mean that the solution found the right number of catalogs, catalog sizes, and contents of the catalogs to maximize net return, based on prior buying behavior, after mailing costs (variable to reflect the different sizes of catalogs) were subtracted. In yet another experiment, we showed that we could perform automatic classification of news articles with near-human accuracy. Using a nearest neighbor method that called CMDRS as a subroutine, our system was able to assign keywords to 92% of articles with a correctness performance equal to human editors. (As in the Census application, our system referred articles that fell below a confidence threshold to human experts.) This work is described in [6]. Based on these experiments, and on parallel relational database prototypes, Thinking Machines sold two large systems to American Express and Epsilon, Amex's subsidia .ry for mailing list generation and software development. These systems replaced several mainframes and cut turnaround time for answering marketing questions from several weeks to less than a day. 8. T H E F I N A L C H A P T E R

(~11)

A number of factors conspired to doom the old Thinking Machines: 9 the end of the cold war 9 big cuts in federal research funds

336

9 competition from "killer micros," shared-memory multiprocessors, and other MPP manufacturers. ("Killer micros" is Eugene Gross's phrase. It refers to the overwhelming effect of ever-cheaper and ever-more-powerful commodity hardware. MPP hardware required special software, and over the three years or more required to bring an entire MPP system to market, the PC's and workstations would have increased their cost-performance by a factor of four or more, making any MPP look bad in cost-performance terms.) The MPP competitors were very aggressive; some had deep pockets (Intel, IBM) and others lacked scruples (Kendall Square Research has been embroiled in court cases over misrepresenting income. KSR is charged with claiming income for donated computers, artificially boosting their bottom line, and thus receiving artificially inflated prices for their stock. It was hard to compete against a company that offered computers - quite excellent ones - for much less than cost! [4]). 9 bad (or unfortunate) technical decisions: in 1989, TMC chose to go with the SPARC chip for the CM-5 rather than the MIPS chip. In retrospect, MIPS delivered several generations of faster, compatible chips before even one faster SPARC generation arrived, and therefore MIPS would have been the better choice by far. DMA memo .ry should have been designed into the CM-5 but was not, with the net effect that communication and computation could not be overlapped. 9 To top this off, there were management power struggles and cases of less-thanoptimal decision-making in response to all the factors above. In the end, Thinking Machines survived, but in a much changed and smaller form. It is now a software vendor with about 200 employees vs. about 650 at its peak. It is possible that Thinking Machines would have avoided calamity if it had wholeheartedly embraced the data mining/commercial data processing goals at the time I and people around me began pushing for this (about 1988). However, there was vast inertia and momentum in direction of scientific computing- floating-point oriented scientific computing would have let the company enter the mainstream, whereas the original dream of making AI possible using NETL marker-passing methods on the non-floating-point equipped CM-1 had yielded mostly university interest. By the time we began lobbying for increasing TMC's commercial thrust, the people associated with commercial applications were outnumbered by at least 10 to 1 within the company, and the net management, sales, and marketing attention given to these non-scientific applications was in about the same ratio. To be fair, "the attack of the killer micros" would almost certainly have doomed TMC's hardware business in any case, but the end would have been less sudden, giving the company a better opportunity to shift its focus without mass layoffs. A Thinking Machines team under Steve Smith completed "Darwin," a package of data mining tools, in 1995. Darwin has been coverted to run on a wide variety of platforms, and is being sold to commercial customers. In the end, TMC may have won some battles but it lost the war. Data mining has become mainstream and "hot." But the data mining pieces have been picked up not so much by Thinking Machines, but by IBM, Sun, Dun & Bradstreet, Amex, and perhaps 100 other companies, many of them small. IBM SP-2's are hot sellers as database mining

337 engines and mainframe replacements; ironically, IBM had argued repeatedly throughout the 80's and early 90's that MPPs would never replace mainframes. 9. O V E R A L L M E S S A G E S So what can one take from all this? Here is an attempt to sum up some of the lessons we've learned, which apply to sales of any large system or to sales of systems that introduce big changes in customer operations: Good applications must show cost savings, but only 1) ve.ry large installations, or 2) highly replicated applications (i.e. a mass market) can support the high costs of development. (Most of the broad applications so far are generic - e.g. Oracle, SAS, SQL - and do not have novel functionality - e.g. Darwin.) Libraries of standard applications would be very useful, but they are chicken-or-egg problems - very costly to build, and no one may be willing to pay to develop them until there is some guarantee of costperformance. But cost savings do not guarantee sales, as in the case of the Census Bureau case above. Successful organizations inherently resist change; unsuccessful organizations can't afford new projects. Perceived risks must be addressed: the probability of technical success, job loss, retraining, user acceptance, scalability, maintenance and updating, migration to future platforms, etc. Customer confidence is important. Reference accounts can help, once a business gets rolling. But especially for a small start-up, it is difficult to overcome customer fears about whether the company will be around next year or the year after that. This gives established companies a huge advantage. All sales involve solving people problems, never just technical problems. To succeed, it is important for the vendor to understand the customer's organization, operations, decision making, power structure, individual motivations, etc. Sales of large systems are unlikely unless the is an internal champion. But big-ticket items also need an internal consensus. Overall, it is critical to offer clear benefits with manageable risks. The not-invented-here syndrome is often a problem. Involving the customer, e.g., with joint projects, can help get past this problem.

Acknowledgments I would like to thank the wonderful people at Thinking Machines who worked on and supported the projects listed above: Craig Stanfill, John Mucci, Bob Millstein, Marvin Denicoff, Sheryl Handler, Steve Smith, Gordon Linoff, Brij Masand, Franklin Davis, Kath Durant, Michael Berry, Lily Li, Kurt Thearling, Mario Bourgoin, Ga.ry Drescher, Ellie Baker, Tracy Shen, Chris Madsen, Danny Hillis, Paul Mott, Shaun Keller, Howard Resnikoff, Brewster Kahle, and Paul Rosenbloom. It is impossible to list everyone, and I apologize to those I've left out. I would also like to thank especially Bill Dunn, formerly of Dow Jones, and Rob Creecy of the US Bureau of the Census.

REFERENCES 1. M.V. Appel, Automated indust .ry and occupation coding, Development of Statistical

338

2.

3.

4. 5. 6.

7. 8. 9. 10.

Tools Seminar on Development of Statistical Expert Systems (DOSES), Luxembourg, December 1987. M.V. Appel and E. Hellerman, Census Bureau experiments with automated indust .ry and occupation coding, Proceedings of the American Statistical Association, 1983, 32-40. Robert Creecy, Brij Masand, Stephen Smith and David Waltz, Trading MIPS and Memo .ry for Knowledge Engineering, Communications of the A CM, 35, 8, August 1992, 48-64. Josh Hyatt, Kendall Square plans to restate '92 fiscal results, Boston Globe, first page, business section, December 3, 1993. Danny Hillis, The Connection Machine, Cambridge, MA: MIT Press, 1985. Brij Masand, Gordon Linoff, and David Waltz, Classifying News Stories Using Memo .ry Based Reasoning, Proceedings of the 15th Annual A CM/SIGIR Conference, Copenhagen, Denmark, 1992, 59-65. Gerald Salton, The SMART Retrieval System - Experiment in Automatic Document Classification, Cambridge, MA: MIT Press, 1971. Craig Stanfill and Brewster Kahle, Parallel free test search on the Connection Machine, Communications of the A CM, 29, 12, December 1986, 1229-1239. Craig Stanfill and David L. Waltz, Toward Memory-Based Reasoning, Communications of the ACM 29, 12, December 1986, 1213-1228. David L. Waltz, Massively Parallel AI, International Journal of High Speed Computing, 5, 3, 1993, 491-501.

339 David W a l t z David Waltz is Vice President of the Computer Science Research Division of NEC Research Institute in Princeton, N J, and Adjunct Professor of Computer Science at Brandeis University in Waltham, MA. From 1984 to 1993, he was Senior Scientist and Director of Advanced Information Systems at Thinking Machines Corporation in Cambridge, MA, and Professor of Computer Science at Brandeis. From 1974 to 1983 he was a Professor of Electical and Computer Engineering at the University of Illinois at Urbana-Champaign. Dr. Waltz received SB, SM, and Ph.D. degrees from MIT, in 1965, 1968, and 1972 respectively. His research interests have included constraint propagation, especially for computer vision, massively parallel systems for both relational and text databases, memory-based and case-based reasoning systems, protein structure prediction using hybrid neural net and memory-based methods, connectionist models for natural language processing, and natural language front ends for relational databases. He is President-Elect of the American Association of Artificial Intelligence and was elected a fellow of AAAI in 1990. He was President of ACM SIGART from 1977-9, served as Executive Editor of Cognitive Science from 1983-6, AI Editor for Communications of the ACM 1981-4, and is a senior member of IEEE, and a member of ACM, ACL (Association for Computational Linguistics), AAAI, and the Cognitive Science Society. Home Page: http://www.neci.nj.nec.com/homepages/waltz.html

341

APPENDIX

This appendix contains references to all the papers that originally appeared in four workshops: 1. PPAI91 - Workshop for Parallel Processing in Artificial Intelligence, IJCAI 1991, Sydney, Australia. 2. SSS93- Stanford Spring Symposium on Massively Parallel Artificial Intelligence, 1993, Stanford, CA. 3. PPAI93 - Workshop for Parallel Processing in Artificial Intelligence - 2, IJCAI 1993, Chambery, France. 4. PPAI95 - Workshop for Parallel Processing in Artificial Intelligence - 3, IJCAI 1995, Montreal, Canada. REFERENCES

D. Abramson, J Abela. A Parallel Genetic Algorithm for Solving the School Timetabling Problem. PPAI91. Emmanuel D. Adamides. Celluar Objects for Cellular Computers. SSS93. Ali M. A1Haj, Eiichiro Sumita, and Hitoshi Iida. A Parallel Text Retrieval System. PPAI95. Ed P. Andert Jr. and Thomas Bartolac. Parallel Neural Network Training. SSS93. Jean-Marc Andreoli, Paolo Ciancarini, and Remo Pareschi. Rarallel Searching with Multisets-as-Agents. SSS93. Ulrich Assman. A Model for Parallel Deduction. PPAI93. Tito Autrey and Herbert Gelernter. Parallel Heuristic Search. SSS93. Frank W. Bergmann and J. Joachim Quantz. Parallelizing Description Logics. PPAI95. Pierre Berlandier. Problem Partition and Solvers Coordination in Distributed Constraint Satisfaction. PPAI95. 10. Mark S. Berlin. Toward An Architecture Independent High Level Parallel Programming Model For Artificial Intellingence. PPAI93. 11. B. Boutsinas, Y. C. Stamatiou, and G. Pavlides. Parallel Reasoning using Weighted Inheritance Networks. PPAI95. 12. Jon Bright, Simon Kasif, and Lewis Stiller. Exploiting Algebraic Structure in Parallel State-Space Search (Extended Abstract). SSS93. 13. Daniel J. Challou, Maria Gini, and Vipin Kumar. Toward Real-Time Motion Planning. PPAI93. 14. Daniel J. Challou, Maria Gini, and Vipin Kumar. Parallel Search Algorithms for Robot Motion Planning. SSS93. 15. C.-C. Chu, J.C. Aggarwal. An Experimental Parallel Implementation of a Rule-Based Image Interpretation System. PPAI91.

342 16. Diane J. Cook. Fast Information Distribution for Massively Parallel IDA* Search. SSS93. 17. Diane J. Cook and Shubha Nerur. Maximizing the Speedup of Parallel Search Using HyPS. PPAI95. 18. J.-P. Corriveau. Constraint Satisfaction in Time-Constrained Memo .ry. PPAI91. 19. Van-Dat Cung and Lucien Gotte. A First Step Towards the Massively Parallel GameTree Search: a SIMD Approach. PPAI93. 20. R.F. DeMara, H Kitano. The MP-1 Benchmark Set for Parallel AI Architectures. PPAI91. 21. J. Denzinger. Parallel Equational Deduction by Team Work Completion. PPAI91. 22. G. Destri and P. Marenzoni. Performance Evaluation of Distributed Low-Level Computer Vision Algorithms. PPAI95. 23. Rumi M. Dubash and Farokh B. Bastani. Decentralized, Massively Parallel Path Planning and its Application to Process-Control and Multi Robot Systems. SSS93. 24. Wolfgang Ertel. Random Competition: A Simple, but Efficient Method for Parallelizing Inference Systems. PPAI91. 25. Wolfgang Ertel. Massively Parallel Search with Random Competition. SSS93. 26. Matthew P. Evett, William A. Anderson, and James A. Hendler. Massively Parallel Support for Computationally Effective Recognition Queries SSS93. 27. Scott E. Fahlman. Some Thoughts on NETL, 15 Years Later. SSS93. 28. M. Factor, S. Fertig, D.H. Gelernter. Using Linda to Build Parallel AI Applications. PPAI91. 29. Michael Fisher. An Open Approach to Concurrent Theorem-Proving. PPAI95. 30. U. Furbach. Splitting as a source of Parallelism in Disjunctive Logic Programs. PPAI91. 31. Edmund Furse and Kevin H. Sewell. Automatic Parallelisation of LISP program. PPAI93. 32. J.-L. Gaudiot, C.A. Montgomery, R.E. Strumberger. Data-Driven Execution of Natural Language Parsing. PPAI91. 33. James Geller. Upward-Inductive Inheritance and Constant Time Downward Inheritance in Massively Parallel Knowledge Representation. PPAI91. 34. James Geller. Massively Parallel Knowledge Representation. SSS93. 35. G. Grot]e. Actor Coordination in Parallel Planning. PPAI91. 36. L.O. Hall, D.J. Cook, W. Thomas. Parallel Window Search using TransformationOrdering Iterative-Deepening A*. PPAI91. 37. Sanda M. Harabagiu and Dan I. Moldovan. A Marker-Propagation Algorithm for Text Coherence. PPAI95. 38. R. Hasegawa, H. Fujita, M. Fujita. A Parallel Model-Generation Theorem Prover with Ramified Term-Indexing. PPAI91. 39. James A. Hendler. Massively-Parallel Marker-Passing in Semmantic Networks. PPAI91. 40. James A. Hendler. The Promise of Massive Parallelism for AI. SSS93. 41. Dominik Henrich. Initialization of Parallel Branch-and-bound Algorithms. PPAI93. 42. Tetsuya Higuchi, Tatsuya Niwa, Toshio Tanaka, Hitoshi Iba, Tatsumi Furuya. A Parallel Architecture for Genetic Based Evolvable Hardware. PPAI93.

343 43. Lothar Hotz. An Object-oriented Approach for Programming the Connection Machine. PPAI93. 44. Walter Hower. Parallel Global Constraint Satisfaction. PPAI91. 45. Walter Hower and Stephan Jacobi. Parallel Distributed Constraint Satisfaction, PPAI93, SSS93. 46. Ken Jung, Evangelos Simoudis, and Ramesh Subramonian. Parallel Induction Systems Based on Branch and Bound. SSS93. 47. George Ka .rypis, Vipin Kumar. Unstructured Tree Search on SIMD Parallel Computers: Experimental Results. SSS93. 48. P. Kefalas, T.J. Reynolds. Hill-Climbing and Genetic Algorithms coded using ORparallel Logic Plus Meta-Control. PPAI91. 49. S. Keretho, R. Loganantharaj, V. N. Gudivada. Parallel Path-Consistency Algorithms for Constraint Satisfaction PPAI91. 50. Hiroaki Kitano. Massively Parallel AI and its Application to Natural Language Processing PPAI91. 51. Hiroaki Kitano. Massively Parallel Artificial Intelligence and Grand Challenge AI Applications. SSS93. 52. Richard Kufrin. Decision Trees on Parallel Processors. PPAI95. 53. Deepak Kumar. An AI Architecture Based on Message Passing. SSS93. 54. F. KurfeB. Massive Parallelism in Inference Systems. PPAI91. 55. Franz Kurfefl. Massive Parallelism in Logic. SSS93. 56. Yannick Lallement, Thierry Cornu, and St~phane Vialle. An Abstract Machine for Implementing Connectionist and Hybrid Systems on Multi-processor Architectures. PPAI93. 57. Yannick Lallement, Thierry Cornu, and St~phane Vialle. Application development under ParCeL-1. PPAI95. 58. Trent E. Lange. Massively-Parallel Inferencing for Natural language Understanding and Memory Retrieval in Structured Spreading-Activation Networks. SSS93. 59. Eunice(Yugyung) Lee and James Geller. Parallel Operations on Class Hierarchies with Double Strand Representations. PPAI95. 60. Q.Y. Luo, P.G. Hend .ry, and J.T. Buchanan. Comparison of Different Approaches for Solving Distributed Constraint Satisfaction Problems. SSS93. 61. E.L. Lusk, W.W. McCune, J.K. Slane. ROO- a Parallel Theorem Prover. PPAI91. 62. A. Mahanti, C.J. Daniels. A SIMD Approach to Parallel Heuristic Search. PPAI91. 63. Takao Mohri, Masaaki Nakamura, and Hidehiko Tanaka. Weather Forecasting Using Memo .ry-Based Reasoning. PPAI93. 64. D. Moldovan, W. Lee, C. Lin. A Marker Passing Parallel Processor for AI. PPAI91. 65. Petri Myllymaki and Henry Tirri. Bayesian Case-Based Reasoning with Neural Networks. SSS93. 66. P.C. Nelson, A.A. Toptsis. Superlinear Speedup Using Bidirectionalism and Islands. PPAI91. 67. J. Thomas Ngo and Joe Marks. Massively Parallel Genetic Algorithms for Physically Correct Articulated Figure Locomation. SSS93. 68. T. Nishiyama, O. Katai, T. Sawaragi, T. Katayama. Multiagent Planning by Distributed Constraint Satisfaction. PPAI91.

344 69. Katsumi Nitta. Experimental Legal Reasoning System on Parallel Inference Machine. PPAI91. 70. Katsumi Nitta, Stephen Wong. The Role of Parallelism in Parallel Inference Applications. SSS93. 71. Kozo Oi, Eiichiro Sumita, Osamu Furuse, Hitoshi Iida, Hiroaki Kitano. Toward Massively Parallel Spoken Language Translation. PPAI93. 72. R. Oka. Parallelism for Heterarchical Aggregation of Knowledge in Image Understanding PPAI91. 73. Gdrald Ouvradou, Aymeric Poulain Maubant, and Andr~ Th~paut. Hybrid systems on a Multi-Grain Parallel Architecture. PPAI93. 74. Robert A. Pearson. A Coarse Grained Parallel Induction Heuristic. PPAI93. 75. G. Pinkas. Constructing Proofs using Connectionist Networks. PPAI91. 76. Curt Powley, R.E. Korf. IDA* on the Connection Machine. PPAI91. 77. Curt Powley, Richard E. Korf, and Chris Ferguson. Parallelization of Tree-Recursive Algorithms on a SIMD Machine. SSS93. 78. S. Rangoonwala, G. Neugebauer. Distributed Failure Production: Sequential Theorem Proving on a Parallel Machine. PPAI91. 79. Thilo Reski and Willy B. Strothmann. A Dense, Massively Parallel Architecture. PPAI93. 80. J. Riche, R. Whaley, J. Barlow. Massively Parallel Processing and Automated Theorem Proving. PPAI91. 81. James D. Roberts. Associative Processing: A Paradigm for Massively Parallel AI. SSS93. 82. Ian N. Robinson. PAM: Massive Parallelism in Support of Run-Time Intellignece. SSS93. 83. Satoshi Sato. MIMD Implementation of MBT3. PPAI93. 84. James G. Schmolze, Wayne Snyder. Using Confluence to Control Parallel Production Systems. PPAI93. 85. J. Schumann and M. Jobmann. Scalability of an OR-parallel Theorem Prover on a Network of Transputer- A Modelling Approach. PPAI93. 86. Johann Schumann. SiCoTHEO-Simple Competitive parallel Theorem Provers based on SETHEO. PPAI95. 87. S. Sei, N. Ichiyoshi. Experimental Version of Parallel Computer Go-Playing System "GOG". PPAI91. 88. R.V. Shankar, S. Ranka. Parallel Processing of Sparse Data Structures for Computer Vision. PPAI91. 89. Lokendra Shastri. Leveraging Massive Parallelism for Tractable Reasoning- taking inspiration from cognition. SSS93. 90. Kilian Stoffel, Ian Law, and B~at Hirsbrunner. Fuzzy Logic controlled dynamic allocation system. PPAI93. 91. Kilian Stoffel, James Hendler, and Joel Saltz. PARKA on MIMD-supercomputers. PPAI95. 92. Salvatore J. Stolfo, Hasanat Dewan, David Ohsie, Mauricio Hernandez and Leland Woodbury. A Parallel and Distributed Environment for Database Rule Processing: Open Problems and Future Directions. SSS93.

345 93. S.Y. Susswein, T.C. Henderson, J.L. Zachary. et.al. Parallel Path Consistency. PPAI91. 94. G. Sutcliffe. A Parallel Linear UR-Derivation System. PPAI91. 95. Christian B. Suttner. Competition versus Cooperation. PPAI91. 96. Christian B. Suttner and Manfred R. Jobmann. Simulation Analysis of Static Partitioning with Slackness. PPAI93. 97. Christian B. Suttner. Static Partioning with Slackness. PPAI95. 98. H. Tomabechi, H. Iida, T. Morimoto, A. Kurematsu. Graph-based CP in MassivelyParallel Memo~.: Toward Massively-Parallel NLP. PPAI91. 99. Dave Waltz. Innovative Massively Parallel AI Applications. SSS93. 100A. Martin Wildberger. Position Statement: Innovative Application of Massive Parallelism. SSS93. 101Stefan Winz and James Geller. Methods of Large Grammar Representation in Massively Parallel Parsing Systems. SSS93. 102Z4.J. Wise. Introduction to PMS-Prolog: A Distributed Coarse-Grain-Parallel Prolog with Processes, Modules and Streams. PPAI91. 103Andreas Zell, Niels Mache, Markus Huttel, and Michael Vogt. Massive Parallelism in Neural Network Simulation. SSS93. 104.Y. Zhang, A.K. Mackworth. Parallel and Distributed Constraint Satisfaction. PPAI91.