Lecture Notes in Artificial Intelligence Subseries of Lecture Notes in Computer Science Edited by J. G. Carbonell and J. Siekmann
Lecture Notes in Computer Science Edited by G. Goos, J. Hartmanis, and J. van Leeuwen
2636
3
Berlin Heidelberg New York Barcelona Hong Kong London Milan Paris Tokyo
Eduardo Alonso Daniel Kudenko Dimitar Kazakov (Eds.)
Adaptive Agents and Multi-Agent Systems Adaptation and Multi-Agent Learning
13
Series Editors Jaime G. Carbonell, Carnegie Mellon University, Pittsburgh, PA, USA J¨org Siekmann, University of Saarland, Saarbr¨ucken, Germany Volume Editors Eduardo Alonso City University Department of Computing London EC1V 0HB, UK E-mail:
[email protected] Daniel Kudenko University of York Department of Computer Science Heslington, York YO10 5DD, UK E-mail:
[email protected] Dimitar Kazakov University of York Department of Computer Science Heslington, York, YO10 5DD, UK E-mail:
[email protected] Cataloging-in-Publication Data applied for A catalog record for this book is available from the Library of Congress. Bibliographic information published by Die Deutsche Bibliothek Die Deutsche Bibliothek lists this publication in the Deutsche Nationalbibliografie; detailed bibliographic data is available in the Internet at
. CR Subject Classification (1998): I.2.11, I.2, D.2, C.2.4, F.3.1, D.3.1, H.5.3, K.4.3 ISSN 0302-9743 ISBN 3-540-40068-0 Springer-Verlag Berlin Heidelberg New York This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer-Verlag. Violations are liable for prosecution under the German Copyright Law. Springer-Verlag Berlin Heidelberg New York a member of BertelsmannSpringer Science+Business Media GmbH http://www.springer.de © Springer-Verlag Berlin Heidelberg 2003 Printed in Germany Typesetting: Camera-ready by author, data conversion by PTP-Berlin GmbH Printed on acid-free paper SPIN: 10929377 06/3142 543210
Preface
Adaptive Agents and Multi-Agent Systems is an emerging and exciting multidisciplinary area encompassing computer science, software engineering, biology, and cognitive and social sciences. When designing agent systems, it is impossible to foresee all the potential situations an agent may encounter and specify the agent’s behavior optimally in advance. Agents therefore have to learn from and adapt to their environment. This task is even more complex when the agent is situated in an environment that contains other agents with potentially different capabilities and goals. Multiagent learning, i.e., the ability of the agents to learn how to co-operate and compete, becomes central to agency in such domains. In 2000 E. Alonso and D. Kudenko organized the First Symposium on Adaptive Agents and Multi-Agent Systems (AAMAS, not to be mistaken with the Joint International Conference on Autonomous Agents and Multi-Agent Systems launched a year later) as part of the 2001 convention of the Society for the Study of Artificial Intelligence and the Simulation of Behaviour (SSAISB). The main goals that this symposium aimed to achieve were to: – increase awareness and interest in adaptive agent research in the artificial intelligence community and encourage further research; – encourage collaboration between machine learning experts and agent systems experts; – give a representative overview of the current research in the area of adaptive agents world-wide. Fifteen papers from authors all around the world (Taiwan, UK, France, The Netherlands, Portugal, USA, Austria, and Turkey) were presented at this symposium, held in York, UK in March 2001. The success of this first symposium encouraged the chairs to make it an annual event. A Second Symposium on Adaptive Agents and Multi-Agent Systems (AAMAS-2), this time also co-chaired by D. Kazakov, was held at Imperial College, London, UK as part of the Annual SSAISB Convention in April 2002. There were 16 papers presented from Canada, France, Portugal, UK, USA, Belgium, and The Netherlands. This initiative continued with the organization of the Third Symposium on Adaptive Agents and Multi-Agent Systems (AAMAS-3) held in Aberystwyth, Wales, in April 2003. The created momentum also led to the establishment of a Special Interest Group on Agents that Learn, Adapt and Discover (ALAD SIG) within the European Network of Excellence on Agent-Based Computing,
VIII
Preface
AgentLinkII. The success of the symposia and related initiatives strengthens our belief that the relatively young research area of adaptive agents will continue to grow and attract increasing attention in the future. The volume you have in your hands is a compilation of the best AAMAS and AAMAS-2 papers. Two more papers based on the AAMAS and AAMAS-2 invited talks have been added, those of E. Plaza (IIIA-Institut d’Investigaci´ o en Intel.lig`encia Artificial, Spanish Scientific Research Council) and S. Dˇzeroski (Joˇsef Stefan Institute, Department of Intelligent Systems, Slovenia). The volume has been completed with contributions by leading researchers in the area of adaptation and multi-agent learning. We have structured the volume into three main sections: Learning, Cooperation, and Communication; Emergence and Evolution in Multi-Agent Systems; and Theoretical Foundations of Adaptive Agents. No doubt, the ability to communicate and cooperate in multi-agent systems where groups of agents try to get coordinated to achieve common goals is of great importance. Agents need to continuously adapt their communication policies and cooperation strategies as they interact with each other. The first section of this volume consists of six papers on this issue. E. Plaza and S. Onta˜ no´n introduce a framework, Cooperative Multiagent Learning, to analyze the benefit of case exchanges in distributed case-based learning in multi-agent systems. The second paper, by S. Kapatenakis, D. Kudenko and M.J.A. Strens reports on an investigation on reinforcement learning techniques for the learning of coordination in cooperative multi-agent systems. In particular, they focus on two model approaches, one based on a new selection strategy for Q-learning, and the other on a model estimation with a shared action-selection protocol. L. Nunes’ and E. Oliveira’s paper describes a technique that enables a heterogeneous group of learning agents to improve its learning performance by exchanging advice. The evolution of cooperation and communication as a function of the environmental risk is studied in the fourth paper, by P. Andras, G. Roberts and J. Lazarus. M. Rovastsos, G. Weiß and M. Wolf present an interaction learning meta-architecture, InFFrA, as one possible answer to multi-agent learning challenges such as diversity, heterogeneity and fluctuation, and introduce the opponent classification heuristic ADHOC. The section finishes with a paper by H. Brighton, S. Kirby and K. Smith, who use multi-agent computational models to show that certain hallmarks of language are adaptive in the context of cultural transmission. The second section is dedicated to emergence and evolution in multi-agent systems. In these systems, individuals do not only learn in isolation, and in their own life cycle, but also evolve as part of a group (species) through generations. The evolutionary pressure in a changing environment leads to a trade-off
Preface
IX
between the skills that are inherited, and the individual’s ability to learn. An evolving multi-agent system resembles in many ways a genetic algorithm, as both draw their inspiration from nature. However, the former has the advantage of being capable of discriminating between the genotype (inherited features) of an individual, and its phenotype, or what becomes of it in life under continuous interaction with the environment. A multi-agent system, whether shaped by evolution or not, can reach a degree of complexity at which it is not possible to accurately predict its overall behaviour from that of one of its components. Examples of such emergent behaviour can be found in various sciences, from biology to economics and linguistics. In all of them, multi-agent simulations can provide a unique insight. On the other hand, these fields provide inspirations and often lend some of their tools to the software engineering approach based on adaptive and learning agents.
In the first paper of the second section P. De Wilde, M. Chli, L. Correia, R. Ribeiro, P. Mariano, V. Abramov and J. Goossenaerts investigate the repercussions of maintaining a diversity of agents, and study how to combine learning as an adaptation of individual agents with learning via selection in a population. L. Steels’s paper surveys some of the mechanisms that have been demonstrated to be relevant to evolving communication systems in software simulation or robotic experiments. In the third paper, by G. Picard and M.-P. Gleizes, groups of agents are considered as self-organizing teams whose collective behavior emerges from interaction. P. Marrow, C. Hoile, F. Wang and E. Bonsma describe experiments in the DIET (Decentralized Information Ecosystem Technologies) agent platform that uses evolutionary computation to evolve preferences of agents in choosing environments so as to interact with other agents representing users with similar interests. H. Turner and D. Kazakov’s paper assesses the role of genes promoting altruism between relatives as a factor for survival in the context of a multi-agent system simulating natural selection. A paper by S. van Splunter, N.J.E. Wijngaards and F.M.T. Brazier closes this section. Their paper focuses on automated adaption of an agent’s functionality by means of an agent factory. The structure of the agent is based on the dependencies between its components. We have included the paper by van Splunter’s et al. in this section because a component-based structure of an agent can be understood as a (holonic) multiagent system and adaption and, thus, as the emergence of behaviors through the interaction of its components.
The first two sections of this volume focus on the description of learning techniques and their application in multi-agent domains. The last section has a different flavor. No doubt, designing and implementing tools is important. But they are just tools. We also need sound theories on learning multi-agent systems, theories that would guide our future research by allowing us to better analyze our applications. The last section, Theoretical Foundations of Adaptive Agents, consists, as do the previous sections, of six papers.
X
Preface
The first one, by J.M. Vidal introduces some of the most relevant findings in the theory of learning in games. N. Lacey and M.H. Lee show the relevance of philosophical theories to agent knowledge base (AKB) design, implementation and behavior. In the third paper, P.R. Gra¸ca and G. Gaspar propose an agent architecture where cognitive and learning layers interact to deal with real-time problems. W.T.B. Uther and M. Veloso describe the Trajectory Tree, or Ttree, algorithm that uses a small set of supplied policies to help solve a Semi-Markov Decision Problem (SMDP). The next paper by C.H. Brooks and E.H. Durfee uses Landscape Theory to represent learning problems and compares the usefulness of three different metrics for estimating ruggedness on learning problems in an information economy domain. Last, but not least, S. Dˇzeroski introduces relational reinforcement learning, a method that, by working on relational representations, can be used to approach problems that are currently out of reach for classical reinforcement learning approaches. All in all, an attempt has been made to produce a balanced overview of adaptive agents and multi-agent systems, covering both theory and practice, as well as a number of different techniques and methods applied to domains such as markets, communication networks, and traffic control. Indeed, the volume includes papers from both academics and industry in Spain, UK, Portugal, Germany, The Netherlands, Belgium, USA, and Slovenia. We would like to acknowledge all the contributors for their hard work and infinite patience with the editors. Also, this volume would not exist without the commitment of the “adaptive and learning agents and multi-agent systems” community. In particular, we are thankful to the ALAD community and all those involved in the organization of the AAMAS symposia. We would also like to thank the SSAISB and AgentLinkII for their support.
London, March 2003
Eduardo Alonso Daniel Kudenko Dimitar Kazakov
Reviewers
Chris Child Kurt Driessens Pete Edwards Michael Fisher Christophe Giraud-Carrier Lyndon Lee Michael Luck David Mobach Eug´enio Oliveira Ana Paula Rocha Michael Schroeder Kostas Stathis Sander van Splunter Niek Wijngaards
City University, London Catholic University of Leuven University of Aberdeen Manchester Metropolitan University University of Bristol British Telecom Laboratories, Ipswich University of Southampton Vrije Universiteit, Amsterdam University of Porto University of Porto City University, London City University, London Vrije Universiteit, Amsterdam Vrije Universiteit, Amsterdam
Table of Contents
Learning, Co-operation, and Communication Cooperative Multiagent Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Enric Plaza, Santiago Onta˜ n´ on
1
Reinforcement Learning Approaches to Coordination in Cooperative Multi-agent Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Spiros Kapetanakis, Daniel Kudenko, Malcolm J.A. Strens
18
Cooperative Learning Using Advice Exchange . . . . . . . . . . . . . . . . . . . . . . . . Lu´ıs Nunes, Eug´enio Oliveira
33
Environmental Risk, Cooperation, and Communication Complexity . . . . . Peter Andras, Gilbert Roberts, John Lazarus
49
Multiagent Learning for Open Systems: A Study in Opponent Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Rovatsos, Gerhard Weiß, Marco Wolf
66
Situated Cognition and the Role of Multi-agent Models in Explaining Language Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Henry Brighton, Simon Kirby, Kenny Smith
88
Emergence and Evolution in Multi-agent Systems Adapting Populations of Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 Philippe De Wilde, Maria Chli, L. Correia, R. Ribeiro, P. Mariano, V. Abramov, J. Goossenaerts The Evolution of Communication Systems by Adaptive Agents . . . . . . . . . 125 Luc Steels An Agent Architecture to Design Self-Organizing Collectives: Principles and Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 Gauthier Picard, Marie-Pierre Gleizes Evolving Preferences among Emergent Groups of Agents . . . . . . . . . . . . . . . 159 Paul Marrow, Cefn Hoile, Fang Wang, Erwin Bonsma Structuring Agents for Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 Sander van Splunter, Niek J.E. Wijngaards, Frances M.T. Brazier Stochastic Simulation of Inherited Kinship-Driven Altruism . . . . . . . . . . . . 187 Heather Turner, Dimitar Kazakov
XIV
Table of Contents
Theoretical Foundations of Adaptive Agents Learning in Multiagent Systems: An Introduction from a Game-Theoretic Perspective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 Jos´e M. Vidal The Implications of Philosophical Foundations for Knowledge Representation and Learning in Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216 N. Lacey, M.H. Lee Using Cognition and Learning to Improve Agents’ Reactions . . . . . . . . . . . 239 Pedro Rafael Gra¸ca, Gra¸ca Gaspar TTree: Tree-Based State Generalization with Temporally Abstract Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 260 William T.B. Uther, Manuela M. Veloso Using Landscape Theory to Measure Learning Difficulty for Adaptive Agents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 291 Christopher H. Brooks, Edmund H. Durfee Relational Reinforcement Learning for Agents in Worlds with Objects . . . 306 Saˇso Dˇzeroski
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323
Cooperative Multiagent Learning Enric Plaza and Santiago Onta˜ no´n IIIA – Artificial Intelligence Research Institute CSIC – Spanish Council for Scientific Research Campus UAB, 08193 Bellaterra, Catalonia (Spain) Vox: +34-93-5809570, Fax: +34-93-5809661 {enric,santi}@iiia.csic.es http://www.iiia.csic.es
Abstract. Cooperation and learning are two ways in which an agent can improve its performance. Cooperative Multiagent Learning is a framework to analyze the tradeoff between cooperation and learning in multiagent systems. We focus on multiagent systems where individual agents are capable of solving problems and learning using CBR (Case-based Reasoning). We present several collaboration strategies for agents that learn and their empirical results in several experiments. Finally we analyze the collaboration strategies and their results along several dimensions, like number of agents, redundancy, CBR technique used, and individual decision policies.
1
Introduction
Multiagent systems offer a new paradigm to organize AI applications. Our goal is to develop techniques to integrate lazy learning into applications that are developed as multiagent systems. Learning is a capability that together with autonomy is always defined as a feature needed for full-fledged agents. Lazy learning offers the multiagent systems paradigm the capability of autonomously learning from experience. In this paper we present a framework for collaboration among agents that use Case-based Reasoning (CBR) and some experiments illustrating the framework. A distributed approach for lazy learning in agents that use CBR (case-based reasoning) makes sense in different scenarios. Our purpose in this paper is to present a multiagent system approach for distributed case bases that can support these different scenarios. A first scenario is one where cases themselves are owned by different partners or organizations. This organizations can consider their cases as assets and they may not be willing to give them to a centralized “case repository” where CBR can be used. In our approach each organization keeps their private cases while providing a CBR agent that works with them. Moreover, the agents can collaborate with other agents if they keep the case privacy intact an they can improve their performance by cooperating. Another scenario involves scalability: it might be impractical to have a centralized case base when the data is too big. E. Alonso et al. (Eds.): Adaptive Agents and MAS, LNAI 2636, pp. 1–17, 2003. c Springer-Verlag Berlin Heidelberg 2003
2
E. Plaza and S. Onta˜ no ´n
Our research focuses on the scenario of separate case bases that we want to use in a decentralized fashion by means of a multiagent system, that is to say a collection of CBR agents that manage individual case bases and can communicate (and collaborate) with other CBR agents. From the point of view of Machine Learning (ML) our approach can be seen as researching the issues of learning with distributed or“partitioned” data: how to learn when each learning agent is able to see only a part of the examples from which to learn. This approach is related to the work in ML on ensembles or committees of classifiers (we explain this relationship later in § 6). The main difference is that ensembles work on collection of classifiers that see all data but treat them differently, while our focus a collection of agents each having a view of part of the data (that in the extreme case can be completely exclusive). In this paper we show several strategies for collaboration among learning agents and later we analyze their results in terms of ML concepts like the error in terms of bias plus variance and the “ensemble effect”. Form the point of view of agent systems, we focus on multiagent systems and not on distributed applications. In distributed applications there are some overall goals that govern the different parts performing distributed processing, and their coordination is decided at design time, it is not decided by the constituent parts. In a multiagent system, agents have autonomy—i.e. they have individual goals that determine when it is in their interest to collaborate with others, and when not. In our approach, the agents have autonomy given by individual data (the cases from which they learn) and individual goals (solving problems and improving their performance), and they only collaborate when it can further their goals.
2
Collaboration Strategies
A collaboration strategy in a MAC system establishes a coordination structure among agents where each agent exercises individual choice while achieving an overall effect that is positive both for the individual members and the whole system. Specifically, a collaboration strategy involves two parts: interaction protocols and decision policies. The interaction protocols specify the admissible pattern of message interchange among agents; e. g. a simple protocol is as follows: agent A can send a Request messsage to agent B and then agent B can reply with an Accept message or a Reject message. Interaction protocols specify interaction states whose meaning is shared by the agents; in our example, agent A knows that it’s up to agent B to accept or not the request, and agent B knows that agent A is expecting an answer (usually in a time frame specified in the mesage as an expiration time). An interaction state then requires some agent to make a decision and act accordingly: the decision policies are the internal, individual procedures that agents use to take those decisions following individual goals and interests. In the following sections we will show several strategies for collaboration in the framework of interaction protocols for committees of agents. Since interaction
Cooperative Multiagent Learning
3
protocols for committees are quite similar we will focus on different individual decision policies that can be used while working in committees. 2.1
Multiagent CBR
A multiagent CBR (MAC) system M = {(Ai , Ci )}i=1...n is composed on n agents, where each agent Ai has a case base Ci . In this framework we restrict ourselves to analytical tasks, i.e. tasks (like classification) where the solution is achieved by selecting from an enumerated set of solutions K = {S1 . . . SK }. When an agent Ai asks another agent Aj help to solve a problem the interaction protocol is as follows. First, Ai sends a problem description P to Aj . Second, after Aj has tried to solve P using its case base Cj , it sends back a message that is either :sorry (if it cannot solve P ) or a solution endorsement record (SER). A SER has the form {(Sk , Ekj )}, P, Aj , where the collection of endorsing pairs (Sk , Ekj ) mean that the agent Aj has found Ekj cases in case base Cj endorsing solution Sk —i.e. there are a number Ekj of cases that are relevant (similar) for endorsing Sk as a solution for P. Each agent Aj is free to send one or more endorsing pairs in a SER record. 2.2
Voting Scheme
The voting scheme defines the mechanism by which an agent reaches an aggregate solution from a collection of SERs coming from other agents. The principle behind the voting scheme is that the agents vote for solution classes depending on the number of cases they found endorsing those classes. However, we do not want that agents having a larger number of endorsing cases may have an unbounded number of votes regardless of the votes of the other agents. Thus, we will define a normalization function so that each agent has one vote that can be for a unique solution class or fractionally assigned to a number of classes depending on the number of endorsing cases. Formally, let At the set of agents that have submitted their SERs to agent Ai for problem P . We will consider that Ai ∈ At and the result of Ai trying to solve P is also reified as a SER. The vote of an agent Aj ∈ At for class Sk is V ote(Sk , Aj ) =
c+
Ekj
r=1...K
Erj
where c is a constant that on our experiments is set to 1. It is easy to see that an agent can cast a fractional vote that is always less than 1. Aggregating the votes from different agents for a class Sk we have ballot V ote(Sk , Aj ) Ballott (Sk , At ) = Aj ∈At
and therefore the winning solution class is Solt (P, At ) = arg max Ballot(Sk , At ) k=1...K
4
E. Plaza and S. Onta˜ no ´n
i.e., the class with more votes in total. We will show now two collaboration policies that use this voting scheme.
3
Committee Policy
In this collaboration policy the member agents of a MAC system M are viewed as a committee. An agent Ai that has to solve a problem P, sends it to all the other agents in M. Each agent Aj that has received P sends a solution endorsement record {(Sk , Ekj )}, P, Aj to Ai . The initiating agent Ai uses the voting scheme above upon all SERs, i.e. its own SER and the SERs of all the other agents in the multiagent system. The final solution is the class with maximum number of votes. The next policy, Bounded Counsel, is based on the notion that an agent Ai tries to solve a problem P by himself and if Ai “fails” to find “good” solution then Ai asks counsel to other agents in the MAC system M. Let EPi = {(Sk , Eki )} the endorsement pairs the agent Ai computes to solve problem P . For an agent Ai to decide when it “fails” we require that each agent in M has a predicate Self-competent(P, EPi ). This predicate determines whether or not the solutions endorsed in EPi allow the agent to conclude that there is a good enough solution for P. 3.1
Bounded Counsel Policy
In this policy the agents member of a MAC system M try first to solve the problems they receive by themselves. Thus, if agent Ai receives a problem P and finds a solution that is satisfactory according to the termination check predicate, the solution found is the final solution. However, when an agent Ai assesses that its own solution is not reliable, the Bounded Counsel Policy tries to minimize the number of questions asked to other agents in M. Specifically, agent Ai asks counsel only to one agent, say agent Aj . When the answer of Aj arrives the agent Ai uses the termination check. If the termination check is true the result of the voting scheme at that time is the final result, otherwise Ai asks counsel to another agent—if there is one left to ask, if not the process terminates and the voting scheme determines the global solution. The termination check works, at any point in time t of the Bounded Counsel Policy process, upon the collection of solution endorsement records (SER) received by the initiating agent Ai at time t. Using the same voting scheme as before, Agent Ai has at any point in time t a plausible solution given by the t be the votes cast for the curwinner class of the votes cast so far. Let Vmax t t t rent plausible solution, Vmax = Ballot (Sol (P, At ), At ), the termination check t is a boolean function T ermCheck(Vmax , At ) that determines whether there is enough difference between the majority votes and the rest to stop and obtain a final solution. In the experiments reported here the termination check function is the following
Cooperative Multiagent Learning
5
Table 1. Average precision and standard deviation for a case base of 280 sponges pertaining to three classes. All the results are obtained using a 10-fold cross validation. 3 Agents 4 Agents Policy µ σ µ σ Isolated 83.2 6.7 82.5 6.4 Bounded 87.2 6.1 86.7 6.5 Committee 88.4 6.0 88.3 5.7
t T ermCheck(Vmax , At ) =
5 Agents µ σ 79.4 8.4 85.1 6.3 88.4 5.4
6 Agents µ σ 77.9 7.6 85.0 7.3 88.1 6.0
7 Agents µ σ 75.8 6.8 84.1 7.0 87.9 5.9
t Vmax ≥η t M ax (1, Ballot(Sk , At ) − Vmax )
t i.e. it checks whether the majority vote Vmax is η times bigger than the rest of the ballots. After termination the global solution is the class with maximum number of votes at that time.
3.2
Experimental Setting
In order to compare the performance of these policies, we have designed an experimental suite with a case base of 280 marine sponges pertaining to three different orders of the Demospongiae class (Astrophorida, Hadromerida and Axinellida). The goal of the agents is to identify the correct biological order given the description of a new sponge. We have experimented with 3, 4, 5, 6 and 7 agents using LID [1] as the CBR method. The results presented here are the result of the average of 5 10-fold cross validation runs. Therefore, as we have 280 sponges in our case base, in each run 252 sponges will form the training set and 28 will form the test set. In an experimental run, training cases are randomly distributed to the agents (without repetitions, i.e. each case will belong to only one agent case base). Thus, if we have n agents and m examples in the training set, each agent should have about m/n examples in its case base. Therefore increasing the number of agents in our experiments their case-base size decreases. When all the examples in the training set have been distributed, the test phase starts. In the test phase, for each problem P in the test set, we randomly choose an agent Ai and send P to Ai . Thus, every agent will only solve a subset of the whole test set. If testing the isolated agents scenario, Ai will solve the problem by itself without help of the other agents. And if testing any of the collaboration policies, Ai will send P to some other agents. We can see (Table 1) that in all the cases we obtain some gain in accuracy compared to the isolated agents scenario. The Committee policy is always better than the others; however this precision has a higher cost since a problem is always solved by every agent. If we look at Bounded Counsel policy we can see it is much better than the isolated agents, and slightly worse than the Committee policy—but it is a cheaper policy since less agents are involved. A small detriment of the system’s performance is observable when we increase the number of agents. This is due to the fact that the agents have a more
6
E. Plaza and S. Onta˜ no ´n
reduced number of training cases in their case bases. A smaller case base has the effect of obtaining less reliable individual solutions. However, the global effect of reducing accuracy appears on Bounded Counsel but not on the Committee policy. Thus, the Committee policy is quite robust to the effect of diminishing reliability individual solutions due to smaller case bases. This result is reasonable since the Committee policy always uses the information available from all agents. A more detailed analysis can be found in [10] The Bounded Counsel policy then only makes sense if we have some cost associated to the number of agents involved in solving a problem that we want to minimize. However, we did some further work to improve Bounded Counsel policy resulting in an increase of accuracy that achieves that of the Committee with a minimum number of agents involved. Although we will not pursue this here, the proactive learning approach explained in [7] uses induction in every agent to learn a decision tree of voting situations; the individually induced decision tree is used by the agent to decide whether or not to ask counsel to a new agent.
4
Bartering Collaboration Strategies
We have seen that agents perform better as a committee than working individually when they have a partial view of data. We can view an individual case base as a sample of examples from all examples seen by the whole multiagent system. However, in the experiments we have shown so far these individual samples were unbiased, i. e. the probability of any agent having an example of a particular solution class was equal for all agents. Nonetheless, there may be situations where the examples seen by each agent can be skewed due to external factors, and this may result in agents having a biased case base: i.e. having a sample of examples where instances of some class are more (or less) frequent than they are in reality. This bias implies that individual agents have a less representative sample of the whole set of examples seen by a MAC. Experimental studies showed that the committee collaboration strategy decreased accuracy when the agents have biased case bases compared to the situation where their case bases are unbiased. In the following section we will formally define the notion of case base bias and show a collaboration strategy based on bartering cases that can improve the performance of a MAC when individual agents implement decision policies whose goal is to diminish their individual case base bias. 4.1
Individual Case Base Bias
Let be di = {d1i , . . . , dK i } the individual distribution of cases for an agent Ai , where dji is the number of cases with solution Sj ∈ K in the the case base of Ai . Now, we can estimate the overall distribution of cases D = {D1 , . . . , DK } where n n K Di is the estimated probability of the class Si , Dj = i=1 dji / i=1 l=1 dli . To measure how far is the case base Ci of a given agent Ai of being a representative sample of the overall distribution we will define the Individual Case
Cooperative Multiagent Learning
7
Base (ICB ) bias, as the square distance between the distribution of cases D and the (normalized) individual distribution of cases obtained from di : 2 K dli l ICB(Ci ) = D − K j j=1 di l=1 Figure 1 shows the cosinus distance between an individual distribution and the overall distribution.The square distance in simply the distance among the normalized vectors shown in Fig. 1
Ci
α
Individual Distribution
Overall Distribution Estimation
ICB bias(Ci) measures α
Fig. 1. Individual case base bias.
4.2
Case Bartering Mechanism
To reach an agreement for bartering between two agents, there must be an offering agent Ai that sends an offer to another agent Aj . Then Aj has to evaluate whether the offer of interchanging cases with Ai is interesting, and accept or reject the offer. If the offer is accepted, we say that Ai and Aj have reached a bartering agreement, and they will interchange the cases in the offer. Formally an offer is a tuple o = Ai , Aj , Sk1 , Sk2 where Ai is the offering agent, Aj is the receiver of the offer, and Sk1 and Sk2 are two solution classes, meaning that the agent Ai will send one of its cases (or a copy of it) with solution Sk2 and Aj will send one of its cases (or a copy of it) with solution Sk1 . The interaction protocol in bartering is explained in [8] but essentially provides an agreed-upon pattern for offering, accepting, and performing barter actions. An agent both generates new bartering offers and assesses bartering offers received from other agents. Received bartering offers are accepted if the result of the interchange diminishes the agent’s ICB. Similarly, an agent generates new bartering offers that if accepted will diminish the agent’s ICB—notice, however that this effect occurs only when the corresponding agent also accepts the offer, which implies the ICB value of that agent will also diminish.
8
E. Plaza and S. Onta˜ no ´n
Class A Class B
Class C
Fig. 2. Artificial problem used to visualize the effects of Case Bartering.
In the experiments we performed, the bartering ends when no participating agent is willing to generate any further offer, and the final state of the multiagent system is one where: – all the individual agents have diminished their respective ICB bias values, and – the accuracy of the committee has increased to proficient levels (as high as the levels shown in §3). The conclusion of these experiments show that the individual decision making (based on the bias estimate) leads to an overall performance increment (the committee accuracy). Moreover, it shows that the ICB measure is a good estimate of the problems involved with the date, since ”solving” the bias problem (diminishing the case base bias) has the result of solving the performance problem (the accuracy levels are restored to the higher levels we expected). In order to have an insight of the effect of bartering in the agent’s case bases, we have designed a small classification problem for which agent’s case bases can be visualized. The artificial problem is shown in Figure 2. Each instance of the artificial problem has only two real attributes, that correspond to the x and y coordinates in the two dimensional space shown, and can belong to one of three classes (A, B or C). The goal is to guess the right class of a new point given its coordinates. Figure 3 shows the initial cases bases of five agents for the artificial problem. Notice that the case bases given to the agents are highly biased. For instance, the first agent (leftmost) has almost no cases of the class B in its case base, and the second agent has almost only cases of class A. With a high probability, the first agent will predict class A for most of the problems for which the right solution is class B. Therefore, the classification accuracy of this agent will be very low. Finally, to see the effect of bartering, Figure 4 shows the case bases for the same agents as Figure 3 but after applying the Case Bartering process. Notice in Fig. 4 that all the agents have obtained cases of the classes for which they had few cases before applying Case Bartering. For instance, we can see how the first agent (leftmost) has obtained a lot of cases of class B, by loosing some of
Cooperative Multiagent Learning
9
Fig. 3. Artificial problem case bases for 5 agents before applying Case Bartering.
Fig. 4. Effect of the Case Bartering process in the artificial problem case bases of 5 agents.
its cases of class A. The second agent has also obtained some cases of classes B and C in exchange of losing some cases of class A. Summarizing, each agent has obtained an individual case base that is more representative of the real problem than before applying the Case Bartering process while following an individual, self-interested decision making process.
5 5.1
The Dimensions of Multiagent Learning Bias Plus Variance Analysis
Bias plus Variance decomposition of the error [6] is a useful tool to provide an insight of learning methods. Bias plus variance analysis breaks the expected error as the sum of three non-negative values: – Intrinsic target noise: this is the expected error of the Bayes optimal classifier (lower bound on the expected error of any classifier). – Squared bias: measures how closely the learning algorithm’s prediction matches the target (averaged over all possible training sets of a given size). – Variance: this is the variance of the algorithm’s prediction for the different training sets of a given size. Since the first value (noise) can not be measured, the bias plus variance decomposition estimates the values of squared bias and variance. In order to estimate these values we are using the model presented in [6]. Figure 5 shows the bias plus variance decomposition of the error for a system composed of 5 agents using Nearest Neighbor. The left hand side of Figure 5 shows the bias
10
E. Plaza and S. Onta˜ no ´n
Individual
Committee
results
results
0,35
0,35 0,3
0,3
0,25
0,25
0,2
Variance Bias
0,15
0,2
Variance Bias
0,15
0,1
0,1
0,05
0,05 0
0 Unbiased
Bias
Bartering
Unbiased
Bias
Bartering
Fig. 5. Bias plus variance decomposition of the classification error for a system with 5 agents both solving problems individually and using the Committee collaboration policy.
plus variance decomposition of the error when the agents solve the problems individually, and the right hand side shows the decomposition when agents use the committee collaboration policy to solve problems. Three different scenarios are presented for each one: unbiased, representing a situation where the agents have unbiased case bases; biased, representing a situation where the agents have biased case bases; bartering, where the agents have biased case bases and they use case bartering. Comparing the Committee collaboration policy with the individual solution of problems, we see that the error reduction obtained with the Committee is only due to a reduction in the variance component. This result is expected since a general result of machine learning tells that we can reduce the classification error of any classifier by averaging the prediction of several classifiers when they make uncorrelated errors due to a reduction in the variance term [4]. Comparing the unbiased and the biased scenarios, we can see that the effect of the ICB bias in the classification error is reflected in both bias and variance components. The variance is the one that suffers a greater increase, but bias is also increased. If the agents apply case bartering they can greatly reduce both components of error—as we can see comparing the biased and the bartering scenarios. Comparing the bartering scenario with the unbiased scenario, we can also see that case bartering can make agents in the biased scenario to achieve greater accuracies that agents in the unbiased scenario. Looking with more detail, we see that in the bartering scenario the bias term is slightly smaller than the bias term in the unbiased scenario. This is due to the increased size of individual case bases1 because (as noted in [11]) when the individual training sets are smaller the bias 1
Bartering here is realized with copies of cases, and the result is an increment on the total number of cases in the case bases of the agents. The difference between bartering with or without copy is analyzed in § 5.4
Cooperative Multiagent Learning
11
Accuracy with random Bartering for a 5 agent system
Accuracy with random Bartering for a 3 Agent system 90
92
89
90
88 87
88
86 85
86
84
84
83 82
82
81 80
80 0
0,2
0,4
0,6
0,8
1
1,2
1,4
1,6
1,8
2
0
0,1
0,2
0,3
0,4
0,5
0,6
0,7
0,8
0,9
1
Fig. 6. Accuracy achieved by random bartering among 3 agents and 5 agents.
tends to increase. The variance term is also slightly smaller in the bartering scenario than in the unbiased scenario. Sumarizing, the Committee collaboration policy is able to reduce the variance component of the error. Case Bartering can make a system with biased case bases to achieve grater accuracies than a system with unbiased case bases because of two reasons: 1) as the ICB bias is reduced, the accuracy of a system with unbiased case bases is recovered, and 2) as the size of individual case bases is slighly increased, the bias term of error is reduced and thus the accuracy can be greater than in the unbiased scenario. 5.2
The Effect of Individual Policies
One dimension that is interesting to assess is the effect of a specific individual decision policy inside a given collaboration strategy. In this section we shall examine the effect of the policy of diminishing ICB inside the bartering collaboration strategy. For this purpose, we have set up an experiment to assess the difference between using the ICB policy and using a “base” (uninformed) decision policy, both with the same initial average ICB value. In the “base” experiments, the individual agents just barter cases randomly: every agent randomly chooses a percentage α of cases in her case base and sends each one to another agent (also chosen at random). In this experiments, α = 0.15 means every agents selects at random 15% of the cases in her case base, and randomly sends each one to another agent, α = 1 means the agent sends all of her cases (one to each agent), and α = 2 means the agent sends all of her cases twice. Figure 6 shows the accuracy of the Committee for different α values on two MAC systems with 3 and 5 agents. First, notice that random bartering improves the accuracy—and the more cases are bartered (the greater the α) the higher is the accuracy for the Committee. This experiment give us the baseline utility of bartering cases in the biased scenario. However, the second thing to notice is that does not increase the accuracy as much as bartering with the ICB policy.
12
E. Plaza and S. Onta˜ no ´n
Fig. 7. Accuracy achieved by Committee using Nearest Neighbor and LID for values of R (redundancy) from 0% to 100%.
Figure 6 shows that for the same quantity of bartered cases the accuracy of the Committee is higher with the ICB policy. Moreover, notice that even when the random bartering keeps exchanging more cases (increasing α) it takes a great quantity to approach the accuracy of the ICB policy. The conclusion, thus, is that the ICB policy is capable of selecting the cases that are useful to barter among agents. The process of random bartering introduces a lot of redundancy in the multiagent system data (a great number of repeated cases in individual case bases). This is the dimension we analyze in the next section. 5.3
Redundancy
When we described the experiments in the Committee collaboration framework an assumption we made was that each case in our experimental dataset was adjudicated to one particular agent case base. In other words, there was no copy of any case, so redundancy in the dataset was zero. The reason we performed the experiments on the Committee under the no redundancy assumption is simply that this is the worst individual scenario (since the individual agent accuracy is lower with smaller case bases), and see how much the committee collaboration strategy could improve from there. Let us define the redundancy R of a MAC system as follows: n ( i=1 |Ci |) − M R= · 100 (n − 1)M where |Ci | is the number of cases in agent’s Ai case base, n is the number of agents in the MAC system, and M is the total number of cases. Redundancy is zero when there is no duplicate of a case, and R = 100 when every agent has all (M ) cases. To analyze the effect of redundancy on a MAC system we perform a suite of experiments shown in Fig. 7 with agents using Nearest Neighbor and LID as
Cooperative Multiagent Learning
13
CBR techniques. The experiments set up a Committee with a certain value of R in the individual case bases. We show in Fig. 7 the accuracy of the Committee for different R values, and we also plot there the individual (average) accuracy for the same R values. The accuracy plot named “Base” in Fig. 7 is that of a single agent having all cases (i.e. a single-agent scenario). We notice that as redundancy increases the accuracy of the individual agent, as expected, grows until reaching the “Base” accuracy. Moreover, the Committee accuracy grows faster as the redundancy increases, and it reaches or even exceeds the “Base” accuracy; this fact (the Committee outperforming a single agent with all the data) is due to the “ensemble effect” of multiple model learning [5] (we discuss this further on § 6). The ensemble effect states that classifiers with uncorrelated error perform better than any one of the individual classifiers. The ensemble effect, in terms of bias plus variance, reduces the variance: that’s why Committee accuracy is higher than individual accuracy. On the other hand, individual accuracy increases with redundancy because bias is reduced.The combined effect of reducing bias and variance boosts the Committee accuracy to reach (even exceed) the “Base” accuracy (for R between 50 and 75). When redundancy is very high (for R higher than 90) the individual agents are so similar in the content of their case bases that to Committee strategy cannot reduce much variance, and the accuracy drops to reach the “Base” accuracy (a Committee of agents having all cases is identical to the “Base” scenario with a single agent having all cases). 5.4
Redundancy and Bartering
Redundancy also plays a role during bartering. Usually in bartering one is exchanged for the other, but since cases are simply information the barter action may involve an actual exchange of original cases or an exchange of copies of cases. Let us define copy mode bartering as the exchange of case copies (where bartering agents end up with both cases) and non-copy mode bartering as the exchange of original cases (where each bartering agent deletes the offering case and adds the receiving case). The non-copy mode clearly maintains the MAC system redundancy R while the copy mode increases R. We performed bartering experiments both in the copy and non-copy modes and Figures 8 and 9 show the results with agents using the CBR techniques of Nearest Neighbor and LID, respectively. Comparing now the two modes, we see that in the non-copy mode the MAC obtains lower accuracies than in the copy mode. But, on the other hand, in the non-copy mode, the average number of cases per agent does not increase and in the copy mode the size of the individual case bases grows. Therefore, we can say that in the copy mode (when the agents send copies of the cases without forgetting them) the agents obtain greater accuracies, but at the cost of increasing the individual case base sizes. In other words, they improve the accuracy allowing case redundancy in the contents of individual case bases (a case may be contained in more than one individual case base), while in the noncopy mode the agents only reallocate the cases but allowing only a single copy of each case in the system.
14
E. Plaza and S. Onta˜ no ´n a)
b)
Accuracy using Nearest Neighbor In the non-copy mode
Accuracy using Nearest Neighbor In the copy mode 91
91
89
89
87
87 85
Biased
85
Biased
83
SCBP
83
SCBP
81
TPCBP
81
TPCBP
79
79
77
77 75
75 3
5
8
3
10
5
8
10
Number of agents
Number of agents
Fig. 8. Accuracy in bartering using Nearest Neighbor when copying cases is allowed and disallowed.
a)
b)
Accuracy using LID In the non-copy mode
Accuracy using LID In the copy mode
91
91
89
89
87
87
85
Biased
85
Biased
83
SCBP
83
SCBP
81
TPCBP
81
TPCBP
79
79
77
77
75
75 3
5
8
Number of agents
10
3
5
8
10
Number of agents
Fig. 9. Accuracy in bartering using LID when copying cases is allowed and disallowed.
In terms of bias plus variance, we can see that the copy mode helps the individual agents to improve accuracy (since they have more cases) by decreasing the bias.This individual accuracy increment is responsible for the slight increase in accuracy of the copy mode versus the non-copy mode. Notice that the danger here for the Committee is that the “ensemble effect” could be reduced (since increasing redundancy increases error correlation among classifiers). Since bartering provides a strategy focused by the ICB policy to exchange just the cases that are most needed the redundancy increases moderately and the global effect is still positive.
6
Related Work
Several areas are related to our work: multiple model learning (where the final solution for a problem is obtained through the aggregation of solutions of individual predictors), case base competence assessment, and negotiation protocols. Here we will briefly describe some relevant work in these areas that is close to us. A general result on multiple model learning [5] demonstrated that if uncorrelated classifiers with error rate lower than 0.5 are combined then the resulting
Cooperative Multiagent Learning
15
error rate must be lower than the one made by the individual classifiers. The BEM (Basic Ensemble Method) is presented in [9] as a basic way to combine continuous estimators, and since then many other methods have been proposed: Bagging [2] or Boosting [3] are some examples. However, all these methods do not deal with the issue of “partitioned examples” among different classifiers as we do—they rely on aggregating results from multiple classifiers that have access to all data. Their goal is to use a multiplicity of classifiers to increase accuracy of existing classification methods. Our intention is to combine the decisions of autonomous classifiers (each one corresponding to one agent), and to see how they can cooperate to achieve a better behavior than when they work alone. A more similar approach is the one proposed in [15], where a MAS is proposed for pattern recognition. Each autonomous agent being a specialist recognizing only a subset of all the patterns, and where the predictions were then combined dynamically. Learning from biased datasets is a well known problem, and many solutions have been proposed. Vucetic and Obradovic [14] propose a method based on a bootstrap algorithm to estimate class probabilities in order to improve the classification accuracy. However, their method does not fit our needs, because they need the entire testset available for the agents before start solving any problem in order to make the class probabilities estimation. Related work is that of case base competence assessment. We use a very simple measure comparing individual with global distribution of cases; we do not try to assess the aeras of competence of (individual) case bases - as proposed by Smyth and McKenna [13]. This work focuses on finding groups of cases that are competent. In [12] Schwartz and Kraus discuss negotiation protocols for data allocation. They propose two protocols, the sequential protocol, and the simultaneous protocol. These two protocols can be compared respectively to our Token- Passing Case Bartering Protocol and Simultaneous Case Bartering Protocol, because in their simultaneous protocol, the agents have to make offers for allocating some data item without knowing the other’s offers, and in the sequential protocol, the agents make offers in order, and each one knows which were the offers of the previous ones.
7
Conclusions and Future Work
We have presented a framework for Cooperative Case-Based Reasoning in multiagent systems, where agents use a market mechanism (bartering) to improve the performance both of individuals and of the whole multiagent system. The agent autonomy is maintained, because each agent is free to take part in the collaboration processes or not. For instance, in the bartering process, if an agent does not want to take part, he just has to do nothing, and when the other agents notice that there is one agent not following the protocol they will ignore it during the remaining iterations of the bartering process.
16
E. Plaza and S. Onta˜ no ´n
In this work we have shown a problem arising when data is distributed over a collection of agents, namely that each agent may have a skewed view of the world (the individual bias). Comparing empirical results in classification tasks we saw that both the individual and the overall performance decreases when bias increases. The process of bartering shows that the problems derived from distributed data over a collection of agents can be solved using a market-oriented approach. Each agent engages in a barter only when it makes sense for its individual purposes but the outcome is an improvement of the individual and overall performance. The naive way to solve the ICB bias problem could be to centralize all data in one location or adopt a completely cooperative multiagent approach where each agent sends its cases to other agents and they retain what they want (a “gift economy”). However, these approaches have some problems; for instance, having all the cases in a single case base may not be practical due to efficiency problems. Another problem of the centralized approach is that the agents belong to organizations that consider their case bases as assets, they are not willing to donate their cases to a centralized case base. Case Bartering tries to interchange cases only to the amount that is necessary and not more, to keep the redundancy not increasing very much. As a general conclusion, we have seen that there are avenues to pursue the goal of learning systems, in the form of multiagent systems, where the training data need not be centralized in one agent nor duplicated in all agents. New noncentralized processes can be designed that are able to correct problems in that distributed allocation of training data, for instance bartering. We have seen that the “ensemble effect” of multi-model learning also takes place in the multiagent setting, even in the situation where there is no redundancy. Finally, we have focused on lazy learning techniques (CBR) because it seemed easier to be adapted to a distributed, multiagent setting; however, the same ideas and techniques should be able to work for multiagent systems that learn using eager techniques like induction. We plan to investigate inductive multiagent learning in the near future, starting with classification tasks and decision tree techniques.
References 1. E. Armengol and E. Plaza. Lazy induction of descriptions for relational case-based learning. In 12th European Conference on Machine Learning, 2001. 2. Leo Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996. 3. Yoav Freund and Robert E. Schapire. Experiments with a new boosting algorithm. In Proc. 13th ICML, pages 148–146. Morgan Kaufmann, 1996. 4. Jerome H. Friedman. On bias, variance, 0/1 – loss, and the curse-of-dimensionality. Data Mining and Knowledge Discovery, 1(1):55–77, 1997. 5. L. K. Hansen and P. Salamon. Neural networks ensembles. IEEE Transactions on Pattern Analysis and Machine Intelligence, (12):993–1001, 1990. 6. Ron Kohavi and David H. Wolpert. Bias plus variance decomposition for zero-one loss functions. In Lorenza Saitta, editor, Machine Learning: Proceedings of the Thirteenth International Conference, pages 275–283. Morgan Kaufmann, 1996.
Cooperative Multiagent Learning
17
7. S. Onta˜ n´ on and E. Plaza. Learning when to collaborate among learning agents. In 12th European Conference on Machine Learning, 2001. 8. S. Onta˜ n´ on and E. Plaza. A bartering aproach to improve multiagent learning. In 1st International Joint Conference in Autonomous Agents and Multiagent Systems, 2002. 9. M. P. Perrone and L. N. Cooper. When networks disagree: Ensemble methods for hydrid neural networks. In Artificial Neural Networks for Speech and Vision. Chapman-Hall, 1993. 10. E. Plaza and Onta˜ no ´n S. Ensemble case-based reasoning. Lecture Notes in Artificial Intelligence, 2080:437–451, 2001. 11. Richard A. Olshen and L. Gordon. Almost sure consistent nonparametric regression from recursive partitioning schemes. Multivariate Analysis, 15:147–163, 1984. 12. R. Schwartz and S. Kraus. Bidding mechanisms for data allocation in multi-agent environments. In Agent Theories, Architectures, and Languages, pages 61–75, 1997. 13. B. Smyth and E. McKenna. Modelling the competence of case-bases. In EWCBR, pages 208–220, 1998. 14. S. Vucetic and Z. Obradovic. Classification on data with biased class distribution. In 12th European Conference on Machine Learning, 2001. 15. L. Vuurpijl and L. Schomaker. A framework for using multiple classifiers in a multiple-agent architecture. In Third International Workshop on Handwriting Analysis and Recognition, 1998.
Reinforcement Learning Approaches to Coordination in Cooperative Multi-agent Systems Spiros Kapetanakis1 , Daniel Kudenko1 , and Malcolm J.A. Strens2 1
Department of Computer Science, University of York Heslington, York, YO10 5DD, UK {spiros,kudenko}@cs.york.ac.uk, 2 Guidance and Imaging Solutions, QinetiQ Ively Road, Farnborough, Hampshire GU14 OLX, UK [email protected]
Abstract. We report on an investigation of reinforcement learning techniques for the learning of coordination in cooperative multi-agent systems. Specifically, we focus on two novel approaches: one is based on a new action selection strategy for Q-learning [10], and the other is based on model estimation with a shared action-selection protocol. The new techniques are applicable to scenarios where mutual observation of actions is not possible. To date, reinforcement learning approaches for such independent agents did not guarantee convergence to the optimal joint action in scenarios with high miscoordination costs. We improve on previous results [2] by demonstrating empirically that our extension causes the agents to converge almost always to the optimal joint action even in these difficult cases.
1 Introduction Learning to coordinate in cooperative multi-agent systems is a central and widely studied problem (e.g., [5,1,2,6,7,11]). In this context, coordination is defined as the ability of two or more agents to jointly reach a consensus over which actions to perform in an environment. We investigate the case of independent agents that cannot observe one another’s actions, which often is a more realistic assumption. In this investigation, we focus on reinforcement learning, where the agents must learn to coordinate their actions through environmental feedback. To date, reinforcement learning (RL) methods for independent agents [9,7] did not guarantee convergence to the optimal joint action in scenarios where miscoordination is associated with high penalties. Even approaches using agents that are able to build predictive models of each other (so-called joint-action learners) have failed to show convergence to the optimal joint action in such difficult cases [2]. We investigate two approaches to reinforcement learning in search of improved convergence to the optimal joint action in the case of independent agents. The first approach is a variant of Q-learning [10] where we introduce a novel estimated value function in the Boltzmann action selection strategy. The second technique is based on a shared action-selection protocol that enables the agents to estimate the rewards for specific joint actions. E. Alonso et al. (Eds.): Adaptive Agents and MAS, LNAI 2636, pp. 18–32, 2003. c Springer-Verlag Berlin Heidelberg 2003
Reinforcement Learning Approaches to Coordination
19
We evaluate both RL approaches experimentally on two especially difficult coordination problems that were first introduced by Claus and Boutilier in [2]: the climbing game and the penalty game. The empirical results show that the convergence probability to the optimal joint action is greatly improved over other approaches, in fact reaching almost 100%. Our paper is structured as follows: we first introduce the aforementioned common testbed for the study of learning coordination in cooperative multi-agent systems. We then introduce each reinforcement learning technique and discuss the experimental results. We finish with an outlook on future work.
2 Single-Stage Coordination Games A common testbed for studying the problem of multi-agent coordination is that of repeated cooperative single-stage games [3]. In these games, the agents have common interests, i.e. they are rewarded based on their joint action and all agents receive the same reward. In each round of the game, every agent chooses an action. These actions are executed simultaneously and the reward that corresponds to the joint action is broadcast to all agents. A more formal account of this type of problem was given in [2]. In brief, we assume a group of n agents α1 , α2 , . . . , αn , each of which have a finite set of individual actions Ai , known as the agent’s action space. In this game, each agent αi chooses an individual action from its action space to perform. The action choices make up a joint action. Upon execution of their actions all agents receive the reward that corresponds to the joint action. For example, Table 1 describes the reward function for a simple cooperative single-stage game. If agent 1 executes action b and agent 2 executes action a, the reward they receive is 5. Obviously, the optimal joint-action in this simple game is (b, b) as it is associated with the highest reward of 10. Table 1. A simple cooperative game.
Agent 1 a b Agent 2 a 3 5 b 0 10 Our goal is to enable the agents to learn optimal coordination from repeated trials. To achieve this goal, one can use either independent or joint-action learners. The difference between the two types lies in the amount of information they can perceive in the game. Although both types of learners can perceive the reward that is associated with each joint action, the former are unaware of the existence of other agents whereas the latter can also perceive the actions of others. In this way, joint-action learners can maintain a model of the strategy of other agents and choose their actions based on the other participants’
20
S. Kapetanakis, D. Kudenko, and M.J.A. Strens
perceived strategy. In contrast, independent learners must estimate the value of their individual actions based solely on the rewards that they receive for their actions. In this paper, we focus on individual learners, these being more universally applicable. In our study, we focus on two particularly difficult coordination problems, the climbing game and the penalty game. These games were introduced in [2]. This focus is without loss of generality since the climbing game is representative of problems with high miscoordination penalty and a single optimal joint action whereas the penalty game is representative of problems with high miscoordination penalty and multiple optimal joint actions. Both games are played between two agents. The reward functions for the two games are included in Tables 2 and 3:
Table 2. The climbing game table.
Agent 1 a b c a 11 -30 0 Agent 2 b -30 7 6 c 0 0 5 In the climbing game, it is difficult for the agents to converge to the optimal joint action (a, a) because of the negative reward in the case of miscoordination. For example, if agent 1 plays a and agent 2 plays b, then both will receive a negative reward of 30. Incorporating this reward into the learning process can be so detrimental that both agents tend to avoid playing the same action again. In contrast, when choosing action c, miscoordination is not punished so severely. Therefore, in most cases, both agents are easily tempted by action c. The reason is as follows: if agent 1 plays c, then agent 2 can play either b or c to get a positive reward (6 and 5 respectively). Even if agent 2 plays a, the result is not catastrophic since the reward is 0. Similarly, if agent 2 plays c, whatever agent 1 plays, the resulting reward will be at least 0. From this analysis, we can see that the climbing game is a challenging problem for the study of learning coordination. It includes heavy miscoordination penalties and “safe” actions that are likely to tempt the agents away from the optimal joint action. Another way to make coordination more elusive is by including multiple optimal joint actions. This is precisely what happens in the penalty game of Table 3. In the penalty game, it is not only important to avoid the miscoordination penalties associated with actions (c, a) and (a, c). It is equally important to agree on which optimal joint action to choose out of (a, a) and (c, c). If agent 1 plays a expecting agent 2 to also play a so they can receive the maximum reward of 10 but agent 2 plays c (perhaps expecting agent 1 to play c so that, again, they receive the maximum reward of 10) then the resulting penalty can be very detrimental to both agents’ learning process. In this game, b is the “safe” action for both agents since playing b is guaranteed to result in a reward of 0 or 2, regardless of what the other agent plays. Similarly with the climbing
Reinforcement Learning Approaches to Coordination
21
Table 3. The penalty game table.
Agent 1 a b c a 10 0 k Agent 2 b 0 2 0 c k 0 10
game, it is clear that the penalty game is a challenging testbed for the study of learning coordination in multi-agent systems.
3 A Q-Learning Approach to Learning of Coordination A popular technique for learning coordination in cooperative single-stage games is onestep Q-learning, a reinforcement learning technique. In this section, we first introduce the general approach, followed by a discussion of the novel FMQ heuristic for action selection. We end the section with empirical results and a discussion of limitations of the FMQ approach. 3.1
Basics
Since the agents in a single-stage game are stateless, we need a simple reformulation of the general Q-learning algorithm such as the one used in [2]. Each agent maintains a Q value for each of its actions. The value Q(action) provides an estimate of the usefulness of performing this action in the next iteration of the game and these values are updated after each step of the game according to the reward received for the action. We apply Q-learning with the following update function: Q(action) ← Q(action) + λ(r − Q(action)) where λ is the learning rate (0 < λ < 1) and r is the reward that corresponds to choosing this action. In a single-agent learning scenario, Q-learning is guaranteed to converge to the optimal action independent of the action selection strategy. In other words, given the assumption of a stationary reward function, single-agent Q-learning will converge to the optimal policy for the problem. However, in a multi-agent setting, the action selection strategy becomes crucial for convergence to any joint action. A major challenge in defining a suitable strategy for the selection of actions is to strike a balance between exploring the usefulness of moves that have been attempted only a few times and exploiting those in which the agent’s confidence in getting a high reward is relatively strong. This is known as the exploration/exploitation problem. The action selection strategy that we have chosen for our research is the Boltzmann strategy [4] which states that agent αi chooses an action to perform in the next iteration
22
S. Kapetanakis, D. Kudenko, and M.J.A. Strens
of the game with a probability that is based on its current estimate of the usefulness of that action, denoted by EV(action)1 : P (action) =
e
EV(action) T
action ∈Ai
e
EV(action ) T
In the case of Q-learning, the agent’s estimate of the usefulness of an action may be given by the Q values themselves, an approach that has been usually taken to date. We have concentrated on a proper choice for the two parameters of the Boltzmann function: the estimated value and the temperature. The importance of the temperature lies in that it provides an element of controlled randomness in the action selection: high values in temperature encourage exploration since variations in Q values become less important. In contrast, low temperature values encourage exploitation. The value of the temperature is typically decreased over time from an initial value as exploitation takes over from exploration until it reaches some designated lower limit. The three important settings for the temperature are the initial value, the rate of decrease and the number of steps until it reaches its lowest limit. The lower limit of the temperature needs to be set to a value that is close enough to 0 to allow the learners to converge by stopping their exploration. Variations in these three parameters can provide significant difference in the performance of the learners. For example, starting with a very high value for the temperature forces the agents to make random moves until the temperature reaches a low enough value to play a part in the learning. This may be beneficial if the agents are gathering statistical information about the environment or the other agents. However, this may also dramatically slow down the learning process. It has been shown [8] that convergence to a joint action can be ensured if the temperature function adheres to certain properties. However, we have found that there is more that can be done to ensure not just convergence to some joint action but convergence to the optimal joint action, even in the case of independent learners. This is not just in terms of the temperature function but, more importantly, in terms of the action selection strategy. More specifically, it turns out that a proper choice for the estimated value function in the Boltzmann strategy can significantly increase the likelihood of convergence to the optimal joint action. 3.2
FMQ Heuristic
In difficult coordination problems, such as the climbing game and the penalty game, the way to achieve convergence to the optimal joint action is by influencing the learners towards their individual components of the optimal joint action(s). To this effect, there exist two strategies: altering the Q-update function and altering the action selection strategy. Lauer and Riedmiller [5] describe an algorithm for multi-agent reinforcement learning which is based on the optimistic assumption. In the context of reinforcement learning, this assumption implies that an agent chooses any action it finds suitable expecting the 1
In [4], the estimated value is introduced as expected reward (ER).
Reinforcement Learning Approaches to Coordination
23
other agent to choose the best match accordingly. More specifically, the optimistic assumption affects the way Q values are updated. Under this assumption, the update rule for playing action α defines that Q(α) is only updated if the new value is greater than the current one. Incorporating the optimistic assumption into Q-learning solves both the climbing game and penalty game every time. This fact is not surprising since the penalties for miscoordination, which make learning optimal actions difficult, are neglected as their incorporation into the learning tends to lower the Q values of the corresponding actions. Such lowering of Q values is not allowed under the optimistic assumption so that all the Q values eventually converge to the maximum reward corresponding to that action for each agent. However, the optimistic assumption fails to converge to the optimal joint action in cases where the maximum reward is misleading, e.g., in stochastic games (see experiments below). We therefore consider an alternative: the Frequency Maximum Q Value (FMQ) heuristic. Unlike the optimistic assumption, that applies to the Q update function, the FMQ heuristic applies to the action selection strategy, specifically the choice of EV(α), i.e. the function that computes the estimated value of action α.As mentioned before, the standard approach is to set EV(α) = Q(α). Instead, we propose the following modification: EV(α) = Q(α) + c ∗ freq(maxR(α)) ∗ maxR(α) where: ➀ maxR(α) denotes the maximum reward encountered so far for choosing action α. ➁ freq(maxR(α)) is the fraction of times that maxR(α) has been received as a reward for action α over the times that action α has been executed. ➂ c is a weight that controls the importance of the FMQ heuristic in the action selection. Informally, the FMQ heuristic carries the information of how frequently an action produces its maximum corresponding reward. Note that, for an agent to receive the maximum reward corresponding to one of its actions, the other agent must be playing the game accordingly. For example, in the climbing game, if agent 1 plays action a which is agent 1’s component of the optimal joint-action (a, a) but agent 2 doesn’t, then they both receive a reward that is less than the maximum. If agent 2 plays c then the two agents receive 0 and, provided they have already encountered the maximum rewards for their actions, both agents’ FMQ estimates for their actions are lowered. This is due to the fact that the frequency of occurrence of maximum reward is lowered. Note that setting the FMQ weight c to zero reduces the estimated value function to: EV(α) = Q(α). In the case of independent learners, there is nothing other than action choices and rewards that an agent can use to learn coordination. By ensuring that enough exploration is permitted in the beginning of the experiment, the agents have a good chance of visiting the optimal joint action so that the FMQ heuristic can influence them towards their appropriate individual action components. In a sense, the FMQ heuristic defines a
24
S. Kapetanakis, D. Kudenko, and M.J.A. Strens
model of the environment that the agent operates in, the other agent being part of that environment. 3.3
Experimental Results
This section contains our experimental results. We compare the performance of Qlearning using the FMQ heuristic against the baseline experiments i.e. experiments where the Q values are used as the estimated value of an action in the Boltzmann action selection strategy. In both cases, we use only independent learners. The comparison is done by keeping all other parameters of the experiment the same, i.e. using the same temperature function and experiment length. The evaluation of the two approaches is performed on both the climbing game and the penalty game. Temperature Settings. Exponential decay in the value of the temperature is a popular choice in reinforcement learning. This way, the agents perform all their learning until the temperature reaches some lower limit. The experiment then finishes and results are collected. The temperature limit is normally set to zero which may cause complications when calculating the action selection probabilities with the Boltzmann function. To avoid such problems, we have set the temperature limit to 1 in our experiments2 . In our analysis, we use the following temperature function: T (x) = e−sx ∗ max temp + 1 where x is the number of iterations of the game so far, s is the parameter that controls the rate of exponential decay and max temp is the value of the temperature at the beginning of the experiment. For a given length of the experiment (max moves) and initial temperature (max temp) the appropriate rate of decay (s) is automatically derived. Varying the parameters of the temperature function allows a detailed specification of the temperature. For a given max moves, we experimented with a variety of s, max temp combinations and found that they didn’t have a significant impact on the learning in the baseline experiments. Their impact is more significant when using the FMQ heuristic. This is because setting max temp at a very high value means that the agent makes random moves in the initial part of the experiment. It then starts making more informed moves (i.e. moves based on the estimated value of its actions) when the temperature has become low enough to allow variations in the estimated value of an action to have an impact on the probability of selecting that action. Evaluation on the Climbing Game. The climbing game has one optimal joint action (a, a) and two heavily penalised actions (a, b) and (b, a). We use the settings max temp = 500 and vary max moves from 500 to 2000. The learning rate λ is set to 0.9. Figure 1 depicts the likelihood of convergence to the optimal joint action in the baseline experiments and using the FMQ heuristic with c = 1, c = 5 and c = 10. The FMQ heuristic outperforms the baseline experiments for all settings of c. For c = 10, the FMQ heuristic converges to the optimal joint action almost always even for short experiments. 2
This is done without loss of generality.
Reinforcement Learning Approaches to Coordination
25
1 FMQ (c=10) FMQ (c=5) FMQ (c=1) baseline
likelihood of convergence to optimal
0.8
0.6
0.4
0.2
0 500
750
1000
1250
1500
1750
2000
number of iterations
Fig. 1. Likelihood of convergence to the optimal joint action in the climbing game (averaged over 1000 trials).
Evaluation on the Penalty Game. The penalty game is harder to analyse than the climbing game. This is because it has two optimal joint actions (a, a) and (c, c) for all values of k ≤ 0. The extent to which the optimal joint actions are reached by the agents is affected severely by the size of the penalty. However, the performance of the agents depends not only on the size of the penalty k but also on whether the agents manage to agree on which optimal joint action to choose. Figure 2 depicts the performance of the learners for k = 0 for the baseline experiments and with c = 1 for the FMQ heuristic.
1 FMQ (c=1) baseline likelihood of convergence to optimal
0.8
0.6
0.4
0.2
0 500
750
1000
1250 number of iterations
1500
1750
2000
Fig. 2. Likelihood of convergence to the optimal joint action in the penalty game k = 0 (averaged over 1000 trials).
As shown in Figure 2, the performance of the FMQ heuristic is much better than the baseline experiment. When k = 0, the reason for the baseline experiment’s failure is not the existence of a miscoordination penalty. Instead, it is the existence of multiple optimal joint actions that causes the agents to converge to the optimal joint action so
26
S. Kapetanakis, D. Kudenko, and M.J.A. Strens
infrequently. Of course, the penalty game becomes much harder for greater penalty. To analyse the impact of the penalty on the convergence to optimal, Figure 3 depicts the likelihood that convergence to optimal occurs as a function of the penalty. The four plots correspond to the baseline experiments and using Q-learning with the FMQ heuristic for c = 1, c = 5 and c = 10.
1
likelihood of convergence to optimal
0.8
FMQ (c=1) FMQ (c=5) FMQ (c=10) baseline
0.6
0.4
0.2
0 -100
-80
-60
-40
-20
0
penalty k
Fig. 3. Likelihood of convergence to the optimal joint action as a function of the penalty (averaged over 1000 trials).
From Figure 3, it is obvious that higher values of the FMQ weight c perform better for higher penalty. This is because there is a greater need to influence the learners towards the optimal joint action when the penalty is more severe. 3.4
Further Experiments
We have described two approaches that perform very well on the climbing game and the penalty game: FMQ and the optimistic assumption. However, the two approaches are different and this difference can be highlighted by looking at alternative versions of the climbing game. In order to compare the FMQ heuristic to the optimistic assumption [5], we introduce a variant of the climbing game which we term the partially stochastic climbing game. This version of the climbing game differs from the original in that one of the joint actions is now associated with a stochastic reward. The reward function for the partially stochastic climbing game is included in Table 4. Joint action (b, b) yields a reward of 14 or 0 with probability 50%. The partially stochastic climbing game is functionally equivalent to the original version. This is because, if the two agents consistently choose their b action, they receive the same overall value of 7 over time as in the original game. Using the optimistic assumption on the partially stochastic climbing game consistently converges to the suboptimal joint action (b, b). This because the frequency of occurrence of a high reward is not taken into consideration at all. In contrast, the FMQ heuristic shows much more promise in convergence to the optimal joint action. It also
Reinforcement Learning Approaches to Coordination
27
Table 4. The partially stochastic climbing game table.
Agent 1 a b c a 11 -30 0 Agent 2 b -30 14/0 6 c 0 0 5
compares favourably with the baseline experimental results. Tables 5, 6 and 7 contain the results obtained with the baseline experiments, the optimistic assumption and the FMQ heuristic for 1000 experiments respectively. In all cases, the parameters are: s = 0.006, max moves = 1000, max temp = 500 and, in the case of FMQ, c = 10. Table 5. Baseline experimental results.
a b c a 212 0 3 b 0 12 289 c 0 0 381 Table 6. Results with optimistic assumption.
a b c a0 0 0 b 0 1000 0 c0 0 0 Table 7. Results with the FMQ heuristic.
a bc a 988 0 0 b 0 40 c 0 71 The final topic for evaluation of the FMQ heuristic is to analyse the influence of the weight (c) on the learning. Informally, the more difficult the problem, the greater the need for a high FMQ weight. However, setting the FMQ weight at too high a value can be detrimental to the learning. Figure 4 contains a plot of the likelihood of convergence to optimal in the climbing game as a function of the FMQ weight.
28
S. Kapetanakis, D. Kudenko, and M.J.A. Strens
1
likelihood of convergence to optimal
0.8
0.6
0.4
0.2
0 10
20
30
40
50
60
70
80
90
100
FMQ weight
Fig. 4. Likelihood of convergence to optimal in the climbing game as a function of the FMQ weight (averaged over 1000 trials).
From Figure 4, we can see that setting the value of the FMQ weight above 15 lowers the probability that the agents will converge to the optimal joint action. This is because, by setting the FMQ weight too high, the probabilities for action selection are influenced too much towards the action with the highest FMQ value which may not be the optimal joint action early in the experiment. In other words, the agents become too narrow-minded and follow the heuristic blindly since the FMQ part of the estimated value function overwhelms the Q values. This property is also reflected in the experimental results on the penalty game (see Figure 3) where setting the FMQ weight to 10 performs very well in difficult experiments with −100 < k < −50 but there is a drop in performance for easier experiments. In contrast, for c = 1 the likelihood of convergence to the optimal joint action in easier experiments is significantly higher than in more difficult ones. 3.5
Limitations of the FMQ Approach
The FMQ heuristic performs equally well in the partially stochastic climbing game and the original deterministic climbing game. In contrast, the optimistic assumption only succeeds in solving the deterministic climbing game. However, we have found a variant of the climbing game in which both heuristics perform poorly: the fully stochastic climbing game. This game has the characteristic that all joint actions are probabilistically linked with two rewards. The average of the two rewards for each action is the same as the original reward from the deterministic version of the climbing game so the two games are functionally equivalent. For the rest of this discussion, we assume a 50% probability. The reward function for the stochastic climbing game is included in Table 8. It is obvious why the optimistic assumption fails to solve the fully stochastic climbing game. It is for the same reason that it fails with the partially stochastic climbing game. The maximum reward is associated with joint action (b, b) which is a suboptimal action. The FMQ heuristic, although it performs marginally better than normal Q-learning still doesn’t provide any substantial success ratios.
Reinforcement Learning Approaches to Coordination
29
Table 8. The stochastic climbing game table (50%).
Agent 1 a b c a 10/12 5/-65 8/-8 Agent 2 b 5/-65 14/0 12/0 c 5/-5 5/-5 10/0
In the following section, we present a different reinforcement learning technique that solves the fully stochastic climbing game.
4 A Protocol-Based Reinforcement Learning Approach In games with stochastic payoffs it is difficult to distinguish between the two sources of variation in observed payoff for some action. It would be useful to have a protocol that allows 2 or more agents to select the same joint action repeatedly in order to build up a model for the stochastic payoff distribution. This section describes a new approach for achieving this. The basic idea is that agents follow a shared action selection policy that enables them to estimate the payoffs for each joint action. The action selection policy is based on the following idea: if an agent chooses an action at time i, then the agent is required to choose the same action at specific future time points, defined by a Commitment Sequence. Note, that this approach does not require agents to observe each others actions. The only assumption that the commitment sequence approach makes is that all agents share the same global clock and that they follow a common protocol for defining sequences of time-slots. 4.1
Commitment Sequences
A commitment sequence is some list of “time slots” (t1 , t2 , . . .) for which an agent is committed to taking the same action. If two or more agents have the same protocol for defining these sequences, then the ensemble of agents is committed to selecting a single joint-action for every time in the sequence. Although each agent does not know the action choices of the other agents, it can be certain that the observed payoffs will be statistically stationary and represent unbiased samples for the payoff distribution of some joint action. In order to allow a potentially infinite number of sequences to be considered as the agent learns, it is necessary that the sequences are finite or have an exponentially increasing time interval δi ≡ ti+1 − ti between successive time slots. A sufficient condition is γδi+1 ≥ δi where γ > 1 for all i > i0 (for some pre-defined constant i0 ). In the results given here, sequences are infinite with γ = 8/7. The successive increments are generated as follows: δi+1 = (8δi + 6)/7 where · indicates rounding down to an integer value. To ensure that no two sequences select
30
S. Kapetanakis, D. Kudenko, and M.J.A. Strens
the same time slot, a simple mechanism is introduced. Denote the next time slot for sequence j by tj . At time t, if all tj are greater than t an exploratory action is chosen. Otherwise the first match (the smallest j for which tj = t) is selected to determine the exploitative action. For sequence j, the increment defined above is used to update tj . However, tk is additionally incremented by one for all sequences except k = j. (As an alternative, it is possible to only increment by one the tk for which k > j. This is a better way to keep the ratio of successive increments close to γ.) For example, using the above function, the first commitment sequence starts with (1, 3, 6, 10, 15, 21, 28, . . .). The second sequence therefore starts at time slot 2 with (2, 5, 9, 14, 20, 27, . . .). 4.2
Finding the Exploitative Action
For time i suppose the agents chose actions (ai1 , ai2 , . . . , aim ) (where m is the number of agents). Then an estimate of the value of this joint action is available as the average payoff received during the part of the sequence that has been completed so far. Longer sequences provide more reliable estimates. To reason about the true expected payoff, we must make some assumptions about the possible form of the stochastic payoff for each joint action: for example it must have finite variance. Here we use a Gaussian model and estimate its mean and variance from the observations. If n payoffs are observed with empirical average m and sum of squares S, we obtain estimates for the population mean µ and its variance σµ : µ ˆ=m S + σ02 m2 − 2 n n σ0 is a parameter to the algorithm and should be based on the expected variance of payoffs in the game; in all our experiments σ0 = 10. In order to prefer longer sequences (more reliable estimates), a pessimistic estimate µ ˆ − Nσ σ ˆ is used to provide a lower bound on the expected return for each sequence. At any given time, the exploitative behaviour for an agent is to choose the action corresponding to the sequence with the greatest lower bound. Large values of Nσ reduce the risk that an optimistic bias in the payoff estimate from a short sequence will affect the choice of action. However, smaller values may give faster initial learning. In the results below, Nσ = 4. σ ˆµ2 =
4.3
Exploration Policy
Each agent must choose an action at the start of each sequence. A new sequence starts whenever no existing sequence is active in the current time slot. There are two obvious ways to select the new action: either explore (select the action randomly and uniformly) or exploit (select the action currently considered optimal). The simple approach used here is to choose randomly between exploration and exploitation for √ each sequence. For a 2-agent system, we choose the exploration probability to be 1/ 2. This ensures that both agents select an exploratory action with probability 1/2. As an exception, the first Ninit sequences (where Ninit >= 1) must be exploratory to ensure that an exploitative action can be calculated. In the results below, Ninit = 10.
Reinforcement Learning Approaches to Coordination
31
Table 9. Results for partially stochastic climbing game
a bc a 995 0 0 b 0 50 c 0 00 Table 10. Results for stochastic climbing game
a bc a 992 0 0 b 0 44 c 0 00
4.4
Experimental Evaluation
The commitment sequence method was successful for all the problems described in the previous section, including the stochastic climbing game. We tested the method over 1000 trials, with the number of moves per trial being restricted to either 500 or 1000. In the climbing game, the likelihood of convergence to the optimal exploitative action reached 0.985 after 500 moves, i.e. the exploitative action after 500 moves was optimal in 985 of the 1000 trials. This increased to an optimal 1.000 when the number of moves was increased to 1,000. For the stochastic climbing game, the convergence probability to the optimal joint action was 0.992 after 1000 moves. In the penalty game with 1000 moves, the commitment sequence approach always converged to an optimal joint action for all values of k between −100 and 0. For the partially stochastic climbing game, the convergence probability to the optimal joint action was 0.995. For the stochastic climbing game, convergence probability was 0.992 after 1000 trials.
5 Outlook We have presented an investigation of two techniques that allows two independent agents that are unable to sense each other’s actions to learn coordination in cooperative singlestage games, even in difficult cases with high miscoordination penalties. However, there is still much to be done towards understanding exactly how the action selection strategy can influence the learning of optimal joint actions in this type of repeated games. In the future, we plan to investigate this issue in more detail. Furthermore, since agents typically have a state component associated with them, we plan to investigate how to incorporate such coordination learning mechanisms in multistage games. We intend to further analyse the applicability of various reinforcement learning techniques to agents with a substantially greater action space. Finally, we intend
32
S. Kapetanakis, D. Kudenko, and M.J.A. Strens
to perform a similar systematic examination of the applicability of such techniques to partially observable environments where the rewards are perceived stochastically.
References 1. C. Boutilier. Sequential optimality and coordination in multiagent systems. In Proceedings of the Sixteenth International Joint Conference on Articial Intelligence (IJCAI-99), pages 478–485, 1999. 2. Caroline Claus and Craig Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. In Proceedings of the Fifteenth National Conference on Articial Intelligence, pages 746–752, 1998. 3. Drew Fudenberg and David K. Levine. The Theory of Learning in Games. MIT Press, Cambridge, MA, 1998. 4. Leslie Pack Kaelbling, Michael Littman, and Andrew W. Moore. Reinforcement learning: A survey. Journal of Artificial Intelligence Research, 4, 1996. 5. Martin Lauer and Martin Riedmiller. An algorithm for distributed reinforcement learning in cooperative multi-agent systems. In Proceedings of the Seventeenth International Conference in Machine Learning, 2000. 6. Sandip Sen and Mahendra Sekaran. Individual learning of coordination knowledge. JETAI, 10(3):333–356, 1998. 7. Sandip Sen, Mahendra Sekaran, and John Hale. Learning to coordinate without sharing information. In Proceedings of the Twelfth National Conference on Artificial Intelligence, pages 426–431, Seattle, WA, 1994. 8. S. Singh, T. Jaakkola, M. L. Littman, and C Szpesvari. Convergence results for single-step on-policy reinforcement-learning algorithms. Machine Learning Journal, 38(3):287–308, 2000. 9. Ming Tan. Multi-agent reinforcement learning: Independent vs. cooperative agents. In Proceedings of the Tenth International Conference on Machine Learning, pages 330–337, 1993. 10. C. J. C. H. Watkins. Learning from Delayed Rewards. PhD thesis, Cambridge University, Cambridge, England, 1989. 11. Gerhard Weiss. Learning to coordinate actions in multi-agent systems. In Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, volume 1, pages 311–316. Morgan Kaufmann Publ., 1993.
Cooperative Learning Using Advice Exchange 1,2
Luís Nunes and Eugénio Oliveira
1
1
Laboratório de Inteligência Artificial e Ciência de Computadores (LIACC) – Núcleo de Inteligência Artificial Distribuída e Robótica (NIAD&R), Faculdade de Engenharia da Universidade do Porto (FEUP),Av. Dr. Roberto Frias 4200-465, Porto, Portugal. 2 Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE), Av. Forças Armadas, Edíficio ISCTE, 1649-026, Lisboa, Portugal [email protected], [email protected]
Abstract. One of the main questions concerning learning in a Multi-Agent System’s environment is: “(How) can agents benefit from mutual interaction during the learning process?” This paper describes a technique that enables a heterogeneous group of Learning Agents (LAs) to improve its learning performance by exchanging advice. This technique uses supervised learning (backpropagation), where the desired response is not given by the environment but is based on advice given by peers with better performance score. The LAs are facing problems with similar structure, in environments where only reinforcement information is available. Each LA applies a different, well known, learning technique. The problem used for the evaluation of LAs performance is a simplified traffic-control simulation. In this paper the reader can find a summarized description of the traffic simulation and Learning Agents (focused on the advice-exchange mechanism), a discussion of the first results obtained and suggested techniques to overcome the problems that have been observed.
1 Introduction The objective of this work is to contribute to give a credible answer to the following question: “(How) can agents benefit from mutual interaction during the learning process, in order to achieve better individual and overall system performances?” The objects of study are the interactions between the Learning Agents (hereafter referred to as agents for the sake of simplicity) and the effects these interactions have on individual and global learning processes. Interactions that affect the learning process can take several forms, in Multi-Agent Systems (MAS). These different forms of interaction can range from the indirect effects of other agents’ actions (whether they are cooperative or competitive), to direct communication of complex knowledge structures, as well as cooperative negotiation of a search policy or solution. The most promising way in which cooperative learning agents can benefit from interaction seems to be by exchanging (or sharing) information regarding the learning process itself. As observed by Tan [1] agents can exchange information regarding several as-
E. Alonso et al. (Eds.): Adaptive Agents and MAS, LNAI 2636, pp. 33–48, 2003. © Springer-Verlag Berlin Heidelberg 2003
34
L. Nunes and E. Oliveira
pects of the learning process: a) the state of the environment, b) episodes (state, action, reward triplets), or c) internal parameters and policies. Exchanging environment states may be interpreted as if each agent has extra sets of sensors spread out in the environment, being able to have a more complete view of the external state. This larger view of the state-space may require either pre-acquired knowledge on how to interpret this information and integrate it with its own view of the environment’s state, or simply be considered as extra input providing a wider range of information about the state. Techniques that use this strategy may be adequate to solve the problem of partially observable states in environments where this situation creates serious problems to learning or cooperation amongst agents. Episode exchange requires that the agents are (or have been) facing similar problems, requiring similar solutions and may lead to large amounts of communication if there are no criteria regulating the exchange of information. In the limit case, where all agents share all the episodes, this process can also be seen as a single learning system, and produce very little new knowledge. In fact, the exchange of too much data may lead all the agents to follow the same path through the search space, wasting valuable exploration resources. Nevertheless, the exchange of information has proved to be beneficial if used with care, as shall be demonstrated. Sharing internal parameters requires that agents have similar internal structures, so that they can easily map their peers’ internal parameters into their own, or that they share a complex domain ontology. This type of information exchange can be very effective if there are no restrictions to communication, and the user can be sure that a particular learning algorithm is more suitable than others to solve the problem at hand, or if there is a reliable way of mapping the internal parameters of the solution acquired by one agent to its peers. The question of exchanging information during learning is not only: “what type of information to exchange?” but also “when to exchange information?”, “how much information is it convenient to exchange?”, “how to use shared information?” and “what is the reliability of each piece of information?”. When considering human cooperative learning in a team, a common method to improve one’s skills is to ask for advice at critical times, or to request a demonstration of a solution to a particular problem to someone who is reputed to have better skills in the subject. During this process several situations may occur: • The advisee evaluates the capacity of the elements of a group of potential advisors to provide advice on a specific situation, then selects an advisor and explains the problem. • The advisor seeks a solution, selects which information to give and tries to present it on a format that is understandable by the advisee. The advisor can also give meta-information regarding the validity and limitations of that advice. • The advisee pre-evaluates the advice based on his past experience and in the trust he has in the advisor, interprets the advice and applies it to its own situation. Then he evaluates the results, and finally updates its opinion concerning the advisors’ ability to help in a given situation.
Cooperative Learning Using Advice Exchange
35
This process, (or selected parts of it), is what we are attempting to translate into the realm of Multi-Agent Systems Learning (MASL). Another important point in human learning is that different people specialize on different things, either because of their differing talents or as a consequence of their working experience that forced them to work more in certain problems. This means that some people will be more efficient at dealing with some specific problems than others. In MAS, especially when considering agents with different learning capabilities, this is another point that can be explored. It is common that some kind of specialization occurs, either because a particular learning technique is more fit to solve a certain problem or simply because the dynamics of the environment have caused one agent to have more experience than others in certain areas of the problem. The initial work on MASL, reported here, is mainly concerned with the effect of exchanging advice in a heterogeneous group of agents, where each one is dealing with problems with similar structure, but in which the actions of one agent do not have an impact on the state sensed by other agents. This scenario can be found in several application domains in which agents need to solve similar problems but do not need to share resources to solve them. One example might be the internet web-crawlers that can learn from the experience of their peers but, apart from the information exchanged amongst them, have little impact on the state observed by their peers. In this paper’s future work section several extensions to this problem are previewed. The main difference from other related work is in the use of agents with different learning strategies, as opposed to the study of cooperative Q-Learners that is the most common approach in the related work. The authors believe that the heterogeneity of the group can help to overcome the problems of the “No Free Lunch Theorem” and provide better response to difficult distributed control tasks in which learning can be advantageous. In the experiments, agents selectively share episodes by requesting advice for given situations to other agents whose score is, currently, better than their own in solving a particular problem. The advisors are not pre-trained experts. All agents are started at the same time and run synchronously. Considering the several possibilities for exchanging information regarding the learning process, discussed in the previous section, this option seemed the most promising for the following reasons: a) Sharing of episodes does not put heavy restrictions on the heterogeneity of the underlying learning algorithms and may be achieved using a simple communication mechanism; b) Having different algorithms solving similar problems may lead to different forms of exploration of the same search space, thus increasing the probability of finding a good solution; c) It is more informative and less dependent on pre-coded knowledge than the exchange of environment’s states. Experiments were conducted with a group of agents embedded in a simplified simulation of a traffic-control problem to test the advantages and problems of adviceexchange during learning. Each agent applies a different learning mechanism, unknown to others and uses a standard version of a well know, sub-symbolic, learning algorithm (so far the set of algorithms used is restricted to: Random Walk, Evolutionary Algorithms, Simulated Annealing, and Q-Learning).
36
L. Nunes and E. Oliveira
The heterogeneous nature of the group makes communication of internal parameters or policies difficult to use since sharing this information would require agents to translate their internal knowledge to a common format. Despite the fact that this is an interesting question it is beyond the scope of the current research. The exchanged information is: current state (as seen by the advisee); best response that can be provided to that state (by the advisor agent); current and best scores, broadcasted at the end of each training stage (epoch). The problem chosen to test the use of advice-exchange has, as most problems studied in MASL, the following characteristics: a) Analytical computation of the optimal actions is intractable; b) The only information available to evaluate learning is a measure of the quality of the present state of the system; c) The information regarding the quality of the state is composed of a local and a global component; d) The same action executed by a given agent may have different consequences at different times, even if the system is (as far as the agent is allowed to know) in the same state; e) The agent has only a partial view of the problem’s state. The simplified traffic-control problem chosen for these experiments requires that each agent learn to control the traffic-lights in one intersection under variable traffic conditions. Each intersection has four incoming, and four outgoing, lanes. One agent controls the four traffic lights necessary to discipline traffic in one intersection. In the experiments reported here, the crossings controlled by each of the agents are not connected to each other. The learning parameters of each agent are adapted using two different methods: a reinforcement-based algorithm, using a quality measure that is directly supplied by the environment, and supervised learning using the advice given by peers as the desired response. Notice that the term “reinforcement-based” is used to mean: “based on a scalar quality/utility feedback”, as opposed to supervised learning which requires a desired response as feedback. The common usage of the term “reinforcement learning”, that refers to variations of temporal difference methods [2], is a subclass of reinforcement-based algorithms, as are, for instance, most flavours of Evolutionary Algorithms. In section 2 the reader can find a review of related work. Section 3 contains a brief description of the experimental setup, focused on the advice-exchange algorithm. Section 4 concerns the discussion of the initial results, and finally in section 5, some conclusions and a preview of the future work to be done in this direction.
2 Related Work The work on cooperative learning had some important contributions in the early nineties with the results published by Whitehead [3], Lin [4] and Tan [1]. All these works focused on cooperation of Learning Agents that use variations of Q-Learning [5]. Whitehead has experimented two cooperative learning mechanisms: Learning with an External Critic (LEC) and Learning By Watching (LBW). The first, (LEC), is based on the use of an expert automated critic that provides feedback concerning the agent’s actions more frequently than the environment would, while the second, (LBW), learns vicariously by watching other agent’s behaviour (which is
Cooperative Learning Using Advice Exchange
37
equivalent to sharing state, action, quality triplets). This work proves that the complexity of the search mechanisms of both LEC and LBW is inferior to that of standard Q-Learning for an important class of state-spaces. Experiments reported in [6] support these conclusions. Lin uses an expert teacher to improve the performance of two variants of QLearning. This work reports that the “advantages of teaching should become more relevant as the learning task gets more difficult”. Results in variants of the maze problem show that teaching does improve learning performance in the harder task, although it seems to have no effect on the performance on the easier task. Tan addressed the problem of exchanging information during the learning process amongst Q-Learning agents. This work reports the results of sharing several types of information amongst a group of agents in the predator-prey problem. Experiments were conducted in which agents shared policies, episodes (state, action, quality triplets), and sensation (state). Although the experiments use solely Q-Learning in the predator-prey domain, the author believes that: "conclusions can be applied to cooperation among autonomous learning agents in general". Conclusions point out that “a) additional sensation from another agent is beneficial if it can be used efficiently, b) sharing learned policies or episodes among agents speeds up learning at the cost of communication, and c) for joint tasks, agents engaging in partnership can significantly outperform independent agents, although they may learn slowly in the beginning“. The results reported by Tan also appear to point to the conclusion that sharing episodes with peers is beneficial and can lead to a performance similar to that obtained by sharing policies. Sharing episodes volunteered by an expert agent leads to the best scores in some of the presented tests, significantly outperforming most of the other strategies in the experiments. After these first, fundamental, works, several variants of information sharing QLearners appeared reporting good results in the mixture of some form of informationsharing and reinforcement learning. Matariü [7] reports on the use of localized communication of sensory data and reward as a way to overcome hidden state and credit assignment problems in groups of Reinforcement Learning agents involved in a cooperative task. The experiments conducted in two robot problems, (block pushing and foraging) show improvements in performance on both cases. Several researchers investigated the subject of using an expert automated teacher. Baroglio [8] uses an automatic teacher and a technique called "shaping" to teach a Reinforcement Learning algorithm the task of pole balancing. Shaping is defined as a relaxation of the evaluation of goal states in the beginning of training, and a tightening of those conditions in the end. Clouse [9] uses an automatic expert trainer to give a QLearning Agent actions to perform, thus reducing the exploration time. Brafman and Tenemholtz [10] use an expert agent to teach a student agent in a version of the “prisoner’s dilemma”. Both authors use variations of Q-Learning. Price and Boutilier [11] have demonstrated that the use of learning by imitating one (or several) expert agent(s) produces good results, in variants of the maze problem. Again, Q-Learning agents are used. Berenji and Vengerov [12] report analytical and experimental results concerning the cooperation of Q-Learning agents by sharing quality values amongst them. Experiments were conducted in two abstract problems. Results point out that
38
L. Nunes and E. Oliveira
limitations to cooperative learning described in [3] can be surpassed successfully, under certain circumstances, leading to better results than the theoretical predictions foresaw. Learning joint actions has also been investigated by several research groups. Claus and Boutilier [13] and Kapetanakis and Kudeneko [14] have worked in this problem using Q-Learning agents. Using a human teacher to improve the learning performance of an agent at a given task has also been a topic to which some researchers have devoted their attention. Maclin and Shavlik [15] use human advice, encoded in rules, which are acquired in a programming language that was specifically designed for this purpose. These rules are inserted in a Knowledge Based Neural Network (KBANN) used in Q-Learning to estimate the quality of a given action. Matariü, [16], reports several good results using human teaching and learning by imitation in robot tasks. Experimental results can be found in [17] [18] [19]. In the area of Machine Learning (ML), some interesting experiments have also been conducted that are related to this work. Provost and Hennessy [20] use cooperative learning, partitioning the data amongst a group of Distributed Rule Learners (each performing general-to-specific beam search) to speedup learning for tasks with large amounts of data. Hogg and Williams [30] have experimented in using cooperative search with a mixture of methods (depth-first, backtracking search and heuristic repair) to solve hard graph coloring problems. The learners exchange information on partial solution, and the results report that “even using simple hints they [, the learners,] can improve performance”. Simultaneous uses of Evolutionary Algorithms [21][22] and Backpropagation [23] are relatively common in ML literature, although in most cases Evolutionary Algorithms are used to select the topology or learning parameters, and not to update weights. Some examples can be found in [24] and [25]. There are also reports on the successful use of Evolutionary Algorithms and Backpropagation simultaneously for weight adaptation [26][27][28]. Most of the problems in which a mixture of Evolutionary Algorithms and Backpropagation is used are supervised learning problems, i.e., problems for which the desired response of the system is known in advance (not the case of the problem studied in this paper). Castillo et al. [29] obtained good results in several standard ML problems using Simulated Annealing and Backpropagation, in a similar way to that which is applied in this work. Again, this was used as an add-on to supervised learning to solve a problem for which there is a well known desired response. The use of learning techniques for the control of traffic-lights can be found in [31] [32] and [33].
3 Experimental Setup This section will briefly describe the internal details of the traffic simulation, the learning techniques and the advice-exchange algorithm. A more detailed description of the traffic simulation and learning techniques can be found in [34].
Cooperative Learning Using Advice Exchange
39
3.1 The Traffic Simulator The traffic simulator environment is composed of lanes, lane-segments, traffic-lights (and the corresponding controlling agents), and cars. The latest are not realistically modeled, having infinite breaking capabilities and being unable to perform any illegal maneuver. Cars are inserted at the beginning of each lane with a probability that varies in time according to a function with different parameters for each lane, and are removed when they reach an extremity of any of the outgoing lane-segments, after having passed through the scenario. At the beginning of each green-yellow-red cycle, the agents that control each crossing observe the local state of environment and decide on the percentage of greentime (g) to attribute to the North and South lanes, the percentage of time attributed to the East and West lanes is automatically set at (1 – g). Yellow-time is fixed. The environment is represented as real-valued vector. The first four components represent the ratio of the number of incoming vehicles in each lane relative to the total number of incoming vehicles in all lanes. The four remaining values represent the lifetime of the incoming vehicle that is closest to the traffic-light. This state representation is similar to the one that was reported to have produced some of the best results in the experiments conducted by Thorpe [32] for the same type of problem (learning to control traffic-lights at an intersection). The quality of service of each traffic-light controller is inversely proportional to the average time cars take to cross the traffic light since their creation in the beginning of the lane. The car generation parameters in the traffic simulator proved difficult to tune. Slight changes led to simulations that were either too difficult (no heuristic nor any learned strategy were able to prevent major traffic jams), or to problems in which both simple heuristics and learned strategies were able to keep a normal traffic flow with very few learning steps. 3.2 Learning Agents This section contains a summarized description of the learning algorithms used by each of the agents involved in the traffic-lights control experiments, as well as the heuristic used for the fixed strategy agent. 3.2.1 Stand-Alone Agents The stand-alone versions of the learning agents are used to provide results with which the performance of advice-exchanging agents could be compared. The stand-alone agents implement four classical learning algorithms: Random Walk (RL), which is a simple hill-climbing algorithm, Simulated Annealing (SA), Evolutionary Algorithms (EA) and Q-Learning (QL). A fifth agent was implemented (HEU) using a fixed heuristic policy. As the objective of these experiments was not to solve this problem in the most efficient way, but to evaluate advice-exchange for problems that have character-
40
L. Nunes and E. Oliveira
istics similar to this, the algorithms were not chosen or fine-tuned to produce the best possible results for traffic-control. The choice of algorithms and their parameters was guided by the goal of comparing the performance of a heterogeneous group of learning agents using advice-exchange, to a group in which its elements learn individually. All agents, except QL and HEU, adapt the weights of a small, one hidden layer, neural network. The Random Walk (RW) algorithm simply disturbs the current values of the weights of the neural network by adding a random value of a magnitude that is decreased throughout the training. At the end of an epoch, the new set of parameters is kept if the average quality of service in the controlled crossing during that epoch is better than the best average quality achieved so far. Simulated Annealing (SA), [35], works in a similar way to Random Walk, but it may accept the new parameters even if the quality has diminished. The probability of acceptance is given by a Boltzman distribution with decaying temperature. Evolutionary Algorithms (EA), [21], were implemented in a similar way to the one described in [36], which is reported to have been successful in learning to navigate in a difficult variation of the maze problem by updating the weights of a small Recurrent Artificial Neural Network. This implementation relies almost totally in the mutation of the weights, in a way similar to the one used for the disturbance of weights described for RW and SA. Each set of parameters (specimen), which comprises all the weights of a neural network of the appropriate size, is evaluated during one epoch. After the whole population is evaluated, the best n specimens are chosen for mutation and recombination. An elitist strategy is used by keeping the best b<
Cooperative Learning Using Advice Exchange
41
Table 1. Steps of the advice-exchange sequence for an advisee agent (i) and an advisor agent (k).
1. Agent i: receives the best average quality (bqj) from all other agents (j i). Current quality for Agent i is cqi. 2. Loop: Agent i: gets state s for evaluation. 3. Agent i: calculates k = arg maxj(bqj), for all agents (j i). 4. Agent i: if cqi < d bqk, where 0
Q ( a | s ) = (1 − α )Q ( a | s ) + α ( r + β ∑ p ( s ′ | a , s ) max (Q ( a ′ | s ′)) ), s’
(1)
á′
s ′ ∈ S sa , a ′ ∈ As ′ where is the learning rate, r is the estimated reward achieved by taking that action, is the discount parameter, p(s’|a,s) is the probability of a transition to state s’ given that action a is executed at state s and it is calculated based on previous experience, as the quotient between the number of transitions (ntsas’) to state s’ when performing ac-
42
L. Nunes and E. Oliveira
tion a at the current state, s, and the total number of transitions from current state by action a, i.e., p ( s ′ | a, s ) =
nt sas ′ , i ∈ S sa ∑ nt sai
(2)
i
where Ssa is the set of states reachable from state s by action a, and As’ the set of actions available at state s´. The estimated reward of the advised action will be the average reward advertised by the advisor agent. After updating its internal parameters with the advised information, the advisee agent gives the appropriate response to the system following the normal procedure for each particular algorithm. The use of a different type of adaptation when learning from advice, in the QLearning agent, is required because the state of the system after the proposed action is unknown. The advisor has not actually performed the action and, possibly, neither will the advisee, so the utility of the next state needs to be estimated by a weighted average of the utilities of all the possible following states when executing the given action.
4 Discussion of the Initial Results Before starting experiments, some results were expected, namely: a) After the beginning of advice-exchange, fast, step-like, increases in quality of response. This should happen as soon as one of the agents found a better area of the state space and drove other agents that had worst performances to that area. b) Faster initial learning rates due to parallel exploration and the sharing of information. c) Final convergence on better quality values than in tests where no advice is exchanged. d) Problems of convergence to local optimum when using excess of advice, or high learning rates when processing advice. In the first set of experiments the parameter that controls the decision of requesting advice to a peer based on the difference between the current quality of the advisee’s actions and the best advertised quality, referred as d in Table 1, was set to 1. The consequence of this attribution is that every time an agent has an average quality within the epoch that is lower than the best average quality achieved by its peers in previous epochs, it will request advice. Obviously, this situation happens most of the time, and this has two bad consequences: 1) The time (and resources) required to run the experiments increases several times, compared to other experiments where the value of d is lower; 2) As soon as advice-exchange is allowed all algorithms seem to be attracted to the same area of the search-space and even the ones that use random disturbance have some difficulties in leaving this area (because they are very often advised by an algorithm that is fixed to a solution in that area). This is particularly bad since it throws away one of the most promising characteristics of this type of system, which is
Cooperative Learning Using Advice Exchange
43
parallel exploration. It was found that for values of d equal or smaller than 0.8 this effect disappears. This was the value used in the experiments reported here. Table 2 shows a summarized comparison of the main aspects of the training for two groups of agents involved: Stand Alone and Advice-Exchanging agents. Each experiment consisted in running both groups with the same initial conditions. In each of these two runs one group controlled the four traffic-lights available. Each set of traffic-lights that control one crossing was controlled by an agent with a different learning strategy. In these experiments the crossings were not interconnected. The results are based on a set of 19 experiments, each running for 1600 epochs, with different parameterizations for the introduction of cars in the beginning of each lane in each experiment. A greater number of experiments were performed, but in some of these experiments learning has stopped within the first 200 epochs. In these experiments one of two situations has occurred: either all the agents have climbed to a good quality value within 200 epochs, easily keeping a regular flow of traffic, or none of the learning agents nor the heuristic have managed to prevent heavy traffic jams. These inconclusive experiments were not considered in the results presented in Table 2. Table 2. Summary of first results for the traffic-light control experiment for the Stand-Alone and Advice-Exchanging groups.
Stand-Alone Advice Tie
Better intial score 5% 58% 37%
Best final score 11% 47% 42%
The first column shows the percentage of trials in which a group of agents had better scores than its counterparts in the initial phase of the trial. A group of agents is considered a winner if after 500 epochs all its elements are within 10% of the best score achieved so far (regardless of the group whose member has achieved the best score) and if the same condition does not hold for the other group. All other situations are considered a tie. The second column shows the results for better final score using the same method to determine the winning group. The actual results observed differ in some respects from expectations. In more than half of the experiments, agents that use advice, do climb much faster to a reasonable quality plateau in the initial stage of the training, as foreseen in (a), but, contrary to expectations, they do not achieve better final scores as often as expected, only in 21% of the cases a member of the Advise-Exchanging group achieved the best final score. In some cases learning was much slower in the Advice-Exchange group after these fast early increases in quality and the high initial quality value, obtained by advice-exchanging agents, was gradually equaled, or even surpassed, by their StandAlone counterparts. Figure 1 shows a detail of the initial phase of a trial where we can see a typical situation of a fast increase in quality at an early stage caused by advice exchange. The Q-Learning and Simulated Annealing agents jump to higher quality areas and “pull”,
44
L. Nunes and E. Oliveira
Quality
Evolution of best values
0.9
0.8
0.7
0.6 10
20
HEU AEA
ARW AQL
Epochs ASA
Fig. 1. Detail of the initial phase of a trial. The vertical axis represents the best average quality obtained by each agent up to the moment and the horizontal axis represents the number of epochs elapsed. In this case, advice given by the Q-Learning agent (AQL), first, and Simulated Annealing (ASA), later, led most of the elements in the group on a sudden climb of more than 10% increase in the quality of service. Evolutionary Algorithms also benefited from this jump, but the climb was less steep and from a lower point.
by advice-exchange, their peers into those areas in a few epochs. In this experiment the advice-exchanging agents did not stop at this quality plateau, being able to obtain better final scores than their counterparts, although learning has slowed from this point on. The Evolutionary Algorithm agent, not seen in the figure, does eventually get to that quality plateau, but it takes longer to show the effects of advice-exchange. This is a common observation and it is due to the distribution of advice by different specimens that slows down the effects of advice-exchange in this type of agents. One of the most interesting problems observed was that of ill advice. It was observed that some agents, due to a “lucky” initialization and exploration sequence, never experience very heavy traffic conditions, thus, their best parameters are not suited to deal with this problem. When asked for advice regarding a heavy traffic situation, their advice is not only useless, but also harmful, because it is stamped with the “quality” of an expert. In Q-Learning this was easy to observe because there were situations, far into the trials, for which advice was being given concerning states that had never been visited before. In the next section some measures to prevent this problem will be discussed. One important characteristic that was observed was that advice-exchanging agents fall into local optima considerably less than their counterparts, being able to cope with bad initial parameters better than the members of the Stand Alone group. This is due
Cooperative Learning Using Advice Exchange
45
to the fact that the supervised learning that occurs based on the advice given by peers helps these algorithms to move into more promising regions of the search-space.
5 Conclusions and Future Work As mentioned in the previous section, advice-exchange has proved, in these initial trials, to have some advantages over the use of stand-alone agents speeding up the initial learning stages and making the behavior of the group of learners more robust in face of bad initialization. This seems to be a promising way in which agents can benefit from mutual interaction during the learning process. However, this is just the beginning of a search, where a few questions were answered and many were raised. A thorough analysis of the conditions in which this technique is advantageous is still necessary. It is important to discover how this technique performs when agents are not just communicating information about similar learning problems, but attempting to solve the same problem in a common environment. The application of similar methods to other type of learning agents, as well as to other problems, is also an important step in the validation of this approach. It is fundamental to assert to what extent heterogeneity and advice-exchange, each on its own, are responsible for the better performance of a cooperative learning system. For the time being, experiments in the predator-prey problem and a more realistic traffic environment are under development. This new model for traffic behavior is based on the Nagel-Shreckenberg model, described in [39]. The authors hope that this new formulation provides a richer environment in which advice-exchange can be more thoroughly tested. One of the main problems observed with advice-exchange is that bad advice, or blind reliance, can hinder the learning process, sometimes beyond recovery. One of the major hopes to deal with this problem is to develop a technique in which advisors and/or advisees can measure the quality of the advice, and agents can develop trust relationships, which would provide a way to filter bad advice. This may be especially interesting if an evaluation of the capability of an agent to advise can be associated with agent-situation pairs. This will allow the advisee to differentiate who is the expert on the particular situation it is facing. Work on the development of “trust” relationships amongst a group of agents has been reported recently in several publications, one of the most interesting for the related subject being [40]. Another interesting issue rises from the fact that humans usually offer advice for limit situations. Either great new discoveries, or actions that may be harmful for the advisee, seem to be of paramount importance in the use of advice. The same applies to the combination of advice from several sources. These techniques may require an extra level of skills: more elaborate communication and planning capabilities, long-term memory, etc. These capabilities fall more into the realm of symbolic systems. The connection between learning layers, which has been also an interesting and rich topic of research in recent years may play an important role in taking full advantage of some of the concepts outlined above.
46
L. Nunes and E. Oliveira
Our major aim is to, through a set of experiments, derive some principles and laws under which learning in the Multi-Agent Systems framework proves to be more effective than, and inherently different from, just having agents learning as individuals.
References 1. 2. 3.
4. 5. 6. 7. 8. 9. 10. 11. 12.
13.
14.
15. 16. 17.
M. Tan. Multi-Agent Reinforcement Learning: Independent vs. Cooperative Agents. Proc. of the Tenth International Conference on Machine Learning, Amherst, MA, 330–337, 1993 R. S. Sutton and A. G. Barto. A Temporal-Difference Model of Classical Conditioning. Tech Report GTE Labs. TR87-509.2, 1987 S. D. Whitehead. A complexity Analisys of Cooperative Mechanisms in Reinforcement Learning. Proc. of the 9th National Conference on Artificial Inteligence (AAAI-91), 607– 613, 1991 L.-J. Lin. Self-improving reactive agents based on reinforcement learning, planning and teaching. Machine Learning 8:293-321, Kluwer Academic publishers, 1992 C. J. C. H. Watkins, P. D. Dayan. Technical note: Q-learning. Machine Learning 8, 3:279– 292, Kluwer Academic publishers, 1992 S. D. Whitehead, D. H. Ballard. A study of cooperative mechanisms for faster reinforcement learning. TR 365, Computer Science Department, University of Rochester, 1991 M. J. Matariü. Using Communication to Reduce Locality in Distributed Multi-agent learning. Technical Report CS-96-190, Brandeis University, Dept. of Computer Science, 1996 C. Baroglio. Teaching by shaping. Proc. of ICML-95. Workshop on Learning by Induction vs. Learning by Demonstration, Tahoe City, CA, USA, 1995 J. A. Clouse. Learning from an automated training agent. Gerhard Weiß and Sandip Sen, editors, Adaptation and Learning in Multiagent Systems, Springer Verlag, Berlin, 1996 R. I. Brafman, M. Tennenholtz. On partially controlled multi-agent systems. Journal of Artificial Intelligence Research, 4:477–507, 1996 B. Price, C. Boutilier. Implicit imitation in Multiagent Reinforcement Learning. Proc. of the Sixteenth International Conference on Machine Learning, pp. 325–334. Bled, SI, 1999 H. R. Berenji, D. Vengerov. Advantages of Cooperation Between Reinforcement Learning Agents in Difficult Stochastic Problems. Proc. Of the Ninth IEEE International Conference on Fuzzy Systems (FUZZ-IEEE '00), 2000 C. Claus, C. Boutilier. The Dynamics of Reinforcement Learning in Cooperative Multiagent Systems. Proc. of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), 746–752, July 1998 S. Kapetanakis, D. Kudenko. Reinforcement learning of coordination in cooperative multiagent systems. Proc. of the Eighteenth National Conference on Artificial Intelligence, (AAAI02), 326–331, American Association for Artificial Intelligence 2002 R. Maclin, J. Shavlik. Creating advicetaking reinforcement learners. Machine Learning 22:251–281, 1997 M. J. Matariü. Learning in behaviour-based multi-robot systems: policies, models and other agents. Journal of Cognitive Systems Research 2:81–93, Elsvier, 2001 O. C. Jenkins, M. J. Matariü, S. Weber. Primitive-based movement classification for humanoid imitation. Proc. of the First International Conference on Humanoid Robotics (IEEE-RAS), Cambridge, MA, MIT, 2000
Cooperative Learning Using Advice Exchange
47
18. M. Nicoluescu, M. J. Matariü. Learning and interacting in human-robot domains. K. Dautenhahn (Ed.), IEEE Transactions on systems, Man Cybernetics, special issue on Socially Intelligent Agents – The Human In The Loop, 2001 19. M. J. Matariü. Sensory-motor primitives as a basis for imitation: linking perception to action and biology to robotics. C. Nehaniv & K. Dautenhahn (Eds.), Imitation in animals and artifacts, MIT Press, 2001 20. F. J. Provost, D. N. Hennessy. Scaling Up: Distributed Machine Learning with Cooperation. Proc. of the Thirteenth National Conference on Artificial Intelligence, 1996 21. J. H. Holland. Adaptation in Natural and Artificial Systems. University of Michigan Press, 1975 22. J. R. Koza. Genetic programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge MA, 1992 23. D. E. Rumelhart, G. E. Hinton, R. J. Wlliams. Learning internal representations by error propagation. Parallel Distributed Processing: Exploration in the Microstructure of Cognition, vol. 1: Foundations, 318–362, Cambridge MA: MIT Press, 1986 24. R. Salustowicz. A Genetic Algorithm for the Topological Optimization of Neural Networks. PhD Thesis, Tech. Univ. Berlin, 1995 25. X. Yao. Evolving artificial neural networks. Proceedings of the IEEE, 87(9), 1423-1447, 1999 26. A.P. Topchy, O.A. Lebedko, V.V. Miagkikh. Fast learning in multilayered neural networks by means of hybrid evolutionary and gradient algorithms. Proc. of the International Conference on Evolutionary Computation and Its Applications, Moscow, 1996 27. K. W. C. Ku, M. W. Mak. Exploring the effects of Lamarckian and Baldwinian learning in evolving recurrent neural networks. Proc. of the IEEE International Conference on Evolutionary Computation, 617–621, 1997. 28. W. Erhard, T. Fink, M. M. Gutzmann, C. Rahn, A. Doering, M. Galicki, The Improvement and Comparison of different Algorithms for Optimizing Neural Networks on the MasPar {MP}-2. Neural Computation {NC}'98, ICSC Academic Press, Ed.M. Heiss, 617–623, 1998 29. P.A. Castillo, J. González, J.J. Merelo, V. Rivas, G. Romero, A. Prieto. SA-Prop: Optimization of Multilayer Perceptron Parameters using Simulated Annealing. Proc. of IWANN99, 1999 30. T. Hogg, C. P. Williams. Solving the Really Hard problems with Cooperative Search. Proc. of the Eleventh National Conference on Artificial Intelligence (AAAI-93), 231–236, 1993 31. C. Goldman, J. Rosenschein. Mutually supervised learning in multi-agent systems. Proc. of the IJCAI-95 Workshop on Adaptation and Learning in Multi-Agent Systems, Montreal, CA., August 1995 32. T. Thorpe. Vehicle Traffic Light Control Using SARSA. Masters Thesis, Department of Computer Science, Colorado State University, 1997 33. E. Brockfeld, R. Barlovic, A. Schadschneider, M. Schreckenberg. Optimizing Traffic Lights in a Cellular Automaton Model for City Traffic. Physical Review E 64 , 2001 34. L. Nunes, E. Oliveira. On Learning By Exchanging advice. Symposium on Adaptive Agents and Multi-Agent Systems (AISB/AAMAS-II), Imperial College, London, April 2002 35. S. Kirkpatrick, C. D. Gelatt, M. P. Vecchi. Optimization by simulated Annealing. Science, Vol. 220:671-680, May 1983 36. M. Glickman, K. Sycara. Evolution of Goal-Directed Behavior Using Limited Information in a Complex Environment. Proc. of the Genetic and Evolutionary Computation Conference (GECCO-99), July 1999
48
L. Nunes and E. Oliveira
37. R. S. Sutton. Integrated architectures for learning planning and reacting based on approximating dynamic programming. Proc. of the Seventh International Conference on Machine Learning, 216–224, Morgan-Kaufman. 39. K. Nagel, M Shreckenberg. A Cellular Automaton Model for Freeway Traffic. J. Phisique I, 2(12):2221–2229, 1992 40. S. Sen, A. Biswas, S. Debnath. Believing others: Pros and Cons. Proc. of the Fourth International Conference on Multiagent Systems, 279–286, 2000
Environmental Risk, Cooperation, and Communication Complexity Peter Andras1 , Gilbert Roberts2 , and John Lazarus2 1
School of Computing Science 2 School of Biology University of Newcastle Newcastle upon Tyne, NE1 7RU, UK
Abstract. The evolution of cooperation and communication in communities of individuals is a puzzling problem for a wide range of scientific disciplines, ranging from evolutionary theory to the theory and application of multi-agent systems. A key issue is to understand the factors that affect collaboration and communication evolution. To address this problem, here we choose the environmental risk as a compact descriptor of the environment in a model world of simple agents. We analyse the evolution of cooperation and communication as a function of the environmental risk. Our findings show that collaboration is more likely to rise to high levels within the agent society in a world characterised by high risk than in one characterised by low risk. With respect to the evolution of communication, we found that communities of agents with high levels of collaboration are more likely to use less complex communication than those which show lower levels of collaboration. Our results have important implications for understanding the evolution of both cooperation and communication, and the interrelationships between them.
1
Introduction
Understanding how cooperation between unrelated individuals can arise in animal and human societies has puzzled evolutionary theorists. Early solutions to the problem were found in terms of reciprocal altruism; the mutual exchange of benefits between pairs of individuals ([2],[3],[13],[16]). More recently, ’indirect reciprocity’, in which individuals who are seen to be more generous receive more help from others, has been proposed as an additional route to cooperation ([1],[8],[10]). In these models individuals have information about the past behaviour of others on which to base decisions but there is no communication of intentions: individuals simply act; cooperating, defecting or declining to interact. (This is in contrast to the evolutionary modelling of competitive behaviour in which signalling has played a central role: e.g. [6],[9].) Although there seems to be little theoretical work on intentional signalling in the context of cooperation, arbitrary signals correlating with altruism (the ’green beard effect’, [5]), tags indicating individual identity ([11]) and signalling of partner quality ([7]) have E. Alonso et al. (Eds.): Adaptive Agents and MAS, LNAI 2636, pp. 49–65, 2003. c Springer-Verlag Berlin Heidelberg 2003
50
P. Andras, G. Roberts, and J. Lazarus
been considered. While existing evolutionary models of cooperation are important in examining the minimal conditions for the evolution of cooperation they are also impoverished - at least for the human case - in excluding the possibility that, intentionally or unintentionally, individuals may communicate their intention to cooperate before interacting. In the development and maintenance of human relationships cooperation is accompanied by signals of short-term intentions and longer term commitment. Honest communication, and consequent trust, are of the greatest importance for the development of a stable collaborative relationship; deceit and mistrust are inimical to it ([4]). In this paper we develop an agent world model to examine the evolution of cooperation when individuals communicate their intentions. This communication helps individuals to decide whether to enter into cooperation with another: it allows partner choice. Many of the earlier models provide no such choice, so that cheats can be avoided only if individuals have information on their past behaviour. When the behaviour of others is to some extent predictable, however, individuals can derive the intentions of others and cooperators can choose to interact with other cooperators while cheats can be ostracized ([12]). We ask how cooperation and communication respond to variation in the risk or complexity of the environment. Risk and complexity are important general properties of the environment that impact on an individual’s success in ways that can be influenced by cooperating with others. For example, resources may be predictable (low risk) or unpredictable (high risk), and threats from predators or competitors may vary in a similar way. Resource acquisition, the avoidance of predators and success in competition can all be enhanced by collaboration with others. The second focus of the paper is how the complexity of communication itself evolves in this context and whether it differs between cooperators, cheaters and those who decline to interact at all. This interest in the communication of intentions follows other work in artificial societies (e.g. [14] ), which investigate how collecting information about the intentions of other agents can enhance the development of collaboration. Our agents communicate by sequences of signals, each of which informs the potential partner about the cooperative intentions of the agent. The reliability of the signals differs between agents, who use the information from communication in present and past interactions to make decisions about whether to share resources with a potential partner. The interaction between agents occurs within a risky environment where risk refers to the variability of the gains that result from cooperative behavior. Our results will be applicable to agent societies, animal societies, and human social systems. The rest of the paper is structured as follows. First, we describe the agent world model. Next, we present the simulation results. Finally, we discuss the implications of our results.
Environmental Risk, Cooperation, and Communication Complexity
2
51
The Agent World
In this section we describe the world of our agents. We start with the description of the environment, followed by the description of the agents, communication processes, resource management, the offspring generation rules, and we close with the description of the evolution of the agent society. We also present some argumentation to support our choices with respect to the implemented principles, rules and methods. 2.1
The Environment
The environment of our agent world is characterized by a given risk. The environmental risk represents in a compact way the complexity of the surrounding environment. The environmental risk is implemented as the variance of the agent’s resource regeneration process. 2.2
The Agents
Our agents dispose over some generic resources that they use to maintain themselves and to reproduce. Each agent speaks the same communication language consisting of the symbols: ’0’,’s’,’i’,’y’,’n’,’h’ and ’t’. The meaning of the communication symbols are as follows: ’0’ - no intention of communication, ’s’ - start of communication, ’i’ - maintaining the communication, ’y’ - indication of the willingness to engage into resource sharing, ’n’ - indication of no further interest in communication, ’h’ - effective sharing of the resources, ’t’ - not sharing the resources after an indication of willingness to engage into sharing. The last two symbols, ’h’ and ’t’ actually mean the resource-sharing or no-resource-sharing actions. The first four symbols are ranked according to their positive contribution towards engagement in sharing (the least positive is the ’0’ and the most positive is ’y’). The agents generate communication units (i.e., one of the above symbols) when they engage in communication with another agent. Each agent has its own realization of the language. This language is represented in the form of a twoinput probabilistic automaton (i.e., it is equivalent of a probabilistic push-down automaton). The language units are production rules of the form L : Ucurrent , Ucurrent →p1 Unew,1 : Ucurrent , Ucurrent →p2 Unew,2 ··· : Ucurrent , Ucurrent →pk Unew,k where Ucurrent is the last communication unit produced by the agent, Ucurrent is last communication unit produced by the partner of the agent, Unew,1 , Unew,2 , . . . , Unew,k are the new communication units that can be produced by the agent,
52
P. Andras, G. Roberts, and J. Lazarus
and p1 , p2 , . . . , pk are the probabilities of production of these communication units, p1 + p2 + . . . + pk = 1 (an example of a such rule is: L : i, i →0.4 y; i, i →0.5 i; i, i →0.1 n that means that after producing the symbol ’i’, and receiving a symbol ’i’ from the communication partner, the agent will produce the symbol ’y’ with probability 0.4, the symbol ’i’ with probability 0.5, and the symbol ’n’ with probability 0.1). The language units obey intention consistency rules, i.e., if U0,U1,U2, and U3 are communication units, and U2 is equally or more positive than U1, and U3 is equally or more positive than U2, and L1 is a language unit that produces U3 after U1 and receiving U0 (i.e., the rule says that if U1 was the last signal produced by the agent, and U0 is the most recent signal received from the other agent, then the new signal produced by the agent is U3) , and L2 is a language unit that produces U3 after U2 and receiving U0, then the probability of producing U3 using L2 is equal or higher than the probability of producing U3 using L1. Similarly, if L1 is a language unit that produces U3 after U0 and receiving U1 , and L2 is a language unit that produces U3 after U0 and receiving U2, then the probability of producing U3 using L2 is equal or higher than the probability of producing U3 using L1. In other words the intention consistency rules mean that more positive inputs are more likely to lead to positive outputs than are less positive inputs. This choice of the intention consistency rules is in agreement with human and animal behavior, where the expression of friendly signals is more likely to be followed by further friendly signals by the same individual than non-friendly signals. Each agent has a characteristic intention, which indicates the extent to which it is willing to share resources with other agents. This sharing intention determines the probability of the y, y → h production rule. The agents are equipped with a memory. The memory of the agents can store the experiences of collaboration with the last M different partners (M=10 in our implementation). The memory of the agents also fades with time, and if they don’t meet an old partner for long time they forget their memories about this partner. For each memorized partner the agent keeps the score of the successful and unsuccessful meeting (i.e., successful means a meeting that led to getting shared resources from the partner). The agents are located on a two dimensional plane, and they may change their location. The location of an agent determines the neighbourhood of the agent that consists of the N (N=10) closest agents. The agents live for T (T=60) time units. In each time unit they try to find a collaboration partner in their neighbourhood. At the end of their life time the agents produce their offspring. 2.3
Communication Processes
After selecting a collaborating partner the agents may engage in a communication process. The communication process starts properly after both agents communicated the ’s’ symbol. We set a limit (L1) for the preliminary communication. If the two agents do not reach the proper start of the communication in
Environmental Risk, Cooperation, and Communication Complexity
53
a communication of length L1 we consider that they stop their communication at this moment. During the communication process the agents use their own realization of the common language to produce communication units. The communication process ends either with the communication of an ’n’ symbol (i.e., signalling no further interest), or with the communication of the ’y’ symbol by both partners. After this each agent decides whether to share or not to share their resources with the other agent by producing the action symbol ’h’ or ’t’. We impose a communication length limit (L2) on the proper collaboration oriented communication. If the agents do not reach the stage of communicating the ’y’ symbols in L2 communication steps, we consider that they stop their communication. At the start of each communication process, the agents update their language unit probabilities according to their memories of the agent they are currently interacting with (note that if there had been no previous interaction with this agent there is no update). The updated version of their language applies only to the present communication process. In the case of more positive experiences (those that led to sharing) the probabilities leading to more positive symbols are increased, while in the case of negative experiences these are decreased. The probabilities for each language unit are normalized after effecting the above changes (i.e., the probability of producing all the allowed new symbols is always one for each language unit). During each communication process, as an agent produces equally or more positive symbols their willingness to share increases. We note that although this increase happens in all agents, those who have very low intention to collaborate will increase an originally low probability, which means that they will not necessarily share at the end of the communication process. We adopted the above formulated principle of increasing willingness for collaboration, in conformity with human and animal behavior, where a sequence of expression of friendly signals increases the likelihood of the friendly ending of the interaction, even if the original intentions were less friendly. 2.4
Resource Management
The agent dispose over their own resources that they use to maintain themselves and reproduce. In each turn the agents use their available resources to produce new resources. If they manage to find a partner who is willing to share its resources they can use the combined resources to generate the new resources for the next turn. The mean resource generating function is a squashing function of the form: Rnew = a ·
1 1 + e−R+R0
where R is the amount of available resources, and R0 and a are parameters. Operating at the convex half of the squashing function (i.e., R < R0 ) means that using more added resources is more advantageous than using the resources separately.
54
P. Andras, G. Roberts, and J. Lazarus
Resource generation happens in a probabilistic manner. The environmental risk specifies the variance of the resource regeneration process. The amount of new resources is found by taking a sample from a uniform distribution that has the calculated mean and the variance specified by the environmental risk. We use the notation N(R) for the amount of new resources generated by using R amount of available resources. The variance of the resource regeneration increases with the time spent in negotiation about resource sharing. This risk increase principle is in agreement with how environmental risk changes in the real world. To exemplify it we consider an example from business. If two companies start negotiations about a joint business, lengthy negotiations may proceed while a competitor enters in the market, and the final gain of the two collaborating companies will be reduced. At the same time lengthy negotiations may lead to a well-designed contract that makes possible to avoid future impasses, making the collaboration more profitable. If the negotiations are short, the deal is made quickly, and the companies may start gaining some new market share. At the same time they may run into some unregulated disputes that may slow down their cooperation and the increase of their market share. As we can see from this example, if the communication process is short the variance of the expected benefits is likely to be smaller than in the case when the communication process becomes lengthy. When two agents meet, having resources R1 and R2 , and they both decide to share their resources they may receive extra new resources. The extra resources for both partners are calculated as the half of the difference N (R1 + R2 ) − (N (R1 ) + N (R2 )) Such agents are called collaborators. If an agent is engaged in a communication process, convinces its partner to share, but then withholds its own resources from sharing, it is called a cheater. The gain of a cheater is the whole amount of the difference N (R1 + R2 ) − (N (R1 ) + N (R2 )) In such case the one that is cheated generates only N (R2 ) new resources for itself, i.e., it does not benefit from the sharing. If two agents select each other as communicating partners, but they do not manage to decide about the sharing of their resources (i.e., their communication end with an ’n’ symbol) we call them non-collaborators. If an agent does not reproduce enough resources to maintain itself (i.e., the maintenance costs are higher than the amount of own resources) the agent reaches the zero resource level and dies. 2.5
Offspring Generation
When the agents reach the end of their lifetime they generate their offspring. The number of the offspring depends on the available resources of the agent.
Environmental Risk, Cooperation, and Communication Complexity
55
If the agent has R resources, and the mean amount of the resources in the agent society at that moment is Rm , and the standard deviation of the resources is Rs , then the number of offspring of the agent is calculated as n=α·
R − (Rm − β · Rs ) + n0 Rs
where α, β, n0 are parameters. If n is negative or R = 0 we consider that the agent produces no offspring. If n > nmax , where nmax is the allowed upper limit of offspring, we cut back n to nmax . In order to avoid strong generational effects the newly generated offspring have random ages between 1 and A0 . The offspring of an agent inherit from their parent its language and collaboration intention with small random modifications. They also inherit the resources of their parent equally divided between the offspring. Note that in our case the language and collaboration intention are part of the specification of an agent, in other words, they are the equivalent of the genetic material of the agent . 2.6
The Evolution of the Agent Society
At the beginning we start with randomly initialised agents, i.e., the transition probabilities of their language units, their collaboration intentions, initial resources and initial positions are set randomly. The agent society evolves through the interaction and reproduction of the agents. The agents search for collaborating partners. They try to share their resources, or to cheat the collaborating partner, or they may not manage to make the decision about sharing. In each turn each agent may choose one partner from its neighbours. After each turn the agents make a random move, changing their position, and possibly finding a new neighbourhood. The agents regenerate their resources alone or in collaboration with another agent in each turn, and they pay a part of their resources to maintain themselves. At the end of their lifetime the agents generate their offspring if they have enough resources.
3
Simulation Results
This section presents our simulation results. The objective of these simulations were twofold. First, to determine how environmental risk affects both the level of collaboration, and the complexity of communication. Second, to examine how the various strategists (collaborators, cheaters, non-collaborators) differ in complexity of their communications in an evolving society. The results are presented in this order, after examining some general effects of risk level on the agent society. We selected five levels of risk in the range of 0.1 - 0.9 (the risk levels were 0.1, 0.2, 0.5, 0.7, and 0.9). We measured communication complexity by measuring the average length of the communication processes.
56
P. Andras, G. Roberts, and J. Lazarus
Fig. 1. The evolution of the number of agents in agent societies living in environments with different risk levels.
Fig. 2. The evolution of the average amount of resources of collaborators in agent societies living in environments with different risk levels.
For each risk level we run 20 simulations to obtain valid estimates of average values and variances of the measured variables. The number of agents in each simulation was the same at the beginning (1500). We run each simulation for 400 time units, or until the agent population died out or reached the maximum allowed level of number of individuals (5000).
Environmental Risk, Cooperation, and Communication Complexity
3.1
57
General Effects of the Enviornmental Risk
To see the general effects of environmental risk on the agent society we looked at how the number of agents and the resource level varied with time. We considered the average amount of resources separately for collaborators, cheaters and noncollaborators. Figure 1 shows the average number of agents in the agent societies for the five risk levels. Figures 2–4 show the change over time of the average amount of resources of collaborators, cheaters and non-collaborators. The first segment of dropping in the graphs represents the period when the randomly initialised population selects those who are able to survive. This segment corresponds to one generation (i.e., around 60 time units). Following the initial drop the societies start to grow in number and in average amount of resources in all cases. The graphs show that this growth happens much faster in agent societies living in low risk environment than in those which live in high risk environments. These results confirm the standard expectation that the average level of populations and their available resources is lower in high risk environments than in low risk environments. 3.2
Environmental Risk and the Level of Collaboration
To analyse the effect of the environmental risk on the level of collaboration we looked at the percentage of collaborators, cheaters and non-collaborators within the society (note that the percentage of those who were cheated is the same as the percentage of cheaters). The change of these percentages over time is shown in Figures 5–7. The figures show that the level of collaboration increases in all conditions. After the first generation (i.e., around 60 time units) the level of collaborators increases steadily until it stabilizes (above 50%). In the case of cheaters and non-collaborators there is a corresponding decline to stabilization at below 18% for cheaters and below 12% for non-collaborators. The figures show that the stable level of collaborators is lower in low risk conditions than in high risk conditions, and that levels of cheaters and noncollaborators are higher in low risk conditions than in high risk conditions. This indicates that the agent societies living in a high risk environment are more likely to achieve high level of collaboration than those which live in low risk environments. We note also than in the high risk environments it is more likely that the population dies out than in low risk environments. 3.3
The Complexity of Communications
We analysed the complexity of communications by measuring the average length of communication processes within the whole society. Figures 8 shows the evolution over time of the communication complexity in the whole society.
58
P. Andras, G. Roberts, and J. Lazarus
Fig. 3. The evolution of the average amount of resources of cheaters in agent societies living in environments with different risk levels.
Fig. 4. The evolution of the average amount of resources of non-collaborators in agent societies living in environments with different risk levels.
The figure shows that there is no clear ordering of the stable levels of communication complexity as a function of the level of environmental risk. The same is true when each of the three agent strategies is examined independently.
Environmental Risk, Cooperation, and Communication Complexity
59
Fig. 5. The evolution of the average percentage of collaborators within the agent societies living in environments with different risk levels.
Fig. 6. The evolution of the average percentage of cheaters within the agent societies living in environments with different risk levels.
3.4
Collaboration and Communication Complexity
First we analysed the correlation between the levels of collaborators, cheaters and non-collaborators and the average complexity of communications within the society. These correlations are shown in Tables 1–3.
60
P. Andras, G. Roberts, and J. Lazarus
Fig. 7. The evolution of the average percentage of non-collaborators within the agent societies living in environments with different risk levels.
Fig. 8. The evolution of the average communication complexity within the whole agent societies living in environments with different risk levels.
These results indicate that the proportion of collaborators is strongly negatively correlated with the complexity of communications at all risk levels. In the case of cheaters we see that their proportion is moderately positively correlated with the average complexity of communications at low risk levels, and
Environmental Risk, Cooperation, and Communication Complexity
61
Fig. 9. The evolution of the average communication complexity in the groups of collaborators, cheaters and non-collaborators living an environment characterized by risk level r = 0.2.
Fig. 10. The evolution of the average communication complexity in the groups of collaborators, cheaters and non-collaborators living an environment characterized by risk level r = 0.9.
that the correlation gets much stronger for high risk levels. In the case of noncollaborators their percentage is strongly positively correlated with the average communication complexity at all risk levels. These results together suggest that
62
P. Andras, G. Roberts, and J. Lazarus
Table 1. The correlation between the average percentage of collaborators and the average complexity of communications within societies living in environments with various risk levels. Risk 0.1 0.2 0.5 0.7 0.9 Correlation -0.94 -0.98 -0.99 -0.99 -0.99 Table 2. The correlation between the average percentage of cheaters and the average complexity of communications within societies living in environments with various risk levels. Risk 0.1 0.2 0.5 0.7 0.9 Correlation 0.15 0.74 0.93 0.95 0.97
those who collaborate tend to communicate in shorter sequences, while those who cheat or do not collaborate are likely to use longer communication sequences. To analyse this suggestion directly, we examined how communication complexity evolves in the three different groups at given risk levels. Figures 9 and 10 show two examples for risk levels 0.2 and 0.9. These figures confirm the suggestion that those who collaborate are likely to use less complex communication between themselves, and those who do not collaborate use more lengthy communication processes. Table 3. The correlation between the average percentage of non-collaborators and the average complexity of communications within societies living in environments with various risk levels. Risk 0.1 0.2 0.5 0.7 0.9 Correlation 0.97 0.98 0.99 0.99 0.99
4
Discussion
First, we interpret our results from the more general point of view of evolution of collaboration and communication in societies of individuals. Second, we discuss the implications of the presented results for agent worlds and multi-agent systems. 4.1
Evolution of Collaboration and Communication in Societies of Individuals
Under the assumptions of our model society, cooperation can thrive and its frequency increases with environmental risk, while both cheating and non-cooperation decline. The increase in cooperation with environmental risk probably
Environmental Risk, Cooperation, and Communication Complexity
63
comes about because cooperation can be crucial for survival when resources are very low and/or provides particularly large rewards (compared to cheating) when resources are very high. This is because while cheating is profitable in the short term (for a few interactions), in the longer term cheaters fail to find other agents who will interact with them. The fact that population size and average resource level decline as risk increases supports the conclusion that cooperation becomes increasingly advantageous in difficult or harsh environments, as measured by risk. The prediction that cooperation is more likely in risky environments can be tested in animal societies, in human experimental groups and in the real world of human social and economic behaviour, at the level of both individuals and of groups such as firms and nations. For example, this prediction might help to explain the phenomenon of increased feelings of community during wartime. It also suggests that cooperation might be enhanced by increasing the risk or complexity of the problem at hand. Although there may also be costs associated with increasing risk, if perceived risk increased while objective risk remained unchanged then cooperation might be enhanced without cost. However, as perceived risk increases the population of those willing to participate is likely to decline. A possible application area here is communication on the Internet, although there would be ethical issues involved in deceiving users about the risk or complexity of the Internet environment. In contrast to its effect on cooperation, environmental risk in the model had no clear effect on the length of communication. This may have been because the model language was too simple, varying between only 4 and 6 elements at the outset. For a richer language we predict that communications will be shorter as risk increases since: (1) there is a positive correlation between collaboration level and risk, and negative correlations between cheating and non-collaboration levels and risk (Figures 5-7), and (2) there is a negative correlation between collaboration level and language complexity, and positive correlations between cheating and non-collaboration levels and language complexity (Tables 1-3). Those who collaborated had shorter communication strings than those who cheated or failed to collaborate. This is because if a collaborator meets an agent for whom it has a memory biased towards collaboration then it has higher probabilities of production for positive communication symbols and therefore moves more quickly (i.e., with fewer communication steps) into an interaction that is likely to be collaborative. Thus collaborating agents, by positive feedback, build an increasingly cooperative relationship with each other, in a manner analogous to that described by Roberts & Sherratt [13]. The direct complement of this process is that meeting a past cheater for which an agent has a memory increases that agent’s likelihood of cheating in the present interaction after a longer series of communication symbols (see sub-section 2.3 on the communication process). Collaboration thus brings with it the bonus of a saving on communication effort. Such effort may be trivial, but it may also be considerable, as in some forms of human negotiation. The prediction that collaboration simplifies the communication process (compared to both cheating and avoiding interaction)
64
P. Andras, G. Roberts, and J. Lazarus
can be tested in the scenarios already described for examining the relationship between cooperation and risk. 4.2
Collaboration and Communication in Agent Worlds
We see two directions of implications of our work in the context of agent worlds and multi-agent systems. First, our results indicate that appropriate setting of the environmental risk factors of an agent world can determine to a significant extent the level of collaboration within the agent world. This may have applications in the design of multi-agent systems where the developers wish to achieve some desired mix of collaborative/non-collaborative behavioural patterns that fits to the objective of the system. It is important to note that pure collaborative behavior in an open agent world may pose significant risks to the proper working of that world, as malignant agents may appear, and may abuse the default benevolent behavior of other agents. This means that some level of non-cooperative behavior should be allowed in an open agent system ([15]). Second, our results suggest that it is possible to predict the expectable level of collaboration and communication complexity in an agent world, if enough information is available about the environmental risk factors characterizing this world. Such predictions can form the basis for checks of the validity of risk factor assumptions, and for corrective actions aimed to keep the agent world within the desired range of macro parameters.
References 1. R.D. Alexander. The Biology of Moral Systems. Aldine de Gruyter, New York, 1987. 2. R. Axelrod. The Evolution of Cooperation. Basic Books, New York, 1984. 3. R. Axelrod, & W.D. Hamilton. The evolution of cooperation. Science, 211:1390– 1396, 1981. 4. S.D. Boon and J.G. Holmes. The dynamics of interpersonal trust: resolving uncertainty in the face of risk. In: R.A. Hinde and J. Groebel (eds.) Cooperation and Prosocial Behaviour, pp. 190–211. Cambridge University Press, Cambridge, 1991. 5. R. Dawkins. The Selfish Gene. Oxford University Press, Oxford, 1976. 6. M. Enquist. Communication during aggressive interactions with particular reference to variation in choice of behaviour. Animal Behaviour, 33:1152–1161, 1985. 7. O. Leimar. Reciprocity and communication of partner quality. Proceedings of the Royal Society of London Series B: Biological Sciences,264:1209–1215, 1997. 8. O. Leimar and P. Hammerstein. Evolution of cooperation through indirect reciprocity. Proceedings of the Royal Society of London Series B: Biological Sciences, 268:745–753, 2001. 9. J. Maynard Smith 1974 The theory of games and the evolution of animal conflicts. Journal of Theoretical Biology, 47:209–221. 10. M.A. Nowak, and K. Sigmund. Evolution of indirect reciprocity by image scoring. Nature, 393:573–577, 1998.
Environmental Risk, Cooperation, and Communication Complexity
65
11. R.L. Riolo, M.D. Cohen, and R. Axelrod. Evolution of cooperation without reciprocity. Nature, 414:441–443, 2001. 12. G. Roberts. Competitive altruism: From reciprocity to the handicap principle. Proceedings of the Royal Society Series B: Biological Sciences, 265:427–431, 1998. 13. G. Roberts and T.N. Sherratt. Development of cooperative relationships through increasing investment. Nature, 394:175–179, 1998. 14. M. Schillo, P. Funk and M. Rovatsos.Using trust for detecting deceitful agents in artificial societies. Applied Artificial Intelligence, 14:825–848, 2000. 15. T.N. Sherratt and G. Roberts. The role of phenotypic defectors in stabilizing reciprocal altruism. Behavioural Ecology, 12:313–317, 2001. 16. R.L. Trivers. The evolution of reciprocal altruism. Quarterly Review of Biology, 46:35–57, 1971.
Multiagent Learning for Open Systems: A Study in Opponent Classification Michael Rovatsos, Gerhard Weiß, and Marco Wolf Department of Informatics Technical University of Munich 85748 Garching bei M¨unchen Germany {rovatsos,weissg,wolf}@in.tum.de
Abstract. Open systems are becoming increasingly important in a variety of distributed, networked computer applications. Their characteristics, such as agent diversity, heterogeneity and fluctuation, confront multiagent learning with new challenges. This paper presents the interaction learning meta-architecture InFFrA as one possible answer to these challenges, and introduces the opponent classification heuristic AdHoc as a concrete multiagent learning method that has been designed on the basis of InFFrA. Extensive experimental validation proves the adequacy of AdHoc in a scenario of iterated multiagent games and underlines the usefulness of schemas such as InFFrA specifically tailored for open multiagent learning environments. At the same time, limitations in the performance of AdHoc suggest further improvements to the methods used here. Also, the results obtained from this study allow more general conclusions regarding the problems of learning in open systems to be drawn.
1 Introduction The advent of the Internet brought with it an increasing interest in open systems [8, 10]. Real-world applications such as Web Services, ubiquitous computing, web-based supply chain management, peer-to-peer systems, ad hoc networks for mobile computing etc. involve interaction between users (humans, mobile devices, companies) with different goals whose internal structure is intransparent for others. Agent populations change dynamically over time. Drop-outs, malicious agents and agents who fail to complete tasks assigned to them may jeopardise the robustness of the overall system. Centralised authorities and “trusted third parties” are (if they exist at all) not always trustworthy themselves, since ultimately they are also serving someone’s self-interest. Obviously, these phenomena confront multiagent systems (MAS) [19] research with new challenges, because many assumptions regarding the knowledge of agents about each other are no more realistic. While the MAS community as a whole has been concentrating on relatively closed systems for a long time, multiagent learning researchers have long dealt with the problems of openness. This is because of the characteristic of machine learning [12] approaches in general to look at problems of acquiring useful knowledge from observation in complex domains in which this knowledge is not readily available. Thus, it is only natural to take a learning approach to build intelligent, flexible E. Alonso et al. (Eds.): Adaptive Agents and MAS, LNAI 2636, pp. 66–87, 2003. c Springer-Verlag Berlin Heidelberg 2003
Multiagent Learning for Open Systems: A Study in Opponent Classification
67
and adaptive agents that can operate in open systems – in fact, the amount of potentially “missing” knowledge offers a seemingly endless source of new learning problems that need to be tackled. Mostly, the purpose of applying learning techniques in the construction of adaptive agents is to learn how to predict the behaviour of the system one way or the other. As far as multiagent learning is concerned, this means learning how to predict the behaviour of other agents, e.g. by learning their preferences [2], their strategies [3,6,17], the outcomes of joint actions or all of these [15], i.e. in some way to acquire a model of the opponent1 . Most of the time, these models are also used to derive an optimal strategy towards this opponent at the same time, such as in [5,3,6,15]. The majority of these multiagent learning approaches adopt a heavily cognitionbiased view of learning, which aims at extracting as much information from observation as possible about an individual. However, in large-scale, open MAS, in which agents have only occasional encounters with peers they are acquainted with, learning models of individual peer agents may not be feasible. This is because the cost of acquiring and maintaining an adequate model of the other normally outweighs its potential benefits if the probability of interacting with that same agent in the future is not very high. This problem has lead us to investigate a more social view of multiagent learning that is more adequate for open systems, which, at its core, focuses around the idea of learning the behaviour of agents in certain classes of interactions rather than the specific properties of particular agents. At the level of architectures for social learning of this kind, we have developed InFFrA [14] which provides a meta-model for developing appropriate learning algorithms. In this paper, we present a concrete implementation of this model for the problem of learning in multi-agent games called AdHoc (Adaptive Heuristic for Opponent Classification) and illustrate its usefulness by providing extensive experimental results. These confirm our initial intuition that the road to the development of new multiagent learning methods for open systems is long, but that our methodology is a first step in the right direction. The remainder of this paper is structured as follows: we first introduce our general intuitions about learning in open systems (section 1). Subsequently, section 2 presents an abridged introduction to the InFFrA social learning architecture. Section 3 then introduces AdHoc, an application of InFFrA to learning in multi-player games and section 4 presents extensive experimental results obtained from AdHoc simulations. In section 5, we discuss these results and make suggestions for further improvements, and section 6 concludes.
2 Open Systems: A New Challenge for Multiagent Learning One of the most prominent problems multiagent learning (MAL) research deals with is how to build an agent who can learn to perform optimally in a given environment. We will restrict the scope of our analysis to a sub-class of this problem, in which we are more specifically concerned with building an optimal “goal-rational agent” by assuming 1
We will make frequent use of the terms opponent, adversary, peer, partner etc. They are all intended to mean “other agent” without any implications regarding the competitive or cooperative nature of the interactions.
68
M. Rovatsos, G. Weiß, and M. Wolf
that an agent has preferences regarding different states of affairs and that she deliberates in order to reach those states of the world that seem most desirable to her (usually, these preferences are expressed by a utility function or an explicit set of goals/tasks). The environment such an agent is situated in is assumed to be co-inhabited by other agents that the agent interacts with. We assume that, in general, agents’ actions have effects on each other’s goal attainment so that agents have to coordinate their actions with each other to perform well. In the absence of omnipotence, it is useful and often necessary for agents to organise their experience in a cause-and-effect model of some sort in order to be able to adapt to the environment and to take rational action; only if the rules that govern the world are discovered can an agent actively pursue its goals through means-ends-reasoning, because such means-ends-reasoning requires making predictions about the effects of one’s own actions. Such a model of the world should not only capture the properties of the physical environment, but also describe (and predict) the behaviour of other agents, as they also influence the outcome of certain activities. In our analysis, we concentrate on the latter problem and neglect all problems associated with learning to behave optimally in a (passive) environment. This is to say that our focus is on agents that model other agents in order to predict the future behaviour of those agents. While the activity of modelling other agents forms part of virtually any “socially adaptive” agent design, open MAS have some special characteristics that add to the complexity of this modelling activity: 1. Behavioural diversity: In open systems, agents are usually granted (deliberately or inevitably) more degrees of freedom in their behaviour. For example, they might be untruthful, malicious or not behaving rationally from the modelling agent’s perspective. Generally speaking, an agent that is trying to model other agents can make less a priori assumptions about the internals or the behaviour of other agents (such as the benevolence and rationality assumptions in the above examples). In machine learning terms, this means that the hypothesis space [12] used when learning about others will be larger than it is in closed systems. 2. Agent heterogeneity: Open systems allow for a larger variety of possible agent designs than closed systems do. One problem this leads to is that an agent who is building models of other agents will need to maintain many different models. Worse still, it might be the case that the modelling agent has to apply different learning methods for different peers (for example, a logic-based adversary might be better modelled using inductive logic programming while decision tree learning might be better when modelling a reactive opponent). 3. Agent fluctuation: If agents may freely enter and leave the system, it is not clear how the agent should assess the value of learning a model of them. This makes it very hard to estimate how much effort should be spent on learning models of particular agents (given that the agent has only limited resources some of which it also needs for perception, planning and execution), considering that information about certain agents might become useless anytime. A side-effect of this is also that it becomes much harder to develop a reasonable “exploration vs. exploitation” strategy, not knowing which partners deserve being “explored”.
Multiagent Learning for Open Systems: A Study in Opponent Classification
69
Taken together, these phenomena might be seen as different aspects of what can be called the contingent acquaintances problem (CAP) of modelling other agents: the problem that the behaviours of known peers are unrestricted, that there are many different agents the modelling agent is acquainted with, and that there is uncertainty about the value of information obtained from learning more about these acquaintances. Although opponent modelling (OM) has received considerable attention from the multiagent learning community in recent years, we feel that this problem has been largely ignored. Our own research has been focusing on a specific approach that addresses the CAP by different means called interaction frame learning. Essentially, it is based on the idea of learning different categories of interactions instead of particular models of individual adversaries, while combining learned patterns of behaviour appropriately with information about some specific individual agent whenever such information is available.
3
InFFrA: A Meta-architecture for Social Learning
InFFrA (the Interaction Frames and Framing Architecture) is a sociologically informed framework for building social learning and reasoning architectures. It is based on the concepts of “interaction frames” and “framing” which originate in the work of Erving Goffman [9]. Essentially, interaction frames describe classes of interaction situations and provide guidance to the agent about how to behave in a particular social context. Framing, on the other hand, signifies the process of applying frames in interaction situations appropriately. As Goffman puts it, framing is the process of answering the question “what is going on here?” in a given interaction situation – it enables the agent to act in a competent, routine fashion. In a MAL context, interaction frames can be seen as learning hypotheses that contain information about interaction situations. This information should suffice to structure interaction for the individual that employs the frames. Also, frames describe recurring categories of interaction rather than the special properties of individual agents, as most models learned by OM techniques.Thus, learning interaction frames is suitable to cope with the class of learning problems described in the previous section. A brief overview of InFFrA will suffice for the purposes of this paper and hence we will not go into its details here. More complete accounts can be found in [14] and [16]. In their computationally operationalised version, frames are data structures which contain information about – the possible interaction trajectories (i.e. the courses the interaction may take in terms of sequences of actions/messages), – roles and relationships between the parties involved in an interaction of this type, – contexts within which the interaction may take place (states of affairs before, during, and after an interaction is carried out) and – beliefs, i.e. epistemic states of the interacting parties. Figure 1 shows a graphical representation for the interaction frame (henceforth “frame”) data structure. As examples for how the four “slots” of information it provides might be realised, it contains graphical representations of groups (boxes) and relationships
70
M. Rovatsos, G. Weiß, and M. Wolf
Frame trajectories
roles & relationships R1
R2
R3
G2 A1
R2
A7
status
R3
A7
A2 A6
A9
A4
G1
context preconditions
trajectory model
beliefs postconditions
conceptual beliefs
A R2
time
C9
E
C
C4 R2
C8
C5
G
F R1
C3
D
B
R3
C1
C2
A8
A3
R5
R4
time
A5
R1
C10
activation conditions
deactivation conditions
A
R3 B
G
D
C C6
R1
C7
sustainment conditions
causal beliefs
F E
H
Fig. 1. Interaction frame data structure.
(arrows) in the “roles and relationships” slot and a protocol-like model of concurrent agent actions in the “trajectories” slot. The “contexts” slot embeds the trajectory model in boxes that contain preconditions, postconditions and conditions that have to hold during execution of the frames. The “beliefs” slot contains a semantic network and a belief network as two possible representations of ontological and causal knowledge, where shaded boxes define which parts of the networks are known to which participant of the frame.A final important characteristic of frames is that certain attributes of the above must be assumed to be shared knowledge among interactants (so-called common attributes) for the frame to be carried out properly while others may be private knowledge of the agent who “owns” the frame (private attributes). Private attributes are mainly used by agents to store their personal experience with a frame, e.g. utilities associated with previous frame enactments and instance values for the variables used in generic representations that describe past enactments (“histories”), inter-frame relationships (“links”) etc. Apart from the interaction frame abstraction, InFFrA also offers a control flow model for social reasoning and social adaptation based on interaction frames, through which an InFFrA agent performs its framing. Before describing the steps that are performed in this reasoning cycle, we need to introduce the data structures on which they operate. They are – the active frame (the unique frame currently activated), – the perceived frame (a frame-wise interpretation of the currently observed state of affairs),
Multiagent Learning for Open Systems: A Study in Opponent Classification InFFrA Framing Architecture
frame repository
F1
F2
behaviour generation module
perceived frame
frame enactment module
activated frame
F
trajec− tory
situation interpretation module
F
F
frame adjustment module
retain
adapt
contexts beliefs compliance
perception
F
difference model roles
F
Fn .....
F
action
F3
71
deviance
frame matching module
framing assessment module
sub−social level private goals/values
Fig. 2. Detailed view of the framing-based agent architecture. The main line of reasoning between perception and action (shown as a shaded arrow) captures both the sub-processes involved and the temporal order in which they occur.
– the difference model (containing the differences between perceived frame and active frame), – the trial frame (the current hypothesis when alternatives to the current frame are sought for), – and the frame repository, a (suitably organised) frame database used as a hypothesis space. The control flow model consists of the following steps: 1. Matching: Compare the current interaction situation (the perceived frame) with the frame that is currently being used (the active frame). 2. Assessment: Assess the usability of the current active frame. 3. Framing decision: If the current frame seems appropriate, retain the active frame and continue with 6. Else, proceed with 4. to find a more suitable frame. 4. Re-framing: Search the frame repository for more suitable frames. If candidates are found, “mock-activate” one of them as a trial frame and go back to 1; else, proceed with 5. 5. Adaptation: Iteratively modify frames in the frame repository and continue with 4. If no candidates for modification can be found, create a new frame on the grounds of the perceived frame. 6. Enactment: Influence action decisions by applying the active frame. Return to 1. Figure 2 visualises the steps performed in this reasoning cycle. It introduces a functional module for each of the above steps and shows how these modules operate on the ac-
72
M. Rovatsos, G. Weiß, and M. Wolf
tive/perceived/trial frame, difference model and frame repository data structures. Also, it links the sub-social reasoning (e.g. BDI [13]) level to the InFFrA layer by suggesting that the agent’s goals and preferences are taken into account in the frame assessment phase. InFFrA offers a number of advantages regarding the design of social learning techniques for open learning environments: 1. The frame abstraction combines information about all the relevant aspects of a certain class of interaction. It contains information about who is interacting, what they are expected (not) to do when they interact, when they will apply this frame and what they need to know to carry out the frame appropriately. At the same time, it is left to the particular algorithm designed with InFFrA to specify which of these aspects it will focus on and how they will be modelled. For example, beliefs can be completely disregarded in scenarios in which behavioural descriptions are considered sufficient. 2. The framing procedure addresses all issues relevant to the design of interaction learning algorithms. It assists the designer in the analysis of – what knowledge should be captured by the frames and which level of abstraction should be chosen for them, – how perception is to be interpreted and matched against existing conceptions of frames, – how to define an implementable criterion for retaining or rejecting a frame, – which operators will be used for retrieving, updating, generating and modifying frames in the repository, – how the concrete behaviour of the agent should be influenced by frame activation (in particular, how local goal-oriented decision-making should be combined with social considerations). 3. It provides a unifying view for various perspectives on social learning at the interaction level. By touching upon classical machine learning issues such as classification, active learning, exploration vs. exploration, case-based methods, reinforcement and the use of prior knowledge, InFFrA provides a complete learning view of interaction. This enables us to make the relationship of specific algorithms to these issues explicit, if the algorithms have been developed (or analysed) with InFFrA. In the following section, we will lay out the process of applying the InFFrA framework in the design of a MAL system for opponent classification.
4
AdHoc
InFFrA has been successfully used to develop the Adaptive Heuristic for Opponent Classification AdHoc which addresses the problem of learning opponent models in the presence of large numbers of opponents. This is a highly relevant problem in open systems, because encounters with particular adversaries are only occasional, so that the an agent will usually encounter peers it knows nothing of. Therefore, hoping that
Multiagent Learning for Open Systems: A Study in Opponent Classification
73
opponents will only use a limited set of strategies is the only possibility of learning anything useful at all, and hence it is only natural to model classes of opponents instead of individuals. Remarkably, this issue has been largely overlooked by research on opponent modelling. Yet, this area abounds in methods for learning models of particular individuals’ strategies (cf. [2,5,3,6,15]). Therefore, proposing new OM methods was not an issue by itself in the development of AdHoc. Instead, the classification method was thought to be parametrised with some OM method in concrete implementations. In fact, AdHoc does not depend on any particular choice of OM method, as long as the opponent models fulfil certain criteria. For our experimental evaluation in an iterated multiagent game-playing scenario, we combined AdHoc with the well-known US-L* algorithm proposed by Carmel and Markovitch [4,3]. This algorithm is based on modelling opponent strategies in terms of deterministic finite automata (DFA) which can then be used to learn an optimal counterstrategy, e.g. by using standard Q-learning [18]. In explaining the system, we will first give an overview of AdHoc, then explain the underlying models and algorithms, and finally describe how it can be combined with the US-L* algorithm. 4.1
Overview
AdHoc creates and maintains a bounded, dynamically adapted number of opponent classes C = {c1 , . . . ck } in a society of agents A = {a1 , . . . an } together with a (crisp) agent-class membership function m : A → C that denotes which class any known agent aj pertains to from the perspective of a modelling agent ai . In our application of AdHoc to multiagent iterated games, each of this classes will consist of (i) a DFA that represents the strategy of opponents assigned to it (this DFA is learned using the US-L* algorithm), (ii) a Q-value table that is used for evolving an optimal counter-strategy against this behaviour and (iii) a set of most recent interactions with this class. This instance of AdHoc complies with the more general assumptions that have to be made regarding any OM method that is used – it allows for the derivation of an opponent model and it makes the use of this model possible in order to achieve optimal behaviour towards this kind of opponent. AdHoc assumes that interaction takes place between only two agents at a time in discrete encounters e = (s0 , t0 ), . . . (sl , tl ) of length l where sk and tk are the actions of ai and aj in each round, respectively. Each pair (sk , tk ) is associated with a real-valued utility ui for agent ai 2 . The top-level AdHoc algorithm operates on the following inputs: – an opponent aj that ai is currently interacting with, – the behaviour of both agents in the current encounter e (we assume that AdHoc is called after the encounter is over) and – an upper bound k on the maximal size of C. 2
Any encounter can be interpreted as a fixed length iterated two-player normal-form game [7]; however, the OM method we use in our implementation does not require that ui be a fixed function that returns the same payoff for every enactment of a joint action (sk , tk ) (in contrast to classical game-theoretic models of iterated games).
74
M. Rovatsos, G. Weiß, and M. Wolf
It maintains and modifies the values of – the current set of opponent classes C = {c1 , . . . ck } (initially C = ∅) and – the current membership function m : A → C ∪ {⊥} (initially undefined (⊥) for all agents). Thus, assuming that an OM method is available for any c = m(aj ) (obtained via the function OM (c)) which provides aj with methods for optimal action determination, the agent can use that model to plan its next steps. In InFFrA terms, each AdHoc class is an interaction frame. In a given encounter, the modelling agent matches the current sequence of opponent moves with the behavioural models of the classes in C (situation interpretation and matching). It then determines the most appropriate class for the adversary (assessment). In AdHoc, this is done using a similarity measure S between adversaries and classes. After an encounter, the agent may have to revise its framing decision: If the current class does not cater for the current encounter, the class has to be modified (frame adaptation), or a better class has to be retrieved from C (re-framing). If no adequate alternative is found or frame adaptation seems inappropriate, a new class has to be generated that matches the current encounter. In order to determine its own next action, the agent applies the counter-strategy learned for this particular opponent model (behaviour generation). Feedback obtained from the encounter is used to update the hypothesis about the agent’s optimal strategy towards the current opponent class. A special property of AdHoc is that is combines the two types of opponent modelling previously discussed, i.e. the method of learning something about particular opponents versus the method of learning types of behaviours or strategies that are relevant for more than one adversary. It does so by distinguishing between opponents the agent is acquainted with and those it encounters for the first time. If an unknown peer is encountered, that agent determines the optimal class to be chosen after each move in the iterated game, possibly revising its choice over and over again. Else, the agent uses its experience with the peer by simply applying the counter-strategy suggested by the class that this peer had previously been assigned to. Next, we will describe how all this is realised in AdHoc in more detail. 4.2 The Heuristic in Detail Before presenting the top-level heuristic itself, we first have to introduce OptAltClass, a function for determining the most suitable class for an opponent after a new encounter which is employed in several situations in the top-level AdHoc algorithm.A pseudo-code description of this function is given in algorithm 1. The OptAltClass procedure proceeds as follows: if C is empty, a new class is created whose opponent model is consistent with e (function NewClass). Else, a set of maximally similar classes Cmax is computed, the similarity of which with the behaviour of aj must be at least b (we explain below how this threshold is used in the top-level heuristic). The algorithm strongly relies on the definition a similarity measure S(aj , c) that reflects how accurate the predictions of c regarding the past behaviour of aj are. In
Multiagent Learning for Open Systems: A Study in Opponent Classification
75
Algorithm 1 Procedure OptAltClass inputs: Agent aj , Encounter e, Set C, Int k, Int b outputs: Class c begin if C = ∅ then {Compute the set of classes that are most similar to aj , at least with similarity b } Cmax = {c|S(aj , c) = maxc ∈C S(aj , c ) ∧ S(aj , c) ≥ b} if Cmax = ∅ then {Return the “best” of the most similar classes} return arg maxc∈Cmax Quality(c) else {Create a new class, if |C| permits; else, the “high similarity” condition is dropped} if |C| < k then return NewClass(e) else return OptAltClass(C, k, −∞) end if end if else return NewClass(e) end if end
our prototype implementation, the value of S is computed as the ratio between the number of encounters with aj correctly predicted by the class and the number of total encounters with aj (where only entirely correctly predicted action sequences count as “correct” predictions, i.e. a single mis-predicted move suffices to reject the prediction of a particular class). As we well see below in the description of the top-level algorithm, this definition of S does not require entire encounters with aj to be stored (which would contradict our intuition that less attention should be paid to learning data associated with individual peers). Instead, the modelling agent can simply keep track of the ratio of successful predictions by incrementing counters. If Cmax is empty, a new class has to be generated for aj . However, this is only possible if the size of C does not exceed k, because, as mentioned before, we require the set of opponent classes to be bounded in size. If C has already reached its maximal size, OptAltClass is called with b = −∞, so that the side-condition of S(aj , c) ≥ b can be dropped if necessary. If Cmax is not empty, i.e. there exist several classes with identical (maximal) similarity, we pick the best class according to the heuristic function Quality, which may use any additional information regarding the reliability or computational cost of classes. In our implementation, this function is defined as follows: (c) (c) Quality(c) = α · #CORRECT + β · ##corr #ALL(c) all(c) #agents(c) + γ · #known agents 1 + (1 − α − β − γ) · Cost(c)
where
76
– – – – – – – –
M. Rovatsos, G. Weiß, and M. Wolf
#ALL(c) is the total number of all predictions of class c in all past games, #CORRECT(c) is the total number of correct predictions of class c in all past games, #all(c) is the total number of all predictions of class c in the current encounter, #correct(c) is the total number of correct predictions of class c in the current encounter, #agents(c) = |{a ∈ A|m(a) = c}|, #known agents be the number of known agents, Cost(c) is a measure for the size of the model OM (c) and α + β + γ ≤ 1.
Thus, we consider those classes to be most reliable and efficient that are 1. accurate in past and current predictions (usually, α < β); 2. that account for a great number of agents (i.e. that have a large “support”); 3. that are small in size, and hence computationally cheap. It should be noted that the definition of a Quality function is not a critical choice in our implementation of AdHoc, since it is only used to obtain a deterministic method of optimal class selection in case several classes have exactly the same similarity value for the opponent in question (a rather unlikely situation). If the similarity measure is less fine-grained, though, such a “quality” heuristic might significantly influence the optimal class choice. Given the OptAltClass function that provides a mechanism to re-classify agents, we can now present the top-level AdHoc heuristic. Apart from the inputs and outputs already described in the previous paragraphs, it depends on a number of additional internal parameters: – an encounter comprehension flag ecf(c) that is true, whenever the opponent model of some class c “understands” (i.e. would have correctly predicted) the current encounter; – an “unchanged” counter u(c) that counts the number of past encounters (across opponents) for which the model for c has remained stable; – a model stability threshold τ that is used to determine very stable classes; – similarity thresholds δ, ρ1 and ρ2 that similarities S(a, c) are compared against to determine when an agent needs to be re-classified and which classes it might be assigned to. After completing an encounter with opponent aj , the heuristic proceeds as presented in the pseudo-code description of algorithm 2. At the beginning, we set the current class c to the value of the membership function for opponent aj . So in case of encountering a known agent, the modelling agent makes use of her prior knowledge about aj . Then, we update all classes’ current similarity values with respect to aj as described above, i.e. by dividing the number of past encounters with aj that would have been correctly predicted by class c (correct(aj , c)) by the total number of past encounters with aj (all(aj )). In InFFrA terms, the similarity function represents the difference model, and thus, AdHoc keeps track of the difference between all opponents’ behaviours and all frames simultaneously. Therefore, the assessment phase in the framing procedure simply consists of consulting the similarity
Multiagent Learning for Open Systems: A Study in Opponent Classification
77
Algorithm 2 AdHoc top-level heuristic inputs: Agent aj , Encounter e, Integer k outputs: Set C, Membership function m begin c ← m(aj ) {The similarity values of all classes are updated depending on their prediction accuracy regarding e} for all c ∈ C do correct(aj ,c) S(aj , c) ← all(aj ) end for if c = ⊥ then {Unknown aj is put into the best sufficiently similar class that understands at least e, if any; else, a new class is created, if k permits} m(aj ) ← OptAltClass(C, k, aj , 1) if m(aj ) ∈ C then C ← C ∪ {m(aj )} end if else {c is incorrect wrt aj or very stable} if S(aj , c) ≤ δ ∨ u(c) ≥ τ then {Re-Classify aj to a highly similar c, if any; else create a new class if k permits} m(aj ) ← OptAltClass(C, k, aj , ρ1 ) if m(aj ) ∈ C then C ← C ∪ {m(aj )} end if else {The agent is re-classified to the maximally similar (if also very stable) class} c ← OptAltClass(C, k, aj , ρ2 ) if c ∈ C ∧ u(c ) > τ then m(aj ) ← c end if end if Om-Learn(m(aj ), e) {Model of m(aj ) was modified because of e} if ecf(m(aj )) = false then {Reset similarities for all non-members of c} for all a ∈ A do if m(a ) = c then S(a , c) ← 0 end if end for end if C ← C − {c |∀a.m(a) = c } end if
values previously computed. In other words, the complexity of assessing the usability of a particular frame in an interaction situation is shifted to the situation interpretation and matching phase.
78
M. Rovatsos, G. Weiß, and M. Wolf
If aj has just been encountered for the first time, the value of m(aj ) is undefined (this is indicated by the condition c = ⊥). Quite naturally, aj is put into the best class that correctly predicts the data in the current encounter e, since the present encounter is the only available source of information about the behaviour of aj . Since only one sample e is available for the new agent, setting b = 1 in OptAltClass amounts to requiring that candidate classes correctly predict e. Note, however, that this condition will be dropped inside OptAltClass, if necessary (i.e. if no class correctly predicts the encounter). In that case, that class will be chosen for which Quality(c) is maximal. Again, taking an InFFrA perspective, this means that a (reasonably general and cheap) frame that is consistent with current experience is activated. If no such frame is available, the current encounter data is used to form a new category unless no more “agent types” can be stored. It should be noted that this step requires the OM method to be capable of producing a model that is consistent at least with a single observed encounter. Next, consider the case in which m(aj ) = ⊥, i.e. the case in which the agent has been classified before. In this case, we have to enter the re-classification routine to improve the classification of aj , if this is possible. To this end, we choose to assign a new class to aj , if the similarity between agent aj and its current class c falls below some threshold δ or if the model c has remained stable for a long time (u(c) ≥ τ ) (which implies that it is valuable with respect to predictions about other agents). Also, we require that candidate classes for this re-classification be highly similar to aj (b = ρ1 ). As before, if no such classes exist, OptAltClass will generate a new class for aj , and if this is not possible, the “high similarity” condition is dropped – we simply have to classify aj one way or the other. In the counter-case (high similarity and unstable model), we still attempt to pick a new category for aj . This time, though, we only consider classes that are very stable, very similar to aj (ρ2 > ρ1 ), and we ignore classes output by OptAltClass that are new (by checking “if c ∈ C . . .”). The intuition behind this is to merge similar classes in the long run so as to obtain a minimal C. After re-classification, we update the class m(aj ) by calling its learning algorithm Om-Learn and using the current encounter e as a sample. The “problem case” occurs if e has caused changes to model c because of errors in the predicted behaviour of aj (ecf(m(aj )) = false), because in this case, the similarity values of m(aj ) to all agents are no more valid. Therefore, we choose to set the similarities of all non-members of c with that class to 0, following the intuition that since c has been modified, we cannot make any accurate statement about the similarity of other agents with it (remember that we do not store past encounters for each known agent and are hence unable to re-assess the values of S). Finally, we erase all empty classes from C. To sum up, the heuristic proceeds as follows: it creates new classes for unknown agents or assigns them to “best matches” classes if creating new ones is not admissible. After every encounter, the best candidate classes for the currently encountered agent are those that are able to best predict past encounters with it. At the same time, good candidates have to be models that have been reliable in the past and low in computational cost. Also, classes are merged in the long run if they are very similar and very stable.
Multiagent Learning for Open Systems: A Study in Opponent Classification
79
Table 1. Prisoner’s Dilemma payoff matrix. Matrix entries (ui , uj ) contain the payoff values for agents ai and aj for a given combination of row/column action choices, respectively. C stands for each player’s “cooperate” option, D stands for “defect”. ai C D
aj
C
D
(3,3) (0,5) (5,0) (1,1)
With respect to InFFrA, assigning agents to classes and creating new classes defines the re-framing and frame adaptation details. Merging classes and deleting empty classes, on the other hand, implements a strategy of long-term frame repository management. As far as action selection is concerned, a twofold strategy is followed: in the case of known agents, agent ai simply uses OM (m(aj )) when interacting with agent aj , and the classification procedure is only called after an encounter e has been completed. If an unknown agent is encountered, however, the most suitable class is chosen for action selection in each turn using OptAltClass. This reflects the intuition that the agent puts much more effort into classification in case it interacts with a new adversary, because it knows very little about that adversary. In the following section we introduce the scenario we have chosen for an empirical validation of the heuristic. This section will also present the details of combining AdHoc with a concrete OM method for a given application domain.
5 Application to Iterated Multiagent Games To evaluate AdHoc, we choose the long-studied domain of iterated multiagent games that captures a number of interesting interaction issues. More specifically, we implemented a simulation system in which agents move on a toroidal grid and play a fixed number of Iterated Prisoner’s Dilemma [7,11] games whenever they happen to be in the same caret with some other agent. The matrix for the single-shot Prisoner’s Dilemma game in normal form is given in Table 1. If more than two agents meet, every player plays against every other player where couplings are drawn in random order. As stated before, we extend the model-based learning method US-L* proposed by Carmel and Markovitch [4,3] that is based on learning opponent behaviour in terms of a DFA by classification capabilities. To this end, we model the opponent classes used in AdHoc as follows: let C = {ci = Ai , Qi , Si |i = 1, . . . k} such that Ai is a DFA that models the behaviour of opponents in ci and Qi a Q-table [18] that is used to learn an optimal strategy against Ai . The state space of the Q-table is the state space of Ai , and the Q-value entries are updated using the rewards obtained during encounters. We also store a set of samples Si with each ci . These are recent fixed-length sequences of game moves of both players used to train Ai . They are collected whenever the modelling agent plays against class ci . It is important to see that they may stem from
80
M. Rovatsos, G. Weiß, and M. Wolf
games against different opponents, if these opponents pertain to the same class ci , because this means that we do not need to store samples of several encounters with the same peer in order to learn the automaton; it suffices to interact with opponents of the same “kind”. Further, as required by AdHoc, a similarity measure σ : A × C → [0; 1] between adversaries and classes is maintained, as well as a membership function m : A → C that describes which opponent pertains to which class. Again, looking at InFFrA, this choice of opponent classes implies a specific design of interaction frames. We can easily describe how the concept of interaction frames can be mapped to the opponent classes: – trajectories – these are given by the behavioural models of the DFA, describing the behaviour of the opponent depending on the behaviour of the modelling agent (which is not restricted by the trajectory model); – roles & relationships – each frame/class is defined in terms of two roles, where one (that of the modelling agent) is always fixed; the m-function defines the set of agents that can fulfil the “role” captured by a particular opponent class, while the S-function keeps track of similarities between agents and these roles across different opponent classes. – contexts — preconditions, frame activation conditions and frame deactivation conditions are trivial: agents can in theory apply any frame in any situation, provided that they are in the same caret with an opponent; framing terminates after a fixed number of rounds has been played. The post-conditions of encounters are stored in terms of reward expectations as represented by the Q-table. – beliefs – these are implicit to the architecture: both agents know their action choices (capabilities), both know the game has a fixed length, both know that the other’s choices matter. The training samples and Q-table values constitute the private attributes associated with a frame. They capture experienced rewards and a (bounded-memory) history of experiences with every frame, respectively. To recall how the AdHoc classification fits into the framing view of social reasoning with this particular OM method, we need to look at the individual aspects of framing once more. Thereby, we have to distinguish between (i) the case in which the current opponent aj has been encountered before and (ii) the case in which we are confronted with an unknown adversary. Let us first look at the case in which aj has been encountered before: 1. Situation interpretation: The agent records the current interaction sequence (which becomes the perceived frame) and stores it in Sm(aj ) . It also updates the entries in the Q-table according to recent payoffs. 2. Frame matching: Similarity values are updated for all frames with respect to aj . As observed before, this is a complex variant of matching that moves some of the complexity associated with frame assessment to the matching phase. 3. Frame assessment and re-framing: These occur only after an encounter, since the classification procedure is only called after an encounter. As a consequence, the framing decision can only have effects on future encounters with the same agent. The framing decision itself depends on whether the current sequence of opponent
Multiagent Learning for Open Systems: A Study in Opponent Classification
81
moves is understood by the DFA in m(aj ) or not. After calling the OM learning procedure, the opponent model currently used for aj may have been modified, i.e. a frame adaptation has taken place. Unfortunately, the US-L* does not allow for incremental modifications to the DFA. This means that if a sample challenges the current DFA, the state set of the DFA and its transitions are completely overthrown – the algorithm is not capable of making “minor” modifications to the previous DFA. Therefore, none of the difference models represented implicitly by the similarity function S is adequate after such a modification to the DFA. Thus, it is very risky to modify the model of a class. 4. Trial instantiation: Instead of testing different hypotheses and “mock-activating” them as suggested by the general InFFrA view, AdHoc performs a “greedy” search for an adequate frame by using OptAltClass. Hence, this is a very simple implementation of the trial instantiation phase. 5. Frame enactment: This is straightforward – the InFFrA component uses the Qtable associated with the frame/opponent class m(aj ) for action selection and uses the classical Boltzmann strategy for exploration. Since there is no other level of reasoning to compete with, the agent can directly apply the choices of the InFFrA layer. A notable speciality of the US-L*-AdHoc variant of InFFrA is that the DFA impose no restrictions on the action selection mode of the modelling agent itself, it is free to do anything that will help optimise its utility. In the case of unknown opponents, frame assessment and re-framing is implemented as in the previous case. The differences lie in matching and in making framing decisions, which occurs after each round of the encounter (and not only after the entire encounter). After each round, the modelling agent activates the most similar class with respect to the current sequence of moves and uses this class for enactment decisions (according to the respective Q-table).
6 Experimental Results In the first series of experiments we conducted, one AdHoc agent played against a population of opponents with fixed strategies chosen from the following pool of strategies: – “ALL C” (always cooperate), – “ALL D” (always defect), – “TIT FOR TAT” (cooperate in the first round; then play whatever the opponent played in the previous round) and – “TIT FOR TWO TATS” (cooperate initially; then, cooperate iff opponent has cooperated in the two most recent moves). Using these simple strategies served as a starting point to verify whether the AdHoc agent was capable of performing the task in principle. If AdHoc proved able to group a steadily increasing number of opponents into these four classes and to learn optimal strategies against them, this would prove that AdHoc can do the job. An important advantage of using these simple and well-studied strategies was that it is easy to verify whether the AdHoc agent is playing optimally, since optimal counter-strategies can be analytically derived.
82
M. Rovatsos, G. Weiß, and M. Wolf
Fig. 3. Number of agent classes an AdHoc agent creates over time in contrast to the total number of known (fixed-strategy) opponents (which is increased by 40 in rounds 150, 300 and 450). As can be seen, the number of identified classes converges to the actual (four) strategies.
To obtain an adequate and fair performance measure for the classification heuristic, we compared the performance of the AdHoc agent to that of an agent who learns one model for every opponent it encounters (“One model per opponent”) and to that of an agent who learns a single model for all opponents (“Single model”). The results of these simulations are shown in figures 3 and 4. Plots were averaged over 100 simulations on a 10 × 10-grid. Parameter settings where: δ = 0.3, τ = 15, ρ1 = 0.6, ρ2 = 0.9, k = 80 and l = 10. 6 samples where stored for each class in order to learn opponent automata. They prove that the agent is indeed capable of identifying the existing classes, and that convergence to a set of opponent class is robust against entry of new agents into the system. In terms of scalability, this is a very important result, because it means that AdHoc agents are capable of evolving a suitable set of opponent classes regardless of the size of the agent population, as long as the number of strategies employed by adversaries is limited (and in most applications, this will be reasonable to assume). A look at performance results with respect to utility reveals even more impressive results. It shows that the AdHoc agent not only does better than the “single model” agent (which is quite natural), but that it also significantly outperforms an allegedly “unboundedly rational” agent that is capable of constructing a new opponent model for each adversary! Even though the unboundedly rational agent’s performance is steadily increasing, it remains below that of the AdHoc agent even after 40000 encounters. This sheds new light on the issues discussed in section 2, because it means that designing learning algorithms in a less “individual-focused” way can even speed-up the progress in learning. In technical terms, the reason for this is easily found. The speed-up is caused by the fact that the AdHoc agent obtains much more learning data for every class model it maintains by forcing more than one agent into that model, thus being able
Multiagent Learning for Open Systems: A Study in Opponent Classification
83
32
Reward per 100 Games
30
28
26
24
AdHoc Agent One Model per Opponent Single Model 0
5000
10000
15000 20000 Played PD Games
25000
30000
35000
Fig. 4. Comparison of cumulative rewards between AdHoc agent, an agent that maintains one model for each opponent and an agent that has only a single model for all opponents in the same setting as above.
to learn a better strategy against every class within a shorter period of time. This illustrates an aspect of learning in open systems that we might call “social complexity reduction”: the ability to adapt to adversaries quickly by virtue of creating stereotypes. If opponents can be assumed to have something in common (their strategy), learning whatever is common among them can be achieved faster if they are not granted individuality. Another issue that deserves analysis is, of course, the appropriate choice of the upper bound k for the number of possible opponent classes. Figure 5 shows a comparison between AdHoc agents that use values 10, 20, 40 and 80 for k, respectively, in terms of both number of opponent classes maintained and average reward per encounter. Even though there is not much difference between the time needed to converge to the optimal number of opponent classes, there seem to be huge differences with respect to payoff performance. More specifically, although a choice of k = 40 instead of k = 80 seems to have little or no impact on performance, values of 10 and 20 are certainly too small. In order to interpret this result, we have to consider two different aspects of the US-L*-AdHoc system: – The more models are maintained, the more exploration will be carried out per model in order to learn an optimal strategy against it. – The fewer models we are allowed to construct, the more “erroneous” will these models be in predicting the behaviour of adversaries that pertain to them. The first aspect is simply due to the fact that as new classes are generated, their Q-tables have to be initialised and it takes a number of Q-value updates until a reasonably good counter-strategy is learned. On the other hand, a small upper bound on the number of classes forces us to use models that do not predict their member’s behaviours correctly, at least until a large amount of training data has been gathered.
84
M. Rovatsos, G. Weiß, and M. Wolf
Fig. 5. Comparison between AdHoc agents using different k values. The upper plot shows the number of opponent classes the agents maintain, while the plot below shows the average reward per encounter. The number of opponent classes remains stable after circa 5000 rounds.
This contrasts our previous observation about the potential of creating stereotypes. Although it is certainly true that we have to trade off these two aspects against each other (both extremely high and extremely low values for k seem to be inappropriate), our results here are in favour of large values for k. This suggests that allowing some diversity during learning seems to be crucial to achieve effective learning and interaction. The second series of experiments involved simulations with societies that consist entirely of AdHoc agents. Here, unfortunately, AdHoc agents fail to evolve any useful strategy, they exhibit random behaviour throughout. The reason for this is that they have fairly random initial action selection distributions (when Q tables are not filled yet and automata un-settled). Hence, no agent can strategically adapt to the strategies of others (since they do not have a strategy, either). A slight modification to the OM method, however, suffices to produce better results. We simply add the following rule to the decision-making procedure:
Multiagent Learning for Open Systems: A Study in Opponent Classification
85
If the automaton of a class c is constantly modified during r consecutive games, we play some fixed strategy X for r games; then we return to the strategy suggested by OM (c). The intuition behind this rule can be re-phrased as follows: if an agent discovers that her opponent is not pursuing any discernible strategy whatsoever, it takes the initiative to come up with a reasonable strategy itself. In other words, she tries to become “learnable” herself in the hope that an adaptive opponent will develop some strategy that can be learned in turn. An important design issue is which strategy X to use in practice, and we investigated four different methods: 1. Use TIT FOR TAT on the grounds that it is stable and efficient in the IPD domain. 2. Use a fixed strategy that is obtained by generating an arbitrary DFA. 3. Pick that strategy from the pool of known opponent strategies C that a) has the highest Quality value, b) provides the largest payoff. The results of these simulations are shown in figure 6. Quite clearly, the choice of TIT FOR TAT outperforms the other methods. More precisely, the AdHoc agents who have already met before will converge to the (C, C) joint action, once at least one of them has played the fixed strategy for a while (and has hence become comprehensible to its opponent). Of course, newly encountered opponents may still appear to be behaving randomly, so that sub-optimal play with those agents cannot be avoided in many situations. The superiority of TIT FOR TAT in these experiments confirms the long-lived reputation of this strategy [1], because it is known to be a very safe strategy against a variety of counter-strategies. In terms of deriving as generic method of choosing such a strategy that does not depend on knowledge about the particular interaction problem, this is certainly not an optimal solution and is surely an issue that deserves further investigation. Still, our experiments prove that AdHoc is at least capable of “trying out” new strategies without jeopardising long-term performance, although this is certainly not optimal in decision-theoretic terms.
7 Conclusion Open systems are becoming increasingly important in real-world applications, and pose new problems for multiagent learning methods. In this paper, we have suggested a novel way of looking at the problem of learning how to interact effectively in such systems that is based on the principle of looking less at the behaviour of individual co-actors and concentrating more on learning activities that are concerned with recurring patterns of interactions that are relevant regardless of the particular opponent. We presented a framework for designing and analysing such learning algorithms based on the sociological terms of frames and framing called InFFrA and applied it to the design of a heuristic for opponent classification. A practical implementation of this heuristic in the context of a multiagent IPD scenario was used to conduct extensive empirical studies which proved the adequacy of
86
M. Rovatsos, G. Weiß, and M. Wolf 38
36
Reward per 100 Games
34
32
30
28
26
Tit For Tat Random Class Maximum Quality Highest Payoff
24 0
2000
4000
6000 8000 10000 Interactions (IPD Games)
12000
14000
Fig. 6. Comparison of agent performance in “AdHoc vs. AdHoc” simulations and different selection methods for the strategy chosen if the opponent exhibits random behaviour.
our approach. This system called US-L*-AdHoc not only constitutes a first application of InFFrA, it also extends an existing opponent modelling method by the capability of classification, rendering this opponent modelling method usable for systems with more than just a few agents. For a particular domain of application, US-L*-AdHoc successfully addresses the problems of behavioural diversity, agent heterogeneity and agent fluctuation. It does so by allowing for different types of opponents and deriving optimal strategies for acting towards these opponents, while at the same time coercing data of different opponents into the same model whenever possible, so that the learning efforts pay even if particular agents are never encountered again. The work presented here is only a first step toward the development of learning algorithms for open multiagent systems, and many issues remain to be resolved, out of which we can mention but a few that are directly related to our own research. One of these is the use of communication: in our experiments, the main reason for not being able to evolve cooperative interaction patterns in the “AdHoc vs. AdHoc” setting (i.e. the interesting case) unless TIT FOR TAT is used as a “fallback” strategy is the utility pressure which causes agents to use ALL D whenever in doubt (after all, it is the best strategy if we know nothing about the other’s strategy). So if agents were able to indicate what strategy they are going to play without endangering their utility performance every time, we expect the possibility for cooperation to emerge to be much bigger. Another interesting issue to explore is the application of US-L*-AdHoc to more realistic multiagent domains. Finally, a theoretical framework to analyse the trade-off between learning models of individuals vs. learning models of recurring behaviours would surely contribute to a principled development of learning algorithms for open systems.
Multiagent Learning for Open Systems: A Study in Opponent Classification
87
References 1. R. Axelrod. The evolution of cooperation. Basic Books, New York, NY, 1984. 2. H. Bui, D. Kieronska, and S. Venkatesh. Learning other agents’ preferences in multiagent negotiation. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 114–119, Menlo Park, CA, 1996. AAAI Press. 3. D. Carmel and S. Markovitch. Learning and using opponent models in adversary search. Technical Report 9609, Technion, 1996. 4. D. Carmel and S. Markovitch. Learning models of intelligent agents. In Thirteenth National Conference on Artificial Intelligence, pages 62–67, Menlo Park, CA, 1996. AAAI Press/MIT Press. 5. C. Claus and C. Boutilier. The dynamics of reinforcement learning in cooperative multiagent systems. In Collected Papers from the AAAI-97 Workshop on Multiagent Learning, pages 13–18. AAAI, 1997. 6. Y. Freund, M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, and R. E. Shapire. Efficient Algorithms for Learning to Play Repeated Games Against Computationally Bounded Adversaries. In 36th Annual Symposium on Foundations of Computer Science (FOCS’95), pages 332–343, Los Alamitos, CA, 1995. IEEE Computer Society Press. 7. D. Fudenberg and J. Tirole. Game Theory. The MIT Press, Cambridge, MA, 1991. 8. L. Gasser. Social conceptions of knowledge and action: DAI foundations and open systems semantics. Artificial Intelligence, 47:107–138, 1991. 9. E. Goffman. Frame Analysis: An Essay on the Organisation of Experience. Harper and Row, New York, NY, 1974. Reprinted 1990 by Northeastern University Press. 10. C. Hewitt. Open information sytems semantics for distributed artificial intelligence. Artificial Intelligence, 47:79–106, 1991. 11. R. D. Luce and H. Raiffa. Games and Decisions. John Wiley & Sons, New York, NY, 1957. 12. T. Mitchell. Machine Learning. McGraw-Hill, New York, 1997. 13. A. S. Rao and M. P. Georgeff. BDI agents: From theory to practice. In Proceedings of the First International Conference on Multi-Agent Systems (ICMAS-95), pages 312–319, 1995. 14. M. Rovatsos. Interaction frames for artificial agents. Technical Report Research Report FKI244-01, AI/Cognition Group, Department of Informatics, Technical University of Munich, 2001. 15. M. Rovatsos and J. Lind. Hierarchical common-sense interaction learning. In E. H. Durfee, editor, Proceedings of the Fifth International Conference on Multi-Agent Systems (ICMAS00), Boston, MA, 2000. IEEE Press. 16. M. Rovatsos, G. Weiß, and M. Wolf. An Approach to the Analysis and Design of Multiagent Systems based on Interaction Frames. In M. Gini, T. Ishida, C. Castelfranchi, and W. L. Johnson, editors, Proceedings of the First International Joint Conference on Autonomous Agents and Multiagent Systems (AAMAS’02), Bologna, Italy, 2002. ACM Press. 17. J. M. Vidal and E. H. Durfee. Agents learning about agents: A framework and analysis. In Collected papers from AAAI-97 workshop on Multiagent Learning, pages 71–76. AAAI, 1997. 18. C. Watkins and P. Dayan. Q-learning. Machine Learning, 8:279–292, 1992. 19. G. Weiß, editor. Multiagent Systems. A Modern Approach to Distributed Artificial Intelligence. The MIT Press, Cambridge, MA, 1999.
Situated Cognition and the Role of Multi-agent Models in Explaining Language Structure Henry Brighton, Simon Kirby, and Kenny Smith Language Evolution and Computation Research Unit Department of Theoretical and Applied Linguistics The University of Edinburgh Adam Ferguson Building 40 George Square Edinburgh EH8 9LL {henry, simon, kenny}@ling.ed.ac.uk
Abstract. How and where are the universal features of language specified? We consider language users as situated agents acting as conduits for the cultural transmission of language. Using multi-agent computational models we show that certain hallmarks of language are adaptive in the context of cultural transmission. This observation requires us to reconsider the role of innateness in explaining the characteristic structure of language. The relationship between innate bias and the universal features of language becomes opaque when we consider that significant linguistic evolution can occur as a result of cultural transmission.
1
Introduction
There must be a biological basis for language. Animals cannot be taught language. Now imagine having a thorough knowledge of this capacity: a detailed explanation of whatever cognitive processes are relevant to learning, understanding, and producing language. Would this understanding be sufficient for us to predict universal features of language? Human languages exhibit only a limited degree of variation. Those aspects of language that do not vary are termed language universals. The assumption of contemporary linguistics and cognitive science is that these hallmarks can shed light on the cognitive processes underlying language. In the discussion that follows we reflect on the reverse implication, and argue that language universals cannot be fully explained by understanding biologically determined aspects of cognition. The relationship between the two is opaque, and mediated by a cultural dynamic in which some linguistic forms are adaptive [23]. In addressing this question one must reconsider the traditional practice in cognitive science of, first, isolating a competence from its cultural context and then, secondly, attempting to understand that competence such that its behaviour can be fully explained. This practice is questioned by the proponents of embodied cognitive science [12,44,11,4,32]. We examine the claims of embodied E. Alonso et al. (Eds.): Adaptive Agents and MAS, LNAI 2636, pp. 88–109, 2003. c Springer-Verlag Berlin Heidelberg 2003
Situated Cognition and the Role of Multi-agent Models
89
cognitive science, specifically the principle of situatedness, and relate this enterprise to recent work in the field of computational evolutionary linguistics [1,24,2, 26]. We note that these two approaches share a methodological assumption, one that singles out cultural context as being a theoretically significant consideration. In the discussion that follows we show how this notion of cultural context can be modelled using multi-agent computational models. In short, we aim to show how multi-agent systems can be used to shed light on some fundamental issues in linguistics, but also cognitive science in general. First we discuss alternative standpoints in explaining why, as a cognitive process, language exhibits certain designs. We argue that situatedness must form part of any explanation – a thorough understanding of linguistic competence cannot lead to a thorough explanation for the universal aspects of language structure. To flesh this claim out we present work on an agent-based framework for studying the evolution of language: the iterated learning model. In particular, we focus on compositionality in language. Insights gained from these models suggest that language designs cannot be explained by understanding language in terms of a detached individual’s knowledge of language. An argument for this stance is presented in Section 4 where we make explicit the foundational principles that underly our approach to understanding the characteristic structure of language.
2
Explaining Universal Features of Language
Take all the world’s languages and note the structural features they have in common. On the basis of these universal features of language, we can propose a universal grammar, a hypothesis that circumscribes the core features of all possible human languages [7]. On accepting this hypothesis, we should ask: Why is linguistic form subject to this set of universal properties? More precisely, how and where are these restricted set of structures specified? The discussion that follows will address the manner in which this question is answered. The hunt for an explanation of universal features is traditionally mounted by arguing that universal grammar is an innate biological predisposition that partially defines the manner in which language is learned by a child. The linguistic stimulus a child faces, be it Chinese or Spanish, through the process of learning, results in a knowledge of language. For Chomsky learning is “better understood as the growth of cognitive structures along an internally directed course under the triggering and partially shaping effect of the environment”[9]. So an innate basis for language, along with the ability to learn, permits the child to arrive at a knowledge of language. The degree to which language is specified innately is a matter of heated debate. At one extreme, we can imagine a highly specialised “language instinct” [33] and at the other, we can imagine a domain general learning competence which serves language as well other cognitive tasks [14].
90
H. Brighton, S. Kirby, and K. Smith
2.1
The Object of Study
For a moment, let us stand back from this debate and examine the vocabulary of explanation we have employed to answer the original question: How and where are universal features of language specified? We notice that an explanation of a population level phenomenon – language – has been reduced to the problem of an individual’s knowledge of language. Languages vary greatly, but we are specifically interested in the features common to all languages. Universal properties of language, to a greater or lesser extent, are specified innately in each human. This de-emphasis of context, culture and history is recurring theme in cognitive science, as Howard Gardner notes: “Though mainstream cognitive scientists do not necessarily bear any animus [...] against historical or cultural analyses, in practice they attempt to factor out these elements to the maximum extent possible.” [15]. Taking this standpoint helps in mounting a practical investigation into a possible answer to the question. The universal aspects of language we see in the world are strongly correlated with an individual’s act of cognition, which is taken to be biologically determined. Now we have isolated the real object of study. Understanding the innate linguistic knowledge of humans will lead us to an understanding of why language is the way it is. For the purposes of this study, let us characterise this position. Principle 1 (Principle of detachment.) A total explanation of the innate basis for language, along with an explanation of the role played by the linguistic stimulus during the language acquisition process, would be sufficient for a thorough explanation for the universal properties of language. Now the problem is to account for a device that relates input (linguistic stimulus) to output (knowledge of language). For example, Chomsky discusses a language acquisition device (LAD) in which the output takes the form of a grammatical system of rules. He states that “An engineer faced with the problem of designing a device for meeting the given input-output conditions would naturally conclude that the basic properties of the output are a consequence of the design of the device. Nor is there any plausible alternative to this assumption, so far as I can see” [8]. In other words, if we want to know how and where the universal design features of language are specified, we need look no further than an individual’s competence derived from primary linguistic data via the LAD. This position, which we have termed the principle of detachment, runs right through cognitive science and amounts to a general approach to studying cognitive processes. For example, in his classic work on vision, Marr makes a convincing case for examining visual processing as a competence understood entirely by considering a series of transformations of visual stimulus [28,29]. We will now consider two bodies of work that suggest that the principle of detachment is questionable1 . 1
There are other arguments for questioning the principle of detachment, for example, those presented by Winograd & Flores [44], but we omit them for the sake of brevity.
Situated Cognition and the Role of Multi-agent Models
91
Explanation Via Synthetic Construction. One of the aims of cognitive science, and in particular, artificial intelligence (AI), is to explain human, animal, and alien cognition by building working computational models. Those working in the field of AI often isolate a single competence, such as reasoning, planning, learning, or natural language processing. This competence is then investigated in accordance with the principle of detachment, more often than not, in conjunction with a simplified model of the environment (a micro-world). These simplifying assumptions, given the difficulty of the task, are quite understandable. So the traditional approach is centred around the belief that investigating a competence with respect to a simplified micro-world will yield results that, by and large, hold true when that agent is placed in the real world. General theories that underly intelligent action can therefore be proposed by treating the agent as a detached entity operating with respect to an environment. Crucially, this environment is presumed to contain the intrinsic properties found in the environment that “real” agents encounter. This is a very broad characterisation of cognitive science and AI. Nevertheless, many within cognitive science see this approach as misguided and divisive, for a number of reasons. For example, we could draw on the wealth of problems and lack of progress traditional AI is accused of [32]. Some within AI have drawn on this history of perceived failure to justify a new set of principles collectively termed Embodied Cognitive Science [32], and occasionally New AI [4]. Many of these principles can be traced back to Hubert Dreyfus’ critique of AI, 20 years earlier [12]. The stance proposed by advocates of embodied cognitive science is important because they refine Dreyfus’ stance, build on it, and crucially cite examples of successful engineering projects. This recasting of the problem proposes, among others, situatedness as a theoretical maxim [11]. Taking the principle of situatedness to its extreme, the exact nature of the environment is to be taken as primary and theoretically significant. For example, the environment may be partly constructed by the participation of other agents [5]. In other words, certain aspects of cognition can only be fully understood when viewed in the context of participation [44,4]. It is important to note that this “new orientation” is seen by many as opposing the branches of mainstream AI, or at least the branches of AI that claim to explain cognition. If, for a moment, we believe the advocates of embodied cognitive science, they are telling us that any explanation for a cognitive capacity must be tightly coupled with an understanding of the environment. What impact does this discussion have on our questions about language universals? First, it provides a source of insights into investigating cognition through building computational models. A theory faces a different set of constraints when implemented as a computational model. An explanation that is grounded by a synthetic artifact can act as a sanity check for theory. Second, this discussion admits the possibility that investigating cognition without assuming the principle of detachment can be fruitful. In the context of language and communication, the work of Luc Steels is a good example of this approach. Steels investigates the construction of perceptual distinctions and signal lexicons in visually grounded communicating
92
H. Brighton, S. Kirby, and K. Smith
robots [40,41]. In this work signals and the meanings associated with signals emerge as a result of self-organisation. The Evolutionary Explanation. Only humans have language. How did language evolve? The communication systems used by animals do not even approach the sophistication of human language, so the question must concern the evolution of humans over the past 5 million years, since the split with our last non-linguistic ancestor, Australopithecus [22]. Unfortunately, there is no fossil evidence offering concrete insights into the evolution of language in humans. We can, for example, analyse the evolution of the vocal tract, or examine skulls and trace a path through the skeletal evolution of hominids, but the kind of conclusions we can draw from such evidence can only go so far [27,43]. Over the past 15 years computational evolutionary linguistics has emerged as a source of alternative answers. This approach uses computational models to try and shed light on the very complex problem of the evolution of language in humans [18,26]. One source of complexity is the interaction between two substrates, each one operating on a different time-scale. More precisely, linguistic information is transmitted on two evolutionary substrates: the biological and the cultural. For example, you are born with some innate predisposition for language which evolved over millions of years. The linguistic forms you inherit from your culture have evolved over hundreds of years, and your linguistic competence emerges over tens of years. Much of the work on the evolution of language, particularly in the context of computational modelling, has analysed this interaction. By modelling linguistic agents as learners and producers of language, and then investigating how communication systems evolve in the presence of both biological and cultural transmission, computational evolutionary linguistics attempts to shed light on how language could evolve from non-linguistic communities. This approach draws on disciplines such as cognitive science, artificial life, complexity, and theoretical biology. Recent work in this field has focussed on how certain hallmarks of human language can arise in the absence of biological change. This observation must lead us to consider how far a biological explanation for language can take us. For example, the very possibility of trademark features of language not being fully explained in terms of an individual’s (biologically determined) cognitive capacity raises important questions. We detail this work in the next section, but raise the issue here as it impacts on the current discussion. In explaining how and why language has its characteristic structure, the evolutionary approach is in line with the claims made by proponents of embodied cognitive science. A thorough explanation for language universals may lie outside the traditional vocabulary of explanation, in which case the principle of detachment will need to be breached. 2.2
Summary
This discussion has outlined the basis for asking two questions. First, what kind of explanatory vocabulary should be invoked when explaining universal features of language? Secondly, can situatedness shed light on this problem?
Situated Cognition and the Role of Multi-agent Models
93
Building multi-agent computational models allows us to analyse how cognitive agents interact, specifically, what role this interaction plays in explaining the behaviour we observe in nature. This approach serves an important purpose for cognitive science generally, which traditionally views the individual as the locus of study. For linguistics, being subfield of cognitive science, a multi-agent approach to understanding cognition, one which takes situatedness as theoretically significant, is an untapped resource. We aim to fully investigate how relevant multi-agent systems are to the question of explaining universal features of language. This is a timely investigation. For example, on the validity of artificial intelligence Chomsky notes “in principle simulation certainly can provide much insight” [10]. Perhaps more relevant is the remark made by another prominent linguist, Ray Jackendoff: “If some aspects of linguistic behaviour can be predicted from more general considerations of the dynamics of communication in a community, rather than from the linguistic capacities of individual speakers, then they should be.” [21]. Taking these two observations together we should at least consider the role of situatedness in explaining the universal features of language. The next section presents recent work on exploring precisely this question.
3
Language Evolution and Iterated Learning
The Iterated Learning Model (ILM) is a general framework for modelling the cultural transmission of language [24,2], and is based on Hurford’s conception of the expression/induction model [18,19]. The basis of an iterated learning model is a series of generations. Each generation consists of a population of agents which learn language from utterances produced by the previous generation. Each agent represents a language user, and begins life as an infant observing the language of adult agents in the previous generation. The agent learns from these observations and induces a knowledge of language. After doing so, the infant becomes an adult. Once an adult, an agent will be prompted to form utterances which infant agents, in the next generation, observe. This process, depicted in Figure 1, is repeated for some number of generations, typically in the thousands. In this article we will concentrate on models which have one agent in each generation. A simulation therefore comprises many agents, but the transfer of information is only ever between two agents. This simplification is important, as we first need understand the kind of linguistic structure that can be explained in the absence of complex information transfer. An ILM is not restricted to this one-to-one transfer: we are currently embarking on research into the impact of population effects on language structure. In brief, the iterated learning model allows us to see how a language evolves over time, as it passes through a repeated cycle of induction and production. The agents themselves act as a conduit for language, with the bias inherent in the processes of learning and generalisation defining, in part, how language will evolve from one generation to the next. In the ILM a language is defined as a mapping from meanings to signals. Meanings are regarded as abstract structured entities, and modelled here as
94
H. Brighton, S. Kirby, and K. Smith Generation 1
Knowledge of language
Produc
tion Utterances
Generation 2
Knowledge of language Generation 3
Knowledge of language
tion Induc
Produc
tion
tion Induc
Utterances
Produc
tion Utterances
Fig. 1. The agents in the ILM produce utterances. These utterances are used by the agents in the next generation to induce a knowledge of language. By repeating this process, the language evolves.
feature vectors. Signals differ from meanings in that they are of variable length. Signals are built by concatenating abstract symbols drawn from some alphabet. These idealisations are consistent with Pinker and Bloom’s characterisation of language as the “transmission of propositional structures over a serial channel” [34]. One of the hallmarks of human language, which we will be considering in detail, is the property of compositionality [31]: The meaning of a signal is a function of the meaning of its parts, and how they are put together. Compositional languages are those exhibiting the property of compositionality. We can contrast these with holistic languages, where parts of the meaning do not correspond to parts of the signal — the only association that exists is one that relates the whole meaning to the whole signal. Before going into the details of the ILM, it is worth considering three examples of communication systems found in nature: 1. The alarm calls of Vervet monkeys provide us with the classic example of a largely innate holistic communication system [6]. 2. Bird song has learned signals with elaborate structure, but the meaning the song conveys is believed to be holistic – a structured song refers to the meaning as whole [17].
Situated Cognition and the Role of Multi-agent Models
95
3. Honey bees do have a compositional communication system, but it is innate [42]. Significantly, the only communication system that is learned and exhibits compositionality is human language. Both compositional and holistic utterances occur in human language. For example, the idiom “kicked the bucket” is a holistic utterance which means died. Contrast this utterance with “large green caterpillar” for which the meaning is a function of the meaning of its parts: “large”, “green”, and “caterpillar”. A simple2 example of a holistic language, using the formalisation of language in the ILM, might be a set of meaning signal pairs Lholistic : Lholistic = {{1, 2, 2}, sasf, {1, 1, 1}, ac, {2, 2, 2}, ccx, {2, 1, 1}, q, {1, 2, 1}, pols, {1, 1, 2}, monkey} No relation exists between the signals and the meanings, other than the whole signal standing for the whole meaning. In contrast, an example of a compositional language is the set: Lcompositional = {{1, 2, 2}, adf, {1, 1, 1}, ace, {2, 2, 2}, bdf, {2, 1, 1}, bce, {1, 2, 1}, ade, {1, 1, 2}, acf} Notice that each signal is built from symbols that map directly onto feature values. Therefore, this is a compositional language; the meaning associated with each signal is a function of the meaning of the parts of that signal. Now, at some point in evolutionary history, we presume that a transition from a holistic to a compositional communication system occurred [45]. This transition formed part of what has been termed the eighth major transition in evolution – from an animal communication system to a full blown human language [30]. Using the ILM, we can try and shed light on this transition. In other words, how and why might a holistic language such as Lholistic spontaneously pass through a transition to a compositional language like Lcompositional ? 3.1
Technicalities of the ILM
Agents in the ILM learn a language on the basis of a set of observed meaning/signal pairs L . This set L is some random subset of the language which could have been spoken in the previous generation, denoted as L. That is, L is the set of utterances of L that were produced. Humans are placed in precisely this position. First, we hear signals and then we somehow associate a meaning to that signal. Second, we suffer from the the poverty of the stimulus [35] – we learn language in light of remarkably little evidence. For example, there is no way any 2
The languages used in the simulations we discuss are usually larger than the examples presented here.
96
H. Brighton, S. Kirby, and K. Smith Language represented by Agent 1
Hypothesis
h1
h1
Lh
1
L’h 1 Externalised utterances Agent 2
Hypothesis
h2
Lh 2 L’h 2
Agent 3
Hypothesis
h3
Lh 3 L’ h3
Fig. 2. The hypothesis of agent 1, h1 , represents a mapping between meanings and signals, Lh1 . On the basis of some subset of this language, Lh1 , the agent in the next generation induces a new hypothesis h2 . This process of utterance observation, hypothesis induction, and production, is repeated generation after generation.
human language can ever be externalised as a set of utterances. Languages are just too large, in fact, they are ostensibly infinite. This restriction on the degree of linguistic stimulus available during the language learning process we term the transmission bottleneck. This process is illustrated in Figure 2. Once an agent observes the set of utterances L , it forms a hypothesis, h, for this observed language using a learning mechanism. In our experiments we draw on a number of machine learning approaches to achieve this task. On the basis of a body of linguistic evidence, a hypothesis is induced, after which an agent is then considered an adult, capable of forming utterances of its own. Precisely how and when hypotheses are induced will depend on the details of the ILM in question. For example, either batch learning or incremental learning can be used. The notion of the ILM is sufficiently general to accommodate a wide variety of
Situated Cognition and the Role of Multi-agent Models
97
learning models and algorithms. By interrogating the hypothesis, signals can be produced for any given meaning. Sometimes the agent will be called to produce for a meaning it has never observed in conjunction with a signal, and it therefore might not be able to postulate a signal by any principled means. In this situation some form of invention is required. Invention is a last resort, and introduces randomness into the language. However, if structure is present in the language, there is the possibility of generalisation. In such a situation, the hypothesis induced could lead to an ability to produce signals for all meanings, without recourse to invention, even though all the meaning/signal pairs have not been observed. With a transmission bottleneck in place, a new dynamic is introduced into the ILM. Because learners are learning a mapping by only observing a subset of that mapping, through the process of invention, they might make “mistakes” when asked to convey parts of that mapping to the next generation. This means that the mapping will change from generation to generation. In other words, the language evolves. How the language evolves, and the possibility and nature of steady states, are the principle objects of study within the ILM. We now consider these two questions. 3.2
The Evolution of Compositional Structure
Recall that, from an initially holistic language, we are interested in the evolution of compositional language. Specifically, we would like to know which parameters lead to the evolution of compositional structure. The parameters we consider in the discussion that follows are: 1. The severity of the transmission bottleneck, b (0 < b ≤ 1), which represents the proportion of the language utterable by the previous generation that is actually observed by the learner. The poverty of the stimulus corresponds to the situation when b < 1.0. It is worth noting that natural languages are infinitely large as a result of recursive structure. But in this experiment, we only consider compositional structure: the languages will be finite and therefore language coverage can be measured. Importantly, an ILM is in no way restricted to a treatment of finite languages. We later refer to work in which recursive structure is modelled. 2. The structure of the meaning space. Meanings are feature vectors of length F . Each feature can take one of V values. The space from which meanings are drawn can be varied from unstructured (scalar) entities (F = 1) to highly structured entities with multiple dimensions. 3. The learning and production bias present in each agent. The learning bias defines a probability distribution over hypotheses, given some observed data. The production bias defines, given a hypothesis and a meaning, a probability distribution over signals. To illustrate how compositional language can evolve from holistic language we present the results of two experiments. The first experiment is based on a
98
H. Brighton, S. Kirby, and K. Smith
mathematical model identifying steady states in the ILM [3,2], and the second considers the dynamics of an ILM in which neural networks are used as a model of learning [38]. We refer the reader to these articles if they require a more detailed discussion. Compositional Structure is an Attractor in Language Space. Using a mathematical model we show that, under certain conditions, compositional language structure is a steady state in the ILM. In these experiments the processes of learning and generalisation are modelled using the Minimum Description Length Principle [36] with respect to a hypothesis space consisting of finite state transducers. These transducers map meanings to signals, and as a result of compression, can permit generalisation so that utterances can be produced for meanings which have never been observed. Primarily we are interested in steady states. A steady state corresponds to a language which repeatedly survives the transmission bottleneck: It is stable within the ILM. We can define language stability as the degree to which the hypotheses induced by subsequent agents agree on the mapping between meanings and signals. Starting from random languages, which contain no structure, the stability of the system will depend on the presence of a bottleneck. Without any bottleneck in place, all languages, structureless or not, will be stable because production is always consistent with observation. This property is in-line with the MDL principle, which requires that chosen hypotheses are always consistent with the observed data. Therefore, if the agent observes the whole language, then there is never any doubt when called to express a meaning, as the signal associated with that meaning has been observed. This is not the case when a bottleneck is in place. Starting from a position of randomness the appropriate signal for some meanings will be undefined, as they have not been observed. In this situation, the system will be unstable, as generalisation will not be possible from randomness. A stable language is one that can be compressed, and therefore pass through the transmission bottleneck. Compression can only occur when structure is present in a language, so compression can be thought of as exploiting structure to yield a smaller description of the data. This is why holistic language cannot fit through the bottleneck – it has no structure. Ultimately, we are interested in the degree of stability advantage conferred by compositional language over holistic language. Such a measure will reflect the probability of the system staying in a stable (compositional) region in language space. More formally, we define the expressivity, E of a language L as the number of meanings that the hypothesis induced on the basis of L, which we term h, can express without recourse to invention. Given a compositional language Lc , and a holistic language Lh , we use a mathematical model to calculate the expected expressivity of the transducer induced for each of these language types [2]. We denote these measures of expressivity Ec and Eh , respectively. These expressivity values tell us how likely the transducer is to be able to express an arbitrary meaning, and therefore, how
Situated Cognition and the Role of Multi-agent Models b=0.9
b=0.5 1 Relative Stability (S)
1 Relative Stability (S)
99
0.9 0.8 0.7 0.6 0.5
0.7 0.6
2
8
4
4
6
6
6
4
8 10
(a)
0.8
0.5
10
2
0.9
2
8 Features (F)
Values (V)
2
10
(b)
Values (V)
6
4
8
10
Features (F)
b=0.1
b=0.2
Relative Stability (S)
Relative Stability (S)
1 0.9 0.8 0.7 0.6 0.5
0.7 0.6
4
6
6 10
2
5
6
4
8 Values (V)
10
2
8
4
(c)
0.8
0.5
10
2
0.9
8 Features (F)
(d)
10 Values (V)
Features (F)
Fig. 3. The bottleneck size has a strong impact on the relative stability of compositionality, S. In (a), b = 0.9 and little advantage is conferred by compositionality. In (b)-(d) the bottleneck is tightened to 0.5, 0.2, and 0.1, respectively. The tighter the bottleneck, the more stability advantage compositionality offers. For low bottleneck sizes, a sweet spot exists where highly structured meanings lead to increased stability.
stable that language will be in the context of the ILM. Finally, the value we are really interested in is that of relative stability, S: S=
Ec Ec + Eh
This tells us how much more stable compositional language is than holistic language. In short, the model relates relative stability, S, to the parameters b (severity of the communication bottleneck), F , and V (the structure of the meaning space). Figure 3(a)-(d) illustrates how these three variables interact. Each surface represents, for a different bottleneck value, how the meaning space structure impacts on the relative stability, S, of compositional language over holistic language. We now analyse these results from two perspectives. Tight Bottleneck. The most striking result depicted in Figure 3 is that for low bottleneck values, where the linguistic stimulus is minimal, there is a high stability payoff for compositional language. For large bottleneck values (0.9), compositionality offers a negligible advantage. This makes sense, as we noted above, because without a bottleneck in place all language types are equally stable. But
100
H. Brighton, S. Kirby, and K. Smith
why exactly is compositional language so advantageous when a tight bottleneck is in place? When faced with a holistic language we cannot really talk of learning, but rather memorisation. Without any structure in the data, the best a learner can do is memorise: generalisation is not an option. For this reason, the expressivity of an agent faced with a holistic language is equal to the number of distinct utterances observed. Note that when agents are prompted to produce utterances, the meanings are drawn at random from the meaning space. A meaning can therefore be expressed more than once. Expressivity is precisely the number of distinct utterances observed. When there is structure in the language, expressivity is no longer a function of the number of utterances observed, but rather some faster-growing function, say f , of the number of distinct feature values observed, as these are the structural entities that generalisation exploits. Whenever a meaning is observed in conjunction with a signal, F feature values are contained in the observation. In such a situation, the observed meaning can be expressed, but the observation also helps to provide information relevant to expressing all meanings that contain the F observed feature values. As a result, expressivity, as a function of observations, will no longer be linear but will increase far more rapidly. The mathematical model we have developed proposes a function f on the basis of the MDL principle. Recall the parallel between the transmission bottleneck and the situation known as the poverty of the stimulus: all humans are placed in the situation where they have to learn a highly expressive language with relatively little linguistic stimulus. These results suggest that for compositionality to take hold the poverty of the stimulus is a requirement. Traditionally, poverty of stimulus, introduced in Section 3.1, is seen as evidence for innate linguistic knowledge. Because a language learner is faced with an impoverished body of linguistic evidence, innate language specific knowledge is one way of explaining how language is learned so reliably [7,33,35]. The results presented here suggest an alternative viewpoint: stimulus poverty introduces an adaptive pressure for structured, learnable languages. Structured Meaning Spaces. Certain meaning spaces lead to a higher stability payoff for compositionality. Consider one extreme, where there is one dimension (F = 1). Here, only one feature value is observed when one meaning is observed. Compositionality is not an option in such a situation, as there is no structure in the meaning space. When we have a highly structured meaning space, the payoff in compositionality decreases. This is because feature values are likely to co-occur infrequently as the meaning space becomes vast. Somewhere in between these extremes sits a point of maximum stability payoff for compositionality. An Agent-Based Model. The results presented above tell us something fundamental about the relation between expressivity and learning. The model, stripped bare, relates language expressivity to two different learning models by considering the combinatorics of entity observation. We compare two extremes of language structure: fully structured compositional languages and structureless
Situated Cognition and the Role of Multi-agent Models
101
holistic languages. In this respect, the model is lacking because human language exhibits a mixture of both. Some utterances we use are holistic, some are compositional [45]. We also skirt round the question of dynamics. The model is an analysis of Lyapounov stable states: places in language space that, if we start near, we stay near [16]. We now briefly discuss a second experiment that addresses both these issues. In this experiment, the dynamics of language evolution are modelled explicitly using an agent-based simulation, rather than an agent-based mathematical model. Agents in this experiment are associative neural networks. This model is an extension of a model of simple learned vocabulary [39]. Using an associative network in conjunction with learning rules used to define when activations are strengthened and weakened in light of observations, the mapping between meanings and signals is coded using a meaning layer, two intermediate layers, and a signal layer. Languages exhibiting all degrees of compositionality, holistic to compositional, and all gradations in between, are learnable by this network [38]. The first generation of the ILM starts with a network consisting of weighted connections, all of which are initialised to zero. The network is then called to express meanings drawn from an environment which we define as some subset of the meaning space. One dimension of variation over environments is dense to sparse. This means that the set of possible meanings to be communicated are drawn from a large proportion of the space (dense) or a small proportion of the space (sparse). The second dimension of variation concerns structured and unstructured environments. A structured environment is one where the average inter-meaning Hamming distance is low, so that meanings in the environment are clustered. Unstructured environments have a high inter-meaning hamming distance. Once again, the bottleneck parameter, the proportion of the environment used as learning data, is varied. First, let us consider the case where no bottleneck is present — a hypothesis is chosen on the basis of a complete exposure to the language of the previous generation. Figure 4(a) depicts, for 1000 independent ILM runs, the frequency of the resultant (stable) languages as a function of compositionality. Compositionality is measured as the degree of correlation between the distance between pairs of meanings and distance between the corresponding pairs of signals. We see that few compositional languages evolve. Contrast this behaviour with Figure 4(b), where a bottleneck of 0.4 is imposed. Compositional languages are now by far the most frequent end-states of the ILM. The presence of a bottleneck makes compositional languages adaptive in the ILM. We also note that structured environments lead reliably to compositional language. This experiment, when considered in more detail, illustrates the role of clustering in the meaning space, and the impact of different network learning mechanisms [39]. But for the purposes of this discussion, the key illustration is that the bottleneck plays an important role in the evolution of compositional languages. In short, these results validate those of the previous section.
102
H. Brighton, S. Kirby, and K. Smith 1 sparse, structured sparse, unstructured
Relative frequency
dense, structured dense, unstructured
0.5 compositional languages
0 −1
−0.5
(a)
0 Compositionality ( c)
0.5
1
0.5
1
1 dense, structured, b = 0.4
Relative frequency
dense, unstructured, b = 0.4
0.5
0 −1
(b)
−0.5
0 Compositionality ( c)
Fig. 4. In (a) we see how the lack of a bottleneck results in little pressure for compositional languages. In (b), where a bottleneck of 0.4 is imposed, compositional languages reliably evolve, especially when the environment is structured.
3.3
Using the ILM to Explain Language Structure
The learning bias and hypothesis space of each agent is taken to be innately specified. Each generation of the ILM results in the transfer of examples of language use only. In the absence of a bottleneck, compositionality offers little advantage, but as soon as a bottleneck is imposed, compositional language becomes an attractor in language space. So even though agents have an innate ability to learn and produce compositional language, it is the dynamics of transmission that result in compositionality occurring in the ILM. We must reject the idea that an innate ability to carry out some particular behaviour necessarily implies its occurrence. We aim to strengthen this claim, and refine it. Previous work investigating the ILM has shown that linguistic features such as recursive syntax [25], and regular/irregular forms [24] can also be framed in this context. The idea that we can map innate properties such as, for example,
Situated Cognition and the Role of Multi-agent Models
103
the learning and generalisation process, the coding of environmental factors, and the fidelity of utterance creation directly onto properties of evolved languages is not wholly justifiable. This approach should be seen as building on Kirby’s analysis of language universals [23] in which issues such as, for example, constraints on representation and processing are shown to bring about functional pressures that restrict language variation. Here, we also note that the relationship between innate bias3 and universal features of language is not transparent, but concentrate on the constraints introduced by cultural transmission. These constraints result in certain linguistic forms being adaptive; we can think of language evolving such that it maximises its chances of survival. For a linguistic feature to persist in culture, it must adapt to the constraints imposed by transmission pressures. Compositionality is one example of an adaptive feature of language. If we want to set about explaining the characteristic structure of language, then an understanding of the biological machinery forms only part of the explanation. The details of these results, such as meaning space structure and the configuration of the environment, are not important in the argument that follows. Nevertheless, factors relating to the increase in semantic complexity have been cited as necessary for the evolution of syntactic language [37]. We believe that the scope of the ILM as a means to explain and shed light on language evolution is wider than we have suggested so far. To summarise, by taking compositionality as an example, we argue that its existence in all the world’s languages is due to the fact that compositional systems are learnable, generalisable, and therefore are adaptive in the context of human cultural transmission. This explanation cannot be arrived at when we see the individual as the sole source of explanation. Viewing individuals engaged in a cultural activity allows us to form explanations like these.
4
Underlying Principles
We began by considering explanations for the hallmarks of language. So far we have investigated an agent’s role in the context of cultural transmission. In this section we aim to tie up the discussion by making explicit a set of underlying principles. We start by noting that any conclusions we draw will be contingent on an innateness hypothesis: Principle 2 (Innateness hypothesis.) Humans must have a biologically determined predisposition to learn and produce language. The degree to which this capacity is language specific is not known. Here we are stating the obvious – the ability to process language must have a biological basis. However, the degree to which this basis is specific to language 3
Innate bias, in the experiments presented here, refers to both the representation bias introduced in our adoption of certain hypothesis spaces, and the hypothesis selection policy used to select hypotheses in light of observed data.
104
H. Brighton, S. Kirby, and K. Smith
is unclear. We have no definitive answer to the question of innately specified features of language [35]. Next, we must consider the innateness hypothesis with respect to two positions. First, assuming the principle of detachment, the innateness hypothesis must lead us to believe that there is a clear relation between patterns we observe in language and some biological correlate. If we extend the vocabulary of explanation by rejecting the principle of detachment, then the question of innateness is less clear cut. We can now talk of a biological basis for a feature of language, but with respect to a cultural dynamic. Here, a cultural process will mediate between a biological basis and the occurrence of that feature in language. This discussion centres around recasting the question of innateness, and leads us to accepting that situatedness plays a role. Principle 3 (Situatedness hypothesis.) A thorough explanation of language competence would not amount to a total explanation of language structure. A thorough explanation of language competence in conjunction with an explanation of the trajectory of language adaption would amount to a total explanation of language structure.
Cpossible
Biological basis for language
(a)
Cpossible Biological basis for language
(b)
Cadaptive
Fig. 5. In (a), which assumes the principle of detachment, we can only make a claim about possible communication systems. In (b), assuming the situatedness hypothesis, an explanation accounts for the resulting communication systems which are adaptive over cultural transmission.
The degree of correlation between a biological basis and the observed language universal is hard to quantify. However, Figure 5 illustrates the general point. A biological basis will admit the possibility of some set of communication
Situated Cognition and the Role of Multi-agent Models
105
systems Cpossible . A detached understanding of language can tell us little about which members of Cpossible will be adaptive and therefore observed. The situatedness hypothesis changes the state of play by considering which communication systems are adaptive, Cadaptive , on a cultural substrate. Rejecting the situatedness hypothesis must lead us to consider the issue of representation. The only way a thorough knowledge of language universals can be arrived at, while at the same time accepting the principle of detachment, is that universal features are somehow “represented” explicitly. How else could we understand a universal feature of language by understanding a piece of biological machinery? An acceptance of the situatedness hypothesis allows us to explain a feature of language in terms of a biological trait realised as a bias which, in combination with the adaptive properties of this bias over repeated cultural transmission, leads to that feature being observed. However, if one accepts cultural transmission as playing a pivotal role in determining language structure, then one must also consider the impact of other factors effecting adaptive properties. But as a first cut, we need to understand how much can be explained without resorting to any functional properties of language: Principle 4 (Language function hypothesis.) Language structure can be explained independently of language function. A defence of this hypothesis is less clear cut. However, the models we have discussed make no claims about, nor explicitly model, any notion of language function. Agents simply observe the result of generalisation. The fact that compositional structure results without a model of language function suggests that this is a fruitful line of enquiry to pursue. The treatment of language in discussions on embodied cognitive science often assume language function is salient [44], but we must initially assume it is not. The kind of cognitive processes that we consider include issues such as memory limitations, learning bias, and choice of hypothesis space. 4.1
The Role of Modelling
In the previous section we examined the basis for explaining language universals. The claims we made are partly informed by modelling. Is this methodology valid? Many issues relating to language processing are not modelled. For example, those involved in the study of language acquisition will note that our learners are highly implausible: the language acquisition process is an immensely complex and incremental activity [13]. It must be stressed that our models of learning and generalisation should be seen as abstracting the learning process. We are interested in the justifiable kind of generalisations that can be made from data, not a plausible route detailing how these generalisations are arrived at. The output of a cognitively plausible model of learning is generalisation decisions, just as it is in our models. Rather than modelling the language acquisition process, we are modelling the result (or output) of the language acquisition process. We make no claims about the state of learners during the act of learning. We also
106
H. Brighton, S. Kirby, and K. Smith
have not addressed the role of population dynamics. The models presented here represent a special case of the ILM, one where there is a single agent in each generation. It has been shown that structured languages can evolve in populations containing multiple agents at each generation, given a fairly limited set of population dynamics [20]. Extending these models to include a more realistic treatment of populations and population turnover is a current research project.
5
Conclusions
Cognitive science has traditionally restricted the object of study by examining cognitive agents as detached individuals. For some aspects of cognition this emphasis might be justifiable. But this assumption has become less appealing, and many have taken to the idea that notions of situatedness, embeddedness, and embodiment should be regarded as theoretically significant and should play an active role in any investigation of cognition. Our aim is to consider this claim by building multi-agent models, where agents are learners and producers of language. Specifically, we aim to investigate how multi-agent models can shed light on the problem of explaining the characteristic structure of language. When explaining universal features of language, the traditional standpoint, which we characterised in Principle 1, assumes that cultural context is not a theoretically significant consideration. We attempt to shed light on the question of how and where the universal features of language are specified. The approach we take is in line with the intuitions of embodied cognitive science. By examining the role of the cultural transmission of language over many generations, we show that certain features of language are adaptive: significant evolution of language structure can occur on the cultural substrate. Taking the example of compositionality in language, we illustrate this point using two models. The first model identifies compositionality as a Lyapounov stable attractor in language space when a transmission bottleneck is in place. The second model offers additional evidence by demonstrating that compositionality evolves from holistic language. The upshot of these two experiments is that cultural transmission in populations of agents endowed with a general ability to learn and generalise can lead to the spontaneous evolution of compositional syntax. Related work has shown that recursive syntax and regular/irregular forms are also adaptive in the context of cultural transmission [24]. The implications of this work lead us to reconsider how features of language should be explained. More precisely, the relationship between any innate (but not necessarily language-specific) basis for a language feature, and the resulting feature, is opaque. We place the discussion in the context of three principles that need to be considered when explaining features of language. First, Principle 2 lays down an innateness hypothesis, which states that language must have a biological basis. What form this biological basis takes is very much an open question. Secondly, we propose Principle 3, a situatedness hypothesis which makes explicit the claim that understanding the biological machinery behind language alone
Situated Cognition and the Role of Multi-agent Models
107
is not enough to explain universal features of language: cultural dynamics are also determiners of linguistic structure. This claim constitutes the core of the argument. Principle 4 identifies a hypothesis relating to the relationship between language function and language structure. The idea that language function, such as issues of communicability, has an impact on language universals is unclear. By rejecting Principle 1 and pursuing a line of enquiry guided by Principles 2–4 we have shown that techniques from multi-agent modelling can provide important insights into some fundamental questions in linguistics and cognitive science. The work presented here should be seen as the first steps towards a more thorough explanation of the evolution of linguistic structure. We believe that multi-agent models will become an increasingly important tool in the study of language.
References 1. J. Batali. The negotiation and acquisition of recursive grammars as a result of competition among exemplars. In E. Briscoe, editor, Linguistic Evolution through Language Acquisition: Formal and Computational Models, pages 111–172. Cambridge University Press, Cambridge, 2002. 2. H Brighton. Compositional syntax from cultural transmission. Artificial Life, 8(1):25–54, 2002. 3. H. Brighton and S. Kirby. The survival of the smallest: stability conditions for the cultural evolution of compositional language. In Jozef Kelemen and Petr Sos´ık, editors, Advances in Artificial Life: Proceedings of the 6th European Conference on Artifical Life, pages 592–601. Springer-Verlag, 2001. 4. R. A. Brooks. Cambrian Intelligence. MIT Press, Cambridge, MA, 1999. 5. S. Bullock and P. M. Todd. Made to measure: Ecological rationality in structured environments. Minds and Machines, 9(4):497–541, 1999. 6. D. Cheney and R. Seyfarth. How Monkeys See the World: Inside the Mind of Another Species. University of Chicago Press, Chicago, IL, 1990. 7. N. Chomsky. Aspects of the Theory of Syntax. MIT Press, Cambridge, MA, 1965. 8. N. Chomsky. Recent contributions to the theory of innate ideas. Synthese, 17:2–11, 1967. 9. N. Chomsky. Rules and Representations. Basil Blackwell, London, 1980. 10. N. Chomsky. Language and Thought. Moyer Bell, 1993. 11. W. J. Clancy. Situated Cognition. Cambridge Univeristy Press, Cambridge, 1997. 12. H. L. Dreyfus. What computers still can’t do. MIT Press, Cambridge, MA, 2nd edition, 1972. 13. J. L. Elman. Learning and development in neural networks: The importance of starting small. Cognition, 48:71–99, 1993. 14. J. L. Elman, E. A. Bates, M. H. Johnson, A. Karmiloff-Smith, D. Parisi, and K. Plunkett. Rethinking Innateness: A Connectionist Perspective on Development. MIT Press, Cambridge, MA, 1996. 15. H. Gardner. The minds’s new science. Basic Books, New York, 1985. 16. P. Glendinning. Stability, instability, and chaos: An introduction to the theory of nonlinear differential equations. Cambridge University Press, Cambridge, 1994. 17. M. D. Hauser. The Evolution of Communication. MIT Press, Cambridge, MA, 1996.
108
H. Brighton, S. Kirby, and K. Smith
18. J. R. Hurford. Biological evolution of the Saussurean sign as a component of the language acquisition device. Lingua, 77:187–222, 1989. 19. J. R. Hurford. Nativist and functional explanations in language acquisition. In I. M. Roca, editor, Logical issues in language acquisition. Foris Publications, 1990. 20. J. R. Hurford. Social transmission favours linguistic generalization. In C. Knight, M. Studdert-Kennedy, and J.R. Hurford, editors, The Evolutionary Emergence of Language: Social Function and the Origins of Linguistic Form. Cambridge University Press, Cambridge, 2000. 21. R. Jackendoff. Foundations of Language. Oxford Univeristy Press, Oxford, 2002. 22. S. Jones, R. Martin, and D. Pilbeam, editors. The Cambridge Encyclopedia of Human Evolution. Cambridge University Press, Cambridge, 1992. 23. S. Kirby. Function, selection and innateness: the emergence of language universals. Oxford University Press, Oxford, 1999. 24. S. Kirby. Spontaneous evolution of linguistic structure: an iterated learning model of the emergence of regularity and irregularity. IEEE Journal of Evolutionary Computation, 5(2):102–110, 2001. 25. S. Kirby. Learning, bottlenecks and the evolution of recursive syntax. In E. Briscoe, editor, Linguistic Evolution through Language Acquisition: Formal and Computational Models. Cambridge University Press, Cambridge, 2002. 26. S. Kirby. Natural language from artificial life. Artificial Life, 8:185–215, 2002. 27. P. Lieberman. The biology and evolution of language. The University of Harvard Press, Cambridge. MA, 1984. 28. D. Marr. Artificial intelligence: A personal view. Artificial Intelligence, 9:37–48, 1977. 29. D. Marr. Vision. Freeman, 1982. 30. J. Maynard Smith and E Szathm´ ary. The major transitions in evolution. Oxford University Press, 1995. 31. R. Montague. Formal philosophy: Selected papers of Richard Montague. Yale University Press, Newhaven, 1974. 32. R. Pfeifer and C. Scheier. Understanding Intelligence. MIT Press, Cambridge, MA, 1999. 33. S. Pinker. The Language Instinct. Penguin, 1994. 34. S. Pinker and P. Bloom. Natural language and natural selection. Behavioral and Brain Sciences, 13:707–784, 1990. 35. G. K. Pullum and B. C. Scholz. Empirical assessment of stimulus poverty arguments. The Linguistic Review, 19(1-2), 2002. 36. J. J. Rissanen. Stochastical Complexity and Statistical Inquiry. World Scientific, 1989. 37. P. T. Schoenemann. Syntax as an emergent property of the evolution of semantic complexity. Minds and Machines, 9, 1999. 38. K. Smith. Compositionality from culture: the role of environment structure and learning bias. Technical report, Department of Theoretical and Applied Linguistics, The University of Edinburgh, 2002. 39. K. Smith. The cultural evolution of communication in a population of neural networks. Connection Science, 14(1):65–84, 2002. 40. L. Steels. Constructing and sharing perceptual distinctions. In M. van Someren and G. Widmer, editors, Proceedings of the European conference on machine learning, Berlin, 1997. Springer-Verlag. 41. L. Steels. The origins of syntax in visually grounded robotic agents. Artificial Intelligence, 103(1,2):133–156, 1998.
Situated Cognition and the Role of Multi-agent Models
109
42. K. von Frisch. Decoding the language of the bee. Science, 185:663–668, 1974. 43. W. K. Wilkins and J. Wakefield. Brain evolution and neurolinguistic preconditions. Behavioral and Brain Sciences, 18:161–226, 1995. 44. Terry Winograd and Fernando Flores. Understanding Computers and Cognition. Adison-Wesley, 1986. 45. A. Wray. Protolanguage as a holistic system for social interaction. Language and Communication, 18:47–67, 1998.
Adapting Populations of Agents Philippe De Wilde1 , Maria Chli1 , L. Correia2 , R. Ribeiro2 , P. Mariano2 , V. Abramov3 , and J. Goossenaerts3 1
Intelligent and Interactive Systems Group, Department of Electrical and Electronic Engineering, Imperial College London, London SW7 2BT, United Kingdom, [email protected], http://www.ee.ic.ac.uk/philippe 2 Universidade Nova de Lisboa, 2825-115 Monte Caparica, Portugal 3 Eindhoven University of Technology, Eindhoven, The Netherlands
Abstract. We control a population of interacting software agents. The agents have a strategy, and receive a payoff for executing that strategy. Unsuccessful agents become extinct. We investigate the repercussions of maintaining a diversity of agents. There is often no economic rationale for this. If maintaining diversity is to be successful, i.e. without lowering too much the payoff for the non-endangered strategies, it has to go on forever, because the non-endangered strategies still get a good payoff, so that they continue to thrive, and continue to endanger the endangered strategies. This is not sustainable if the number of endangered ones is of the same order as the number of non-endangered ones. We also discuss niches, islands. Finally, we combine learning as adaptation of individual agents with learning via selection in a population.
1
Populations of Software Agents
In this paper we study a population of software agents [9] that interact with each other. By drawing an analogy between the evolution of software agents and evolution in nature, we are able to use replicator dynamics [14] as a model. Replicator dynamics, first developed to understand the evolution of animal populations, has recently been used in evolutionary game theory to analyze the dynamical behaviour of agents playing a game. Agents playing a game are a good model of software agents when the latter have to make decisions. In replicator dynamics, the mechanism of reproduction is linked to the success or utility of the agents in the interaction with other agents. We think that this process also occurs among software agents. This paper adopts this premise, and then goes on to investigate whether it pays off for a population to retain some unsuccessful strategies as an “insurance policy” against changes in the environment. Each agent is uniquely determined by its code, just as a living organism is determined by its genetic code. For agents, there is no distinction between phenotype and genotype. Consider n different types of agents. At time t, there are pi (t) agents with code i in the population. Just as an agent is determined by i, a population is determined at time t by pi (t), i = 1, . . . , n. E. Alonso et al. (Eds.): Adaptive Agents and MAS, LNAI 2636, pp. 110–124, 2003. c Springer-Verlag Berlin Heidelberg 2003
Adapting Populations of Agents
111
The frequency of agent i in the population is pi (t) . xi (t) = n i=1 pi (t)
(1)
n Abbreviate i=1 pi (t) = p, where p is the total population. Denote the state of the population of agents by x(t) = (x1 (t), . . . , xn (t)). Now make the following basic assumptions using terminology widely adapted in evolutionary game theory [14]. Assumption 1 (Game) If agents of type i interact with a population in state x, all agents of type i together receive a payoff u(ei , x). Assumption 2 (Replicator Dynamics) The rate of change of the number of agents of type i is proportional to the number of agents of type i and the total payoff they receive: p˙i (t) = pi (t)u(ei , x(t)).
(2)
The proportionality constant in (2) can be absorbed in u. These assumptions are discussed in the rest of this section. In assumption 1, the code i of an agent is identified with a pure strategy ei in a game. The notation should not distract the reader, i could have been used instead of ei . Identification of a code with a strategy is familiar from evolutionary genetics [12]. During replication of individuals in a population, information is transmitted via DNA. It is straightforward to identify the code of an agent with DNA. Assumption 1, introducing a payoff to software agents, is part of the modelling of software agents as economic agents. Economic principles have been used before in distributed problem solving [8], but in [6] the author has made a start with the analysis of general software agents as economic agents. This paper is part of that project. The replicator dynamics (2) describe asexual reproduction. Agents do sometimes result out of the combination of code from “parent” agents, but such innovative combinations do not occur very often. On a timescale of one year, the replication of agents will be far more important than the reproduction via combination of parent programs. In addition to DNA exchange, our species also passes information between individuals via cultural inheritance. This tends to result in behaviour that is a close copy to the behaviour of the “cultural” parent. If agents are to represent humans in an evolving society, they will also exhibit cultural inheritance or social learning, which follows assumption 2 [7]. In biological systems, one can distinguish long term macroevolution [13], and shorter term microevolution [12]. Assumption 2 can be situated in the field of microevolution. On an even shorter timescale, psychologists observe reinforcement learning. Although the results of reinforcement learning are not passed on to
112
P. De Wilde et al.
offspring (central dogma of neo-Darwinism), it is possible to cast this learning as replicator dynamics [2]. This adds to the credibility of assumption 2, because software agents will often use reinforcement learning together with replication of code [5]. Biological organisms as well as software agents live in an environment. This is actually the same environment, because software agents act for humans, who live in the biological environment. In the model (2), the change of the environment will be reflected in the change of the payoff function u(ei , x), which has to be written u(ei , x(t), t) to make the time dependence explicit. It is very important to be able to model this change in environment, because a successful strategy or agent type should be as robust as possible against changes in environment. Another type of evolution that software agents have in common with biological agents is mutation. Strategies should be evolutionary stable if they are to survive mutations [12]. However, mutations can positively contribute to the evolution of a system. Current models tend to concentrate on stochastically perturbing the choice of strategies used [7,3], rather than the random creation of new strategies. Much work still needs to be done in this area.
2
The Burden of Maintaining Diversity
Agents are pieces of software that act on behalf of humans. Software has a lifetime, so have agents, and humans in a population. The human lifetime is not necessarily the biological lifetime, it may be the time that a human operates in a certain environment. Unsuccessful strategies will die out. This is an essential part of the model (2). Recently there has been much interest in biodiversity [11] and sustainable development [10]. As in all population dynamics, one can ask the question whether it is worth artificially maintaining strategies (or agent types) with a low payoff. The research on biodiversity suggests that this is worthwhile indeed. An agent type is a number i ∈ K = {1, . . . , n}. The set Ke of endangered agent types is defined by Ke = {i ∈ K|u(ei , x) < a}, they have a payoff lower than a. The set Ke will change in time, but the threshold a is fixed. To indicate that the payoffs have been changed, we will use q instead of p for the population, and y instead of x for the frequencies. Assume now that a is the minimum payoff required to sustain a viable population. It is now possible to redistribute the payoff between the non-endangered strategies outside Ke , and the endangered ones in Ke in the following way u(ei , x) = a,
i ∈ Ke ,
u(ei , x) = u(ei , x) −
j∈Ke [a
− u(ej , x)]
q − |Ke |
This transformation conserves the total payoff u(ei , x). i∈K
,
i ∈ Ke .
(3)
Adapting Populations of Agents
Abbreviate
j∈Ke [a
b=
− u(ej , x)]
q − |Ke |
113
(4)
To derive the differential equations in the state variables yi , start from (1), q(t)yi (t) = qi (t),
(5)
take the time derivative of left and right hand side to obtain q y˙i = q˙i − qy ˙ i.
(6)
Using (3), we obtain, for i ∈ Ke , q y˙i = qi a −
q˙j yi ,
j∈K
= qi a −
qj ayi −
j∈Ke
or
qj [u(ej , y) − b]yi ,
(7)
j∈Ke
y˙i = yi a − yj a − yj [u(ej , y) − b] . j∈Ke
(8)
j∈Ke
Similarly, for i ∈ Ke , we find y˙i = yi {u(ei , y) − b − −
yj a
j∈Ke
yj [u(ej , y) − b]}.
(9)
j∈Ke
We can now simplify this using − yj a + yj b j∈Ke
j∈Ke
|Ke | q − |Ke | =− a+ q q 1 =− u(el , y). q
− u(el , x)] , q − |Ke |
l∈Ke [a
(10)
l∈Ke
This quantity will be much smaller than both a and u(ei , y), i ∈ Ke , the payoff of the non-endangered strategies, if u(ei , y) a, i ∈ Ke , or if |Ke | q. These plausible assumptions mean, in words, that the conservation value, a, of the payoff is much larger than the payoff of the endangered strategies, or that there are only a small number of endangered strategies. We will give practical examples in the next section.
114
P. De Wilde et al.
Hence we can write the population dynamics with conservation of endangered strategies as y˙i = yi [a − yj u(ej , y)], i ∈ Ke , j∈Ke i
y˙i = yi [u(e , y) − b −
yj u(ej , y)],
i ∈ Ke .
(11)
j∈Ke
Compare this to the population dynamics without conservation [14], x˙i = xi [u(ei , x) − xj u(ej , x)], i ∈ K.
(12)
j∈K
The term subtracted from the payoff for strategy i is the population average payoff u(x, x) = xj u(ej , x). (13) j∈K
These effects of artificially increasing the utility for endangered species are illustrated in figures 1 and 2. 1 x 1 x
0.9
2
0.8 0.7 0.6 0.5 0.4 0.3 0.2
smallest sustainable population
0.1 0 0
5
10 time
15
20
Fig. 1. The evolution of the proportion of two population types, x1 and x2 , for payoffs u(e1 , x) = 5x2 and u(e2 , x) = 0.5. This is a situation where the first type depends on the second type to survive. Both types survive, but x2 is low and can become accidentally extinct.
3
Niches and Islands
Comparing equations (11) and (12) can teach us a lot about the cost and implications of conservation. In the natural world, endangered strategies tend to
Adapting Populations of Agents
115
1 y1 y
0.9
2
0.8 0.7 0.6 0.5 0.4 0.3 0.2 smallest sustainable population 0.1 0 0
5
10 time
15
20
Fig. 2. The evolution of the proportion of two population types, y1 and y2 , as in figure 1, but the payoffs have now been adjusted according to (3), with Ke = {2}, and a = 0.8. This implies u(e1 , y) = 5y2 − 0.3 and u(e2 , y) = 0.8. The proportion of type 2 is now higher, and not in danger of extinction. The price to pay is that the proportion of type 1 is now lower.
become extinct, unless a niche can be found for them. A niche is an area in state space where there is little or no competition. We will say that here is a niche in the system if the equations (13) are uncoupled. In that case, the payoff does not depend on the whole state x anymore, but on a projection of x on a subspace of the state space. The existence of a niche prevents strategies from going extinct because it imposes a particular structure on the payoff function u. For a fixed u, there is no particular advantage or disadvantage in the existence of a niche, the replicator dynamics go their way, and that is all. However, the environment can change, and this will induce a change in u, the function modeling the payoff or success of strategies. Certain feedback loops in the dynamics can now become active. Assume a system with two strategies that each operate in their niche x˙1 = x1 [u(e1 , x1 , z) − x1 u(e1 , x1 , z) −x2 u(e2 , x2 , z)], x˙2 = x2 [u(e2 , x2 , z) − x1 u(e1 , x1 , z) −x2 u(e2 , x2 , z)].
(14)
Strategy 1 does not have to compete with strategy 2 because u(e1 , x1 , z) is independent of x2 . Similarly, Strategy 2 does not have to compete with strategy 1 because u(e2 , x2 , z) is independent of x1 . The frequencies x1 , x2 of the strategies remain non-zero. The frequencies of the other strategies x3 , . . . , xn are grouped in the partial state vector z. If the function u changes, the dynamics of the frequencies of strategies 1 and 2 will now in general be
116
P. De Wilde et al.
x˙1 = x1 [u(e1 , x1 , x2 , z) − x1 u(e1 , x1 , x2 , z) −x2 u(e2 , x1 , x2 , z)], x˙2 = x2 [u(e2 , x1 , x2 , z) − x1 u(e1 , x1 , x2 , z) −x2 u(e2 , x1 , x2 , z)].
(15)
It is now possible that there is a positive feedback that causes x1 and x2 to increase over time. This positive feedback is not guaranteed, but if one of the strategies had become extinct, then the positive feedback could never have occurred at all when u changed. Remark that such a positive feedback was already possible in (14), because both equations are coupled via the partial state vector z. We are not concerned with this process here. More complex mechanisms of the benefits of altruism have been studied in [1]. So far the pedestrian justification of conservation: once a strategy is extinct, you cannot benefit from it anymore in the future, if the environment changes. In mathematical terms, once the state space is reduced, many trajectories are excluded, and some could benefit your strategy if the environment changes. Since Darwin’s time, it is known that local optima in a population distribution x can exist on islands. And recently, we have seen how ending the isolation of islands can destroy the local optimum by the invasion of more successful species. Should we isolate information systems and software agents, so that new types can develop? In that case the replicator dynamics (12) will be altered again. For the evolution on an island, it appears that all species not present on the island are extinct. Call Kr the set of strategies represented on the island. Then the population dynamics is x˙i = xi [u(ei , r) − u(r, r)],
i ∈ Kr ,
(16)
where r is the state vector with zeroes for the strategies not in Kr . When the isolation is ended, these zeroes become non-zero, and we obtain the dynamics x˙i = xi [u(ei , x) − u(x, x)],
i ∈ Kr
(17)
for the strategies on the island. This is illustrated in figure 3. The dynamics (16) and (17) are just the general formulations for the two-strategy niche dynamics described by equations (14) and (15).
4
Learning by Individual Agents
The dynamics of software agents differs in another aspect from that of biological systems. Learned behaviour can be passed on to offspring. Agents can be duplicated, retaining all that was learned, e.g. via a neural network [5]. The replicator dynamic has to take learning into account. If learning is successful, the payoff for states encountered earlier will increase. If the learning also has a generalization capacity, as happens for neural networks [4], then the payoff for states similar
Adapting Populations of Agents
117
1 x1 x
0.9
2
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
5
10 time
15
20
Fig. 3. The evolution of the proportion of two population types, x1 and x2 , with payoffs u(e1 , x) = x1 x2 and u(e2 , x) = 0.1. At t = 5, the populations become separated, and the positive feedback maintaining type 1 disappears.
to those encountered earlier will also increase. The payoff now changes explicitly with time, and (12) becomes x˙i = xi [u(ei , x, t) − u(x, x, t)],
i ∈ K.
(18)
If all the payoffs u were simply multiplied by the same number during learning, the dynamics (18) would be equivalent to (12) in that the orbits remain the same, but they are traversed at a different speed (faster for a constant larger than one). When the payoffs are multiplied by a time-varying factor, x˙i = xi α(t)[u(ei , x, t) − u(x, x, t)],
i ∈ K,
(19)
the factor α(t) can be absorbed in the time derivative, and the orbits are traversed with a speed varying in time. When the learning factor now described as αi (t) becomes dependent on the strategy i however, the orbits are changed, and we cannot compare the population evolution with and without learning any more. A non-trivial learning algorithm for a population of 2 different types is illustrated in figure 4.
5
A Population of Traders
We present here a simulation of a population of traders, under realistic assumptions. We investigate what happens when the payoff (price) is perturbed. 5.1
Scenario
There is an array of N different resources in this model. There are M traders each one with its capital, its discounting percentage and its available resources.
118
P. De Wilde et al. 1 x1 x
0.9
2
0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0
5
10 time
15
20
Fig. 4. The evolution of the proportion of two population types, x1 and x2 . Initially type 1 looses out, x1 goes down. However, type 1 adapts its utility in time as u(e1 , x) = t2 x1 /3, and this allows it to win over type 2, that uses no learning, u(e2 , x) = x2 .
All traders are endowed with the same amount of wealth, the amount of each resource given to each trader is randomly calculated. A discounting percentage is used when the trader recalculates its prices. Its function is explained in more detail in the paragraph “Pricing of the Resources” below. Finally, the scenario involves tasks that are generated by agents. A task requires some resources and produces some others, loosely modelling generic economic activity. At each clock tick every trader with its turn issues a task and advertises it to the other traders. Each task carries a random combination of required and produced resources. Every trader gives an offer for the task (provided that they possess the required resources). If the issuer cannot pay for any offer then the task is not executed. Otherwise, it selects the cheapest offer, and the task is executed. The required resources are subtracted from the task executor’s set of resources, the produced resources are added to the issuer’s set of resources and the issuer pays to the executor an amount of money equal to the price for executing the task. The prices each trader sets for each resource are different. After the execution of a task all the traders that gave offers for the task recalculate their prices. Only the prices of the resources required for the task are altered on recalculation. There are three ways in which recalculation occurs: 1. The trader whose offer was accepted increases the prices of the required resources of the task as follows: resourcePrices[i] += this.resourcePrices[i] * discountingFactor; 2. The traders whose offers were not accepted decrease the prices of the required resources of the task as follows: resourcePrices[i] -=
Adapting Populations of Agents
119
(1 - selectedPrice/myPrice) * this.resourcePrices[i] * discountingFactor; 3. In case no offer was accepted all traders that gave offers for the task decrease the prices of the required resources of the task as follows: resourcePrices[i] -= (1 - maxPricePayable/myPrice) * discountingFactor * this.resourcePrices[i]; When a trader is sufficiently rich, i.e. its wealth exceeds a certain threshold, it generates a new trader to which it gives half its wealth and half of its resources. The new trader inherits its generator’s discounting factor. When a trader’s wealth goes below zero it is destroyed. One could say that the system described above is stable when the prices do not rise or fall unexpectedly, or when they do not fluctuate outside some set limits. Also, we would perceive the system as being stable when the traders’ number does not increase or decrease too much. Similarly having zero traders, or all prices set to zero are situations where the system is stable. However, we are not interested in these trivial cases, and we would prefer to avoid them. 5.2
Experiments
Having in mind the criteria of stability listed above, we now devise metrics of stability for the model. It is of interest to measure the proportion of traders that execute tasks during a time tick. It is also useful to know the prices of the resources in the course of time. The graphs shown below are for an experiment with 500 traders, 10 different types of resources. The simulation was left running for 10 000 time ticks. Each trader is endowed with EURO 100 000 and a random amount of each of the 10 resources. The amount from each resource it gets is of the order of 1000 (calculated randomly). The prices of resources are initially of the order of EURO 100 (calculated randomly). A trader can generate a new trader if its wealth exceeds EURO 150 000 and it is destroyed if its wealth goes below EURO 0. We only show the first 1000 ticks of the simulation in figure 5 as the rest are more or less similar. We can observe that the number of traders stabilizes after some time. Also the tasks each trader has executed is more or less stable, fluctuating around 0.20 of a task per trader during the course of one time tick. The prices of the resources seem to fluctuate evenly close to EURO 100. The few spikes we observe in this graph are due to traders who have a relatively big discounting factor and increase their prices. We now try to inject a shock into the system. At time tick 350, the prices of all the resources of each trader are multiplied arbitrarily by 1000. We then allow the simulation to run until time tick 10 000 and see what happens. Again the graphs shown are up to tick 1000. The following graph, figure 6 is a ‘zoom in’ on the region of time when the shock is injected. We see the prices of five resources rising momentarily at t=350. Then they go into a transient recovery phase slowly converging to a stable state, with prices slightly higher than before.
120
P. De Wilde et al.
a.
b.
c. Fig. 5. Simulation Statistics: a.Number of Traders, b.Tasks per Trader, c.Prices of Resources R0-R4
Adapting Populations of Agents
121
Fig. 6. Prices of Resources R0-R4 (before and after the ‘shock’)
In another experiment we carried out, figures 7 and 8, a different shock was injected into the system. This time, at time tick 350 the number of traders was decreased by 30%. The system was left to run and here we show what happened up to tick 2000 as the rest is more or less similar.
Fig. 7. Number of Traders (before and after the ‘shock’)
6
Conclusions
All living things contain a code. So do computer programs, languages, designs and artwork. The code consists of all the information that makes replication possible. In a competitive environment, programs are pitched against each other,
122
P. De Wilde et al.
Fig. 8. Tasks per Trader (before and after the ‘shock’)
in a way similar to individuals in an ecosystem. The interaction brings a payoff u to the program, or language, or design. The population dynamics with conservation (11) is crucially dependent on the conservation subsidy a per strategy, and on b, which depends on q, the total population, and |Ke |, the number of endangered strategies. Conservation maintains a greater pool of strategies than the ecosystem without conservation (12). This makes it possible that the fitness of any single non-endangered strategy could increase when the environment changes adversely for that strategy, via the mutual-benefit feedback loop with an endangered strategy. The price to pay for this is an overall decrease of the payoff values of the non-endangered strategies. In the animal and plant kingdoms, the number of endangered species seems much smaller that the number of non-endangered ones [11], although there is great uncertainty on the numerical data. In this situation, equations (11)-(13) seem to indicate that it is possible to conserve the endangered species, if the effort is spread over all other species. However, replicator dynamics are not such good models of sexual reproduction and mutation, so that it is difficult to reach conclusions. In the case of languages, artificial and computer, and information systems, the number of endangered types is of the same order as the number of nonendangered ones. In this case, (11)-(15) show that a conservation effort will decrease the payoff of the non-endangered types so much, and their dynamics affected to such an extent, that they also could become extinct if the environment changes. If conservation is successful, i.e. without lowering too much the payoff for the non-endangered types, it has to go on forever, because the non-endangered types still get a good payoff, so that they continue to thrive, and continue to endanger the endangered types. This is not sustainable if the number of endangered ones is of the same order as the number of non-endangered ones. In other words, one should not try to control the pure Darwinian evolution in a population of competing agents by artificially maintaining a diversity of agents. If the
Adapting Populations of Agents
123
number of endangered species is much smaller than the others, they will have little influence on the dynamics of the system, and whether the others sustain them or not will make little difference again. We have illustrated the evolution of a population of agents using trading agents. We have also shown how robust this population was against perturbations of the payoff. In short, we have proposed replicator dynamics as a model for the evolution of populations of software agents. We have shown what happens if the utility of some types in increased (conservation), if some types of agents do not interact with each other (niches and islands), and if some types of agents change their utility in time (individual learning). In each of these three cases the adaptation of the population is artificially modified. It is up to the systems analyst to decide which situation applies is a practical case. Our replicator dynamics then allow us to predict what will happen to the different types of agents. Acknowledgements. Partly funded by European Community, under the Future and Emerging Technologies arm of the IST Programme, FET-UIE scheme, Ecology and Evolution of Interacting Infohabitants, IST-1999-10304.
References 1. Gary S. Becker. Altruism, egoism, and genetic fitness: Economics and sociobiology. In The Economic Approach to Human Behavior, pages 282–294. University of Chicago Press, Chicago, 1976. 2. T. Borgers and R. Sarin. Learning through reinforcement and replicator dynamics. Journal of Economic Theory, 77:1–14, 1997. 3. Andrea Cavagna, Juan P. Garrahan, Irene Giardina, and David Sherrington. Thermal model for adaptive competition in a market. Physical Review Letters, 83:4429– 4432, 1999. 4. Philippe De Wilde. Neural Network Models, second expanded edition. Springer Verlag, London, 1997. 5. Philippe De Wilde. How soft games can be played. In H.-J. Zimmermann, editor, EUFIT ’99, 7th European Congress on Intelligent Techniques & Soft Computing, pages FSD–6–12698, Aachen, September 1999. Verlag Mainz. 6. Philippe De Wilde, Hyacinth S. Nwana, and Lyndon C. Lee. Stability, fairness and scalability of multi-agent systems. International Journal of Knowledge-Based Intelligent Engineering Systems, 3(2):84–91, 1999. 7. Drew Fudenberg and David K. Levine. The Theory of Learning in Games. MIT Press, Cambridge, Massachusetts, 1998. 8. B. A. Huberman, editor. The Ecology of Computation. North-Holland, Amsterdam, 1988. 9. Nicholas J. Jennings. On agent-based software engineering. Artificial Intelligence, 117:277–296, 2000. 10. John Pezzey. Economic analysis of sustainable growth and sustainable development. World Bank, Washington DC, 1989. 11. Andy Purvis and Andy Hector. Getting the measure of biodiversity. Nature, 405:212–219, 20.
124
P. De Wilde et al.
12. John Maynard Smith. Evolutionary Genetics. Oxford University Press, Oxford, 1998. 13. S. M. Stanley. Macroevolution. W. H. Freeman, San Fransisco, 1979. 14. J¨ orgen W. Weibull. Evolutionary Game Theory. MIT Press, Cambridge, Massachusetts, 1995.
The Evolution of Communication Systems by Adaptive Agents Luc Steels VUB AI Lab – Brussels Sony Computer Science Laboratory – Paris [email protected]
Abstract. The paper surveys some of the mechanisms that have been demonstrated to be relevant for evolving communication systems in software simulations or robotic experiments. In each case, precursors or parallels with work in the study of artificial life and adaptive behaviour are discussed.
1
Introduction
Almost since the beginning of research in Artificial Life and the Simulation of Adaptive Behaviour, there have been efforts to apply biological principles and the methodology of building artificial systems to understand the origins and evolution of communication systems with the complexity of natural languages, not only by abstract software simulations but also by experiments on situated embodied robotic agents operating in real world environments. In almost all these experiments, language is viewed as a complex adaptive system which emerges in a bottom-up fashion from local one-on-one interactions between situated embodied agents, and evolves and complexifies based on principles like cultural selection, structural coupling, and self-organisation. Rather than looking only at natural languages as they exist today, research in ‘artificial language evolution’ tries to evolve artificial languages with natural language-like properties – and thus explores the space of possible languages the same way artificial life explores the space of possible life forms [22]. Moreover the languages are not considered to be static. Attempts are made to have them evolve in ways that are similar to human language evolution. This paper surveys some of this research which is relevant in several ways to the general questions posed by biologically-inspired agents research: 1. Communication is obviously a very important feature of higher animals, particularly humans. Indeed it has been argued that it is through the increasing power and needs of communication that cognition has been bootstrapped to human level intelligence in the first place. The study of communication and its complexification therefore fits within the general biological study of the ‘major transitions in evolution’ [27]. E. Alonso et al. (Eds.): Adaptive Agents and MAS, LNAI 2636, pp. 125–140, 2003. c Springer-Verlag Berlin Heidelberg 2003
126
L. Steels
2. Research into the origins and evolution of language introduces a whole new approach towards the problem of getting autonomous robots to communicate in natural language with humans or each other. Instead of the traditional top-down approach used in most AI work on natural language processing, in which lexicon and grammar remain static, communication is seen as adaptive behaviour. The robot progressively acquires more complex forms of languagelike communication – similar to a child which progresses from babbling to prelinguistic communication, then to simple forms of lexical language and finally full-blown grammar. Lexicon and grammar continue to change and develop throughout life. The emphasis on embodiment and situatedness naturally resonates strongly with work on behaviour-based robotics, particularly more recent work which tries to establish attention sharing, turn-taking and emotional communication [4]. 3. As I will try to show in this paper, research on artificial language evolution has benefited greatly from adopting principles discovered in the study of the origins of biological complexity. But benefits could also flow the other way. Language is a very good domain to study how communication may self-organise and complexify. This topic is still only weakly understood in biology and is relevant to questions such as how do information codings arise in the brain, how do different organs within a body develop the necessary communication to coordinate their activity, or how has the genetic code evolved towards such great complexity. Research on artificial language evolution is not only of interest for the study of autonomous adaptive agents but is receiving increased attention from other scientific disciplines interested in the evolution of language as well [14], [56]. The objective of this paper is to survey some of the work done so far with an eye on finding some of the principles that have proven to be relevant for simulating some form of language emergence. This survey is necessarily very brief and inevitably biased to my own research. I do not pretend that all researchers in the area subscribe to these principles. Moreover the complete problem is far from solved. Much remains to be discovered. Nevertheless, I believe it is relevant to occasionally perform this kind of synthesis to expose the gaps in our understanding and begin to exploit application opportunities. Complementary surveys and examples of research can be found in [7], [5], [15].
2
Language Games
Much of the success of research in artificial language evolution has come from framing the problem in terms of games. Before the advent of artificial life, game theory was already widely used in theoretical biology to study aspects of genetic evolution and animal cooperation [25]. Indeed, some of the early successes in Artificial Life have come from adopting this framework as a basis for studying complex dynamics and the evolution of complexity. For example, Lindgren [24], Kaneko, [17] and others studied the iterated prisoner’s dilemma game and showed various evolutionary phenomena such as spatio-temporal chaos, co-evolution of
The Evolution of Communication Systems by Adaptive Agents
127
strategies, etc. Research in artificial language evolution has started to take off when iterated games were adopted as a way to study the emergence and evolution in language. One of the earliest examples is [28]. The mapping to language works as follows. Typically there is a population of agents (which can be static or dynamic). Two players (a speaker and a hearer) are randomly drawn from the population. The players engage in an interaction which is either a complete language game, i.e. a communication involving the real world, or an aspect of a language game, for example, only the exchange of sounds [9] or the exchange of a string together with what the string means [32], or well-formed sentences generated by an evolving grammar [13]. The players have their own private cognitive structures, like lexicons or grammars, which they use to play the game, just as they have their own strategy and memory in the case of iterated prisoner’s dilemma games. There is an outcome of the game, for example, successful communication or successful imitation, and the behaviour of the players changes over time in such a way that success increases. The language phenomena are a side effect of repeated games. Language conventions are not put in from the start, and there is no central agency that controls how agents are supposed to act. Shared communication conventions must emerge from the distributed activity of the agents. Thus an experiment in artificial language evolution always has the following ingredients: – A definition of an interaction prototocol for the agents. – A definition of the architecture of an agent (what cognitive structures are available, what input/output processing is done, how learning and language invention proceeds). – An environment, possibly a real world environment if the agents are robotic. – A set of measures which show that the language phenomena one is interested in indeed arise. For example, success in communication, growing size of the lexical repertoire, similar sound systems as in natural languages, grammatical structures, etc.
3
Genes versus Viruses
An important difference can be seen between a group of researchers that view language as the result of genetic evolution (e.g. [31]) and those which emphasise cultural evolution [42] (with various researchers taking an in-between position as well as in the case of Baldwinian evolution [5]). This mirrors disputes in linguistics between researchers like Chomsky and Pinker who defend nativist positions versus researchers like Lindblom, MacNeilage and Studdert-Kennedy [23] or Tomasello [51] who defend the social and cultural construction, learning, and evolution of language. All of these variations can be explored within the language game framework and so it is possible to compare the strength and weakness of each approach [45].
128
L. Steels
Genetic evolution is modeled by adopting the same framework as used in research on genetic algorithms [20]. The population is divided into different generations. A particular generation plays a set of games and individuals receive a score on how well they are doing in the game. The cumulative score determines the fitness of an agent. Then there is the creation of a new generation based on the previous one. The probability of having offspring is based on prior success in the game and the offspring inherits the linguistic knowledge of the parent(s), with possible mutations or recombinations. The computer simulations of [28] follow this framework. It has been clearly demonstrated by many simulations that this leads to the emergence of a shared set of conventions in a population. Mutation and recombination operators are the only way new structure can arise into the language. Language coherence arises because the same ‘language genes’ eventually spread in the total population. In the alternative view, language conventions spread similar to the propagation of bacteria or viruses. Language evolution is viewed as driven by cultural (memetic) evolution rather than genetic evolution. In this case, there is no division into different generations, although it is still possible that there is a population flow, with agents entering and leaving. There is no fitness associated with the agents. Agents do not inherit anything from parents. The notion of offspring does not exist. Instead each agent adjusts his linguistic behaviour after each game in order to be better in the next game. Adjustment could mean: change the score in memory between a wordform and its meaning, invent a new word for a specific meaning, add a new sound to the phonetic repertoire, invent or adopt a new grammatical rule, etc. The computer simulations of [40], [32] or [9] and the robotic experiments in [49] or [47] all follow this particular framework. Again it has been demonstrated beyond doubt that cultural evolution also leads to the emergence of a shared set of conventions in a population. Language coherence now arises through self-organisation in the sense of complex systems theory [29] (as explained below). There have been experiments which use a mixed form. For example, the simulations reported by Kirby [18] are structured like in genetic models. The population is divided into different generations and agents get a score which results in a fitness measure. But the language is learned by each generation from the previous one, as opposed to genetically inherited. In this learning process more structure (specifically more abstract rules) are introduced by the agents. So language does not evolve through mutation and recombination but in a cultural fashion. On the other hand, language coherence is still partly influenced by inheritance relations because success in the game influences whether an agent will have offspring or not.
4
Grounding
There have been important advances in robotics lately, largely due to adopting a behaviour-based approach [46] [34]. It has become more and more realistic to build robust autonomous robots which interact in real time with a dynami-
The Evolution of Communication Systems by Adaptive Agents
129
cally changing world. Behaviour-based robots use an architecture that couples sensing almost directly to actuating, de-emphasising complex internal symbolic representations. Moreover they include motivational and emotional parameters in deciding which action path to pursue. Together with rapid advances in mechanical and electronical engineering, the behaviour-based approach has lead to complex mass produced pet robots such as Sony’s AIBO and is leading to a new generation of humanoid robots [19]. All these advances are a tremendous opportunity for research on artificial language evolution because it becomes possible to implement language games on such robotic platforms and thus investigate fully grounded situated verbal communications between autonomous robots. Several researchers have been trying to do this [55], [1], [48], [36]. There are two key issues to be solved, known as the grounding issue and the bootstrapping issue. The grounding issue concerns the problem of relating the conceptualisations underlying a language utterance to the external world through a sensori-motor apparatus. Agents must implement the full semiotic cycle. That means, the speaker must perform the necessary pattern recognition and sensory processing on captured images and or types of sensory data, conceptualise the scene by categorising objects and events, verbalise this conceptualisation, and transmit it to the hearer. The hearer must decode the utterance, and confront the interpretation with his own conceptualisation of the sensory image. The grounding problem is an active field of research at the moment [12], [8] but there does not appear to be a simple straightforward solution, in the sense of a component that could be added to make a non-grounded agent grounded in external reality. Instead, grounding is a matter of setting up tight couplings between the behaviours of the agent and his environment on the one hand and the internal representations that are used on the other. It is the result of a total integrated process, in which adequate pattern recognition and image processing provides the ground work and adaptive categorisation algorithms (based on weighted decisions, nearest neighbor computation, discrimination trees, etc.) play key roles. In the case of language games, there is an additional complexity, namely the grounded representations constructed by the lower cognitive levels must be in tune with the language systems that verbalise or interpret these representations. Because both grounded representations and language are evolving systems, we need a way to coordinate them without a central coordinator or prior knowledge. I will argue below that the principle of structural coupling discussed below is relevant for this.
5
Linguistic Bootstrapping
The second issue for evolving grounded communication on embodied robots is how verbal communication itself can be bootstrapped. This is related to the general problem of the origins of communication which has also been studied in adaptive behaviour research [30]. Of course it is possible to pre-program the agents, in other words pre-program the game, but that would put into the robots
130
L. Steels
the processes we try to understand and explain. Instead we want to understand the process by which language gets bootstrapped, and empirical research shows that this is not an individualistic process. There is an important role for a ‘mediator’ that scaffolds the complexity, provides pragmatic feedback, and motivates learning [51]. As in the case of grounding, there is not a single magical trick to explain linguistic bootstrapping but many competences need to be integrated. Careful observations by developmental psychologists, following in the footsteps of Piaget and Bruner, have shown that ‘learning how to mean’ is a slow process which takes roughly 8 months starting from 6 months of age, and is estimated to involve as much as 50,000 interactions. The presence of a mediator is absolutely crucial. Verbal communication (initially with single words which only approximately sound like standard words) implies that (1) the speaker has an effect on the hearer (communicative effect), (2) the hearer interprets the speaker’s behaviour as communication (communicative inference), and (3) the speaker intends her behaviour to be communicative (intentional communication). The observed developmental sequence is roughly as follows (see [11]): 1. Communicative effect: Infant acts (cries, kicks) => Caregiver reacts to these behaviours. 2. Communicative inference: Infant develops goal-directed behaviours (e.g. reach for toy while making sound) => Caregiver infers the intention and responds with appropriate behaviour. Caregiver also typically re-enforces the sounds and corrects. 3. Intentional communication: Infant realises power of communication and starts to use it deliberately. Communication includes vocalisation, eye contact, as well as gestures. 4. Upping the ante: The caregiver starts to require more precise vocalisations that ressemble words used in the language. Notice that the role of a caregiver as interpreter of behaviour is crucial, otherwise the infant cannot learn that vocalisations can have certain effects, and climb up the hill of more conventional and more complex language use. So far there have been no convincing simulations of this developmental sequence although preliminary efforts have been going on in this direction [48]. It is obvious that there are many preconditions which are extremely difficult to realise on autonomous robots and which co-develop at the same time as language communication bootstraps. They include: localising and recognising other human beings, eye contact and gaze following, producing vocalisations (babbling), emotion recognition and production through sound, gesture tracking and interpretation, sharing attention with others to specific objects or actions, which implies segmentation, template matching and tracking, realising that actions can have causal effects, realising that to achieve an effect, the action needs to be performed that causes this effect, realising that a vocalisation is equivalent to such an action, adjusting a vocalisation so that it comes closer to a vocalisation heard by the caregiver, etc. Each of these competences has been the object of intense investigation lately by AI researchers, mostly in the context of humanoid
The Evolution of Communication Systems by Adaptive Agents
131
robotics research. The work of Breazeal [4] on emotional attention sharing and turn taking, Scassellati [37] on face identification and tracking, Oudeyer [33] on babbling and emotion expression, are some examples in this direction. Only when all these components can be integrated in a single system can we begin to simulate human-like linguistic bootstrapping.
6
Self-Organisation
We now return to the collective level. One of the key questions to understand how a communication system can arise, is how there can be coherence in the group, in other words how distributed agents without a central authority and without prior specification can nevertheless arrive at sufficiently shared language conventions to make communication possible. The genetic evolution hypothesis of language evolution ‘solves’ this problem by considering that successful language genes spread in the population, so after some time everybody shares a copy of the same most successful gene. However genetic evolution is extremely unlikely for most aspects of language (definitely for the lexicon, and even for grammar – there seems too much variation between languages to encode much if anything genetically [54]). However an alternative solution is available that could explain how coherence can arise in a cultural fashion, namely through self-organisation. The concept of self-organisation (narrowly defined) has its roots in research in the fifties and sixties on certain types of chemical reactions such as the Bhelouzow-Zhabotinsky reaction [29]. It then became generalised to many different types of systems not only physical but also biological [6] and even economical. Since the beginning of Artificial Life and Adaptive behaviour research, simulations of the self-organisation of ant paths, bird flocks, slime molds, pattern formation in morphogenesis, etc. have been common, with applications to collective robotics [10]. Self-organisation occurs when there is a system of distributed elements which all have a random behaviour in the equilibrium state. The system is then brought out of equilibrium, which is usually by the supply of energy in physical systems. A positive feedback loop becomes active, enforcing local fluctuations into coherent global behaviour. In the well-studied case of ant societies [10], an ant hits a food source in random exploration, and then returns to the nest depositing a pheromone. This attracts other ants, which enforce the chemical trail, attracting even more ants, etc. (the positive feedback effect). Progressively the whole group self-organises to a single path. When food is exhausted, no more pheromone is deposited and the chemical evaporates returning the system to a random exploration (i.e. equilibrium) stage. Self-organisation in this sense has now been studied extensively from the viewpoint of dynamical systems theory and a large body of mathematical models and techniques exist to describe it. Around 1995, it became clear that this mechanism could also be applied to language evolution. It was first shown for lexicon formation (see [40], [32]) but then generalised to other aspects of language, including phonetics [9]. The application for the lexicon works as follows. Suppose speakers invent new words
132
L. Steels
for the meanings which they do not know how to express and listeners store the words used by other agents. In this case, agents will develop words for all meanings and adopt them from each other. However the lexicon will be very large. Many different words will be in use for the same meaning. But suppose now that a positive feedback is introduced between use and success: Agents keep a score for each word-meaning pair in their lexicon. When a game is sucessful the score of the word-meaning pair that was used increases, and that of competing word-meaning pairs is decreased (lateral inhibition). When a game fails, the score of the word-meaning pair is diminished. In interpreting or producing language, agents use the word-meaning pairs with the highest score. These dynamics indeed gives self-organisation towards a shared lexicon (figure 1). So it suffices to program the adaptive behaviour of individual agents in such a way that a positive feedback loop arises between use and success and self-organisation sets in. 1
bevuwu bozopite
0,9
centerlft
wogglesplat
danuve
0,8
fibofure gauche links
0,7
mamit mefali red
0,6
rekifini rotota 0,5
rouge sisibuta sowuwi
0,4
sulima tonoto tonozibo
0,3
vizonuto wegirira 0,2
wogglesplat wolf wopuwido
0,1
xesofage xomove yaretile 90000
85000
80000
75000
70000
65000
60000
55000
50000
45000
40000
35000
30000
25000
20000
15000
5000
10000
0
0
ybgrshapes yellow
Fig. 1. This graph plots the usage rate of all possible words for the same meaning in 100,000 iterated language games played by a group of over 1000 agents. Initially many words are competing until one dominates due to a winner-take-all effect.
The adoption of self-organisation is a nice example where a principle from biology (in fact complexity science in general) could first be demonstrated in artificial life simulations and then transported into ‘artificial language evolution’.
The Evolution of Communication Systems by Adaptive Agents
7
133
Structural Coupling (Co-evolution)
Another key problem for artificial language evolution is how the different levels of language, which each have their own developmental trajectory, can become coordinated with each other. For example, how can the meanings underlying language become coordinated with the lexicon? There are profound differences between languages as far as their conceptualisations are concerned [50]. For example, the conceptualisation of the position of the car in “the car is behind the tree” is the opposite in most African languages compared to Western languages. The front of the tree is viewed as being in the same direction as the face of the speaker and hence the car is conceptualised as in front of the tree as opposed to behind the tree. Examples like this are not hard to find and they suggest that different human cultures invent their own ways to conceptualise reality and propagate it through language, implying a strong causal influence of language on concept formation (the Sapir-Whorf thesis) [3]. The same problem arises for the coordination between phonetics/phonology and the lexicon. The sound system of a language evolves independently, but this change creates effects on other language levels. For example, the loss of a case system in old English is generally attributed to phonetic effects which caused the case-markers at the end of words more difficult to perceive. Grammaticalisation processes commonly observed in natural language evolution [52] show that there is a strong interaction as well between lexicon and grammar. Typically certain lexical words become recruited for syntactic functions, they progressive lose meanings, become shorter, and may even disappear altogether so that the cycle of grammaticalisation restarts again. A principle from biology has once again turned out to be helpful to understand how the co-evolution between different subsystems involved in language may be achieved. In the early nineteen seventies, Maturana introduced the concept of structural coupling and developed it further with Varela [26]: Given two adaptive systems operating independently but having a history of recurrent interactions in the same shared environment, a ‘structural congruence’ may develop under certain circumstances, so that they become coordinated without a central coordinator. It is important that each adaptive system acts as a perturbator of the other, and, because they are adaptive, the perturbation leads to a structural change. Structural coupling has come out of attempts to understand certain biological phenomena, such as the development of multi-cellular systems or the coordination between organs. It is a system-level concept which has found application in areas ranging from physics to economics or social science. The concept is related to so called coupled maps [17] which are dynamical systems, for example systems of oscillators, where one subsystem acts as a context for the other. The relevance of structural coupling to artificial language evolution has also become clear around 1995, particularly in the context of coordination between conceptualisation and lexicon formation [41], [16]. Both systems have to be adaptive: conceptualisation requires a mechanism that can generate new categories driven by the need for communication, for example, new distinctions may have to be introduced in order to refer to objects within a particular context. Lex-
134
L. Steels
icon formation is also adaptive because new words need to be invented or are being learned from others. Each system perturbs the other. The lexicon may push the conceptualisation system to develop new categories or categories that are also employed by other agents. The conceptualisation system occasionally comes up with categories that have not been lexicalised yet, so it perturbs the lexical system to make structural changes as well. Both systems have a history of interactions, not only in single agents but also in a group of agents. If the right structural coupling is set up, it can be shown that not only lexicons but also the conceptual repertoires underlying these lexicons can self-organise and become coordinated. Figure 2 from [45] shows an example of this. In this experiment, the agents play language games about coloured stimuli (corresponding to the Munsell samples widely used in the anthropological literature). Given a set of samples, the hearer has to identify one of them based on a colour name provided by the speaker. The colour name assumes a certain categorisation of reality (for example green and blue colours) which the agents have to develop at the same time as they are developing from scratch a lexicon for naming these categories. Categorisation fails if the agent does not have a category in his repertoire that distinguishes the colour of the chosen sample from the other colours. For example, if there is a blue, green and red sample, and the blue one is chosen, then it will be necessary to have a colour category for blue which distinguishes blue from green and from red. In the experiment reported in [45] there is a structural coupling between the lexicon formation and concept formation processes, leading to progressive coherence of the categorial repertoires. If there is no such coupling and agents individually develop categories to distinguish samples, individual repertoires adequate for colour categorisation still develop but they are no longer similar. Figure 2 displays the evolution over time of category variance with (top graph) and without (bottom graph) structural coupling. The ratio between the two demonstrates how categorical similarity is drastically increased when there is a structural coupling.
8
Theory of Mind
The previous sections discussed mostly research in the domain of lexicon and concept formation. The problem of grammar has turned out to be much more difficult to crack and there is no consensus yet on how it should be done. In a series of intriguing simulations, Kirby and coworkers [18], [2] showed that in iterated games where agents from one generation learn grammars from the output from the previous generation, agents will choose a compositional as opposed to a non-compositional language because this overcomes the learning bottleneck, i.e. the problem that agents have to learn a language from limited data. In this case, learners (i.e. children) play a crucial role in shaping the future of a language. This approach has been confirmed by theoretical results of Nowak, et.al. [31]. But there is an alternative view, namely that grammar arises to optimise communication [44]. Speakers try to increase the chance of being understood
The Evolution of Communication Systems by Adaptive Agents
135
Fig. 2. The graph displays the variance between the emerging category sets used by a population of agents playing iterated language games, with (top) and without (bottom) a structural coupling between lexicon formation and category formation. The ratio between the two is displayed as well.
correctly by making additional aspects of meaning explicit and by minimising the processing that needs to be done by the hearer (and by themselves). Of course the grammatical rules that speakers introduce must still be learnable – otherwise they would not propagate in the population. Moreover in the adoption of rules used by others, a listener may overgeneralise, or a listener may overinterpret certain formal characteristics of an utterance to be carriers of meaning, whereas they were not intended to be so. This would also introduce additional structure and regularity as soon as the learner uses these rules in his own language production. Nevertheless the creative force in language evolution from this alternative perspective rests primarily with language producers. Recent experiments [44] have shown examples how all this might work. The first important step is to view natural language as an inferential coding system [39], which means that the sender assumes that the receiver is embedded in the same context and is intelligent enough to infer commonly known relevant facts about the current situtation. The message is therefore incomplete and cannot be interpreted without the context. This contrasts with Shannon-like pure coding systems where the message carries all the meaning that the sender wants to transmit. Inferential coding systems can transmit much more information with fewer means, however, there is a risk of misunderstanding and there is a risk that the hearer has to do more work than he is willing to do to interpret the message.
136
L. Steels
This is why grammatical elements (as well as additional lexical elements) get introduced. In the experiment reported in [44], the speaker simulates the understanding of his own utterance as part of language production and detects potential difficulties. The experiments focus on case grammar, which centers around case markers that help to express the relations of objects with respect to events (as in ‘He gave her the ball’ versus ‘She gave him the ball’). It is possible to communicate without explicating these event-object relations, and often they can be inferred from the present situation and context. But most languages have developed grammatical tools to express event-object relations to minimise the risk that roles get confused. For example, English uses word order, German or Latin use case affixes, Japanese uses particles, etc. In the experiment, agents detect where ambiguity or uncertainty arises and repair it by introducing additional (case)markers. The hearer assumes that unknown elements of the utterance are meaningful and are intended to help in interpretation. When the hearer can construct an interpretation, this helps to figure out the possible meaning of unknown utterances. The main mechanism to simulate these processes is to introduce a subsystem to infer how the listener will interpret a sentence in a particular context, which amounts to a kind of ‘theory of mind’ of the other. The growing complexity of robots and the rise of humanoid robots makes this more feasable because these robots are much more situated and therefore have more information available than is relevant to sustain a grounded communication [38]. Moreover the speaker can use himself as a model to predict how the other agent will react.
9
Further Evolution
Language not just self-organises once, but evolves, and sometimes very rapidly [21] – which is one of the reasons why it is implausible that language evolution is entirely genetic. Even without population change and throughout the life time of an individual, new words are introduced, meanings of words shift, grammatical rules change, the phonetics undergoes change, etc. When human populations with mixed languages are put together and change rapidly, creoles may form which recruit elements from source languages but re-invent many grammatical phenomena like expression of tense, aspect, mood, case systems, reflexivity, determiners, etc. Evolution requires variation and selection. These can easily be mapped onto language evolution. As soon as there is a distributed set of agents which each evolve their own communication system, variation is inevitable. An individual’s language behaviour is affected by past developmental history: what environments were encountered, with which other agents most interactions took place, what random choices were made. Additional variation may come from the inevitable stochasticity in language communication: errors in the transmission or reception of the spoken signal, errors in pragmatic feedback, processing errors in parsing and production. There are many selective forces at work, ranging from physi-
The Evolution of Communication Systems by Adaptive Agents
137
ology (particularly important for constraining the kinds of speech signals that can be produced and the kinds of sensori-motor data that is available for conceptualisation), the environment, the ecology (what are important distinctions), cognitive constraints (memory, processing speed, learning speed), the dominating conventions adopted by the group, and the specific communicative tasks that need to be realised. A language system is never going to be optimal with respect to all these constraints. For example, sometimes parts of words are no longer pronounced to make the utterance shorter but this may lead to a loss of information (such as case marking) which then gives rise to grammatical instability that needs to be solved by the re-introduction of markers or by a shift to another kind of system [53].
10
Conclusions
The paper has presented a number of principles that are being explored by a growing group of researchers to explore artificial language evolution. This field attempts to set up systems of autonomous distributed agents that self-organise communication systems which have properties comparable to human natural languages. The agents are either software agents or fully embodied and situated robotic agents. The relevance to adaptive behaviour research is twofold: On the one hand the study of language evolution provides insight into a number of processes generating complexity in communication systems. These processes appear similar to mechanisms generating complexity in other areas of biology. Self-organisation, structural coupling, level formation and cultural selection are examples. On the other hand, the study of how complex communication has evolved is giving new ways to implement open-ended grounded communication with autonomous robots, and to simulate the epigenetic development of cognition. Discussion of general principles is risky but at the same time necessary because it is only at this level that bridges between fields, particularly between biology and evolutionary linguistics, can be established. Moreover I want to emphasise again that much remains to be discovered. The principles reported here are far from complete and need to be explored in many more case studies.
Acknowledgement. I am indebted to members of the Artificial Intelligence Laboratory in Brussels who have been involved in several experiments reported in this document, financed by the Belgian Government GOA grant, in particular, Tony Belpaeme (financed by the Belgian FWO) who implemented the colour naming and categorisation experiments. I am also indebted to members of the ‘origins of language’ group at the Sony Computer Science Laboratory in Paris, particularly Frederic Kaplan who implemented the language games on the AIBO, and Pierre-Yves Oudeyer who has focused on experiments in evolving phonetic systems.
138
L. Steels
References 1. Billard, A. and K. Dautenhahn (2000) Experiments in social robotics: grounding and use of communication in autonomous agents. Adaptive behaviour vol. 7:3/4. 2. Brighton, H. (2002). Compositional Syntax from Cultural Transmission. Artificial Life, 8(1) 3. Bowerman, M. and S. Levinson (2001) Language acquisition and conceptual development. Cambridge University Press, Cambridge. 4. Breazeal, C. (1998) A Motivational System for Regulating Human-Robot Interaction, Proceedings of AAAI98, Madison, WI. 5. Briscoe Edward J. (ed.) (2002) : Linguistic Evolution Through Language Acquisition: Formal and Computational Models, Cambridge, UK, Cambridge University Press. 6. Camazine, S. J.-L. Deneubourg, N. Franks, J. Sneyd, G. Theraulaz, and E. Bonabeau (2001) Self-Organization in Biological Systems. Princeton University Press, Princeton. 7. Cangelosi, A. and D. Parisi (eds.) (2001) Simulating the Evolution of Language. Springer-Verlag, Berlin. 8. Cohen, P., et.al. (2001) Proceedings of the AAAI Spring Symposium on Grounding. AAAI Press, Anaheim, Ca. 9. De Boer, B. (1997) Self-Organisation in Vowel Systems through Imitation. In: P. Husbands and I. Harvey (eds.) Proceedings of the fourth European Conference on Artificial Life, Cambridge Ma. (1997) 10. Deneubourg, J.-L., et.al. (1993) Self-organisation and life: from simple rules to global complexity. Proceddings of the Second European Conference on Artificial Life, ULB. Brussels. 11. Harding, C. G. (1983) Setting the Stage for Language Acquisition: Communication Development in the First Year. In: Golifkoff, R. (1983) (ed.) The Transition from Prelinguistic to Linguistic Communication. Lawrence Erlbaum Ass. Hilssdale NJ. p. 93–113. 12. Harnad, S. (1990) The Symbol Grounding Problem. Physica D 42: 335–346. 13. Hashimoto, T. and Ikegami, T., (1996), Emergence of net-grammar in communicating agents, BioSystems, 38, 1–14, 1996. 14. Hurford, J., C. Knight and M. Studdert-Kennedy (1998) Approaches to the Evolution of Language: Social and Cognitive bases. Cambridge University Press, Cambridge. pp 405–426. 15. Hurford, J. (2001) Expression / Induction Models of Language Evolution: Dimensions and Issues. In: Briscoe, T. (2001) Linguistic Evolution through Language Acquisition: Formal and Computational Models, Cambridge University Press. 16. Ikegami, T. and M Taiji (1999) Imitation and Cooperation in Coupled Dynamical Recognizers. In: Floreano, D. J. Nicoud and F. Mondada (eds.) Advances in Artificial Life (ECAL 99). Lecture Notes in Computer Science. 17. Kaneko, K. and J. Suzuki. (1994) Evolution to the edge of choas in imitation games. In: Artificial Life III, the MIT Press, Cambridge Ma. pp. 43–54. 18. Kirby, S. (1999) Function, Selection and Innateness: The Emergence of Language Universals. Oxford University Press, Oxford. 19. Kuroki, Y., T. Ishida, J. Yamaguchi, M. Fujita, T. Doi (2001) A Small Biped Entertainment Robot. In: Proceedings of the IEEE-RAS International Conference on Humanoid Robots. Waseda University, Tokyo. p. 181–186.
The Evolution of Communication Systems by Adaptive Agents
139
20. Koza, J. (1992) Genetic Programming: on the programming of computers by means of natural selection. MIT Press, Cambridge Ma. 21. Labov, W. (1994) Principles of Linguistic Change. Volume 1: Internal Factors. Blackwell, Oxford. 22. Langton, C. (ed.) (1989) Artificial Life. Addison-Wesley, Reading Ma. 23. Lindblom, B, P. MacNeilage and M Studdert-Kennedy (1984) Self-organizing processes and the explanation of language universals. In: Butterworth, B, et.al. Explanations for language universals. Walter de Gruyter and Co, pp. 181–203. 24. Lindgren, K. and M. Nordahl (1994) Cooperation and Community Structure in Artificial Ecosystems. Artificial Life Journal 1(1). 25. Maynard Smith, J. (1982) Evolution and the theory of games. Cambridge University Press, Cambridge. 26. Maturana, H. and Varela, F. (1998) The Tree of Knowledge (revised edition). Shambhala Press, Boston. 27. Maynard Smith, J. and E. Szathm´ ary (1995) The Major Transitions in Evolution. Oxford University Press, Oxford. 28. MacLennan, B. (1992) Synthetic Ethology: An Approach to the study of Communication. In: Langton, C., et.al. (1991) Artificial Life II. Addison-Wesley Pub. Cy, Redwood City Ca. pp. 603–631. 29. Nicolis, G. and I. Prigogine (1988) Exploring Complexity. Freeman and Co. New York. 30. Noble, J (1999). Cooperation, conflict and the evolution of communication. Adaptive behaviour, 7(3/4), 349–370. 31. Nowak, M., J. Plotkin and J. Vincent (2000) The evolution of syntactic communication. Nature, vol. 404 (30/03/2000), pp. 495–498. 32. Oliphant, M. (1997) Formal Approaches to Innate and Learned Communication: Laying the Foundation for Language, Ph.D. Thesis, University of California San Diego, Cognitive Science Department. 33. Oudeyer P-Y. (2001) Emotional Interactions with Humanoids Using Speech. In: Proceedings of the IEEE-RAS International Conference on Humanoid Robots. Waseda University, Tokyo. pp. 17–24. 34. Pfeifer, R. and R. Scheier (2000) Understanding Intelligence. The MIT Press, Cambridge Ma. 35. Pinker, S. (1994) The Language Instinct. The New Science of Language and Mind. Penguin, Harmondsworth. 36. Roy, D. (2001) Learning Visually Grounded Words and Syntax of Natural Spoken Language. Evolution of communication 4(1). 37. Scassellati, B. (1998) Eye Finding via Face Detection for a Foveated, Active Vision system. In Proceedings of AAAI-98. AAAI Press Books. 38. Scassellati, B. (2002) Foundations for a Theory of Mind for a Humanoid Robot. Ph.D. Thesis, MIT. 39. Sperber, D. and D. Wilson (1986), Relevance: Communication and Cognition. Harvard University Press, Cambridge Ma. 40. Steels, L. (1996) Self-Organising Vocabularies. In: Langton, C. Proceedings of Alife V. The MIT Press, Cambridge. 41. Steels, L. (1997a) Constructing and Sharing Perceptual Distinctions. In: van Someren, M. and G. Widmer (eds.) (1997) Proceedings of the European Conference on Machine Learning. Springer-Verlag, Berlin. Steels, L. (1998) The origins of syntax in visually grounded agents. Artificial Intelligence 103 (1998) 1–24.
140
L. Steels
42. Steels, L. (2001) Language Games for Autonomous Robots. IEEE Intelligent systems, September/October 2001, p. 16–22. 43. Steels, L. (2001) Social Learning and Verbal Communication with Humanoid Robots. In: Proceedings of the IEEE-RAS International Conference on Humanoid Robots. Waseda University, Tokyo. pp. 335–342. 44. Steels, L. (2002) Computer simulations of the origins of case grammar. Fourth Evolution of Language Conference, Harvard, Cambridge Ma. 45. Steels, L. and T. Belpaeme (2003) Computer Simulations of Colour Categorisation and Colour Naming. Submitted to BBS. 46. Steels, L. and R. Brooks (eds.) (1994) The artificial life route to artificial intelligence. Building situated embodied agents. Lawrence Erlbaum, Hilssdale, NJ. 47. Steels, L., F. Kaplan, A McIntyre and J. Van Looveren (2001) Crucial factors in the origins of word-meaning. In Wray, A., editor, The Transition to Language, Oxford University Press. Oxford, UK, 2002. 48. Steels, L. and F. Kaplan (2001) AIBO’s first words. The social learning of language and meaning. Evolution of Communication 4(1). 49. Steels, L. and P. Vogt (1997) Grounding Adaptive Language Games in Robotic Agents. In Harvey, I. et.al. (eds.) Proceedings of ECAL 97, Brighton UK, July 1997. The MIT Press, Cambridge Ma. 50. Talmy, L. (2000) Toward a Cognitive Semantics: Concept Structuring Systems (Language, Speech, and Communication) The MIT Press, Cambridge Ma. 51. Tomasello, M. (1999). The Cultural Origins of Human Cognition. Harvard University Press. 52. Traugott, E. and Heine, B. (1991) Approaches to Grammaticalization. Volume I and II. John Benjamins Publishing Company, Amsterdam. 53. Van Kemenade, A (1987) Syntactic Case and Morphological Case in the History of English. Forist Publications, Dordrecht. 54. Vogel, P.M. and B. Comrie (eds.) (2000) Approaches to the Typology of Word Classes (Empirical Approaches to Language Typology, 23) Cambridge University Press, Cambridge. 55. Vogt, P. (2001) Bootstrapping grounded symbols in minimal autonomous robots. Evolution of Communication 4(1). 56. Wray, A.,et.al. (ed.) (2002): The Transition to Language, Oxford, UK, Oxford University Press.
An Agent Architecture to Design Self-Organizing Collectives: Principles and Application Gauthier Picard and Marie-Pierre Gleizes IRIT, Université Paul Sabatier 118, Route de Narbonne, 31062 TOULOUSE Cedex, France {picard, gleizes}@irit.fr http://www.irit.fr/SMAC
Abstract. Designing teams which have a task to execute in a very dynamic environment is a complex problem. Determining the relevant organization of these teams by using group or role notions might be very difficult and even impossible for human analysts. Although an organization can be found or approximated, it becomes complicated to design entities, or agents in our case, that take into account, at the conception and design phases, all possible situations an agent could face up to. Emergent and self-organizing approaches to model adaptive multi-agent systems avoid these difficulties. In this paper, we propose a new approach, to design Adaptive Multi-Agent Systems with emergent functionality, which enables us to focus on the design of agents that compose the system. In fact, self-organization of the system is led by the environmental feedback that each agent perceives. Interactions and organization evolve, providing an adequate function to the system, which fits to its environment as well. Such functions have enough properties to be considered as emergent phenomena. First, we briefly present the Adaptive Multi-Agent Systems theory (AMAS) and our view of self-organization. In the second part, a multi-level architecture is proposed to model agents and to consider groups of agents as self-organizing teams. In the third part, we describe a sample robot group behaviour, the setting up of traffic in a constrained environment. Our architecture allows the emergence of a coherent collective behaviour: the dedication of corridors to specific directions. Finally, we show what is emergent by the analysis of results arising from measurements of collective phenomena.
1 Introduction Nowadays, Multi-Agent Systems (MAS) tackle complex and non-linear problem solving that classic systems do not resolve efficiently. These problems, such as flood forecast [17], on-line brokerage [2] or equation system solving, have been the SMAC (Cooperative Multi-Agent Systems) team’s leitmotiv in the elaboration of new models based on self-organization. Robotics meets such complexity and non-linearity. The difficulty to understand or to design spatial settling teams, dangerous area minesweeping or resource transportation, becomes too high to be targetted single robots.
E. Alonso et al. (Eds.): Adaptive Agents and MAS, LNAI 2636, pp. 141–158, 2003. © Springer-Verlag Berlin Heidelberg 2003
142
G. Picard and M.-P. Gleizes
That is how Collective Robotics is born, inspired by social insects communities like bees or ants [11]. High-level problem resolution by low-level entities interactions and the appearance of new functionalities within groups are also two motivations of the Adaptive MultiAgent Systems theory (AMAS). It takes part in a movement, named Emergence, which motivated many researchers around the world. Many groups have been formed to study this ill-known concept which appeared in the early Antiquity as [1] emphasize. Emergence was early quoted in Computer Science, notably by [10], and is linked to the Complexity Theory that tends to demystify this concept. [8] proposes a set of properties for the emergent systems such as radical novelty, coherence, macro and micro-level, dynamical and ostensive phenomena. These characterizations lead to new methodologies to design systems where the macro-level algorithm (the task of the system or the global function) does not appear at the micro-level (the agents’ level). In Collective Robotics domain, [13] propose such a methodology, named CIRTA. Our work takes place at this crossroads between MAS, Robotics and Emergence. The MAS we are working on are self-organizing adaptive systems consisting in several autonomous agents situated in a common environment. Moreover, these agents participate to a coherent collective activity, to a common task. The particularity is in the fact that we do not code within an agent the global function of the system. Thanks to agents’ self-organization ability, the system can adapt itself and the global function can emerge. The function realized by the system evolves when the organization of the agents in the system changes. We would like to show the relevance of the Adaptive Multi-Agent Systems applied to Collective Robotics. In the first place, we present the AMAS theory on which all developments are based. Then we describe our architecture. It is a generic model for our agents guided by the wish to apply cooperative interactions and self-organization. After that, a sample application is presented, about the study of the traffic of numerous robots in a constrained environment, composed of halls and narrows corridors as in Vaughan et al. [18,19]. The task is resource transportation between two halls. The question of learning and memorization will be raised. We distinguish several modules for different means to learn. Our system has been simulated and results have been obtained and analysed. Finally, we conclude on the scope of our study and the perspectives of our work.
2 A Brief Overview of the AMAS Theory In the AMAS theory, we propose to consider the equivalence between the organization of the team and the global function obtained by the set of interactions between the low-level entities. To guarantee the emergent property of the global functionality, the team has to be self-organizing.
An Agent Architecture to Design Self-Organizing Collectives
143
2.1 Motivations Several applications require the development of software that is characterized by an incomplete specification phase, because of the following reasons:
• the system has to evolve in a dynamical environment and it is impossible to totally specify all the situations the system may encounter; • the system is open; • the system is complex; • there is no known algorithmic solution to resolve the problem; • the internal organization of the system is a priori unknown. The unexpected is inherent to this systems’ life. Self-organization, which corresponds to an autonomously decided change, becomes a way to overcome possible perturbations of the environment [14]. This is an approach to implement adaptive systems too. In our systems, the organization is treated as a result and not as a characteristic of the system to specify. 2.2 Definition and Characteristics The AMAS theory is based on self-organization by cooperation. In this theory: Definition 1. We call adaptive multi-agent system a multi-agent system which is able to change its behaviour to adjust itself to its dynamical environment, either to carry out the task it is intended to complete or to improve its functionality or performance. An adaptive multi-agent system is characterized by the following points:
• • • • •
the system is plunged in a dynamical environment; the system implements a function; the system is composed of interacting autonomous agents; each agent of the system implements a partial function; the organization of the system determines the result of the system.
Learning, from the system point of view, consists of transforming its current function to adapt itself to its environment, i.e. changing its internal organization. So, learning enables the system to have a relevant activity in the environment in which it is located: this is the definition of functional adequacy. The activity relevance is decidable only by an external observer who appreciates the interactions and who knows the function the system has to carry out in its environment. Therefore, the question is "when and how the system can transform itself to lead to the functional adequacy". The AMAS theory [7] says: Theorem 1. For all functionally adequate systems, there is at least one cooperative system which realizes an equivalent function in the same environment. Theorem 1 is important because it enables the guidance of the adaptive multi-agent system design. A first step lies in the identification of the agents and then in ensuring
144
G. Picard and M.-P. Gleizes
that each agent is or tends to be in cooperative interaction with the other agents. This method ensures the functional adequacy according to the theory. In the AMAS theory, an agent is generally supplied with skills, communication and interaction (with other agents and/or the environment) capacities, beliefs and knowledge of other agents in the environment, aptitudes which enable the agent to reason, and a cooperation-based social attitude. The behaviour of each agent is specified in order to try to reach its objective(s) and to keep cooperative interactions with the other agents. Before any action, an agent locally examines if it is in cooperative interaction or not. In fact, it detects if it is in non-cooperative situation. If the agent is in such a non-cooperative situation, it tries to escape from this situation to return to a cooperative one. Cooperation is the social attitude that guides locally each agent to select its behaviour: the agent judge locally if it could be cooperative. This is what we call cooperation-based social attitude. Therefore, an agent has two essential roles: the first is to implement its partial function. The second one is to act on the internal organization of the system. If an agent detects a non-cooperative situation, it has to return to a cooperative situation so as the system returns to a functionally adequate organization. 2.3 Non-cooperative Situations (NCS) Agents, which are designed using the AMAS theory [7] and the associated methodology ADELFE [4], have to respond to unexpected events. After identifying the agents, according to the AMAS theory, designers have to give an autonomous agent the means to decide how to change its interactions with the other agents. As we previously say, the change of the organization of agents changes the function implemented by the whole system. The mean to self-organize is local to the agent. It consists in the ability to detect and to remove (if the agent can) all non-cooperative interactions and to perform cooperative action whenever it is possible. There are three categories of noncooperative interactions:
• Misunderstanding: when a signal that is received from the environment can not be understood without ambiguity; • Incompetence: when information (an interpreted signal) cannot lead to new logical consequence. • Uselessness: when concluding results are useless for the environment (and the other agents). We can observe that an agent can locally judge the first two situations by using knowledge it has about itself. It can analyze the third situation after having perceived the environment. This is a generic manner of defining the engine of self-organization. We call these situations Non-Cooperative Situation or NCS. For each level of the system, a set of NCS must be determined. This set must be as complete as possible. We instantiate it for the robot in section 4.2 Instantiation of the Model.
An Agent Architecture to Design Self-Organizing Collectives
145
Fig. 1. A three level architecture: the robot level, the state level and the activity level.
3 A Multi-level Architecture In this section, we present our architecture for modelling a group of robots as an adaptive multi-agent system. First, we define the three levels that will be present in an adaptive multi-agent system: the robots, their inner states and their activity level. Later, we identify and describe the agents at each level and the possible noncooperative situations. 3.1 The Different Levels of the MAS As a primordial motivation to easily model systems, the decomposition of a system into different levels of abstraction is a prevalent characteristic of our work. This decomposition enables to develop levels separately and to observe the phenomena that correspond to each level. At the robot level, the globally modelled system is composed of several agents, the robots, which can be physically homogeneous or not. Each robot is driven by its decision module, an adaptive multi-agent system itself, in which agents are states. At this state level, the states must self-organize to give the robot a coherent global behaviour. This approach requires the definition and the identification of each agent at each level of the system. It is important to identify these levels because of the intrinsic multi-level nature of emergent phenomena. This emergence is a bottom-up phenomenon that can be propagated through the entire system and its levels. Actually, an emergent behaviour can appear from the organization of the state agent, at the state level, to the robot level. Each robot is led by such behaviour so that an emergent global behaviour appears at the global level (or robot level). The global behaviour is guaranteed by the AMAS theory [7]. This theory showed if the system and its environment are in cooperative situation, the system realizes the
146
G. Picard and M.-P. Gleizes
right function. Moreover, if the internal medium of the system that consists of agents is in cooperative situation with the environment then the system is cooperative. In our system, we find three levels as in Fig. 1. At the highest level, the Robot Level, the adaptive multi-agent system is composed of autonomous robots which have to accomplish a collective task in a dynamical environment. For example, the collective’s task of a collective may be resource transportation [18] or a box-pushing task [12,16]. The agents should be equipped as the robots that they represent. They have sensors, a sonar for example, and actuators, as wheels, to be able to interact with their environment and to communicate indirectly with the other robots. They can also have communication equipment, as an infrared port, to communicate directly. At the mid-level there is the State Level. Each robot contains a multi-agent system in its Decision Module. This component that links the robot level and the state level is discussed in section 2.2.4 and section 5.2. This module has to determine the behaviour of the robot. The behaviour of a robot is directed by a sequence of inner states that are simple activities allowing the robot to accomplish its task. One way to implement adaptive robots is to build them using adaptive multi-agent components. Each state is represented as an agent in the Decision Module. These agents must determine the right time to activate themselves to give to the robot a coherent behaviour. This multi-agent system can be viewed as an organization of states that are able to change their place in a state graph. This self-organization leads to the emergence of a robot’s behaviour. We can define the states as high-level primitives because they do not manipulate directly the actuators of the robot. For example, in the case of an ant having to bring resources to its nest, the states might be exploration, exploitation, back to the nest and rest. They might not be turn left, turn right, move forward, move backward or pick that manipulate directly the actuators (arms for example). Other decompositions and level definitions are easily imaginable. As we said in the previous paragraph, states are high-level activities that do not manipulate directly the robot’s actuators. So, what directly controls these actuators? The definition of another level is required to complete this top-down definition of our architecture. This level is named Activity Level. Like robots, states are led in their local behaviour by a decision module that has to activate the right activity at the right time. A state needs different activities to be coherent. For example, an exploring robot has to know how to reach a resource when it detects one. More examples are given in section 3 Example: a Traffic Survey. In fact, activities may manipulate directly actuators. In this paper, we focus on the Robot Level and the State Level even if the Activity Level raised several questions which will be shortly developed. 3.2 High-Level Agents: The Robots Robot agents have effect at the Robot Level. These agents are designed to control physical robots or to simulate them in an artificial life platform. They are composed of four distinct parts: sensors, a decision module, actuators and a non-cooperative situation detection module (or NCS detection module). The three first components correspond to a classical principle of several works in Collective Robotics and Artificial
An Agent Architecture to Design Self-Organizing Collectives
147
Life. The last one, the NCS detection module is our contribution in the agent architecture. Agents are led by a classical three-phase life cycle: 1. The perception phase during which the robot updates its registers corresponding to its sensors and so updates their input vector composed of boolean values corresponding to the robot’s point of view of its environment; 2. The decision phase during which the robot chooses an appropriate inner state in function of its input vector; 3. The action phase during which the robot updates its registers corresponding to its actuators (wheel speed, rotation angle...) in function of the taken decision of the previous phase. The NCS module participates in the decision phase. If the robot locally assumes it is in a cooperative situation then it normally acts. Else, i.e. the robot locally infers it is in a non-cooperative situation, it acts to come back to a cooperative situation. As discussed in the section 3.1 The Multi-Agent System’s Levels, the robot can be composed of adaptive multi-agent systems as in others applications e.g. brokerage agents that model their ontologies as multi-agent systems composed of words [2]. Sensors represent the link between the robot and its environment. They allow the robot to construct a partial representation of the world where it is situated. This is why such robots are suitable to be modelled as autonomous agents that obey the locality principle: an agent has a local perception of its environment. This principle is a "sine qua non" condition for an agent to be [6]. Sensors update the robot’s input vector. NCS detection module is coupled with the sensors and acts during the perception phase, this module analyses the input vector. In function of its inputs, including what a robot knows about itself – its current state for example –, the module determines if the robot is in a non-cooperative situation. In such a situation, the NCS detection module has to send a message to the decision module to handle this particular situation. Actuators represent the link between the robot and its environment. Thanks to actuators, the robot can move (legs or wheels) or can pick some objects, for example, that modifies the environment. Registers, containing values representing their position, orientation, etc., lead actuators. The only representation the robot has of its actuators is this set of values. The decision module selects an inner state in function of the input vector. In our architecture, the decision module uses another multi-agent system composed of state agents, taking place at the State Level. The problem of choosing of the right state at the appointed time appears in several works in Robotics [15]. For our part, we explore two ways to the decision, a simple reactive one and an agent-based one in which states are agents of a adaptive multi-agent system:
• Reactive decision: the module associates a state to each possible input vector. This kind of decision is very efficient but is not flexible. It needs a complete exploration of the input vector space at the conception of the robot even if factorisations may be possible. • Agent-based decision: during the decision phase, each state agent evaluates its wish to act at this time. A state agent has beliefs based on input-vector and as-
148
G. Picard and M.-P. Gleizes
signs a weight to each of them. The most relevant state will be activated, i.e. each state agent sends a value corresponding to its wish to be activated and a decision function chooses the best (the agent with the highest value for example). With such a decision module, robots are able to learn from state-agents’ selforganization as explained in section 5.1 Learning from Self-Organization. They change their functions as the states’ organization changes. The issue of decision-making can also be raised at the state level. Therefore, a state agent must activate the right activity at the proper time to be coherent. It is the motivation for the decomposition of the agents in multi-agent systems. 3.3 Mid-Level Agents: The States The state agents appear in the decision module. A state represents a behaviour a robot can have. The role of the state agents is to activate themselves at the proper time to control coherently the robot. A state agent has two tasks to accomplish. Firstly, it has to select the right activity at the proper time. Secondly, it must send a value that represents its inclination to be the active state to the decision module. Actually, the global behaviour of a robot, i.e. the sequence of its actions, can be represented by a transition graph as in the methodology CIRTA [13]. In this graph, two of the levels may be the state level and the activity level. The goal is to find the graph robots have to follow. Keeping in mind we work with the concept of emergence, this graph does not have to appear at the robot level. The robot should have to construct it by learning. 3.4 Related Works Our architecture is similar to the behaviour-based architecture of Brooks [5] or Matariü [15]. In fact, our states and activities correspond to their behaviours. In these architectures, each agent has a set of simple behaviours, which enable the agent to accomplish its task. The choice between the different behaviours at a given time can be done by different procedures: arbitration, subsumption, etc. In our work states are agents having to choose by themselves the right state at the proper time. The choice is not a centralized procedure but it is distributed between the state agents.
4 Traffic Survey To explain the presented architecture, we develop an application. This study shows a common problem in Collective Robotics: spatial interference. The result of this application is the observation of a global emergent behaviour. We show that a stream of a team of agents through corridors is a global emergent behaviour.
An Agent Architecture to Design Self-Organizing Collectives
149
ú ì
÷
ö
ø
ó
Fig. 2. An example of environment for the stream emergent behavior.
4.1 Presentation of the Problem The resource transportation problem is a classical task in Collective Robotics inspired by insect communities [11,18]. The task of the robots is to transport resources from a claim area to a laying area. These areas are situated in different rooms separated by narrow corridors. A spatial interference problem appears as in the survey of Vaughan et al. [19]. Once a robot enters in a corridor, what has it to do if another robot comes in the opposite direction? In the mentioned survey, the authors use an aggressive competition to resolve the problem. For our part, we use the NCS concept. The configuration of the environment has of course a great importance for the appearance of emergent phenomena. As Fig. 2 shows, each room (1 or 2) contains an area (3 or 4 respectively) at the opposite of the other room to force robots to enter in a room to carry out their activities. Corridors (5, 6 and 7) are narrow, more than the size of a robot but less than the size of two robots, and their length can be parameterised. In fact, the length has an importance on the appearance of a stream direction. Either the corridor is shorter than perception range of the robot, or the corridor is longer. In the first case, a robot can see another robot engaged in a corridor before coming in. In the second case, a robot cannot know if another is engaged and has to come in without certitude. The number of corridors may have an importance, too. In this article, we only present a case with two long corridors. 4.2 Instantiation of the Model Now, we present how to use our architecture for the resource transportation task. The Robots. To complete the task, a robot must have some sensors and actuators: short-ranged sensors to avoid obstacles, an object sensor to distinguish robots from resources and an area sensor to detect areas and corridors (each area or corridor can have a proper colour); two wheels to move in any direction and a pick-up unit (such a
150
G. Picard and M.-P. Gleizes
clamp) to pick resources. The sensors enable a robot to construct its input vector. We have determined a set of inputs such as seeResource, inClaim or inCorridorNumber. These inputs are booleans or list of Booleans. This set of inputs is called the input vector. States and Activities. We have determined the following states and activities:
• The Claim State: the state a robot must have when he have to take a resource. This state uses the following activities: - Seek Resource: the robot is exploring the rooms to find a resource; - Reach Resource: the robot is moving to a resource (it’s able to avoid obstacles too); - Pick Resource: the robot is picking a resource which is close ranged; • The Laying State: the state a robot must have when he have to drop a resource. This state uses the following activities: - Seek Area: the robot is exploring the rooms to find the Laying Area; - Reach Area: the robot is moving to the Laying Area resource (it’s able to avoid obstacles too); - Drop Resource: the robot is dropping a resource; • The Travel State X: the state a robot must have when he have to cross the corridor X. This state uses the following activities: - Seek Corridor X: the robot is exploring the rooms to find the corridor X; - Reach Resource X: the robot is moving to the corridor X (it’s able to avoid obstacles too); - Pick Resource X: the robot is crossing the corridor X;
Fig. 3. The State Level and the Activity Level appear in the transition graph.
An Agent Architecture to Design Self-Organizing Collectives
151
Each state uses a set of activities that can be summed up as seek, reach and act. The robots should have to: seek a resource, find it, reach it, pick it, then seek a corridor, find it, reach it, cross it and then seek the laying area, find it, reach and drop the carried resource. This behaviour corresponds to a transition graph where conditions may be very complex because of the number of parameters in the input vector. A solution is to group activities in states, as in Fig. 3. Conditions are factorised and easier to define. NCS rules. We shall define the NCS corresponding to the robot level. The NCS corresponding to the state level will not be explain because will only focus on the appearance of emergent behaviour at the robot level. 1. A first robot A, which is not carrying a resource, is reaching a corridor and sees another robot B, which is carrying a resource. The robot A is reaching a corridor which is frequented by robots carrying resources and it may disturb them; 2. A first robot A, which is not carrying a resource, is reaching a corridor and sees another robot B which is immobile and which is not carrying a resource. The first robot A is reaching a corridor which is blocked for carrying robots; 3. A first robot A, which is carrying a resource, is reaching a corridor and sees another robot B, which is not carrying a resource. The robot A is reaching a corridor which is frequented by carrying robots and may disturb them; 4. A first robot A, which is carrying a resource, is reaching a corridor and sees another robot B which is immobile and which is carrying a resource. The robot A is reaching a corridor which is blocked for robots which are not carrying resources; 5. A first robot A, which is not carrying a resource is reaching a corridor and sees another robot B carrying a resource. A and B are crossing a corridor which is frequented by robots which do antinomic jobs; 6. A first robot A, which is carrying a resource is crossing a corridor and sees another robot B, which is not carrying a resource. Same as (5); 7. A first robot A is crossing a corridor and sees an immobile robot. The corridor is blocked. It may be due to robots in NCS (5) or (6); 4.3 Preliminary Experiments At this stage, the system is limited. We shall expose in this section first results, which have motivated our wish to extend our system, notably by learning capacities. Experiments have been realized on the oRis platform developed at the ENIB [9]. In this section, we show different results from measurements on the team for different configuration of agents and we conclude on the need to equip our agents of learning capabilities. Robots without NCS Detection Module. Results of Fig. 5 show the corridor frequenting. The configuration of experimentation is two corridors and twenty robots that are unable to detect NCS. Two curves represent the number of robot crossing a
152
G. Picard and M.-P. Gleizes
Fig. 4. Number of incoming robots in the corridors (first corridor at top and second corridor at bottom) in function of simulation’s step time, for robots without NCS detection module.
Fig. 5. Number of incoming robots in the corridors (first corridor at top and second corridor at bottom) in function of simulation’s step time, for robots with NCS detection module.
An Agent Architecture to Design Self-Organizing Collectives
153
corridor. Each curve corresponds to a direction: from the claim room to the laying room, or from the laying room to the claim room. As an observation, the corridors are not dedicated to a direction. We do not observe the emergence of a traffic setting up as a global behaviour. Robots with NCS Detection Module. Results of Fig. 4 have been obtained with the same configuration except that the robots can detect and treat NCS. However, there is no emergent global behaviour. NCS are not Totally Exploited. To be able to detect and resolve immediately NCS seems not to be sufficient to observe emergent behaviour. The previous results underline the need to add learning capacities to robots. In fact, conflicts are not due to the current state of a robot but to a previous one, cause to the length of the corridors.
5 Learning and Decision In the previous section, we have shown the need to endow our robots of learning abilities. We develop now the concept of learning by self-organization, which is the way for our system to adapt itself to its environment. 5.1 Learning from Self-Organization
Learning for a system S consists in modifying autonomously its function fS to adapt itself to its environment. In this case, the environment is a given constraint. Each agent Ai of a system S achieves a partial function fAi of the global function fS. fS is the result of the combination of the partial functions fAi, noted by the operator " 7KHFRPEination being determined by the current organization of the parts, we can deduce: fS = fA1 fA2 « fAn
(1)
As generally fA1 fA2 fA2 fA1, by transforming the organization, you change the combination of the partial functions and therefore you modify the global function fS. Therefore, this is a way to adapt the system to the environment. The theorem we carried can be expressed like this: Theorem 2. Any system having a cooperative internal medium is functionally adequate. Each agent has to be in cooperative interaction with the others so as the totality be in cooperative interaction. It means that each agent who locally detects noncooperative situations must try to change the organization in order to get in a new cooperative state. It might be restrictive to use the principle of self-organization, that is to say the search for an optimum organization, as the only learning mechanism. However, structuring a system in levels of different granularity allows for learning not
154
G. Picard and M.-P. Gleizes
only at the level of the organization but, also, at the level of skills. Indeed, the skill of an agent changes if its internal organization (the agents it is made of) is changed. So, the organization of a level is chained to the one of the upper level. In fact, each agent could also be a multi-agent system composed of cooperative sub-agents at a lower level. 5.2 Learning at the Robot Level Learning at the Robot Level can be effectuated by an agent-based decision module. A change in the organization of the state agents corresponds to a change in the robot agent skill. When a robot detects a NCS, it has to learn in order to come back to a cooperative situation. The environmental feedback is implicit1, i.e. it is the robot who detects this feedback. A robot agent learns about its skill: it changes its graph of states in order to change its global behaviour. To change its graph, it must change the organization of its state agent by sending a message to its decision module when it detects a noncooperative situation. 5.3 Learning at the State Level At the State Level, in the decision module, each state agent evaluates its propensity to be activated. All these agents form an organization. States must learn when a wrong state is activated at an improper time. States detects such a situation when the robot agent sends a NCS message to its decision module. To evaluate its propensity to be activated, a state agent must calculate a wish value. To calculate this one, the state agent multiplies a modified input vector (ei) with weight vector (cj) corresponding to the state agent’s belief about a component of the vector.
[e1
e2 ... en − 1
c1 c2 en ] • ... = value to return cn − 1 cn
(2)
The modified inputs ei are equal to 1 for an input vector value of 1, and equal to -1 for an input vector value of 0. The ei values correspond to "is the input important for the decision?" and the cj values correspond to the importance assigned to the input for the decision. Learning will be done by modifying the weights cj when the decision module receives a NCS message from the robot (the weight decreases) or when the 1
As opposed to explicit feedback, where an external omniscient entity decide for each agent if its action is good or not.
An Agent Architecture to Design Self-Organizing Collectives
155
state corresponding to the state agent is chosen (the weight increases). After computing of their values, the agents compare them and decide which state will be activated. This learning model is close to the Reinforcement Learning model or the model proposed by Matariü [15] but it differs because there is no global evaluation or punishment/reward attribution. There is only a local cooperation criterion.
6 Results In the previous sections, we have seen the necessity to supply our robot with learning abilities to reach our goal: emergence of the stream global behaviour. Therefore, we have developed an agent-based decision module. In this section, we present results showing how a team of robots, which are supplied with such a module, evolve by selforganization. As in section 4.3 Preliminary Experiments, the team is composed of 20 robots situated in a two-corridor environment. Finally, we draw conclusions on the appearance of coherent group behaviour. 6.1 Robots with Agent-Based Decision Module In Fig. 6, the lighter curve corresponds to robots that do not carry resources and the darker one corresponds to the robots that do. Results show that the two “classes” of robots that cross the corridors (carrying robots and not carrying robots) are well dissociated. Carrying robots take over the second corridor. This phenomenon is due to the learning ability: it does not appear with a team of robots without decision module. In fact, at the beginning of the simulation, robots encounter numerous NCSs because they collide with each other at the entries of the corridors. Bit by bit, they change their transition graph. The dissociation between corridors does not correspond to a cast formation as referred by Balch [3], because this is a dynamic phenomenon where robots constantly move from class to class. Therefore, all the robots follow the same circle: claim room, second corridor, laying room, first corridor, and then claim room … As the environment includes only two corridors, it seems to be evident. However, what will happen if it includes more then two corridors? In this case it seems to be interesting to observe cast formation. The results show that two different “classes” of robots frequent each of the corridors. This is a consequence of the used method to count the robots. To obtain these results, it is the incoming robots that are counted, not the robots that completely cross the corridors. Therefore, robots that come in the corridor to avoid an obstacle are also counted. We can see that this dissociation is done in a short time, at the beginning of the simulation (20,000 steps). It shows the efficiency of the learning by selforganization for a wide solution space.
156
G. Picard and M.-P. Gleizes
Fig. 6. Number of incoming robots in the corridors (first corridor at top and second corridor at bottom) in function of simulation’s step time, for robots with agent-based decision module.
6.2 Emergence of a Global Behaviour Robots self-organize to transport resources by crossing specific corridors. Nevertheless, we have not coded in the robots an algorithm to do so. The micro-level specification leads to the appearance of a coherent emergent phenomenon: the stream, i.e. the dedication of corridors to a specific direction.
7 Conclusion and Perspectives In this paper, we developed an example of a Collective Robotics problem: the resource transportation through corridors. This problem was tackled for the purpose of observing an emergent phenomenon from the team. After the presentation of our architecture and the study of the task, we discussed some results obtained by simulation on the agent-oriented platform oRis. Robots had to be equipped with learning abilities to observe the emergent stream. This conclusion was the motivation to develop an agentbased decision module. The expected result appeared: the stream behaviour emerged from the team. The developed system is incomplete, but offers several perspectives for the future: - Optimisation of the NCS resolution: because they lead to short-time blockages at the entries of the corridors;
An Agent Architecture to Design Self-Organizing Collectives
157
- Experimentation with a “Logical Adaptive Network”: where the decision is done by an adaptive multi-agent system composed of “NAND” agents which emulate the transition graph; - Other behaviours and cast dynamic study: such as “box-pushing” with a team or hierarchy; - Learning by cooperation: the robots are able to communicate and share their experiences. The results obtained with our present system are encouraging for two reasons. Firstly, the method that is used to specify and design the system is confirmed to be efficient for the study of complex system with emergent functionality. Secondly, the AMAS theory seems to be relevant in Collective Robotics domains, even if our application is only a laboratory case. In a more general way, our work focuses now on a methodology, which is based on the AMAS theory and the UML notations, to design adaptive multi-agent systems with emergent functionality. This methodology, named ADELFE, involves several partners2 to develop a toolkit for engineers and will be implemented in the OpenTool© application, which is provided by TNI company.
References [1]
[2]
[3] [4]
[5] [6] [7]
[8]
2
S. M. Ali and R. M. Zimmer. The question concerning emergence. In Computing Anticipatory Systems: CASYS - First International Conference, D.M. AIP Conference Proceedings 437, pp 138–156, 1998. Athanassiou, Chirichescu, Camps, Gleizes, Glize, Lakoumentas, Léger, Moreno, Schlenker. Abrose: A Co-operative Multi-Agent Based Framework for Marketplace – IATA, Stockholm, Sweden August 1999 T. Balch. Social Entropy: a New Metric for Learning Multi-Robot Teams. In Proceedings, th 10 International FLAIRS Conference (FLAIRS-97). 1997 C. Bernon, V. Camps, M.P. Gleizes, P. Glize. La conception de systèmes multi-agents adaptatifs : contraintes et spécificités. In Atelier de Méthodologie et Environnements pour les Systèmes Multi-Agents (SMA 2001), Plate-forme AFIA, Grenoble, June 2001. R. Brooks. A robust layered control system for a mobile robot. In IEEE Journal of Robotics and Automation, volume 2, pages 14–23, 1986. J. Ferber. Les systèmes multi-agents: Vers une intelligence collective. InterEditions,Paris, 1995. M.-P. Gleizes, V. Camps, and P. Glize. A theory of emergent computation based on cooperative self-organization for adaptive artificial systems. In Fourth European Congress of System Science, Valencia, Spain. 1999. J. Goldstein. Emergence as a construct: History and issues. In Emergence: a Journal of Complexity Issues in Organizations and Manadgement. The New England Complex Systems Institute, 1(1):49–72, 1999.
RNTL Project Partners: ARTAL Technologies, IRIT, L3I and TNI.
158 [9] [10] [11] [12] [13]
[14] [15] [16]
[17]
[18]
[19]
G. Picard and M.-P. Gleizes F. Harrouet. oRis: s’immerger par le langage pour le prototypage d’univers virtuels base d'entités autonomes. PhD thesis, Université de Bretagne Occidentale, 2000. J. Holland. Emergence: From Order to Chaos. Oxford University Press, Oxford, 1998. C. R. Kube and H. Zhang. Collective robotics: from social insects to robots. In Adaptive Behaviour, 1994, 2(2):189–218. C. R. Kube and H. Zhang. The use of perceptual cues in multi-robot boxpushing. In 1996 IEEE International Conference on Robotics and Automation, pages 2085–2090, 1996. O. Labbani-Igbida, J.-P. Müller, and A. Bourjault. Cirta: An emergentist methodology to st design and evaluate collective behaviours in robots' colonies. In Proceedings of the 1 International Workshop on Collective Robotics (CWR-98), volume 1456 of LNAI, pp 72–84, Berlin, 1998. MARCIA Group. Auto-organizsation := ‘évolution de structures? In Journées du PRC GDR Intelligence Artificielle: les systèmes multi-agents, Toulouse. 1996. M. J. Matariü. Interaction and Intelligent Behavior. PhD thesis, MIT, May 1994. M. J. Matariü, C. Nilsson, and K. Simsarian. Cooperative multi-robot box-pushing. In IEEE International Conference on Intelligent Robots and Systems, volume 3, pages 556– 561, 1995. T. Sontheimer, P. Cornuau, J.J. Vidal, P. Glize. Application d’un système adaptatif pour la prévision de crues dans le bassin de la Garonne – Un Modèle Emergent. In Conférence SIRNAT’01, 2001. R. Vaughan, K. Støy, G. Sukhatme, and M. Matariü. Blazing a trail: Insect-inspired reth source transportation by a robotic team. In Proceedings, 5 International Symposium on Distributed Robotic Systems, October 2000. R. Vaughan, K. Støy, G. Sukhatme, and M. Matariü. Go ahead make my day: Robot conth flict resolution by aggressive competition. In Proceedings, 6 International Conference on Simulation of Adaptive Behaviour, pp 491–500, 2000.
Evolving Preferences among Emergent Groups of Agents Paul Marrow, Cefn Hoile, Fang Wang, and Erwin Bonsma BTexact Technologies Intelligent Systems Laboratory, Adastral Park, Ipswich IP5 3RE, UK {paul.marrow, cefn.hoile, fang.wang, erwin.bonsma}@bt.com
Abstract. Software agents can prove useful in representing the interests of human users of agent systems. When users have diverse interests, the question arises as to how agents representing their interests can be grouped so as to facilitate interaction between users with compatible interests. This paper describes experiments in the DIET (Decentralised Information Ecosystem Technologies) agent platform that use evolutionary computation to evolve preferences of agents in choosing environments so as to interact with other agents representing users with similar interests. These experiments suggest a useful way for agents to acquire preferences for formation of groups for information interaction between users, and may also indicate means for supporting load balancing in distributed systems.
1
Introduction
Software agents have proved useful for representing the interests of individual human users (e.g. [10]). With multi-agent systems there arises the possibility of managing processes on behalf of large populations of human users. Associated with this is the problem of ensuring that users with common interests get to interact appropriately. This is the issue of group formation, part of the more general problems of cooperation and collaboration in multi-agent systems. Group formation and the associated issues of cooperation and collaboration have proved relevant to much research in multi-agent systems (e.g. [9,11,21]). Similar problems exist in robotics (e.g. [15]). One particular focus of research has considered how evolutionary algorithms can be used to adapt agent behaviour, and achieve collaborative or cooperative solutions ([7,6,16]). The use of evolutionary algorithms seems particularly appropriate in this context since they depend upon the interaction of many individuals for their success [1]. In this paper we describe how an evolutionary algorithm can be used to adapt agent behaviour in the DIET (Decentralised Information Ecosystem Technologies) system ([8,14]), resulting in the emergence of groups of agents that share common interests. The DIET system implements large numbers of lightweight agents that can interact in a decentralised and extensible manner. The DIET system has been inspired by the interaction of organisms in natural ecosystems, E. Alonso et al. (Eds.): Adaptive Agents and MAS, LNAI 2636, pp. 159–173, 2003. c Springer-Verlag Berlin Heidelberg 2003
160
P. Marrow et al.
and, inspired by the role of evolution in such interactions, the mechanism we use for group formation is the evolution of preferences for different environments. In the software agent context, an environment refers to the software environment an agent inhabits. In this context we assume that each environment exists on a single machine. For mobile agents, multiple environments may be on different machines. In a DIET network different environments maintain connections with each other in a “peer to peer” fashion. It is well known that a degree of centralisation in peer to peer networks can improve the efficiency of functions such as indexing and retrieval [18]. However, existing strategies for centralisation often depend upon the existence of reliable, well known, central servers. Here we demonstrate the emergence of centralisation within a network of peers with no central servers. The dynamic approach described offers a compromise between the robustness and self-sufficiency of fully decentralised networks of transient peers with the efficiency of a centralised system. The dynamic formation of communities of agents could be very important for the proper exploitation of computational and informational resources in future networks. The most rapid and effective interactions between agents typically are those that take place locally, between agents occupying a single environment. Accordingly we use an evolutionary algorithm to evolve preferences that lead to agents with common interests sharing the same environment. Use of an evolutionary algorithm allows local interactions between agents to be taken advantage of in shaping the strategies used for despatch of agents to different environments over a sequence of iterative steps (evolutionary generations). Working with two populations of agents, User agents, representing user interests, and Scout agents, searching out preferred environments, we use the evolutionary algorithm to evolve the preferences of Scout agents for environments in a network of multiple environments in which agents can exist. We show that the evolutionary algorithm can increase the effectiveness of Scout agents in locating environments that are suitable for information transfer with other agents representing common interests. This can provide a basis for the automatic formation of groups of users sharing interests. We also consider how the results from the process of group formation indicate the robustness and flexibility of the DIET system. As well as explicit selection of agents through an evolutionary algorithm, we consider how characteristics of the DIET agent environment can stimulate a process of implicit evolution, that is evolution with respect to computational resource constraints, where computational efficiency is associated with survival. This could also be used to evolve agents that adopt computationally efficient behaviour.
2
The DIET Platform
The experiments presented here use the DIET system ([8,14]), a software platform that has been developed to enable the implementation of multi-agent systems consisting of very large numbers of lightweight agents, under decentralised
Evolving Preferences among Emergent Groups of Agents
161
control, interacting in a manner inspired by natural ecosystems. The development effort within the DIET project [3] is focused on providing an extensible framework for the exploration of ecologically inspired software solutions in an open agent platform. 2.1
Aims and Inspiration
Inspiration for the DIET system has come from natural ecosystems, where many living organisms and abiotic elements interact to produce a variety of emergent phenomena [22]. These biological systems have inspired the Universal Information Ecosystems initiative of the European Union [4], which addresses the issue of managing and understanding the complexity of the emerging global information infrastructure by looking at local and bottom-up interactions between elements of this infrastructure, in the manner of interactions between living organisms. Such local and bottom-up approaches may be expected to provide more flexibility, adaptability and scalability in response to changing circumstances than more top-down or centralised approaches. The DIET project forms part of the Universal Information Ecosystems initiative and hence the system design attempts to take these ideas into account. 2.2
Architecture
The DIET system is designed around a three-layer architecture [14]: – Core layer : The functionality supported by the lowest layer is deliberately minimal, but designed to provide a robust and reliable service [8,14]. It is here that the services and functions common to all agents are provided. – ARC layer : Additional utilities are distributed along with the core layer, known as “Application Reusable Components” (ARC). These provide primitives that exploit the minimal functions of the core to support higher level activities common to many, but not all, applications. These include remote communication, agent reusable behaviours, multicasting and directory services. – Application layer : This layer comprises additional data structures and agent behaviours for application-specific objectives. The three-layer structure provides flexibility for implementing a variety of applications using the same core features of the agents and the software environment that they inhabit. It has been implemented in Java. 2.3
Core Layer
The core layer provides for Environments that are the basic units which DIET agents can inhabit. One or more Environment may be located within a DIET World, there being a single Java Virtual Machine for each World. The possibility
162
P. Marrow et al.
exists for multiple Worlds in conjunction with multiple Java Virtual Machines, allowing for indefinite scaling up of the DIET system. Each Environment provides minimal functionality to all agents, allowing for agent creation, agent destruction, local communication between agents, and initiation of migration between Environments. These basic functions have been designed so as to minimise the computational overhead required for their execution. The CPU time required for each function is not dependent upon the number of agents occupying the Environment, allowing efficient and rapid operation even in Environments inhabited by large numbers of agents. Operation of the DIET system is based upon these basic functions and the resulting local interactions between agents. Local communication is central to local interaction between DIET agents. Local communication in this context involves the passing of messages and objects between two agents in the same Environment. The agent that initiates the communication must identify the agent that it wishes to contact - this can be done using a binary “name tag” associated with the target agent that is randomly generated in its original Environment. In addition an agent has a “family tag” that indicates the group of agents to which it belongs. These are in consequence not typically unique, but may also be used for identification. Identification of agents by either of these methods is decentralised, being associated only with particular Environments, and thus scales well with larger systems. Once a target agent has been identified, a Connection is formed between the two agents, allowing messages and/or objects to be passed between the two agents. Each agent has a message buffer that provides a space into which messages can be received. More information about local communication is given by Hoile et al. [8]. Remote communication, that is, communication between Environments, is also possible. The core layer provides only agent migration at the Environment level. Key functions associated with remote communication are provided in the ARC layer. 2.4
ARC Layer
The ARC layer provides for various extensions that can support remote communication between Environments, as well as other functions. These include “Carrier Pigeon” agents that can migrate to another Environment and then deliver a message by local communication to the intended target agent. Alternatively, Mirror agents can be created in an Environment to support a communication channel to an agent in another Environment, via Carrier Pigeons that only the Mirror agent, and not the agent initiating the remote communication, interacts with. Remote communication via a Mirror agent looks like local communication to other agents in the Environment. Such means of remote communication allow for increased flexibility in interaction between agents distributed across multiple environments [8].
Evolving Preferences among Emergent Groups of Agents
2.5
163
Applications
Based on the functionality provided by the core layer and the ARC layer, applications can be developed based on the DIET platform, with application-specific code situated in the third, application, layer. Examples of work in this area include [5,12]. Applications can also take advantage of visualisation and interactive control software that is being developed [13]. The basing of application development on this architecture centred on local interactions between agents makes the DIET system particularly appropriate for the study of phenomena emerging from local interactions. We now go on to do this in the context of emerging preferences for environments supporting co-location for information sharing.
3
Experiments
We seek to generate emergent phenomena among agents in getting them to evolve preferences for particular environments (that are DIET Environments). As such agents can represent the interests of human users, this may be a useful mechanism for automatically ensuring that users’ interests in terms of environmental preferences are best served. We consider a situation where human users of an information management system connect transiently to a peer-to-peer network in order to identify information resources that satisfy their requirements for information. We assume that each user has a “category of interest” that represents some topic that they are particularly interested in. Users that are interested in the same category of interest are assumed to be interested in the same set of information, but to only have access to a subset of that initially. We also assume that users are interested in finding other users with the same category of interest and sharing information with them. Each user creates a DIET Environment from which agents can be created to facilitate the user’s requirements. 3.1
World, Environments, and Links
The experiments take place in the context of a DIET World composed of multiple Environments as described above (Section 2.3). Each Environment is distinct from others in terms of its distinctive signature provided by a hashcode. The Environment’s hashcode is generated based on the Environment’s address in the DIET World. A 32 bit hashcode is used, because a hashcode of this form can easily be acquired from all Java objects. But this form of hashcode can be replaced by one of many other hashing schemes if required (see e.g. [17]). In our experiments Environments are connected in a peer network. This network is formed by choosing pairs of Environments at random, and then creating neighbourhood links between them. Such links are uni-directional, but link formation can be repeated to give the effect of a bi-directional link. This process is repeated until each Environment has on average four links. This level of connectivity is intended to approximate the connectivity of a fully decentralised
164
P. Marrow et al.
peer network. The existence of such links between Environments allows agents to explore the network. Although agents can migrate to any Environment, they need to know the address of the destination Environment. At each Environment, agents can get the address of one of the neighbouring Environments and subsequently migrate to it and thus explore the collection of Environments. Figure 1 illustrates what such an environment network might look like.
Environment Link
Fig. 1. An example DIET peer network
3.2
Agents
The experiments depend upon two populations of agents: User agents and Scout agents. These agents are lightweight and use the minimal functions supplied by the DIET platform. User agents represent human users and deploy and evolve Scout agents, as well as collecting information from Scout agents. Scout agents explore multiple Environments and carry out the activities needed to form groups. Only one User agent is associated with a particular Environment. The User agent remains in that Environment throughout each experiment. Each experiment starts with a number of User agents distributed, one at each of the Environments. Each User agent creates a population of Scout agents, and evolves them independently of other populations of Scout agents. Having created these Scout agents, the User agent dispatches them into the peer network of Environments, where they may interact with other Scout agents and other User agents, before returning to interact with the User agent that created them. 3.3
Evolutionary Algorithm
Scout agents are bred by User agents. User agents seek to maximise Scout agents’ success in locating other Scout agents with common interests. A Scout agent’s
Evolving Preferences among Emergent Groups of Agents
165
preference for Environments is determined by a bitstring genome provided at birth. A Steady State genetic algorithm is used [19], implemented using the Eos evolutionary and ecosystem platform [2]. Tournament selection, two-point crossover, uniform mutation and random replacement operators are used in the algorithm. Random replacement allows Scout agents to adapt their expectation of success under changing conditions of informational and environmental availability. When dispatching new Scout agents, the User agent uses tournament selection to choose parent genomes from its population, favouring genomes according to the success of the respective Scout agent in locating other satisfactory Scout agents. The behaviour of each Scout agent depends upon a satisfaction or preference function that indicates the degree of satisfaction that the Scout agent has with an Environment. This satisfaction function employs two bitstrings of length 32, drawn from a binary genome containing 64 bits. These two bitstrings are known as the XOR_mask and the AND_mask. To determine the degree of satisfaction for a given Environment, the Environment’s hashcode is XORed with the XOR_mask, and then ANDed with the AND_mask. The number of set bits (i.e. bits with the value “1”) then indicates the degree of satisfaction with the Environment. This preference function can then be evolved, in order to generate different orderings of preferences for Environments. Scout agents are initialised with a success of zero. New generations of Scout agents are generated by the recombination of the genomes of two “parent” Scout agents, resulting in two new Scout agent genomes. Two new Scout agents result, that are released into the User agent’s local Environment. From this point they carry out a three-phase life cycle (described below). If they complete this life cycle, and return successfully to the originating Environment, an existing member of the population of Scout agents based in that Environment is replaced at random by the genome of the returning Scout agent. In this way Scout agent preferences evolve over many generations in response to the conditions they encounter in different Environments. 3.4
Scout Agent Life Cycle
Having been created by User agents in their home Environment, Scout agents go through a life cycle that is divided into three phases: the Exploratory phase, the Sharing phase and the Reporting phase. In the Exploratory phase, a Scout agent visits eight Environments in a random walk starting from the Environment in which it originated. At each Environment it requests four addresses of neighbouring Environments, selecting one of these at random for the next hop. These numbers are fixed across all experiments in order to allow comparison across peer networks of different sizes. After collecting the thirty-two Environment addresses in this way, the Scout agent applies its evolved preference function in order to calculate a satisfaction value for each of the thirty-two potential host Environments encountered. It then selects
166
P. Marrow et al.
as a host the Environment with the address that gives it the highest satisfaction. Where several Environment addresses give the same satisfaction, the most recently visited is preferred. The Scout agent then enters the Sharing phase. During the Sharing phase the Scout agent migrates to its preferred host Environment, and spends a pre-determined period interacting with other Scout agents in that Environment - notifying them of its User agent’s ID and category of interest, as well as noting the IDs and categories of interest represented by other Scout agents in that Environment. Then it moves to the Reporting phase. In the Reporting phase the Scout agent migrates back to its originating Environment, notifies the originating User agent of its genome, and the number of successful encounters achieved. Scout agent success is measured according to the number of Scout agents encountered that were derived from different User agents (hence different Environments), but represented the same information category. So, a successful encounter in this context means an encounter with a Scout agent originating from other User agents that represent the same information category. The Scout agent then destroys itself, but its genome may live on in that it may be selected to contribute to the next generation of Scout agents, according to its success in locating Environments that contain other Scout agents representing User agents with common interests. The use of tournament selection means that some Scout agents with success lower than the current Scout agent population average may contribute to the next generation, but they are less likely to, and Scout agents with higher success are more likely to be represented in the next generation. Tournament selection also ensures responsiveness to changing conditions. 3.5
Consequences of Agent Behaviour
The repetition of this three-phase life cycle over multiple generations will lead to changes in the numbers of Scout agents found in each Environment at each iteration (corresponding to a generation of the evolutionary algorithm.) The longterm solution should be a network of Scout agents clustered to different densities in different Environments, with average Scout agent preference for Environments evolved to a level that most effectively supports choice of Environments in which agents representing the same category of interest can interact. Accordingly, Scout agent success in achieving such interactions should be maximised. Such a network of information sharing agents may support several distinct groups of agents, as represented by the shaded and unshaded agents shown in Figure 2. 3.6
Experimental Conditions
The algorithm described above provided the basis for a series of experiments. In each experimental run we were interested in the effectiveness of the evolutionary learning among agents in stimulating co-location of Scout agents in appropriate Environments. For the sake of logging results, all Environments were hosted in parallel on a single machine. (However, there is no reason why they should not be hosted on multiple machines in parallel in the future.) To compensate for this
Evolving Preferences among Emergent Groups of Agents
167
Fig. 2. An example configuration of information-sharing agents
lack of true parallelism, User agent search intervals, Scout agent waiting time, and overall run length were made proportionate to the number of User agents. A minute of CPU time was provided for the activity of each User agent. Each User agent began the simulation with a fixed category of interest, and a population of 100 Scout agents with random genomes (defining random preference functions). Initial experiments used the same category of interest for all User agents, but more than one category of interest can be used if required.
4
Results
Figure 3 shows the progress of a single experiment, involving thirty-two User agents. The number of Scout agents in each Environment changes over time due to the migration of Scout agents between Environments, as well as being due to the evolution of Scout agent genomes. The evolutionary algorithms are executed in real time by the parallel operations of all the User agents. For this reason results are shown against CPU time. It is clear that one Environment in particular becomes the most popular, attracting the vast majority of Scout agents in the system. This distribution of Scout agents, with few agents in many Environments, and many in few, is the result of selection of Scout preferences for Environments based on interactions between Scout agents during the Sharing phase of their life cycle. This grouping of Scout agents could then be used to support more effective information exchange among the User agents in the system than was possible at the start of the experiment. It indicates how this evolutionary approach could be useful in facilitating information interactions between the human users who have established such User agents. Figure 4 shows how the phenomenon shown in Figure 3 occurs. Over time average Scout agent success increases, because the independent evolutionary algorithms converge to common Environmental preferences. This increases the
168
P. Marrow et al.
Number of Scout agents in each environment
90 80 70 60 50 40 30 20 10 0 0
200000 400000 600000 800000 1e+06 1.2e+061.4e+061.6e+061.8e+06 2e+06 Time (ms)
Fig. 3. Environment population over time – 32 User agents
Scout agent success for each User agent
45 40 35 30 25 20 15 10 5 0
200000 400000 600000 800000 1e+06 1.2e+061.4e+061.6e+061.8e+06 2e+06 Time (ms)
Fig. 4. Average Scout success over time – 32 User agents
Scout agent population density in certain Environments and hence increases Scout success. If more User agents (and hence more Environments and more Scout agents) are involved, the system takes longer to evaluate where the appropriate Scout agents are, but still identifies them in the end. In Figure 5 we show the results of multiple runs of the algorithm designed to calculate the average (mean) value of
Evolving Preferences among Emergent Groups of Agents
169
Scout agent success. This is calculated after each 1 minute of CPU time has been used per User agent. We are interested to see whether use of the evolutionary algorithm has an effect on the average success of Scout agents. Figure 5 shows that this occurs, in that average Scout agent success after one minute of CPU time is greater than the initial value (of zero). If the evolutionary algorithm is not used, so Scout agents have uniform preferences, average Scout agent success after one minute of CPU time, although non-zero, is constant irrespective of the number of User agents involved. If the evolutionary algorithm is used (represented by evolved preferences in the Figure), it is interesting that the average Scout agent success actually increases with the number of User agents, before declining at higher numbers of User agents. This suggests the benefit that the use of evolutionary techniques can offer among populations of agents in multi-agent systems, but also implies that very high numbers of User agents may make it more difficult for successful interactions between Scout agents to arise.
40 evolved preference uniform preference
Average Scout Agent Success
35 30 25 20 15 10 5 0 1
10
100
1000
Number of User Agents
Fig. 5. Average Scout agent success after one minute CPU time per User agent
The results in Figure 3 shows that the evolution of Scout agent preferences for Environments can support convergence of many Scout agents to a single preferred Environment. When larger numbers of User agents are spread across more Environments, evolution of Scout agent preferences may result in several Environments supporting significant numbers of Scout agents in the long term (Figure 6). This does not indicate a failure to evolve to a sufficiently preferred solution, as comparison of the changes in average Scout success over time with this higher number of User agents with Figure 4 shows a similar change in
170
P. Marrow et al.
Number of Scout agents in each environment
200 180 160 140 120 100 80 60 40 20 0 0
1e+06
2e+06
3e+06
4e+06 5e+06 Time (ms)
6e+06
7e+06
8e+06
Fig. 6. Environment population over time – 128 User agents
Scout agent success for each User agent
35
30
25
20
15
10
5
0 0
1e+06
2e+06
3e+06
4e+06 5e+06 Time (ms)
6e+06
7e+06
8e+06
Fig. 7. Average Scout agent success over time – 128 User agents
success results although the final average is different (see Figure 7). In fact one Environment in Figure 6 ends up with significantly more Scout agents than all the others after the algorithm is run for some time. But this does not eliminate the several Environments that maintain persistent populations of Scout agents
Evolving Preferences among Emergent Groups of Agents
171
at somewhat lower levels. This is an inevitable source of the use of a random walk by Scout agents in locating Environments.
5
Discussion
The experiments in evolving group formation that we have implemented using the DIET platform suggest that evolving agent preferences may be a useful means to tackle information management problems. Starting from a random initial assembly of users, agents quickly co-locate according to the interests of their respective User agents. This facilitates more rapid and effective communications between User agents representing human users with common interests, and so shows the potential for application to more general peer-to-peer collaborative networks [18]. The experiments presented here are designed such that many User agents represent similar interests, but it would be possible to develop alternative scenarios where very many different interests were represented, and Scout agents spread out over many Environments. While the results given above show convergence of the majority of Scout agents in the system to a single preferred Environment, it is likely that Scout agents will encounter a variety of Environments during exploration. The coexistence of agents in multiple Environments may provide additional robustness, since the loss of specific machines hosting some Environments is unlikely to eliminate all members of a specific agent community in a sufficiently large network. In addition agents persisting in such a diminished system will have the capability to evolve their preferences so as to adapt to the remaining Environments. In fact, because users for the Scouts that converged on a specific Environment that has just disappeared, all have similar evolved preferences, their Scouts are likely to quickly converge on an Environment with a similar hashcode. The experiments described above implement agents in multiple Environments in parallel on a single machine. It would be worthwhile to investigate larger networks of Environments and User agents with diverse categories of interest. Accordingly, further experimentation is planned, using multiple computers connected in a Beowulf cluster [20]. This should help reduce artefacts arising from thread sharing, and also permit the construction of larger peer networks. The implementation of preference evolution on multiple machines may provide a means of using this algorithm for load balancing. This is because the use of multiple machines and hence system resources in parallel will provide the agents involved with the potential to evolve preferences and vary success at different rates in different machines. As a consequence, Scout agents will have the opportunity to switch between machines in order to improve their success rate in interacting with other Scout agents. While initial convergence of most Scout agents to a single Environment may result in a similar way to that in the experiments shown here, a consequence of this will be increased demands on one of the machines in the peer network. This will place constraints on the Environments hosted on that machine, restricting agent interactions. This may stimulate migration of Scout agents to other machines where system resources
172
P. Marrow et al.
are less heavily in demand. The consequence of this will be a contrasting pressure on Scout agents to disperse over multiple machines, a kind of implicit evolution driven by available system resources. This implicit evolution could be further used to develop groups of information sharing agents. The DIET platform provides the means to monitor the use of system resources by agents. Accordingly computational resource cost could be used to constrain the evolutionary algorithm so as to develop preferences appropriate to the machines (and/or resources) available at the time. In this way agents adopting computationally efficient behaviour can be evolved without explicit population management. Acknowledgements. This work was carried out as part of the DIET (Decentralised Information Ecosystems Technologies) project (IST-1999-10088), within the Universal Information Ecosystems initiative of the Information Society Technology Programme of the European Union. We thank the other participants in the DIET project, from the Departmento de Teoria de Se˜ nal y Communicaciones, Universidad Carlos III de Madrid, the Department of Electronic and Computer Engineering, Technical University of Crete, and the Intelligent and Simulation Systems Department, DFKI, for their comments and contributions. We also acknowledge the support of the Enterprise Venturing Programme of BTexact Technologies.
References 1. B¨ ack, T., Fogel, D., Michaelewicz, Z., eds.: Handbook of Evolutionary Computation. Institute of Physics (2000) 2. Bonsma, E., Shackleton, M., Shipman, R.: Eos: an evolutionary and ecosystem research platform. BT Technology Journal 18 (2002) 24–31 3. DIET project: web site. http://www.dfki.uni-kl.de/DIET (2001) 4. European Commission IST Future and Emerging Technologies: Universal information ecosystems proactive initiative. http://www.cordis.lu/ist/fethome.htm (1999) 5. Gallardo-Antol´ın, A., Navia-V´ azquez, A., Molina-Bulla, H., Rodr´ıguez-Gonz´ alez, A., Valverde-Albacete, F., Cid-Suerio, J., Figueiras-Vidal, A., Koutris, T., Xiruhaki, C., Koubarakis, M.: I-Gaia: an information processing layer for the DIET platform. In: Proc. 1st Int. Conf. on Autonomous Agents and Multi-Agent Systems (AAMAS2002). Volume 3., Bologna, Italy (2002) 1272–1279 6. Gordin, M., Sen, S., Puppala, N.: Evolving cooperative groups: preliminary results. In: Proc. of the AAAI-97 Workshop on Multi-Agent Learning. (1997) 7. Haynes, T., Sen, S., Schoenefeld, D., Wainwright, R.: Evolving a team. In: Proc. AAAI Fall Syposium on Genetic Programming, Cambridge, MA (1995) 8. Hoile, C., Wang, F., Bonsma, E., Marrow, P.: Core specification and experiments in DIET: A decentralised ecosystem-inspired mobile agent system. In: Proc. 1st Int. Conf. on Autonomous Agents and Multi-Agent Systems (AAMAS2002). Volume 2., Bologna, Italy (2002) 623–630 9. Jonker, C., Klusch, M., Treur, J.: Design of collaborative information agents. In: Proc. 4th Int. Workshop on Cooperative Information Agents. Number 1860 in LNAI, Berlin, Springer (2000)
Evolving Preferences among Emergent Groups of Agents
173
10. Klusch, M.: Information agent technology for the internet: a survey. Journal of Data and Knowledge Engineering (2000) 11. Klusch, M., Sycara, K.: Brokering and matchmaking for coordination of agent societies: a survey. In Omicini, A., Zambonelli, F., Klusch, M., Tolksdorf, R., eds.: Coordination of Internet Agents: Models, Technologies and Applications, Berlin, Springer (2001) 12. Koubarakis, M., Tryfonopoulos, C., Raftopoulou, P., Koutris, T.: Data models and languages for agent-based textual information dissemination. In: Cooperative Information Agents 2002 (CIA-2002), Madrid (2002) 13. van Lengen, R., B¨ ahr, J.T.: Visualisation and debugging of decentralised information ecosystems. In: Proc. of Dagstuhl Seminar on Software Visualisation, Berlin, Springer (2001) 14. Marrow, P., Koubarakis, M., van Lengen, R., Valverde-Albacete, F., Bonsma, E., Cid-Suerio, J., Figueiras-Vidal, A., Gallardo-Antol´ın, A., Hoile, C., Koutris, T., Molina-Bulla, H., Navia-V´ azquez, A., Raftopoulou, P., Skarmeas, N., Tryfonopoulos, C., Wang, F., Xiruhaki, C.: Agents in decentralised information ecosystems: the DIET approach. In: Proc. of the AISB’01 Symposium on Information Agents for Electronic Commerce, York, UK (2001) 109–117 15. Matari´c, M.: Designing and understanding adaptive group behavior. Adaptive Behavior 4 (1995) 51–80 16. Moukas, A., Zacharia, G.: Evolving a multi-agent information filtering solution in amalthea. In: Proc. of Agents ’97. (1997) 17. National Institute of Standards and Technology: FIPS PUB 180-I. secure hash standard. http://www.itl.nist.gov/fipspubs/fip180-1.htm (2001) 18. Oram, A., ed.: Peer-to-peer: harnessing the power of disruptive technologies. O’Reilly Associates, Cambridge MA (2001) 19. Sarma, J., Jong, K.D.: Generation gap methods. In B¨ ack, T., Fogel, D., Michaelewicz, Z., eds.: Handbook of Evolutionary Computation, Bristol, Insititute of Physics (2000) 20. Sterling, T., Becker, D., Savarese, D., Durband, J., Ranawake, U., Packer, C.: Beowulf: a parallel workstation for scientific computation. In: Proc. 24th Int. Conf. on Parallel Processing. Volume 1. (1995) 11–14 21. Wang, F.: Self-organising communities formed by middle agents. In: Proc. 1st Int. Conf. on Autonomous Agents and Multi-Agent Systems (AAMAS2002). Volume 3., Bologna (2002) 1333–1339 22. Waring, R.: Ecosystems: fluxes of matter and energy. In Cherrett, J., ed.: Ecological Concepts, Oxford, Blackwell Scientific (1989)
Structuring Agents for Adaptation Sander van Splunter, Niek J.E. Wijngaards, and Frances M.T. Brazier Intelligent Interactive Distributed Systems Group, Faculty of Sciences, Vrije Universiteit Amsterdam, de Boelelaan 1081a, 1081HV, The Netherlands {sander,niek,frances}@cs.vu.nl http://www.iids.org/
Abstract. Agents need to be able to adapt to the dynamic nature of the environments in which they operate. Automated adaptation is an option that is only feasible if enough structure is provided. This paper describes a componentbased structure within which dependencies between components are made explicit. An example of a simple web-page analysis agent is used to illustrate the structuring principles and elements.
1 Introduction Agents typically operate in dynamic environments. Agents come and go, objects and services appear and disappear, and cultures and conventions change. Whenever an environment of an agent changes to the extent that an agent is unable to cope with (parts of) the environment, an agent needs to adapt. Changes in the social environment of an agent, for example, may require new agent communication languages, or new protocols for auctions. In some cases an agent may be able to detect gaps in its abilities; but not be able to fill these gaps on its own (with, e.g., its own built-in learning and adaptation mechanisms). Adaptive agents are a current focus of research (e.g., see this book), but opinions on what ’adaptation’ constitutes differ. Sometimes reactive behaviour of an agent is dubbed ’adaptive behaviour’ [1] where an agent is, e.g., capable of abandoning a previous goal or plan and adopting a new goal or plan that fits its current situation better. In this paper, adaptation of an agent is used to refer to "structural" changes of an agent, including knowledge and facts available to an agent. External assistance may be needed to perform the necessary modifications, e.g. by an agent factory [2]. An adaptation process has a scope: a scope defines the extent to which parts of an agent are adapted. Research on agent adaptation can be categorised by distinguishing three specific scopes: adaptation of knowledge and facts; adaptation of the language with which an agent’s interface to the outside world is expressed (e.g., dependency on agent platform), and adaptation of an agent’s functionality. Research on adaptation of knowledge and facts of an agent is usually based on (machine) learning, e.g. [3]. Example applications include personification: an agent maintains and adapts a profile of its (human) clients, e.g. [4], [5] and [6], coE. Alonso et al. (Eds.): Adaptive Agents and MAS, LNAI 2636, pp. 174–186, 2003. © Springer-Verlag Berlin Heidelberg 2003
Structuring Agents for Adaptation
175
ordination in multi-agent systems, e.g. [7] and [8], and situated learning for agents, e.g. [9]. Research on adaptation of the interface of an agent is usually concerned with adapting the agent’s interface to the (current) agent platform, e.g. see [10], [11]. Research on adapting an agent’s functionality is not commonly available. Agent creation tools are usually semi-automatic, providing a basis for developing an automated tool for agent adaptation, e.g. see AGENTBUILDER [12], D’AGENTS/AGENTTCL [13], ZEUS [14], and PARADE [15]. Computer assisted software engineering tools are usually not focussed on agents, and are less concerned with ’adaptivity’; see the discussion in Section 4 for a more detailed comparison. The approach taken in this paper focuses on automated adaptation of an agent’s functionality by means of an agent factory. An agent factory is an external service that adapts agents, on the basis of a well-structured description of the software agent. Our hypothesis is that structuring an agent makes it possible to reason about an agent’s functionality on the basis of its blueprint (that includes information about its configuration). This ability makes it possible to identify specific needs for change, defining the necessary input required to automatically reconfigure an agent. This approach is much akin to the knowledge-level approach to system design [16] in which the knowledge-level is distinguished from the symbol level. The agent factory presented in this paper relies on a component-based agent architecture described in Section 2. An example of the use of these component-based structures by (automated) adaptation of a simple web-page analysis agent is shown in Section 3. Section 4 discusses results of this approach.
2
Structure of Agents
The structure of an agent proposed in this paper is based on general software engineering, knowledge engineering and agent technology principles. Section 2.1 briefly discusses these principles. Section 2.2 describes the structuring principles used in this paper. The result is illustrated for a simple web analysis agent introduced in Section 2.3. 2.1
Structuring Principles
Intelligent agents are autonomous software programs that exhibit social, co-operative, and intelligent behaviour in distributed environments [17]. Modelling and implementing 'intelligent' agents are not only studied in Software Engineering, but also in Knowledge Engineering and Agent Research. Each of these research disciplines imposes its own structuring principles, often adopting principles of other research disciplines. In software engineering functional programming, e.g. [18], object-oriented programming, e.g. [19], and component-based programming, e.g. [20], [21] have a
176
S. van Splunter, N.J.E. Wijngaards, and F.M.T. Brazier
concept of compositionality. The compositional structure in functional programming and component-based programming is based on processes and functions. The concept of compositionality in object oriented programming is that of objects that encapsulate data and methods. Each approach has its own merits, depending on characteristics of the domain for which a software system is designed. Re-use and maintenance are important aspects of all approaches (see e.g. [22]). All require a means to specify and retrieve appropriate software components [23]: by means of e.g. design patterns [24], annotation of software components [25], and annotation of web-services [26]. In knowledge engineering, common structuring principles have a process-oriented focus, in which the (problem solving / reasoning) processes are explicitly modelled and controlled, e.g. by approaches related to COMMONKADS [27] and DESIRE [28]. Methodologies including reasoning patterns and generic task models have been defined, facilitating re-use and maintenance. In intelligent agent research a wide variety of approaches are employed. Most common seem to be process (or task) oriented approaches, for which general agent models are defined, e.g. by INTERRAP [29], ZEUS [14], and DESIRE [28] An example of a common model is the BDI architecture, proposed by [30]. Each of the approaches described above employs a notion of compositionality and defines generic models or patterns. Reuse and maintenance are recognised as important endeavours, but as such, not often formalised nor automated. Current research on brokering for web-services focuses on annotations of services (roughly stated: software components). Annotations make architectural assumptions explicit, including assumptions about the nature of the components, the nature of the connectors, the global architectural structure, and the construction process [31]. 2.2
Agent Structuring Approach
For automated agent adaptation, an agent structuring approach is needed which facilitates reuse of existing components of an agent. This implies explication of not only the structure of reusable parts, but also the semantics, including assumptions and behaviour. The component-based approach proposed in this paper distinguishes components and data types (akin to data formats) [32] incorporating process-oriented and objectoriented approaches. Where process-oriented modelling approaches distinguish processes and information exchange explicitly, object-oriented approaches encapsulate data and methods in objects. In the approach proposed in this paper components are the ’active’ parts of an agent (akin to processes), and data types are the ’passive’ parts of an agent (akin to classes). This approach combines process-oriented and objectoriented approaches, building on knowledge-engineering and software-engineering research results. Components have an interface, describing their input and output data types, and slots within which component configuration is regulated. Data types represent types of data, and may have their own slots with which data type configuration is regulated. Slots define their interface and that which is expected from the component or data type
Structuring Agents for Adaptation
177
that is to be inserted. The addition of slots makes components not "black boxes", but "grey boxes"; components can thus be partial specifications. De Bruijn and van Vliet [33] even argue that for reuse a "black box" approach to components in componentbased development is a dead end. The concept of slots helps defining the ‘static’ structure or architecture of an agent. Components and data types need to be matched to slots, determining, as a result, matches between, e.g., (replaceable) software components [34]. Co-ordination patterns and ontologies are distinguished to annotate configurations of components and data types. Annotations are essential for automation of the agent adaptation process. A co-ordination pattern is a temporal specification (e.g., see [28]) defining temporal relations and dependencies between processes, when used within a specific task. A co-ordination pattern describes the flow of control and information for group of components in the context of a specific task. An ontology describes the concepts and relations between concepts. Co-ordination patterns and ontologies may themselves be composed and are ultimately related to (annotations of) components and data types. 2.3
An Example
To illustrate the role of structure in our approach a web-analyser agent is introduced; an agent that analyses websites for relevance, on demand. Given a URL of a website and a term, a web analyser agent determines the relevance of the website with regard to the given term. The agent uses simple analysis techniques: it counts the number of occurrences of the term on the pages at the given location. Three components of the agent are described to illustrate the agent's functionality and component configuration. The web-analyser agent's major structuring component is the generic-agent-model component [28]. The generic-agent-model component models an agent that can reason about its own processes, interact with and maintain information about other agents, interact with and maintain information about the external world, co-operate with other agents, and performs its own specific tasks. Figure 1 shows the compositional structure of the generic-agent-model component and its seven component slots. For each slot, the name of the component inserted into the slot is given. The further structure of the conceptual component web-page-analysis inserted in the agent-specific-task-slot of the conceptual generic-agent-model component is shown in Figure 2. The generic agent model can be used to structure both the conceptual and operational description of an agent. At operational level the components within the web-page-analysis component differ from the components in the conceptual description, as shown in Figure 3. The conceptual page-ranking-by-search-term component is implemented by the operational component configuration of the operational two-setenumerator and count-substring-in-string components.
178
S. van Splunter, N.J.E. Wijngaards, and F.M.T. Brazier generic-agent-model own-processcoordination-slot
beliefs-desires-intentionscommitments-handling
cooperationmanagement-slot
cooperation-managementby-project-management
agent-interactionmanagement-slot
default-agentcommunication-management
maintenance-ofagent-information-slot
default-agent-informationstorage-and-retrieval
world-interactionmanagement-slot
default-world-interactionmanagement
maintenance-of-worldinformation-slot
default-world-informationstorage-and-retrieval
agent-specific-taskslot
web-page-analysis
Fig. 1. The generic-agent model structure for the simple web analyser agent at conceptual level. web-page-analysis page-selection-slot
pages-to-be-analyseddetermination
page-analysis-slot
page-ranking-by-searchterm
Fig. 2. The structure of the web-page-analysis component at conceptual level.
A rationale for this operational configuration is that a set of webpages needs to be ranked for one search term. The actual analysis process consists of counting the number of occurrences of the search term in a web page by simply counting the number of occurrences of a (small) string in another (larger) string, i.e. a web page.
web-page-analysis page-selection-slot
get-pages
page-analysis-slot
two-set-enumerator tuple-operation-slot
count-substring-in-string
Fig. 3. Structure of the web-page-analysis component at operational level.
Co-ordination patterns are used to verify whether the configuration of components and data types exhibits the required behaviour, in this case receiving requests for web analysis, performing requested web analysis, and returning results. A high-level coordination pattern for multiple job execution is applicable; a "job" is a "request for web analysis". In this specific case, a simple sequential job execution pattern suffices. This co-ordination pattern is shown in pseudo-code below: "tasks" are ordered in time, and need to be performed by the configuration proposed.
Structuring Agents for Adaptation
(1) (2) (3) (4) (5)
179
collect jobs in job list select a job perform current selected job remove current selected job from job lists go to (1)
The tasks shown in the co-ordination pattern may be directly mapped onto components, but this is not necessarily the case. Some of the tasks may involve a number of components. For example, the first task, collect jobs in job list involves, from the perspective of the generic-agent-model component, components in two of its ’slots’: agent-interaction-management for receiving web-analysis requests, and maintenance-ofagent-information for storing web-analysis requests. Another co-ordination pattern, collect items in existing item list, is needed to specify this task more precisely. These tasks can be mapped directly onto the above mentioned components.
(1a) obtain item (1b) add obtained items to item list
3
Adapting Structured Agents
This section describes how agents with a compositional structure as described in the previous section can be adapted. Section 3.1 introduces the adaptation process of the agent factory, and Section 3.2 describes the results of adapting a simple web-page analysis agent. 3.1
Adaptation in an Agent Factory
Agents are constructed from components and data types by an automated agent factory [2]. Adapting an agent entails adapting the configuration of its components and data types. Whether the need for servicing is detected by an agent itself, or by another agent (automated or human) is irrelevant in the context of this paper. The agent factory is based on three underlying assumptions: (1) agents have a compositional structure with reusable parts, (2) two levels of conceptualisation are used: conceptual and operational, (3) re-usable parts can be annotated and knowledge about annotations is available. An agent factory, capable of automatically building and adapting an agent, combines knowledge of its domain (i.e., adapting intelligent agents), its process (i.e., adaptation processes), and the combination of these (i.e., adapting intelligent agents). Needs for adaptation are qualified to express preference relations among needs, and refer to properties of an agent. Needs may change during the adaptation process, e.g. conflicting needs may be removed. An adaptation process starts with a blueprint of an agent, and results in a (modified) blueprint of the agent; a process similar to re-design processes. In re-design, an initial artefact is modified according to new, additional, requirements. An existing model of
180
S. van Splunter, N.J.E. Wijngaards, and F.M.T. Brazier
re-design [35] has been used to model the adaptation process. Models and systems for re-design make use of the structure of their artefacts, the same holds for the adaptation process. Strategic knowledge is required to ’guide’ the adaptation process, both in deciding which goals to pursue and how to tackle a goal; goals take the form of adaptation foci. An adaptation focus consists of the following categories of elements of the agent: – needs that are taken into account: e.g. needs that relate to a specific facet (task or function, data, behaviour) of part of the agent, – levels of conceptualisation – components – data types – co-ordination patterns and their mappings – ontologies and their mappings – annotations The component-based adaptation approach presented in this paper is similar to design-as-configuration, e.g., as described in [8], which focuses on constructing a satisfactory configuration of elements on the basis of a given set of requirements (also named: constraints). Examples of design-as-configuration are described in COMMONKADS [27] and an analysis of elevator configuration systems [36]. 3.2
Adaptation Results
Assume in the example introduced in Section 2.2 that the owner of the web analyser agent has decided that she wants to be able to acquire a higher level of service for those sites for which she is known to be a preferred client (and the standard quality of service for those sites for which this is not the case). The (new) requirements for the web-analyser agent are that: – the agent shall have two levels of quality of service for assessing relevance of web pages; – the agent shall employ other analysis methods in addition to its analysis based on a single search term: analysis involving synonyms is a better quality of service than analysis involving a single search term; – the agent shall maintain a list of those sites for which its client is a preferred client; – the agent shall be informed about a site’s preferred clients; – a co-ordination pattern shall relate a client's request to a preferred quality of service. The resulting blueprint is described in this section by focusing on the changes within the conceptual agent-specific-task component, the most constrained component of the agent. Other components and data types are not shown in this description. In one of the libraries of components and data types, an alternative web-page-analysis component is found which has a slot for query expansion, shown in Figure 4. The alternative quality of service for web-page analysis consists of expanding a search term into a set of synonyms with which web-pages are analysed. The slots of this component can be filled with components used in the original web-page-analysis
Structuring Agents for Adaptation
181
component. One new component needs to be found, to ‘expand a search term. A component that uses a synonym database qualifies, and is used. extended-webpage-analysis page-selection-slot
pages-to-be-analyseddetermination
query-expansion-slot
synonym-determination
page-analysis-slot
page-ranking-bymultiple-search-terms
Fig. 4. Component extended-web-page-analysis at conceptual level.
This extended-web-page-analysis component is parameterised, i.e., the level of query expansion can be specified explicitly. This property makes it possible to provide both required qualities of service: one quality of service with extended query expansion, the other quality of service with no query expansion at all. The existing web-page-analysis component can be replaced by the extended-web-page-analysis component. An additional component is needed to determine the applicable quality of service, as shown in Figure 5. method-determination methoddetermination-slot subtask-slot
Fig. 5. Component method deliberation at conceptual level.
The resulting component configuration within agent-specific-task is shown in Figure 6. generic-agent-model … agentspecifictask-slot
method-determination methoddetermination-slot subtask-slot
quality-of-service determination extended-webpage-analysis page-selection-slot
pages-to-be-analyseddetermination
query-expansion-slot
synonym-determination
page-analysis-slot
page-ranking-bymultiple-search-terms
Fig. 6. The agent-specific-task-slot contains a conceptual component for selecting a quality of service of web-page analysis.
The same high-level co-ordination pattern is used as described in Section 2.2, however the third task, perform current selected job, has been replaced by a different (sub)co-
182
S. van Splunter, N.J.E. Wijngaards, and F.M.T. Brazier
ordination pattern which involves the choice of a specific quality of service. This coordination pattern is shown below:
(3a) (3b) (3c) (3d)
prepare for current job plan work for current job perform planned work for current job finish current job
The main change is in the presence of the second sub-task, plan work for current job, which is related to the method_determination component. The changes in the operational configuration of components and data types for the resulting agent, are comparable to those needed for the conceptual configuration of components and data types.
4
Discussion
Agents can be adapted by services such as an agent factory. Automated adaptation of software agents is a configuration-based process requiring explicit knowledge of strategies to define and manipulate adaptation foci. Automated agent adaptation becomes feasible when the artefact is structured, as demonstrated in a number of prototypes. A compositional approach is taken to structure the agent: components and data-types can be configured to form an agent, together with co-ordination patterns and ontologies which describe the agent’s behaviour and semantics. A simple webpage analysis agent has been used to illustrate the agent structuring and adaptation process needed to adapt an agent’s functionality. An example of the use of an agent factory for the adaptation of the external interface of an agent (a less complex endeavour) is described in [11]. For agents that migrate in an open, heterogeneous environment generative migration minimally entails adapting an agent's wrapper. It may, however, also involve rebuilding an agent with different operational components and data types (e.g., in a different code base). Four different scenario’s for generative migration have been distinguished: homogeneous, cross-platform, agent-regeneration, and heterogeneous migration. Migration is categorised with respect to combinations of variation of (virtual) machines and agent platforms. The proposed structuring of agents presented in this paper is similar to other work, in the eighties, in which an automated software design assistant is developed [37]. To facilitate automated derivation of a structural design from a logical model of a system, a modular structure was assumed, with the explicit property that independent modules are clearly understood, together with explicit, dependencies between modules. In their approach, a logical description of processes is modularised. This technique has shown to be useful, on the one hand, for grouping functionality and tasks into components and co-ordination patterns, and on the other hand for grouping needs for adaptation. The Programmer's Apprentice [38], from the same period, aims to provide intelligent assistance in all phases of the programming task: an interactive tool that may alleviate a programmer of routine tasks. By using 'clichés', patterns of code, the system can 'understand' parts of code. This work is not based on components, but on
Structuring Agents for Adaptation
183
individual programming statements, which is a major difference with our work. A number of the processes involved in the Programmer’s Apprentice are of relevance to the adaptation process. A related semi-automated approach, KIDS [39], derives programs from formal specifications. In this approach users interactively transform a formal specification into an implementation; this is mainly used for algorithm design. The principles apply to our approach for, e.g., adapting an operational configuration of components and data types on the basis of an already adapted conceptual configuration of components and data types. The adaptation approach taken in this paper is similar to approaches such as IBROW [40]. IBROW supports semi-automatic configuration of intelligent problem solvers. Their building blocks are ’reusable components’, which are not statically configured, but dynamically ’linked’ together by modelling each building block as a CORBA object. The CORBA-object provides a wrapper for the actual implementation of a reusable component. A Unified Problem-solving method development language UPML [41] has been proposed for the conceptual modelling of the building blocks. Our approach differs in a number of aspects, which include: no commitments to specific conceptual or operational languages and frameworks, two levels of conceptualisation, and the process of (re-)configuration is a completely automated (re-)design process. The design of an agent within the agent factory is based on configuration of components and data types. Components and data types may include cases and partial (agent) designs (cf. generic models / design patterns). This approach is related to design patterns (e.g., [24], [42], [43]) and libraries of software with specific functionality (e.g., problem-solving models, e.g. [27], or generic task models, e.g. [28]). The adaptation process uses strategic knowledge to explore the design space of possible configurations with the aim of satisfying the needs for adaptation. Alternative approaches may expand all configurations of (some) components and data types, when insufficient knowledge on their (non)functional properties is available [44]. Module Interconnection Languages [45] explicitly structure interfaces and relations between components, providing a basis for matching descriptions of components and slots [34]. Approaches for semantic web and annotation of web services play an important role in representing and reasoning about annotations [26]. The QUASAR project focuses on (semi-)automation of generation and provision of implementations of software architectures. A reference architecture is derived based on functional requirements, after which modifications are made for non-functional requirements. Their approach is based on top-down software architecture decomposition, where choices are based on Feature-Solution graphs, which link requirements to design solutions [46], [47]. Future research focuses on augmenting our prototype and analysing annotations in the context of semantic web research.
184
S. van Splunter, N.J.E. Wijngaards, and F.M.T. Brazier
Acknowledgements. The authors wish to thank the graduate students Hidde Boonstra, David Mobach, Oscar Scholten and Mark van Assem for their explorative work on the application of an agent factory for an information retrieving agent. The authors are also grateful to the support provided by Stichting NLnet, http://www.nlnet.nl/.
References 1.
2. 3.
4.
5.
6. 7. 8. 9.
10.
11.
12. 13.
14. 15.
Rus, D., Gray, R.S., Kotz, D.: Autonomous and Adaptive Agents that Gather Information. In: Proceedings of AAAI’96 International Workshop on Intelligent Adaptive Agents (1996) 107–116 Brazier, F.M.T., Wijngaards, N.J.E.: Automated Servicing of Agents. AISB Journal 1 (1), Special Issue on Agent Technology (2001) 5–20 Kudenko, D., Kazakov, D., Alonso, E.: Machine Learning for Multi-Agent Systems. In: V. Plekhanova, V.(ed.): Intelligent Agents Software Engineering, Idea Group Publishing (2002) Bui, H.H., Kieronska, D., and Venkatesh, S.: Learning Other Agents’ Preferences in Multiagent Negotiation. In: Proceedings of the National Conference on Artificial Intelligence (AAAI-96) (1996) 114–119 Soltysiak, S., Crabtree, B.: Knowing me, knowing you: Practical issues in the personalisation of agent technology. In: Proceedings of the third international conference on the practical applications of intelligent agents and multi-agent technology (PAAM98), London (1998) 467–484 Wells, N., Wolfers, J.: Finance with a personalized touch. Communications of the ACM, Special Issue on Personalization 43:8 (2000) 31–34. Schaerf, A., Shohamm Y., Tennenholtz, M.: Adaptive Load Balancing: A Study in MultiAgent Learning. Journal of Artificial Intelligence Research 2 (1995) 475–500 Stefik, M.: Introduction to Knowledge Systems. Morgan Kaufmann Publishers, San Francisco, California (1995) Reffat, R.M. and Gero, J.S.: Computational Situated Learning in Design. In: J. S. Gero (ed.): Artificial Intelligence in Design '00. Kluwer Academic Publishers, Dordrecht (2000) 589–610 Brandt, R., Hörtnagl, C., Reiser, H.: Dynamically Adaptable Mobile Agents for Scaleable Software and Service Management. Journal of Communications and Networks 3:4 (2001) 307–316 Brazier, F.M.T., Overeinder, B.J., van Steen, M., Wijngaards, N.J.E.: Agent Factory: Generative Migration of Mobile Agents in Heterogeneous Environments. In: Proceedings of the 2002 ACM Symposium on Applied Computing (SAC 2002) (2002) 101–106 Reticular: AgentBuilder: An Integrated Toolkit for Constructing Intelligent Software Agents. Reticular Systems Inc, white paper edition. http://www.agentbuilder.com (1999) Gray, R.S., Kotz, D., Cybenko, G., Rus, D.: Agent Tcl. In: Cockayne, W., Zypa, M. (eds.): Itinerant Agents: Explanations and Examples with CD-ROM. Manning Publishing. (1997) 58–95 Nwana, H.S., Ndumu, D.T., Lee, L.C.: ZEUS: An Advanced Tool-Kit for Engineering Distributed Multi-Agent Systems. Applied AI 13:1/2 (1998) 129–185 Bergenti, F., Poggi A.: A Development Toolkit to Realize Autonomous and Inter-Operable Agents. In: Proceedings of Fifth International Conference of Autonomous Agents (Agents 2001), Montreal (2001) 632–639
Structuring Agents for Adaptation
185
16. Newell, A.: The Knowledge Level. Artificial Intelligence 18:1 (1982) 87–127. 17. Jennings, N.R., Wooldridge, M.J.: Applications Of Intelligent Agents. In: Jennings, N.R., Wooldridge, M.J. (eds.): Agent Technology Foundations, Applications, and Markets, Springer-Verlag , Heidelberg, Germany (1998) 3–28 18. Kernighan, B.W., Ritchie, D.M.: The C Programming Language. 2nd edn. Prentice Hall Software Series (1988) 19. Booch, G.: Object oriented design with applications. Benjamins Cummins Publishing Company, Redwood City (1991) 20. Hopkins, J.: Component primer. Communications of the ACM 43:10 (2000) 27–30 21. Sparling, M.: Lessons learned through six years of component-based development. Communications of the ACM 43:10 (2000) 47–53 22. Biggerstaff, T., Perlis, A. (eds.): Software Reusability: Concepts and models. Vol 1. New York, ACM Press (1997) 23. Moormann Zaremski, A., Wing, J.M.: Specification Matching of Software Components. ACM Transactions on Software Engineering and Methodology (TOSEM), Vol. 6:4 (1997) 333–369 24. Gamma, E., Helm, R., Johnson, R., Vlissides, J.: Design Patterns: Elements of reusable object-oriented software. Addison Wesley Longman, Reading, Massachusetts (1994) 25. Shaw, M., DeLine, R., Klein, D.V., Ross, T.L., Young, D.M., Zelesnik G.: Abstractions for Software Architecture and Tools to Support Them. Software Engineering 21:4 (1995) 314–335 26. Ankolekar, A., Burstein, M., Hobbs, J.R., Lassila, O., McDermott, D., Martin, D., McIlraith, S.A., Narayanan, S., Paolucci, M., Payne, T., Sycara, K.: DAML-S: Web Service Description for the Semantic Web. In: Proceedings of the first International Semantic Web Conference (ISWC 02) (2002) 27. Schreiber, G., Akkermans, H., Anjewierden, A., de Hoog, R., Shadbolt, N., van de Velde, W., Wielinga, B.: Knowledge Engineering and Management, the CommonKADS Methodology. MIT Press (2000) 28. Brazier, F.M.T., Jonker, C.M., Treur, J.: Principles of Component-Based Design of Intelligent Agents. Data and Knowledge Engineering 41 (2002) 1–28 29. Müller, J.P., Pischel, M.: The Agent Architecture InteRRaP: Concept and Application. Technical Report RR-93-26, DFKI Saarbrucken (1993) 30. Rao, A.S., Georgeff, M.P.: Modeling rational agents within a BDI architecture. In: Fikes, R., Sandewall, E. (eds.): Proceedings of the Second Conference on Knowledge Representation and Reasoning, Morgan Kaufman, (1991) 473–484 31. Garlan, D., Allen, R., Ockerbloom, J.: Architectural Mismatch, or, Why it's hard to build systems out of existing parts? In: Proceedings of the Seventh international Conference on Software Engineering, Seattle, Washington (1995) 179–185 32. van Vliet, H.: Software Engineering: Principles and Practice. 1st edn. John Wiley & Sons (1993) 33. de Bruin, H., van Vliet, H.: The Future of Component-Based Development is Generation, not Retrieval. In: Crnkovic, I., Larsson, S., Stafford, J. (eds.): Proceedings ECBS'02 Workshop on CBSE -- Composing Systems from Components, Lund, Sweden, April 8–11, (2002) 34. Moormann Zaremski, A., Wing, J.M.: Specification Matching Software Components. ACM Transactions on Software Engineering and Methodology, vol. 6:4 (1997) 333–369 35. Brazier, F.M.T., Wijngaards, N.J.E.: Automated (Re-)Design of Software Agents. In: Gero, J.S. (ed.): Proceedings of the Artificial Intelligence in Design Conference 2002, Kluwer Academic Publishers (2002) 503–520
186
S. van Splunter, N.J.E. Wijngaards, and F.M.T. Brazier
36. Schreiber, A. Th., Birmingham, W. P. (eds.): Special Issue on Sisyphus-VT. 37. International Journal of Human-Computer Studies (IJHCS) 44:3/4 (1996) 275–280 38. Karimi, J., Konsynski, B.R.: An Automated Software Design Assistant. IEEE Transactions on Software Engineering, Vol. 14:2 (1988) 194–210 39. Rich, C., Water. R.C.: The Programmer's Apprentice: A research overview. IEEE Computer 21:11(1988) 10–25 40. Smith, D.R.: KIDS: A Semi-automatic Program Development System. IEEE Transactions on Software Engineering, Vol. 16:9 (1990) 1024–1043 41. Motta, E., Fensel, D., Gaspari, M., Benjamins, V.: Specifications of Knowledge Component Reuse. In: Proceedings of the 11th International Conference on Software Engineering and Knowledge Engineering (SEKE-99), Germany, Kaiserslautern (1999) 17– 19 42. Fensel, D., Motta, E., Benjamins, V., Crubezy, M., Decker, S., Gaspari, M., Groenboom, R., Grosso, W., van Harmelen, F., Musen, M., Plaza, E., Schreiber, A., Studer, R., Wielinga, B.: The unified problem-solving method development language UPML. Knowledge and Information Systems 5:1, to appear (2002) 43. Peña-Mora, F., Vadhavkar, S.: Design Rationale and Design Patterns in Reusable Software Design. In: Gero, J., Sudweeks, F. (eds.): Artificial Intelligence in Design (AID’96), Kluwer Academic Publishers, the Netherlands, Dordrecht (1996) 251–268 44. Riel, A.: Object-Oriented Design Heuristics. Addison Wesley Publishing Company, Reading Massachusetts (1996) 45. Kloukinas, C., Issarny, V.: Automating the Composition of Middleware Configurations. th In: Proceedings of the 15 IEEE International Conference on Automated Software Engineering (2000) 241–244 46. Prieto-Diaz, R., Neighbors, J.M.: Module Interconnection Languages. Journal of Systems and Software 4 (1986) 307–334 47. de Bruin, H., van Vliet, H.: Quality-Driven Software Architecture Composition. Journal of Systems and Software, to appear (2002) 48. de Bruin, H., van Vliet, H.: Top-Down Composition of Software Architectures. In: Proceedings 9th Annual IEEE International Conference on the Engineering of ComputerBased Systems (ECBS), IEEE, April 8-11 (2000) 147–156
Stochastic Simulation of Inherited Kinship-Driven Altruism Heather Turner and Dimitar Kazakov Department of Computer Science, University of York, Heslington, York YO10 5DD, UK, [email protected], http://www-users.cs.york.ac.uk/˜kazakov/
Abstract. The aim of this research is to assess the rˆ ole of a hypothetical inherited feature (gene) promoting altruism between relatives as a factor for survival in the context of a multi-agent system simulating natural selection. Classical Darwinism and Neo-Darwinism are compared, and the principles of the latter are implemented in the system. The experiments study the factors that influence the successful propagation of altruistic behaviour in the population. The results show that the natural phenomenon of kinship-driven altruism has been successfully replicated in a multi-agent system, which implements a model of natural selection different from the one commonly used in genetic algorithms and multiagent systems, and closer to nature.
1
Introduction
The aim of this research is to assess the rˆole of a hypothetical inherited feature (gene) promoting altruism between relatives as a factor for survival. The two main goals are, firstly, to replicate the phenomenon of altruism, which has been observed in nature, and show that the proposed mechanism leads to altruistic individuals being selected by evolution. Secondly, the research aims to provide an implementation of a Multi-Agent System (MAS) employing a model of natural selection, which is different from the one commonly used in Computer Science [1], and, hopefully, closer to the one existing in nature. Altruism can be defined as selfless behaviour, action that will provide benefit to another at no gain to the actor himself, and possibly even to his detriment. In kinship-driven altruism, this behaviour is directed between individuals who are related. [2] introduces an analytical model, in which altruistic behaviour towards relatives is favoured by evolution, provided that the amount of help that an individual bestows on relatives of a given distance is appropriately measured. Both MASs [3] and Genetic Algorithms (GAs) [1] can be used effectively to simulate the interaction of a population that evolves over a period of time. A MAS allows study of the interactions at the level of the individual, while a GA is a better tool for generalisation over an entire population. In a GA, no distinction is made between individuals with the same genotype (i.e., inherited features), whereas in a MAS these are represented by different phenotypes, or set E. Alonso et al. (Eds.): Adaptive Agents and MAS, LNAI 2636, pp. 187–201, 2003. c Springer-Verlag Berlin Heidelberg 2003
188
H. Turner and D. Kazakov
of observable characteristics resulting from the interaction of each genotype with the environment [4]. The use of MAS with large populations is limited by the requirement for extra resources to represent individual phenotypes. In a GA, the individual is anonymous, so there is no capacity to “zoom-in” on its behaviour, but in contrast, there is the possibility of scaling up to consider a much larger population, which may be statistically more relevant. The GA uses a fitness function to estimate how well each individual will fare in the future and uses this to influence the likelihood that they survive to subsequent generations. A MAS uses information about the current position of an individual in the environment, and taking into account its internal state, considered to be the cumulative result of its actions and experiences in the past, determines its actions. In a GA, the population size is fixed, and during each system cycle, individuals may be selected to mate (and be replaced by their descendants) or they pass to the next generation. The anonymity of each individual is suited to the probabilistic selection afforded to this algorithm, and the resulting possibility that clones of an individual be produced in future generations. Without this anonymity, in a system that ‘tracks’ the behaviour of individuals through the generations, complications could arise on cloning. Attachment of energy values becomes difficult if the probabilistic freedom is to be maintained without producing a system that can produce and destroy energy at will. In a MAS, the population size is not explicitly constrained, and the internal state of an individual determines its lifespan. A system cycle will not generally represent an entire generation, as individuals may survive for many cycles. Table 1 summarises the main differences between the GA and MAS models of natural selection. Table 1. MAS vs. GA simulation of natural selection Feature Representation of individuals Survival of individuals Population size Environment resources Preservation of energy
MAS genotype + phenotype
GA genotype only
deterministic, based on the lifetime interaction with environment unlimited limited capacity
probabilistic, based on genotype’s fitness fixed use bounded by maximum population size not considered
enforced
We combine features of each approach to produce a more scalable, personality-driven system without a modelled spatial dimension. The probabilistic nature of all events and the high level of abstraction typical for the GA are preserved. However, the description of each individual consists of a set of
Stochastic Simulation of Inherited Kinship-Driven Altruism
189
inherited features (genome) along with a—very abstract—description of the actual organism (phenotype). The internal state of each individual is changed by the interaction with a very simple, abstract, environment, in which both the selection of an individual’s action and its outcome are modelled as probabilistic functions. This permits us to avoid the use of an explicit fitness function, and instead describe the survival of an individual directly as a probabilistic function of its internal state (e.g., current energy levels). Our system is designed to simulate a population in which some of the individuals are carriers of a gene forcing them to share their energy with the individuals they meet in proportion to the degree of kinship (i.e., number of shared genes). The exact sharing policy is subjected to selection and studied. Food consumption results in increased individual energy level. Falling below a certain energy level means death. An encounter of two individuals of the same species could result in the creation of offspring if their energy levels are sufficient. The initial energy level of the offspring is subtracted from that of the parents. This research uses the hybrid platform described above to study from a different angle an extended version of some of the experiments with kinship-driven altruism performed by Barton [5].
2
Altruism and Darwinian Theory
The possible evolution of a selfless gene is an interesting area of study as it does not necessarily seem intuitive that an individual should value the survival of another to the extent of causing detriment to itself (perhaps by decreasing its own chance of mating or survival) in order to help the other. It would be in contrast to the classic Darwinian theory of natural selection, according to which selfish individuals would always take the upper hand, and eliminate altruists, as the behaviour of the latter would by definition hinder their reproductive success. There is evidence however, as Hammilton [6] illustrates, that many species in nature exhibit this altruistic trait. Neo-Darwinian theory [7] attempts to provide an explanation with the idea of ‘inclusive fitness’, and the hypothesis that natural selection works not at the level of an individual, but on each individual gene. Many individuals can carry copies of the same gene, and if these individuals could identify one another, it would be possible for them to aid in the process of natural selection over that gene by attempting to secure reproductive success and the passing of this gene to the next generation. The evidence provided by Hamilton suggests that nature has evolved to recognise that it is likely for close relatives to have similar genetic makeup. In Hamilton’s model, the degree of kinship is quantified, and it can then be used to determine how much help an individual can bestow on a relative, at detriment to itself and yet still be likely to benefit the inclusive fitness, the ‘fitness’ of the gene. Barton [5] used a MAS to model a population of individuals who behaved altruistically competing in an environment with a population of the same size that was not altruistic. His MAS used GA principles by associating genes with each individual in an attempt to find optimum solutions for variables used in
190
H. Turner and D. Kazakov
his simulations. In some of his experiments, it was the sharing population that prevailed, in others, the non-sharing population over-ran the environment. He quotes ‘Gause’s Competitive Exclusion Principle’, stating ‘no two species can coexist if they occupy the same niche’, and hypothesises that given the limitations of his simulated system, his competing populations are likely to ‘end up having the same, or very similar, niches’. In the MAS he uses, there are agents to represent food and the individuals of each population. The environment is represented on a grid with varying terrain that could restrict movement, or provide water as sustenance to fulfil ‘thirst,’ one of the ‘drives’ that describe the internal state of an agent in a given cycle. Each agent uses the values of its drives, its immediate surroundings and some deterministic rules to make life choices in each cycle.
3
Design
The system we have implemented to investigate altruistic behaviour combines features used in a MAS and those used in a GA. Rather than providing coordinates for the position of each individual in the system, we model encounters with food (energy) and other individuals probabilistically, reflecting the likelihood that these would occur in a given cycle. We do not constrain the population size, thus permitting easier comparisons with Barton’s work [5]. We stem the growth of our population by increasing the probability of random death as the individual ages. The individuals in our implementation retain a portion of genetic material, encoding their behaviour, and sharing policy, and thus allowing evolution of optimum policies. The diagram provides the proposed environmental interaction module for our system. Each individual stores as its phenotype the value of its sex drive, its hunger (or energy level), age and the probability of survival. These values are updated in each system cycle. Figure 1 contains an outline of the proposed simulation, where individual boxes have the following functions:
7
1
Pay gamble for mating
Payment for life
8
Find mate?
Yes
9
15
Adjust energy
Reset sex drive
No
No 3
2
Death
Yes
Have energy?
Yes
10
6
Calculate gambles
Pay gamble for hunting/foraging
14
Did you share?
13
Meet someone?
Yes
4 Random
death?
No
Yes No
5
Increase sex drive
11
Find food?
Yes
12
Increase energy
No
Fig. 1. Simulation outline
No
Stochastic Simulation of Inherited Kinship-Driven Altruism
191
1. Make a payment of energy to the environment (energy expended to survive generation). 2. If all energy is used up, the individual dies. 3. Individual has ‘died’ and is therefore removed from the population. 4. Random death occurs with some probability for each individual (this probability increases exponentially with age). 5. Increase sex drive, and thus priority of reproduction. 6. Genetic material encodes a function to determine behaviour based on the values of the drives. This function produces “gambles” dictating how much, if any of the available energy to expend in search of a mate or food. 7. The gamble for mating is ‘paid’ to the system. 8. Pairs selected at random from the mating pool are deemed to have ‘met’ with some probability. Each must satisfy certain energy requirements, and the pair must not be related as parent and child. The probability that they mate is set in proportion to their mating gambles and determines whether or not they actually produce offspring. On mating, new individuals are created from clones of the genetic material, and by resetting non-genetic parameters. Each parent contributes energy for sharing equally amongst the offspring. The clones undergo crossover producing two children to be included in the population for the next cycle. 9. The sex drive of the individuals who mated successfully is reset. 10. The gamble for hunting/foraging (or food gamble) is ‘paid’ to the system. 11. A probability distribution based on the gamble determines how much energy an individual receives. For a gamble of zero, the probability that an individual receive any energy should be very low. 12. Energy level is increased by the amount of food found. 13. Pairs are further selected from the population, and with some probability are deemed to meet. 14. If the better fed of the pair is an altruist, they decide to share as per his genetically encoded sharing policy. 15. The energy of each individual is then adjusted as appropriate.
Gambling policies. The searches for a mate and food are modelled as stochastic processes, in which an individual spends (or “gambles”) a certain amount of its energy, and receives a payoff from the environment (finds a mate or a given amount of food) with a certain probability. The functions described in Table 2 and displayed in Figure 2a are used to compute the food and mating gambles. The mating gamble is used as described above. The actual amount of food received from the environment is determined from the food gamble in the following way. Firstly, the sigmoid function from Equation 1 is used to compute the average number of units of food/energy µ that the effort represented by the food gamble will produce (see Figure 2b). µ=
max food payoff 1 + e−0.025∗(gamble-200)
(1)
192
H. Turner and D. Kazakov Average payoff max_food payoff
Gamble
µ
γ
β
a) A
mating gamble
food gamble B
b)
Energy available
c) x
Food gamble
111 000 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 000 111 111 000 111 000
σ = const µ
σ
µ
Payoff
Fig. 2. (a) Computing the food and mating gambles from the available energy. (b) Mapping food gambles to average food obtained. (c) Distribution of the amount of food received for a given average payoff µ. Table 2. Computing gambles from the energy available if Energy ≤ A Food Gamble := 0 Mating Gamble := 0 if A < Energy ≤ B Food Gamble := tg β * (Energy - A) Mating Gamble := 0 if Energy > B Food Gamble := tg β* (B - A) + tg (β − γ) * (Energy - B) = tg β * (Energy - A) - tg γ * (Energy - B) Mating Gamble := [tg β − tg(β − γ)] * (Energy - B) = tg γ * (Energy - B)
The actual amount of food is then generated at random according to a Gaussian distribution G(µ, σ) (Figure 2c) where the ratio σ/µ is kept constant for all µ to ensure that only a very small, fixed proportion of the payoffs are negative; these, when generated, were reset to zero. The parameters of the gambling function, that is, A, B, tg β and tg γ, are encoded in the genes of the individuals, and, therefore, are subject to natural selection. The above discussion shows that in this simulation spatial phenomena (food discovery, encounter with another individual) are represented as random processes with a certain probability. It is worth noting that physical distance between individuals is ignored, and the encounter of each pair is equally probable. Similarly, the probability of finding food does not depend on the past actions of the agent, as it would be the case if its co-ordinates were taken into account.
Stochastic Simulation of Inherited Kinship-Driven Altruism
4
193
Experiments and Evaluation
The tool specified in the previous section was implemented in C++, and used to study the influence of several factors on the evolution of altruistic behaviour. In all cases, the evaluation assesses whether the hypothetical altruistic gene is selected by evolution, and studies the circumstances in which this happens. Degree of kinship. Individuals may (1) have a complete knowledge of their genealogy (Royalty model), (2) estimate the degree of kinship according to the presence of some inherited visible indicators (Prediction), or (3) not have this information available (Unknown). The Royalty kinship recognition policy assumes one knows its relatives and the degree to which they are related. Each individual keeps a record of their relatives up to two levels up and down the genealogical tree (see Figure 3). Instead of recording the actual relationship, relatives are grouped in two sets, according to whether on average they share 50% or 25% of their genes with the individual in question. The first group consists of parents and children, the second of grandparents, grandchildren, and siblings. Treating siblings in this way can be explained by the fact that individuals change partners in every generation, and, therefore, the vast majority of siblings are actually half-sibs, which is the case displayed in Figure 3. One peculiarity of our implementation is that when two individuals mate, they produce exactly two children, the chromosomes of which are produced from the parents’ by crossover. This means that if one child inherits a copy of a gene from one parent, the other child will not have that gene, unless the other parent carried it. In any case, the likelihood of two individuals mating together on more than one occasion is negligible in larger populations and the case of full-sibs is therefore discounted for simplicity in this implementation. Individuals who do not appear in either of the above groups of relatives are treated as being no relation at all. The Prediction kinship recognition policy assumes that all genes but one (coincidentally, the one identifying altruistic individuals) are visible in the phenotype. A simple linear metric is then used to measure the similarity between the visible parts of genotype of the two individuals. Type of sharing function. Three social models are considered. Communism equalises the energy levels of two individuals with the same genome (see Figure 4). Progressive Taxation with a non-taxable allowance is a simple linear function with a threshold: y = α(x − θ) for x > θ; y = 0 otherwise. Poll Tax defines an altruistic act between two individuals as an exchange of a fixed amount of energy pt set in the genes of the donor, which does not depend on the energy level of either individual. The above descriptions correspond to the case of sharing between two individuals with the same set of genes. In all other cases, the actual amount given is reduced in proportion to the difference between the two individuals’ genomes, as derived from the perceived degree of kinship. All combinations of the above two factors have been studied by running each of the nine possible experiments three times (see Table 3). All parameters of the
194
H. Turner and D. Kazakov 25% gp
25% gp
25% gp
50% par
100%
50% child
25% gc
25% gc
25% gp
50% par
25% sib
25% sib
50% child
25% gc
25% gc
Fig. 3. Average expected percentage of shared genes between relatives.
Before
After
=>
Fig. 4. Sharing between identical twins: Communism
Fig. 5. Evolution of population size
sharing functions (α, θ), resp. pt were initially set at random, and left to evolve. When employing the Unknown model of kinship, a rather optimistic assumption was made, under which the donor treated the aid receiver as a parent or child. The graphs in Table 3 are self-explanatory. In brief, the use of either perfect knowledge of the degree of kinship or a sharing function based on progressive taxation ensures that a substantial level of altruism is selected and maintained in the population. The population size remains the same in all cases, and is given
Stochastic Simulation of Inherited Kinship-Driven Altruism
195
Table 3. Percentage of altruistic individuals in the population (1=100%). (Columns, from left to right: Royalty, Prediction and Unknown models of kinship recognition. Rows, top to bottom: Communism, Progressive Taxation and Poll Tax sharing functions.)
by the amount of food supplied. A representative example of the way in which the population size evolved is shown in Figure 5 on the case of Royalty with Progressive Taxation.
Degree of altruism and availability of resources. In the experiments, all individuals carry a gene defining them as either selfish or altruistic. Simply counting the individuals carrying either gene is a good measure of the altruism in the population only in a communist society. In the other two cases, individuals, which are nominally altruistic, can have their sharing parameters set in a way, which reduces the effects of altruism to an arbitrary low level, e.g., α or pt → 0, θ → ∞. In these cases, the ratio of what is given to what is actually owned by the individual, integrated over the whole energy range, is considered a more appropriate measure. The idea in the case of progressive taxation is shown in Figure 6 where a nominally altruistic individual is assigned a degree of altruism given by the ratio of the filled triangle and the square made of the ranges of energy owned and exchanged.
196
H. Turner and D. Kazakov
Changes in the level of resources available in the system will by definition have an effect on the carrying capacity (maximum population size) of the environment, and could be expected to cause variations in the system dynamics, and possibly the ability of the environment to support altruism. We ran several experiments to see how the degree of altruism in the system varies for the different sharing policies (note that this level does not change, and is considered equal to 100% for the Communist sharing policy, so the graphs are omitted) with different amounts of energy (resources) available. The graphs in Table 4 indicate that altruism tends to converge faster to a single stable level when more energy is provided by the environment.
Energy given
Max
11111111 00000000 00000000 11111111 00000000 11111111 00000000 11111111 α 00000000 11111111 0
Θ Energy available
Max
Fig. 6. Measure of altruism
Fig. 7. Percentage of altruists in the population with respect to initial levels (1=100%)
Stochastic Simulation of Inherited Kinship-Driven Altruism
197
Table 4. Percentage of altruism (1=100%) evolving in the population as the sharing strategy and level of energy (resources) available are varied.
198
H. Turner and D. Kazakov
Initial ratio between altruistic and selfish individuals. To study the influence that the initial proportion of altruistic to selfish individuals has on the levels of altruism selected by evolution, the Royalty with Progressive Taxation experiment was run with several initial values for this ratio. The results in Figure 7 show that the system reaches a dynamic equilibrium which, in the cases shown, does not depend on the initial ratio. Mutation. We conducted some experiments where the rate of mutation in the system was varied. Although it was maintained at relatively low levels, variation was seen in the speed of convergence to a stable level of altruism and the eventual level. The mutation rates were set at: 0; 0.0005; 0.001; 0.0015; 0.002 and 0.0025, with other variables fixed as follows: sharing function = Progressive Taxation, kinship-recognition policy = Royalty and Energy = 2.5M (see Table 5). At the lowest rates of mutation, there appears to be a greater variation in the evolved levels of altruism between runs of the experiment, making it difficult to draw conclusions about the rate of convergence. As the mutation rate increases, a more definite level of altruism is evolved, and the experimental populations converge faster to this level. It is unlikely that this trend would continue as the mutation rate increases much higher, since, at some point, the high level of mutation is likely to override the effects of natural selection. (For the third chart, where the level of mutation is at 0.001, note that it is just an extension of chart eight in Table 4: Progressive Taxation with 2.5M energy units, the same experimental setup, but run for three times as long.)
5
Discussion
Both goals of this research, as stated in Section 1, are successfully met. The proposed algorithm has been implemented, and altruism has, indeed, been shown to be selected and maintained by evolution in a number of cases. No direct comparison with Barton’s work could be made as his detailed results were not available in a suitable form. However, a few major points can be made. Firstly, it has been confirmed that the policy of Progressive Taxation produces more altruists than Communism. An additional policy (Poll Tax) was studied in this research, which also introduced the new dimension of ‘knowledge of the degree of kinship’ in the experimental setup. Unlike Barton’s, these experiments produced populations of virtually the same size. Barton treats altruists and non-altruists as two different species, which in turn results in one species completely taking over the other one. In our results, there are several cases in which a balance between altruists and selfish individuals is maintained. Altruism is a demonstration of the mechanisms on which natural selection is based. Note that this work does not aim to imply the existence of such gene in reality, and indeed nothing of that said above would change if one assumed altruistic behaviour being inherited not as a gene, but through upbringing. There is interest in the use of natural selection in artificial societies. This research should bring the implementation of natural selection in artificial societies
Stochastic Simulation of Inherited Kinship-Driven Altruism
199
Table 5. Effect of varying mutation rate on the percentage of altruism in the population (1=100%).
a step closer to the original mechanism that is copied. The authors’ expectations are that the natural selection incorporating altruism would be suitable in cases, when the task is to produce an optimal population of agents rather than a single best individual, in situations when the knowledge about the performance of the population is incomplete and local. The software described here may also represent a useful tool for the simulation of natural societies and give an interesting insight in their inner making, although this would be up to experts in the relevant fields to judge. The two main characteristics of the model of altruism discussed here, namely, ‘inherited’ and ‘kinship-driven’, also mark the limits of its reach.
200
H. Turner and D. Kazakov
Firstly, the model does not allow changes in the altruistic behaviour of an individual within its lifespan. In fact, natural selection and individual learning are not perceived here as mutually exclusive. It is expected that, in many cases, combination of the two could be a successful strategy, where natural selection provides the starting point for the individual behaviour, which is modified according to the agent’s personal experience. The actual technique employed at this second stage could be, for instance, based on game theory, where natural selection provides a suitable initial strategy. If individual behaviour is to be modified by a machine learning technique, natural selection could also provide it with a suitable bias. Research in this direction should be helped by the York MAS, currently under development, which supports natural selection among agents, as well as logic-based programming of behaviour and individual learning [8]. The second limitation of the model of altruism discussed here is that it does not discuss the case when agents can at will opt in and out of a society promoting altruism among its members. Since the names of many such societies draw analogies with kinship, e.g. ‘fraternities’ or ‘sororities’, in order to evoke the corresponding spirit of altruism (or ‘brotherhood’) in its members, the authors believe that also in this case the findings described in the paper would not be without relevance. In comparison with logic-based approaches, this research makes one simple initial assumption, and attempts to see if altruism can be worked out from first principles. The actual behaviour of agents can be deterministic (and described in logic) or stochastic, that should not be of principal importance. On the other hand, no further background knowledge is assumed here—the agent’s rules of behaviour are left to evolve, and not set in advance. In the future, comparisons with Hamilton’s analytical model, and the evolutionary game theory point of view would also be worth exploring.
6
Future Work
It would be interesting to extend the platform developed to implement different mating policies, so that pairs of individuals could be selected from a single mating pool or from separate mating pools into which individuals have previously been grouped according to their internal state: rich meet (mostly) rich, poor meet poor, individuals with high sexual drive are grouped together, etc. In addition to the impact of resource availability and rates of mutation, studied in this paper, another environmental parameter, the probability of meeting another individual, should be taken into account, and used to test the effectiveness of altruistic vs. selfish policy in various, and changing, environments. An important and, potentially, non-trivial issue is the analysis of the content of the individuals’ sets of genes and their evolution in time. In the case when the propagation of all genes is subject to simultaneous selection, one would have to study data sets, which are multidimensional—one dimension per locus plus an extra dimension representing time—hence difficult to visualise. One could expect that there would be a correlation between the genes selected in each
Stochastic Simulation of Inherited Kinship-Driven Altruism
201
locus, and that certain combinations might show a trend of dominating the population, which would form clusters around those points. Methods and tools for multivariate data visualisation with a minimal loss of information, such as those described by Schr¨ oder and Noy [9], would be considered for the above task. Acknowledgements. The second author wishes to expess his gratitude to his wife Mar´ıa Elena and daughter Maia for being such a wonderful source of inspiration.
References 1. Goldberg, D.E.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley (1989) 2. Hamilton, W.D.: The genetical evolution of social behaviour I. Journal of Theoretical Biology 7 (1964) 1–16 3. Wooldridge, M., Jennings, N.: Intelligent agents: theory and practice. Knowledge Engineering Review 2 (1995) 4. Thompson, D., ed.: The Oxford Compact English Dictionary. Oxford University Press (1996) 5. Barton, J.: Kinship-driven altruism in multi-agent systems. Project report for a degree in Computer Science, University of York. Project supervisor: Dimitar Kazakov (2001) 6. Hamilton, W.D.: The genetical evolution of social behaviour II. Journal of Theoretical Biology 7 (1964) 17–52 7. Watson, T.: Kin selection and cooperating agents. Technical report, Dept. of Computer Science, De Montfort University, Leicester (1995) 8. Kazakov, D., Kudenko, D.: Machine Learning and Inductive Logic Programming for Multi-Agent Systems. LNAI 2086. In: Multi-Agent Systems and Applications. Springer (2001) 246–270 9. Schr¨ oder, M., Noy, P.: Multi-agent visualisation based on multivariate data. In: Working Notes of the Fourth UK Workshop on Multi-Agent Systems UKMAS-01. (2001) 10. Turner, H.: Stochastic simulation of inherited kinship driven altruism. Project report for a degree in Computer Science, University of York. Project supervisor: Dimitar Kazakov (2002)
Learning in Multiagent Systems: An Introduction from a Game-Theoretic Perspective Jos´e M. Vidal University of South Carolina, Computer Science and Engineering, Columbia, SC 29208 [email protected]
Abstract. We introduce the topic of learning in multiagent systems. We first provide a quick introduction to the field of game theory, focusing on the equilibrium concepts of iterated dominance, and Nash equilibrium. We show some of the most relevant findings in the theory of learning in games, including theorems on fictitious play, replicator dynamics, and evolutionary stable strategies. The CLRI theory and n-level learning agents are introduced as attempts to apply some of these findings to the problem of engineering multiagent systems with learning agents. Finally, we summarize some of the remaining challenges in the field of learning in multiagent systems.
1
Introduction
The engineering of multiagent systems composed of learning agents brings together techniques from machine learning, game theory, utility theory, and complex systems. A designer must choose carefully which machine-learning algorithm to use since otherwise the system’s behavior will be unpredictable and often undesirable. Fortunately, we can use the tools from these areas in an effort to predict the expected system behaviors. In this article we introduce these techniques and explain how they are used in the engineering of learning multiagent systems. The goal of machine learning research is the development of algorithms that increase the ability of an agent to match a set of inputs to their corresponding outputs [7]. That is, we assume the existence of a large, sometimes infinite, set of examples E. Each example e ∈ E is a pair e = {a, b} where a ∈ A represents the input the agent receives and b ∈ B is the output the agent should produce when receiving this input. The agent must find a function f which maps A → B for as many examples of A as possible. In a controlled test the set E is usually first divided into a training set which is used for training the agent, and a testing set which is used for testing the performance of the agent. In some scenarios it is impossible to first train the agent and then test it. In these cases the training and testing examples are interleaved. The agent’s performance is assessed on an ongoing manner. When a learning agent is placed in a multiagent scenario these fundamental assumptions of machine learning are violated. The agent is no longer learning E. Alonso et al. (Eds.): Adaptive Agents and MAS, LNAI 2636, pp. 202–215, 2003. c Springer-Verlag Berlin Heidelberg 2003
Learning in Multiagent Systems: An Introduction
203
to extrapolate from the examples it has seen of fixed set E, instead it’s target concept keeps changing, leading to a moving target function problem [10]. In general, however, the target concept does not change randomly; it changes based on the learning dynamics of the other agents in the system. Since these agents also learn using machine learning algorithms we are left with some hope that we might someday be able to understand the complex dynamics of these type of systems. Learning agents are most often selfish utility maximizers. These agents often face each other in encounters where the simultaneous actions of a set of agents leads to different utility payoffs for all the participants. For example, in a marketbased setting a set of agents might submit their bids to a first-price sealed-bid auction. The outcome of this auction will result in a utility gain or loss for all the agents. In a robotic setting two agents headed in a collision course towards each other have to decide whether to stay the course or to swerve. The results of their combined actions have direct results in the utilities the agents receive from their actions. We are solely concerned with learning agents that maximize their own utility. We believe that systems where agents share partial results or otherwise help each other can be considered extension on traditional machine learning research.
2
Game Theory
Game theory provides us with the mathematical tools to understand the possible strategies that utility-maximizing agents might use when making a choice. It is mostly concerned with modeling the decision process of rational humans, a fact that should be kept in mind as we consider its applicability to multiagent systems. The simplest type of game considered in game theory is the single-shot simultaneous-move game. In this game all agents must take one action. All actions are effectively simultaneous. Each agent receives a utility that is a function of the combined set of actions. In an extended-form game the players take turns and receive a payoff at the end of a series of actions. A single-shot game is a good model for the types of situations often faced by agents in a multiagent system where the encounters mostly require coordination. The extended-form games are best suited to modeling more complex scenarios where each successive move places the agents in a different state. Many scenarios that first appear like they would need an extended-form game can actually be described by a series of single-shot games. In fact, that is the approach taken by many multiagent systems researchers. In the one-shot simultaneous-move game we say that each agent i chooses a strategy si ∈ Si , where Si is the set of all strategies for agent i. These strategies represent the actions the agent can take. When we say that i chooses strategy si we mean that it chooses to take action si . The set of all strategies chosen by all the agents is the strategy profile for that game and it is denoted by s ∈ S ≡ ×Ii=i Si . Once all the agents make their choices and form the strategy profile s
204
J.M. Vidal A B A 1,2 3,4 B 3,2 2,1
Fig. 1. Sample two-player game matrix. Agent 1 chooses from the rows and agent 2 chooses from the columns.
then each agent i receives a utility which is given by the function ui (s). Notice that a player’s utility depends on the choices made by all the agents. Two player games involve only two players, i and j. They are often represented using a game matrix such as the one shown in Figure 1. In that matrix we see that if agent 1 (the one who chooses from the rows) chooses action A and agent 2 chooses action B then agent 1 will receive a utility of 3 while agent 2 receives a utility of 4. Using our notation for strategies we would say that if the strategy profile is (s1 , s2 ) then the payoff vector is (u1 (s1 , s2 ), u2 (s1 , s2 )) It is possible that a player will choose randomly between its action choices, using different prior probabilities for each choice. These types of strategies are called mixed strategies and they are a probability distribution over an agent’s actions. We say that a mixed strategy for agent i is σi ∈ Σi ≡ P (Si ) where P (Si ) is the set of all probability distributions over the set of pure strategies Si . Although a real agent can not take a “mixed action”, mixed strategies are useful abstractions since they allow us to model agents who might use some randomization subroutine to choose their action.
3
Solution Concepts
Much of the work in game theory has concentrated in the definition of plausible solution concepts. A solution concept tries to define the set of actions that a set of rational agents will choose when faced with a game. The most common assumptions are that the agents are rational, have common knowledge1 of the payoffs in the game matrix, and that they are intelligent enough to re-create the thought process that the mathematician went through to come up with the solution concept. As such, most solution concepts are geared towards an understanding of how smart, well-informed people would act. They are not necessarily meant to explain the behavior of machine-learning agents. Still, the fact that they provide the “best” solution makes them a useful tool.
1
Common knowledge about p means that everybody knows that everybody knows, and so on to infinity, about p.
Learning in Multiagent Systems: An Introduction
205
A B A 8,2 9,4 B 1,2 3,1 Fig. 2. A game where agent 1’s action B is dominated by A.
3.1
Iterated Dominance
The iterated dominance approach is to successively eliminate from consideration those actions that are worst than some other action, no matter what the other player does. For example, in Figure 2 we see a game where agent 1’s action B is dominate by A. That is, no matter what agent 2 does, agent 1 should choose action A. Then, if agent 1 chooses action A then agent 2 should choose action B. Therefore, the solution strategy profile for this game is (A, B). Formally, we say that a strategy σi is strictly dominated for agent i if there is some other strategy σ ˜i ∈ Σi for which ui (˜ σi , σ−i ) > ui (σi , σ−i ) for all σ−i , where σ−i is a set of strategies for all agents except i. Notice that the inequality sign is a greater-than. If we change that sign to a greater-than-or-equal then we have the definition for a weakly dominated strategy. There is no reason for a rational agent to choose a strictly dominated strategy. That is, there is no reason for an agent to choose σi when there exists a σ ˜i which will give it a better utility no matter what the other agents do. Similarly, there is no reason for the agent to choose a weakly dominated strategy. Of course, this reasoning relies on the assumption that the agent can indeed determine the existence of a σ ˜i . This assumption can be hard to justify in cases where the better strategy is a mixed strategy where the agent has an infinite number of possible strategies to verify, or in cases where the number of actions and agents is too large to handle. The iterated dominance algorithm consists of calculating all the strategies that are dominated for all the players, eliminating those strategies from consideration, and repeating the process until no more strategies are dominated. At that point it might be the case that only one strategy profile is left available. In this case that profile is the one all agents should play. However, in many cases the algorithm still leaves us with a sizable game matrix with a large number of possible strategy profiles. The algorithm then serves only to reduce the size of the problem. 3.2
Nash Equilibrium
The Nash equilibrium solution concept is popular because it provides a solution where other solution concepts fail. The Nash equilibrium strategy profile is defined as σ ˆ such that for all agents i it is true that there is no strategy better than σ ˆi given that all the other agents take the actions prescribed by σ ˆ−i . Formally, we say that σ ˆ is a Nash equilibrium strategy profile if for all i it is true that σ ˆi ∈ BRi (σˆ−i ), where BRi (s−i ) is the best response for i to s−i . That is, given
206
J.M. Vidal
that everyone else plays the strategy given by the Nash equilibrium the best strategy for any agent is the one given by the Nash equilibrium. A strict Nash equilibrium states that σ ˆi is strictly (i.e., greater than) better than any other alternative. It has been shown that every game has at least one Nash equilibrium, as long as mixed strategies are allowed. The Nash equilibrium has the advantage of being stable under single agent desertions. That is, if the system is in a Nash equilibrium then no agent, working by itself, will be tempted to take a different action. However, it is possible for two or more agents to conspire together and find a set of actions which are better for them. This means that the Nash equilibrium is not stable if we allow the formation of coalitions. Another problem we face when using the Nash equilibrium is the fact that a game can have multiple Nash equilibria. In these cases we do not know which one will be chosen, if any. The Nash equilibrium could also be a mixed strategy for some agent while in the real world the agent has only discrete actions available. In both of these cases the Nash equilibrium is not sufficient to identify a unique strategy profile that rational agents are expected to play. As such, further studies of the dynamics of the system must be carried out in order to refine the Nash equilibrium solution. The theory of learning in games—a branch of game theory—has studied how simple learning mechanisms lead to equilibrium strategies.
4
Learning in Games
The theory of learning in games studies the equilibrium concepts dictated by various simple learning mechanisms. That is, while the Nash equilibrium is based on the assumption of perfectly rational players, in learning in games the assumption is that the agents use some kind of algorithm. The theory determines the equilibrium strategy that will be arrived at by the various learning mechanisms and maps these equilibria to the standard solution concepts, if possible. Many learning mechanisms have been studied. The most common of them are explained in the next few sub-sections. 4.1
Fictitious Play
A widely studied model of learning in games is the process of fictitious play. In it agents assume that their opponents are playing a fixed strategy. The agents use their past experiences to build a model of the opponent’s strategy and use this model to choose their own action. Mathematicians have studied these types of games in order to determine when and whether the system converges to a stable strategy. Fictitious play uses a simple form of learning where an agent remembers everything the other agents have done and uses this information to build a probability distribution for the other agents’ expected strategy. Formally, for the two agent (i and j) case we say that i maintains a weight function ki : Sj → R+ .
Learning in Multiagent Systems: An Introduction
207
The weight function changes over time as the agent learns. The weight function at time t is represented by kit which keeps a count of how many times each strategy has been played. When at time t − 1 opponent j plays strategy st−1 j then i updates its weight function with = sj , 1 if st−1 j kit (sj ) = kit−1 (sj ) + (1) 0 if st−1 = sj . j Using this weight function, agent i can now assign a probability to j playing any of its sj ∈ Sj strategies with kit (sj ) t s ). j s˜j ∈Sj ki (˜
Prti [sj ] =
(2)
Player i then determines the strategy that will give it the highest expected utility given that j will play each of its sj ∈ Sj with probability Prti [sj ]. That is, i determines its best response to a probability distribution over j’s possible strategies. This amounts to i assuming that j’s strategy at each time is taken from some fixed but unknown probability distribution. Several interesting results have been derived by researches in this area. These results assume that all players are using fictitious play. In [3] it was shown that the following two propositions hold. Proposition 1. If s is a strict Nash equilibrium and it is played at time t then it will be played at all times greater than t. Intuitively we can see that if the fictitious play algorithm leads to all players to play the same Nash equilibrium then, afterward, they will increase the probability that all others are playing the equilibrium. Since, by definition, the best response of a player when everyone else is playing a strict Nash equilibrium is to play the same equilibrium, all players will play the same strategy and the next time. The same holds true for every time after that. Proposition 2. If fictitious play converges to a pure strategy then that strategy must be a Nash equilibrium. We can show this by contradiction. If fictitious play converges to a strategy that is not a Nash equilibrium then this means that the best response for at least one of the players is not the same as the convergent strategy. Therefore, that player will take that action at the next time, taking the system away from the strategy profile it was supposed to have converged to. An obvious problem with the solutions provided by fictitious play can be seen in the existence of infinite cycles of behaviors. An example is illustrated by the game matrix in Figure 3. If the players start with initial weights of k10 (A) = 1, k10 (B) = 1.5, k20 (A) = 1, and k20 (B) = 1.5 they will both believe that the other will play B and will, therefore, play A. The weights will then be updated to k11 (A) = 2, k11 (B) = 1.5, k21 (A) = 2, and k21 (B) = 1.5. Next time, both agents
208
J.M. Vidal A B A 0,0 1,1 B 1,1 0,0 Fig. 3. A game matrix with an infinite cycle.
will believe that the other will play A so both will play B. The agents will engage in an endless cycle where they alternatively play (A, A) and (B, B). The agents end up receiving the worst possible payoff. This example illustrates the type of problems we encounter when adding learning to multiagent systems. While we would hope that the machine learning algorithm we use will be able to discern this simple pattern and exploit it, most learning algorithms can easily fall into cycles that are not much complicated than this one. One common strategy for avoiding this problem is the use of randomness. Agents will sometimes take a random action in an effort to exit possible loops and to explore the search space. It is interesting to note that, as in the example from Figure 3, it is often the case that the loops the agents fall in often reflect one of the mixed strategy Nash equilibria for the game. That is, (.5, .5) is a Nash equilibrium for this game. Unfortunately, if the agents are synchronized, as in this case, the implementation of a mixed strategy could lead to a lower payoff. Games with more than two players require that we decide whether the agent should learn individual models of each of the other agents independently or a joint probability distribution over their combined strategies. Individual models assume that each agent operates independently while the joint distributions capture the possibility that the others agents’ strategies are correlated. Unfortunately, for any interesting system the set of all possible strategy profiles is too large to explore—it grows exponentially with the number of agents. Therefore, most learning systems assume that all agents operate independently so they need to maintain only one model per agent.
4.2
Replicator Dynamics
Another widely studied model is replicator dynamics. This model assumes that the percentage of agents playing a particular strategy will grow in proportion to how well that strategy performs in the population. A homogeneous population of agents is assumed. The agents are randomly paired in order to play a symmetric game, that is, a game where both agents have the same set of possible strategies and receive the same payoffs for the same actions. The replicator dynamics model is meant to capture situations where agents reproduce in proportion to how well they are doing.
Learning in Multiagent Systems: An Introduction
209
Formally, we let φt (s) be the number of agents using strategy s at time t. We can then define φt (s) t s ∈S φ (s )
θt (s) =
(3)
to be the fraction of agents playing s at time t. The expected utility for an agent playing strategy s at time t is defined as θt (s )u(s, s ), (4) ut (s) ≡ s ∈S
where u(s, s ) is the utility than an agent playing s receives against an agent playing s . Notice that this expected utility assumes that the agents face each other in pairs and choose their opponents randomly. In the replicator dynamics the reproduction rate for each agent is proportional to how well it did on the previous step, that is, φt+1 (s) = φt (s)(1 + ut (s)).
(5)
Notice that the number of agents playing a particular strategy will continue to increase as long as the expected utility for that strategy is greater than zero. Only strategies whose expected utility is negative will decrease in population. It is also true that under these dynamics the size of a population will constantly fluctuate. However, when studying replicator dynamics we ignore the absolute size of the population and focus on the fraction of the population playing a particular strategy, i.e., θt (s), as time goes on. We are also interested in determining if the system’s dynamics will converge to some strategy and, if so, which one. In order to study these systems using the standard solution concepts we view the fraction of agents playing each strategy as a mixed strategy for the game. Since the game is symmetric we can use that strategy as the strategy for both players, so it becomes a strategy profile. We say that the system is in a Nash equilibrium if the fraction of players playing each strategy is the same as the probability that the strategy will be played on a Nash equilibrium. In the case of a pure strategy Nash equilibrium this means that all players are playing the same strategy. An examination of these systems quickly leads to the conclusion that every Nash equilibrium is a steady state for the replicator dynamics. In the Nash equilibrium all the strategies have the same average payoff since the fraction of other players playing each strategy matches the Nash equilibrium. This fact can be easily proven by contradiction. If an agent had a pure strategy that would return a higher utility than any other strategy then this strategy would be a best response to the Nash equilibrium. If this strategy was different from the Nash equilibrium then we would have a best response to the equilibrium which is not the equilibrium, so the system could not be at a Nash equilibrium. It has also been shown [4] that a stable steady state of the replicator dynamics is a Nash equilibrium. A stable steady state is one that, after suffering from
210
J.M. Vidal
a small perturbation, is pushed back to the same steady state by the system’s dynamics. These states are necessarily Nash equilibria because if they were not then there would exist some particular small perturbation which would take the system away from the steady state. This correspondence was further refined by Bomze [1] who showed that an asymptotically stable steady state corresponds to a Nash equilibrium that is trembling-hand perfect and isolated. That is, the stable steady states are a refinement on Nash equilibria—only a few Nash equilibria can qualify. On the other hand, it is also possible that a replicator dynamics system will never converge. In fact, there are many examples of simple games with no asymptotically stable steady states. While replicator dynamics reflect some of the most troublesome aspects of learning in multiagent systems some differences are evident. These differences are mainly due to the replication assumption. Agents are not usually expected to replicate, instead they acquire the strategies of others. For example, in a real multiagent system all the agents might choose to play the strategy that performed best in the last round instead of choosing their next strategy in proportion to how well it did last time. As such, we cannot directly apply the results from replicator dynamics to multiagent systems. However, the convergence of the systems’ dynamics to a Nash equilibrium does illustrate the importance of this solution concept as an attractor of learning agent’s dynamics. 4.3
Evolutionary Stable Strategies
An Evolutionary Stable Strategy (ESS) is an equilibrium concept applied to dynamic systems such as the replicator dynamics system of the previous section. An ESS is an equilibrium strategy that can overcome the presence of a small number of invaders. That is, if the equilibrium strategy profile is ω and small number of invaders start playing ω then ESS states that the existing population should get a higher payoff against the new mixture (ω + (1 − )ω) than the invaders. It has been shown [9] that an ESS is an asymptotically stable steady state of the replicator dynamics. However, the converse need not be true—a stable state in the replicator dynamics does not need to be an ESS. This means that ESS is a further refinement of the solution concept provided by the replicator dynamics. ESS can be used when we need a very stable equilibrium concept.
5
Learning Agents
The theory of learning in games provides the designer of multiagent systems with many useful tools for determining the possible equilibrium points of a system. Unfortunately, most multiagent systems with learning agents do not converge to an equilibrium. Designers use learning agents because they do not know, at design time, the specific circumstances that the agents will face at run time. If a designer knew the best strategy, that is, the Nash equilibrium strategy, for his agent then he would simply implement this strategy and avoid the complexities
Learning in Multiagent Systems: An Introduction
211
of implementing a learning algorithm. Therefore, the only times we will see a multiagent system with learning agents are when the designer cannot predict that an equilibrium solution will emerge. The two main reasons for this inability to predict the equilibrium solution of a system are the existence of unpredictable environmental changes that affect the agents’ payoffs and the fact that on many systems an agent only has access to its own set of payoffs—it does not know the payoffs of other agents. These two reasons make it impossible for a designer to predict which equilibria, if any, the system would converge to. However, the agents in the system are still playing a game for which an equilibrium exists, even if the designer cannot predict it at design-time. But, since the actual payoffs keep changing it is often the case that the agents are constantly changing their strategy in order to accommodate the new payoffs. Learning agents in a multiagent system are faced with a moving target function problem [10]. That is, as the agents change their behavior in an effort to maximize their utility their payoffs for those actions change, changing the expected utility of their behavior. The system will likely have non-stationary dynamics— always changing in order to match the new goal. While game theory tells us where the equilibrium points are, given that the payoffs stay fixed, multiagent systems often never get to those points. A system designer needs to know how changes in the design of the system and learning algorithms will affect the time to convergence. This type of information can be determined by using CLRI theory. 5.1
CLRI Theory
The CLRI theory [12] provides a formal method for analyzing a system composed of learning agents and determining how an agent’s learning is expected to affect the learning of other agents in the system. It assumes a system where each agent has a decision function that governs its behavior as well as a target function that describes the agent’s best possible behavior. The target function is unknown to the agent. The goal of the agent’s learning is to have its decision function be an exact duplicate of its target function. Of course, the target function keeps changing as a result of other agents’ learning. Formally, CLRI theory assumes that there are N agents in the system. The world has a set of discrete states w ∈ W which are presented to the agent with a probability dictated by the probability distribution D(W ). Each agent i ∈ N has a set of possible actions Ai where |Ai | ≥ 2. Time is discrete and indexed by a variable t. At each time t all agents are presented with a new w ∈ D(W ), take a simultaneous action, and receive some payoff. The scenario is similar to the one assumed by fictitious play except for the addition of w. Each agent i’s behavior is defined by a decision function δit (w) : W → A. When i learns at time t that it is in state w it will take action δit (w). At any time there is an optimal function for i given by its target function ∆ti (w). Agent i’s learning algorithm will try to reduce the discrepancy between δi and ∆i by using the payoffs it receives for each action as clues since it does not have direct access to ∆i . The probability that an agent will take a wrong action is given
212
J.M. Vidal
by its error e(δit ) = Pr[δit (w) = ∆ti (w) | w ∈ D(W )]. As other agents learn and change their decision function, i’s target function will also change, leading to the moving target function problem, as depicted in Figure 5.1. An agent’s error is based on a fixed probability distribution over world states and a boolean matching between the decision and target functions. These constraints are often too restrictive to properly model many multiagent systems. However, even if the system being modeled does not completely obey these two constraints, the use of the CLRI theory in these cases still gives the designer valuable insight into how changes in the design will affect the dynamics of the system. This practice is akin to the use of Q-learning in non-Markovian games— while Q-learning is only guaranteed to converge if the environment is Markovian, it can still perform well on other domains. The CLRI theory allows a designer to understand the expected dynamics of the system, regardless of what learning algorithm is used, by modeling the system using four parameters: Change rate, Learning rate, Retention rate, and Impact (CLRI). A designer can determine values for these parameters and then use the CLRI difference equation to determine the expected behavior of the system. The change rate (c) is the probability that an agent will change at least one of its incorrect mappings in δ t (w) for the new δ t+1 (w). It captures the rate at which the agent’s learning algorithm tries to change its erroneous mappings. The learning rate (l) is the probability that the agent changes an incorrect mapping to the correct one. That is, the probability that δ t+1 (w) = ∆t (w), for all w. By definition, the learning rate must be less than or equal to the change rate, i.e. l ≤ c. The retention rate (r) represents the probability that the agent will retain its correct mapping. That is, the probability that δ t+1 (w) = δ t (w) given that δ t (w) = ∆t (w). CLRI defines a volatility term (v) to be the probability that the target function ∆ changes from time t to t + 1. That is, the probability that ∆t (w) = ∆t+1 (w). As one would expect, volatility captures the amount of change that the agent must deal with. It can also be viewed as the speed of the target function in the moving target function problem, with the learning and retention rates representing the speed of the decision function. Since the volatility is a dynamic property of the system (usually it can only be calculated by running the system) CLRI provides an impact (Iij ) measure. Iij represents the impact that i’s learning has on j’s target function. Specifically, it is the probability that ∆tj (w) will change given that δit+1 (w) = δit (w). Someone trying to build a multiagent system with learning agents would determine the appropriate values for c, l, r, and either v or I and then use E[e(δit+1 )] = 1 − ri + vi
|Ai |ri − 1 |Ai | − 1 |Ai |(li − ri ) + li − ci + e(δit ) ri − li + vi |Ai | − 1
(6)
Learning in Multiagent Systems: An Introduction
213
in order to determine the successive expected errors for a typical agent i. This equation relies on a definition of volatility in terms of impact given by ∀w∈W vit = Pr[∆t+1 (w) = ∆ti (w)] i =1− (1 − Iji Pr[δjt+1 (w) = δjt (w)]),
(7)
j∈N−i
which makes the simplifying assumption that changes in agents’ decision functions will not cancel each other out when calculating their impact on other agents. The difference equation (6) cannot, under most circumstances, be collapsed into a function of t so it must still be iterated over. On the other hand, a careful study of the function and the reasoning behind the choice of the CLRI parameter leads to an intuitive understanding of how changes in these parameters will be reflected in the function and, therefore, the system. A knowledgeable designer can simply use this added understanding to determine the expected behavior of his system under various assumptions. An example of this approach is shown in [2]. For example, it is easy to see that an agent’s learning rate and the system’s volatility together help to determine how fast, if ever, the agent will reach its target function. A large learning rate means that an agent will change its decision function to almost match the target function. Meanwhile, a low volatility means that the target function will not move much, so it will be easy for the agent to match it. Of course, this type of simple analysis ignores the common situation where the agent’s high learning rate is coupled with a high impact on other agents’ target function making their volatility much higher. These agents might then have to increase their learning rate and thereby increase the original agent’s volatility. Equation (6) is most helpful in these type of feedback situations. 5.2
N-Level Agents
Another issue that arises when building learning agents is the choice of a modeling level. A designer must decide whether his agent will learn to correlate actions with rewards, or will try to learn to predict the expected actions of others and use these predictions along with knowledge of the problem domain to determine its actions, or will try to learn how other agents build models of other agents, etc. These choices are usually referred to as n-level modeling agents—an idea first presented in the recursive modeling method [5] [6]. A 0-level agent is one that does not recognize the existence of other agents in the world. It learns which action to take in each possible state of the world because it receives a reward after its actions. The state is usually defined as a static snapshot of the observable aspects of the agent’s environment. A 1-level agent recognizes that there are other agents in the world whose actions affect its payoff. It also has some knowledge that tells it the utility it will receive given any set of joint actions. This knowledge usually takes the form of a game matrix that only has utility values for the agent. The 1-level agent observes the other agents’ actions and builds probabilistic models of the other agents. It then
214
J.M. Vidal
uses these models to predict their action probability distribution and uses these distributions to determine its best possible action. A 2-level agent believes that all other agents are 1-level agents. It, therefore, builds models of their models of other agents based on the actions it thinks they have seen others take. In essence, the 2-level agent applies the 1-level algorithm to all other agents in an effort to predict their action probability distribution and uses these distributions to determine its best possible actions. A 3-level agent believes that all other agents are 2-level, an so on. Using these guidelines we can determine that fictitious play (Section 4.1) uses 1-level agents while the replicator dynamics (Section 4.2) uses 0-level agents. These categorizations help us to determine the relative computational costs of each approach and the machine-learning algorithms that are best suited for that learning problem. 0-level is usually the easiest to implement since it only requires the learning of one function and no additional knowledge. 1-level learning requires us to build a model of every agent and can only be implemented if the agent has the knowledge that tells it which action to take given the set of actions that others have taken. This knowledge must be integrated into the agents. However, recent studies in layered learning [8] have shown how some knowledge could be learned in a “training” situation and then fixed into the agent so that other knowledge that uses the first one can be learned, either at runtime or in another training situation. In general, a change in the level that an agent operates on implies a change on the learning problem and the knowledge built into the agent. Studies with n-level agents have shown [11] that an n-level agent will always perform better in a society full of (n-1)-level agents, and that the computational costs of increasing a level grow exponentially. Meanwhile, the utility gains to the agent grow smaller as the agents in the system increase their level, within an economic scenario. The reason is that an n-level agent is able to exploit the non-equilibrium dynamics of a system composed of (n-1)-level agents. However, as the agents increase their level the system reaches equilibrium faster so the advantages of strategic thinking are reduced—it is best to play the equilibrium strategy and not worry about what others might do. On the other hand, if all agents stopped learning then it would be very easy for a new learning agent to take advantage of them. As such, the research concludes that some of the agents should do some learning some of the time in order to preserve the robustness of the system, even if this learning does not have any direct results.
6
Conclusion
We have seen how game theory and the theory of learning in games provide us with various equilibrium solution concepts and often tell us when some of them will be reached by simple learning models. On the other hand, we have argued that the reason learning is used in a multiagent system is often because there is no known equilibrium or the equilibrium point keeps changing due to outside forces. We have also shown how the CLRI theory and n-level agents are attempts
Learning in Multiagent Systems: An Introduction
215
to characterize and predict, to a limited degree, the dynamics of a system given some basic learning parameters. We conclude that the problems faced by a designer of a learning multiagent systems cannot be solved solely with the tools of game theory. Game theory tells us about possible equilibrium points. However, learning agents are rarely at equilibrium, either because they are not sophisticated enough, because they lack information, or by design. There is a need to explore non-equilibirium systems and to develop more predictive theories which, like CLRI, can tell us how changing either the parameters on the agents’ learning algorithms or the rules of the game will affect the expected emergent behavior.
References 1. Bomze, I.: Noncoopertive two-person games in biology: A classification. International Journal of Game Theory 15 (1986) 31–37 2. Brooks, C.H., Durfee, E.H.: Congregation formation in multiagent systems. Journal of Autonomous Agents and Multi-agent Systems (2002) to appear. 3. Fudenberg, D., Kreps, D.: Lectures on learning and equilibrium in strategic-form games. Technical report, CORE Lecture Series (1990) 4. Fudenberg, D., Levine, D.K.: The Theory of Learning in Games. MIT Press (1998) 5. Gmytrasiewicz, P.J., Durfee., E.H.: A rigorous, operational formalization of recursive modeling. In: Proceedings of the First International Conference on MultiAgent Systems. (1995) 125–132 6. Gmytrasiewicz, P.J., Durfee., E.H.: Rational communication in multi-agent systems. Autonomous Agents and Multi-Agent Systems Journal 4 (2001) 233–272 7. Mitchell, T.M.: Machine Learning. McGraw Hill (1997) 8. Stone, P.: Layered Learning in Multiagent Systems. MIT Press (2000) 9. Taylor, P., Jonker, L.: Evolutionary stable strategies and game dynamics. Mathematical Biosciences 16 (1978) 76–83 10. Vidal, J.M., Durfee, E.H.: The moving target function problem in multi-agent learning. In: Proceedings of the Third International Conference on Multi-Agent Systems. (1998) 11. Vidal, J.M., Durfee, E.H.: Learning nested models in an information economy. Journal of Experimental and Theoretical Artificial Intelligence 10 (1998) 291–308 12. Vidal, J.M., Durfee, E.H.: Predicting the expected behavior of agents that learn about agents: the CLRI framework. Autonomous Agents and Multi-Agent Systems (2002)
The Implications of Philosophical Foundations for Knowledge Representation and Learning in Agents N. Lacey1 and M.H. Lee2 1
2
University of Arkansas, Fayetteville, AR 72701, USA [email protected] University of Wales, Aberystwyth, Ceredigion, Wales, UK [email protected]
Abstract. The purpose of this research is to show the relevance of philosophical theories to agent knowledge base (AKB) design, implementation, and behaviour. We will describe how artificial agent designers face important problems that philosophers have been working on for centuries. We will then show that it is possible to design different agents to be explicitly based on different philosophical approaches, and that doing so increases the range of agent behaviour exhibited by the system. We therefore argue that alternative, sometimes counter-intuitive, conceptions of the relationship between an agent and its environment may offer a useful starting point when considering the design of an agent knowledge base.
1
Introduction
A situated agent is one which exists within an environment [9]. Except for a small number of trivial cases, a situated agent will have to represent facts about its environment within some kind of internal model. A learning agent will have to take new information on board, while maintaining the consistency of its knowledge base. This means that its designers will face problems which we can divide into the following categories: Accuracy. Some of the data it receives may be inaccurate. It will need some mechanism to decide the truth or falsehood of its beliefs in the light of this data. Completeness. The agent may not receive all the data it needs from the external world. This means it will have to derive new information on the basis of the information it does have. It will also have to non-monotonically revise its knowledge base in the light of new information. Inconsistency. Consider the following three pieces of information: – All swans are white. – The creature in front of you is a swan. – It is black. E. Alonso et al. (Eds.): Adaptive Agents and MAS, LNAI 2636, pp. 216–238, 2003. c Springer-Verlag Berlin Heidelberg 2003
The Implications of Philosophical Foundations
217
Taken together, this information is inconsistent. In order to retain consistency, an agent would have to withdraw at least one of them from its knowledge base. The question is, which one? How does an agent determine which beliefs to revise, and what are the implications for the remainder of the knowledge base of this revision? We can view all these problems as having a common cause, namely, that of representing an infinite external world within a finite world model [11]. When seen in this light, it becomes clear that the precise nature of the relationship between an agent and its environment will be a key factor when deciding how to address these issues. Philosophers have for millennia been considering the nature of the relationship between ourselves and the external world. The three areas of philosophy which are of interest here are: Epistemology concerns the theory of knowledge. Epistemology addresses questions such as “What is truth?” and “Can we be sure that we know anything?” Philosophy of Language is concerned with different approaches to meaning. Metaphysics concerns the nature of the relationship between an agent and its surroundings. The approach adopted within one of these areas will have implications concerning the approaches which may consistently be adopted in the others. Indeed, one of interesting features of philosophy is its ability to resolve tensions and inconsistencies in belief systems that may otherwise have gone unnoticed. For example, consider the following statements concerning agent knowledge bases: S1 The external world exists independently of any agent’s beliefs concerning it. S2 The accuracy of an agent’s knowledge base is defined in terms of correspondence with the external world. S3 Therefore, in order to maximise the accuracy of the system, the agent’s knowledge base must be defined in terms of the external world. S4 It is impossible for any finite agent to obtain objective data concerning the “actual” state of ultimate reality. Taken individually, it is hard to disagree with any of these statements. However, whether or not there is a direct contradiction contained within these statements, there is at least some tension between S1 , S2 , S3 , and S4 . The research described in this paper concerns the application of the techniques provided by philosophers in the three areas of philosophy described above to the problems of AKBs. Some of the similarities between the two disciplines are easy to spot. For example, the epistemological approaches to justification represented by foundationalism and coherence are already represented in the coherence and foundations approaches to belief revision, described in [5]. In other cases, however, the relevance of the relationship between the two disciplines is not obvious from a superficial investigation of the theories concerned. In these cases, it is necessary to conduct a detailed theoretical analysis of the
218
N. Lacey and M.H. Lee
fundamental concepts involved. The techniques used to approach this task, and the conclusions that we were able to draw from this research, are described in this paper.
2
From Theory to Algorithm
One of the principle aims of this research was to show that differences in the philosophical foundations of an agent could be reflected in differences in the agent’s design and behaviour. In order to accomplish this, we establish two opposing philosophical positions, and go on to show how learning agents based on these two positions can be designed and implemented. It is important to note that we are not suggesting that either of these positions represents a position that is more philosophically “correct” than its rivals and hence should be adopted by AKB designers. At this stage, our only purpose is to show that philosophical theories at this level do affect agent design. This section examines the transition of our two positions from the philosophical to the functional and algorithmic levels. 2.1
The Philosophical Level
The two positions used in this research were based on extreme versions of holism and atomism. These concepts refer to the basic units that make up a language. Tennant [16] describes the holist as holding that the basic unit of linguistic communication is the entire theory of the speaker. Any statement made by an agent only has meaning within the entire set of beliefs currently held by the speaker. Quine describes a radically holistic perspective in [14], where he writes that the unit of empirical significance is the whole of science. Atomism is, as would be expected, the opposite of holism. The atomist believes that individual words are, as Tennant describes, “primitively endowed with meaning” [16]. Words acquire this meaning independently of any other beliefs held by the agent. As far as our agent is concerned, these two contrasting theories lead to the following two positions: PA is a position based on extreme atomism. This means that every piece of data is assigned its own meaning, independently of the beliefs already stored within the agent’s knowledge base. PH is based on extreme holism. According to this view, the meaning of a piece of data is determined in relation to all the existing beliefs of the agent at a given time. It is interesting to examine how these two contrasting approaches to meaning affect the approach to justification that an agent designer may take. Justification is the process whereby a belief is supported by beliefs or by other entities. As Sosa describes in [15], there are two major approaches to justification, namely the foundationalist approach, and the coherence approach.
The Implications of Philosophical Foundations
219
The foundationalist approach to justification holds that inferentially justified beliefs are themselves justified by non-inferentially justified beliefs, or basic beliefs [4]. According to the foundationalist, our beliefs are organised in a structure that is comparable to a pyramid, [15] whereby we have a set of basic beliefs on which our non-basic beliefs are based. Haack identifies the two essential characteristics of foundationalism as follows [7]: F1. There are two forms of justification: inferential and non-inferential F2. Basic beliefs are never justified, even in part, by appeal to non-basic beliefs. The major alternative to the foundationalist approach to justification comes in the form of the coherence theory of justification. Dancy describes the coherence theory of justification as holding that “a belief is justified to the extent to which the belief-set of which it is a member is coherent.” [15] compares the structure of beliefs suggested by the coherence theory of justification to that of a raft, whereby our beliefs mutually support each other, and no single belief is supported by anything other than other beliefs. Thus, the coherence theory of justification does not contain the same asymmetries as foundationalism, as all beliefs mutually justify each other, so the coherence theory of justification rejects both F1 and F2. As well as justification, the major epistemological concept that is of interest here is that of truth. We will briefly present two contrasting theories of truth: the correspondence and the coherence theories. Bradley [3] define the correspondence theory of truth as follows: φ is true if and only if the actual state of the world is as φ asserts it to be According to this theory, the proposition “Grass is green” is true if, and only if, grass is actually green in the real world.1 The major problem for correspondence comes in the form of fallibilism. If, as Dancy argues, we can never be sure that our beliefs are entirely free from error, we can never be sure that our beliefs actually represent the real external reality. The coherence theory of truth, as described in [4], defines truth as follows: φ is true if and only if it is a member of a coherent set Thus, the proposition “Grass is green” is true because my belief that grass is green coheres with my view of the world, rather than because it describes an actual state of affairs in the real world. The coherence theory of truth is not susceptible to problems of fallibilism, as the concept of correspondence with external reality does not underpin the coherence-based concept of truth. The coherence theory of truth has been criticised under the plurality objection, in that it is possible for multiple agents to have conflicting beliefs at the same time, and yet all have “true” beliefs. Whether or not we accept that the plurality objection is effective depends on the metaphysical framework on which we build our philosophical system. Although harder to translate into functional approaches, the distinction between PA and PH can also be roughly be translated to the metaphysical level. 1
There are in fact many subtle variations concerning the issue of correspondence, as described in [10] and [6].
220
N. Lacey and M.H. Lee
At this level, the distinction we are interested in is that between realism and anti-realism. Aune [1] describes the difference between realist and anti-realist theories in terms of the difference between theories which hold reality to be fundamentally different from our perceptions of it, and those that do not; those that hold this distinction are realist, and those that do not are anti-realist. van Inwagen defines metaphysical realism as the view that there is such a thing as objective truth [19]. He identifies two components that are necessary for an agent to believe in realism: R1. Our beliefs and assertions are either true or false. This is known as the principle of bivalence [10]. R2. The external world exists and has features which are independent of our beliefs. Thus, the external world is mind-independent. For example, consider the following fact F : There are an odd number of stars in the universe. According the realist, F is objectively either true or false, and the fact that we do not yet know whether it is true or false does not detract from its truth or falsehood. For the anti-realist, however, F is neither true nor false but meaningless, at least until such time as it can be verified. The vital distinction between these two approaches, then, is that realism holds that the world exists for agents to discover, while anti-realism holds that it is the very interaction of an agent with its environment which bestows meaning on the environment.2 Realism and correspondence are an important part of what van Inwagen refers to as the Common Western Metaphysic. This is the view of metaphysics that is an implicit part of our common sense view of the world. The important point here is that everyone has metaphysical views, whether or not they are aware of them. Most people in the Western world will readily accept both R1 and R2, and, as such, would be classed as realists. Thus, position PH is based on holism and a coherence-based theory of justification within an anti-realist metaphysical framework. PA has been defined in such a way as to be as diverse as possible from PH. Thus, PA is based on extreme atomism. Some of the concepts incorporated into PA, such as foundationalism and correspondence, can be associated with realism. However, it is not being suggested that PA represents a coherent realist metaphysical position, far less that PA represents the realist position. Indeed, by placing more emphasis on sensory perceptions than on the external objects being perceived, this position could be held to represent an anti-realist approach to the external world. The problem here is that a philosophical position consists of many axes which are not entirely independent. This means that one is not necessarily able to simply create the position one requires out of thin air, as careful consideration has to be given to the full implications of the position we choose to adopt. 2
The philosophical literature concerning this topic contains many subtleties which we cannot do justice to here. For a more detailed analysis of this area, see [20], [21], [2], and [17].
The Implications of Philosophical Foundations
221
Inferentially Justified Beliefs
Senses
Basic Beliefs
Fig. 1. The Foundationalist Approach to Justification based on the Acceptance of F1 and F2
2.2
The Functional Level
The next step was to identify the functionality that we would expect to see exhibited in agents based on each of these approaches. Three agents were designed which incorporated different aspects of the functionality we would expect to see in systems based on PA and PH. Agents SA and WA. We would expect the agent based on strong atomism to process every piece of data on its own merits, making as little reference as possible to its existing beliefs. In such an agent, input data would be the major factor which determined its behaviour, as shown in Figure 1. At this stage it was noticed that the positions represented by PA and PH were a little extreme, so as well as agents strongly based on these two positions, SA and SH, a third agent WA was also developed, which was weakly based on PA. The extreme atomism of agent SA was represented using an agent which performed no processing of input data. Instead, new beliefs were inferred by forward chaining data received from the agent’s sensory inputs. While this approach was in keeping with the extreme atomism of PA, however, it is clear to see that agent SA would be extremely vulnerable to noisy data. For this reason, agent WA was designed to represent a compromise between the extreme positions of SA and SH. As such, agent WA incorporated some simple sensor averaging techniques which proved very effective when operating in noisy environments. In effect, agent WA did not react immediately to sensor data, but waited until a sufficient body of evidence for a particular state of affairs had been received by its sensors. With the exception of these averaging techniques, agent WA was identical to agent SA. The main advantage of this approach is that it leads to behaviour which is more flexible than that exhibited by agent SA, in that agent WA is able to
222
N. Lacey and M.H. Lee
Sensor Data
Corrected sensor data
Beliefs concerning the accuracy of sensor data
Beliefs based on sensor data
123456789012345678901234567890121234 123456789012345678901234567890121234 123456789012345678901234567890121234 123456789012345678901234567890121234 123456789012345678901234567890121234 123456789012345678901234567890121234 123456789012345678901234567890121234 123456789012345678901234567890121234 123456789012345678901234567890121234 123456789012345678901234567890121234 123456789012345678901234567890121234 123456789012345678901234567890121234
Core Beliefs
Fig. 2. The Approach to Justification Based on PH
function more effectively in noisy environments. Also, even though some theoretical purity has been sacrificed at the expense of operational flexibility, agent WA can still claim to be based on PA. This is for two reasons. Firstly, all the information used by agent WA is derived from its sensors. Secondly, the process by which the agent decides which suggestions to act on is atomistic, in that it is based on individual beliefs rather than the coherence of the entire knowledge base. However, there are also disadvantages caused by this approach. On a theoretical level, agent WA is less firmly based on PA than agent SA. This is because, by increasing the complexity of the operations used to derive the agent’s internal model from its sensor information, agent WA is moving towards the PH concept of defining the external world in terms of internal beliefs, rather than vice-versa. Finally, the sensor averaging used by agent WA means that the agent will not respond immediately to environmental changes. While this is useful in noisy in environments, it may cause problems in noise-free environments. Agent SH. An agent based on strong holism would place far more emphasis on the coherence of its knowledge base than on making use of every individual piece of new data. We would therefore expect an agent based on PH to be less sensitive to noisy input data than an agent based on PA, but to be significantly more computationally expensive. Figure 2 shows the approach to justification that we would expect agent based on PH to exhibit. Agent SH was significantly more complex than the other two agents. While agents SA and WA were designed to organise their knowledge bases to be as
The Implications of Philosophical Foundations
223
closely related to sensory inputs as possible, the organisation of the knowledge base of agent SH was based on the concept of integrity.3 Theoretically, agent SH is free to make any and all revisions to its knowledge base which maximise the integrity of the knowledge base. It is important to note that from the radically holistic approach of PH, any comparisons between prospective changes to the knowledge base must be made from the point of view of the current knowledge base. Unlike SA and WA, SH is not allowed to use a pseudo-objective standpoint when evaluating its knowledge base. The method used to calculate the integrity of agent SH’s knowledge base was based on the following concepts:
Consistency. A knowledge base is consistent if it does not contain any logical contradictions. Thus, a knowledge base K is consistent with respect to a belief φ if ¬((φ ∈ K) ∧ (¬φ ∈ K)). Consistency is one the most important features of coherence, as noted by [4], [5], and [18], and as such should form one of the principal ingredients of the measure of the integrity of an ontology. Mutual Belief Support. The concept of mutual belief support requires that rather than merely being consistent, beliefs in a knowledge base actively support each other. Explanatory Power. The explanatory power of a knowledge base relates to the amount of data it can explain. A good knowledge base should be able to explain past events as well as predict future ones. Epistemological Conservatism. This is also known as the principle of minimal change, and is described in [8] and [5] as an important part of coherence. The method used by agent SH to calculate the integrity of its knowledge base reflects the fact that a new knowledge base will compare favourably with the present knowledge base if it is similar to the present knowledge base. A system of constraints was used to represent core beliefs with the agent’s ontology. When these constraints were violated, recovery measures were put in place which allowed the agent to process new input data while maximising the integrity of its ontology. Unlike the other two agents, agent SH was thus able to ignore input data if this course of action cohered with its existing beliefs.
3
The concept of integrity as it will be used here is heavily based on coherence. However, as the term “coherence” is used to denote a theory of epistemological truth, an approach to justification, and an approach to belief revision, each of which have subtly but importantly different definitions, it was decided to use a different term altogether when comparing knowledge bases within the PH framework.
224
N. Lacey and M.H. Lee
2.3
The Algorithmic Level
Once the functionality required for each agent had been identified, it was then necessary to devise algorithms which allow this functionality to be represented within an artificial learning agent.4 Agents SA and WA. Agents SA and WA take information from their sensory inputs and place it directly in the parts of the model to which it relates, namely, the basic beliefs. This procedure leaves no room for the pre-processing of sensor data, as is allowed by the PH approach. These agents derive their knowledge base by forward chaining their basic beliefs (BK ) using the relations contained in the rest of their knowledge base (RK ). This approach implements both F1 and F2, as the beliefs in BK are never inferentially justified, but instead are justified by virtue of the fact that they come from the agent’s senses. This approach also reflects the foundationalist concept that all derived beliefs are ultimately derived from basic beliefs. During the forward chaining process, an inconsistency arises if a relation attempts to assign a fact which is already assigned a different value. The abilities of agents SA and WA to respond to inconsistencies are severely limited by their atomism. This means that both agents are based on the concept that individual beliefs may be inconsistent, with this inconsistency due to inaccuracy, which in turn is due to a lack of correspondence with the external world. Furthermore, any method used to address this inconsistency must be derived from, and evaluated with reference to, individual beliefs rather than the entire knowledge base. Agent SH. Agents SA and WA are firmly based on empirical data. This means that the process of deriving their knowledge base requires no more algorithmic guidance than that provided by the forward chaining algorithm. Agent SH, however, is based upon the concepts of “integrity” and “coherence”. While these may be well defined at the philosophical and functional levels, providing rigorous algorithmic definitions of these concepts is a non-trivial problem. Thus, the knowledge-base derivation process used by agent SH requires more algorithmic guidance than that used by agents SA and WA. The method used to provide this guidance is ”explanation-driven backward chaining”. There are two fundamental components to this approach: Backward Chaining which allows the same information to be supplied from different sources within K, and Explanation-level Beliefs which guide the backward chaining process The relations which are used to assign beliefs are ordered by centrality before any backward chaining attempts are made. This means that the algorithm is carrying out a depth-first search of the relation search space. Depending on the 4
Unfortunately we are not able to describe all the algorithms used in the design of these agents in the space available here. For a full description of this work see [11].
The Implications of Philosophical Foundations
225
structure of K, this approach may take longer than if a breadth-first search were used. This is because high-centrality relations are backward chained as far as possible before any low-centrality relations are used. However, the advantage of using depth-first search is that this respects the PH concept of a hierarchy of beliefs. Important beliefs, which will be the most central, will receive priority over less important ones. Agent SH uses the concept of meta-level beliefs which are concerned with the explanation of why certain events occurred. These meta beliefs can themselves be backward-chained, and this process guides the overall knowledge base derivation process. For example, just as back_chain(distance_up,tx=4) might produce an answer of distance_up = 4.33 meaning that the distance above the robot at t4 was 4.33 units, so back_chain(sensor_explanation,tx=4) might produce an answer of sensor_explanation = sensor_1 : sensor_ok sensor_2 : noise = -23.493 sensor_3 : sensor_ok sensor_4 : noise = 18.872 meaning that at time t4 sensors 1 and 3 were working correctly, while sensors 2 and 4 were affected by noise. By using a full explanation belief whose value is determined by first determining the value of several lower-level explanation objects, each of which relates to a different part of the domain, it is possible to guide the order in which the backward chaining algorithm attempts to find the values of the different objects within the knowledge base. Indeed, in order for the agent to arrive at a value for full explanation at a particular time, it will be necessary for it to have attempted to find a value for every relevant object within the knowledge base. Figure 3 shows how the full explanation object is related to various other objects in the example described in Section 3. Once the derivation process is finished, the beliefs in K are subjected to various constraint checks. If the original sensor data is in any way inconsistent with the agent’s existing beliefs, K will fail some of these constraint checks. The source of the inconsistency will be traced to several alternative possibilities, and the possibility which maximises the integrity of K will be chosen. If the chosen explanation to the inconsistency is that some or all of the sensors were affected by noise, then the affected sensory data will be adjusted accordingly, and K will be re-derived.
226
N. Lacey and M.H. Lee Full Explanation
Internal Explanation
Sensor Explanation
Sensor_1
Sensor_n
External Explanation
Map Explanation
Edit Map
Obstacle Explanation
Unknown Obstacle
Fig. 3. An Example of how the Top-Level Explanation Object can be Determined from Lower Level Explanation Objects
Constraints are placed on various beliefs throughout the agent’s knowledge base. For example, the following constraint limits the amount by which the sensors can be adjusted to the maximum amount of noise that the sensors may be affected by, where [leq,x,y] means that x must be less than or equal to y: [leq,sensor_adjustment, max_sensor_noise] A violation of this constraint would mean that the sensor adjustment was too high, or that the sensors were prone to more noise than was previously believed, and hence that the value of max sensor noise was too low. The use of constraints has theoretical and algorithmic advantages. From an algorithmic point of view, the use of constraints allows potential inconsistencies to be detected without necessarily having to derive the entire knowledge base. From a theoretical point of view, the use of constraints as a method of consistency checking reflects the holism of PH. After all, there is no inherent inconsistency in the following beliefs: sensor_adjustment = 35 max_sensor_noise = 15 Rather, it is the constraint which the agent places on these beliefs which makes them inconsistent. The operation of checking for inconsistencies at the constraint level, rather than at the object level, reinforces the PH concept that it is the relations between beliefs which determine their consistency and accuracy, as opposed to the beliefs themselves.
The Implications of Philosophical Foundations
227
As constraints inherit the centrality values of the beliefs which they affect, agent SH will possess constraints of varying importance. This is important, as a criticism that is often leveled at coherence-based systems and theories is that they cannot account for the differences between the importance of our beliefs. By placing factual, relational, constraint, and meta-beliefs within an ordered hierarchy based on centrality, agent SH can avoid this problem. If an ontology based on PH were to be visualised as a sphere, the central beliefs would be the meta-beliefs while the outermost beliefs would be state dependent assertions. This structure is similar to, and indeed is based on, the belief structure proposed by Quine in [14]. We can summarise the important points concerning the theoretical aspects of the design of agent SH as follows: – Agent SH distinguishes between raw sensor data and corrected sensor data. – Whether or not the agent decides to correct its sensor data, sensor data is an inferentially justified belief. As such, agent SH is based on a rejection of F2. Indeed, as all beliefs are inferentially justified, agent SH also rejects F1. – Agent SH possesses the ability to modify the sensor data it receives based on its existing beliefs so as to maximise the integrity of its knowledge base. – A process of knowledge base derivation based on explanation-based backward chaining allows agent SH to process a hierarchical belief structure that would not normally be possible within a coherence-based system.
3
Implementation and Experiments
Once algorithms designed to represent the functionality that would be exhibited by agents based on PA and PH had been defined, it was then possible to implement agents which incorporated these algorithms. A system was implemented in SICStus Prolog which allowed the performance of each of these agents to be compared.5 As shown in Figure 4, the domain which was used to compare the performance of agents SA, WA, and SH was that of guiding a simulated robot around a simulated environment. The robot is equipped with sensors which provide the agent controlling it with information concerning its distance from surfaces in the up, left, down, and right directions. The experiment occurs over a series of discrete timeslices. The agents receive sensor data from the robot, process this data, and then send movement commands back to the robot. This process continues until the agent controlling the robot believes that the robot has reached its goal position. The agents had no knowledge of the actual position of the robot. This means that the actual sensor information must be derived by a process which is external to the agents. The part of the system responsible for this is called the 5
The results presented here in fact represent the second series of experiments undertaken during this research. The experimental domain used during the first set of experiments is described in [12].
228
N. Lacey and M.H. Lee
Start Robot
1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890 1234567890
Sensors Obstacle
Goal
Fig. 4. The Environment Used to test Agent Performance
“Robot Simulator”. Its task was to receive commands from the agents, move the simulated robot accordingly, and send sensor data back to the agents according to the state of the environment. The agents received sensor readings from the robot, and had to learn the layout of the environment on the basis of these sensor readings. By adding noise to the sensor readings received by the agents, as well as to the motor commands received by the robot, it was possible to create a complex learning task which allowed the relative performance of the agents to be compared. The performance of the agents was examined in terms of the following performance measures: – Location accuracy (Lxy ) – Surface accuracy (Sr , Su ) – The number of seconds required to complete the experiment. The accuracy of the agent’s model of the position of the robot Lxy is given as a percentage. A score of 100% indicates that the agent’s beliefs concerning the location of the robot are completely accurate. Lxy is defined as the mean of Ttxy for all timeslices, where Ttxy represents the agent’s positional accuracy at a given timeslice t, and is calculated as shown in equation 1. Ttxy =
1+
1 ((Dxt )2 + (Dyt )2 )
× 100
(1)
where Dxt is the difference between the robot’s actual x position and the x position the agent believes it to be in at time t, and Dyt is the difference between the robot’s actual y position and the y position the agent believes it to be in at time t.
The Implications of Philosophical Foundations
229
Model Surface
B
A
BR
AL
L
12345678901234567890123456789012123456789012345678901234567890121234567890123 12345678901234567890123456789012123456789012345678901234567890121234567890123 12345678901234567890123456789012123456789012345678901234567890121234567890123 12345678901234567890123456789012123456789012345678901234567890121234567890123 12345678901234567890123456789012123456789012345678901234567890121234567890123 12345678901234567890123456789012123456789012345678901234567890121234567890123 12345678901234567890123456789012123456789012345678901234567890121234567890123 12345678901234567890123456789012123456789012345678901234567890121234567890123 12345678901234567890123456789012123456789012345678901234567890121234567890123 12345678901234567890123456789012123456789012345678901234567890121234567890123 12345678901234567890123456789012123456789012345678901234567890121234567890123 12345678901234567890123456789012123456789012345678901234567890121234567890123
R
Actual Surface
Fig. 5. The Measures of Surface Accuracy
The accuracy of the agent’s model of the obstacles in the environment is given by the following two values: Sr Concerns the cases where the agent believes a surface exists but where no surface exists in the real map. A value of 100% means there are no false reports of surfaces. Su Concerns the cases in which the agent fails to identify a surface. A value of 100% means that there are no unreported surfaces. Figure 5 illustrates what is represented by Sr and Su . The actual percentages are derived as shown in equations 2 and 3. Sr =
LR × 100 LR + AL
(2)
Su =
LR × 100 LR + BR
(3)
where AL is the distance between A and L. This is the length of the model surface which does not correspond to an actual surface. BR is the distance between B and R. This is the length of the actual surface which is not represented by an model surface. LR is the length of the actual surface. 3.1
Experiments
This section describes the experiments that have been carried out. The experiments were designed to provide a comparison of the performance of agents SA, WA, and SH when operating under a variety of environmental conditions, as summarised in Table 1. The results from each experiment describe the performance of each agent. For each experiment, results are given in terms of the measures of agent performance described above. All statistics given represent the mean results from at least three runs.
230
N. Lacey and M.H. Lee Table 1. The Conditions used for the Experiments Experiment 1 2 3
Environmental Conditions No noise, one obstacle One obstacle and sensor noise One obstacle, sensor noise and motor noise
Experiment 1 The purpose of this experiment was to examine the performance of the agents when acting in an environment which contained unfamiliar elements. This was achieved by placing an obstacle in the robot’s environment, between the start and goal positions. The agents had no prior knowledge of this obstacle. The robot could not pass through the obstacle. Therefore, the agents had to use the robot’s sensor readings to infer its location and plan a path around it. Figure 6 shows the map used for these experiments. The obstacle is represented by the shaded rectangle.
Fig. 6. The map used for Experiment 1
The results obtained in experiment 1 are summarised in Table 2. As shown in this table, while all the agents were successful in guiding the robot to the goal, the execution time of agent SH was significantly higher than those of the other two agents. Figure 7 shows the path used by agent SA to guide the robot to the goal position in experiment 1. The thick lines along the edges of the obstacle represent the position of surfaces in the environment according to agent SA. In this experiment, agents WA and SH produced a path and world model identical to that produced by agent SA.
The Implications of Philosophical Foundations
231
Table 2. Results from Experiment 1 Performance Measure Lxy (%) Sr (%) Su (%) Seconds
SA 100.00 100.00 100.00 86.02
Agent WA SH 100.00 100.00 100.00 100.00 100.00 100.00 92.66 516.56
Fig. 7. The Path used by Agent SA in Experiment 1
Experiment 2 The purpose of this experiment was to compare the performances of the agents when dealing with imperfect input data. Random amounts of noise were added to the sensor readings supplied to all the agents. As well as this, an obstacle about which the agents had no prior knowledge was positioned between the robot’s start and goal positions, as in experiment 1. Table 3. Results from Experiment 2 Performance Measure Lxy (%) Sr (%) Su (%) Seconds
SA 100.00 12.59 6.89 159.24
Agent WA SH 100.00 100.00 58.08 100.00 46.00 100.00 102.90 514.24
The results described in Table 3 show that the map produced by agent SH was significantly more accurate than those produced by the other two agents, but that agent SH also required significantly more time to complete the task.
232
N. Lacey and M.H. Lee
When we compare the final models produced by the agents, the effects of the differing designs are clear. The model produced by agent SA, shown in Figure 8, shows that, as would be expected, an agent based on atomism is vulnerable to noisy input data. This agent has produced a world model which contains many inaccuracies as far as the location of surfaces in the environment is concerned.
Fig. 8. The Final Model of Agent SA from Experiment 2
The model produced by agent WA, shown in Figure 9, contains inaccuracies, but is more accurate than that produced by agent SA. This shows that even simple sensor averaging can have a significant effect on the ability of a system to cope with noisy data.
Fig. 9. The Final Model of Agent WA from Experiment 2
Experiment 3 This experiment involved a similar environment to that used in experiment 2, except that in experiment 3 the movement commands sent by the agents to
The Implications of Philosophical Foundations
233
the robot were also subject to noise. This meant that, as the robot would not be following the precise movements ordered by the agents, the robot’s actual position might differ from the position the agent believed it to be in. The results obtained in experiment 3 are summarised in Table 4. Figures 10, 11, and 12 show the final maps of the agents after sample runs of experiment 3. The dotted trail represents the agent’s beliefs concerning the path taken by the robot, while the thick trail represents the path the robot actually took. Similarly, the solid black circle represents the actual position of the robot, while the hollow outlined circle represents the position where the agent believed the robot to be.
Table 4. Results from Experiment 3 Performance Measure Lxy (%) Sr (%) Su (%) Seconds
SA 7.61 3.86 2.88 225.77
Agent WA SH 8.57 12.77 7.11 9.46 4.61 17.25 103.18 6351.75
Fig. 10. The Final Model of Agent SA from Experiment 3
Figure 10 shows that the difference between the robot’s actual position and the position in which agent SA believed the robot to be caused the agent to form an erroneous model of the external world. Despite the sensor noise, one can see how the position of surfaces along the left hand edge of the environment follow a similar pattern to the actual path of the robot. This also explains why agent SA consistently placed surfaces too far to the right, meaning that they ended up inside the obstacle’s actual position.
234
N. Lacey and M.H. Lee
Indeed, in this experiment, agent SA was not able to detect the actual position of the obstacle. This is why the path agent SA believed the robot took passes through the actual position of the obstacle.
Fig. 11. The Final Model of Agent WA from Experiment 3
Figure 11 shows that agent WA was better equipped to operate in a noisy environment than agent SA. However, like agent SA, agent WA did not actually succeed in bringing the robot to its goal position.
Fig. 12. The Final Model of Agent SH from Experiment 3
Figure 12 shows that agent SH handled the noisy environment reasonably well, as the agent adjusted its beliefs concerning the robot’s position on several occasions. While this meant that agent SH was actually successful in guiding the robot to its goal, the cost in terms of execution time was high.
The Implications of Philosophical Foundations
3.2
235
Results from Agent Performance
This section will draw some conclusions concerning the relative performance of the three agents based on the results presented above. Agent SA The principal advantage of agent SA is its simplicity compared to agents WA and SH. This results in fast execution times, although agent WA was faster than agent SA in very noisy environments. Agent WA Agent WA was specifically designed for noisy environments. Algorithmically, as well as conceptually, the difference between agents WA and SA is very simple, yet is surprisingly effective. A visual comparison of the final maps reached by these two agents in experiments 2 and 3 shows the effectiveness of not having to translate every change in the environment directly into a change in an agent’s world model. The ability to modify the amount of support needed by suggested alterations to its map in order for them to become implemented allows agent WA to arrive at world models which are much closer to the real world than those arrived at by agent SA. From a purely theoretical point of view, the importance of the performance of agent WA lies in the fact that it represents a compromise between the extreme atomistic and holistic positions. The fact that this compromise resulted in performance advantages for the agent may have implications for the relationship between AI and philosophy. This point is discussed in [11]. Agent SH Agent SH performed well in terms of accuracy in the experiments described above, as it achieved the highest accuracy scores in every experiment. However, the price for this increased accuracy was increased computational expense. As well as taking longer than agents SA and WA in every experiment, the execution times of agent SH were also shown to be considerably more sensitive to increased complexity than those of the other agents. This means that agent SH might not be suitable for use in an environment which requires consistent execution times.
4
Further Work
Perhaps the most important way in which the work presented in this paper could be extended would be to consider a wider range of philosophical positions. In particular, the implementation of agents based on less extreme positions would be beneficial. As mentioned above, however, defining a theory to implement is a non-trivial task. In this paper we have dealt with the following three philosophical axes:
236
N. Lacey and M.H. Lee
Philosophy of Language where the extremes were atomism and holism Epistemological Justification where the extremes were foundationalism and coherence Metaphysical Realism where the extremes were realism and anti-realism. Now, even using the extremes of each of these three relatively well defined axes, it is by not a simple task to identify the points on each of these axes that are occupied by a coherent position. This is because the interactions between these axes mean that we cannot simply pick a point on each axis and implement the resulting position. Furthermore, a position that may be allowed by one philosopher may be rejected by another, making the task of choosing a coherent theory even more difficult. Using the lessons learned from this research, however, it is our belief that it would indeed be possible to implement agents based on less extreme philosophical positions, and that this exercise would yield interesting results concerning the design decisions that AKB designers are justified in making given a particular set of philosophical assumptions.
5
Conclusions
This work has shown that the philosophical foundations of an artificial agent do affect its design, which in turn affects its behaviour and the methods it is able to use to learn and process new data. We are now in a position to consider the following question: Should the world model of a situated agent be defined in terms of its environment, or in terms of its existing beliefs? The common sense view, as held by the Common Western Metaphysic, is that the environment exists independently of the agent. This means that, in order to maximise accuracy, the agent’s world model should be defined in terms of its environment. However, this leads to difficulties, as in all but the most trivial cases the agent’s environment will be significantly more complex than its model. AKB designers therefore use AI techniques which allow the agent to function using inaccurate, incomplete and uncertain information. The question then arises as to what extent such systems are actually based on the agent’s external world. Even the simple sensor averaging used by agent WA necessitated the overriding of sensor data depending on the values of existing beliefs. Clearly, any method which allows the agent to operate using imperfect data must also allow sensor information to be altered or ignored. We argue, therefore, that even though many designers of agent-based systems would think of themselves as realists, the systems that they design are not defined purely in terms of the external world, and as such contain varying elements of anti-realism. Furthermore, we note that philosophical theories exist in which the world is not defined objectively, but rather is given meaning by the interaction of an agent with its environment. Finally, we suggest that such philosophical theories might
The Implications of Philosophical Foundations
237
provide alternative theoretical starting points for the designers of agent-based systems to complement those provided by the Common Western Metaphysic. McCarthy and Hayes [13] write that undertaking to construct an intelligent computer program entails presupposing, amongst other things, that our common-sense view of the world is approximately correct. The research described in this paper has shown that, although our common sense view of the world may indeed be correct, it is by no means the only view on which artificial agents can be based. This in turn shows that AKB designers may be able to benefit by exploring alternative conceptions of the relationship between an agent and its environment. Acknowledgements. This work was carried out while the first author was supported by the Department of Computer Science, University of Wales, Aberystwyth, Wales, UK.
References 1. Bruce Aune. Metaphysics: The Elements. Basil Blackwell, Oxford, 1986. 2. Simon Blackburn. Spreading the Word: Groundings in the Philosophy of Language. Clarendon Press, Oxford, 1984. 3. Raymond Bradley and Norman Swartz. Possible worlds: an introduction to logic and its philosophy. Basil Blackwell, Oxford, UK, 1979. 4. Jonathan Dancy. Introduction to Contemporary Epistemology. Basil Blackwell, Oxford, 1985. 5. Peter G¨ ardenfors. Belief revision: An introduction. In Peter G¨ ardenfors, editor, Belief Revision, volume 29 of Cambridge Tracts in Theoretical Computer Science, pages 1–28. Cambridge University Press, Cambridge, 1992. 6. Susan Haack. Philosophy of Logics. Cambridge University Press, Cambridge, 1978. 7. Susan Haack. Evidence and Inquiry: Towards Reconstruction in Epistemology. Blackwell, Oxford, UK, 1997. 8. Gilbert Harman. Change in View: Principles of Reasoning. MIT Press, Cambridge, Mass, 1986. 9. Nicholas R. Jennings, Katia Sycara, and Michael Wooldridge. A roadmap of agent research and development. International Journal of Autonomous Agents and MultiAgent Systems, 1(1):7–38, 1998. 10. Richard L. Kirkham. Theories of Truth: A Critical Introduction. MIT Press, Cambridge, Mass, 1995. 11. Nicholas Lacey. Investigating the Relevance and Application of Epistemological and Metaphysical Theories to Agent Knowledge Bases. PhD thesis, University of Wales, Aberystwyth, 2000. 12. Nicholas Lacey, Keiichi Nakata, and Mark Lee. Investigating the effects of explicit epistemology on a distributed learning system. In Gerhard Weiß, editor, Distributed Artificial Intelligence Meets Machine Learning, number 1221 in Lecture Notes in Artificial Intelligence, pages 276–292. Springer-Verlag, 1997. 13. John McCarthy and Pat Hayes. Some philosophical problems from the standpoint of artificial intelligence. In Matthew L. Ginsberg, editor, Readings in Nonmonotonic Reasoning, pages 26–45. Morgan Kaufmann, Los Altos, California, 1987.
238
N. Lacey and M.H. Lee
14. Willard Van Orman Quine. Two dogmas of empiricism. In From a Logical Point of View: 9 logico-philosophical essays, chapter 2, pages 20–46. Harvard University Press, Cambridge, Mass, 1980. 15. Ernest Sosa. The raft and the pyramid: Coherence versus foundations in the theory of knowledge. In Paul K. Moser and Arnold vander Nat, editors, Human Knowledge Classical and Contemporary Approaches, pages 341–356. Oxford University Press, Oxford, UK, 1995. 16. Neil Tennant. Anti-realism and logic – Truth as eternal. Clarendon Press, Oxford, UK, 1987. 17. Neil Tennant. The Taming of the True. Clarendon Press, Oxford, UK, 1997. 18. Paul Thagard and Karsten Verbeurgt. Coherence as constraint satisfaction. Cognitive Science, 22(1):1–24, 1998. 19. Peter van Inwagen. Metaphysics. Dimensions of Philosophy Series. Oxford University Press, Oxford, 1993. 20. Crispin Wright. Realism, antirealism, irrealism, quasi-realism. In Howard K. Wettstein Peter A. French, Theodore E. Vehling Jr, editor, Midwest Studies in Philosophy, XII, pages 25–49. University of Minnesota Press, Minneapolis, 1988. 21. Crispin Wright. Realism, Meaning & Truth. Blackwell, Oxford, UK, 1993.
Using Cognition and Learning to Improve Agents’ Reactions* Pedro Rafael Graça and Graça Gaspar Department of Computer Science, Faculty of Sciences of the University of Lisbon, Bloco C5 – Piso 1 – Campo Grande 1700 Lisboa, Portugal {prafael,gg}@di.fc.ul.pt
Abstract. This paper proposes an agent-architecture to deal with real-time problems where it is important both to react to constant changes in the state of the environment and to recognize the generic tendencies in the sequence of those changes. Reactivity must satisfy the need for immediate answers; cognition will enable the perception of medium and long time variations, allowing decisions that lead to an improved reactivity. Agents are able to evolve through an instance-based learning mechanism fed by the cognition process that allows them to improve their performance as they accumulate experience. Progressively, they learn to relate their ways of reacting (reaction strategies) with the general state of the environment. Using a simulation workbench that sets a distributed communication problem, different tests are made in an effort to validate our proposal and put it in perspective as a solution for other problems.
1 Introduction Reaction, cognition and the ability to learn are among the most fundamental aspects of human behaviour. Daily, we react to a non-deterministic and constantly changing world, often facing unknown situations that nevertheless need immediate answer (for example, crossing an unknown street for the first time); we constantly rely on our cognitive ability to classify the surrounding environment (for example, choosing the best place to cross the street); we use our experience to select actions for specific situations (for example, quickly crossing the street when the sign turns red). Generally, cognition and the ability to learn lead to the accumulation of experience, allowing better decisions that improve the selection of actions. This is the central idea of the agent-architecture proposed in this paper: the agents have a reaction module that allows them to answer in real-time, a cognition module that successively captures and classifies images of the environment, and a learning module that accumulates experience that progressively allows a better selection of actions.
*
This work was supported by the LabMAC unit of FCT.
E. Alonso et al. (Eds.): Adaptive Agents and MAS, LNAI 2636, pp. 239–259, 2003. © Springer-Verlag Berlin Heidelberg 2003
240
P.R. Graça and G. Gaspar
1.1 Environment A real-time group communication simulated environment (reproduced by a simulation workbench) supported the development and study of the proposed architecture. Prone to non-deterministic and continuous changes (traffic fluctuations), such an environment emphasizes the need for immediate reactions. At the same time, the cyclic occurrence of similar environment states (for example, periods with low traffic level) and the repetition of changing patterns (for example, brisk increases of traffic level) point to the utility of a cognitive system that enables a form of learning, allowing the use of accumulated experience. In this distributed communication environment, where the traffic level is variable and messages can be lost, each agent is responsible for sending and eventually resending (when losses occur) a set of messages. Each agent’s goal is to optimise the timeout interval for resending lost messages, in such a way that the sending task is completed as soon as possible and the communication system is not overloaded by unnecessary resending. The agent chooses from a set of tuning strategies that over time it learns to relate to the state of the communication system, progressively improving its performance. Using the simulation workbench that reproduced the group communication environment, we evaluated: the utility of the multi-agent system architecture and the importance of the individual features of agents, the utility of using a set of different strategies, and the significance of the learning mechanism. The resulting conclusions helped us to point out the most significant aspects of the generic model adopted, and to put it in perspective as a solution for other problems. 1.2 Related Work In the context of concrete applications of multi-agent systems in traditional telecommunication problems, our main goal is to put in perspective the relationships between the generic problems observed in a distributed application and the generic answers that a multi-agent system is able to offer, in an abstraction level more concerned with properties than with detail. Even though no studies similar to ours were found (involving reaction, cognition and machine learning in a telecommunication problem), many other investigations in the field of multi-agent systems and telecommunications address problems in real-time environments that share many of the essential characteristics. A wide diversity of studies address problems such as routing, traffic congestion, scalability, fault location, and cooperative work, to mention only a few. Work on this area can be found in [1] and [3]. In [5] and [4] layered agent-architectures to address the problem of controlling and balancing reactivity and deliberation in dynamic environments requiring real-time responsiveness are proposed. These perspectives show some similarities to our work, but they do not incorporate a machine learning mechanism. In [9] the relationship between learning, planning and reacting is discussed, and an extension to a single-agent architectural framework to improve multi-agent coordination is proposed. The learning mechanism is used in order to determine the best way of alternating between reaction-based and plan-based coordination. In this particular, our study is significantly different: our learning mechanism concerns how to react in environments where reaction is a necessity rather than an option.
Using Cognition and Learning to Improve Agents’ Reactions
241
Work concerning a learning approach in some regards close to ours can be found in [7]. They propose a system that dynamically configures societies of agents, using cognition and/or communication as the basis for learning specific-situation coordination strategies.
2 A Group Communication Problem The communication problem used in this investigation was idealized following two main requirements: the preservation of the generic properties of a real-time distributed application; the avoidance of complex situations that could make the interpretation of results a more difficult task. To meet the first requirement, we selected a group communication problem, a typical and intuitive real-time distributed situation, offering a high degree of versatility concerning the desired complexity level involved. To meet the second requirement, we used a simulation workbench that reproduced the selected problem, simplifying accessory aspects and preserving all the essential properties that ensure the accuracy and expressiveness of the results. As a good example of the benefits introduced by the simplifications that took place, it is considered that, although normal messages can be lost in the communication process, acknowledgments cannot. Since from the message sender point of view both of these losses are equivalent and undistinguishable, the occurrence of acknowledgment losses would increase complexity without enriching the study or its results. 2.1 Description The problem in question was conceived in order to serve a generic purpose, but the description of a real and specific situation will help to clarify its outlines. Imagine a team of stockbrokers, each of them working on a different stock market. Suppose that, in order to coordinate the team’s global action, there is a synchronization rule that establishes that each individual can only perform an action after being informed of every other team member’s intention. Consider that it is important to perform as many operations as possible and that the communication between stockbrokers takes place in a telecommunication system where the traffic level is variable and messages can be lost. This scenario corresponds to the distributed communication problem adopted in this investigation. Each agent is associated to a communication node, having the responsibility of sending and resending (when losses occur) a message to each other node. When a message arrives to its destination, an acknowledgment is sent back to the message sender. In each communication season, the users (each associated to a node) exchange messages with each other. One season ends when the last message (the one that takes more time to successfully arrive) reaches its destination. The traffic level on the communication network varies over time, influencing the reliability: an increase of traffic level causes a decrease of reliability, increasing the occurrence of message losses; a decrease of traffic level has the opposite effect. The better the agents are able to adapt to the sequence of changes, the more accurate be-
242
P.R. Graça and G. Gaspar
comes the chosen instant for resending lost messages. Increased accuracy improves the communication times, causing the duration of seasons to decrease. It is important to notice that the communication problem described takes place at application level. In environments where the sequence of events is very fast (imagine a millisecond time scale) the ability for reacting very quickly is often more important than the ability for choosing a good reaction. The time needed to make a good choice can actually be more expensive than a fast, even if worse, decision. Because of this, the agent-architecture proposed in this paper is better suited for problems where the time scale of the sequence of events justifies efforts such as cognition or learning. This does not mean that quick answers to the environment are not possible: deliberation (recognising, learning and deciding) can easily become a background task, only showing its influence on the quality of reactions when there is enough time to identify environmental states and use previously acquired experience. On the worst case (millisecond time scale) this influence will tend to be null and agents will react without deliberating. The better the deliberation process can accompany the sequence of events, the greater will this influence be.
3 Agent–Architecture Considering the communication problem adopted, the agents’ task is to tune the timeout interval for resending lost messages, so that the duration of the communication seasons and the occurrence of unnecessary resending are both minimised. In their tuning task, agents must deal with information at different time scale perspectives: they must immediately react to constant variations in the state of the environment and also be able to recognize tendencies in the sequence of those variations so that a learning mechanism can be used to take advantage of accumulated experience. To react to constant variations, each agent uses one of several tuning strategies at its disposal. To evaluate the quality of a tuning strategy in a communication context (for example, during a low traffic period) the period during which that strategy is followed cannot be too short; to allow the search for better strategies, this period should not last too long. These opposing desires led to the introduction of the satisfaction level concept, a varying parameter that regulates the probability of a strategy change decision (the lower the agent’s satisfaction level is, the more likely it will decide to adopt a new strategy). As it will be briefly explained below, this satisfaction level depends on two additional aspects: the detection of changes in the state of the environment (communication conditions); the self-evaluation of the agents’ performance. To recognize non-immediate tendencies in the sequence of environment state changes, the agent uses its cognition system. The information collected in each communication season is gathered in a memorization structure. This structure is periodically analysed in order to abstract from the details of basic states, fusing sequences of those basic states into generic states classified in communication classes and detecting important variations in the state of the communication system (for example, a transition from a traffic increase tendency to a traffic decrease tendency). The result of this analysis influences the satisfaction level; for example, a change of the communication
Using Cognition and Learning to Improve Agents’ Reactions
243
class, or the detection of an important variation, can cause the satisfaction level to decrease, motivating the agent to choose a new tuning strategy (possibly fitter to the new conditions). Progressively, agents learn to relate the tuning strategies and the communication classes. The establishment of this relationship depends on two classification processes: the classification of the agents’ performance and the classification of the generic state of the environment (the communication classes processed by the cognitive system). A scoring method was developed in order to classify the agents’ performance. As the duration of the timeout interval depends on the tuning strategy in use, the qualification of an agent’s performance in a sequence of seasons is a measure of the fitness of the strategy used to the communication class observed during those seasons. The diversity of states of the environment emphasizes the utility of studying a multi-agent system where different individuals may have different characteristics. While an optimism level regulates the belief in message losses (the more pessimistic the agent is, the sooner it tends to conclude that a message was lost), a dynamism level regulates the resistance to stimulation (the less dynamic the agent is, the less it reacts to changes, the longer it keeps using the same strategy). Each different agent has a specific behaviour and interprets the surrounding environment in a different way. In this section, after introducing some terminology (subsection 3.1), the details of the proposed agent-architecture are presented in the following order: the tuning strategies (subsection 3.2); the scoring method (subsection 3.3); the cognitive system (subsection 3.4); the learning system (subsection 3.5). Finally, a diagram (subsection 3.6) and an illustration of an operating agent (subsection 3.7) give a global perspective of the architecture. A more detailed description of this agent-architecture can be found in [2]. 3.1 Terminology The period of time needed to successfully send a message (including eventual resending) and to receive its acknowledgement is called total communication time. When a message successfully reaches its destination at the first try, the communication time is equal to the total communication time; otherwise, it is the time elapsed between the last (and successful) resending and the reception of the acknowledgement. The ending instant of the timeout interval is called resending instant. It is considered that the ideal resending instant of a message (the instant that optimises the delay) is equal to the communication time of that message. The difference between the resending instant and the ideal resending instant is called distance to the optimum instant. A high increase or decrease of traffic level immediately followed by, respectively, a high decrease or increase is called a jump. A high increase or decrease of traffic level immediately followed by stabilization is called a step.
244
P.R. Graça and G. Gaspar
3.2 Tuning Strategies To follow the fluctuations of the communication system, each agent constantly (every communication season) adjusts the resending instant. It is possible to imagine several ways of making this adjustment: following the previous communication time, following the average of the latest communication times, accompanying a tendency observed in the succession of communication times, etc. A tuning strategy is a way of adjusting the resending instant. It is a simple function whose arguments include the communication times observed on previous seasons and the optimism level, and whose image is the resending instant to be used on the following season.
Average Strategy
Communication time
Com m unica tion Tim e
Resending Instant 200 180 160 140 120 29
27
25
23
21
19
17
15
13
9
11
7
5
3
1
100 Season
Fig. 1. Average tuning strategy
A set of ten tuning strategies is available to the agents, including for example: a reactive strategy (according to this strategy, the communication time observed in season t is used as resending instant in season t+1), an average strategy (as shown in Figure 1, each resending instant is defined according to the average of all previous communication times), a semi-reactive average strategy (the last communication time is weighted by one tenth in the average calculus), an almost reactive average strategy (as shown in Figure 2, the last communication time is weighted by one third in the average calculus), a reactive ignoring jumps strategy (works like the reactive strategy but keeps the same resending instant when jumps occur). A TCP strategy, reproducing the real procedure adopted by the TCP/IP protocol (see [6] for details), was also included in this set. According to this strategy (Figure 3) the more unstable the communication conditions are, the bigger is the safety margin used (higher resending instants).
Using Cognition and Learning to Improve Agents’ Reactions
245
Almost Reactive Average Strategy Communication Time
Communication time
200
Resending Instant
180 160 140 120 29
27
25
23
21
19
17
15
13
9
11
7
5
3
1
100 Season
Fig. 2. Almost reactive average tuning strategy Communication time
TCP Strategy Communication Time
Resending instant 300 250 200 150
99
92
85
78
71
64
57
50
43
36
29
22
15
8
1
100
Season
Fig. 3. TCP tuning strategy
It is expected that the diversity of strategies helps to match the diversity of environmental states: different tuning strategies will be differently fitted to the different communication classes. For example, when traffic fluctuations are unpredictable, a reactive strategy will probably produce better results than an average-based strategy; the opposite will probably occur if the traffic level tendency is to vary within a sufficiently well determined interval. The goal of the learning system is precisely to select for each communication class the strategies that produce better results. As mentioned before, the optimism level regulates an agent’s belief in message losses: the more pessimistic the agent is, the sooner it tends to conclude that a message was lost, the sooner it tends to resend it. Agents with different optimism levels use the same tuning strategy differently: based on the same unmodified resending instant given by the strategy, a pessimistic agent shortens the delay and an optimistic agent widens it (Figure 4 shows this effect).
246
P.R. Graça and G. Gaspar
Resending Instant
Effect of the Optimism Level Unmodified resending instant Pessimistic agent Optimistic agent
210 190 170 150 130 110 90 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17
Season
Fig. 4. Effect of the optimism level on strategies
This differentiation (along with the dynamism level) opens the possibility of establishing relations between certain types of agents (for example, highly pessimistic) and certain communication conditions (for example, constant traffic level). 3.3 Performance Evaluation The evaluation of the agents’ performance has three main goals: to measure the significance of different individual features: it may be possible to determine a measure of the fitness of certain agents’ characteristics to the communication conditions, if a specific type of agent (for example, very optimistic) tends to receive higher or lower scores under those conditions; to allow the agents to adjust their procedure: poor performance causes a decrease of the satisfaction level, and eventually leads to a change of strategy; to support the classification of training examples: this will be detailed further ahead, in the learning system subsection. At the end of each season, each agent has information that allows it to qualify its performance. The score obtained helps each individual to answer the following question: how accurate were the resending instants chosen? As it will be detailed in the learning system subsection, the successive scores will help to answer another question: how fitted is the chosen strategy to the communication class currently observed? The main performance evaluation score is obtained considering the sum of distances to the optimum instants, adding a penalty for each message unnecessarily resent (the lower the score, the better the performance). The penalty for unnecessary resend equals the distance to the optimum instant: the larger the distance the bigger the penalty. To clarify this system, consider the following example: agent A sends 3 messages during a communication season, to agents B (mAB), C (mAC) and D (mAD); the resending instant for mAB is 150 time units; the resending instant for mAC is 180 time units;
Using Cognition and Learning to Improve Agents’ Reactions
247
the resending instant for mAD is 140 time units; the acknowledgement from mAB arrives after 146 time units; the acknowledgement from mAC arrives after 183 time units (an unnecessary resend occurred); the acknowledgement from mAD arrives after 270 time units (this means the first message was not acknowledged, and the second one’s acknowledgement arrived after 130 time units). Given this situation, agent A would get: 4 points for mAB (150-146); 6 points for mAC (183-180=3, penalty for unnecessary resend: 3*2=6); 10 points for mAD (140-130); Agent A’s final score is 20 points. For example if the resending instant for mAB were 155 time units (further away from the optimum instant by 5 time units), the score would have been 25 points (worse than 20). Two additional auxiliary scoring mechanisms are also used in order to evaluate the agents’ performance. These auxiliary scoring mechanisms were not considered in the classification of training examples, but they also influence the satisfaction level. Each of these mechanisms ranks the agents on each season (from first to last place), according to each of the following criteria: the necessary time to complete the sending task (the best one is the first to receive all acknowledgements); the quality of the chosen resending instants (same criteria as the main method described above). At the end of each season, each agent is scored according to its rank (the first of n agents gets 1 point, the second 2 points, and so on until the last that gets n points). The information about every agent’s performance is gathered and processed at the end of each season in order to establish the rank. On a simulation workbench this is a trivial task because all the information is locally available. On a real situation, a way of gathering the information and broadcasting the results would have to be studied. The purpose for these auxiliary mechanisms is to allow the agents to compare each other’s performance. When the communication conditions are unstable the resending instants are more difficult to set and, although the agent’s performance may be good (considering the complexity involved), the main score (determined in an absolute way) will tend to be lower. In these cases, the auxiliary scores (determined in a relative way) can help each agent to correctly evaluate the quality of its performance. 3.4 Cognitive System The information memorized after each season is continuously analysed and processed in order to provide the agent with an image of the world. The memorization structure, more than just an information repository, is a fundamental part of the cognitive system; among other information (arguments for the tuning strategies), it stores: a short memory block: contains each ten consecutive average communication times and the average performance score during those ten seasons (state of the environment in the last few seasons);
248
P.R. Graça and G. Gaspar
a global memory block: the result of a process of synthesis of past short memory blocks (state of the environment during a wider period). Every ten seasons, a short memory block is processed in order to obtain information that is then added to the global memory block. This synthesised information includes: a set of parameters that characterize the traffic oscillation (for example, how many increases of traffic level were observed during the ten seasons), the communication class observed, and the average performance score. A communication class is a generic classification of the state of the environment. Such a classification is determined from the parameters that characterize the traffic conditions (obtained from each short memory block), and has three dimensions: the traffic level (high, medium, low, variable), the traffic variation tendency (increase, decrease, constant, alternating, undetermined) and the sudden variations occurrence (jumps, steps, jumps and steps, none). The detection of variations in the communication system is based on the following principle: the greater the difference between the global communication class (the communication class of the global memory block) and the communication classes of the last short memory blocks, the more likely it is that a significant variation is occurring. A metric of three-dimensional distance between communication classes was developed in order to apply this idea (considering, for example, that the difference between a high and a medium traffic level is smaller than the difference between a high and a low traffic level). The distance between two communication classes is obtained by adding the distances between each dimension’s members. The detection of variations causes the satisfaction level to progressively decrease, motivating the agent to choose a new tuning strategy more adequate to the new communication class. When a variation is first detected, the decrease of satisfaction is generally small; in this way, if the variation is merely transitory its impact will be minimal. However, if the variation is progressively confirmed, the decrease in the satisfaction level is continuously accentuated: the more obvious and significant the variation is, the larger becomes the satisfaction level decrease. 3.5 Learning System The agents must be able to adapt to the conditions of the communication system, selecting the best strategies for each communication class. This requirement appeals for a learning mechanism that builds up and uses accumulated experience. When its performance is not satisfactory (bad score), an agent must learn that the strategy used in the current communication class is probably not adequate. If the performance is good, the agent must learn the opposite. The learning mechanism is based on the following cyclic process associated to internal state transitions: Finalization of the previous state (a new training example is stored); Prevision of a communication class and selection of a tuning strategy for the next state; Beginning of the new state. When the satisfaction level is low (bad performance and/or variation detected), an agent may decide to finalize its current state. The dynamism level associated to each agent makes it possible for two agents to act differently when facing the same condi-
Using Cognition and Learning to Improve Agents’ Reactions
249
tions: a more dynamic individual is more likely to feel dissatisfied and will consequently change its state more often then a more conservative individual. An agent’s state (from the learning mechanism perspective) consists of a sequence of communication seasons, characterized by a global communication class, a strategy in use, and a performance score. When an agent decides to end the current state, this characterization is used to build a new training example, a triple . The training examples are stored into a two-dimensional table (the experience table) that contains the average score for each pair , recalculated every time a correspondent training example is added. The more training examples (concerning a particular communication class) an agent stores, the higher is its experience level (in that class). This form of case based learning has some specific characteristics: the learning cases are processed instead of being stored (only necessary information is preserved); it admits an initially void base of cases; it is dynamic, in the sense that new learning cases cause previous information to be progressively adjusted (recalculation of the average score). This learning mechanism has therefore some resemblance to a simple way of reinforcement learning. Our initial idea was indeed to use reinforcement learning, but a more careful study revealed incompatibilities between this learning mechanism and the addressed problem: on the state transition process, there is no necessary relationship between the action an agent selects (new tuning strategy), and the following state (that includes the communication class, which is generally not influenced by the agent’s choice); moreover, a state is only identified at the moment of its conclusion. These facts oppose the most basic principles of reinforcement learning. Before initiating a new state, the agent must predict a communication class and choose a new strategy. The prediction of a communication class for the next state is based on the following complementary ideas: if the state transition was mainly motivated by bad performance, the communication class remains the same; if it was motivated by the detection of variations, then those variations are analysed in order to predict a new communication class. This analysis considers the latest three short memory blocks, using their data to determine a new communication class (assuming that the lately detected variations characterize the new state of the communication environment). If transition patterns were to be found in the sequence of states, this prediction process could be enhanced by the use of another learning mechanism that were able to progressively determine which transitions were more likely to occur. To select a new strategy an agent may consult the experience table (selecting the best strategy according to the predicted communication class), choose randomly (when it has no previous experience or when it decides to explore new alternatives), or consult a shared blackboard (described ahead). The random selection of strategies is the preferred option when the agent has a low experience level, being progressively abandoned when the experience level increases (even when experience is very high, a small probabilistic margin allows random selections). Random selection could be replaced by another specific distribution (such as Boltzmann distribution). The shared blackboard is used as a simple communication method for sharing knowledge between agents. Every ten seasons, the most successful agents (those who receive better scores) use it to register some useful information (strategy in use and communication class detected), allowing others to use it in their benefit. When the agents do not use this communication mechanism, learning is an individual effort; if it is used in exclusivity (as the only way to select a new strategy), learning does not occur. When it is optional, it enables a simple way of multi-agent learning: by sharing
250
P.R. Graça and G. Gaspar
their knowledge, the agents allow others to benefit from their effort, eventually leading each other to better solutions earlier than it would happen otherwise. An agent whose only method of strategy selection is consulting the blackboard is considered opportunistic: he develops no learning effort and always relies on other agents’ work. When a new state begins, the agent’s memory (short and global blocks) is initialised. If, in a short period of time (first thirty seasons), the predicted communication class proves to be a wrong prevision (because the analysis was wrong or because the conditions changed), the agent may choose to interrupt the new state to correct it. In this case, regarding the interrupted state, no training example is considered. 3.6 Agent–Architecture Diagram The diagram shown in Figure 5 summarizes the agent-architecture presented in this paper, dividing it in three functional modules. The reaction module includes a message management unit that is connected to the communication system and is responsible for sending and resending messages according to the strategy in use (modified by the optimism level). The strategy is changed when the strategy management unit (in the learning module) produces such a decision. The data processor unit included in the cognition module is responsible for analysing the information that is constantly memorized, evaluating the environment (communication classes) and the performance (score). Conditioned by the dynamism level, the result of this analysis influences the satisfaction level. Whenever a state transition occurs, the data processor sends necessary information to the learning module so that a new training example can be created and stored. The strategy management unit included in the learning module is responsible for the state transition decision (conditioned by the satisfaction level) and its related operations. It stores the training examples in the experience data unit. 3.7 Learning with Agent Adam The following description illustrates a sequence of operations of a specific agent during a learning state. This example can help to clarify the concepts presented along this section, showing how they are applied in a practical situation. Agent Adam, a very optimistic and very dynamic individual, was using an average-based strategy to tune the resending instants. Since he is very optimistic, the resending instants are delayed accordingly (a less optimistic agent following the same strategy would resend earlier than Adam). The strategy was producing good results (he was receiving high scores) in a communication environment characterized by a low and constant traffic level, with no sudden variations; this communication class was being observed in the last 50 seasons (that far, the duration of the current state). Adam’s satisfaction level was high and a change of strategy was not in consideration.
Using Cognition and Learning to Improve Agents’ Reactions
251
Fig. 5. Agent-architecture diagram
After ten more seasons had passed, the latest short memory block was analysed and some jumps were detected. Initially, the detection of this new condition caused only a small decrease of the satisfaction level; but when it was confirmed by two additional short memory blocks and reinforced by low scoring (the adopted strategy was now producing bad results), it motivated a quick decrease of the satisfaction level that led to a strategy change decision. If Adam were less dynamic, the decrease of the satisfaction level would have been slower and such decision would probably take a longer time to occur. Following the strategy change decision, a new training example was stored, describing the good results of the previous strategy under the previous communication class (low and constant traffic level, with no sudden variations). If the same conditions were met in the future, this information could then be helpful for the selection of a tuning strategy. Considering the latest short memory blocks, a new communication class was predicted: low and alternating traffic level, with jumps. Since Adam had no memory of
252
P.R. Graça and G. Gaspar
operating under such conditions, he could not rely on previous experience to select a new tuning strategy. So, putting aside a random selection alternative, he decided to consult the blackboard. Understanding that Amy, a very successful agent that had been receiving the highest scores, had detected the same communication class, Adam selected the same tuning strategy that she was using, and then begun a new state. Even if Amy’s strategy were a bad choice for Adam, he would have in the future (when the same communication class were detected and a random selection of strategy decided) the opportunity for testing other strategies (explore other possibilities) and find out which one would serve him better in this situation. Moreover, he would (given enough opportunities) eventually select untested strategies even if a good strategy were already found (this would allow him to escape local minimums in local search). However, if Amy’s strategy were a good choice for Adam, it would allow him not only to perform better but also to accelerate his learning effort (a good reference point in terms of tuning strategy would allow him to quickly discard worst alternatives).
4 Tests and Results Using a simulation workbench for the group communication problem described, a significant number of tests were made. In this section we describe the most relevant of those tests and discuss their results. To support these tests, several traffic variation functions were created. Some reflect traffic variation patterns as they are found in real communication situations; others set interesting situations that help the evaluation of the different aspects of the agent-architecture. Each simulation is composed by a sequence of one thousand communication seasons. Each test is composed by a sequence of five hundred simulations. The average of the agents’ added distances to the optimum instants (from hereon referred simply as distance) is the value chosen to express the results.
4.1 Tuning Strategies and Cognitive System The initial tests were focused on the tuning strategies. The original set of strategies was tested separately (no cognition or learning) under different traffic conditions. These tests led to the introduction of additional strategies (to cover important specific situations) and to an important (and expected) conclusion: different communication classes have different more adequate strategies. The tests made to the cognitive system allowed its progressive refinement. In its final version, the system was able to classify the communication conditions in a very satisfactory manner. The possibility of capturing the essential aspects of a complex real-time environment in a classification system opened the door to the introduction of the learning mechanism.
Using Cognition and Learning to Improve Agents’ Reactions
253
4.2 Learning Mechanism The learning mechanism produced a clear and important result: the progressive decrease of the distance. The more complex the traffic variation function is (in other words, the greater the number of communication classes needed to capture the evolution of traffic conditions), the slower is this decrease. Learning Mechanism (simple traffic variation function)
150 140 Distance
Learning
130 Best strategy (no learning)
120 110
477
449
421
393
365
337
309
281
253
225
197
169
141
85
113
57
1
29
100 Experience (simulations)
Fig. 6. Learning in a simple traffic variation context
In simple situations, where a single tuning strategy is highly adequate to the traffic function, the learning curve tends to approach the results that would be obtained if only that strategy was used1 (Figure 6). In more complex situations, when the diversity of the traffic variation function appeals to the alternate use of two or more strategies, the advantage of the learning mechanism becomes evident (Figure 7). Learning Mechanism (complex traffic variation function)
275
Learning
Distance
255
Best strategy (no learning)
235 215 195
487
460
433
406
379
352
325
298
271
244
217
190
163
136
109
82
55
1
28
175 Experience (simulations)
Fig. 7. Learning in a complex traffic variation context
1
When we mention that only a single strategy is used, we mean that the same tuning procedure is kept throughout the simulations. In these cases, the learning mechanism remains inactive.
254
P.R. Graça and G. Gaspar
Figure 7 emphasizes an important result: the alternated use of different strategies can match the diversity of communication classes, producing, in some situations, better results than those produced by any single strategy available. To confirm this result, special tests were made. Three different sets of available tuning strategies were considered in these tests: set A included the best strategy (the one producing better global results if used in exclusivity on the chosen context) and four other strategies (chosen randomly); set B included the five remaining strategies; a full set always included all ten. In each test, each of these sets was used separately and the respective results were compared. These tests showed clearly that, in some situations, the diversity of strategies is more important then the global fitness of any particular strategy. The full set often produced the best results (especially on complex traffic variation contexts) and, in some cases (as the one in Figure 8), set A produced the worst results (even though it contained the best strategy). Strategies and Learning
290
Set A
Distance
270
Set B Full set
250 230 210 190
481
451
421
391
361
331
301
271
241
211
181
151
121
91
61
31
1
170 Experience (simulations)
Fig. 8. Comparing the performance of different sets of strategies on the same traffic variation context
From these results emerges the idea that, in certain traffic variation contexts, there is no single strategy that can be used to match the performance of alternating a set of different strategies. 4.3 Optimism and Dynamism Levels Various tests showed that the optimism level could clearly influence the agents’ performance. When the traffic level has a low variation or while it continuously decreases, pessimistic agents usually have a better performance; when the traffic level has a high variation or while it continuously increases, an optimistic posture tends to be better. The next two figures show the results of testing five different sets of agents grouped by their optimism levels (all having neutral dynamism levels). Figure 9 refers to testing on a traffic variation context where the traffic level predominantly increases: periods of traffic increase are intercalated with abrupt and instantaneous traffic level decreases, producing a typical sawtoothed pattern. Since optimistic agents
Using Cognition and Learning to Improve Agents’ Reactions
255
tend to delay their resending instants, they have better chances to avoid unnecessary resending under such conditions and achieve a better performance. A pessimistic posture, according to which delays are believed to result from message losses, tends to anticipate the resend. Such a posture is generally penalized when the delay is a consequence of a traffic level increase. Because of this, in these cases, it pays off to wait a little longer before resending (optimistic posture).
Optimism level and Learning Highly pessimistic Pessimistic Neutral Optimistic Highly optimistic
430 410 Distance
390 370 350 330 310 290
494
465
436
407
378
349
320
291
262
233
204
175
146
88
117
59
30
1
270 Experience (simulations)
Fig. 9. The influence of the optimism level (sawtooth traffic variation pattern)
When a traffic variation context in which there is no predominant variation is considered, high optimism or pessimism postures are usually not adjusted (Figure 10).
Optimism level and Learning Highly pessimistic Pessimistic Neutral Optimistic Highly optimistic
95
Distance
85 75 65 55
239
225
211
197
183
169
155
141
127
113
99
85
71
57
43
29
15
1
45 Experience (simulations)
Fig. 10. The influence of the optimism level (traffic context with no predominant variation)
256
P.R. Graça and G. Gaspar
As it is evidenced by Figure 10, the relation between the optimism level and performance can be complex: although an optimistic posture produces the best results, a highly optimistic posture is worse than a highly pessimistic posture. The effect of the dynamism level on the agents’ performance was not made clear by the tests. It was observed that extremely dynamic agents had more difficulty in learning (they eventually achieved the same distance of others but took more time to do so). 4.4 Communication between Agents As described before, the agents may use a simple communication method (blackboard) as an alternative way of selecting a new strategy. To test and evaluate the impact of introducing this alternative, we considered two different sets of agents: agents that only used their individual experience table (no communication), and agents that alternated between their experience table and the blackboard. Results showed that the alternation between consulting the experience table and consulting the blackboard could improve the agents' performance (Figure 11). Comunication between Agents
220 210
Distance
200
Experience table only
190
Experience table and communication
180 170 160 150
494
465
436
407
378
349
320
291
262
233
204
175
146
88
117
59
1
30
140 Experience (simulations)
Fig. 11. Learning with and without communication
Encouraged by this observation, new tests were made in order to compare the performance of a set of agents that learn without using the blackboard with the performance of an opportunistic agent that only uses the blackboard (only uses the experience of others and does not learn by itself). These tests showed that, regardless of the traffic function in use, the opportunistic agent’s performance was never worse then the learning agents’ performance. Moreover, when complex traffic variation functions were used, the opportunistic agent clearly beat the agents that were able to learn (Figure 12). It is important to notice that, in this case, the opportunistic agent achieves a clearly better performance since the first simulation, and reaches its best performance level within the first ten simulations. This is a clear sign that the learning task could be optimised, that the interactions between agents (namely the knowledge sharing) can improve global performance (even at early stages), and that having agents with different tasks or roles can benefit the collective effort and enhance the results.
Using Cognition and Learning to Improve Agents’ Reactions
257
Communicating or Learning 220 210
Learning Agents (no communication)
Distance
200 190
Opportunistic Agent (no learning)
180 170 160 150 140 494
465
436
407
378
349
320
291
262
233
204
175
146
88
117
59
30
1
130 Experience (simulations)
Fig. 12. Performance of an opportunistic agent
5 Conclusions The good results achieved with the proposed agent-architecture in a simulated group communication problem showed its adequacy to a real-time environment. Reacting through the use of different tuning strategies, classifying the environment through a cognitive system, and progressively determining the fitness of those strategies to specific communication classes, the agents were able to significantly improve their performance. Furthermore, in some situations, the alternation of strategies allowed by the learning mechanism achieved results that were clearly superior to those obtainable using a single strategy, leaving the idea that having an expressive variety of options can be a good way of addressing the complexity and dynamism of the environment. The optimism and dynamism levels added versatility and adaptability to the agents. The optimism level revealed a special relevance, significantly influencing the agents’ performance in various situations. The accurate characterization of these situations could motivate the online variation of this level, allowing increased adaptation to the environment. The use of the blackboard as a knowledge sharing method improved overall performance. Furthermore, especially under complex traffic variation functions, opportunistic non-learning agents had better performance than learning non-communicating agents.
6 Final Discussion The success achieved by opportunistic agents indicates that it would be interesting and potentially useful to study the use of a mixed agent society in which the existence of different roles could lead to an improvement of collective performance. More than that, the members of this society could be exchanged according to their performance (from time to time, the worst active agents would be replaced) or even according to the collective experience (for example, under unknown conditions agents with the
258
P.R. Graça and G. Gaspar
ability of learning would be preferred, but under well known conditions the opportunistic agents would become predominant). The study of ways of coordination and interaction between agents to optimise the learning task is a promising field of development of the proposed architecture towards further improvement of global performance. The expressive results obtained with a simple communication mechanism suggest that an additional effort towards basic coordination could easily introduce a distributed learning perspective into the proposed model. This, along with the introduction of specific agent roles, could allow the reduction of the collective learning cost. The generic properties of a multi-agent system successfully matched the generic problems found in a typical telecommunication problem, reinforcing the idea that the affinities between Distributed Systems and Distributed Artificial Intelligence justify further research. Globally, more than showing the utility of the proposed agent-architecture to the problem in question, the encouraging results indicate that the generic model is a potentially adequate solution for similar problems, namely for those where a real-time environment constantly demands immediate reactions and continuously appeals for cognition. To help to put in perspective the generic aspects of the architecture, consider the following real-time communication problem. Imagine a multimedia conference where it is important that the participants keep visual contact with each other. During the conference, the video image frames are continuously transmitted on a communication system prone to traffic fluctuations. This problem is also concerned with the imperfection of the telecommunication system in a group communication situation. In this case it becomes important to set an adequate frame transmission rate so that the video image’s quality is as good as possible (it is expected and possibly unavoidable that on high traffic situations this quality decreases, being advisable to decrease the frame transmission rate so that congestion does not worsen). To apply the proposed agent-architecture to this problem, a set of transmission strategies (for example, frame repetition strategies, frame skipping strategies, fixed transmission rate, etc.) and a method of performance evaluation (based on the quality of the video image) would have to be defined. Other than that, the essential aspects of the architecture would be easily applicable. On a first glance, and as an example of a problem belonging to a different area of study (not centred on the communication process), our architecture seems to match the generic properties of the Robocup environment. Robocup sets a constantly changing environment that requires real-time responsiveness and, at the same time, strongly appeals for cognition. The alternative ways of reacting (running towards the ball, stopping, shooting at goal, passing, etc.) could be converted into strategies, and a learning mechanism could progressively determine their fitness to specific states of the environment. To determine the real extent of this simplistic and superficial analogy, further investigation is obviously required. If reaction, cognition and the ability to learn are among the most fundamental aspects of human behaviour, they may well emerge as fundamental aspects of artificial agents that dwell on artificial worlds that become more and more similar to our own.
Using Cognition and Learning to Improve Agents’ Reactions
259
References 1. Albayrak , S. (ed.): Intelligent Agents for Telecommunication Applications. Lecture Notes in Artificial Intelligence, Vol. 1699. Springer-Verlag, Berlin Heidelberg (1999) 2. Graça, P. R.: Performance of Evolutionary Agents in a Context of Group Communication. M. Sc. thesis, Department of Computer Science of the University of Lisbon (in Portuguese) (2000) 3. Hayzelden, A. L. G., Bigham, J. (eds.): Software Agents for Future Communication Systems. Springer-Verlag, Berlin Heidelberg (1999) 4. Malec, J.: On Augmenting Reactivity with Deliberation in a Controlled Manner. In: Proceedings of the Workshop on Balancing Reactivity and Social Deliberation in Multi–Agent Systems, Fourteenth European Conference on Artificial Intelligence, Berlin (2000) 89–100 5. Mavromichalis, V. K. and Vouros, G.: ICAGENT: Balancing Between Reactivity and Deliberation. In: Proceedings of the Workshop on Balancing Reactivity and Social Deliberation in Multi–Agent Systems, Fourteenth European Conference on Artificial Intelligence, Berlin (2000) 101–112 6. Peterson, L. L. and Davie, B. S.: Computer Networks: a Systems Approach. Morgan Kaufmann Publishers (1997) 7. Prasad, M. V. N. and Lesser, V. R.: Learning Situation-Specific Coordination in Cooperative Multi-Agent Systems. Autonomous Agents and Multi-Agent Systems, Vol. 2. (1999) 2:173–207 8. Sutton, R. S. and Barto, A. G.: Reinforcement Learning: an Introduction. The MIT Press, Cambridge (1998) 9. Weib, G.: An Architectural Framework for Integrated Multiagent Planning, Reacting, and Learning. In: Proceedings of the Seventh International Workshop on Agent Theories, Architectures, and Languages, Boston (2000)
TTree: Tree-Based State Generalization with Temporally Abstract Actions William T.B. Uther and Manuela M. Veloso Computer Science Department, Carnegie Mellon University, Pittsburgh, PA 15213 USA {uther, veloso}@cs.cmu.edu
Abstract. In this chapter we describe the Trajectory Tree, or TTree, algorithm. TTree uses a small set of supplied policies to help solve a Semi-Markov Decision Problem (SMDP). The algorithm uses a learned tree based discretization of the state space as an abstract state description and both user supplied and auto-generated policies as temporally abstract actions. It uses a generative model of the world to sample the transition function for the abstract SMDP defined by those state and temporal abstractions, and then finds a policy for that abstract SMDP. This policy for the abstract SMDP can then be mapped back to a policy for the base SMDP, solving the supplied problem. In this chapter we present the TTree algorithm and give empirical comparisons to other SMDP algorithms showing its effectiveness.
1
Introduction
Both Markov Decision Processes (MDPs) and Semi-Markov Decision Processes (SMDPs), presented in [1], are important formalisms for agent control. They are used for describing the state dynamics and reward structure in stochastic domains and can be processed to find a policy; a function from the world state to the action that should be performed in that state. In particular, it is useful to have the policy that maximizes the sum of rewards over time. Unfortunately, the number of states that need to be considered when finding a policy is exponential in the number of dimensions that describe the state space. This exponential state explosion is a well known difficulty when finding policies for large (S)MDPs. A number of techniques have been used to help overcome exponential state explosion and solve large (S)MDPs. These techniques can be broken into two main classes. State abstraction refers to the technique of grouping many states together and treating them as one abstract state, e.g. [2,3,4]. Temporal abstraction refers to techniques that group sequences of actions together and treat
This research was sponsored by the United States Air Force under Agreement Nos. F30602-00-2-0549 and F30602-98-2-0135. The content of this chapter does not necessarily reflect the position of the funding agencies and no official endorsement should be inferred.
E. Alonso et al. (Eds.): Adaptive Agents and MAS, LNAI 2636, pp. 260–290, 2003. c Springer-Verlag Berlin Heidelberg 2003
TTree: Tree-Based State Generalization
261
them as one abstract action, e.g. [5,6,7,8,9]. Using a function approximator for the value function, e.g. [10], can, in theory, subsume both state and temporal abstraction, but the authors are unaware of any of these techniques that, in practice, achieve significant temporal abstraction. In this chapter we introduce the Trajectory Tree, or TTree, algorithm with two advantages over previous algorithms. It can both learn an abstract state representation and use temporal abstraction to improve problem solving speed. It also uses a new format for defining temporal abstractions that relaxes a major requirement of previous formats – it does not require a termination criterion as part of the abstract action. Starting with a set of user supplied abstract actions, TTree first generates some additional abstract actions from the base level actions of the domain. TTree then alternates between learning a tree based discretization of the state space and learning a policy for an abstract SMDP using the tree as an abstract state representation. In this chapter we give a description of the behavior of the algorithm. Moreover we present empirical results showing TTree is an effective anytime algorithm.
2
TTree
The goal of the TTree algorithm is to take an SMDP and a small collection of supplied policies, and discover which supplied policy should be used in each state to solve the SMDP. We wish to do this in a way that is more efficient than finding the optimal policy directly. The TTree algorithm is an extension of the Continuous U Tree algorithm [3]. In addition to adding the ability to use temporal abstraction, we also improve the Continuous U Tree algorithm by removing some approximations in the semantics of the algorithm. TTree uses policies as temporally abstract actions. They are solutions to subtasks that we expect the agent to encounter. We refer to these supplied policies as abstract actions to distinguish them from the solution – the policy we are trying to find. This definition of “abstract actions” is different from previous definitions. Other definitions of abstract actions in reinforcement learning, e.g. [5,6], have termination criteria that our definition does not. Definitions of abstract actions in planning, e.g. [11], where an abstract action is a normal action with some pre-conditions removed, are even further removed from our definition. This ‘planning’ definition of abstract actions is closer to the concept of state abstraction than temporal abstraction. Each of the supplied abstract actions is defined over the same set of baselevel actions as the SMDP being solved. As a result, using the abstract actions gives us no more representational power than representing the policy through some other means, e.g. a table. Additionally, we ensure that there is at least one abstract action that uses each base-level action in each state, so that we have no less representational power than representing the policy through some other means.
262
W.T.B. Uther and M.M. Veloso
We noticed that a policy over the abstract actions has identical representational power to a normal policy over the states of an SMDP. However, if we have a policy mapping abstract states to abstract actions, then we have increased the representation power over a policy mapping abstract states to normal actions. This increase in power allows our abstract states to be larger while still representing the same policy.
3
Definitions
An SMDP is defined as a tuple S, A, P, R. S is the set of world states. We will use s to refer to particular states, e.g. {s, s } ∈ S. We also assume that the states embed into an n-dimensional space: S ≡ S 1 × S 2 × S 3 × · · · × S n . In this chapter we assume that each dimension, S i , is discrete. A is the set of actions. We will use a to refer to particular actions, e.g. {a0 , a1 } ∈ A. Defined for each state action pair, Ps,a (s , t) : S × A × S × → [0, 1] is a joint probability distribution over both next-states and time taken. It is this distribution over the time taken for a transition that separates an SMDP from an MDP. R(s, a) : S × A → defines the expected reward for performing an action in a state.1 The agent interacts with the world as follows. The agent knows the current state: the world is Markovian and fully observable. It then performs an action. That action takes a length of time to move the agent to a new state, the time and resulting state determined by P . The agent gets reward for the transition determined by R. As the world is fully observable, the agent can detect the new world state and act again, etc. Our goal is to learn a policy, π : S → A, that maps from states to actions. In particular we want the policy, π ∗ , that maximizes a sum of rewards. To keep this sum of rewards bounded, we will introduce a multiplicative discount factor, ∞ γ ∈ (0, 1). The goal is to find a policy that maximizes i=0 γ τi ri where τi is the time that the agent starts its ith action, and ri is the reward our agent receives for its ith action. Note that Times-Roman it will be useful to refer to a stochastic policy. This is a function from states to probability distributions over the actions. We can then define the following standard functions: ∞ Q(s, a) = R(s, a) + Ps,a (s , t)γ t V (s ) dt (1) s ∈S
t=0
V (s) = Q (s, π(s)) π ∗ (s) = argmax Q∗ (s, a) a∈A
(2) (3)
We now introduce a related function, the T π function. This function is defined over a set of states S ⊂ S. It measures the discounted sum of reward for following the given action until the agent leaves S , then following the policy π. 1
R can also depend upon both next state and time for the transition, but as these in turn depend only upon the state and action, they fall out of the expectation.
TTree: Tree-Based State Generalization
TSπ (s, a) = R(s, a) +
t=0
s ∈S
+
∞
Ps,a (s , t)γ t TSπ (s , a) dt
s ∈(S−S )
∞
t=0
263
(4)
Ps,a (s , t)γ t V π (s ) dt
We assume that instead of sampling P and R directly from the world, our agent instead samples from a generative model of the world, e.g. [12]. This is a function, G : S × A → S × × , that takes a state and an action and returns a next state, a time and a reward for the transition. Our algorithm uses G to sample trajectories through the state space starting from randomly selected states.
4
The TTree Algorithm
TTree works by building an abstract SMDP that is smaller than the original, or base, SMDP. The solution to this abstract SMDP is an approximation to the solution to the base SMDP. The abstract SMDP is formed as follows: The states in the abstract SMDP, the abstract states, are formed by partitioning the states in the base SMDP; each abstract state corresponds to the set of base level states in one element of the partition. Each base level state falls into exactly one abstract state. Each action in the abstract SMDP, an abstract action, corresponds to a policy, or stochastic policy, in the base SMDP. The abstract transition and reward functions are found by sampling trajectories from the base SMDP. We introduce some notation to help explain the algorithm. We use a bar over a symbol to distinguish the abstract SMDP from the base SMDP, e.g. s¯ vs. s, or A¯ vs. A. This allows us a shorthand notation: when we have a base state, s, we use s¯ to refer specifically to the abstract state containing s. Also, when we have an abstract action a ¯ we use πa¯ to refer to the base policy corresponding to a ¯ and hence πa¯ (s) is the corresponding base action at state s. Additionally, we Times-Roman overload s¯ to refer to the set of base states that it corresponds to, e.g. s ∈ s¯. Finally, it is useful, particularly in the proofs, to define functions that describe the base states within an abstract state, s¯, but only refer to abstract states outside of s¯. We mark these functions with a tilde. For example, we can define a function related to TS (s, a) in equation 4 above, T˜s¯(s, a). T˜s¯(s, a) = R(s, a) + s ∈¯ s
+
∞
t=0
s ∈s¯ ,s¯ =s¯
Ps,a (s , t)γ t T˜s¯(s , a) dt
∞
t=0
Ps,a (s , t)γ t V¯ (s¯ ) dt
(5)
264
W.T.B. Uther and M.M. Veloso
Note that the T˜s¯ function is labelled with a tilde, and hence within the abstract state s¯ we refer to base level states, outside of s¯ we refer to the abstract value function over abstract states. We describe the TTree algorithm from a number of different viewpoints. First ¯ A, ¯ P¯ , R. ¯ Then we we describe how TTree builds up the abstract SMDP, S, follow through the algorithm in detail, and finally we give a high level overview of the algorithm comparing it with previous algorithms. 4.1
Defining the Abstract SMDP
TTree uses a tree to partition the base level state space into abstract states. Each node in the tree corresponds to a region of the state space with the root node corresponding to the entire space. As our current implementation assumes state dimensions are discrete, internal nodes divide their region of state space along one dimension with one child for each discrete value along that dimension. It is a small extension to handle continuous and ordered discrete attributes in the same manner that Continuous U Tree [3] does. Leaf nodes correspond to abstract states; all the base level states that fall into that region of space are part of the abstract state. TTree uses a set of abstract actions for the abstract SMDP. Each abstract action corresponds to a base level policy. There are two ways in which these abstract actions can be obtained; they can be supplied by the user, or they can be generated by TTree. In particular, TTree generates one abstract action for each base level action, and one additional ‘random’ abstract action. The ‘random’ abstract action is a base level stochastic policy that performs a random base level action in each base level state. The other generated abstract actions are degenerate base level policies: they perform the same base level action in every base level state: ∀s; πa¯1 (s) = a1 , πa¯2 (s) = a2 , . . . , πa¯k (s) = ak . These generated abstract actions are all that is required by the proof of correctness. Any abstract actions supplied by the user are hints to speed up the algorithm and are not required for correctness. Informally, the abstract transition and reward functions are the expected result of starting in a random base state in the current abstract state and following a trajectory through the base states until we reach a new abstract state. To for˜ s¯(s, a malize this we define two functions. R ¯) is the expected discounted reward of starting in state s and following a trajectory through the base states using πa¯ until a new abstract state is reached. If no new abstract state is ever reached, ˜ is the expected discounted reward of the infinite trajectory. P˜s,¯a (s¯ , t) is then R ˜ s¯(s, a the expected probability, over the same set of trajectories as R ¯), of reaching the abstract state s¯ in time t. If s¯ is s¯ then we change the definition; when t = ∞, P˜s,¯a (s¯ , t) is the probability that the trajectory never leaves state s¯, and P˜s,¯a (s¯ , t) is 0 otherwise. We note that assigning a probability mass to t = ∞ is a mathematically suspect thing to do as it assigns a probability mass, rather than a density, to a single ‘point’ and, furthermore, that ‘point’ is ∞. We justify the use of P˜s,¯a (¯ s, ∞) as
TTree: Tree-Based State Generalization
265
a notational convenience for “the probability we never leave the current state” as follows. We note that each time P is referenced with s¯ = s¯, it is then multiplied by γ t , and hence for t = ∞ the product is zero. This is the correct value for an infinitely discounted reward. In the algorithm, as opposed to the proof, t = ∞ is approximated by t ∈ (MAXTIME, ∞). MAXTIME is a constant in the algorithm, chosen so that γ MAXTIME multiplied by the largest reward in the SMDP is approximately zero. The exponential discounting involved means that MAXTIME is usually not very large. ˜ are expressed in the following equations: The definitions of P˜ and R ∞ ˜ s¯(s , a ˜ ¯) = R(s, πa¯ (s)) + Ps,πa¯ (s) (s , t)γ t R ¯) dt (6) Rs¯(s, a s ∈¯ s t=0
P˜s,¯a (s¯ , t) =
Ps,πa¯ (s) (s , t) ¯ s ∈s t + Ps,πa¯ (s) (s , t )P˜s ,¯a (s¯ , t − t ) dt
s ∈¯ s 0 1− s¯ =s¯
t =0
∞
t=0
P˜s,a (s¯ , t) dt
: s¯ = s¯ (7) : s¯ = s¯, t = ∞ : s¯ = s¯, t = ∞
˜ is recursively defined as the expected reward of the first step plus Here R the expected reward of the rest of the trajectory. P˜ also has a recursive formula. The first summation is the probability of moving from s to s¯ in one transition. The second summation is the probability of transitioning from s to another state s ∈ s¯ in one transition, and then continuing from s on to s¯ in a trajectory using the remaining time. Note, the recursion in the definition of P˜ is going to be bounded as we disallow zero time cycles in the SMDP. ¯ We can now define the abstract transition and reward functions, P¯ and R, ˜ as the expected values over all base states in the current abstract state of P and ˜ R:
E P˜s,¯a (s¯ , t) ¯ s, a ˜ s¯(s, a ¯) R(¯ ¯) = E R s∈¯ s
P¯s¯,¯a (s¯ , t) =
s∈¯ s
(8) (9)
¯ are the expected transition and reward functions if we In English, P¯ and R start in a random base level state within the current abstract state and follow the supplied abstract action until we reach a new abstract state. 4.2
An Overview of the TTree Algorithm
¯ are not calculated directly from the above formulae. In the algorithm P¯ and R Rather, they are sampled by following trajectories through the base level state space as follows. A set of base level states is sampled from each abstract state. From each of these start states, for each abstract action, the algorithm uses the
266
W.T.B. Uther and M.M. Veloso
generative model to sample a series of trajectories through the base level states that make up the abstract state. In detail for one trajectory: let the abstract state we are considering be the state s¯. The algorithm first samples a set of base level start states, {s0 , s1 , . . . , sk } ∈ s¯. It then gathers the set of base level policies for the abstract actions, {πa¯1 , πa¯2 , . . . , πa¯l }. For each start state, si , and policy, πa¯j , in turn, the agent samples a series of base level states from the generative model forming a trajectory through the low level state space. As the trajectory progresses, the algorithm tracks the sum of discounted reward for the trajectory, and the total time taken by the trajectory. The algorithm does not keep track of the intermediate base level states. These trajectories have a number of termination criteria. The most important is that the trajectory stops if it reaches a new abstract state. The trajectory also stops if the system detects a deterministic self-transition in the base level state, if an absorbing state is reached, or if the trajectory exceeds a predefined length of time, MAXTIME. The result for each trajectory is a tuple, sstart , a¯j , sstop , t, r, of the start base level state, abstract action, end base level state, total time and total discounted reward. We turn the trajectory into a sample transition in the abstract SMDP, i.e. a tuple ¯ sstart , a¯j , s¯stop , t, r. The sample transitions are combined to estimate the ¯ abstract transition and reward functions, P¯ and R. The algorithm now has a complete abstract SMDP. It can solve it using traditional techniques, e.g. [13], to find a policy for the abstract SMDP: a function from abstract states to the abstract action that should be performed in that abstract state. However, the abstract actions are base level policies, and the abstract states are sets of base level states, so we also have a function from base level states to base level actions; we have a policy for the base SMDP – an approximate solution to the suppled problem. Having found this policy, TTree then looks to improve the accuracy of its approximation by increasing the resolution of the state abstraction. It does this by dividing abstract states – growing the tree. In order to grow the tree, we need to know which leaves should be divided and where they should be divided. A leaf should be divided when the utility of performing an abstract action is not constant across the leaf, or if the best action changes across a leaf. We can use the trajectories sampled earlier to get point estimates of the T function defined in equation 4, itself an approximation of the utility of performing an abstract action in a given state. First, we assume that the abstract value function, V¯ , is an approximation of the base value function, V . Making this substitution gives us the T˜ function defined in equation 5. The sampled trajectories with the current abstract value function allow us to estimate T˜. We refer to these estimates as Tˆ. For a single trajectory si , a¯j , sstop , r, t we can find s¯stop and then get the estimate2 : 2
It has been suggested that it might be possible to use a single trajectory to gain Tˆ estimates at many locations. We are wary of this suggestion as those estimates would be highly correlated; samples taken from the generative model near the end of a trajectory would affect the calculation of many point estimates.
TTree: Tree-Based State Generalization
267
Table 1. Constants in the TTree algorithm Constant Definition Na The number of trajectory start points sampled from the entire space each iteration Nl The minimum number of trajectory start points sampled in each leaf Nt The number of trajectories sampled per start point, abstract action pair MAXTIME The number of time steps before a trajectory value is assumed to have converged. Usually chosen to keep γ MAXTIME r/(1 − γ t ) < , where r and t are the largest reward and smallest time step, and is an acceptable error
Tˆs¯(si , a¯j ) = r + γ t V¯ (¯ sstop )
(10)
From these Tˆ(s, a ¯) estimates we obtain three different values used to divide the abstract state. Firstly, we divide the abstract state if maxa¯ Tˆ(s, a ¯) varies across the abstract state. Secondly, we divide the abstract state if the best action, argmaxa¯ Tˆ(s, a ¯), varies across the abstract state. Finally, we divide the abstract state if Tˆ(s, a ¯) varies across the state for any abstract action. It is interesting to note that while the last of these criteria contains a superset of the information in the first two, and leads to a higher resolution discretization of the state space once all splitting is done, it leads to the splits being introduced in a different order. If used as the sole splitting criterion, Tˆ(s, a ¯) is not as effective as maxa¯ Tˆ(s, a ¯) for intermediate trees. Once a division has been introduced, all trajectories sampled within the leaf that was divided are discarded, a new set of trajectories is sampled in each of the new leaves, and the algorithm iterates. 4.3
The TTree Algorithm in Detail
The TTree algorithm is shown in Procedure 1. The various constants referred to are defined in Table 1. The core of the TTree algorithm is the trajectory. As described above, these are paths through the base-level states within a single abstract state. They are used in two different ways in the algorithm; to discover the abstract transition function and to gather data about where to grow the tree and increase the resolution of the state abstraction. We first discuss how trajectories are sampled, then discuss how they are used. Trajectories are sampled in sets, each set starting at a single base level state. The function to sample one of these sets of trajectories is shown in Procedure 2. The set of trajectories contains Nt trajectories for each abstract action. Once sampled, each trajectory is recorded as a tuple of start state, abstract action, resulting state, time taken and total discounted reward, sstart , a ¯, sstop , ttotal , rtotal , with sstart being the same for each tuple in the set. The tuples in the trajectory set are stored along with sstart as a sample point, and added to the leaf containing sstart .
268
W.T.B. Uther and M.M. Veloso
¯ G, γ) Procedure 1 Procedure TTree(S, A, 1: tree ← a new tree with a single leaf corresponding to S 2: loop 3: Sa ← {s1 , . . . , sNa } sampled from S 4: for all s ∈ Sa do ¯ G, γ) {see Procedure 2} 5: SampleTrajectories(s, tree, A, 6: end for ¯ G, γ) {see Procedure 3} 7: UpdateAbstractSMDP(tree, A, ¯ γ) {see Procedure 4} 8: GrowTTree(tree, A, 9: end loop
The individual trajectories are sampled with the randomness being controlled [12,14]. Initially the algorithm stores a set of Nt random numbers that are used as seeds to reset the random number generator. Before the j th trajectory is sampled, the random number generator used in both the generative model and any stochastic abstract actions is reset to the j th random seed. This removes some of the randomness in the comparison of the different abstract actions within this set of trajectories. There are four stopping criteria for a sampled trajectory. Reaching another abstract state and reaching an absorbing state are stopping criteria that have already been discussed. Stopping when MAXTIME time steps have passed is an approximation. It allows us to get approximate values for trajectories that never leave the current state. Because future values decay exponentially, MAXTIME does not have to be very large to accurately approximate the trajectory value [12]. The final stopping criterion, stopping when a deterministic self-transition occurs, is an optimization, but it is not always possible to detect deterministic self-transitions. The algorithm works without this, but samples longer trajectories waiting for MAXTIME to expire, and hence is less efficient. The TTree algorithm samples trajectory sets in two places. In the main procedure, TTree randomly samples start points from the entire base level state space and then samples trajectory sets from these start points. This serves to increase the number of trajectories sampled by the algorithm over time regardless of resolution. Procedure 3 also samples trajectories to ensure that there sampled trajectories in every abstract state to build the abstract transition function. As well as using trajectories to find the abstract transition function, TTree also uses them to generate data to grow the tree. Here trajectories are used to generate three values. The first is an estimate of the T function, Tˆ, the second is an estimate of the optimal abstract action, π ˆ (s) = argmaxa¯ Tˆ(s, a ¯), and the ˆ third is the value of that action, maxa¯ T (s, a ¯). As noted above, trajectories are sampled in sets. The entire set is used by TTree to estimate the Tˆ values and hence reduce the variance of the estimates. As noted above (equation 10 – reprinted here), for a single trajectory, stored as the tuple sstart , a¯j , sstop , r, t, we can find s¯stop and can calculate Tˆ: ¯) = r + γ t V¯ (¯ sstop ) Tˆs¯start (sstart , a
(11)
TTree: Tree-Based State Generalization
269
¯ G, γ) Procedure 2 Procedure SampleTrajectories(sstart , tree, A, 1: Initialize new trajectory sample point, p, at sstart {p will store Nt trajectories for ¯ actions} each of the |A| 2: Let {seed1 , seed2 , . . . , seedNt } be a collection of random seeds 3: l ← LeafContaining(tree, sstart ) 4: for all abstract actions a ¯ ∈ A¯ do 5: let πa¯ be the base policy associated with a ¯ 6: for j = 1 to Nt do 7: Reset the random number generator to seedj 8: s ← sstart 9: ttotal ← 0, rtotal ← 0 10: repeat 11: s, t, r ← G(s, πa¯ (s)) 12: ttotal ← ttotal + t 13: rtotal ← rtotal + γ ttotal r 14: until s ∈ l, or ttotal > MAXTIME, or s , ∗, ∗ = G(s, πa¯ (s)) is deterministic and s = s , or s is an absorbing state 15: if the trajectory stopped because of a deterministic self transition then 16: rtotal ← rtotal + γ (ttotal +t) r/(1 − γ t ) 17: ttotal ← ∞ 18: else if the trajectory stopped because the final state was absorbing then 19: ttotal ← ∞ 20: end if 21: sstop ← s 22: Add sstart , a ¯, sstop , ttotal , rtotal to the trajectory list in p 23: end for 24: end for 25: Add p to l
¯ G, γ) Procedure 3 Procedure UpdateAbstractSMDP(tree, A, 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15:
for all leaves l with fewer than Nl sample points do Sa ← {s1 , . . . , sNa } sampled from l for all s ∈ Sa do ¯ G, γ) {see Procedure 2} SampleTrajectories(s, tree, A, end for end for P ← ∅ {Reset abstract transition count} for all leaves l and associated points p do for all trajectories, sstart , a ¯, sstop , ttotal , rtotal , in p do lstop ← LeafContaining(tree, sstop ) P ← P ∪ {l, a ¯, lstop , ttotal , rtotal } end for end for Transform P into transition probabilities Solve the abstract SMDP
270
W.T.B. Uther and M.M. Veloso
For a set of trajectories all starting at the same base level state with the same abstract action we find a better estimate: Nt 1 ˆ ¯) = ri + γ ti V¯ (¯ sstop i ) Ts¯start (sstart , a Nt i=0
(12)
This is the estimated expected discounted reward for following the abstract action a ¯ starting at the base level state sstart , until a new abstract state is reached, and then following the policy defined by the abstract SMDP. If there is a statistically significant change in the Tˆ value across a state for any action then we should divide the state in two. Additionally, we can find which abstract action has the highest3 Tˆ estimate, π ˆ , and the value of that estimate, Vˆ : ¯) Vˆ (sstart ) = max Tˆ(sstart , a
(13)
π ˆ (sstart ) = argmax Tˆ(sstart , a ¯)
(14)
a ¯
a ¯
If the Vˆ or π ˆ value changes across an abstract state, then we should divide that abstract state in two. Note that it is impossible for π ˆ (s) or Vˆ (s) to change without Tˆ(s, a ¯) changing and so these extra criteria do not cause us to introduce any extra splits. However, they do change the order in which splits are introduced. Splits that would allow a change in policy, or a change in value function, are preferred over those that just improve our estimate of the Q function. The division that maximizes the statistical difference between the two sides is chosen. Our implementation of TTree uses a Minimum Description Length test that is fully described in [8] to decide when to divide a leaf. As well as knowing how to grow a tree, we also need to decide if we should grow the tree. This is decided by a stopping criterion. Procedure 4 does not introduce a split if the stopping criterion is fulfilled, but neither does it halt the algorithm. TTree keeps looping gathering more data. In the experimental results we use a Minimum Description Length stopping criterion. We have found that the algorithm tends to get very good results long before the stopping criterion is met, and we did not usually run the algorithm for that long. The outer loop in Procedure 1 is an infinite loop, although it is possible to modify the algorithm so that it stops when the stopping criterion is fulfilled. We have been using the algorithm as an anytime algorithm. 4.4
Discussion of TTree
Now that we have described the technical details of the algorithm, we look at the motivation and effects of these details. TTree was developed to fix some of the limitations of previous algorithms such as Continuous U Tree [3]. In particular we wanted to reduce the splitting from the edges of abstract states and we wanted to 3
Ties are broken in favor of the abstract action selected in the current abstract state.
TTree: Tree-Based State Generalization
271
¯ γ) Procedure 4 Procedure GrowTTree(tree, A, 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: 20: 21: 22:
DT ← ∅ {Reset split data set. DT is a set of states with associated Tˆ estimates.} Dπ ← ∅ DV ← ∅ for all leaves l and associated points p do {a point contains a set of trajectories starting in the same state} ¯ Tˆ(sstart , .) ← ∅ {Tˆ(sstart , .) is a new array of size |A|} ¯ for all trajectories in p, sstart , a ¯, sstop , t, r do {Nt trajectories for each of |A| actions} lstop ← LeafContaining(tree, sstop ) Tˆ(sstart , a ¯) ← Tˆ(sstart , a ¯) + (r + γ t V (lstop ))/Nt end for DT ← DT ∪ {sstart , Tˆ} {add Tˆ estimates to data set} Vˆ ← maxa¯ Tˆ(sstart , a ¯) π ˆ ← argmaxa¯ Tˆ(sstart , a ¯) DV ← DV ∪ {s, Vˆ } {add best value to data set} Dπ ← Dπ ∪ {s, π ˆ } {add best action to data set} end for for all new splits in the tree do EvaluateSplit(DV ∪ Dπ ∪ DT ) {Use the splitting criterion to evaluate this split } end for if ShouldSplit(DV ∪ Dπ ∪ DT ) then {Evaluate the best split using the stopping criterion} Introduce best split into tree Throw out all sample points, p, in the leaf that was split end if
allow the measurement of the usefulness of abstract actions. Finally, we wanted to improve the match between the way the abstract policy is used and the way the abstract SMDP is modelled to increase the quality of the policy when the tree is not fully grown. Introducing trajectories instead of transitions solves these problems. The Tˆ values, unlike the q values in Continuous U Tree, vary all across an abstract state, solving the edge slicing issue. The use of trajectories allows us to measure the effectiveness of abstract actions along a whole trajectory rather than only for a single step. Finally, the use of trajectories allows us to build a more accurate abstract transition function. Edge slicing was an issue in Continuous U Tree where all abstract selftransitions with the same reward had the same q values, regardless of the dynamics of the self-transition. This means that often only the transitions out of an abstract state have different q values, and hence that the algorithm tends to slice from the edges of abstract states into the middle. TTree does not suffer from this problem as the trajectories include a measure of how much time the agent spends following the trajectory before leaving the abstract state. If the state-dynamics change across a state, then that is apparent in the Tˆ values.
272
W.T.B. Uther and M.M. Veloso
Trajectories allow us to select abstract actions for a state because they provide a way to differentiate abstract actions from base level actions. In one step there is no way to differentiate an abstract action from a base level action. Over multiple steps, this becomes possible. Finally, trajectories allow a more accurate transition function because they more accurately model the execution of the abstract policy. When the abstract SMDP is solved, an abstract action is selected for each abstract state. During execution that action is executed repeatedly until the agent leaves the abstract state. This repeated execution until the abstract state is exited is modelled by a trajectory. This is different from how Continuous U Tree forms its abstract MDP where each step is modelled individually. TTree only applies the Markov assumption at the start of a trajectory, whereas Continuous U Tree applies it at each step. When the tree is not fully grown, and the Markov assumption inaccurate, fewer applications of the assumption lead to a more accurate model. However, the use of trajectories also brings its own issues. If the same action is always selected until a new abstract state is reached, then we have lost the ability to change direction halfway across an abstract state. Our first answer to this is to sample trajectories from random starting points throughout the state, as described above. This allows us to measure the effect of changing direction in a state by starting a new trajectory in that state. To achieve this we require a generative model of the world. With this sampling, if the optimal policy changes halfway across a state, then the Tˆ values should change. But we only get Tˆ values where we start trajectories. It is not immediately obvious that we can find the optimal policy in this constrained model. In fact, with a fixed size tree we usually can not find the optimal policy, and hence we need to grow the tree. With a large enough tree the abstract states and base level states are equivalent, so we know that expanding the tree can lead to optimality. However, it is still not obvious that the Tˆ values contain the information we need to decide if we should keep expanding the tree; i.e. it is not obvious that there are no local maxima, with all the Tˆ values equal within all leaves, but with a non-optimal policy. We prove that no such local maxima exist in Section 5 below. The fact that we split first on Vˆ = maxa¯ Tˆ(., a ¯) and π ˆ = argmaxa¯ Tˆ(., a ¯) values before looking at all the Tˆ values deserves some explanation. If you split on Tˆ values then you Times-Roman split based on the data for non-optimal abstract actions. While this is required for the proof in Section 5 (see the example in Section 5.2), it also tends to cause problems empirically [8]. Our solution is to only split on non-optimal actions when no splits would otherwise be introduced. Finally, we make some comments about the random abstract action. The random abstract action has Tˆ values that are a smoothed version of the reward function. If there is only a single point reward there can be a problem finding an initial split. The point reward may not be sampled often enough to find a statistically significant difference between it and surrounding states. The random abstract action improves the chance of finding the point reward and introducing
TTree: Tree-Based State Generalization
273
the initial split. In some of the empirical results we generalize this to the notion of an abstract action for exploration.
5
Proof of Convergence
Some previous state abstraction algorithms [2,3] have generated data in a manner similar to TTree, but using single transitions rather than trajectories. In that case, the data can be interpreted as a sample from a stochastic form of the Q-function (TTree exhibits this behavior as a special case when MAXTIME = 0). When trajectories are introduced, the sample values no longer have this interpretation and it is no longer clear that splitting on the sample values leads to an abstract SMDP with any formal relation to the original SMDP. Other state abstraction algorithms, e.g. [4], generate data in a manner similar to TTree but are known not to converge to optimality in all cases. In this section, we analyze the trajectory values. We introduce a theorem that shows that splitting such that the Tˆ values are equal for all actions across a leaf leads to the optimal policy for the abstract SMDP, π ¯ ∗ , also being an optimal policy for the original SMDP. The complete proofs are available at [8]. We also give a counter-example for a simplified version of TTree showing that having constant trajectory values for only the highest valued action is not enough to achieve optimality. 5.1
Assumptions
In order to separate the effectiveness of the splitting and stopping criteria from the convergence of the SMDP solving, we assume optimal splitting and stopping criteria, and that the sample sizes, Nl and Nt , are sufficient. That is, the splitting and stopping criteria introduce a split in a leaf if, and only if, there exist two regions, one on each side of the split, and the distribution of the value being tested is different in those regions. Of course, real world splitting criteria are not optimal, even with infinite sample sizes. For example, most splitting criteria have trouble introducing splits if the data follows an XOR or checkerboard pattern. Our assumption is still useful as it allows us to verify the correctness of the SMDP solving part of the algorithm independently of the splitting and stopping criteria. This proof only refers to base level actions. We assume that the only abstract actions are the automatically generated degenerate abstract actions, and hence ∀¯ a, ∀s, πa¯ (s) = a and we do not have to distinguish between a and a ¯. Adding extra abstract actions does not affect the proof, and so we ignore them for convenience of notation. Theorem 1. If the Tˆ samples are statistically constant across all states for all actions, then an optimal abstract policy is an optimal base level policy. Formally, ¯ ∀¯ ¯ ∀s1 ∈ s¯, ∀s2 ∈ s¯, T˜(s1 , a ∀¯ a ∈ A, s ∈ S, ¯) = T˜(s2 , a ¯) ⇒ π ¯ ∗ (s1 ) = π ∗ (s1 )
(15)
274
W.T.B. Uther and M.M. Veloso
We first review the definition of T˜ introduced in equation 5: T˜s¯(s, a) = R(s, a) +
∞
s ∈¯ s t=0
Ps,a (s , t)γ t T˜s¯(s , a) dt
+
s ∈s¯ ,s¯ =s¯
∞
t=0
(16)
Ps,a (s , t)γ t V¯ (s¯ ) dt
This function describes the expected value of the Tˆ samples used in the algorithm, assuming a large sample size. It is also closely related to the T function defined in equation 4; the two are identical except for the value used when the region defined by S or s¯ is exited. The T function used the value of a base level value function, V , whereas the T˜ function uses the value of the abstract level value function, V¯ . ∗ ∗ We also define functions V˜s¯ (s) and Q˜s¯ (s, a) to be similar to the normal V ∗ and Q∗ functions within the set of states corresponding to s¯, but once the agent leaves s¯ it gets a one-time reward equal to the value of the abstract state it enters, V¯ . ∗ Q˜s¯ (s, a) = R(s, a) +
∞
s ∈¯ s t=0
+
∗
Ps,a (s , t)γ t V˜s¯ (s ) dt
∞
t=0
∗ V˜s¯ (s)
=
s ∈s¯ ,s¯ =s¯ ∗ max Q˜s¯ (s, a) a
(17)
Ps,a (s , t)γ t V¯ (s¯ ) dt (18)
Intuitively these functions give the value of acting optimally within s¯, assuming that the values of the base level states outside s¯ are fixed. We now have a spectrum of functions. At one end of the spectrum is the base Q∗ function from which it is possible to extract the set of optimal policies ˜ ∗ function which is optimal within for the original SMDP. Next in line is the Q an abstract state given the values of the abstract states around it. Then we have the T˜ function which can have different values across an abstract state, but assumes a constant action until a new abstract state is reached. Finally we have ¯ ∗ function which does not vary across the abstract state and gives the abstract Q us the optimal policy for the abstract SMDP. The outline of the proof of optimality when splitting is complete is as follows. First, we introduce in Lemma 1 that T˜ really is the same as our estimates, Tˆ, for large enough sample sizes. We then show that, when splitting has stopped, the maximum over the actions of each of the functions in the spectrum mentioned in the previous paragraph is equal and is reached by the same set of actions. ¯ ∗ ≤ Q∗ . This implies that an optimal policy in the abstract We also show that Q SMDP is also an optimal policy in the base SMDP. The proofs of the lemmas 1 and 3 are available at [8].
TTree: Tree-Based State Generalization
275
Lemma 1. The Tˆ samples are an unbiased estimate of T˜. Formally,
E
trajectories starting at s s,¯ a,s ,r,t
¯) = T˜s¯(s, a) Tˆs¯(s, a
(19)
∗
Lemma 2. ∀s ∈ s¯, ∀a, Q˜s¯ (s, a) ≥ T˜s¯(s, a) This is true by inspection. Equations 5 and 17 are reprinted here for reference: T˜s¯(s, a) = R(s, a) + s ∈¯ s
+
∞
t=0
s ∈s¯ ,s¯ =s¯ ∗ Q˜s¯ (s, a) = R(s, a) +
∞
s ∈¯ s t=0
+
Ps,a (s , t)γ t T˜s¯(s , a) dt
∞
t=0
Ps,a (s , t)γ t V¯ (s¯ ) dt
∗ Ps,a (s , t)γ t V˜s¯ (s ) dt
∞
t=0 s ∈s¯ ,s¯ =s¯
(20)
(21)
Ps,a (s , t)γ t V¯ (s¯ ) dt
∗ ∗ Substituting V˜s¯ (s) = maxa Q˜s¯ (s, a) into equation 21 makes the two func˜ has a max where T˜ does not. Hence Q ˜ ≥ T˜. q.e.d. tions differ only in that Q
Lemma 3. If T˜s¯ is constant across s¯ for all actions, then maxa T˜s¯(., a) = ∗ ∗ maxa Q˜s¯ (., a) and argmaxa T˜s¯(., a) = argmaxa Q˜s¯ (., a). Lemma 4. If T˜s¯ is constant across the abstract state s¯ for all actions then ¯ s, a) = T˜s¯(s, a) for all actions. Q(¯ During the proof of lemma 1 we show, ˜ s¯(s, a) + T˜s¯(s, a) = R
∞
t=0 s¯ ∈S¯
P˜s,a (s¯ , t)γ t V¯ (s¯ ) dt
Now, ¯ s, a) = R(¯ ¯ s, a) + Q(¯
∞
t=0 s¯ ∈S¯
P¯s¯,a (s¯ , t)γ t V¯ π (s¯ ) dt
(22)
276
W.T.B. Uther and M.M. Veloso
Substituting equations 8 and 9, =
Es s∈¯
˜ s¯(s, a) + R
∞
t=0 s¯ ∈S¯
Es s∈¯
P˜s,a (s¯ , t) γ t V¯ π (s¯ ) dt
1 ˜ = Rs¯(s, a) |¯ s| s∈¯s
∞ 1 + P˜s,a (s¯ , t) γ t V¯ π (s¯ ) dt |¯ s| s∈¯s ¯ ¯ t=0
(23)
(24)
s ∈S
1 ˜ = Rs¯(s, a) |¯ s| s∈¯s 1 ∞ ˜ + Ps,a (s¯ , t)γ t V¯ π (s¯ ) dt |¯ s| s∈¯s ¯ ¯ t=0 s ∈S 1 ˜ s¯(s, a) = R |¯ s| s∈¯s ∞ + P˜s,a (s¯ , t)γ t V¯ π (s¯ ) dt
(25)
(26)
t=0 s¯ ∈S¯
=
E T˜s¯(s, a)
(27)
s∈¯ s
Given that T˜s¯(s, a) is constant across s ∈ s¯, then ∀s ∈ s¯, Es∈¯s T˜s¯(s, a) = ˜ Ts¯(s , a). q.e.d. Lemma 5. If T˜s¯ is constant across the abstract state s¯ for all actions, and s) = V¯ (s¯ ) = V ∗ (s ) for all base states in all abstract states s¯ , s¯ = s¯, then V¯ (¯ V ∗ (s) in s¯. Substituting V¯ (s¯ ) = V ∗ (s ) for other states into equation 17, we see that Q = Q∗ for the current state and so V˜ ∗ = V ∗ for the current state. Also, ˜ ∗ (s, a) = argmaxa Q∗ (s, a) and so the policies implicit in these funcargmaxa Q tions are also equal. Moreover, because T˜s¯ is constant across the current abstract state, we know, ¯ s, a) = T˜s¯(s, a). For the same reason we also know by by Lemma 4, that Q(¯ ∗ Lemma 3 that maxa T˜s¯(s, a) = V˜s¯ (s). ˜∗
¯ s, a) = T˜s¯(s, a) Q(¯ thereforeV¯ (¯ s) = max T˜s¯(s, a) =
a ∗ ˜ Vs¯ (s) ∗
= V (s) q.e.d.
(28) (29) (30) (31)
TTree: Tree-Based State Generalization
277
Table 2. T , Q and V for sample MDP Function s1 s2 Q(s, a1 ) 9 9 T (s, a1 ) 9 9 Q(s, a2 ) 108.1 -900 T (s, a2 ) -710 -900 V (s) 108.1 9
Lemma 6. If T˜s¯ is constant across each abstract state for each action, then setting V ∗ = V¯ is a consistent solution to the Bellman equations of the base level SMDP. This is most easily seen by contradiction. Assume we have a tabular representation of the base level value function. We will initialize this table with the values from V¯ . We will further assume that T˜s¯ is constant across each abstract state for each action, but that our table is not optimal, and show that this leads to a contradiction. As in lemma 5, because T˜s¯ is constant across the current abstract state, we ¯ s, a) = T˜s¯(s, a). For the same reason we also know know, by Lemma 4, that Q(¯ ∗ ˜ by Lemma 3 that maxa Ts¯(s, a) = V˜s¯ (s). ∗ This means that our table contains V˜s¯ for each abstract state. Hence, there is no single base level state that can have its value increased by a single bellman update. Hence the table must be optimal. This optimal value function is achieved with the same actions in both the base and abstract SMDPs. Hence any optimal policy in one is an optimal policy in the other. q.e.d. 5.2
Splitting on Non-optimal Actions
We did not show above that the T˜ and Q∗ functions are equal for non-optimal actions. One might propose a simpler algorithm that only divides a state when T˜ is not uniform for the action with the highest value, rather than checking for uniformity all the actions. Here is a counter-example showing this simplified algorithm does not converge. Consider an MDP with three states, s1 , s2 and s3 . s3 is an absorbing state with zero value. States s1 and s2 are both part of a single abstract state, s3 is in a separate abstract state. There are two deterministic actions. a1 takes us from either state into s3 with a reward of 10. a2 takes us from s1 to s2 with a reward of 100, and from s2 to s3 with a reward of −1000. Table 2 shows the T˜ and Q∗ values for each state when γ = 0.9. Note that even though the T (s, a1 ) values are constant and higher than the T (s, a2 ) values, the optimal policy does not choose action a1 in both states.
278
6
W.T.B. Uther and M.M. Veloso
Empirical Results
We evaluated TTree in a number of domains. For each domain the experimental setup was similar. We compared mainly against the Prioritized Sweeping algorithm [13]. The reason for this is that, in the domains tested, Continuous U Tree was ineffective as the domains do not have much scope for normal state abstraction. It is important to note that Prioritized Sweeping is a certainty equivalence algorithm. This means that it builds an internal model of the state space from its experience in the world, and then solves that model to find its policy. The model is built without any state or temporal abstraction and so tends to be large, but, aside from the lack of abstraction, it makes very efficient use of the transitions sampled from the environment. The experimental procedure was as follows. There were 15 learning trials. During each trial, each algorithm was tested in a series of epochs. At the start of their trials, Prioritized Sweeping had its value function initialized optimistically at 500, and TTree was reset to a single leaf. At each time step Prioritized Sweeping performed 5 value function updates. At the start of each epoch the world was set to a random state. The algorithm being tested was then given control of the agent. The epoch ended after 1000 steps were taken, or if an absorbing state was reached. At that point the algorithm was informed that the epoch was over. TTree then used its generative model to sample trajectories, introduce one split, sample more trajectories to build the abstract transition function, and update its abstract value function and find a policy. Prioritized Sweeping used its certainty equivalence model to update its value function and find a policy. Having updated its policy, the algorithm being tested was then started at 20 randomly selected start points and the discounted reward summed for 1000 steps from each of those start points. This was used to estimate the expected discounted reward for each agent’s current policy. These trajectories were not used for learning by either algorithm. An entry was then recorded in the log with the number of milliseconds spent by the agent so far this trial (not including the 20 test trajectories), the total number of samples taken by the agent so far this trial (both in the world and from the generative model), the size of the agent’s model, and the expected discounted reward measured at the end of the epoch. For Prioritized Sweeping, the size of the model was the number of visited state/action pairs divided by the number of actions. For TTree the size of the model was the number of leaves in the tree. The trial lasted until each agent had sampled a fixed number of transitions (which varied by domain). The data was graphed as follows. We have two plots in each domain. The first has the number of transitions sampled from the world on the x-axis and the expected reward on the y-axis. The second has time taken by the algorithm on the x-axis and expected reward on the y-axis. Some domains have a third graph showing the number of transitions sampled on the x-axis and the size of the model on the y-axis. For each of the 15 trials there was a log file with an entry recorded at the end of each epoch. However, the number of samples taken in an epoch varies, making it impossible to simply average the 15 trials. Our solution was to connect each
TTree: Tree-Based State Generalization
279
consecutive sample point within each trial to form a piecewise-linear curve for that trial. We then selected an evenly spaced set of sample points, and took the mean and standard deviation of the 15 piecewise-linear curves at each sample point. We stopped sampling when any of the log files was finished (when sampling with time on the x-axis, the log files are different lengths). 6.1
Towers of Hanoi
The Towers of Hanoi domain is well known in the classical planning literature for the hierarchical structure of the solution; temporal abstraction should work well. This domain consists of 3 pegs, on which sit N disks. Each disk is of a different size and they stack such that smaller disks always sit above larger disks. There are six actions which move the top disk on one peg to the top of one of the other pegs. An illegal action, trying to move a larger peg on top of a smaller peg, results in no change in the world. The object is to move all the disks to a specified peg; a reward of 100 is received in this state. All base level actions take one time step, with γ = 0.99. The decomposed representation we used has a boolean variable for each disk/peg pair. These variables are true if the disk is on the peg. The Towers of Hanoi domain had size N = 8. We used a discount factor, γ = 0.99. TTree was given policies for the three N = 7 problems, the complete set of abstract actions is shown in Table 3. The TTree constants were, Na = 20, Nl = 20, Nt = 1 and MAXSTEPS = 400. Prioritized Sweeping used Boltzmann exploration with carefully tuned parameters (γ was also tuned to help Prioritized Sweeping). The tuning of the parameters for Prioritized Sweeping took significantly longer than for TTree. Figure 1 shows a comparison of Prioritized Sweeping and TTree. In Figure 1b the TTree data finishes significantly earlier than the Prioritized Sweeping data; TTree takes significantly less time per sample. Continuous U Tree results are not shown as that algorithm was unable to solve the problem. The problem has 24 state dimensions and Continuous U Tree was unable to find an initial split. We also tested Continuous U Tree and TTree on smaller Towers of Hanoi problems without additional macros. TTree with only the generated abstract actions was able to solve more problems than Continuous U Tree. We attribute this to the fact that the Towers of Hanoi domain is particularly bad for U Tree style state abstraction. In U Tree the same action is always chosen in a leaf. However, it is never legal to perform the same action twice in a row in Towers of Hanoi. TTree is able to solve these problems because the, automatically generated, random abstract action allows it to gather more useful data than Continuous U Tree. In addition, the transition function of the abstract SMDP formed by TTree is closer to what the agent actually sees in the real world than the transition function of abstract SMDP formed by Continuous U Tree. TTree samples the transition function assuming it might take a number of steps to leave the abstract state. Continuous U Tree assumes that it leaves the abstract state in one step. This makes TTree a better anytime algorithm.
280
W.T.B. Uther and M.M. Veloso Table 3. Actions in the Towers of Hanoi domain
Base Level Actions Action Move Disc From Peg To Peg a0 P0 P1 a1 P0 P2 a2 P1 P2 a3 P1 P0 a4 P2 P1 a5 P2 P0
Action a ¯0 a ¯1 a ¯2 a ¯3 a ¯4 a ¯5 a ¯r a ¯7P0 a ¯7P1 a ¯7P2
30
30
Trajectory Tree Prioritized Sweeping
20 15 10 5
20 15 10 5
0
0
-5
-5 0
50
100
150
200
Samples Taken (x 1000)
(a)
250
Trajectory Tree Prioritized Sweeping
25 Expected Discounted Reward
25 Expected Discounted Reward
A Set of Abstract Actions Effect Generated abstract actions Perform action a0 in all states Perform action a1 in all states Perform action a2 in all states Perform action a3 in all states Perform action a4 in all states Perform action a5 in all states Choose uniformly from {a0 , . . . , a5 } in all states Supplied abstract actions If the 7 disc stack is on P0 then choose uniformly from {a0 , . . . , a5 }, otherwise follow the policy that moves the 7 disc stack to P0 . If the 7 disc stack is on P1 then choose uniformly from {a0 , . . . , a5 }, otherwise follow the policy that moves the 7 disc stack to P1 . If the 7 disc stack is on P2 then choose uniformly from {a0 , . . . , a5 }, otherwise follow the policy that moves the 7 disc stack to P2 .
300
0
50
100
150
200
250
300
350
400
450
500
Time Taken (s)
(b)
Fig. 1. Results from the Towers of Hanoi domain. (a) A plot of Expected Reward vs. Number of Sample transitions taken from the world. (b) Data from the same log plotted against time instead of the number of samples
6.2
The Rooms Domains
This domain simulates a two legged robot walking through a maze. The two legs are designated left and right. With a few restrictions, each of these legs can be raised and lowered one unit, and the raised foot can be moved one unit in each of the four compass directions: north, south, east and west. The legs are restricted in movement so that they are not both in the air at the same time. They are
TTree: Tree-Based State Generalization
281
also restricted to not be diagonally separated, e.g. the right leg can be either east or north of the left leg, but it cannot be both east and north of the left leg. More formally, we represent the position of the robot using the two dimensional coordinates of the right foot, x, y. We then represent the pose of the robot’s legs by storing the three dimensional position of the left foot relative to the right foot, ∆x, ∆y, ∆z. We represent East on the +x axis, North on the +y axis and up on the +z axis. The formal restrictions on movement are that ∆x and ∆y cannot both be non-zero at the same time and that each of ∆x, ∆y and ∆z are in the set {−1, 0, 1}. A subset of the state space is shown diagrammatically in Figures 2 and 3. These figures do not show the entire global state space and also ignore the existence of walls. The robot walks through a grid with a simple maze imposed on it. The mazes have the effect of blocking some of the available actions: any action that would result in the robot having its feet on either side of a maze wall fails. Any action that would result in an illegal leg configuration fails and gives the robot reward of −1. Upon reaching the grey square in the maze the robot receives a reward of 100. In our current implementation of TTree we do not handle ordered discrete attributes such as the global maze coordinates, x and y. In these cases we transform each of the ordered discrete attributes into a set of binary attributes. There is one binary attribute for each ordered discrete attribute/value pair describing if the attribute is less than the value. For example, we replace the x attribute with a series of binary attributes of the form: {x < 1, x < 2, . . . , x < 9}. The y attribute is transformed similarly. In addition to the mazes above, we use the ‘gridworlds’ shown in Figure 4 for experiments. It should be remembered that the agent has to walk through these grids. Unless stated otherwise in the experiments we have a reward of 100 in the bottom right square of the gridworld. When solving the smaller of the two worlds, shown in Figure 4 (a), TTree was given abstract actions that walk in the four cardinal directions: north, south, east and west. These are the same actions described in the introduction, e.g. Tables 4. The various constants were γ = 0.99, Na = 40, Nl = 40, Nt = 2 and MAXSTEPS = 150. Additionally, the random abstract action was not useful in this domain, so it was removed. The other generated abstract actions, one for each base level action, remained. The results for the small rooms domain are shown in Figure 5. When solving the larger world, shown in Figure 4 (b), we gave the agent three additional abstract actions above what was used when solving the smaller world. The first of these was a ‘stagger’ abstract action, shown in Table 5. This abstract action is related to both the random abstract action and the walking actions: it takes full steps, but each step is in a random direction. This improves the exploration of the domain. The other two abstract actions move the agent through all the rooms. One moves the agent clockwise through the world and the other counter-clockwise. The policy for the clockwise abstract action is shown in Figure 6. The counter-clockwise abstract action is similar, but follows a path in the other direction around the central walls.
282
W.T.B. Uther and M.M. Veloso
Fig. 2. The local transition diagram for the walking robot domain without walls. This shows the positions of the feet relative to each other. Solid arrows represent transitions possible without a change in global location. Dashed arrows represent transitions possible with a change in global location. The different states are shown in two different coordinate systems. The top coordinate system shows the positions of each foot relative to the ground at the global position of the robot. The bottom coordinate system shows the position of the left foot relative to the right foot
The results for this larger domain are shown in Figure 7. The various constants were γ = 0.99, Na = 40, Nl = 40, Nt = 1 and MAXSTEPS = 250. Additionally the coefficient on the policy code length in the MDL coding was modified to be 10 instead of 20.
TTree: Tree-Based State Generalization
283
Fig. 3. A subset of the global transition diagram for the walking robot domain. Each of the sets of solid lines is a copy of the local transition diagram shown in Figure 2. As in that figure, solid arrows represent transitions that do not change global location and dashed arrows represent transitions that do change global location
(a)
(b)
Fig. 4. (a) A set of four 10 × 10 rooms for our robot to walk through; (b) A set of sixteen 10 × 10 rooms for our robot to walk through
284
W.T.B. Uther and M.M. Veloso
Table 4. The policy for walking north when starting with both feet together. (a) Shows the policy in tree form, (b) shows the policy in diagram form. Note: only the ∆z-∆y plane of the policy is shown as that is all that is required when starting to walk with your feet together if ∆z = 0 then {both feet on the ground} if ∆y > 0 then {left foot north of right foot} raise the right foot else raise the left foot end if else if ∆z = 1 then {the left foot is in the air} if ∆y > 0 then {left foot north of right foot} lower the left foot else move the raised foot north one unit end if else {the right foot is in the air} if ∆y < 0 then {right foot north of left foot} lower the right foot else move the raised foot north one unit end if end if (a)
15
Trajectory Tree Prioritized Sweeping
10 Expected Discounted Reward
Expected Discounted Reward
15
Trajectory Tree Prioritized Sweeping
10
(b)
5 0 -5 -10 -15
5 0 -5 -10 -15
-20
-20 0
10
20
30
40
50
60
70
Samples Taken (x 1000)
(a)
80
90
100
0
10
20
30
40
50
60
70
80
90 100 110 120
Time Taken (s)
(b)
Fig. 5. Results from the walking robot domain with the four room world. (a) A plot of expected reward vs. number of transitions sampled. (b) Data from the same log plotted against time instead of the number of samples
TTree: Tree-Based State Generalization
285
Table 5. The ‘stagger’ policy for taking full steps in random directions if ∆z < 0 then {the right foot is in the air} if ∆x < 0 then {left foot west of right foot} move the raised foot one unit west else if ∆x = 0 then {right foot is same distance east/west as left foot} if ∆y < 0 then {left foot south of right foot} move the raised foot one unit south else if ∆y = 0 then {left foot is same distance north/south as right foot} lower the right foot else {left foot north of right foot} move the raised foot one unit north end if else {left foot east of right foot} move the raised foot one unit east end if else if ∆z = 0 then {both feet are on the ground} if ∆x = 0 and ∆y = 0 then {the feet are together} raise the left foot else raise the right foot end if else {the left foot is in the air} if ∆x = 0 and ∆y = 0 then {the left foot is directly above the right foot} Move the raised foot north, south, east or west with equal probability else lower the left foot end if end if
Fig. 6. The clockwise tour abstract action. This is a policy over the rooms shown in Figure 4 (b)
286
W.T.B. Uther and M.M. Veloso 40
0 -20 -40 -60
0 -20 -40 -60
-80
-80
-100
-100 100
200
300
400
500
600
700
Samples Taken (x 1000)
(a)
800
Trajectory Tree Prioritized Sweeping
20 Expected Discounted Reward
Expected Discounted Reward
40
Trajectory Tree Prioritized Sweeping
20
900
0
100 200 300 400 500 600 700 800 900 100011001200 Time Taken (s)
(b)
Fig. 7. Results from the walking robot domain with the sixteen room world. (a) A plot of Expected Reward vs. Number of Sample transitions taken from the world. (b) Data from the same log plotted against time instead of the number of samples
6.3
Discussion
There are a number of points to note about the TTree algorithm. Firstly, it generally takes TTree significantly more data than Prioritized Sweeping to converge, although TTree performs well long before convergence. This is unsurprising. Prioritized Sweeping is remembering all it sees, whereas TTree is throwing out all trajectories in a leaf when that leaf is split. For example, all data gathered before the first split is discarded after the first split. However, TTree is significantly faster that Prioritized Sweeping in real time in large domains (see Figures 1b, 5b and 7b). It performs significantly less processing on each data point as it is gathered and this speeds up the algorithm. It also generalizes across large regions of the state space. Figure 8 shows the sizes of the data structures stored by the two algorithms. Note that the y-axis is logarithmic. TTree does not do so well in small domains like the taxi domain [6]. Given this generalization, it is important to note why we did not compare to other state abstraction algorithms. The reason is because other state abstraction algorithms do not have a temporal abstraction component and so cannot generalize across those large regions. e.g. Continuous U Tree performs very poorly on these problems. The next point we would like to make is that the abstract actions help TTree avoid negative rewards even when it has not found the positive reward yet. In the walking robot domain, the agent is given a small negative reward for attempting to move its legs in an illegal manner. TTree notices that all the trajectories using the generated abstract actions receive these negative rewards, but that the supplied abstract actions do not. It chooses to use the supplied abstract actions and hence avoid these negative rewards. This is evident in Figure 5 where TTree’s expected reward is never below zero. The large walking domain shows a capability of TTree that we have not emphasized yet. TTree was designed with abstract actions like the walking actions
TTree: Tree-Based State Generalization 100000
Trajectory Tree Prioritized Sweeping
287
Trajectory Tree Prioritized Sweeping
Number of states in model
Number of states in model
10000
1000
100
10
1
10000
1000
100
10
1 0
50
100 150 200 250 300 350 400 450 500 Samples Taken (x 1000)
(a)
100 200 300 400 500 600 700 800 900 Samples Taken (x 1000)
(b)
Fig. 8. Plots of the number of states seen by Prioritized Sweeping and the number of abstract states in the TTree model vs. number of samples gathered from the world. The domains tested were (a) the Towers of Hanoi domain, and (b) the walking robot domain with the sixteen room world. Note that the y-axis is logarithmic
in mind where the algorithm has to choose the regions in which to use each abstract action, and it uses the whole abstract action. However TTree can also choose to use only part of an abstract action. In the large walking domain, we supplied two additional abstract actions which walk in a large loop through all the rooms. One of these abstract actions is shown in Figure 6. The other is similar, but loops through the rooms in the other direction. To see how TTree uses these ‘loop’ abstract actions, Table 6 shows a small part of a tree seen while running experiments in the large walking domain. In the particular experiment that created this tree, there was a small, −0.1, penalty for walking into walls. This induces TTree to use the abstract actions to walk around walls, at the expense of more complexity breaking out of the loop to reach the goal. The policy represented by this tree is interesting as it shows that the algorithm is using part of each of the abstract actions rather than the whole of either abstract action. The abstract actions are only used in those regions where they are useful, even if that is only part of the abstract action. This tree fragment also shows that TTree has introduced some non-optimal splits. If the values 78 and 68 were replaced by 79 and 70 respectively then the final tree would be smaller.4 As TTree chooses its splits based on sampling, it sometimes makes less than optimal splits early in tree growth. The introduction of splits causes TTree to increase its sample density in the region just divided. This allows TTree to introduce further splits to achieve the desired division of the state space. The note above about adding a small penalty for running into walls in order to induce TTree to use the supplied abstract actions deserves further comment. The Taxi domain [15] has a penalty of −10 for misusing the pick up and put down actions. It has a reward of 20 for successfully delivering the passenger. We 4
The value 79 comes from the need to separate the last column to separate the reward. The value 70 lines up with the edge of the end rooms.
288
W.T.B. Uther and M.M. Veloso
Table 6. Part of the policy tree during the learning of a solution for the large rooms domain in Figure 4 (b) if x < 78 then if x < 68 then if y < 10 then perform the loop counter-clockwise abstract action else perform the loop clockwise abstract action end if else {Rest of tree removed for space} end if else {Rest of tree removed for space} end if
found TTree had some difficulty with this setup. The macros we supplied chose randomly between the pick up and put down actions when the taxi was at the appropriate taxi stand. While this gives a net positive reward for the final move (with an expected reward of 10), it gives a negative expected reward when going to pick up the passenger. This makes the abstract action a bad choice on average. Raising the final reward makes the utility of the abstract actions positive and helps solve the problem. When running our preliminary experiments in the larger walking domain, we noticed that sometimes TTree was unable to find the reward. This did not happen in the other domains we tested. In the other domains there were either abstract actions that moved the agent directly to the reward, or the random abstract action was discovering the reward. In the walking domain the random abstract action is largely ineffective. The walking motion is too complex for the random action to effectively explore the space. The abstract actions that walk in each of the four compass directions will only discover the reward if they are directly in line with that reward without an intervening wall. Unless the number of sample points made very large, this is unlikely. Our solution was to supply extra abstract actions whose goal was not to be used in the final policy, but rather to explore the space. In contrast to the description of McGovern [16], where macros are used to move the agent through bottlenecks and hence move the agent to another tightly connected component of the state space, we use these exploration abstract actions to make sure we have fully explored the current connected component. We use these ‘exploration’ abstract actions to explore within a room rather than to move between rooms. An example of this type of exploratory abstract action is the ‘stagger’ abstract action shown in Table 5. We also implemented another abstract action that walked the agent through a looping search pattern in each room. This search pattern covered every space in the room, and was replicated for each room. The stagger policy turned out to be enough to find the reward in the large walking
TTree: Tree-Based State Generalization
289
domain and it was significantly less domain specific than the full search, so it was used to generate the results above.
7
Conclusion
We have introduced the TTree algorithm for finding policies for Semi-Markov Decision Problems. This algorithm uses both state and temporal abstraction to help solve the supplied SMDP. Unlike previous temporal abstraction algorithms, TTree does not require termination criteria on its abstract actions. This allows it to piece together solutions to previous problems to solve new problems. We have supplied both a proof of correctness and empirical evidence of the effectiveness of the TTree algorithm.
References 1. Puterman, M.L.: Markov Decision Processes : Discrete stochastic dynamic programming. Wiley series in probability and mathematical statistics. Applied probability and statistics section. John Wiley & Sons, New York (1994) 2. Chapman, D., Kaelbling, L.P.: Input generalization in delayed reinforcement learning: An algorithm and performance comparisons. In: Proceedings of the Twelfth International Joint Conference on Artificial Intelligence (IJCAI-91), Sydney, Australia (1991) 726–731 3. Uther, W.T.B., Veloso, M.M.: Tree based discretization for continuous state space reinforcement learning. In: Proceedings of the Fifteenth National Conference on Artificial Intelligence (AAAI-98), Madison, WI (1998) 769–774 4. Munos, R., Moore, A.W.: Variable resolution discretization for high-accuracy solutions of optimal control problems. In: Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence (IJCAI-99). (1999) 5. Sutton, R.S., Precup, D., Singh, S.: Intra-option learning about temporally abstract actions. In: Machine Learning: Proceedings of the Fifteenth International Conference (ICML98), Madison, WI, Morgan Kaufmann (1998) 556–564 6. Dietterich, T.G.: The MAXQ method for hierarchical reinforcement learning. In: Machine Learning: Proceedings of the Fifteenth International Conference (ICML98), Madison, WI, Morgan Kaufmann (1998) 118–126 7. Parr, R.S., Russell, S.: Reinforcement learning with hierarchies of machines. In: Neural and Information Processing Systems (NIPS-98). Volume 10, MIT Press (1998) 8. Uther, W.T.B.: Tree Based Hierarchical Reinforcement Learning. PhD thesis, Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA (2002) 9. Hengst, B.: Discovering hierarchy in reinforcement learning with HEXQ. In: International Conference on Machine Learning (ICML02). (2002) 10. Baird, L.C.: Residual algorithms: Reinforcement learning with function approximation. In Prieditis, A., Russell, S., eds.: Machine Learning: Proceedings of the Twelfth International Conference (ICML95), San Mateo, Morgan Kaufmann (1995) 30–37
290
W.T.B. Uther and M.M. Veloso
11. Knoblock, C.A.: Automatically Generating Abstractions for Problem Solving. PhD thesis, Computer Science Department, Carnegie Mellon University, Pittsburgh, PA (1991) 12. Ng, A.Y., Jordan, M.: PEGASUS: A policy search method for large MDPs and POMDPs. In: Uncertainty in Artificial Intelligence, Proceedings of the Sixteenth Conference. (2000) 13. Moore, A.W., Atkeson, C.G.: Prioritized sweeping: Reinforcement learning with less data and less real time. Machine Learning 13 (1993) 14. Strens, M., Moore, A.: Direct policy search using paired statistical tests. In: International Conference on Machine Learning (ICML 2001). (2001) 15. Dietterich, T.G.: Hierarchical reinforcement learning with the MAXQ value function decomposition. Journal of Artificial Intelligence Research 13 (2000) 227–303 16. McGovern, A.: Autonomous Discovery Of Temporal Abstractions From Interaction With An Environment. PhD thesis, Department of Computer Science, University of Massachusetts, Amherst, Massachusetts (2002)
Using Landscape Theory to Measure Learning Difficulty for Adaptive Agents Christopher H. Brooks1 and Edmund H. Durfee2 1
Computer Science Department University of San Francisco 2130 Fulton St. San Francisco, CA 94118 [email protected] 2 EECS Department University of Michigan 1101 Beal Ave. Ann Arbor, MI 48109-2110 [email protected]
Abstract. In many real-world settings, particularly economic settings, an adaptive agent is interested in maximizing its cumulative reward. This may require a choice between different problems to learn, where the agent must trade optimal reward against learning difficulty. A landscape is one way of representing a learning problem, where highly rugged landscapes represent difficult problems. However, ruggedness is not directly measurable. Instead, a proxy is needed. We compare the usefulness of three different metrics for estimating ruggedness on learning problems in an information economy domain. We empirically evaluate the ability of each metric to predict ruggedness and use these metrics to explain past results showing that problems that yield equal reward when completely learned yield different profits to an adaptive learning agent.
1
Introduction
In many problems, such as learning in an economic context, an adaptive agent that is attempting to learn how to act in a complex environment is interested in maximizing its cumulative payoff; that is, optimizing its performance over time. In such a case, the agent must make a tradeoff between the long-term value of information gained through learning and the short-term cost incurred in gathering information about the world. This tradeoff is typically referred to in the machine learning literature as the exploration-exploitation tradeoff [9]. If an agent can estimate the amount of learning needed to produce an improvement in performance, it can then decide whether to learn or, more generally, what it should learn. However, making this estimate requires that an adaptive agent know something about the relative difficulty of the problems it can choose to learn. In this paper, we demonstrate how metrics from landscape theory can be applied to a particular agent learning problem, namely that of an agent learning E. Alonso et al. (Eds.): Adaptive Agents and MAS, LNAI 2636, pp. 291–305, 2003. c Springer-Verlag Berlin Heidelberg 2003
292
C.H. Brooks and E.H. Durfee
the prices of information goods. A landscape is a way of representing the relative quality of solutions that lie near each other within some topology. We begin by describing our past work on the problem and appeal to a pictorial description to explain these results. Following this, we provide some background on landscapes and metrics for assessing their ruggedness, or difficulty. We then empirically evaluate two metrics, distribution of optima and autocorrelation, and show how these metrics can explain our previous results. We conclude by summarizing and discussing opportunities for future work.
2
Summarizing Price Schedule Learning Performance
In our previous work [1], we studied the problem of an adaptive agent selling information goods to an unknown consumer population. This agent acted as a monopolist and was interested in maximizing its cumulative profit. We assumed that the learning algorithm (amoeba [8], a direct search method) was a fixed feature of the agent. The adaptive agent’s decision problem involved selecting a particular price schedule to learn, where this schedule served as an approximate model of consumer preferences. These schedules are summarized in Table 1 Table 1. This table presents the parameters of six pricing schedules, ordering in terms of increasing complexity. More complex schedules allow a producer to capture a greater fraction of potential consumer surplus by fitting demand more precisely, but require longer to learn, since they have more parameters. Pricing Schedule Pure Bundling
Parameters b
Linear Pricing
p
Two-part Tariff
f, p
Mixed Bundling
b, p
Block Pricing
p1 , p2 , m
Nonlinear Pricing
p1 , ..., pN
Description Consumers pay a fixed price b for access to all N articles. Consumers pay a fixed price p for each article purchased. Consumers pay a subscription fee f , along with a fixed price p for each article. Consumers have a choice between a perarticle price p and a bundle price. b Consumers pay a price p1 for the first m articles (m < N ), and a price p2 for remaining articles. Consumers pay a different price pi for each article i.
We found that simple schedules were learned more easily, but yielded lower profit per period once learned. More complex schedules took longer to learn, but yielded higher profits per period after learning. We ran experiments comparing the performance of six different pricing schedules (a sample is shown in Figure 1) and found that moderately complex two-parameter schedules tended to perform
Using Landscape Theory to Measure Learning Difficulty for Adaptive Agents
293
best in the short to medium-run, where learning is most important. In addition, the relative performance of the different schedules changes as the total number of articles (N ) that the producer could offer was varied.
Fig. 1. Learning curves for six price schedules when a monopolist offers N=10 (above) and N=100 (below) articles. The schedules are: Linear pricing, where each article has the same price p; pure bundling, where a consumer pays b for access to all articles, two-part tariff, where a consumer pays a subscription fee f , plus a per-article price p for each article purchased, mixed bundling, where a consumer can choose between the per-article price p and the bundle price b, block pricing, where the consumer pays p1 for each of the first i articles purchased and p2 for each remaining article, and nonlinear pricing, where the consumer pays a different price pi for each article purchased. The x axis is number of iterations (log scale) and the y axis is average cumulative profit per article, per customer.
294
C.H. Brooks and E.H. Durfee
An agent that had these curves, which we call learning profiles, could then apply decision theory to determine which schedule to select at every iteration. The details of this are discussed in our previous work [1]; essentially, an agent must compare the expected cumulative profits gained from each schedule and select the one that yields the highest profits. There are two problems with this approach. The experiments above do not explain why one schedule outperforms another, or why the relative performance of the schedules changes as the number of articles is varied. For example, twopart tariff and mixed bundling yield the same profits under perfect information, yet learning producers accrued higher profits per article with two-part tariff than with mixed bundling when N = 100. This leads us to ask both why these schedules have different learning profiles and why mixed bundling’s performance depends upon N , whereas two-part tariff’s performance seems not to. One way of explaining this is through an appeal to pictorial representations, such as Figure 3, where we see that two-part tariff has a single hill, whereas mixed bundling has a large plateau. As N increases, the size of this plateau grows, and so a large ‘flat area’ in the landscape is introduced, thereby thwarting an adaptive agent that employs a hill-climbing method.
Fig. 2. Linear pricing landscapes for a small (10) and large (100) N and C.
Another complication is the number of consumers C in the population. For small values of C, performance on most schedules is lower (per consumer) than for large values of C. The conjectural argument is that large values of C tend to “smooth out” the landscape by producing a more uniform distribution of consumer preferences. Figure 2 shows an example of this for the one-dimensional linear pricing problem as N and C are varied. This type of pictorial argument is helpful as a visualization aid, but it is not particularly rigorous, and cannot be
Using Landscape Theory to Measure Learning Difficulty for Adaptive Agents
295
Fig. 3. Two-part tariff (left) and mixed bundling (right) landscapes. Even though they have the same number of parameters and the same optimal profit, their landscapes are very different.
easily applied to functions with more than two inputs. A more precise measure of why these problems are different is needed. Using the learning profile to determine a problem’s complexity is also a problem in the many cases where an adaptive agent does not have this complete learning profile. Instead, it might have some sample problems, and need to use these to compare learning problems directly. In order to estimate the difficulty of a learning problem when the learning profile is either uninformative or not available, we draw on results from landscape theory. Much of the recent work on landscape theory has taken place within the context of the theoretical study of genetic algorithms (GAs). In this case, the problem is to construct landscapes of varying difficulty to serve as benchmarks for a particular GA. Our problem is different; we assume that the landscape (the learning problem, or the mapping from inputs of the price schedule to profits) is determined by an external process (such as the composition of the consumer population) and our job is to characterize how hard it is. Rather than generating a landscape with particular features and claiming that these features make it difficult, we want to characterize the features of existing landscapes and identify sets of features that make adaptation difficult. In the following section, we provide a context for this work, identifying some key results concerning landscape theory.
3
A Review of Landscape Theory
The concept of a landscape is a familiar one in evolutionary biology, optimization, and artificial intelligence. Figure 3 appeals pictorially to the concept. A landscape is visualized as a surface with one or more peaks, where each point on the landscape corresponds to a solution. Optimizing a function is cast as locating the highest peak. This idea is simple, yet extremely powerful. It allows a wide range of seemingly dissimilar problems to be cast in a common framework. In
296
C.H. Brooks and E.H. Durfee
particular, the selection of a set of price schedule parameters that maximizes profit is equivalent to finding the global peak of a profit landscape. The primary distinction that is made is between those landscapes that are smooth and those that are rugged. A smooth landscape is one that is easy to ascend; the optima can be located without much effort, and there are typically few local optima. A rugged landscape is one that contains discontinuities, many local optima, and other features that make it difficult for a local search algorithm (that is, one that is not able to see the entire landscape at once) to find the global optimum. The notion of a rugged landscape has received a great deal of attention in the complex systems community. Kauffman [7] was one of the first researchers to describe a landscape’s mathematical properties with respect to a search algorithm. (The concept was originally proposed by Sewall Wright [11] as a model for explaining natural selection.) Hordjik [3] and Jones [4], among others, tighten up Kauffman’s concepts and apply more mathematical rigor. A landscape consists of two components: an objective function F , and a neighborhood relation R that indicates the elements of the domain of F that are adjacent to each other. F is the function that an agent is interested in optimizing. The input of F is indicated by the vector x, where x can contain numeric (either real-valued or integer) elements or symbolic elements. Since this is an optimization problem, F maps into the reals; the goal of the problem is to find an x that maximizes F . In our price-setting problems, x is the parameters of the price schedule, and F (x) is the resultant profit. R is a neighborhood relation that, for any x, returns those elements that are adjacent to x. This provides a topology over F , and allows us to describe it as a surface. The choice of neighborhood relation may be exogenously determined by the input variables, or it may be endogenously determined by the user, depending upon the domain. If one is optimizing a price schedule and the inputs of x are the parameters of that schedule, each category, it is natural to define R(x) to be the schedule one gets when one parameter in the schedule is increased or decreased by a set amount. In other problems, such as the traveling salesman problem, the problem may be encoded in a number of different ways, leading to different R relations and, subsequently, different landscapes that may be easier or harder to search. We treat the neighborhood relation as exogenously given, since we are searching over pricing parameters defined on either the reals or the integers, which have natural neighborhood relations. Jones [4] presents a slightly different formulation of the R relation which depends upon the algorithm being used to traverse the landscape. Essentially, Jones’s R is the successor relationship generated by a search algorithm; for a given state, R gives all the states that can be reached in one step for a particular algorithm. This formulation works well for Jones’s purposes, which involve developing a theory for genetic algorithms, but it makes it difficult to compare two different landscapes and ask whether one is intrinsically easier or harder. The distinction is that Jones couples the neighborhood relationship explicitly
Using Landscape Theory to Measure Learning Difficulty for Adaptive Agents
297
to the particulars of the search algorithm being used, whereas we assume that there are landscapes which have a natural neighborhood relationship. For example, prices occur on the real line, making a neighborhood relationship based on adjacency on the real line a natural choice. Since some price schedules induce profit landscapes that appear to be intrinsically easier than others to optimize, we would like our definition to capture this. There have also been a variety of metrics proposed for comparing problem difficulty for genetic algorithms, in addition to the metrics described below. These metrics include fitness distance correlation [5] and epistatis variance and correlation [6]. Fitness distance correlation is similar to the autocorrelation metric we describe below. It examines how closely correlated neighboring points in a landscape are. Epistatis is a biological term that refers to the amount of ‘interplay’ between two input variables. If there is no epistasis, then each input variable can be optimized independently, whereas a large amount of epistasis (as is found in most NP-hard problems) means that the optimal choice for one of the inputs to F depends upon the choices for other values of F . Epistasis is most useful and easily measured when evolutionary algorithms are being employed for optimization.
4
Applying Landscape Theory to Price Schedule Learning
By estimating a landscape’s ruggedness, an adaptive agent can then construct an estimate of how long it will take it to find an optimum and the learning cost associated with finding an optimum. However, ruggedness cannot be induced directly. Instead, it must be inferred from other landscape characteristics. When one is using a generative model such as the NK model to build landscapes, it is fine to build your model so that a parameter such as K can be used to tune those features of the landscape that make it difficult to optimize of and then claim that K is the amount of ‘ruggedness’. However, when landscapes are provided exogenously, this is not an option. Instead, an agent must look at the measurable characteristics of a landscape and use this to estimate ruggedness. In this section, we consider three possible observable landscape characteristics and study their efficacy as estimators of ruggedness, using amoeba as a measure of actual ruggedness. These measures are also applied to the two-part tariff and mixed bundling landscapes and used to quantitatively explain the result we argued for pictorially in Figure 3. 4.1
Number of Optima
One metric that is sometimes discussed [6] for determining the difficulty of finding the global optimum of a particular landscape is the number of optima it contains. Intuitively, if a landscape contains a single optimum, it should be easy to locate. The exception to this is a landscape that contains a single narrow peak and a large plateau. (See Figure 4 for an example of this.) Similarly, if
298
C.H. Brooks and E.H. Durfee
a landscape contains a large number of optima, it will be harder, particularly when using a hill-climbing algorithm, to find the global optimum. 100
90
80
70
60
50
40
30
20
10
0 −5
−4
−3
−2
−1
0
1
2
3
4
5
Fig. 4. A landscape with a single peak, but large plateaus, which present a challenge for hill-climbing algorithms.
If we accept that the number of optima is an indicator of ruggedness, and therefore of learning difficulty, we must then ask how we can determine the number of optima a landscape contains. In general, it is not possible to exhaustively search a landscape, at least one with more than a couple of dimensions. In addition, the profit landscapes we examine in this domain have continuous dimensions. In addition, the landscapes typically contain ridges and discontinuities. This makes the use of a standard hill-climbing algorithm to find optima a rather arbitrary exercise. Since the input dimensions are continuous and landscape contains ridges, the number of optima found will depend upon the granularity of the hill-climbing algorithm (how large a step it takes), rather than any intrinsic feature of the landscape. Given this problem, it is more useful to look at the distribution of optima, rather than the number of optima. 4.2
Distribution of Optima
In addition to practical problems in calculating the number of optima on profit landscapes with continuous inputs, there are deeper problems in using the number of optima as an estimator of ruggedness. A great deal of information is lost if optima are simply counted. A landscape with a large number of optima that are all clustered together at the top of a hill would seem to be qualitatively different (and less rugged) than one in which the optima are evenly spaced throughout the landscape. Again, thinking in terms of basins of attraction may make this easier to understand. In general, we would conjecture that the more the distribution of basin sizes tends toward uniform, the more rugged the landscape is. In
Using Landscape Theory to Measure Learning Difficulty for Adaptive Agents
299
Figure 3, we can see that the optima for two-part tariff are clustered on a hill, whereas mixed bundling contains a large plateau. This is a possible explanation for two-part tariff’s being more easily learned, and yielding higher cumulative profit. In this section, we validate that argument by estimating the distribution of optima for two-part tariff and mixed bundling landscapes. We will find the distribution of optima in a landscape by finding the distribution of basin sizes. To do this, we use the following technique, inspired by a similar approach used by Garnier and Kallel [2]. First, note that for any point in a landscape, we can find the optimum of the basin it resides in by using a steepest-ascent hill-climbing algorithm. We choose a random set p of starting points and use steepest-ascent to locate the corresponding optima. The distribution will be stored in a sequence β. For each optimum i, we calculate βi , which is the number of points in p that lead to that optimum. Each element of β will correspond to an optimum i, and so the value of element βi is the number of points from p that are in the basin of optimum i. If we sum all of the elements of β, we have a value for the ‘size’ of the landscape. (If we normalize this size, then the elements of β are percentages.) By fitting a sorted β to a distribution, we can estimate the clusteredness of the landscape’s optima. For simplicity, we fit β to an exponential distribution eλx , where the magnitude of λ indicates the clusteredness of the optima. The exponential distribution is convenient because it has only one free parameter, λ, which governs the clusteredness of the optima. When λ = 0, optima are uniformly distributed; as abs(λ) increases, the distribution becomes more clustered. By using an exponential distribution, we can then fit log(β) to a line, where the slope of the line is λ. Of course, we have no a priori reason (other than observation) to assume that an exponential distribution is the correct distribution. Future work will consider more complicated distributions, particularly ones such as the Beta distribution (not to be confused with our list β of basin sizes) that have “heavy tails.” To determine the distribution of optima for two-part tariff and mixed bundling landscapes, we performed the following experiment. For each schedule, we generated 10 landscapes using N = 10 and 10 landscapes using N = 100. Consumers were generated identically to those in the experiments summarized in Figure 1. For each landscape we chose p = 1000 points and ran a steepest-ascent hill-climbing algorithm to determine β. The value of p was determined by using a χ-square test, as in [2]; a p was selected, a distribution generated, and then the expected and actual distributions were compared. p was then increased and this comparison repeated. When increases in p no longer produced significant gains in confidence, p was taken to be “large enough.” Figure 5 compares the log distributions of β (averaged over 10 landscapes) for two-part tariff and mixed bundling for N = 10 and N = 100. Previously, we argued that two-part tariff performed well because the optima were clustered on a hill, whereas mixed bundling contained a large flat region that served as a set of optima. Figure 5 supports this argument; λ is an order of magnitude larger for two-part tariff than for mixed bundling.
300
C.H. Brooks and E.H. Durfee
Fig. 5. Distribution of basin sizes for two-part tariff and mixed bundling. The x axis ranks basins sorted by size and the y axis (log scale) is the size of a particular basin. Each line is the fitted distribution of basin sizes. The left figure is for N = 10 and the right is for N = 100.
We can also see that the optima for two-part tariff become less clustered as we move from N = 10 to N = 100. A closer examination of Figure 3 shows that the two-part tariff optima are located along a ridge; as N increases, this ridge grows, since larger values for the fee will yield positive profit. This spreads out the optima and reduces the magnitude of λ. The distribution of optima for mixed bundling does not change as significantly as we move from N = 10 to N = 100. Recall that mixed bundling offers consumers a choice between per-article and bundle pricing. This creates a large plateau in the landscape where the per-article price is too high, and so all consumers buy the bundle. A small change in per-article price gives an adaptive agent no change in profit. As we increase N , this plateau takes up a larger portion of the landscape, but the optima on this plateau (really just the flat portion of the plateau) retain their relative basin sizes. This points out a weakness in using this normalized approach: we are measuring the fraction of a landscape occupied by each basin, rather than an absolute measure of its size, which increases with N . Measuring the absolute size of each landscape is a difficult thing; it clearly affects the learning process, but it is hard to do without making arbitrary assumptions.
4.3
Autocorrelation Coefficient
Estimating the distribution of optima is a useful technique for explaining why two-part tariff outperforms mixed bundling, but there are other questions about landscape ruggedness that have been raised in this article, such as the role of N (the number of articles) and C (the number of consumers) in affecting ruggedness. In this section, we explain these differences in ruggedness using autocorrelation.
Using Landscape Theory to Measure Learning Difficulty for Adaptive Agents
301
Hordjik [3] describes the use of autocorrelation as a method of measuring the ruggedness of a landscape. To construct this measurement, one conducts a random walk over a landscape, retaining all the (x, F (x)) pairs. This series of pairs is then treated not as a sequence of steps but instead as a series of state transitions generated by a Markov process. We can then apply a technique from time series analysis known as autocorrelation to find an estimate of ruggedness. What we wish to know is how well the last n points allow us to predict the value of the n + 1th point. More importantly, we wish to know the largest t for which the n + tth point can be predicted from the nth point. The larger t is, the less rugged the landscape, since a learner will have a great deal of information that it can use to predict the next point. A small t indicates a rapidly changing landscape in which past observations have little predictive value, which is what is meant by ruggedness. To be more precise, we begin by recalling that the covariance between two series of points X and Y (Cov(XY ) = E[XY ] − µX µY ) is an indicator of how closely related these series are. If covariance is normalized by dividing by the product of the standard deviations of X and Y , then we have the correlation ) between X and Y , denoted ρ(X, Y ) = Cov(X,Y σX σY . Autocorrelation is a closely related concept, except that instead of comparing two series, we are going to compare a series to itself, shifted by a time lag, as a way of measuring the change in the process over time. The autocorrelation of t ]E[tt+i ] . points i steps apart in a series y is defined as: ρi = E[yt yt+iV]−E[y ar(yt ) Autocorrelation allows us to determine the correlation length of a landscape [10], which we will use as an indicator of ruggedness. Correlation length is the largest i for which there is a significantly nonzero autocorrelation. We compare the autocorrelation of two-part tariff and mixed bundling landscapes as N and C are varied. In addition, we consider two different sorts of paths: one collected through a steepest-ascent algorithm, which indicates ruggedness during optimization, and one collected through a random walk over the landscape, which serves as an overall characterization of ruggedness. This will help us to understand whether particular values of N and C play a part in the learnability of two-part tariff and mixed bundling. The experiment works as follows: we generate a random profit landscape (using the distribution of consumers that generated the learning curves in Figure 1 and varying N and C between 10 and 100). We then choose 1000 random points on the landscape and run a steepest-ascent hill-climbing algorithm from each point until an optimum is reached. We then compute the autocorrelation over that path for all window sizes from 1 to 40 and average the results to get a mean autocorrelation (for each window size) during optimization for this landscape. This is then averaged across 10 landscapes, giving an average autocorrelation during optimization for each schedule. Next, for each landscape, we conduct a random walk of length 1000 and measure the autocorrelation over this walk with window sizes from 1 to 40. These random walk autocorrelations are then averaged across 10 landscapes to yield a random walk autocorrelation for each schedule.
302
C.H. Brooks and E.H. Durfee
Fig. 6. Autocorrelation as a function of window size for two-part tariff and mixed bundling (N=10, C=10). The left figure uses a path generated by steepest ascent, and the right uses a path generated by a random walk.
Fig. 7. Autocorrelation as a function of window size for two-part tariff and mixed bundling (N=10, C=100). The left figure uses a path generated by steepest ascent, and the right uses a path generated by a random walk.
Figures 6, 7, 8, and 9 compare autocorrelation over both random walks and optimization paths for N = {10, 100} and C = {10, 100}. From these figures, we can draw several conclusions. First, the significant window size is much smaller when optimizing on either landscape than when performing a random walk; this should not be surprising, since the whole point of optimizing is to change one’s state, hopefully in a useful direction. It is interesting that both landscapes produced very similar autocorrelations when optimizing, indicating that the difference in learning difficulty is probably not due to a difference in the ability to effectively reach optima. Instead, our previous conclusion that distribution of optima was more uniform for mixed bundling (meaning also that it is more difficult to move between optima) gains credence. Second, we note that, for random walks, mixed bundling shows little change as N and C are varied, while two-part tariff improves when either N or C are increased. We expected increasing either variable to improve autocorrelation,
Using Landscape Theory to Measure Learning Difficulty for Adaptive Agents
303
Fig. 8. Autocorrelation as a function of window size for two-part tariff and mixed bundling (N=100, C=10). The left figure uses a path generated by steepest ascent, and the right uses a path generated by a random walk.
Fig. 9. Autocorrelation as a function of window size for two-part tariff and mixed bundling (N=100, C=100). The left figure uses a path generated by steepest ascent, and the right uses a path generated by a random walk.
since increasing C appears to reduce the size of discontinuities in the landscape. It is unclear why mixed bundling does not show the same improvement; however, we can conclude that it is less resistant to a change in these parameters. We also can see that autocorrelation is significantly higher for mixed bundling than it is for two-part tariff, although only for random walks. It is very similar when optimizing. We conjecture that the reason for this is due to the large plateaus seen in the mixed bundling landscape; a random walk along a plateau will have an autocorrelation of 1. Finally, we note that the effective window size actually decreases for two-part tariff when both N and C are 100 over the case where one variable is 10 and the other 100. This may be due to a ‘stretching’ of the landscape as N is increased; the effective range of the fee parameter of two-part tariff grows with N . In summary, these experiments help us to understand quantitatively what we were previously able to explain only through a reliance on pictures and an
304
C.H. Brooks and E.H. Durfee
appeal to metaphors. There clearly seems to be a correlation between optima distribution and learning difficulty with regard to two-part tariff and mixed bundling. There is also some evidence that increasing C and N reduces landscape ruggedness, although not necessarily in a way that affects learning performance for an algorithm such as amoeba.
5
Conclusions and Future Work
In this article, we have described the problem of learning in an environment where cumulative reward is the measure of performance and stressed the need for an adaptive agent to consider what it chooses to learn as a way of optimizing its total reward. We have argued that landscapes are a useful representation for an agent’s learning problem and applied the analysis of landscapes to a particular learning problem, that of learning price schedules in an information economy. We showed that two metrics, distribution of optima and autocorrelation, can be calculated and used as estimates of ruggedness, and further used these metrics to explain results in our previous work. By using these metrics to estimate the difficulty of different landscapes, an adaptive agent can thereby make a more informed decision as to which learning problem it will choose to solve. There are many possible directions for future research. One particular avenue included the extension of this analysis to learning in nonstationary environments. That is, where the landscape an agent is adapting to changes over time. In this case, we would like to characterize how this change affects the difficulty of the agent’s learning problem. In particular, we are interested in problems where one agent’s learning affects the learning problem of another agent, and providing tools by which agents can minimize their impact on each other’s learning. Measuring this impact is a necessary step toward solving that problem. Acknowledgments. This work was supported in part by NSF grants IIS9872057 and IIS-0112669.
References 1. Christopher H. Brooks, Robert S. Gazalle, Rajarshi Das, Jeffrey O. Kephart, Jeffrey K. MacKie-Mason, and Edmund H. Durfee. Model selection in an information economy: Choosing what to learn. Computational Intelligence, 2002. to appear. 2. Josselin Garnier and Leila Kallel. How to detect all attraction basins of a function. In Theoretical Aspects of Evolutionary Computation, pages 343–365. SpringerVerlag, 2000. 3. Wim Hordjik. A measure of landscapes. Evolutionary Computation, 4(4):336–366, 1996. 4. Terry Jones. Evolutionary Algorithms, Fitness Landscapes, and Search. PhD thesis, University of New Mexico, May 1995.
Using Landscape Theory to Measure Learning Difficulty for Adaptive Agents
305
5. Terry Jones and Stephanie Forrest. Fitness distance correlation as a measure of problem difficulty for genetic algorithms. In Larry Eshelman, editor, Proceedings of the Sixth International Conference on Genetic Algorithms, pages 184–192, San Francisco, 1995. Morgan Kauffman. 6. L. Kallel, B. Naudts, and C.R. Reeves. Properties of fitness functions and search landscapes. In L. Kallel, B. Naudts, and A. Rogers, editors, Theoretical Aspects of Evolutionary Computation, pages 175–206. Springer-Verlag, 2000. 7. Stuart Kauffman. Origins of Order: Self-organization and Selection in Evolution. Oxford University Press, New York, 1993. 8. William H. Press et al. Numerical Recipes. Cambridge University Press, 1992. 9. Sebastian Thrun. The role of exploration in learning control. In Handbook of Intelligent Control: Neural, Fuzzy and Adaptive Systems. Van Nostrand Reinhold, Florence, Kentucky, 1992. 10. E. Weinberger. Correlated and uncorrelated fitness landscapes and how to tell the difference. Biological Cybernetics, 63:325–336, 1990. 11. Sewall Wright. The roles of mutation, inbreeding, crossbreeding and selection in evolution. In Proceedings of the 6th Congress on Genetics, page 356, 1932.
Relational Reinforcement Learning for Agents in Worlds with Objects Saˇso Dˇzeroski Department of Intelligent Systems, Joˇzef Stefan Institute, Jamova 39, 1000 Ljubljana, Slovenia, [email protected]
Abstract. In reinforcement learning, an agent tries to learn a policy, i.e., how to select an action in a given state of the environment, so that it maximizes the total amount of reward it receives when interacting with the environment. We argue that a relational representation of states is natural and useful when the environment is complex and involves many inter-related objects. Relational reinforcement learning works on such relational representations and can be used to approach problems that are currently out of reach for classical reinforcement learning approaches. This chapter introduces relational reinforcement learning and gives an overview of techniques, applications and recent developments in this area.
1
Introduction
In reinforcement learning (for an excellent introduction see the book by Sutton and Barto [13]), an agent tries to learn a policy, i.e., how to select an action in a given state of the environment, so that it maximizes the total amount of reward it receives when interacting with the environment. In cases where the environment is complex and involves many inter-related objects, a relational representation of states is natural. This typically yields a very high number of possible states and state/action pairs, which makes most of the existing tabular reinforcement learning algorithms inapplicable. Even the existing reinforcement learning approaches that are based on generalization, such as that of Bertsekas and Tsitsiklis [1], typically use a propositional representation and cannot deal directly with relationally represented states. We introduce relational reinforcement learning, which uses relational learning algorithms as generalization engines within reinforcement learning. We start with an overview of reinforcement learning ideas relevant to relational reinforcement learning. We then introduce several complex worlds with objects, for which a relational representation of states is natural. An overview of different relational reinforcement learning algorithms developed over the last five years is presented next and illustrated on an example from the blocks world. Finally, some experimental results are presented before concluding with a brief discussion.
E. Alonso et al. (Eds.): Adaptive Agents and MAS, LNAI 2636, pp. 306–322, 2003. c Springer-Verlag Berlin Heidelberg 2003
Relational Reinforcement Learning for Agents in Worlds with Objects
2
307
Reinforcement Learning
This section gives an overview of reinforcement learning ideas relevant to relational reinforcement learning. For an extensive treatise on reinforcement learning, we refer the reader to Sutton and Barto [13]. We first state the task of reinforcement learning, then briefly describe the Q-learning approach to reinforcement learning. In its basic variant, Q-learning is tabular: this is unsuitable for problems with large state spaces, where generalization is needed. We next discuss generalization in reinforcement learning and in particular generalization based on decision trees. Finally, we discuss the possibility of integrating learning by exploration (as is typically the case in reinforcement learning) and learning with guidance (by a human operator or some other reasonable policy, i.e., a policy that yields sufficiently dense rewards). 2.1
Task Definition
The typical reinforcement learning task using discounted rewards can be formulated as follows: Given – – – –
a set of possible states S. a set of possible actions A. an unknown transition function δ: S × A → S. an unknown real-valued reward function r : S × A → R.
Find a policy π ∗ : S → A that maximizes V π (st ) =
∞
γ i rt+i
i=0
for all st where 0 ≤ γ < 1. At each point in time, the reinforcement learning agent can be in one of the states st of S and selects an action at = π(st ) ∈ A to execute according to its policy π. Executing an action at in a state st will put the agent in a new state st+1 = δ(st , at ). The agent also receives a reward rt = r(st , at ). The function V π (s) denotes the value (expected return; discounted cumulative reward) of state s under policy π. The agent does not necessarily know what effect its actions will have, i.e., what state it will end up in after executing an action. This means that the function δ is unknown to the agent. In fact, it may even be stochastic: executing the same action in the same state on different occasions may yield different successor states. We also assume that the agent does not know the reward function r. The task of learning is then to find an optimal policy, i.e., a policy that will maximize the discounted sum of the rewards. We will assume episodic learning, where a sequence of actions ends in a terminal state.
308
2.2
S. Dˇzeroski
Tabular Q-Learning
Here we summarize Q-learning, one of the most common approaches to reinforcement learning, which assigns values to state-action pairs and thus implicitly represents policies. The optimal policy π ∗ will always select the action that maximizes the sum of the immediate reward and the value of the immediate successor state, i.e., ∗
π ∗ (s) = argmaxa (r(s, a) + γV π (δ(s, a))) The Q-function for policy π is defined as follows : Qπ (s, a) = r(s, a) + γV π (δ(s, a)) Knowing Q∗ , the Q-function for the optimal policy, allows us to rewrite the definition of π ∗ as follows π ∗ (s) = argmaxa Q∗ (s, a) An approximation to the Q∗ -function, Q, in the form of a look-up table, is learned by the following algorithm. Table 1. The Q-learning algorithm. Initialize Q(s, a) arbitrarily repeat (for each episode) Initialize s0 ; t ← 0 repeat (for each step of episode) Choose at for st using the policy derived from Q Take action at , observe rt , st+1 Q(st , at ) ← rt + γmaxa Q(st+1 , a) t←t+1 until s is terminal until no more episodes
The agent learns through continuous interaction with the environment, during which it exploits what it has learned so far, but also explores. In practice, this means that the current approximation Q is used to select an action most of the time. However, in a small fraction of cases an action is selected randomly from the available choices, so that unseen state/action pairs can be explored. For smoother learning, an update of the form Q(st , at ) ← Q(st , at ) + α[rt + γmaxa Q(st+1 , a) − Q(st , at )] would be used. This is a special case of temporal-difference learning, where algorithms such as SARSA [12] also belong. In SARSA, instead of considering all possible actions a in state st+1 and taking the maximum Q(st+1 , a), only the
Relational Reinforcement Learning for Agents in Worlds with Objects
309
action at+1 actually taken in state s during the current episode is considered. The update rule is thus Q(st , at ) ← Q(st , at ) + α[rt + γQ(st+1 , at+1 ) − Q(st , at )]. For the algorithm in Table 1, the learned action-value function Q directly approximates Q∗ , regardless of the policy being followed. 2.3
Generalization / G-Trees
Using a tabular representation for the learned approximation to the Q-function or V-functions is only feasible for tasks with small numbers of states and actions. This is due to both issues of space (large table) and time (needed to fill the table accurately). The way out is to generalize over states and actions, so that approximations can be produced also for states (and possibly actions) that the agent has never seen before. Most approaches to generalization in reinforcement learning use neural networks for function approximation [1]. States are represented by feature vectors. Updates to state-values or state-action values are treated as training examples for supervised learning. Nearest-neighbor methods have also been used, especially in the context of continuous states and actions [11]. Table 2. The G-algorithm. Create an empty leaf while data available do Sort data down to leaves and update statistics in leaves if a split is needed in a leaf then grow two empty leaves
The G-algorithm [3] is a decision tree learning algorithm designed for generalization in reinforcement learning. An extension of this algorithm has been used in relational reinforcement learning: we thus briefly summarize its main features here. The G-algorithm updates its theory incrementally as examples are added. An important feature is that examples can be discarded after they are processed. This avoids using a huge amount of memory to store examples. At a high level, the G-algorithm (Table 2) stores the current decision tree: for each leaf node statistics are kept for all tests that could be used to split that leaf further. Each time an example is inserted, it is sorted down the decision tree according to the tests in the internal nodes; in the leaves, the statistics of the tests are updated. 2.4
Exploration and Guidance
Besides the problems with tabular Q-learning, large state/action spaces entail another type of problem for reinforcement learning. Namely, in a large state/action space, rewards may be so sparse that with random exploration (as is typical at the start of a reinforcement learning run) they will only be discovered
310
S. Dˇzeroski
d a b
c
clear(d). clear(c). on(d,a). on(a,b). on(b,floor). on(c,floor). move(d,c).
Fig. 1. Example state and action in the blocks-world.
extremely slowly. This problem has only recently been addressed for the case of continuous state/action spaces. Smart and Kaelbling [11] integrate exploration in the style of reinforcement learning with human-provided guidance. Traces of human-operator performance are provided to a robot learning to navigate as a supplement to its reinforcement learning capabilities. Using nearest-neighbor methods together with precautions to avoid overgeneralization, they show that using the extra guidance helps improve the performance of reinforcement learning.
3
Some Worlds with Objects
In this section, we introduce three domains where using a relational representation of states is natural. Each of the domains involves objects and relations between them. The number of possible states in all three domains is very large. The three domains are: the blocks world, the Digger computer game, and the Tetris computer game. 3.1
The Blocks World
In the blocks world, blocks can be on the floor or can be stacked on each other. Each state can be described by a set (list) of facts, e.g., s = {clear(c), clear(d), on(d, a), on(a, b), on(b, f loor), on(c, f loor)} represents the state in Figure 1. The available actions are then move(X, Y ) where X = Y , X is a block and Y is a block or the floor. The number of states in the blocks world grows rapidly with the number of blocks. With 10 blocks, there are close to 59 million possible states. We study three different goals in the blocks world: stacking all blocks, unstacking all blocks (i.e., putting all blocks on the floor) and putting a specific block on top of another specific block. In a blocks world with 10 blocks, there are 3.5 million states which satisfy the stacking goal, 1.5 million states that satisfy a specific on(A, B) goal (where A and B are bound to specific blocks) and one state only that satisfies the unstacking goal. A reward of 1 is given in case a goal-state is reached in the optimal number of steps; the episode ends with a reward of 0 if it is not.
Relational Reinforcement Learning for Agents in Worlds with Objects
3.2
311
The Digger Game
Digger1 is a computer game created in 1983, by Windmill Software. It is one of the few old computer games which still hold a fair amount of popularity. In this game, the player controls a digging machine or “Digger” in an environment that contains emeralds, bags of gold, two kinds of monsters (nobbins and hobbins) and tunnels. The object of the game is to collect as many emeralds and as much gold as possible while avoiding or shooting monsters.
Fig. 2. A snapshot of the DIGGER game.
In our tests we removed the hobbins and the bags of gold from the game. Hobbins are more dangerous than nobbins for human players, because they can dig their own tunnels and reach Digger faster, as well as increase the mobility of the nobbins. However, they are less interesting for learning purposes, because they reduce the implicit penalty for digging new tunnels (and thereby increasing the mobility of the monsters) when trying to reach certain rewards. The bags of gold we removed to reduce the complexity of the game. A state representation consists of the following components: – the coordinates of digger, e.g., digPos(6,9) – information on digger of the form digInf(digger dead, time to reload, level done, pts scored,steps taken), e.g., digInf(false,63,false,0,17), – information on tunnels as seen by digger (range of view in each direction, e.g., tunnel(4,0,2,0); information on the tunnel is relative to the digger; there is only one digger, so there is no need for a digger index argument) – list of emeralds (e.g., [em(14,9), em(14,8), em(14,5), . . .]), – list of monsters (e.g., [mon(10,1,down), mon(10,9,down) . . .]), and – information on the fireball fired by digger (coordinates, travelling direction, e.g., fb(7,9,right)). The actions are of the form moveOne(X) and shoot(Y ), where X and Y are in [up,down,left,right]. 1
http://www.digger.org
312
S. Dˇzeroski
Fig. 3. A snapshot of the TETRIS game.
3.3
The Tetris Game
Tetris2 is a widespread puzzle-video game played on a two-dimensional grid. Differently shaped blocks fall from the top of the game field and fill up the grid. The object of the game is to score points while keeping the blocks from piling up to the top of the game field. To do this, one can move the dropping blocks right and left or rotate them as they fall. When one horizontal row is completely filled, that line disappears and the player scores points. When the blocks pile up to the top of the game field, the game ends. In the tests presented, we only looked at the strategic part of the game, i.e., given the shape of the dropping and the next block, one has to decide on the optimal orientation and location of the block in the game-field. (Using low level actions — turn, move left or move right — to reach such a subgoal is rather trivial and can easily be learned by (relational) reinforcement learning.) We represent the full state of the Tetris Game, the type of the next dropping block included.
4
Relational Reinforcement Learning
Relational reinforcement learning (RRL) addresses much the same task as reinforcement learning in general. What is typical of RRL is the use of a relational (first-order) representation to represent states, actions and (learned) policies. Relational learning methods, originating from the field of inductive logic programming [10], are used as generalization engines. 4.1
Task Definition
While the task definition for reinforcement learning (as specified earlier in this chapter) applies to RRL, a few details are worth noting. States and actions are 2
Tetris was invented by Alexey Pazhitnov and is owned by The Tetris Company and Blue Planet Software.
Relational Reinforcement Learning for Agents in Worlds with Objects
313
represented relationally. Background knowledge and declarative bias need to be specified for the relational generalization engines. All possible states would not be listed explicitly as input to the RRL algorithm (as they might be for ordinary reinforcement learning). A relational language for specifying states would rather be defined (in the blocks world, this language would comprise the predicates on(A, B) and clear(C)). Actions would also be specified in a relational language (move(A, B) in the blocks world) and not all actions would be applicable in all states; in fact, the number of possible actions may vary considerably across different states. Background knowledge generally valid about the domain (states in S) can be specified in RRL. This includes predicates that can derive new facts about a given state. In the blocks world, a predicate above(A, B) may define that a block A is above another block B. Declarative bias for learning relational representations of policies can also be given. In the blocks world, e.g., we do not allow policies to refer to the exact identity of blocks (A = a, B = b, etc.). The background knowledge and declarative bias taken together specify the language in which policies are represented.
4.2
The RRL Algorithm
The RRL algorithm (Table 2) is obtained by combining the classical Q-learning algorithm (Table 1) and a relational regression tree algorithm (TILDE [2]). Instead of an explicit lookup table for the Q-function, an implicit representation of this function is learned in the form of a logical regression tree, called a Q-tree. After a Q-tree is learned, a classification tree is learned that classifies actions as optimal or non-optimal. This tree, called a P-tree, is usually much more succinct than the Q-tree, since it does not need to distinguish among different levels of non-optimality. The RRL algorithm is given in Table 3. In its initial implementation [7], RRL keeps a table of state/action pairs with their current Q-values. This table is used to create a generalization in the form of a relational regression tree (Q-tree) by applying TILDE. The Q-tree is then the policy used to select actions to take by the agent. The reason the table is kept is the nonincrementality of TILDE. In complex worlds, where states can have a variable number of objects, an exact Q-tree representation of the optimal policy can be very large and also depend on the number of objects in the state. For example, in the blocks world, a state can have a varying number of blocks: the number of possible values for the Q-function (and the complexity of the Q-tree) would depend on this number. Choosing the optimal action, however, can sometimes be very simple: in the unstacking task, we simply have to pick up a block that is on top of another block and put it on the floor. This was our motivation for learning a P-tree by generating examples from the Q-tree.
314
S. Dˇzeroski Table 3. The RRL algorithm for relational reinforcement learning.
ˆ 0 to assign 0 to all (s, a) pairs Initialize Q Initialize Pˆ0 to assign 1 to all (s, a) pairs Initialize Examples to the empty set. e := 0 while true do generate an episode that consists of states s0 to si and actions a0 to ai−1 through the use of a standard Q-learning algorithm, ˆe using the current hypothesis for Q for j=i-1 to 0 do [generate examples for learning Q-tree] generate example x = (sj , aj , qˆj ), ˆ e (sj+1 , a) where qˆj := rj + γmaxa Q if an example (sj , aj , qˆold ) exists in Examples, replace it with x, else add x to Examples ˆ e by applying TILDE to Examples, update Q ˆ e+1 = TILDE(Examples) i.e., Q for j=i-1 to 0 do [generate examples for learning P-tree] for all actions ak possible in state sj do if state action pair (sj , ak ) is optimal ˆ e+1 according to Q then generate example (sj , ak , c) where c = 1 else generate example (sj , ak , c) where c = 0 update Pˆe : apply TILDE to the examples (sj , ak , c) ˆ to produce Pe+1 e := e + 1 move(c,floor) r=0 Q=0.81
c
move(b,c) r=0 Q=0.9 ..... ... .....
..... ... .....
b
b c
a
a
move(a,b) r=1 Q=1 ..... ... .....
b c
a
move(a,floor) r=0 Q=0
$
b a
c
.. ....... ....
%
Fig. 4. A blocks-world episode for relational Q-learning.
4.3
An Example
To illustrate how the RRL algorithm works, we use an example from the blocks world. The task here is to stack block a on block b, i.e., to achieve on(a, b). An example episode is shown in Figure 4. As for the tabular version of Q-learning, updates of the Q-function are generated for all state/action pairs encountered during the episode. These are also listed in the figure.
Relational Reinforcement Learning for Agents in Worlds with Objects
315
Table 4. Examples for TILDE generated from the blocks-world Q-learning episode in Figure 4. Example 1 Example 2 Example 3 Example 4 qvalue(0.81). qvalue(0.9). qvalue(1.0). qvalue(0.0). move(c,floor). move(b,c). move(a,b). move(a,floor). goal(on(a,b)). goal(on(a,b)). goal(on(a,b)). goal(on(a,b)). clear(c). clear(b). clear(a). clear(a). on(c,b). clear(c). clear(b). on(a,b). on(b,a). on(b,a). on(b,c). on(b,c). on(a,floor). on(a,floor). on(a,floor). on(c,floor). on(c,floor). on(c,floor). root : goal on(A,B) , action move(D,E) on(A,B) ? +--yes: [0] +--no: clear(A) ? +--yes: [1] +--no: clear(E) ? +--yes: [0.9] +--no: [0.81]
Fig. 5. A relational regression tree (Q-tree) generated by TILDE from the examples in Table 4.
The examples generated for TILDE from this episode are given in Table 4. A reward is only obtained when moving a onto b (r = 1, Q = 1): the Q-value is propagated backwards to also reward the actions preceding and leading to move(a, b). Note that a Q-value of zero is assigned to any state/action pair where the state is terminal (the last state in the episode), as no further reward can be expected. From the examples in Table 4, the Q-tree in Figure 5 is learned. The root of the tree (goal on(A,B), action move(D,E)) introduces the state-action pair evaluated, while the rest of the tree performs the evaluation, i.e., calculates the Q-value. The tree correctly predicts zero Q-value if the goal is already achieved and a Q-value of one for any action, given that block A is clear. This is obviously overly optimistic, but does capture the fact that A needs to be clear in order to stack it onto B. Note that the goal on(A, B) explicitly appears in the Q-tree. If we use the Q-trees to generate examples for learning the optimality of actions, we obtain the P-tree in Figure 6. Note that the P-tree represents a policy much closer to the optimal one. If we want to achieve on(A, B), it is optimal to move a block that is above A. Also, the action move(A, B) is optimal whenever it is possible to take it.
316
4.4
S. Dˇzeroski
Incremental RRL/TG Trees
The RRL algorithm as described in the previous section has a number of problems. It needs to keep track of an ever increasing number of examples, needs to replace old Q-values with new ones if a state-action pair is encountered again, and builds trees from scratch after each episode. The G-tree algorithm (also mentioned earlier) does not have these problems, but only works for propositional representations. Driessens et al. [6] upgrade G-tree to work for relational representations yielding the TG-tree algorithm. At the top level, the TG-tree algorithm is the same as the G-tree algorithm. It differs in the fact that TG can use relational tests to split on; these are the same type of tests that TILDE can use. Using TG instead of TILDE within RRL yields the RRL-TG algorithm. Table 5. The G-RRL algorithm: This is the RRL-TG algorithm with integrated guidance (k example traces). ˆ 0 to assign 0 to all (state, action) pairs Initialise Q for i = 0 to k do transform tracei into (state, action, qvalue) triplets process generated triplets with TG algorithm ˆ i into Q ˆ i+1 transforming Q ˆ k as the run normal RRL-TG starting with Q initial Q-function hypothesis
4.5
Integrating Experimentation and Guidance in RRL
Since RRL typically deals with huge state spaces, sparse rewards are indeed a serious problem. To alleviate this problem, Driessens and Dˇzeroski [5] follow the example of Smart and Kaelbling [11] and integrate experimentation and guidance in RRL. In G-RRL (guided RRL), traces of human behavior or traces generated by following some reasonable policy (that generates sufficiently dense rewards) are provided at the beginning and are followed by ordinary RRL. Note that a reasonable policy could also be a previously learned policy that we want to improve upon. The G-RRL algorithm is given in Table 5.
root : goal on(A,B) , action move(D,E) above(D,A) ? +--yes: optimal +--no: action move(A,B) ? +--yes: optimal +--no: nonoptimal
Fig. 6. A P-tree for the three blocks world generated from the episode in Figure 4.
Relational Reinforcement Learning for Agents in Worlds with Objects
5
317
Experiments
Here we summarize the results of experiments with RRL. RRL was extensively evaluated experimentally on the blocks world by Dˇzeroski et al. [8]. We first summarize these results. We then proceed with an overview of the most recent experiments with RRL, which involve the use of guidance in addition to pure reinforcement learning [5], i.e., the use of the G-RRL algorithm. These experiments involve the three domains described earlier in this chapter: the blocks world, the Digger game and the Tetris game. 5.1
Blocks World Experiments with RRL
We have conducted experiments [8] in the blocks world with 3, 4, and 5 blocks, considering the tasks of stacking, unstacking and on(a, b) mentioned earlier. They consider both settings with a fixed number of blocks (either 3, 4 or 5) and a varying number of blocks (first learn with 3 blocks, use this to bootstrap learning with 4 blocks, and similarly learn with 5 blocks afterwards). In addition to the state and action information, the RRL algorithm was supplied with the number of blocks, the number of stacks and the following background predicates: equal/2, above/2, height/2 and difference/3 (an ordinary subtraction of two numerical values). The experiments show that RRL is effective for different goals: it was successfully used for stacking and unstacking, and after some representational engineering also for on(a, b). Policies learned for on(a, b) can be used for solving on(A, B) for any A and B. Both can learn optimal policies for state spaces with a fixed number of blocks (both with Q-trees and P-trees), but this becomes more difficult when the number of blocks increases. An explanation for this is that the sparse rewards problem becomes more and more severe as the number of possible states skyrockets with increasing the number of blocks. Even when learning from experience with a fixed number of blocks, RRL can learn policies that are optimal for state spaces with a varying number of blocks. Q-functions optimal for state spaces with a fixed number of blocks are not optimal for state spaces with a varying number of blocks. But we can learn optimal P-functions from the Q-functions. These P-functions are often optimal for state spaces with a varying number of blocks as well. RRL can also learn from experience in which the number of blocks is varied. Starting with a small number of blocks and gradually increasing it allows for a bootstrapping process, where optimal policies are learned faster. If the Q-tree learned does not work, then the P-tree will not work either. But once a Q-tree is learned that does the job right (even for states with a fixed number of blocks), one is better off using the P-tree learned from it. The latter usually generalizes nicely to larger numbers of blocks than seen during training. 5.2
Experiments with G-RRL
The experiments with G-RRL involve the three domains described earlier: the 10-blocks world, the Digger game and the Tetris game. Only Q-trees were built.
318
S. Dˇzeroski
The Blocks World. In the blocks world, the three tasks mentioned earlier (stacking, unstacking and on(a, b)) were addressed. Traces of the respective optimal policies were provided at the beginning of learning, followed by an application of the RRL-TG algorithm.
Stacking 1 0.9
Average Reward
0.8 0.7 0.6 0.5 0.4 0.3 ’original RRL’ ’5 traces’ ’20 traces’ ’100 traces’
0.2 0.1 0 0
200
400
600 800 1000 Number Of Episodes
1200
1400
Fig. 7. The learning curves of RRL and G-RRL for the stacking task.
On(A,B) 1
Average Reward
0.8 0.6 0.4 ’original RRL’ ’5 traces’ ’20 traces’ ’100 traces’
0.2 0 0
2000
4000
6000
8000
10000
Number of Episodes
Fig. 8. The learning curves of RRL and G-RRL for the on(a,b) task.
In summary, a moderate number of optimal traces helps the learning process converge faster and/or to a higher level of performance (average reward). The learning curves for the stacking and on(a, b) problems are given in Figures 7 and 8. G-RRL is supplied with 5, 20, and 100 optimal traces. Providing guidance clearly helps in the on(a, b) case, but less improvement is achieved when more traces are provided. For stacking, better performance is achieved when providing 5 or 20 traces. Providing 100 traces, however, actually causes worse performance as compared to the original RRL algorithm. The experiment takes longer to converge and during the presentation of the 100 traces to G-RRL no learning takes place.
Relational Reinforcement Learning for Agents in Worlds with Objects
319
The problem is that we supply the system with optimal actions only and it overgeneralizes, failing to distinguish between optimal and nonoptimal actions. The Digger Game. In the Digger Game, in addition to the state and action representation mentioned earlier, predicates such as emerald/2, nearestEmerald/2, monster/2, visibleMonster/2, distanceTo/2, getDirection/2, lineOfFire/1, etc., were provided as background knowledge for the construction of the Q-tree. Since it is hard to write an optimal policy, we used a policy generated in earlier work [4] by RRL (which already performed quite well). Figure 9 shows the average reward obtained by the learned strategies over 640 digger test-games divided over the 8 different Digger levels. It shows that G-RRL is indeed able to improve on the policy learned by RRL. Although the speed of convergence is not improved, G-RRL reaches a significantly higher level of overall performance.
The Digger Game 1000
Average Reward
800 600 400 200 0
’original RRL’ ’5 traces’ ’20 traces’
-200 0
200
400 600 Number Of Episodes
800
1000
Fig. 9. Learning curves for RRL and G-RRL for the Digger game.
The Tetris Game. For the Tetris game, RRL could use the following predicates (among others): blockwidth/2, blockheight/2, rowSmaller/2, topBlock/2, holeDepth/2, holeCovered/1, fits/2, increasesHeight/2, fillsRow/2 and fillsDouble/2. Like with the Digger Game, it is very hard (if not impossible) to generate an optimal or even “reasonable” strategy for the Tetris game. This time, we opted to supply G-RRL with traces of non-optimal playing behavior from a human player. The results for learning Tetris with RRL and G-RRL are below our expectations. We believe that this is due to the fact that the future reward in Tetris is very hard to predict, especially by a regression technique that needs to discretize these rewards like the TG algorithm. However, even with these disappointing results, the added guidance in the beginning of the learning experiment still has its effects on the overall performance. Figure 10 shows the learning curves for RRL and G-RRL supplied with 5 or 20 manually generated traces. The data points
320
S. Dˇzeroski The Tetris Game 6
Average Reward
5 4 3 2 1
’original RRL’ ’5 traces’ ’20 traces’
0 0
2000
4000 6000 Number Of Episodes
8000
10000
Fig. 10. Learning curves for RRL and G-RRL for the Tetris game.
are the average number of deleted lines per game, calculated over 500 played test games.
6
Discussion
Relational reinforcement learning (RRL) is a powerful learning approach that allows us to address problems that have been out of reach of other reinforcement learning approaches. The relational representation of states, actions, and policies allows for the representation of objects and relations among them. Background knowledge that is generally valid in the domain at hand can also be provided to the generalization engine(s) used within RRL and adds further power to the approach. We expect RRL to be helpful to agents that are situated in complex environments which include many objects (and possibly other agents) and where the relations among objects, between the agent and objects, and among agents are of interest. The power of the representation formalism used would allow for different levels of awareness of other agents, i.e., social awareness [9]. Knowledge about the existence and behavior of other agents can be either provided as background knowledge or learned. There are many open issues and much work remains to be done on RRL. One of the sorest points at the moment is the generalization engine: it turns out that G-trees and TG-trees try to represent all policies followed by the agent during its lifetime and can thus be both large and ineffective. Developing better incremental and relational generalization engines is thus a priority. Finding better ways to integrate exploration and guidance also holds much promise for RRL. Finally, we are seeking to apply RRL to difficult, interesting and practically relevant problems. Bibliographic Notes This chapter summarizes research on relational reinforcement learning that has previously been published elsewhere. Relational reinforcement learning (RRL)
Relational Reinforcement Learning for Agents in Worlds with Objects
321
was introduced by Dˇzeroski, De Raedt and Blockeel [7] and further extended and experimentally evaluated on the blocks world by Dˇzeroski, De Raedt, and Driessens [8]. Driessens, Ramon, and Blockeel [6] replaced the non-incremental generalization engine in RRL the with TG-tree algorithm, a relational version of the G-algorithm, yielding the RRL-TG algorithm. Driessens and Blockeel [4] applied RRL to the Digger game problem. Driessens and Dˇzeroski [5] extended RRL-TG to take into account guidance from existing reasonable policies, either human generated or learned. They applied G-RRL to the Digger and Tetris games. Acknowledgements. The author would like to thank Hendrik Blockeel, Kurt Driessens and Luc De Raedt for the exciting and productive cooperation on the topic of relational reinforcement learning. Special thanks to Kurt Driessens for some of the figures and results included in this chapter.
References 1. Bertsekas, D.P., & Tsitsiklis, J.N. (1996). Neuro-Dynamic Programming. Belmont, MA: Athena Scientific. 2. Blockeel, H., De Raedt, L., & Ramon, J. (1998). Top-down induction of clustering trees. In Proc. 15th International Conference on Machine Learning, pages 55–63. San Francisco: Morgan Kaufmann. 3. Chapman, D., & Kaelbling, L. P. (1991). Input generalization in delayed reinforcement learning: An algorithm and performance comparisons. In Proc. 12th International Joint Conference on Artificial Intelligence, pages 726–731. San Mateo, CA: Morgan Kaufmann. 4. Driessens, K., & Blockeel, H. (2001). Learning Digger using hierarchical reinforcement learning for concurrent goals. In Proc. 5th European Workshop on Reinforcement Learning, pages 11-12. Utrecht, The Netherlands: CKI Utrecht University. 5. Driessens, K., & Dˇzeroski, S. (2002) Integrating experimentation and guidance in relational reinforcement learning. In Proc. 19th International Conference on Machine Learning, pages 115–122. San Francisco, CA: Morgan Kaufmann. 6. Driessens, K., Ramon, J., & Blockeel, H. (2001). Speeding up relational reinforcement learning through the use of an incremental first order decision tree algorithm. In Proc. 12th European Conference on Machine Learning, pages 97–108. Berlin: Springer. 7. Dˇzeroski, S., De Raedt, L., & Blockeel, H. (1998). Relational reinforcement learning. In Proc. 15th International Conference on Machine Learning, pages 136–143. San Francisco, CA: Morgan Kaufmann. 8. Dˇzeroski, S., De Raedt, L., & Driessens, K. (2001). Relational reinforcement learning. Machine Learning, 43, 7–52. 9. Kazakov, D., & Kudenko, D. (2001). Machine learning and inductive logic programming for multi-agent systems. In Luck, M., Marik, V., Stepankova, O., and Trappl, R., editors, Multi-Agent Systems and Applications, pages 246–270. Berlin: Springer. 10. Lavraˇc, N. and Dˇzeroski, S. (1994). Inductive Logic Programming: Techniques and Applications., New York: Ellis Horwood. Freely available at http://www-ai.ijs.si/SasoDzeroski/ILPBook/
322
S. Dˇzeroski
11. Smart, W. D., & Kaelbling, L. P. (2000). Practical reinforcement learning in continuous spaces. In Proc. 17th International Conference on Machine Learning, pages 903–910. San Francisco, CA: Morgan Kaufmann. 12. Sutton, R. S. (1996) Generalization in reinforcement learning: Successful examples using sparse coarse coding. In Proc. 8th Conference on Advances in Neural Information Processing Systems, pages 1038–1044. Cambridge, MA: MIT Press. 13. Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. Cambridge, MA: MIT Press.
Author Index
Abramov, V. 110 Andras, Peter 49
Marrow, Paul Nunes, Lu´ıs
Bonsma, Erwin 159 Brazier, Frances M.T. 174 Brighton, Henry 88 Brooks, Christopher H. 291 Chli, Maria 110 Correia, L. 110
159 33
Oliveira, Eug´enio 33 Onta˜ n´ on, Santiago 1 Picard, Gauthier Plaza, Enric 1
141
Durfee, Edmund H. 291 Dˇzeroski, Saˇso 306
Ribeiro, R. 110 Roberts, Gilbert 49 Rovatsos, Michael 66
Gaspar, Gra¸ca 239 Gleizes, Marie-Pierre 141 Goossenaerts, J. 110 Gra¸ca, Pedro Rafael 239
Smith, Kenny 88 Splunter, Sander van Steels, Luc 125 Strens, Malcolm J.A.
Hoile, Cefn
Turner, Heather
159
174 18
187
Kapetanakis, Spiros 18 Kazakov, Dimitar 187 Kirby, Simon 88 Kudenko, Daniel 18
Uther, William T.B.
Lacey, N. 216 Lazarus, John 49 Lee, M.H. 216
Wang, Fang 159 Weiß, Gerhard 66 Wijngaards, Niek J.E. 174 Wilde, Philippe De 110 Wolf, Marco 66
Mariano, P.
110
260
Veloso, Manuela M. 260 Vidal, Jos´e M. 202