Lecture Notes in Artificial Intelligence Edited by R. Goebel, J. Siekmann, and W. Wahlster
Subseries of Lecture Notes in Computer Science
5248
Christian Freksa Nora S. Newcombe Peter Gärdenfors Stefan Wölfl (Eds.)
Spatial Cognition VI Learning, Reasoning, and Talking about Space International Conference Spatial Cognition 2008 Freiburg, Germany, September 15-19, 2008 Proceedings
13
Series Editors Randy Goebel, University of Alberta, Edmonton, Canada Jörg Siekmann, University of Saarland, Saarbrücken, Germany Wolfgang Wahlster, DFKI and University of Saarland, Saarbrücken, Germany Volume Editors Christian Freksa SFB/TR 8 Spatial Cognition Universität Bremen, Bremen, Germany E-mail:
[email protected] Nora S. Newcombe James H. Glackin Distinguished Faculty Fellow Temple University, Philadelphia, PA, USA E-mail:
[email protected] Peter Gärdenfors Lund University Cognitive Science Lund, Sweden E-mail:
[email protected] Stefan Wölfl Department of Computer Science University of Freiburg, Freiburg, Germany E-mail: woelfl@informatik.uni-freiburg.de
Library of Congress Control Number: 2008934601 CR Subject Classification (1998): H.2.8, I.2.10, H.3.1, K.4.2, B.5.1 LNCS Sublibrary: SL 7 – Artificial Intelligence ISSN ISBN-10 ISBN-13
0302-9743 3-540-87600-6 Springer Berlin Heidelberg New York 978-3-540-87600-7 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2008 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 12519798 06/3180 543210
Preface
This is the sixth volume in a series of books dedicated to basic research in spatial cognition. Spatial cognition research investigates relations between the physical spatial world, on the one hand, and the mental world of humans, animals, and artificial agents, on the other hand. Cognitive agents – natural or artificial – make use of spatial and temporal information about their environment and about their relation to the environment to move around, to behave intelligently, and to make adaptive decisions in the pursuit of their goals. More specifically, cognitive agents process various kinds of spatial knowledge for learning, reasoning, and talking about space. From a cognitive point of view, a central question is how our brains represent and process spatial information. When designing spatial representation systems, usability will be increased if the external and internal forms of representation are aligned as much as possible. A particularly interesting feature is that much of the internal representations of the meanings of words seem to have a spatial structure. This also holds when we are not talking about space as such. The spatiality of natural semantics will impose further requirements on the design of information systems. An elementary example is that “more” of something is often imagined as “higher” on a vertical dimension: consequently, a graphical information system that associates “more” with “down” will easily be misunderstood. Another example concerns similarity relations: features that are judged to be similar in meaning are best represented as spatially close in a graphical information system. In addition to the question of how this information is represented and used – which was the focus of the previous Spatial Cognition volumes – an important question is whether spatial abilities are innate (“hard wired”) or whether these abilities can be learned and trained. The hypothesis that spatial cognition is malleable, and hence that spatial learning can be fostered by effective technology and education, is based on recent evidence from multiple sources. Developmental research now indicates that cognitive growth is not simply the unfolding of a maturational program but instead involves considerable learning; new neuroscience research indicates substantial developmental plasticity; and cognitive and educational research has shown us significant effects of experience on spatial skill. Because an informed citizen in the 21st century must be fluent at processing spatial abstractions including graphs, diagrams, and other visualizations, research that reveals how to increase the level of spatial functioning in the population is vital. In addition, such research could lead to the reduction of gender and socioeconomic status differences in spatial functioning and thus have an important impact on social equity. We need to understand spatial learning and to use this knowledge to develop programs and technologies that will support the
VI
Preface
capability of all children and adolescents to develop the skills required to compete in an increasingly complex world. To answer these questions, we need to understand structures and mechanisms of abstraction and we must develop and test models that instantiate our insights into the cognitive mechanisms studied. Today, spatial cognition is an established research area that investigates a multitude of phenomena in a variety of domains on many different levels of abstraction involving a palette of disciplines with their specific methodologies. One of today’s challenges is to connect and relate these different research areas. In pursuit of this goal, the Transregional Collaborative Research Center SFB/TR 8 Spatial Cognition (Bremen and Freiburg) and the Spatial Intelligence and Learning Center (Philadelphia and Chicago) co-organized Spatial Cognition 2008 in the series of the biannual international Spatial Cognition conferences. This conference brought together researchers from both centers and from other spatial cognition research labs, from all over the world. This proceedings volume contains 27 papers that were selected for oral presentation at the conference in a thorough peer-review process to which 54 papers had been submitted; each paper was reviewed and commented on by at least three Program Committee members. Many high-quality contributions could not be accepted due to space limitations in the single-track conference program. The Program Chairs invited three prominent scientists to deliver keynote lectures at the Spatial Cognition 2008 conference: Heinrich H. B¨ ulthoff spoke on “Virtual Reality as a Valuable Research Tool for Investigating Different Aspects of Spatial Cognition”, Laura Carlson’s talk was about “On the ‘Whats’ and ‘Hows’ of ‘Where’: The Role of Salience in Spatial Descriptions”, and Dedre Gentner addressed the topic “Learning about space”. Abstracts of the keynote presentations are also printed in this volume. Spatial Cognition 2008 took place at Schloss Reinach near Freiburg (Germany) in September 2008. Besides the papers for oral presentation, more than 30 poster contributions were selected for presenting work in progress. The conference program also featured various tutorials, workshops, and a doctoral colloquium to promote an exchange of research experience of young scientists and knowledge transfer at an early stage of project development. Immediately before the conference, a workshop sponsored by the American National Science Foundation (NSF) was organized by the SILC consortium in cooperation with the SFB/TR 8 at the University of Freiburg. This workshop included lab visits at the Freiburg site of the SFB/TR 8. Many people contributed to the success of the Spatial Cognition 2008 conference. First of all, we thank the authors for preparing excellent contributions. This volume presents contributions by 61 authors on a large spectrum of interdisciplinary work on descriptions of space, on spatial mental models and maps, on spatio-temporal representation and reasoning, on route directions, wayfinding in natural and virtual environments, and spatial behavior, and on robot mapping and piloting. Our special thanks go to the members of the Program Committee for carefully reviewing and commenting on these contributions. Thorough reviews by peers are one of the most important sources of feedback to the authors
Preface
VII
that connects them to still-unknown territory and that helps them to improve their work and to secure a high-quality scientific publication. We thank Thomas F. Shipley for organizing, and Kenneth D. Forbus, Alexander Klippel, Marco Ragni, and Niels Krabisch for offering tutorials. For organizing workshops we owe thanks to Kenny Coventry and Jan M. Wiener as well as Alexander Klippel, Stephen Hirtle, Marco Ragni, Holger Schultheis, Thomas Barkowsky, Ronan O’Ceallaigh, and Wolfgang St¨ urzl. Further thanks go to Christoph H¨ olscher for organizing the poster session, and Sven Bertel and Marco Ragni, who were responsible for organizing the doctoral colloquium and for allocating travel grants to PhD students. We thank the members of our support staff, namely, Ingrid Schulz, Dagmar Sonntag, Roswitha Hilden, Susanne Bourjaillat, and Ulrich Jakob for professionally arranging many details. Special thanks go to Thomas Barkowsky, Eva R¨athe, Lutz Frommberger, and Matthias Westphal for the close cooperation on both sites of the SFB/TR 8. We thank Wolfgang Bay and the SICK AG for the generous sponsorship for this conference and the continuous support of scientific activities in and around Freiburg. We thank Daniel Schober and the ESRI Geoinformatik GmbH for sponsoring the travel grants to PhD students participating in the doctoral colloquium. We thank the Deutsche Forschungsgemeinschaft and the National Science Foundation and their program directors Bettina Zirpel, Gerit Sonntag, and SooSiang Lim for continued support of our research and for encouragement and enhancing our international research cooperations. For the review process and for the preparation of the conference proceedings we used the EasyChair conference management system, which we found convenient to use. Finally, we thank Alfred Hofmann and his staff at Springer for their continuing support of our book series as well as for sponsoring the Spatial Cognition 2008 Best Paper Award. September 2008
Christian Freksa Nora Newcombe Peter G¨ ardenfors Stefan W¨olfl
Conference Organization
Program Chairs Christian Freksa Nora S. Newcombe Peter G¨ ardenfors
Local Organization Stefan W¨olfl
Tutorial Chair
Poster Session Chair
Thomas F. Shipley
Christoph H¨ olscher
Workshop Chairs
Doctoral Colloquium Chairs
Kenny Coventry Jan M. Wiener
Sven Bertel Marco Ragni
Program Committee Pragya Agarwal Marios Avraamides Christian Balkenius Thomas Barkowsky John Bateman Brandon Bennett Michela Bertolotto Stefano Borgo Melissa Bowerman Angela Brunstein Wolfram Burgard Lily Chao Christophe Claramunt Eliseo Clementini Anthony Cohn Leila De Floriani
Maureen Donnelly Matt Duckham Russell Epstein Ron Ferguson Ken Forbus Antony Galton Susan Goldin-Meadow Gabriela Goldschmidt Klaus Gramann Christopher Habel Mary Hegarty Stephen Hirtle Christoph H¨ olscher Petra Jansen Gabriele Janzen Alexander Klippel
X
Organization
Markus Knauff Stefan Kopp Maria Kozhevnikov Bernd Krieg-Br¨ uckner Antonio Kr¨ uger Benjamin Kuipers Yohei Kurata Gerhard Lakemeyer Longin Jan Latecki Hanspeter Mallot Mark May Timothy P. McNamara Tobias Meilinger Daniel R. Montello Stefan M¨ unzer Lynn Nadel Bernhard Nebel Marta Olivetti Belardinelli Dimitris Papadias Eric Pederson Ian Pratt-Hartmann
Additional Reviewers Daniel Beck Kirsten Bergmann Roberta Ferrario Alexander Ferrein Stefan Schiffer
Martin Raubal Terry Regier Kai-Florian Richter M. Andrea Rodr´ıguez Ute Schmid Amy Shelton Thomas F. Shipley Jeanne Sholl Barry Smith Kathleen Stewart Hornsby Holly Taylor Barbara Tversky Florian Twaroch David Uttal Constanze Vorwerg Stefan W¨olfl Thomas Wolbers Diedrich Wolter Nico Van de Weghe Wai Yeap
Related Book Publications
1. Winter, S., Duckham, M., Kulik, L., Kuipers, B. (eds.): COSIT 2007. LNCS, vol. 4736. Springer, Heidelberg (2007) 2. Fonseca, F., Rodr´ıguez, M.A., Levashkin, S. (eds.): GeoS 2007. LNCS, vol. 4853. Springer, Heidelberg (2007) 3. Barkowsky, T., Knauff, M., Ligozat, G., Montello, D.R. (eds.): Spatial Cognition 2007. LNCS (LNAI), vol. 4387. Springer, Heidelberg (2007) 4. Barker-Plummer, D., Cox, R., Swoboda, N. (eds.): Diagrams 2006. LNCS (LNAI), vol. 4045. Springer, Heidelberg (2006) 5. Raubal, M., Miller, H.J., Frank, A.U., Goodchild, M.F. (eds.): GIScience 2006. LNCS, vol. 4197. Springer, Heidelberg (2006) 6. Cohn, A.G., Mark, D.M. (eds.): COSIT 2005. LNCS, vol. 3693. Springer, Heidelberg (2005) 7. Rodr´ıguez, M.A., Cruz, I., Levashkin, S., Egenhofer, M.J. (eds.): GeoS 2005. LNCS, vol. 3799. Springer, Heidelberg (2005) 8. Meng, L., Zipf, A., Reichenbacher, T. (eds.): Map-based mobile services — Theories, methods and implementations. Springer, Berlin (2005) 9. Freksa, C., Knauff, M., Krieg-Br¨ uckner, B., Nebel, B., Barkowsky, T. (eds.): Spatial Cognition IV. LNCS (LNAI), vol. 3343. Springer, Heidelberg (2005) 10. Blackwell, A.F., Marriott, K., Shimojima, A. (eds.): Diagrams 2004. LNCS (LNAI), vol. 2980. Springer, Heidelberg (2004) 11. Egenhofer, M.J., Freksa, C., Miller, H.J. (eds.): GIScience 2004. LNCS, vol. 3234. Springer, Heidelberg (2004) 12. Gero, J.S., Tversky, B., Knight, T. (eds.): Visual and spatial reasoning in design III, Key Centre of Design Computing and Cognition. University of Sydney (2004) 13. Freksa, C., Brauer, W., Habel, C., Wender, K.F.: Spatial Cognition III. LNCS (LNAI), vol. 2685. Springer, Heidelberg (2003) 14. Kuhn, W., Worboys, M.F., Timpf, S. (eds.): COSIT 2003. LNCS, vol. 2825. Springer, Heidelberg (2003) 15. Hegarty, M., Meyer, B., Narayanan, N.H. (eds.): Diagrams 2002. LNCS (LNAI), vol. 2317. Springer, Heidelberg (2002) 16. Egenhofer, M.J., Mark, D.M. (eds.): GIScience 2002. LNCS, vol. 2478. Springer, Heidelberg (2002) 17. Barkowsky, T.: Mental Representation and Processing of Geographic Knowledge. LNCS (LNAI), vol. 2541. Springer, Heidelberg (2002) 18. Renz, J.: Qualitative Spatial Reasoning with Topological Information. LNCS (LNAI), vol. 2293. Springer, Heidelberg (2002) 19. Coventry, K., Olivier, P. (eds.): Spatial language: Cognitive and computational perspectives. Kluwer, Dordrecht (2002) 20. Montello, D.R. (ed.): COSIT 2001. LNCS, vol. 2205. Springer, Heidelberg (2001) 21. Gero, J.S., Tversky, B., Purcell, T. (eds.): Visual and spatial reasoning in design II. Key Centre of Design Computing and Cognition. University of Sydney (2001) 22. Habel, C., Brauer, W., Freksa, C., Wender, K.F. (eds.): Spatial Cognition 2000. LNCS (LNAI), vol. 1849. Springer, Heidelberg (2000) 23. Habel, C., von Stutterheim, C. (eds.): R¨ aumliche Konzepte und sprachliche Strukturen. Niemeyer, T¨ ubingen (2000)
XII
Related Book Publications
24. Freksa, C., Mark, D.M. (eds.): COSIT 1999. LNCS, vol. 1661. Springer, Heidelberg (1999) 25. Gero, J.S., Tversky, B. (eds.): Visual and spatial reasoning in design. Key Centre of Design Computing and Cognition. University of Sydney (1999) 26. Habel, C., Werner, S. (eds.): Special issue on spatial reference systems, Spatial Cognition and Computation, vol. 1(4) (1999) 27. Freksa, C., Habel, C., Wender, K.F. (eds.): Spatial Cognition 1998. LNCS (LNAI), vol. 1404. Springer, Heidelberg (1998) 28. Hirtle, S.C., Frank, A.U. (eds.): COSIT 1997. LNCS, vol. 1329. Springer, Heidelberg (1997) 29. Kuhn, W., Frank, A.U. (eds.): COSIT 1995. LNCS, vol. 988. Springer, Heidelberg (1995)
Table of Contents
Invited Talks Virtual Reality as a Valuable Research Tool for Investigating Different Aspects of Spatial Cognition (Abstract) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Heinrich H. B¨ ulthoff, Jennifer L. Campos, and Tobias Meilinger
1
On the “Whats” and “Hows” of “Where”: The Role of Salience in Spatial Descriptions (Abstract) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Laura A. Carlson
4
Learning about Space (Abstract) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dedre Gentner
7
Spatial Orientation Does Body Orientation Matter When Reasoning about Depicted or Described Scenes? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Marios N. Avraamides and Stephanie Pantelidou Spatial Memory and Spatial Orientation . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jonathan W. Kelly and Timothy P. McNamara
8
22
Spatial Navigation Map-Based Spatial Navigation: A Cortical Column Model for Action Planning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Louis-Emmanuel Martinet, Jean-Baptiste Passot, Benjamin Fouque, Jean-Arcady Meyer, and Angelo Arleo Efficient Wayfinding in Hierarchically Regionalized Spatial Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Thomas Reineking, Christian Kohlhagen, and Christoph Zetzsche Analyzing Interactions between Navigation Strategies Using a Computational Model of Action Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . Laurent Doll´e, Mehdi Khamassi, Benoˆıt Girard, Agn`es Guillot, and Ricardo Chavarriaga A Minimalistic Model of Visually Guided Obstacle Avoidance and Path Selection Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lorenz Gerstmayr, Hanspeter A. Mallot, and Jan M. Wiener
39
56
71
87
XIV
Table of Contents
Spatial Learning Route Learning Strategies in a Virtual Cluttered Environment . . . . . . . . . Rebecca Hurlebaus, Kai Basten, Hanspeter A. Mallot, and Jan M. Wiener Learning with Virtual Verbal Displays: Effects of Interface Fidelity on Cognitive Map Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nicholas A. Giudice and Jerome D. Tietz Cognitive Surveying: A Framework for Mobile Data Collection, Analysis, and Visualization of Spatial Knowledge and Navigation Practices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Drew Dara-Abrams
104
121
138
Maps and Modalities What Do Focus Maps Focus On? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kai-Florian Richter, Denise Peters, Gregory Kuhnm¨ unch, and Falko Schmid Locating Oneself on a Map in Relation to Person Qualities and Map Characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lynn S. Liben, Lauren J. Myers, and Kim A. Kastens Conflicting Cues from Vision and Touch Can Impair Spatial Task Performance: Speculations on the Role of Spatial Ability in Reconciling Frames of Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Madeleine Keehner
154
171
188
Spatial Communication Epistemic Actions in Science Education . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kim A. Kastens, Lynn S. Liben, and Shruti Agrawal An Influence Model for Reference Object Selection in Spatially Locative Phrases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Michael Barclay and Antony Galton
202
216
Spatial Language Tiered Models of Spatial Language Interpretation . . . . . . . . . . . . . . . . . . . . Robert J. Ross
233
Perspective Use and Perspective Shift in Spatial Dialogue . . . . . . . . . . . . . Juliana Goschler, Elena Andonova, and Robert J. Ross
250
Table of Contents
Natural Language Meets Spatial Calculi . . . . . . . . . . . . . . . . . . . . . . . . . . . . Joana Hois and Oliver Kutz Automatic Classification of Containment and Support Spatial Relations in English and Dutch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kate Lockwood, Andrew Lovett, and Ken Forbus
XV
266
283
Similarity and Abstraction Integral vs. Separable Attributes in Spatial Similarity Assessments . . . . . Konstantinos A. Nedas and Max J. Egenhofer Spatial Abstraction: Aspectualization, Coarsening, and Conceptual Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Lutz Frommberger and Diedrich Wolter
295
311
Concepts and Reference Frames Representing Concepts in Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Martin Raubal The Network of Reference Frames Theory: A Synthesis of Graphs and Cognitive Maps . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tobias Meilinger
328
344
Spatially Constrained Grammars for Mobile Intention Recognition . . . . . Peter Kiefer
361
Modeling Cross-Cultural Performance on the Visual Oddity Task . . . . . . Andrew Lovett, Kate Lockwood, and Kenneth Forbus
378
Spatial Modeling and Spatial Reasoning Modelling Scenes Using the Activity within Them . . . . . . . . . . . . . . . . . . . . Hannah M. Dee, Roberto Fraile, David C. Hogg, and Anthony G. Cohn Pareto-Optimality of Cognitively Preferred Polygonal Hulls for Dot Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Antony Galton
394
409
Qualitative Reasoning about Convex Relations . . . . . . . . . . . . . . . . . . . . . . Dominik L¨ ucke, Till Mossakowski, and Diedrich Wolter
426
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
441
Virtual Reality as a Valuable Research Tool for Investigating Different Aspects of Spatial Cognition (Abstract) Heinrich H. B¨ ulthoff, Jennifer L. Campos, and Tobias Meilinger Max-Planck-Institute for Biological Cybernetics Spemannstr. 38, 72076 T¨ ubingen, Germany {heinrich.buelthoff,jenny.campos,tobias.meilinger}@tuebingen.mpg.de
The interdisciplinary research field of spatial cognition has benefited greatly from the use of advanced Virtual Reality (VR) technologies. Such tools have provided the ability to explicitly control specific experimental conditions, manipulate variables not possible in the real world, and provide a convincing, multimodal experience. Here we will first describe several of the VR facilities at the Max Planck Institute (MPI) for Biological Cybernetics that have been developed to optimize scientific investigations related to multi-modal self-motion perception and spatial cognition. Subsequently, we will present some recent empirical work contributing to these research areas. While in the past, low-quality visual simulations of space were the most prominent types of VR (i.e., simple desktop displays), more advanced visualization systems are becoming increasingly more desirable. At the MPI we have utilized a variety of visualization tools ranging from immersive head-mounted displays (HMD), to large field-of-view, curved projection systems, to a high resolution tiled display. There is also an increasing need for high-quality, adaptable, largescale, simulated environments. At the MPI we have created a virtual replica of downtown T¨ ubingen throughout which observers can navigate. In collaboration with ETH Zurich, who have developed “CityEngine”, a virtual city builder, we are now able to rapidly create virtual renditions of existing cities or customized environmental layouts. In order to naturally interact within such virtual environments (VEs), it is also increasingly more important to be able to physically move within these spaces. Under most natural conditions involving self-motion, bodybased information is inherently present. Therefore, the recent developments of several sophisticated self-motion interfaces have allowed us to present and evaluate natural, multi-sensory navigational experiences in unprecedented ways. For instance, within a large (12 m×12 m), free-walking space, a high-precision optical tracking system (paired with an HMD) updates one’s position within a VE as they naturally navigate through walking or when passively transported (i.e., via a robotic wheelchair). Further, the MPI Motion Simulator is a 6-degree of freedom anthropomorphic robotic arm that can translate and rotate an observer C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 1–3, 2008. c Springer-Verlag Berlin Heidelberg 2008
2
H.H. B¨ ulthoff, J.L. Campos, and T. Meilinger
in any number of ways (both open and closed-loop). Finally, a new, state-of-theart omni-directional treadmill now offers observers the opportunity to experience unrestricted, limitless walking throughout large-scale VE’s. When moving through space, both, dynamic visual information (i.e., optic flow), and body-based information (i.e., proprioceptive/efference copy and vestibular) jointly specify the magnitude of a distance travelled. Relatively little is currently known about how these cues are integrated when simultaneously present. In a series of experiments, we investigated participants’ ability to estimate travelled distances under a variety of sensory/motor conditions. Visual information presented via an HMD was combined with body-based cues that were provided either by walking in a fully-tracked, free-walking space, by walking on a large linear treadmill, or by being passively transported in a robotic wheelchair. Visually-specified distances were either congruent or incongruent with distances specified by body-based cues. Responses reflect a combined effect of both visual and body-based information, with an overall higher weighting of body-based cues during walking and a relatively equal weighting of inertial and visual cues during passive movement. The characteristics of self-motion perception have also been investigated using a novel continuous pointing method. This task simply requires participants to view a target and point continuously towards the target as they moved past it along a straight, forward trajectory. By using arm angle, we are able to measure perceived location and, hence, perceived self-velocity during the entire trajectory. We have compared the natural characteristics of continuous pointing during sighted walking with those during reduced sensory/motor cue conditions, including: blind-walking, passive transport, and imagined walking. The specific characteristics of self-motion perception during passive transport have also been further evaluated through the use of a robotic wheelchair and the MPI Motion Simulator. Additional research programs have focused on understanding particular aspects of spatial memory when navigating through visually rich, complex environments. In one study that investigated route memory, participants navigated through virtual T¨ ubingen while it was projected onto a 220◦ field-of-view, curved screen display. Participants learned two routes while they were simultaneously required to perform a visual, spatial, or verbal secondary task. In the subsequent wayfinding phase the participants were asked to locate and “virtually travel” along the two routes again (via joystick manipulation). During this wayfinding phase a number of dependent measures were recorded. The results indicate that encoding wayfinding knowledge interfered with the verbal and spatial secondary tasks. These interferences were even stronger than the interference of wayfinding knowledge with the visual secondary task. These findings are consistent with a dual-coding approach of wayfinding knowledge. This dual coding approach was further examined in our fully-tracked, free-walking space. In this case, participants walked a route through a virtual environment and again were required to remember the route. For 50% of the intersections they encountered, they were asked associate it with an arbitrary name they heard via headphones (e.g., “Goethe place”). For the other 50% of the intersections they were asked to remember the intersection by the local environmental features and not associate
Virtual Reality as a Valuable Research Tool
3
it with a name. In a successive route memory test participants were “beamed” to an intersection and had to indicate in which direction they originally traveled the route. Participants performed better at intersections without a name than they did for intersections associated with an arbitrary name. When repeating the experiment with meaningful names that accurately represented the environmental features (e.g., “Hay place”), the results turned around (i.e., naming a place no longer lead to worse performance). These results indicate that the benefits of language do not come for free.
References 1. Berger, D.R., Terzibas, C., Beykirch, K., B¨ ulthoff, H.H.: The role of visual cues and whole-body rotations in helicopter hovering control. In: Proceedings of the AIAA Modeling and Simulation Technologies Conference and Exhibit (AIAA 2007), Reston, VA, USA. American Institute of Aeronautics and Astronautics (2007) 2. B¨ ulthoff, H.H., van Veen, H.A.H.C.: Vision and action in virtual environments: Modern psychophysics in spatial cognition research. In: Jenkin, M., Harris, M.L. (eds.) Vision and Attention, pp. 233–252. Springer, Heidelberg (2000) 3. Campos, J.L., Butler, J.S., Mohler, B.J., B¨ ulthoff, H.H.: The contributions of visual flow and locomotor cues to walked distance estimation in a virtual environment. In: Proceedings of the 4th Symposium on Applied Perception in Graphics and Visualization, p. 146. ACM Press, New York (2007) 4. Meilinger, T., Knauff, M., B¨ ulthoff, H.H.: Working memory in wayfinding - a dual task experiment in a virtual city. Cognitive Science 32, 755–770 (2008) 5. Mohler, B.J., Campos, J.L., Weyel, M., B¨ ulthoff, H.H.: Gait parameters while walking in a head-mounted display virtual environment and the real world. In: Proceedings of Eurographics 2007, Eurographics Association, pp. 85–88 (2007) 6. Teufel, H.J., Nusseck, H.-G., Beykirch, K.A., Butler, J.S., Kerger, M., B¨ ulthoff, H.H.: MPI motion simulator: Development and analysis of a novel motion simulator. In: Proceedings of the AIAA Modeling and Simulation Technologies Conference and Exhibit (AIAA 2007), Reston, VA, USA. American Institute of Aeronautics and Astronautics (2007)
On the “Whats” and “Hows” of “Where”: The Role of Salience in Spatial Descriptions (Abstract) Laura A. Carlson Department of Psychology, University of Notre Dame, USA
According to Clark [1] language is a joint activity between speaker and listener, undertaken to accomplish a shared goal. In the case of spatial descriptions, one such goal is for a speaker to assist a listener in finding a sought-for object. For example, imagine misplacing your keys on a cluttered desktop, and asking your friend if s/he knows where they are. In response, there are a variety of spatial descriptions that your friend can select that vary in complexity, ranging from a simple deictic expression such as “there” (and typically accompanied by a pointing gesture), to a much more complicated description such as “its on the desk, under the shelf, to the left of the book and in front of the phone.” Between these two extremes are descriptions of the form “The keys are by the book”, consisting of three parts: the located object that is being sought (i.e., the keys); the reference object from which the location of the located object is specified (i.e., the book) and the spatial term that conveys the spatial relation between these two objects (i.e., by). For inquiries of this type (“where are my keys?”), the located object is pre-specified, but the speaker needs to select an appropriate spatial term and an appropriate reference object. My research focuses on the representations and processes by which a speaker selects these spatial terms and reference objects, and the representations and processes by which a listener comprehends these ensuing descriptions. The “Whats” With respect to selection, one important issue is understanding why particular terms and particular reference objects are chosen. For a given real-world scene, there are many possible objects that stand in many possible relations with respect to a given located object. On what basis might a speaker make his/her selection? Several researchers argue that reference objects are selected on the basis of properties that make them salient relative to other objects [2,3,4]. Given the purpose of the description as specifying the location of the sought-for object, it would make sense that the reference object be easy to find among the other objects in the display. However, there are many different properties that could define salience, including spatial features, perceptual properties, and conceptual properties. With respect to spatial features, certain spatial relations are preferred over others. For example, objects that stand in front/back relations to a given located object are preferred to objects that stand in left/right relations [5]. This is consistent C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 4–6, 2008. c Springer-Verlag Berlin Heidelberg 2008
On the “Whats” and “Hows” of “Where”
5
with well-known differences in the ease of processing different terms [6,7]. In addition, distance may play an important role, with objects that are closer to the located object preferred to those that are more distant [8]. Thus, all else being equal, a reference object may be selected because it is closest to the located object and/or stands in a preferred relation with respect to the located object. With respect to perceptual features, Talmy [4] identified size and movability as key dimensions, with larger and immovable objects preferred as reference objects. In addition, there may be a preference to select more geometrically complex objects as reference objects. Blocher and Stopp [9] argued for color, shape and size as critical salient dimensions. Finally, de Vega et al. [2] observed preferences for reference objects that are inanimate, more solid, and whole rather than parts of objects. Finally, with respect to conceptual features, reference objects are considered “given” objects, less recently mentioned into the discourse [4]. In addition, there may be a bias to select reference objects that are functionally related to the located object [10,11]. In this talk I will present research from my lab in which we systematically manipulate spatial, conceptual and perceptual features, and ask which dimensions are influential in reference object selection, and how priorities are assigned across the spatial, perceptual and conceptual dimensions. Both production and comprehension measures will be discussed. This work will provide a better sense of how salience is being defined with respect to selecting a reference object for a spatial description. The “Hows” Implicit in the argument that the salience of an object is computed across these dimensions is the idea that such computation requires that multiple objects are evaluated and compared among each other along these dimensions. That is, to say an object stands out relative to other objects (for example, a red object among black objects) requires that the color of all objects (black and red) be computed and compared, and that on the basis of this comparison, the unique object (in this case, red) stands out (among black). Put another way, an object can only stand out relative to a contrast set [12]. Research in my lab has examined how properties of various objects are evaluated and compared during production and comprehension, and in particular, the point in processing at which properties of multiple objects exert their influence. For example, we have shown that the presence, placement and properties of surrounding objects have a significant impact during comprehension and production [13,11]. I will discuss these findings in detail, and will present electrophysiological data that illustrate within the time course of processing the point at which these features have an impact. The Main Points The main points of the talk will be an identification of the features and dimensions that are relevant for selecting a reference object, and an examination of how
6
L.A. Carlson
and when these features and dimensions have an impact on processing spatial descriptions. Implications for other tasks and other types of spatial descriptions will be discussed.
References 1. Clark, H.H.: Using language. Cambridge University Press, Cambridge (1996) 2. de Vega, M., Rodrigo, M.J., Ato, M., Dehn, D.M., Barquero, B.: How nouns and prepositions fit together: An exploration of the semantics of locative sentences. Discourse Processes 34, 117–143 (2002) 3. Miller, G.A., Johnson-Laird, P.N.: Language and perception. Harvard University Press, Cambridge (1976) 4. Talmy, L.: How language structures space. In: Pick, H.L., Acredolo, L.P. (eds.) Spatial orientation: Theory, research, and application, pp. 225–282. Plenum, New York (1983) 5. Craton, L.G., Elicker, J., Plumert, J.M., Pick Jr., H.L.: Children’s use of frames of reference in communication of spatial location. Child Developmen 61, 1528–1543 (1990) 6. Clark, H.H.: Space, time, semantics, and the child. In: Moore, T.E. (ed.) Cognitive development and the acquisition of language. Academic Press, New York (1973) 7. Fillmore, C.J.: Santa Cruz lectures on deixis. Indiana University Linguistics Club, Bloomington (1971) 8. Hund, A.M., Plumert, J.M.: What counts as by? Young children’s use of relative distance to judge nearbyness. Developmental Psychology 43, 121–133 (2007) 9. Blocher, A., Stopp, E.: Time-dependent generation of minimal sets of spatial descriptions. In: Olivier, P., Gapp, K.P. (eds.) Representation and processing of spatial relations, pp. 57–72. Erlbaum, Mahwah (1998) 10. Carlson-Radvansky, L.A., Tang, Z.: Functional influences on orienting a reference frame. Memory & Cognition 28, 812–820 (2000) 11. Carlson, L.A., Hill, P.L.: Processing the presence, placeent and properties of a distractor in spatial language tasks. Memory & Cognition 36, 240–255 (2008) 12. Olson, D.: Language and thought: Aspects of a cognitive theory of semantics. Psychological Review 77, 143–184 (1970) 13. Carlson, L.A., Logan, G.D.: Using spatial terms to select an object. Memory & Cognition 29, 883–892 (2001)
Learning about Space (Abstract) Dedre Gentner Department of Psychology, Northwestern University, USA
Spatial cognition is important in human learning, both in itself and as a major substrate of learning in other domains. Although some aspects of spatial cognition may be innate, it is clear that many important spatial concepts must be learned from experience. For example, Dutch and German use three spatial prepositions—op, aan, and om in Dutch—to describe containment and support relations, whereas English requires just one preposition—on—to span this range. How do children learn these different ways of partitioning the world of spatial relations? More generally, how do people come to understand powerful spatial abstractions like parallel, convergent, proportionate, and continuous? I suggest that two powerful contributors to spatial learning are analogical mapping— structural alignment and abstraction—and language, especially relational language, which both invites and consolidates the insights that arise from analogical processes. I will present evidence that (1) analogical processes are instrumental in learning new spatial relational concepts; and, further, that (2) spatial relational language fosters analogical processing. I suggest that mutual bootstrapping between structure-mapping processes and relational language is a major contributor to spatial learning in humans.
C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, p. 7, 2008. c Springer-Verlag Berlin Heidelberg 2008
Does Body Orientation Matter When Reasoning about Depicted or Described Scenes?* Marios N. Avraamides and Stephanie Pantelidou Department of Psychology, University of Cyprus P.O. Box 20537, 1678 Nicosia, Cyprus
[email protected],
[email protected]
Abstract. Two experiments were conducted to assess whether the orientation of the body at the time of test affects the efficiency with which people reason about spatial relations that are encoded in memory through symbolic media. Experiment 1 used depicted spatial layouts while Experiment 2 used described environments. In contrast to previous studies with directly-experienced spatial layouts, the present experiments revealed no sensorimotor influences on performance. Differences in reasoning about immediate and non-immediate environments are thus discussed. Furthermore, the same patterns of findings (i.e., normal alignment effects) were observed in the two experiments supporting the idea of functional equivalence of spatial representations derived from different modalities. Keywords: body orientation, sensorimotor interference, perspective-taking, spatial reasoning.
1 Introduction While moving around in the environment people are able to keep track of how egocentric spatial relations (i.e., self-to-object directions and distances) change as a result of their movement [1-4]. To try out an example, choose one object from your immediate surroundings (e.g., a chair), and point to it. Then, close your eyes and take a few steps forward and/or rotate yourself by some angle. As soon as you finish moving, but before opening your eyes, point to the object again. It is very likely that you pointed very accurately and without taking any time to contemplate where the object might be as a result of your movement. This task which humans can carry out with such remarkable efficiency and speed entails rather complex mathematical computations. It requires that the egocentric location of an object is initially encoded and then continuously updated while moving in the environment. The mechanism that allows people to update egocentric relations and stay oriented within their immediate surroundings is commonly known as spatial updating. Several studies have suggested that spatial updating takes place automatically with physical movement because such movement provides the input that is necessary for *
The presented experiments have been conducted as a part of an undergraduate thesis by Stephanie Pantelidou.
C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 8–21, 2008. © Springer-Verlag Berlin Heidelberg 2008
Does Body Orientation Matter When Reasoning about Depicted or Described Scenes?
9
updating [2, 4]. In the case of non-visual locomotion this input consists of kinesthetic cues, vestibular feedback, and copies of efferent commands. The importance of physical movement is corroborated by empirical findings showing that participants point to a location equally fast and accurately from an initial standpoint and a novel standpoint they adopt by means of physical movement (as in the example above). In contrast, when the novel standpoint is adopted by merely imagining the movement, participants are faster and more accurate to respond from their initial than their novel (imagined) standpoint [5]. This is particularly the case when an imagined rotation is needed to adopt the novel standpoint. The traditional account for spatial updating [4, 6] posits that spatial relations are encoded and updated on the basis of an egocentric reference frame (i.e., a reference frame that is centered on one´s body). Because egocentric relations are continuously updated when moving, reasoning from one’s physical perspective is privileged as it can be carried out on the basis of relations that are directly represented in memory. Instead, reasoning from imagined perspectives is deliberate and effortful as it entails performing “off-line” mental transformations to compute the correct response. Recently, May proposed the sensorimotor interference account which places the exact locus of difficulty for responding from imagined perspectives at the presence of conflicts between automatically-activated sensorimotor codes that specify locations relative to the physical perspective and cognitive codes that define locations relative to the imagined perspective [7, 8]. Based on this account, while responding from an actual physical perspective is facilitated by compatible sensorimotor codes, in order to respond from an imagined perspective, the incompatible sensorimotor codes must be inhibited while an alternative response is computed. The presence of conflicts reduces accuracy and increases reaction time when reasoning from imagined perspectives. In a series of elegant experiments, May provided support for the facilitatory and interfering effects of sensorimotor codes [7]. Recently, Kelly, Avraamides, and Loomis [9] dissociated the influence of sensorimotor interference in spatial reasoning from effects caused by the organizational structure of spatial memory (see also [10]). In one condition of the study participants initially examined a spatial layout of 9 objects from a fixed standpoint and perspective. Then, they were asked to rotate 90° to their left or right to adopt a novel perspective. From this perspective participants carried out a series of localization trials that involved pointing to object locations from various imagined perspectives. This paradigm allowed dissociating the orientation of the testing perspective from that of the perspective adopted during learning. This dissociation is deemed necessary in light of evidence from several studies showing that spatial memories are stored with a preferred direction that is very often determined by the learning perspective [11]. Results revealed that responding from imagined perspectives that coincided with either the learning or the testing perspective was more efficient compared to responding from other perspectives. A similar result was obtained in the earlier study of Mou, McNamara, Valiquette, and Rump [10] which suggested that independent effects attributed to the orientation of the body of the observer at test and the preferred storage orientation of spatial memory can be obtained in spatial cognition experiments. Kelly et al, have termed the former effect as the sensorimotor alignment effect and the latter as the memory-encoding alignment effect.
10
M.N. Avraamides and S. Pantelidou
In order to investigate the boundary conditions of sensorimotor facilitation/interference, Kelly et al included an experimental condition in which participants performed testing trials after having moved to an adjacent room. Results from this condition revealed that when participants reasoned about relations that were not immediately present, no sensorimotor interference/facilitation was exerted on performance. Only a memory-encoding alignment effect was obtained in this condition. The study by Kelly et al. provided evidence that the orientation of one´s body when reasoning about space influences performance only when immediate spatial relations are retrieved. Presumably this occurs because egocentric relations are maintained in a transient sensorimotor representation that functions to encode and automatically update egocentric directions and distances to objects in one´s immediate surroundings [12, 13]. When reasoning about remote environments such a representation is of little, if any, use. In this case, a more enduring, perhaps allocentric, representation would more suitably provide the information needed to compute the spatial relations as needed (see [14] for a comprehensive review of theories of memory that provide for multiple encoding systems). If this is true, then the same pattern of findings (i.e., presence of memory-encoding alignment effect but no sensorimotor alignment effect) should be expected when people reason about spatial relations included in any remote environment regardless of how it is encoded. Although we very frequently in our daily lives reason about environments that we have previously experienced directly, in many cases we process spatial relations that have been committed to memory through symbolic media such as pictures, movies, language etc (e.g., planning a route after having studied a map). While numerous studies have been carried out to examine how people reason about depicted or described experiments, most studies have either focused on examining effects caused by the misalignment between medium and actual space [15] or have confounded the orientations of the learning and testing perspectives [16] . As a result, it is not yet known whether the orientation of the observer’s body mediates spatial reasoning for environments encoded through symbolic media. The goal of the present study is to assess whether the orientation of the body influences performance when reasoning about spatial relations contained in depicted (Experiment 1) or a described (Experiment 2) remote layout. We expect that the use of remote environments will give rise to a pattern of findings similar to those obtained in conditions in which participants are tested after being removed from the learning environment. If such a result is obtained, it would further highlight the fundamental difference between “online” reasoning about immediate environments and “off-line” reasoning about remote environments. A secondary goal of the study is to compare spatial reasoning for depicted and linguistic spatial scenes in order to assess the functional equivalence of spatial layouts that are derived from different modalities. This is a question that has accumulated increased theoretical interest in recent years, presumably because it bears important implications for modern tools and applications that rely on sensory substitution, as in the case of navigational systems for the blind. Most previous studies tested functional equivalence using environments that were immediate to participants [1719]. Although some indirect evidence suggests that learning an environment from a map or text engages the same parieto-frontal network in the brain [20, 21], it is important to test whether the same behavioral effects are found when reasoning for spatial relations derived from different modalities. By comparing the findings of Experiments
Does Body Orientation Matter When Reasoning about Depicted or Described Scenes?
11
1 and 2 in the present study, we will be able to assess the degree of functional equivalence between scenes that are learned though pictures and language. Based on evidence from previous studies that examined the organization of spatial memories derived from maps and linguistic descriptions [22, 23], we expect that similar patterns of findings will be found in the two experiments. For the present experiments we adopted the paradigm used by Waller, Montello, Richardson, and Hegarty [24] and previously by Presson and Hazelrigg [15]. In these studies participants first learned various 4-point paths and then made judgments of relative direction by adopting imagined perspectives within the paths. Trials could be classified as aligned (i.e., the orientation of the imagined perspective matched the physical perspective of the participant) or as contra-aligned (i.e., the imagined perspective deviated 180° from the physical perspective of the participant). The typical result when participants carry out the task without moving from the learning standpoint/perspective (Stay condition in [24]) is that performance is more efficient in aligned than contra-aligned trials. This finding is commonly referred to as an alignment effect. Additional interesting conditions were included in the study by Waller. In experiment 2, a Rotate condition was included. In this condition, participants performed the task after having physically rotated 180°. The rationale was that if the alignment effect is caused primarily by the learning orientation then a similar alignment effect to that of the Stay condition would be obtained. However, if the alignment effect is caused by the influence of the orientation of the body at the time of test, a reverse-alignment effect should be expected. Results, however, revealed no alignment effect (see also[25]). Two additional conditions, namely the Rotate-Update and the Rotate-Ignore, provided important results. In the Rotate-Update condition participants were instructed to physically rotate 180° in place and imagine that the spatial layout was behind them (i.e., they updated their position relative to the learned layout). In the Rotate-Ignore condition participants also rotated by 180° but were asked to imagine that the learned layout had rotated along with them. Results revealed a normal alignment effect in the rotate-ignore condition but a reverse-alignment effect in the rotate-update condition. Overall, these findings suggest that the orientation of the body is important when reasoning about immediate environments. In the present experiments we adopted the rationale of Waller et al. to examine the presence of normal vs. reverse alignment effects in Stay and Rotate conditions. However, in contrast to Waller et al., the paths that we have used were not directly experienced by participants. Instead, they were presented on a computer monitor as either pictures (Experiment 1) or text route descriptions (Experiment 2). If the orientation of the body of the participant at the time of test influences performance, a normal alignment effect should be found in Stay conditions and a reverse alignment effect should be obtained in Rotate conditions. However, if the learning perspective dominates performance then a normal alignment effect should be expected in both Stay and Rotate conditions. Finally, a third possibility is that both the learning and physical perspectives influence performance, as shown by Kelly et al for immediate environments. In that case, if the two effects are of equal magnitude then no alignment effect should be expected in Rotate conditions as the two effects would cancel each other out. However, without making any assumptions about the magnitude of the two effects, we should at least expect a reduced alignment effect in Rotate conditions, if indeed both learning and physical perspectives influence reasoning.
12
M.N. Avraamides and S. Pantelidou
2 Experiment 1 In Experiment 1 participants encoded paths that were depicted on a computer screen and then carried out judgments of relative direction (JRD’s). A Stay condition and a Rotate condition (in which neither update nor ignore instructions were given) were included. Based on previous findings documenting that the orientation of one’s body does not typically influence spatial reasoning about non-immediate environments, we predict that a normal alignment effect would be present in both the Stay and Rotate conditions. We also expect that overall performance will be equal in the Stay and Rotate conditions. 2.1 Method Participants Twenty-two students from an introductory psychology course at the University of Cyprus participated in the experiment in exchange for course credit. Twelve were assigned to the Stay condition and 10 to the Rotate condition. Design A 2 (observer position: Stay vs Rotate) x 3 (imagined perspective: aligned 0°, misaligned 90°, contra-aligned 180°) mixed factorial design was used. Observer position was manipulated between subjects while imagined perspective varied within-subjects. Materials and Apparatus Two 19” LCD monitors attached to a computer running the Vizard software (from WorldViz, Santa Barbara, CA) were used to display stimuli. The monitors were placed facing each other and participants sat on a swivel chair placed in-between the two monitors. Four paths were created as models with Source SDK (from Valve Corporation). Oblique screenshots of these models constituted the spatial layouts that participants learned. Each path comprised of 4 segments of equal length that connected 5 numbered location points (Figure 1). Pointing responses were made using a joystick with the angle of deflection and latency of pointing being recorded by the computer at each trial. 2.2 Procedure Prior to the beginning of the experiment participants were shown example paths on paper and were instructed on how to perform JRD’s. JRD’s involve responding to statements of the form “imagine being at x, facing y. Point to z” were x, y, and z are objects/landmarks from the studied layout. Prior to the beginning of the experiment participants were asked to perform various practice trials with JRD’s using campus landmarks as targets and responding both with their arms and the joystick. Then, participants were seated in front of one of the monitors and were asked to study the first path. They were instructed to visualize themselves moving on the path. The
Does Body Orientation Matter When Reasoning about Depicted or Described Scenes?
13
Fig. 1. Example of a path used in Experiment 1
initial direction of imagined movement was to the left for two paths and to the right in the other two (e.g., Figure 1). This was done to avoid confounding the initial movement direction with either the orientation of the body or the one opposite to it. Participants were given unlimited time to memorize the path and then proceeded to perform the experimental trials. Each trial instructed them to imagine adopting a perspective within the memorized path (e.g., Imagine standing at 1 facing 2) and point from it with the joystick toward a different position in the path (e.g., Point to 3). Participants in the Stay condition performed the trials on the same monitor on which they have previously viewed the path. Those in the Rotate condition were asked to rotate 180° and perform the pointing trials on the other monitor. Participants were instructed to respond as fast as possible but without sacrificing accuracy. Sixteen trials for each path were included yielding to a total of 64 trials per subject. Four imagined perspectives (i.e., aligned 0°, misaligned 90° left, misaligned 90° right, and contra-aligned 180°) were equally represented in the 64 trials. Furthermore, correct pointing responses, which could be 45°, 90°, and 135° to the left and right of the forward joystick position, were equally distributed across the four imagined perspectives. The order of trials within each path was randomized. Also, the order in which the four paths were presented to participants varied randomly.
14
M.N. Avraamides and S. Pantelidou
2.3 Results Separate analyses for pointing accuracy and latency for correct responses were carried out. In order to classify responses as correct and incorrect, joystick deflection angles were quantized as follows. Responses between 22.5° and 67.5° from to forward position of the joystick were classified as 45° responses to the left or right depending on the side of deflections. Similarly, responses that fell between 67.5° and 112.5° were considered as 90° responses to the left or right. Finally, responses between 112.5° and 157.5° were marked as 135° responses. Initial analyses of accuracy and latency involving all four imagined perspectives revealed no differences between the 90° left and the 90° right perspectives in either Stay or rotate conditions. Therefore, data for these two perspectives were averaged to form a misaligned 90° condition. A 2 (observer position) x 3 (imagined perspective) mixed-model Analysis of Variance (ANOVA) was conducted for both accuracy and latency data. Accuracy The analysis revealed that overall accuracy was somewhat higher in the Stay (79,9%) than in the Rotate (73,9%) condition. However, this difference did not reach statistical significance, F(1,20)=.92, p=.35, η2 =.04. A significant main effect for imagined perspective was obtained, F(2,40)=8.44, p<.001, η2 =.30. As seen in Table 1, accuracy was higher for the aligned 0° perspective (84,4%), intermediate for the misaligned 90° perspective (76,2%), and the lowest for the 180° contra-aligned perspective (70,2%). Within-subject contrasts verified that all pair-wise differences were significant, p´s<.05. Importantly, this pattern was obtained in both the Stay and Rotate conditions as evidenced by the absence of a significant interaction, F(2,40)=.40, p=.68, η2 =.02. Table 1. Accuracy (%) in Experiment 1 as a function of observer position and imagined perspective. Values in parentheses indicate standard deviations.
Aligned 90°
Misaligned 90°
Contra-Aligned 180°
Stay
86.27 (18.40)
78.57 (17.55)
75,00 (23.23)
Rotate
82.45 (13.11)
73.75 (13.28)
65.42 (15.63)
Latency The analysis of latencies yielded similar findings with the accuracy data. No differences were obtained between the Stay (11,63s) and the Rotate (11,45s) conditions, F(1,20)=.03, p=.87, η2=.001. However, a significant main effect was obtained for imagined perspective, F(2,40)=19,96, p<.001, η2 =.50. As seen in Figure 2, pointing was faster in the aligned 0° condition (9,80s), intermediate in the misaligned 90° condition (11,47s), and the slowest in the contraaligned 180° condition (13,35s). All pair-wise comparisons were significant, p´s<.01.
Does Body Orientation Matter When Reasoning about Depicted or Described Scenes?
15
Fig. 2. Latency for pointing responses as a function of observer position and imagined perspective in Experiment 1. Error bars represent standard errors.
Finally, the interaction between observer position and imagined perspective was not significant, F(2,40)=.72, p=.50, η2=.04. 2.4 Discussion Results from Experiment 1 clearly documented the presence of a normal alignment effect in both Stay and Rotate conditions. This effect was present in both accuracy and latency. These findings contrast those of Waller et al [24], who found no alignment effect in the Rotate condition and a reverse-alignment effect in the RotateUpdate condition. The critical difference between the two studies is in our opinion the fact that our depicted scenes referred to non-immediate environments while the layouts in Waller et al’s study were immediate to participants. We will return to this issue in the General Discussion.
3 Experiment 2 Experiment 2 was identical to Experiment 1 with the only exception being that instead of presenting the paths in pictures, route descriptions were shown. Previous studies with route descriptions have documented the presence of a strong influence of the orientation of the first travel segment of the path on spatial performance [26]; this suggests that the way the path is represented in memory determined the ease of spatial reasoning. Based on these findings we expect that no influence of body orientation would be evidenced in our experiment. Like Experiment 1, we predict the presence of a normal alignment effect in both Stay and Rotate conditions.
16
M.N. Avraamides and S. Pantelidou
3.1 Method Participants Twenty-two students, none of which were included in Experiment 1, participated in the experiment in exchange for course credit. Half were randomly assigned to the Stay condition and the other half to the Rotate condition. Design As in Experiment 1, the design adopted was a 2 (observer position: Stay vs rotate) x 3 (imagined perspective: aligned 0°, misaligned 90°, contra-aligned 180°) mixed factorial with observer position and imagined perspective as between-subject and withinsubject factors respectively. Materials and Apparatus In contrast to Experiment 1, the paths were learned through text descriptions presented on the screen. These descriptions were presented in Greek, the native language of all participants. Prior to the experiment participants were shown a picture as the one in Figure 1, which however included no path. They were told that this was an environment in which they should imagine themselves standing in. The text description described the same paths of Experiment 1. An English-translation of an example description would read as follows: Imagine standing at the beginning of a path. The position that you are standing at is position 1. Without moving from this position, you turn yourself to the left. Then, you walk straight for 10 meters and you reach position 2. As soon as you get there you turn towards the left again and you walk another 10 meters to reach position 3. At this position, you turn to your right and walk another 10 meters to position 4. Finally, you turn again to your right and walk another 10 meters towards position 5 which is the endpoint of the path. 3.2 Procedure The procedure was identical to that of Experiment 1. Prior to reading the descriptions participants were instructed to visualize themselves moving along the described path and imagine turning 90° whenever a turn was described. As in Experiment 1, the initial movement direction was to the left for two paths and to the right for the other two. Participants in the Rotate condition carried out a physical 180° turn prior to beginning the test trials. 3.3 Results As in Experiment 1, no differences were obtained between the 90° left and the 90° right imagined perspective in either accuracy or latency. Therefore, data were averaged across these two perspectives to form a 90° misaligned perspective condition.
Does Body Orientation Matter When Reasoning about Depicted or Described Scenes?
17
Separate 2 x 3 repeated measures ANOVA were then conducted for accuracy and latency. Accuracy The ANOVA on accuracy data revealed that overall performance was equivalent between the Stay (68,7%) and the rotate conditions (70,3%), F(1,20)=.40, p=.84, η2 =.002. A significant main effect for imagined perspective was obtained, F(2,40)=17.60, p<.001, η2 =.47. As seen in Table 2, accuracy was higher for the aligned 0° perspective (77,1%), intermediate for the misaligned 90° perspective (69,8%), and the lowest for the 180° contra-aligned perspective (61,7%). Withinsubject contrasts verified that all pair-wise differences were significant, p´s<.05. These difference among perspectives were present in both the Stay and rotate conditions as suggested by the lack of a significant interaction, F(2,40)=.22, p=.81, η2 =.01. Table 2. Accuracy (%) in Experiment 2 as a function of observer position and imagined perspective. Values in parentheses indicate standard deviations.
Aligned 90°
Misaligned 90°
Contra-Aligned 180°
Stay
76.96 (20.59)
69.24 (20.84)
59.94 (23.67)
Rotate
77.15 (16.89)
70.38 (18.85)
63.45 (17.37)
Fig. 3. Latency for pointing responses as a function of observer position and imagined perspective in Experiment 1. Error bars represent standard errors.
18
M.N. Avraamides and S. Pantelidou
Latency The analysis reveal no difference in performance for the Stay (12,39 s) and the Rotate (11,79 s) conditions, F(1,20)=.12, p=.74, η2 =.006. A significant main effect was present for imagined perspective, F(2,40)=24,22, p<.001, η2 =.55. As seen in Figure 3, participants pointed faster in the aligned 0° condition (10,51 s), intermediate in the misaligned 90° condition (11,82 s), and the slowest in the contraaligned 180° condition (13,94 s). All pair-wise comparisons were significant, p´s<.05. Finally, the interaction between observer position and imagined perspective was not significant, F(2,40)=.41, p=.67, η2 =.02. 3.4 Discussion and Cross-Experiment Analyses Results from Experiment 2 replicated closely those of Experiment 1. Specifically, a normal alignment effect was evidenced in both Stay and Rotate conditions. This effect was present in both accuracy and latency data. Furthermore, performance did not seem to be influenced by rotation as indicated by the equal overall performance between the Stay and Rotate conditions. The presence of a similar pattern of findings with depicted and described scenes is compatible with recent accounts of functional equivalence of representation derived from various modalities. To further assess functional equivalence we conducted a cross-experiment analysis using the data from Experiments 1 and 2. Separate 3 x 2 ANOVA’s using imagined perspective as a within-subject factor and experiment (visual vs. verbal) as a between-subjects factor were carried out for accuracy and latency data. Accuracy was higher in the visual task of Experiment 1 (77,2%) than in the verbal task of Experiment 2 (69,4%). However, this difference fail short of significance, F(1,42)=2.37, p=.13, η2 =.05. The interaction between experiment and imagined perspective was also non-significant, F(2,84)=.18, p=.84. η2=.004. The only significant effect was the main effect of imagined perspective, F(2,84)=24.12, p<.001, η2 =.37. Similarly, the only significant effect in the latency analysis was the main effect of perspective, F(2,84)=45.21, p<.001, η2 =.52 . In support of the functional equivalence hypothesis, neither the main effect for experiment nor the interaction between experiment and imagined perspective were significant, F(1,42)=.32, p=.58, η2=.01 and F(2,84)=.14, p=.87, η2=.003 respectively.
4 General Discussion The experiments presented here provide evidence for the lack of sensorimotor influence for reasoning about spatial relations contained in depicted or described environments. The current findings deviate from those obtained from experiments with real visual scenes in which the influence of body orientation was substantial [9, 10]. While our findings suggest that reasoning through symbolic media might not always be equivalent to reasoning about actual environments, in our opinion, the critical variable is not whether the environments are experienced directly through our senses or indirectly through symbolic media but rather whether the spatial relations they contain are immediate or not (see [9]). We believe that reasoning about remote locations is free of sensorimotor facilitation/interference. Because symbolic media are
Does Body Orientation Matter When Reasoning about Depicted or Described Scenes?
19
typically used to encode non-immediate spatial relations while immediate relations are encoded through direct experience, the difference in findings occurs. Compatible with this explanation are the findings of Kelly et al which showed that no sensorimotor influence occurs when participants are removed from the spatial layout they had previously encoded by means of visual perception [9]. The current findings are compatible with theories of spatial memory and action that posit separate systems for encoding egocentric and allocentric relations [8, 27, 28]. In these theories, egocentric relations are maintained in a transient sensorimotor memory system and are updated as one moves within the environment. On the other hand, allocentric relations (i.e., inter-object directions and distances) are maintained in an enduring memory system. As Mou et al [10] suggested, memories in the enduring system are stored with a preferred orientation which can be chosen based on a variety of factors that include viewing perspective, instructions, the internal structure of the layout etc. In their critical evaluation of spatial memory theories, Avramides and Kelly [14] argued that when reasoning about immediate spatial relations, both the transient sensorimotor and the enduring systems are relevant to the task. When a participant is asked to point to a location from her actual perspective, performance is facilitated by the fact that the self-to-object vector signifying the correct response is directly represented in the sensorimotor system and is automatically activated as suggested by May [7]. However, in order to point from an imagined perspective, the participant must suppress this self-to-object vector and compute a response using the inter-object relations from the enduring system. As Waller and Hodgson [28] have recently suggested, computations from the enduring system are cognitively effortful. Reasoning from imagined perspectives is thus expected to take longer and be prone to sensorimotor interference. Avraamides and Kelly also argued that when reasoning about nonimmediate spatial relations only the enduring system is relevant to the task. This is the case because the transient egocentric system functions to encode the current surroundings and not the layout one reasons about. As a result, performance is neither facilitated nor interfered with by the physical orientation of the participant. The tasks we used in this experiment seem to fall under the second type of reasoning described by Avraamides and Kelly. We have used pictures and descriptions that referred to spatial layouts that were understood as remote to participants. We have also instructed participants to visualize themselves within the environment that was shown or described. If indeed the environments were understood to be remote, no egocentric relations should have been formed between the actual self and the locations contained in the layouts. Indeed, we believe that the task was executed solely on the basis of an enduring allocentric system and we therefore attribute the alignment effect that was found in all our conditions to the way the paths were represented in memory. In the case of Experiment 1, we believe that paths were organized in memory on the basis of viewing experience (i.e., as a snapshot taken from a vantage point that coincided with the physical observation point of the participant). In the case of Experiment 2, paths were maintained in memory from the direction of the initial imagined facing direction. Although no instructions were given to participants in terms of imagining an initial facing direction, adopting one that is aligned with their actual facing direction seems less taxing on cognitive resources. Indeed, a number of
20
M.N. Avraamides and S. Pantelidou
previous studies have suggested that people have difficulty in maintaining misaligned imagined perspectives [26]. At this point it should be pointed out that while we claim that no egocentric relations between the self and the elements of the path were formed, we acknowledge that the transient egocentric systems of participants would have been used to encode and update egocentric relations to objects from the laboratory, including the two computer monitors used to present stimuli. Moreover, spatial relations between each path location and an imagined representation of the self within the path could have been formed. But, such relations could be more easily classified as allocentric rather than egocentric if the self in the imagined path is regarded as just another location in the layout. A secondary goal of our study was to assess the degree of functional equivalence between spatial representations created from depicted and described scenes. An important result is that the same pattern of findings (i.e., a normal alignment effect) was observed in the two experiments. While performance was somewhat more accurate for depicted than described scenes, our cross-experiment analysis revealed that the difference was not significant. The difference in mean accuracy is not surprising given findings from previous studies showing that it takes a longer to reach the same level of learning when encoding spatial layouts through language than vision [17-19]. In the current study we have used no learning criterion. Instead, participants were provided with unlimited time to study the layouts in the two experiments. The accuracy and fidelity of their spatial representations was, however, not assessed prior to testing. It is possible then that the overall performance difference between described and depicted was caused by differences in encoding. Previous studies suggest that functional equivalence for representations acquired from different modalities is achieved after equating conditions in terms of encoding differences [3, 17]. A future direction for research would thus to examine functional equivalence for representations of remote environments after taking in account the differences that may exist across modalities in terms of encoding. Acknowledgments. We are grateful to all the students who participated in the study.
References 1. Amorim, M.A., et al.: Updating an object’s orientation and location during nonvisual navigation: a comparison between two processing modes. Percept. Psychophys. 59(3), 404–418 (1997) 2. Farrell, M.J., Thomson, J.A.: On-Line Updating of Spatial Information Druing Locomotion Without Vision. J. Mot. Behav. 31(1), 39–53 (1999) 3. Loomis, J.M., et al.: Spatial updating of locations specified by 3-d sound and spatial language. J. Exp. Psychol. Learn. Mem. Cogn. 28(2), 335–345 (2002) 4. Rieser, J.J.: Access to knowledge of spatial structure at novel points of observation. J. Exp. Psychol. Learn. Mem. Cogn. 15(6), 1157–1165 (1989) 5. Presson, C.C., Montello, D.R.: Updating after rotational and translational body movements: coordinate structure of perspective space. Perception 23(12), 1447–1455 (1994) 6. Wang, R.F., Spelke, E.S.: Updating egocentric representations in human navigation. Cognition 77(3), 215–250 (2000)
Does Body Orientation Matter When Reasoning about Depicted or Described Scenes?
21
7. May, M.: Imaginal perspective switches in remembered environments: transformation versus interference accounts. Cognit. Psychol. 48(2), 163–206 (2004) 8. Mou, W., et al.: Roles of egocentric and allocentric spatial representations in locomotion and reorientation. J. Exp. Psychol. Learn. Mem. Cogn. 32(6), 1274–1290 (2006) 9. Kelly, J.W., Avraamides, M.N., Loomis, J.M.: Sensorimotor alignment effects in the learning environment and in novel environments. J. Exp. Psychol. Learn. Mem. Cogn. 33(6), 1092–1107 (2007) 10. Mou, W., et al.: Allocentric and egocentric updating of spatial memories. J. Exp. Psychol. Learn. Mem. Cogn. 30(1), 142–157 (2004) 11. Mou, W., McNamara, T.P.: Intrinsic frames of reference in spatial memory. J. Exp. Psychol. Learn. Mem. Cogn. 28(1), 162–170 (2002) 12. Wang, R.F.: Between reality and imagination: when is spatial updating automatic? Percept Psychophys 66(1), 68–76 (2004) 13. Wang, R.F., Brockmole, J.R.: Human navigation in nested environments. J. Exp. Psychol. Learn. Mem. Cogn. 29(3), 398–404 (2003) 14. Avraamides, M.N., Kelly, J.W.: Multiple systems of spatial memory and action. Cogn. Process (2007) 15. Presson, C.C., Hazelrigg, M.D.: Building spatial representations through primary and secondary learning. J. Exp. Psychol. Learn. Mem. Cogn. 10(4), 716–722 (1984) 16. Avraamides, M.N.: Spatial updating of environments described in texts. Cognit. Psychol. 47(4), 402–431 (2003) 17. Avraamides, M.N., et al.: Functional equivalence of spatial representations derived from vision and language: evidence from allocentric judgments. J. Exp. Psychol. Learn. Mem. Cogn. 30(4), 804–814 (2004) 18. Klatzky, R.L., et al.: Encoding, learning, and spatial updating of multiple object locations specified by 3-D sound, spatial language, and vision. Exp. Brain. Res. 149(1), 48–61 (2003) 19. Klatzky, R.L., et al.: Learning directions of objects specified by vision, spatial audition, or auditory spatial language. Learn. Mem. 9(6), 364–367 (2002) 20. Mellet, E., et al.: Neural basis of mental scanning of a topographic representation built from a text. Cereb Cortex 12(12), 1322–1330 (2002) 21. Mellet, E., et al.: Neural correlates of topographic mental exploration: the impact of route versus survey perspective learning. Neuroimage 12(5), 588–600 (2000) 22. Taylor, H.A., Tversky, B.: Descriptions and depictions of environments. Mem. Cognit. 20(5), 483–496 (1992) 23. Denis, M., Zimmer, H.D.: Analog properties of cognitive maps constructed from verbal descriptions. Psychological Research 54(4), 286–298 (1992) 24. Waller, D., et al.: Orientation specificity and spatial updating of memories for layouts. J. Exp. Psychol. Learn. Mem. Cogn. 28(6), 1051–1063 (2002) 25. Harrison, A.M.: Reversal of the alignment effect: influence of visualization and spatial set size. In: Proceedings of the Annual Cognitive Science Meeting (2007) 26. Wildbur, D.J., Wilson, P.N.: Influences on the first-perspective alignment effect from text route descriptions. Q. J. Exp. Psychol. 61(5), 763–783 (2007) 27. Easton, R.D., Sholl, M.J.: Object-array structure, frames of reference, and retrieval of spatial knowledge. J. Exp. Psychol. Learn. Mem. Cogn. 21(2), 483–500 (1995) 28. Waller, D., Hodgson, E.: Transient and enduring spatial representations under disorientation and self-rotation. J. Exp. Psychol. Learn. Mem. Cogn. 32(4), 867–882 (2006)
Spatial Memory and Spatial Orientation Jonathan W. Kelly and Timothy P. McNamara Department of Psychology, Vanderbilt University 111 21st Ave. South, Nashville, TN 37203
[email protected]
Abstract. Navigating through a remembered space depends critically on the ability to stay oriented with respect to the remembered environment and to reorient after becoming lost. This chapter describes the roles of long-term spatial memory, sensorimotor spatial memory, and path integration in determining spatial orientation. Experiments presented here highlight the reference direction structure of long-term spatial memory and suggest that self-position and orientation during locomotion are updated with respect to those reference directions. These results indicate that a complete account of spatial orientation requires a more thorough understanding of the interaction between long-term spatial memory, sensorimotor spatial memory, and path integration. Keywords: Navigation; Path integration; Reorientation; Spatial cognition; Spatial memory; Spatial updating.
1 Introduction Navigation through a familiar environment can be considered a two-part task, where the successful navigator must first orient him or herself with respect to the known environment and then determine the correct travel direction in order to arrive at the goal location. Several accounts of spatial memory and spatial orientation have been reported in recent years to explain human navigation abilities (Avraamides & Kelly, 2008; Kelly, Avraamides & Loomis, 2007; Mou, McNamara, Valiquette & Rump, 2004; Rump & McNamara, 2007; Sholl, 2001; Waller & Hodgson, 2006; Wang & Spelke, 2000). Inspired in part by perceptual theories positing separate representations for perception and action (Bridgeman, Lewis, Heit & Nagle, 1979; Milner & Goodale, 1995; Schneider, 1969), many of these theories of spatial memory agree that a complete account of human navigation and spatial orientation requires multiple spatial representations. The first such spatial representation is a long-term representation, in which locations are represented in an enduring manner. This long-term representation allows the navigator to plan future travels, recognize previously experienced environments, and identify remembered locations, even when those locations are obscured from view. The preponderance of evidence from spatial memory experiments indicates that these long-term representations are orientation dependent, with privileged access to particular orientations (see McNamara, 2003 for a review). Section 2 (below) reviews the evidence for orientation dependence, and also details recent experiments aimed at understanding the relevant cues that determine which orientations C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 22–38, 2008. © Springer-Verlag Berlin Heidelberg 2008
Spatial Memory and Spatial Orientation
23
receive privileged access, particularly in naturalistic environments that contain many potential cues. The second spatial representation consistently implicated in models of spatial memory is a working memory representation, referred to here as a sensorimotor representation, in which locations are represented only transiently. The sensorimotor representation is thought to be used when performing body-defined actions, such as negotiating obstacles and moving toward intermediate goal locations like landmarks, which can function as beacons. Because these behaviors typically rely on egocentrically organized actions, it makes sense that this sensorimotor representation should also be egocentrically organized, in order to maintain an isomorphic mapping between representation and response. The evidence reviewed in Section 3 supports this conjecture, indicating that the sensorimotor representation is organized in an egocentric framework. Although most models of spatial memory agree that the sensorimotor representation is transient, the exact nature of its transience is not well understood. While some experiments indicate that the sensorimotor representation fades with time (e.g., Mou et al., 2004), other evidence shows that the sensorimotor representation depends primarily on environmental cues that are only transiently available during locomotion (Kelly et al., 2007). Evidence supporting these two claims is presented in Section 3. In order to stay oriented with respect to a known environment, the navigator must be able to identify salient features of his or her surrounding environment and match those features with the same features in long-term spatial memory. This point becomes particularly evident when attempting to reorient after becoming lost. For example, a disoriented student might have an accurate long-term representation of the campus, along with a vivid sensorimotor representation of his or her surrounding environment. But unless the student can identify common features shared by both representations, and bring those representations into alignment based on their common features, he or she will remain disoriented. Neither the sensorimotor nor the long-term representation alone contains sufficient information to re-establish location and orientation within the remembered space. Instead, the disoriented navigator must be able to align the long-term representation, which contains information about how to continue toward one’s navigational goal, with the sensorimotor representation of the immediately surrounding space, similar to how visitors to unfamiliar environments will often align a physical map with the visible surrounds during navigation. In Section 4, we review previous work on cues to reorientation, and frame these results in the context of this matching process between long-term and sensorimotor representations. We also present new data from two experiments exploring the differences and similarities in spatial cue use during reorientation and maintenance of orientation, two tasks integral to successful navigation. The results suggest that spatial orientation is established with respect to the same reference directions that are used to organize long-term spatial memories.
2 Long-Term Spatial Memory An every-day task like remembering the location of one’s car in a stadium parking lot draws on the long-term spatial memory of the remembered environment. Because
24
J.W. Kelly and T.P. McNamara
locations are inherently relative, objects contained in this long-term spatial memory must be specified in the context of a spatial reference system. For example, a football fan might remember the location of his or her car in the stadium parking lot with respect to the rows and columns of cars, or possibly with respect to the car’s location relative to the main stadium entrance. In either case, the car’s location must be represented relative to some reference frame, which is likely to be centered on the environment. Much of the experimental work on the organization of long-term spatial memories has focused on the cues that influence the selection of one spatial reference system over the infinite number of candidate reference systems. In these experiments, participants learn the locations of objects on a table, within a room, or throughout a city, and are later asked to retrieve inter-object spatial relationships from the remembered layout. A variety of spatial memory retrieval tasks have been employed, including map drawing, picture recognition, and perspective taking. These retrieval tasks are commonly performed after participants have been removed from the learning environment, to ensure that spatial memories are being retrieved from the long-term representation and not from the sensorimotor representation. Here we focus primarily on results from perspective taking tasks, where participants point to locations from imagined perspectives within the remembered environment. A consistent finding from these experiments is that long-term spatial memories are typically represented with respect to a small number of reference directions, centered on the environment and selected during learning (see McNamara, 2003, for a review). During spatial memory retrieval, inter-object spatial relationships aligned with those reference directions are readily accessible because they are directly represented in the spatial memory. In contrast, misaligned spatial relationships must be inferred from other represented relationships, and this inference process is cognitively effortful (e.g, Klatzky, 1998). The pattern of response latencies and pointing errors across a sample of imagined perspectives is interpreted as an indicator of the reference directions used to organize the spatial memory, and a large body of work has focused on understanding the cues that influence the selection of one reference direction over another during acquisition of the memory. Because we use our bodies to sense environmental information and also to act on the environment, the body’s position during learning seems likely to have a large influence on selecting a reference direction. Consistent with this thinking, early evidence indicated that perspectives aligned with experienced views are facilitated relative to non-experienced views. This facilitation fell off as a function of angular distance from the experienced views (Diwakdar & McNamara, 1997; Roskos-Ewoldsen, McNamara, Shelton & Carr, 1999; Shelton & McNamara, 1997), and this pattern of facilitation holds true for up to three learned perspectives. These findings resonate with similar findings from object recognition (Bülthoff & Edelman, 1992), but are complicated by two other sets of findings. First, Kelly et al. (2007; see also Avraamides & Kelly, 2005) had participants learn a layout of eight objects within an octagonal room using immersive virtual reality. Participants freely turned and explored the virtual environment during learning, but the initially experienced perspective was held constant. After learning this layout, imagined perspectives aligned with the initially experienced perspective were facilitated, and this pattern persisted even after extensive experience with other perspectives misaligned with that initial view.
Spatial Memory and Spatial Orientation
25
The authors concluded that participants established a reference direction upon first experiencing the environment, and that this organization was not updated even after learning from many other views. Second, Shelton and McNamara (2001; also see Hintzman, O’Dell & Arndt, 1981) found that the environmental shape has a profound impact on selecting reference directions. In one of their experiments, participants learned a layout of objects on the floor of a rectangular room. Learning occurred from two perspectives, one parallel with the long axis of the room and one misaligned with the room axis. Perspective taking performance was best when imagining the aligned learning perspective and performance on the misaligned learning perspective was no better than on non-experienced perspectives. The authors concluded that reference directions are selected based on a combination of egocentric experience and environmental structure, and that the rectangular room served as a cue to selecting a reference direction consistent with that structure. This finding is supported by other work showing facilitated retrieval of inter-object relationships aligned with salient environmental features like city streets, large buildings, and lakes (McNamara, Rump & Werner, 2003; Montello, 1991; Werner & Schmidt, 1999). Other work has shown that selection of reference directions is influenced not only by features external to the learned layout, but also by the structure of the learned layout itself. For example, the reference directions used to remember the locations of cars in a stadium parking lot might be influenced by the row and column structure of the very cars that are being learned. Mou and McNamara (2002) demonstrated the influence of this intrinsic structure by having participants study a rectilinear object array. The experimenter pointed out the spatial regularity of the layout, which contained rows and columns oblique to the viewing perspective during learning. Subsequent perspective taking performance was best for perspectives aligned with the intrinsic axes defined by the rows and columns of objects, even though those perspectives were never directly experienced during learning. Furthermore, this influence of the intrinsic object structure is not dependent on experimenter instructions like those provided in Mou and McNamara’s experiments. Instead, an axis of bilateral symmetry within the object array can induce the same organization with respect to an intrinsic frame of reference, defined by the symmetry axis (Mou, Zhao & McNamara, 2007). To summarize the findings reviewed so far, the reference directions used to organize long-term spatial memories are known to be influenced by egocentric experience, extrinsic environmental structures like room walls (extrinsic to the learned layout), and intrinsic structures like rows and columns of objects or symmetry axes (intrinsic to the learned layout). While these cues have each proven influential in cases where only one or two cues are available, real world environments typically contain a whole host of cues, including numerous extrinsic and intrinsic cues like sidewalks, tree lines, waterfronts, and mountain ranges. A recent set of experiments reported by Kelly & McNamara (2008) sought to determine whether one particular cue type is dominant in a more representative scene, where egocentric experience, extrinsic structure, and intrinsic structure all provided potential cues to selecting a reference direction. In the first of two experiments using immersive virtual reality, participants learned a layout of seven virtual objects from two perspectives. The objects were arranged in rows and columns which were oblique to the walls of a surrounding square room (termed the incongruent environment, since intrinsic and extrinsic environmental structures
J.W. Kelly and T.P. McNamara
Absolute Pointing Error [deg]
26
60 0°-135° 135°-0°
50 40 30
135°
20 10 Incongruent environment
0°
0
Absolute Pointing Error [deg]
0
45
90
135
180
225
270
315
60 0°-135° 135°-0°
50
135°
40 0°
30 20 10
Congruent environment 0 0
45
90 135 180 225 Imagined Perspective [deg]
270
315
Fig. 1. Stimuli and results from Kelly and McNamara (2008). Plan views of the incongruent (top) and congruent (bottom) environments appear as insets within each panel. In the plan views, open circles represent object locations, solid lines represent room walls, and arrows represent viewing locations during learning. Pointing error is plotted as function of imagined perspective, separately for the two viewing orders (0° then 135° or 135° then 0°). After learning the incongruent environment (top), where intrinsic and extrinsic structures were incongruent with one another, performance was best on the initially experienced view. After learning the congruent environment (bottom), where intrinsic and extrinsic structures were congruent with one another, performance was best for perspectives aligned with the redundant environmental structures, regardless of viewing order.
were incongruent with one another; see Figure 1, top panel). One of the learned perspectives (0°) was aligned with the intrinsic object structure, and the other (135°) was aligned with the extrinsic room structure. Learning occurred from both views, and viewing order was manipulated. If the intrinsic structure was more salient than extrinsic structure, then participants should have selected a reference direction from the 0°
Spatial Memory and Spatial Orientation
27
view (aligned with the rows and columns of the layout). However, if extrinsic structure was more salient than intrinsic structure, then participants should have selected a reference direction from the 135° view (aligned with the walls of the room). Finally, if the competing intrinsic and extrinsic structures negated one another’s influence, then participants should have selected a reference direction from the initially experienced view, regardless of its alignment with a particular environmental structure. In fact, spatial memories of the incongruent environment (top panel of Figure 1) were based on the initially experienced view, and the pattern of facilitation is well predicted by the viewing order. Neither the intrinsic structure of the objects nor the extrinsic structure of the room was more salient when the two were placed in competition. In the second experiment reported by Kelly and McNamara (2008), the intrinsic and extrinsic structures were placed in alignment with one another (termed the congruent environment; see inset in Figure 1, bottom panel), and learning occurred from two perspectives, one aligned and one misaligned with the congruent environmental structures. Spatial memories of the congruent environment (bottom panel of Figure 1) were organized around the redundantly defined environmental axes. Performance was best for perspectives aligned with the congruent intrinsic and extrinsic structures, and was no better on the misaligned experienced view than on other misaligned views that were never experienced. The results of these two experiments fit well with those reported by Shelton and McNamara (2001), where multiple misaligned extrinsic structures (a rectangular room and a square mat on the floor) resulted in egocentric selection of reference directions, but aligned extrinsic structures resulted in environment-based selection. Taken together, these findings indicate that intrinsic and extrinsic structures are equally salient, and can serve to reinforce or negate the influences of one another as cues to the selection of reference directions. Every-day environments typically contain multiple intrinsic and extrinsic structures like roads, waterfronts, and tree lines, and these structures often define incongruent sets of environmental axes. As such, it is possible that reference directions are most commonly selected on the basis of egocentric experience. Experiments on long-term spatial memory have regularly provided evidence that long-term representations are orientation-dependent, allowing for privileged access to spatial relations aligned with a reference direction centered on the environment. However, the evidence reviewed thus far is based primarily on imagined perspective taking performance, and experiments using scene recognition indicate that there may be more than one long-term representation. Valiquette and McNamara (2007; also see Shelton & McNamara, 2004) had participants learn a layout of objects from two perspectives, one aligned and one misaligned with the extrinsic structure of the environment (redundantly defined by the room walls and a square mat on the floor). As in other experiments (e.g., Kelly & McNamara, 2008; Shelton & McNamara, 2001), perspective taking performance was better when imagining the aligned learning perspective than when imagining the misaligned learning perspective, which was no better than when imagining other misaligned perspectives that were never experienced. In contrast, scene recognition performance was good on both the aligned and misaligned learning perspectives, and fell off as a function of angular distance from the learned perspectives. So while imagined perspective taking performance indicated that the misaligned learning view was not represented in long-term memory, scene recognition performance indicated that the misaligned view was represented. The
28
J.W. Kelly and T.P. McNamara
authors interpreted this as evidence for two long-term representations, one used for locating self-position (active during the scene recognition test) and the other for locating goal locations after establishing self-position (active during the perspective taking task). Importantly, both representations were found to be orientation-dependent, but the reference directions used to organize the two types of representations were different. The influence of these reference directions on navigation is still unclear. One possibility is that spatial relationships are more accessible when the navigator is aligned with a reference direction in long-term memory. As a result, a navigator’s ability to locate and move toward a goal location might be affected by his or her orientation within the remembered space. Additionally, experiments presented in Section 4 suggest that spatial updating occurs with respect to the same reference directions used to organize spatial memories.
3 Sensorimotor Spatial Memory Whereas long-term representations are suitable for reasoning about inter-object relationships from learned environments, they are, by themselves, insufficient for coordinating actions within the remembered environment. In order to act on our environments, we require a body-centered representation of space, rather than the environment-centered representations characteristic of long-term spatial memories. Indeed, current theories of spatial memory (e.g., Avraamides & Kelly, 2008; Kelly et al., 2007; Mou et al., 2004; Rump & McNamara, 2007; Sholl, 2001) typically include something analogous to a sensorimotor spatial memory system, which represents egocentric locations of objects in the environment and can be used to negotiate obstacles, intercept moving objects, and steer a straight course toward a goal. This sensorimotor representation provides privileged access to objects in front of the body, evidenced by the finding that retrieval of unseen object locations is facilitated for locations in front of the body, compared to behind (Sholl, 1987). This same pattern also occurs when imagining perspectives within a remote environment stored in long-term memory (Hintzman et al., 1981; Shelton & McNamara, 1997; Werner & Schmidt, 1999), where pointing from an imagined perspective is facilitated for objects in front of the imagined position, relative to objects behind, and suggests that the sensorimotor representation might also be used to access spatial relationships from non-occupied environments (Sholl, 2001). This privileged access to objects in front is consistent with other front-facing aspects of human sensory and locomotor abilities, and highlights the importance of locations and events in front of the body. Unlike the environment-centered reference frames characteristic of long-term spatial memories, egocentric locations within the sensorimotor representation must be updated during movement through the environment. Because this updating process is cognitively demanding, there is a limit to the number of objects that can be updated successfully (Wang et al., 2006; but see Hodgson & Waller, 2006). Furthermore, selfmotion cues are critical to successful updating, and a large body of work has studied the effectiveness of various self-motion cues. While updating the location of one or two objects can be done fairly effectively during physical movements, imagined movements, which lack the corresponding self-motion cues, are comparatively quite difficult (Rieser, 1989). In a seminal study on imagined movements, Rieser asked
Spatial Memory and Spatial Orientation
29
blindfolded participants to point to remembered object locations after physical rotations or after imagined rotations. Pointing was equally good before and after physical rotations, indicating the efficiency of updating in the presence of self-motion cues. However, performance degraded as a function of rotation angle after imagined rotation. According to Presson and Montello (1994; also see Presson, 1987), pointing judgments from imagined perspectives misaligned with the body are difficult because of a reference frame conflict between two competing representations of the surrounding environment. The remembered environment in relationship to one’s physical location and orientation is held in a primary representation (i.e., the sensorimotor representation), and the same environment relative to one’s imagined location and orientation is held in a secondary representation. Imagined movement away from one’s physical location and orientation creates conflict between these two representations, referred to here as sensorimotor interference (May, 2004). This conflict occurs when the primary and secondary representations both represent the same environment, and therefore sensorimotor interference only affects perspective-taking performance when imagining perspectives within the occupied environment, but not when imaging perspectives from a remote environment. Much of the research on the sensorimotor representation has been conducted independently from research on long-term spatial memory (reviewed above in Section 2). However, recent experiments indicate that a complete understanding of the sensorimotor representation must also take into account the organization of long-term spatial memory. Experiments by Mou et al. (2004) indicate that the interference associated with imagining a perspective misaligned with the body depends on whether that imagined perspective is aligned with a reference direction in long-term memory. They found that the sensorimotor interference associated with imagining a perspective misaligned with the body was larger when the imagined perspective was also misaligned with a reference direction in long-term memory, compared to perspectives aligned with a reference direction. However, a thorough exploration of this interaction between sensorimotor interference and reference frames in long-term spatial memory is still lacking. As proposed by Mou et al. (2004), the sensorimotor representation is transient, and decays at retention intervals of less than 10 seconds in the absence of perceptual support. However, experiments by Kelly et al. (2007) challenge this notion based on the finding that sensorimotor interference can occur after long delays involving extensive observer movements. In one experiment using immersive virtual reality, participants learned a circular layout of objects within a room. Although participants were allowed unrestricted viewing of the virtual environment, the initially experienced view was held constant across participants. The objects were removed after learning, and subsequent spatial memory retrieval occurred over two blocks of testing. In each block, participants imagined perspectives within the learned set of objects, and those perspectives could be 1) aligned with the initially experienced view (termed the “original” perspective), 2) aligned with the participant’s actual body orientation during retrieval (termed the “sensorimotor aligned” perspective), or 3) misaligned with both the initially experienced view and the participant’s body orientation (termed the “misaligned” perspective). Prior to starting the first block of trials, participants walked three meters into a neighboring virtual room. Perspective taking performance when standing in this neighboring room was best when imagining the initially experienced perspective
30
J.W. Kelly and T.P. McNamara
(compare performance on the original perspective with performance on the misaligned perspective in Figure 2), but there was no advantage for the perspective aligned, compared to misaligned with the body (compare performance on the sensorimotor aligned perspective with performance on the misaligned perspective). Based on results from this first test block, the authors concluded that participants’ sensorimotor representations of the learned objects were purged upon leaving the learning room and replaced with new sensorimotor representations of the currently occupied environment (i.e., the room adjacent to the learning room). As such, there was no sensorimotor interference when imagining the learned layout while standing in the neighboring room. Using Presson and Montello’s (1994) framework, participants’ primary and secondary spatial representations contained spatial information from separate environments, and therefore no sensorimotor interference occurred.
10
Original Sensorimotor aligned Misaligned
Response latency [sec]
9 8 7 6 5 4 3 2 1 0 Block 1
Block 2
Fig. 2. Results of Kelly, Avraamides and Loomis (2007). Response latency is plotted as a function of test block and imagined perspective. After learning a layout of objects, participants walked into a neighboring room and performed Block 1 of the perspective-taking task. Results indicate that performance was best for the originally experienced perspective during learning, but was unaffected by the disparity between the orientation of the imagined perspective relative to the orientation of the participants’ bodies during testing. After completing Block 1, participants returned to the empty learning room and performed Block 2. Results indicate that performance was facilitated on the originally experienced perspective, and also on the perspective aligned with the body during testing.
For the second block of trials, participants returned to the empty learning room (the learned objects had been removed after learning), and performed the exact same perspective-taking task as before. Performance was again facilitated when participants imagined the initially experienced perspective, but also when they imagined their actual perspective, compared to performance on the misaligned perspective (see Figure 2). Despite the fact that participants did not view the learned objects upon
Spatial Memory and Spatial Orientation
31
returning to the learning room, their sensorimotor representations of the objects were reactivated, causing sensorimotor interference when imagining perspectives misaligned with the body. This indicates that walking back into the empty learning room was sufficient to reinstantiate the sensorimotor representation of the learned objects, even though several minutes had passed since they were last seen. Renewal of the sensorimotor representation must have drawn on the long-term representation, because the objects themselves were not experienced upon returning to the empty learning room. In sum, Kelly et al.’s experiment suggests that the sensorimotor representation is less sensitive to elapsed time than previously thought, and instead is dependent on perceived self-location. The sensorimotor representation appears to be context dependent, and moving from one room to another changes the context and therefore also changes the contents of the sensorimotor representation.
4 Spatial Orientation Staying oriented during movement through a remembered space and reorienting after becoming lost are critical spatial abilities. With maps and GPS systems, getting lost on one’s drive home might not present a life or death situation, but the same was not true for our ancestors, whose navigation abilities were necessary for survival. According to Gallistel (1980; 1990), spatial orientation is achieved, in part, by relating properties of the perceived environment (i.e., the sensorimotor representation) with those same properties in the remembered environment (i.e., the long-term representation), and is also informed by perceived self-position as estimated by integrating selfmotion cues during locomotion, a process known as path integration. The importance of information from path integration becomes particularly clear when navigating within an ambiguous environment, such as an empty rectangular room in which two orientations provide the exact same perspective of the room (e.g., Hermer & Spelke, 1994). In this case, one’s true orientation can only be known by using path integration to distinguish between the two potentially correct matches between sensorimotor and long-term representations. From time to time, the matching between perceived and remembered environments can produce grossly incorrect estimates of self-position. Jonsson (2002; also see Gallistel, 1980) describes several such experiences. In one case, he describes arriving in Cologne by train. Because his origin of travel was west of Cologne, he assumed that the train was facing eastward upon its arrival at Cologne Central Station. The train had, in fact, traveled past Cologne and turned around to enter the station from the east, and was therefore facing westward upon its arrival. Jonsson’s initial explorations of the city quickly revealed his orienting error, and he describes the disorienting experience of rotating his mental representation of the city 180° into alignment with the visible scene. Experiences such as these are typically preceded by some activity that disrupts the path integration system (like riding in a subway, or falling asleep on a train), which would have normally prevented such an enormous error. 4.1 Environmental Cues to Spatial Orientation Much of the experimental work on the topic of human spatial orientation has focused on the cues used to reorient after explicit disorientation. In particular, those studies
32
J.W. Kelly and T.P. McNamara
distinguish between two types of environmental cues to spatial orientation: 1) geometric cues, such as the shape of the room as defined by its extended surfaces, and 2) featural cues, such as colors, textures, and other salient features that cannot be described in purely geometric terms (see Cheng & Newcombe, 2005, for an overview of the findings in this area). The majority of these experiments employ a task originally developed by Cheng (1986) to study spatial orientation in rats. Hermer and Spelke (1994, 1996) adapted Cheng’s task to study reorientation in humans. In the basic experimental paradigm, participants learn to locate one corner within a rectangular room, consisting of two long walls and two short walls. Participants are later blindfolded and disoriented, and are then asked to identify which corner is the learned corner. When all four room walls are uniformly colored (Hermer & Spelke, 1996), participants split their responses evenly between the correct corner and the diagonally opposite corner, both of which share the same ratio of left and right wall lengths and the same corner angle. Rarely do participants choose one of the geometrically incorrect corners, a testament to their sensitivity to environmental geometry and their ability to reorient using geometric cues. When a featural cue is added by painting one of the four walls a unique color (Hermer & Spelke, 1996), participants are able to consistently identify the correct corner and no longer choose the diagonally opposite corner, indicating the influence of featural cues on reorientation. Recent experiments in our lab have focused on room rotational symmetry as the underlying geometric cue in determining reorientation performance. Rotational symmetry is defined as the number of possible orientations of the environment that result in the exact same perspective. For example, any perspective within a rectangular
Percentage of correct responses
100 90 80 70 60 50 40 30 20 10 0 Circular
Square
Rectangular
Trapezoidal
Room shape Fig. 3. Reorientation performance in four rooms, varying in their rotational symmetry. Participants learned to identify one of twelve possible object locations, and then attempted to locate the learned location after disorientation.
Spatial Memory and Spatial Orientation
33
room (without featural cues) can be exactly reproduced by rotating the room 180°. Because there are two orientations that produce the same perspective, the rectangular room is two-fold rotationally symmetric. A square room is four-fold rotationally symmetric, and so on. In our experiment, we tested reorientation performance within environments of 1-fold (trapezoidal), 2-fold (rectangular), 4-fold (square) and ∞-fold (circular) rotational symmetry. Participants memorized one of twelve possible target locations within the room, and then attempted to re-locate the target position after explicit disorientation. Reorientation performance (see Figure 3) was inversely proportional to room rotational symmetry across the range of rotational symmetries tested. This can be considered an effect of geometric ambiguity, with the greater ambiguity of the square room compared to the trapezoidal room leading to comparatively poorer reorientation performance in the square room. The same analysis can be applied to featural cues, which have traditionally been operationalized as unambiguous indicators of self-location (e.g., Hermer & Spelke, 1996), but need not be unambiguous. 4.2 Path Integration Even in the absence of environmental cues, humans can maintain a sense of spatial orientation through path integration. Path integration is the process of updating perceived self-location and orientation using internal motion cues such as vestibular and proprioceptive cues, and external motion cues such as optic flow, and integrating those motion signals over time to estimate self-location and orientation (for a review, see Loomis, Klatzky, Golledge & Philbeck, 1999). The path integration process is noisy, and errors accrue with increased walking and turning. In an experiment by Klatzky et al. (1990), blindfolded participants were led along an outbound path consisting of one to three path segments, and each segment was separated by a turn. After reaching the end of the path, participants were first asked to turn and face the path origin and then to walk to the location of the path origin. Turning errors and walked-distance errors increased with the number of path segments, demonstrating that path integration is subject to noise. Errors that accumulate during path integration cannot be corrected for without perceptual access to environmental features, such as landmarks or geometry. 4.3 Spatial Orientation Using Path Integration and Environmental Cues Only occasionally are we faced with a pure reorientation task or a pure path integration task. More commonly, environmental cues and path integration are both available as we travel through a remembered space. In a recent experiment, we investigated the role of environmental geometry in spatial orientation when path integration was also available. Participants performed a spatial updating task, where they learned a location within a room and attempted to keep track of that location while walking along an outbound path. At the end of the path they were asked to point to the remembered location. The path was defined by the experimenter and varied in length from two to six path segments, and participants actively guided themselves along this path. The task was performed in environments of 1-fold (trapezoidal), 2-fold (rectangular), 4-fold (square) and ∞-fold (circular) rotational symmetry. If rotational symmetry affects spatial updating performance like
34
J.W. Kelly and T.P. McNamara
it affected reorientation performance (see Section 4.1, above), then performance should degrade as room shape becomes progressively more ambiguous. The effect of room rotational symmetry was expected to be particularly noticeable at long path lengths, when self-position estimates through path integration become especially error-prone (Klatzky et al., 1990; Rieser & Rider, 1991), and people are likely to become lost and require reorientation. Contrary to these predictions, spatial updating performance was quite good, and was unaffected by increasing path length in all three angled environments (square, rectangular and trapezoidal; see Figure 4). This is in stark contrast to performance in the circular room, where errors increased with increasing path length. Participants were certainly using path integration to stay oriented when performing the task. Otherwise, performance would have been completely predicted by room rotational symmetry (like the reorientation experiment discussed above in Section 4.1). Participants were also certainly using room shape cues, when available. Otherwise, pointing errors in all environments would have increased with increasing path length, as they did in the circular room. To explain these results, we draw on previous work showing that long-term spatial memories are represented with respect to a small number of reference directions (see Section 2). Of particular relevance, Mou et al. (2007) showed that reference directions often correspond to an axis of environmental symmetry. Based on this finding, we believe that participants in the spatial updating task represented each environment (including the room itself and the to-be-remembered locations within the room) with respect to a reference direction, coincident with an environmental symmetry axis. Perceived self-position was updated with respect to this reference direction (see Cheng & Gallistel, 2005, for a similar interpretation of experiments on reorientation by rats). In the circular room, any error in estimating self-position relative to the reference direction directly resulted in pointing error, because the environment itself offered no information to help participants constrain their estimates of the orientation of the reference direction. However, geometric cues in the three angled environments at least partially defined the reference direction, which we believe corresponded to an environmental symmetry axis. For example, the square environment defined the symmetry axis within +/- 45°. If errors in perceived heading ever exceeded this +/45° threshold, then participants would have mistaken a neighboring symmetry axis for the selected reference direction. The rectangular and trapezoidal environments were even more forgiving, as the environmental geometries defined those symmetry axes within +/- 90° and +/- 180°, respectively. Furthermore, participants in the angled environments could use the environmental geometry to reduce heading errors during locomotion, thereby preventing those errors from exceeding the threshold allowed by a given rotational symmetry. The experiments described in this section demonstrate how ambiguous environmental cues and noisy self-motion cues can be combined to allow for successful spatial orientation. During normal navigation, we typically have information from multiple sources, all of which may be imperfect indicators of self-position. By combining those information sources, we can stay oriented with respect to the remembered environment, a crucial step toward successful navigation.
Absolute pointing error [deg]
Spatial Memory and Spatial Orientation
35
Circular Square Rectangular Trapezoidal
70 60 50 40 30 20 10 0 2
4
6
Path segments Fig. 4. Pointing error in a spatial updating task as a function of walked path length, plotted separately for the four different surrounding environments. Pointing errors increased with increased walking distance in the round room. In comparison, performance was unaffected by path length in the square, rectangular, and trapezoidal rooms.
5 Summary and Conclusions Although sensorimotor and long-term spatial memories have traditionally been researched separately, the current overview indicates that a complete description of navigation will depend on a better understanding of how these spatial representations are coordinated to achieve an accurate sense of spatial orientation. This chapter has reviewed the evidence that long-term spatial memories are orientation-dependent, and that the selection of reference directions depends on egocentric experiences within the environment as well as environmentally defined structures, such as intrinsic and extrinsic axes. Environmental symmetry axes are particularly salient cues shown to influence reference frame selection (Mou et al., 2007). Furthermore, the sensorimotor representation can access this long-term representation under certain circumstances. In the experiment by Kelly et al. (2007), the sensorimotor representation of objects from a previously experienced environment could be reified even though participants never actually viewed the represented objects again. The environmental context allowed participants to retrieve object locations from long-term memory and rebuild their sensorimotor representations of those retrieved objects. Building up the sensorimotor representation through retrieval of information stored in long-term memory is necessary when navigating toward unseen goal locations. Furthermore, the sensorimotor representation is likely to be partially responsible for generating and adding to the long-term representation. By keeping track of one’s movements through a new environment, new objects contained in the sensorimotor representation (i.e., novel objects
36
J.W. Kelly and T.P. McNamara
in the visual field) can be added to the long-term, environment-centered spatial memory. However, the nature of these interactions between long-term and sensorimotor spatial memories remains poorly understood, and warrants further research. The experiments on spatial orientation presented in Section 4 represent a step toward understanding this interaction between sensorimotor and long-term representations. Participants in those experiments are believed to have monitored self-position and orientation relative to the reference direction used to structure the long-term memory of the environment, and the selected reference direction most likely corresponded to an axis of environmental symmetry. Path integration helped participants keep track of the selected reference direction and avoid confusion with neighboring symmetry axes. This conclusion underscores the importance of the reference directions used in long-term memory, not just for retrieving inter-object relationships, but also for staying oriented within remembered spaces and updating those spaces during self-motion. A more complete understanding of spatial orientation should be informed by further studies of the interaction between long-term spatial memory, sensorimotor spatial memory, and path integration.
References 1. Avraamides, M.N., Kelly, J.W.: Imagined perspective-changing within and across novel environments. In: Freksa, C., Nebel, B., Knauff, M., Krieg-Brückner, B. (eds.) Spatial Cognition IV. LNCS (LNAI), pp. 245–258. Springer, Berlin (2005) 2. Avraamides, M.N., Kelly, J.W.: Multiple systems of spatial memory and action. Cogntive Processing 9, 93–106 (2008) 3. Bridgeman, B., Lewis, S., Heit, G., Nagle, M.: Relation between cognitive and motororiented systems of visual position perception. Journal of Experimental Psychology: Human Perception and Performance 5, 692–700 (1979) 4. Bülthoff, H.H., Edelman, S.: Psychophysical support for a two-dimensional view interpolation theory of object recognition. Proceedings of the National Academy of Sciences 89(1), 60–64 (1992) 5. Cheng, K.: A purely geometric module in the rat’s spatial representation. Cognition 23, 149–178 (1986) 6. Cheng, K., Gallistel, C.R.: Shape parameters explain data from spatial transformations: Comment on Pearce et al (2004) and Tommasi and Polli (2004). Journal of Experimental Psychology: Animal Behavior Processes 31(2), 254–259 (2005) 7. Cheng, K., Newcombe, N.S.: Is there a geometric module for spatial orientation? Squaring theory and evidence. Psychonomic Bulletin & Review 12(1), 1–23 (2005) 8. Diwadkar, V.A., McNamara, T.P.: Viewpoint dependence in scene recognition. Psychological Science 8(4), 302–307 (1997) 9. Gallistel, C.R.: The Organization of Action: A New Synthesis. Erlbaum, Hillsdale (1980) 10. Gallistel, C.R.: The Organization of Learning. MIT Press, Cambridge (1990) 11. Hermer, L., Spelke, E.S.: A geometric process for spatial reorientation in young children. Nature 370, 57–59 (1994) 12. Hermer, L., Spelke, E.S.: Modularity and development: The case of spatial reorientation. Cognition 61(3), 195–232 (1996) 13. Hintzman, D.L., O’Dell, C.S., Arndt, D.R.: Orientation in cognitive maps. Cognitive Psychology 13, 149–206 (1981)
Spatial Memory and Spatial Orientation
37
14. Hodgson, E., Waller, D.: Lack of set size effects in spatial updating: Evidence for offline updating. Journal of Experimental Psychology: Learning, Memory, & Cognition 32, 854– 866 (2006) 15. Jonsson, E.: Inner Navigation: Why we Get Lost in the World and How we Find Our Way. Scribner, New York (2002) 16. Kelly, J.W., Avraamides, M.N., Loomis, J.M.: Sensorimotor alignment effects in the learning environment and in novel environments. Journal of Experimental Psychology: Learning, Memory & Cognition 33(6), 1092–1107 (2007) 17. Kelly, J.W., McNamara, T.P.: Spatial memories of virtual environments: How egocentric experience, intrinsic structure, and extrinsic structure interact. Psychonomic Bulletin & Review 15(2), 322–327 (2008) 18. Klatzky, R.L.: Allocentric and egocentric spatial representations: Definitions, distinctions, and interconnections. In: Freksa, C., Habel, C., Wender, K.F. (eds.) Spatial Cognition, pp. 1–17. Springer, Berlin (1998) 19. Klatzky, R.L., Loomis, J.M., Golledge, R.G., Cicinelli, J.G., Doherty, S., Pellegrino, J.W.: Acquisition of route and survey knowledge in the absence of vision. Journal of Motor Behavior 22(1), 19–43 (1990) 20. Loomis, J.M., Klatzky, R.L., Golledge, R.G., Philbeck, J.W.: Human navigation by path integration. In: Golledge, R.G. (ed.) Wayfinding: Cognitive mapping and other spatial processes, pp. 125–151. Johns Hopkins, Baltimore (1999) 21. May, M.: Imaginal perspective switches in remembered environments: Transformation versus interference accounts. Cognitive Psychology 48, 163–206 (2004) 22. McNamara, T.P.: How are the locations of objects in the environment represented in memory? In: Freksa, C., Brauer, W., Habel, C., Wender, K.F. (eds.) Spatial cognition III. LNCS (LNAI), pp. 174–191. Springer, Berlin (2003) 23. McNamara, T.P., Rump, B., Werner, S.: Egocentric and geocentric frames of reference in memory of large-scale space. Psychonomic Bulletin & Review 10(3), 589–595 (2003) 24. Milner, A.D., Goodale, M.A.: The visual brain in action. Oxford University Press, Oxford (1995) 25. Montello, D.R.: Spatial orientation and the angularity of urban routes: A field study. Environment and Behavior 23(1), 47–69 (1991) 26. Mou, W., McNamara, T.P.: Intrinsic frames of reference in spatial memory. Journal of Experimental Psychology: Learning, Memory, and Cognition 28(1), 162–170 (2002) 27. Mou, W., McNamara, T.P., Valiquette, C.M., Rump, B.: Allocentric and egocentric updating of spatial memories. Journal of Experimental Psychology: Learning, Memory, and Cognition 30(1), 142–157 (2004) 28. Mou, W., Zhao, M., McNamara, T.P.: Layout geometry in the selection of intrinsic frames of reference from multiple viewpoints. Journal of Experimental Psychology: Learning, Memory, and Cognition 33, 145–154 (2007) 29. Presson, C.C.: The development of spatial cognition: Secondary uses of spatial information. In: Eisenberg, N. (ed.) Contemporary Topics in Developmental Psychology, pp. 87– 112. Wiley, New York (1987) 30. Presson, C.C., Montello, D.R.: Updating after rotational and translational body movements: Coordinate structure of perspective space. Perception 23, 1447–1455 (1994) 31. Rieser, J.J.: Access to knowledge of spatial structure at novel points of observation. Journal of Experimental Psychology: Learning, Memory, and Cognition 15(6), 1157–1165 (1989) 32. Rieser, J.J., Rider, E.A.: Young children’s spatial orientation with respect to multiple targets when walking without vision. Developmental Psychology 27(1), 97–107 (1991)
38
J.W. Kelly and T.P. McNamara
33. Roskos-Ewoldsen, B., McNamara, T.P., Shelton, A.L., Carr, W.: Mental representations of large and small spatial layouts are orientation dependent. Journal of Experimental Psychology: Learning, Memory, and Cognition 24(1), 215–226 (1999) 34. Rump, B., McNamara, T.P.: Updating in models of spatial memory. In: Barkowsky, T., Knauff, M., Montello, D.R. (eds.) Spatial cognition V. LNCS (LNAI), pp. 249–269. Springer, Berlin (2007) 35. Schneider, G.E.: Two visual systems. Science 163, 895–902 (1969) 36. Shelton, A.L., McNamara, T.P.: Multiple views of spatial memory. Psychonomic Bulletin & Review 4(1), 102–106 (1997) 37. Shelton, A.L., McNamara, T.P.: Systems of spatial reference in human memory. Cognitive Psychology 43(4), 274–310 (2001) 38. Shelton, A.L., McNamara, T.P.: Orientation and perspective dependence in route and survey learning. Journal of Experimental Psychology: Learning, Memory, and Cognition 30, 158–170 (2004) 39. Sholl, M.J.: Cognitive maps as orienting schemata. Journal of Experimental Psychology: Learning, Memory, and Cognition 13(4), 615–628 (1987) 40. Sholl, M.J.: The role of a self-reference system in spatial navigation. In: Montello, D. (ed.) Spatial information theory: Foundations of geographic information science, pp. 217–232. Springer, Berlin (2001) 41. Valiquette, C., McNamara, T.P.: Different mental representations for place recognition and goal localization. Psychonomic Bulletin & Review 14(4), 676–680 (2007) 42. Waller, D., Hodgson, E.: Transient and enduring spatial representations under disorientation and self-rotation. Journal of Experimental Psychology: Learning, Memory, & Cognition 32, 867–882 (2006) 43. Wang, R.F., Crowell, J.A., Simons, D.J., Irwin, D.E., Kramer, A.F., Ambinder, M.S., Thomas, L.E., Gosney, J.L., Levinthal, B.R., Hsieh, B.B.: Spatial updating relies on an egocentric representation of space: Effects of the number of objects. Psychonomic Bulletin & Review 13, 281–286 (2006) 44. Wang, R.F., Spelke, E.S.: Updating egocentric representations in human navigation. Cognition 77, 215–250 (2000) 45. Werner, S., Schmidt, K.: Environmental reference systems for large-scale spaces. Spatial Cognition and Computation 1(4), 447–473 (1999)
Map-Based Spatial Navigation: A Cortical Column Model for Action Planning Louis-Emmanuel Martinet1,2,3 , Jean-Baptiste Passot2,3, Benjamin Fouque1,2,3 , Jean-Arcady Meyer1 , and Angelo Arleo2,3 1
UPMC Univ Paris 6, FRE2507, ISIR, F-75016, Paris, France 2 UPMC Univ Paris 6, UMR 7102, F-75005, Paris, France 3 CNRS, UMR 7102, F-75005, Paris, France
[email protected]
Abstract. We modelled the cortical columnar organisation to design a neuromimetic architecture for topological spatial learning and action planning. Here, we first introduce the biological constraints and the hypotheses upon which our model was based. Then, we describe the learning architecture, and we provide a series of numerical simulation results. The system was validated on a classical spatial learning task, the Tolman & Honzik’s detour protocol, which enabled us to assess the ability of the model to build topological representations suitable for spatial planning, and to use them to perform flexible goal-directed behaviour (e.g., to predict the outcome of alternative trajectories avoiding dynamically blocked pathways). We show that the model reproduced the navigation performance of rodents in terms of goal-directed path selection. In addition, we present a series of statistical and information theoretic analyses to study the neural coding properties of the learnt space representations. Keywords: spatial navigation, topological map, trajectory planning, cortical column, hippocampal formation.
1
Introduction
Spatial cognition calls upon the ability to learn neural representations of the spatio-temporal properties of the environment, and to employ them to achieve goal-oriented navigation. Similar to other high-level functions, spatial cognition involves parallel information processing mediated by a network of brain structures that interact to promote effective spatial behaviour [1,2]. An extensive body of experimental work has investigated the neural bases of spatial cognition, and a significant amount of evidence points towards a prominent role of the hippocampal formation (see [1] for recent reviews). This limbic region has been thought to mediate spatial learning functions ever since location-selective neurones — namely hippocampal place cells [3], and entorhinal grid cells [4] — and orientation-selective neurones — namely head-direction cells [5] — were found by means of electrophysiological recordings from freely moving rats. Hippocampal place cells, grid cells, and head-direction cells are likely to subserve spatial representations in allocentric (i.e., world centred) coordinates, thus C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 39–55, 2008. c Springer-Verlag Berlin Heidelberg 2008
40
L.-E. Martinet et al.
providing cognitive maps [3] to support spatial behaviour. Yet, to perform flexible navigation (i.e., to plan detours and/or shortcuts) two other components are necessary: goal representation, and target-dependent action sequence planning [6]. The role of the hippocampal formation in these two mechanisms remains unclear. On the one hand, the hippocampus has been proposed to encode topological-like representations suitable for action sequence learning [6]. This hypothesis mainly relies on the recurrent dynamics generated by the CA3 collaterals of the hippocampus [7]. On the other hand, the hippocampal space code is likely to be highly redundant and distributed [8], which does not seem adequate for learning compact topological representations of high-dimensional spatial contexts. Also, the experimental evidence for high-level spatial representations mediated by a network of neocortical areas (e.g., the posterior parietal cortex [9], and the prefrontal cortex [10]) suggests the existence of an extra-hippocampal action planning system shared among multiple brain regions [11]. This hypothesis postulates a distributed spatial cognition system in which (i) the hippocampus would take part to the action planning process by conveying redundant (and robust) spatial representations to higher associative areas, (ii) a cortical network would elaborate more abstract and compact representations of the spatial context (accounting for motivation-dependent memories, action cost/risk constraints, and temporal sequences of goal-directed behavioural responses). Among the cortical areas involved in map building and action planning, the prefrontal cortex (PFC) may play a central role, as suggested by anatomical PFC lesion studies showing impaired navigation planning in rats [12]. Also, the anatomo-functional properties of the PFC seem appropriate to encode abstract contextual memories not merely based on spatial correlates. The PFC receives direct projections from sub-cortical structures (e.g., the hippocampus [13], the amygdala [14], and the ventral tegmental area [15]), and indirect connections from the basal ganglia through the basal ganglia - thalamocortical loops [16]. These projections provide the PFC with a multidimensional context, including emotional and motivational inputs [17], reward-dependent modulation [18], and action-related signals [16]. The PFC seems then well suited to (i) process manifold spatial information [19], (ii) encode the motivational values associated to spatio-temporal events [6], and (iii) perform supra-modal decisions [20]. Also, the PFC may be involved in integrating events in the temporal domain at multiple time scales [21]. The PFC recurrent dynamics regulated by the modulatory action of dopaminergic afferents [22] may permit to maintain patterns of activity over long time scales. Finally, the PFC is likely to be critical to detecting cross-temporal contingencies, which is relevant to the temporal organisation of behavioural responses, and to the encoding of retrospective and prospective memories [21]. 1.1
Cortical Columnar Organisation: A Computational Principle?
The existence of cortical columns was first reported by Mountcastle [23], who observed chains of cortical neurones reacting to the same external stimuli simultaneously. Cortical columns can be divided in six main layers including: layer I, mostly containing axons and dendrites; layer IV, receiving sensory inputs from
Cortical Model for Navigation Planning
41
sub-cortical structures (mainly the thalamus); and layer VI, sending outputs to sub-cortical brain areas (e.g., to the striatum and the thalamus). Layers II-III and V-VI constitute the so called supragranular and infragranular layers, respectively. The anatomo-functional properties of cortical columns have been widely investigated [24]. Neuroanatomical findings have indicated that columns can be divided into several minicolumns, each of which is composed of a population of interconnected neurones [25]. Thus, a column can be seen as an ensemble of interrelated minicolumns receiving inputs from cortical areas and other structures. It processes these afferent signals and projects the responses both within and outside the cortical network. This twofold columnar organisation has been suggested to subserve efficient computation and information processing [24]. 1.2
Related Work
This paper presents a neuromimetic model of action planning inspired by the columnar organisation of the mammalian neocortex. Planning is defined here as the ability, given a state space S and an action space A, to “mentally” explore the S × A space to infer an appropriate sequence of actions leading to a goal state sg ∈ S. This definition calls upon the capability of (i) predicting the consequences of actions, i.e. the most likely state s ∈ S to be reached when an action a ∈ A is executed from a state s ∈ S, (ii) evaluating the effectiveness of the selected plan on-line. The model generates a topological representation of the environment, and it employs an activation-diffusion mechanism [26] to plan goal-directed trajectories. The activation-diffusion process is based on the propagation of a reward-dependent activity signal from the goal state sg through the entire topological network. This propagation process enables the system to generate action sequences (i.e., trajectories) from the current state s towards sg . Topological map learning and path planning have been extensively studied in biomimetic robotics (see [27] for a review). Here we focus on model architectures that take inspiration from the anatomical organisation of the cortex, and implement an activation-diffusion planning principle. Burnod [28] proposed one of the first models of the cortical column architecture, called “cortical automaton”. He also described a “call tree” process that can be seen as a neuromimetic implementation of the activation-diffusion principle. Several action selection models were inspired by Burnod’s hypothesis. Some of these works employed the cortical automaton concept explicitly [29,30,31]. Others used either connectionist architectures [32,33,34] or Markov decision processes [35]. Yet, none of these works took into account the multilevel coding property offered by the possibility to refine the cortical organisation by adding a sublevel to the column, i.e. the minicolumn. The topological representation presented here exploits this idea by associating the columnar level to a compact representation of the environment, and by employing the minicolumn level to characterise the agent behaviour. In order to validate the model, we have implemented it on a simulated robotic platform, and tested it on the classical Tolman & Honzik’s navigation task [36]. This protocol allowed us to assess the ability of the system to learn topological
42
L.-E. Martinet et al.
representations, and to exploit them to perform flexible goal-directed behaviour (e.g., planning detours).
2 2.1
Methods Single Neurone Model
The elementary computational units of the model are artificial firing-rate neurones i, whose mean discharge ri ∈ [0, 1] is given by ri (t) = f Vi (t) · (1 ± η) . (1) where Vi (t) is the membrane potential at time t, f is the transfer function, and η is a random noise uniformly drawn from [0, 0.01]. Vi varies according to τi ·
dVi (t) = −Vi (t) + Ii (t) . dt
(2)
where τi = 10 ms is the membrane time constant, and Ii (t) is the synaptic drive generated by all the inputs. Eq. 2 is integrated by using a time step Δt = 1 ms. Both the synaptic drive Ii (t) and the transfer function f are characteristic of the different types of model units, and they will be defined thereafter. 2.2
Encoding Space and Actions: Minicolumn and Column Model
The main inputs to the cortical model are the location- and orientation-selective activities of hippocampal place and head-direction cells, respectively [3,5]. The hippocampal place field representation is built incrementally as the simulated animal (i.e., the animat) explores the environment, and it provides the system with a continuous distributed and redundant state representation S [37,38]. A major objective of the cortical model was to build a compact state-action representation S × A suitable for topological map learning and action planning. In the model, the basic component of the columnar organisation is the minicolumn (vertical grey regions in Fig. 1). An unsupervised learning scheme (Sec. 2.3) makes the activity of each minicolumn selective to a specific state-action pair (s, a) ∈ S × A. Notice that a given action a ∈ A represents the allocentric motion direction of the animat when it performs the transition between two locations s, s ∈ S. According to the learning algorithm, all the minicolumns selective for the same spatial location s ∈ S are grouped to form a higher-level computational unit, i.e. the column (see c and c in Fig. 1A). This architecture is inspired by biological data showing that minicolumns inside a column have similar selectivity properties [39]. Thus, columns consist of a set of minicolumns that are incrementally recruited to encode all the state-action pairs (s, a1···N ) ∈ S × A experienced by the animat at a location s. During planning (Sec. 2.4), all the minicolumns of a column compete with each other to locally infer the most appropriate goal-directed action.
Cortical Model for Navigation Planning Back-propagation of goal signal
Motivation
wm
c
SL
Goal column
c’ wu
wc
Current position column
IL
43
Goal column
wl
wh Distributed state-space representation (place cells and head direction cells)
(A)
Current position column
Propagation of the path signal
(B)
Fig. 1. The cortical model and the implementation of the activation-diffusion process. (A) Columns (c and c ) consist of sets of minicolumns (vertical grey regions), each of which contains a supragranular (SL) and an infragranular (IL) layer unit. (B) Top: back-propagation of the motivational signal through the network of SL neurones. Bottom: forward-propagation of the goal-directed action signal through the IL neurones.
Every minicolumn of the model consists of two computational units, representing supragranular layer (SL) and infragranular layer (IL) neurones (Fig. 1A). The discharge of SL and IL units simulates the mean firing activity of a population of cortical neurones in layers II-III, and V-VI, respectively. Each minicolumn receives three different sets of afferent projections (Fig. 1A): (i) Hippocampal inputs conveying allocentric space coding signals converge onto IL neurones; these connections are plastic, and their synaptic efficacy is determined by the weight distribution w h (all the synaptic weights of the model are within the maximum range of [0, 1]). (ii) Collateral afferents from adjacent cortical columns converge onto SL and IL neurones via the projections wu and wl , respectively. These lateral connections are learnt incrementally (Sec. 2.3), and play a prominent role in both encoding the environment topology and implementing the activationdiffusion planning mechanism. (iii) SL neurones receive projections w m conveying motivation-dependent signals. As shown in Sec. 2.4, this input is employed to relate the activity of a minicolumn to goal locations. SL neurones discharge as a function of the motivational signals mediated by the w u and w m projections. The synaptic drive Ii (t) depolarising a SL neurone i that belongs to a column c is given by: u (t) Ii (t) = max w · r + wim · rm . (3) i ii i ∈c =c
where i indexes other SL neurones of the cortical network; wim and rm are the weight and the intensity of the motivational signal, respectively. In the current
44
L.-E. Martinet et al.
version of the model the motivational input is generated algorithmically, i.e. wim = 1 if column c is associated to the goal location, wim = 0 otherwise, and the motivational signal rm = 1. The membrane potential of unit i is then computed according to Eq. 2, and its firing rate ri (t) is obtained by means of an identity transfer function f . Within each minicolumn, SL neurones project onto IL units by means of nonplastic projections wc (Fig. 1A). Thus, IL neurones are driven by hippocampal place (HP) cells h (via the projections w h ), IL neurones belonging to adjacent columns (via the collaterals wl ), and SL units i (via wc ). The synaptic drive of a IL neurone j ∈ c is: l h c wjj · rj (t) + wji wjh · rh (t) , max · ri (t) . (4) Ij (t) = max h∈HP
j ∈c =c
c where j indicates other IL neurones of the network; wji = 1 if the SL neurone c i and the IL neurone j belong to the same minicolumn, wji = 0 otherwise. The membrane potential Vj (t) is computed by Eq. 2, and a sigmoidal transfer function f is employed to calculate rj (t). The parameters of the transfer function change online to adapt the electroresponsiveness properties of IL neurones j to the strength of their inputs [40].
2.3
Unsupervised Growing Network Scheme for Topological Map Learning
The topological representation is built incrementally as the animat explores the environment. At each location visited by the agent at time t the cortical network is updated if-and-only-if the infragranular layers of all existing minicolumns re main silent, i.e. j H(rj (t) − ρ) = 0, where j indexes all the IL neurones, H is the Heaviside function (i.e., H(x) = 1 if x ≥ 0, H(x) = 0 otherwise), and ρ = 0.1 (see [38] for a similar algorithmic implementation of novelty detection in the hippocampal activity space). If at time t the novelty condition holds, a new group of minicolumns (i.e., a new column c) is recruited to become selective to the new place. Then, all the simultaneously active place cells h ∈ HP are h are initialised according to connected to the new IL units j ∈ c. Weights wjh h = H(rh − ρ) · rh . wjh
(5)
For t > t, the synaptic strength of these connections is changed by unsupervised Hebbian learning combined to a winner-take-all scheme. Let c be the column selective for the position visited by the animat at time t , i.e. let all the j ∈ c be the most active IL units of the network at time t . Then: h h Δwjh = α · rh · (rj − wjh ).
(6)
with α = 0.005. Whenever a state transition occurs, the collateral projections wl and wu are updated to relate the minicolumn activity to the state-action space S × A. For instance, let columns c and c denote the animat position
Cortical Model for Navigation Planning
45
before and after a state transition, respectively (Fig. 1A). A minicolumn θ ∈ c becomes selective for the locomotion orientation taken by the animat to perform the transition. A new set of projections wjl j is then established from the IL unit j ∈ θ of column c to all the IL units j of the column c . In addition, at the u supragranular level, a new set of connections wii is learnt to connect all the SL units of column c , i.e. i ∈ c , to the SL unit i of the minicolumn θ ∈ c. The strengths of the lateral projections are initialised as: u wjl j = wii = βLT P
∀i , j ∈ c .
(7)
with βLT P = 0.9. Finally, in order to adapt the topological representation online, a synaptic potentiation-depression mechanism can modify the lateral projections wl and w u . For example, if a new obstacle prevents the animat from achieving a previously learnt transition from column c to c (i.e., if the activation of the IL unit j ∈ θ ∈ c is not followed in the time by the activation of all IL units j ∈ c ), then a depression of the wjl j synaptic efficacy occurs: Δwjl j = −βLT D · wjl j ∀j ∈ c .
(8)
u where βLT D = 0.5. The projections wii are updated in a similar manner. A compensatory potentiation mechanism reinforces both wl and wu connections whenever a previously experienced transition is performed successfully:
Δwjl j = βLT P − wjl j ∀j ∈ c .
(9)
u l u wii ∈ [0, βLT P ]. are updated similarly. Notice that w , w
2.4
Action Planning
The model presented here aims at developing a high-level controller determining the spatial behaviour based on action planning. Yet, a low-level reactive module subserves the obstacle-avoidance behaviour. Whenever the proximity sensors detect an obstacle, the reactive module takes control and prevents collisions. Also, the simulated animal behaves in order to either follow planned pathways (i.e., exploitation) or improve the topological map (i.e., exploration). This exploitation-exploration tradeoff is governed by an -greedy selection mechanism, with ∈ [0, 1] decreasing exponentially over time [38]. Fig. 1B shows an example of activation-diffusion process mediated by the columnar network. During trajectory planning, the SL neurones of the column corresponding to the goal location sg are activated via a motivational signal rm (Eq. 3). Then, the SL activity is back-propagated through the network by means of the lateral projections wu (Fig. 1B, top). During planning, the responsiveness of IL neurones (Eq. 4) is decreased to detect coincident inputs. In particular, the occurrence of the SL input ri is a necessary condition for a IL neurone j to fire. In the presence of the SL input ri , either the hippocampal signal rh or the intercolumn signal rj is sufficient to activate the IL unit j. When the back-propagated
46
L.-E. Martinet et al. Goal
Food Box
Block B
B
Path 1 Gate
A Block A
P1
Pa
P2
th 2
Path 3
P3
Starting Place
Start
(A)
(B)
Fig. 2. (A) Tolman & Honzik’s maze (adapted from [36]). The gate near the second intersection prevented rats from going from right to left. (B) The simulated maze and robot. The dimensions of the simulated maze were taken so as to maintain the proportions of the Tolman & Honzik’s setup. Bottom-left inset: the real e-puck mobile robot has a diameter of 70 mm and is 55 mm tall.
goal signal reaches the minicolumns selective for the current position s this coincidence event occurs, which triggers the forward propagation of a goal-directed path signal through the projections w l (Fig. 1B, bottom). Goal-directed trajectories are generated by reading out the successive activations of IL neurones. Action selection calls upon a competition between the minicolumns encoding the (s, a1···N ) ∈ S × A pairs, where s is the current location, and a1···N are the transitions from s to adjacent positions s . For sake of robustness, competition occurs over a 10-timestep cycle. Notice that each SL synaptic relay attenuates u the goal signal by a factor wii (Eq. 3). Thus, the smaller the number of synaptic relays, the stronger the goal signal received by the SL neurone corresponding to the current location s. As a consequence, because the model column receptive fields are distributed rather uniformly over the environment, the intensity of the goal signal at a given location s is correlated to the distance between s and the target position sg . 2.5
Behavioural Task and Simulated Agent
In order to validate our navigation planning system, we chose the classical experimental task proposed by Tolman & Honzik [36]. The main objective of this behavioural protocol was to demonstrate that rodents undergoing a navigation test were able to show some “insights”, e.g. to predict the outcome of alternative trajectories leading to a goal location in the presence of blocked pathways. The original Tolman & Honzik’s maze is shown in Fig. 2A. It consisted of three narrow alleys of different lengths (Paths 1, 2, and 3) guiding the animals from a starting position (bottom) to a feeder location (top).
Cortical Model for Navigation Planning
47
Fig. 2B shows a simulated version of the Tolman & Honzik’s apparatus, and the simulated robot1 . We emulated the experimental protocol designed by Tolman & Honzik to assess the animats’ navigation performance. The overall protocol consisted of a training period followed by a probe test. Both training and probe trials were stopped when the animat had found the goal. Training period: it lasted 14 days with 12 trials per day. The animats could explore the maze and learn their navigation policy. – During Day 1, a series of 3 forced runs was carried out, in which additional doors were used to force the animats to go successively through P1, P2, and P3. Then, during the remaining 9 runs, all additional doors were removed, and the subjects could explore the maze freely. At the end of the first training day, a preference for P1 was expected to be already developed [36]. – From Day 2 to 14, a block was introduced at place A (Fig. 2B) to require a choice between P2 and P3. In fact, additional doors were used to close the entrances to P2 and P3 to force the animats to go first to the Block A. Then, doors were removed, and the subjects were forced to decide between P2 and P3 on their way back to the first intersection. Each day, there were 10 “Block at A” runs that were mixed with 2 non-successive free runs to maintain the preference for P1. Probe test period: it lasted 1 day (Day 15), and it involved 7 runs with a block at position B to interrupt the common section (Fig. 2B). The animats were forced to decide between P2 and P3 when returning to the first intersection point. For these experiments, Tolman & Honzik used 10 rats with no previous training. In our simulations, we used a population of 100 animats, and we assessed the statistical significance of the results by means of an ANOVA analysis (the significant threshold was set at 10−2 , i.e. p < 0.01 was considered significant). 2.6
Theoretical Analysis
A series of analyses was done to characterise the neural activities subserving the behavioural responses of the system. We recall that one of the aims of the cortical column model was to build a spatial code less redundant than the hippocampal place (HP) field representation. Yet, it is relevant to show that the spatial properties (e.g., spatial information content) of the neural responses were preserved in the cortical network. The set of stimuli S consisted of the places visited by the animat. For the analyses, the continuous two-dimensional input space was discretized, with each location s ∈ S defined as a 5 x 5 cm square region of the environment. The size of the receptive field of a neurone j was taken as 2·σS (j), with σS (j) denoting the standard deviation around the mean of the response tuning curve. A spatial density measure was used to assess the level of redundancy of a neural spatial code, i.e. the average number of units necessary to encode a place: H(rj (s) − σJ (s)) s∈S . (10) D= j∈J 1
c The model was implemented by means of the Webots robotics simulation software.
48
L.-E. Martinet et al.
with rj (s) being the response of the neurone j when the animat is visiting the location s ∈ S, and σJ (s) representing the standard deviation of the population activity distribution for a given stimulus s. Another measure was used to analyse the neural responses, the kurtosis function. This measure is defined as the normalised fourth central moment of a probability distribution, and estimates its degree of peakedness. If applied to a neural response distribution, the kurtosis can be used to measure its degree of sparseness across both population and time [41]. We employed an average population kurtosis measure k¯1 = k1 (s)s∈S to estimate how many neurones j of a population J were, on average, responding to a given stimulus s simultaneously. The kurtosis k1 (s) was taken as: k1 (s) = [rj (s) − r¯J (s))/σJ (s)]4 j∈J .
(11)
with r¯J (s) = rj (s)j∈J . Similarly, an average lifetime kurtosis k¯2 = k2 (j)j∈J was employed to assess how rarely a neurone j responded across time. The k2 (j) function was given by: k2 (j) = [rj (s) − r¯j )/σj ]4 s∈S
(12)
with r¯j = rj (s)s∈S , and σj being the standard deviation of the cell activity rj . Finally, we used an information theoretic analysis [42] to characterise the neural codes of our cortical and hippocampal populations. The mutual information M I(S; R) between neural responses R and spatial locations S was computed:
P (r, s) M I(S; R) = P (r, s) log2 (13) P (r)P (s) s∈S r∈R
where r ∈ R indicated firing rates, P (r, s) the joint probability of having the animat visiting a region s ∈ S while recording a response r, P (s) the a priori probability computed as the ratio between time spent at place s and the total time, and P (r) = s∈S P (r, s) the probability of observing a neural response r. The continuous output space of a neurone, i.e. R = [0, 1], was discretized via a binning procedure (bin-width equal to 0.1). The M I(S; R) measure allowed us to quantify the spatial information content of a neural code, i.e. how much could be learnt about the animat’s position s by observing the neural responses r.
3 3.1
Results Spatial Behaviour
Day 1. During the first 12 training trials, the animats learnt the topology of the maze and planned their navigation trajectory in the absence of both block A and B. Similar to Tolman & Honzik’s findings, our results show that the model learnt to select the shortest goal-directed pathway P1 significantly more frequently than the alternative trajectories P2, P3 (ANOVA, F2,297 = 168.249,
Cortical Model for Navigation Planning
150
7 6 5 4 3
7
Number of occurences
Number of occurences
Number of occurences
8
100
2
50
0
P1
P2
P3
6 5 4 3 2 1 0
P1
P2
P3
P1
P2
P3
1
1
1
0.5
0.5
0.5
0
0
(A)
49
(B)
0
(C)
Fig. 3. Behavioural results. Top row: mean number of transits through P1, P2, and P3 (averaged over 100 animats). Bottom row: occupancy grid maps. (A) During the first 12 training trials (day 1) the simulated animals developed a significant preference for P1 (no significant difference was observed between P2 and P3). (B) During the following 156 training trials (days 2-14, in the presence of block A, Fig. 2B) P2 was selected significantly more frequently than P3. (C) During the last 7 trials (day 15, test phase), the block A was removed whereas the block B was introduced. The animats exhibited a significant preference for P3 compared to P2.
p < 0.0001). The quantitative and qualitative analyses reported on Fig. 3A describe the path selection performance averaged over 100 animats. Days 2-14. During this training phase (consisting of 156 trials), a block was introduced at location A, which forced the animats to update their topological maps dynamically, and to plan a detour to the goal. The results reported by Tolman & Honzik provided strong evidence for a preference for the shortest detour path P2. Consistently, in our simulations (Fig. 3B) we observed a significantly larger number of transits through P2 compared to P3 (ANOVA, F1,198 = 383.068 p < 0.0001), P1 being ignored in this analysis (similar to Tolman & Honzik’s analysis) because blocked. Day 15. Seven probe trials were performed during the 15th day of the simulated protocol, by removing the block A and adding a new block at location B. This manipulation aimed at testing the “insight” working hypothesis: after a first run through the shortest path P1 and after having encountered the unexpected block B, will animats try P2 (wrong behaviour) or will they go directly through P3 (correct behaviour)? According to Tolman & Honzik’s results, rats behaved as predicted by the insight hypothesis, i.e. they tended to select the longer but
50
L.-E. Martinet et al.
120
15
Number of errors
Number of individuals
20
10
5
0
80
40
0 0
10
20
30
40
50
Number of errors (A)
Learning individual
Randomly behaving individual (B)
Fig. 4. Comparison between a learning and a randomly behaving agent. (A) Error distribution of learning (black histogram) versus random (grey line) animats. (B) Mean number of errors made by the model and by a randomly behaving agent.
effective P3. The authors concluded that rats were able to inhibit the previously learnt policy (i.e., the “habit behaviour” consisting of selecting P2 after a failure of P1 during the 156 previous training trials). Our probe test simulation results are shown in Fig. 3C. Similar to rats, the animats exhibited a significant preference for P3 compared to P2 (ANOVA, F1,198 = 130.15, p < 0.0001). Finally, in order to further assess the mean performance of the system during the probe trials, we compared the action selection policy of learning animats with that of randomly behaving (theoretical) animats. Fig. 4A provides the results of this comparison by showing the error distribution over the population of learning agents (black histogram) and randomly behaving agents (grey curve). The number of errors per individual are displayed in the boxplot of Fig. 4B. These findings indicate a significantly better performance of learning animats compared to random agents (ANOVA, F1,196 = 7.4432, p < 0.01). 3.2
Analysis of Neural Activities
Fig. 5A contrasts the mean spatial density (Eq. 10) of the HP receptive fields with that of cortical column receptive fields. It is shown that, compared to the upstream hippocampal space code, the cortical column model reduced the redundancy of the learnt spatial code significantly (ANOVA, F1,316 = 739.2, p < 0.0001). Fig. 5B shows the probability distribution representing the number of active column units (solid curve) and active HP cells (dashed line) per spatial location s ∈ S. As shown by the inset boxplots, the distribution kurtosis was significantly higher for column units than for HP cells (ANOVA, F1,198 = 6057, p < 0.0001). To further investigate this property, we assessed the average population kurtosis k¯1 (Eq. 11) of both columnar and HP cell activities (Fig. 5C). Again, the columnar population activity exhibited a significantly higher kurtosis
Cortical Model for Navigation Planning 1.0
90
12
0.6
60
30
0
0.4
6
140
Kurtosis across population
Distribution kurtosis
0.8
Probability
Density of receptive fields
18
Place units
Column units
0.2 0
0
Place units
Column units
(A)
51
120 100 80 60 40 20 0
0
5
10
15
20
25
Number of active units
(B)
30
Place units
Column units
(C)
Fig. 5. (A) Spatial density of the receptive fields of HP cells and cortical column units. (B) Probability distribution of the number of active column units (solid line) and active HP cells (dashed line) per spatial location s ∈ S. Inset boxplots: kurtosis measures for the two distributions. (C) Population kurtosis of columnar and hippocampal assemblies.
than the HP cell activity (ANOVA, F1,3128 = 14901, p < 0.0001). These results suggest that, in the model, the cortical column network was able to provide a sparser state-space population coding than HP units. In a second series of analyses, we focused on the activity of single cells, and we compared the average lifetime kurtosis k¯2 (Eq. 12) of cortical and HP units. As reported on Fig. 6A, we found that the kurtosis across time did not differ significantly between cortical and HP units (ANOVA, F1,2356 = 2.2699, p < 0.13). This result suggests that, on average, single cortical and HP units tended to respond to a comparable number of stimuli (i.e., spatial locations) over their lifetimes. Along the same line, we recorded the receptive fields of the two types of units. Figs. 6B,C display some samples of place fields of cortical and HP cells, respectively. As expected, we found a statistical anticorrelation between the lifetime kurtosis and the size of the receptive fields. The example of Fig. 6D shows that, for a randomly chosen animat performing the whole experimental protocol (15 days), the size of hippocampal place fields was highly anticorrelated to the HP cells’ lifetime kurtosis (correlation coefficient = −0.94). These results add to those depicted in Fig. 5 in that the increase of sparseness at the level of the cortical population (compared to HP cells) was not merely due to an enlargement of the receptive fields (or, equivalently, to a decrease of the lifetime stimulus-dependent activity). Despite their less redundant code, were cortical columns able to provide a representation comparable to that of HP cells in terms of spatial information content? The results of our information theoretic analysis (Eq. 13) suggest that this was indeed the case. Fig. 6E shows that, for a randomly chosen animat,
52
L.-E. Martinet et al.
0.9
Kurtosis across time
35
0.8 0.7 0.6
25
0.5 0.4 15
0.3 0.2 0.1
5
Place units
Column units
Receptive fieds of column units
Receptive fieds of place cells
(B)
(C)
(A)
Mutual information I(S;R) (bits)
Width of receptive fields
0.22
0.20
0.18
0.16
0.14
0.12
5
10
15
20
25
Kurtosis across time
(D)
30
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0
Place units
Column units
(E)
Fig. 6. (A) Lifetime kurtosis for column and HP units. (B, C) Samples of receptive fields of three column units and four HP cells. (D) Correlation between the lifetime kurtosis and the size of receptive fields. (E) Mutual information M I(S; R) between the set of spatial locations S and the activity R for both cortical and HP units.
the average amount of spatial information conveyed by cortical units was not significantly lower than that of HP cells (ANOVA, F1,140 = 0.8034, P < 0.3716).
4
Discussion
We presented a navigation model that builds a topological map of the environment incrementally, and uses it to plan a course of actions leading to a goal location. The model was employed to solve the classical Tolman & Honzik’s task [36]. As aforementioned, other models have been proposed to solve goal-directed navigation tasks. They are mainly based on the properties of hippocampal (e.g., [43]), and prefrontal cortex (e.g., [31]) neural assemblies. However, most of these models do not perform action planning as defined in this paper (Sec. 1). Samsonovich and Ascoli [43] implement a local path finding mechanism to select the most suitable orientation leading to the goal. Similarly, Hasselmo’s model [31] does not plan a sequence of actions from the current location to the goal but it rather infers the first local action to be taken, based upon a back-propagated
Cortical Model for Navigation Planning
53
goal signal. Yet, these two models rely on discretized state spaces (with predefined grid units coding for places), whereas our model uses a place field population providing a continuous representation of the environment [38]. Also, our model learns topological maps coding for the state-action space simultaneously. In the model by Samsonovich and Ascoli [43] no topological information is represented, but only a distance measure between each visited place and a set of potential goals. Likewise, in Hasselmo’s model states and actions are not jointly represented, which generates a route-based rather than a map-based navigation system [1]. We adopted a three-fold working hypothesis according to which (i) the hippocampus would play a prominent role in encoding spatial information; (ii) higher-level cortical areas, particularly the PFC, would mediate multidimensional contextual representations (e.g., coding for motivation-dependent memories and action cost/risk constraints) grounded on the hippocampal spatial code; (iii) neocortical representations would facilitate the temporal linking of multiple contexts, and the sequential organisation (e.g., planning) of behavioural responses. The preliminary version of the model presented here enabled us to focus on some basic computational properties, such as the ability of the columnar organisation to learn a compact topological representation, and the efficiency of the activation-diffusion planning mechanism. Further efforts will be put to integrate multiple sources of information. For example, the animat should be able to learn maps that encode reward (subjective) values, and action-cost constraints. Also, these maps should be suitable to represent multiple spatio-temporal scales to overcome the intrinsic limitation of the activation-diffusion mechanism in large scale environments. Additionally, these multiscale maps should allow the model to infer high-level shortcuts to bypass low-level environmental constraints. The neurocomputational approach presented here aims at generating crossdisciplinary insights that may help to systematically explore potential connections between findings on the neuronal level (e.g., single-cell discharge patterns), and observations on the behavioural level (e.g., spatial navigation). Mathematical representations permit to describe both the space and time components characterising the couplings between neurobiological processes. Models can help to scale up from single cell properties to the dynamics of neural populations, and generate novel hypotheses about their interactions to produce complex behaviour. Acknowledgments. Granted by the EC Project ICEA (Integrating Cognition, Emotion and Autonomy), IST-027819-IP.
References 1. Arleo, A., Rondi-Reig, L.: Multimodal sensory integration and concurrent navigation strategies for spatial cognition in real and artificial organisms. J. Integr. Neurosci. 6(3), 327–366 (2007) 2. Doll´e, L., Khamassi, M., Girard, B., Guillot, A., Chavarriaga, R.: Analyzing interactions between navigation strategies using a computational model of action selection. In: Freksa, C., et al. (eds.) SC 2008. LNCS (LNAI), vol. 5248, pp. 71–86. Springer, Heidelberg (2008)
54
L.-E. Martinet et al.
3. O’Keefe, J., Nadel, L.: The Hippocampus as a Cognitive Map. Oxford University Press, Oxford (1978) 4. Hafting, T., Fyhn, M., Molden, S., Moser, M.B., Moser, E.I.: Microstructure of a spatial map in the entorhinal cortex. Nature 436(7052), 801–806 (2005) 5. Wiener, S.I., Taube, J.S.: Head Direction Cells and the Neural Mechansims of Spatial Orientation. MIT Press, Cambridge (2005) 6. Poucet, B., Lenck-Santini, P.P., Hok, V., Save, E., Banquet, J.P., Gaussier, P., Muller, R.U.: Spatial navigation and hippocampal place cell firing: the problem of goal encoding. Rev. Neurosci. 15(2), 89–107 (2004) 7. Amaral, D.G., Witter, M.P.: The three-dimensional organization of the hippocampal formation: a review of anatomical data. Neurosci. 31(3), 571–591 (1989) 8. Wilson, M.A., McNaughton, B.L.: Dynamics of the hippocampal ensemble code for space. Science 261, 1055–1058 (1993) 9. Nitz, D.A.: Tracking route progression in the posterior parietal cortex. Neuron. 49(5), 747–756 (2006) 10. Hok, V., Save, E., Lenck-Santini, P.P., Poucet, B.: Coding for spatial goals in the prelimbic/infralimbic area of the rat frontal cortex. Proc. Natl. Acad. Sci. USA. 102(12), 4602–4607 (2005) 11. Knierim, J.J.: Neural representations of location outside the hippocampus. Learn. Mem. 13(4), 405–415 (2006) 12. Granon, S., Poucet, B.: Medial prefrontal lesions in the rat and spatial navigation: evidence for impaired planning. Behav. Neurosci. 109(3), 474–484 (1995) 13. Jay, T.M., Witter, M.P.: Distribution of hippocampal ca1 and subicular efferents in the prefrontal cortex of the rat studied by means of anterograde transport of phaseolus vulgaris-leucoagglutinin. J. Comp. Neurol. 313(4), 574–586 (1991) 14. Kita, H., Kitai, S.T.: Amygdaloid projections to the frontal cortex and the striatum in the rat. J. Comp. Neurol. 298(1), 40–49 (1990) 15. Thierry, A.M., Blanc, G., Sobel, A., Stinus, L., Golwinski, J.: Dopaminergic terminals in the rat cortex. Science 182(4111), 499–501 (1973) 16. Uylings, H.B.M., Groenewegen, H.J., Kolb, B.: Do rats have a prefrontal cortex? Behav. Brain. Res. 146(1-2), 3–17 (2003) 17. Aggleton, J.: The amygdala: neurobiological aspects of emotion, memory, and mental dysfunction. Wiley-Liss, New York (1992) 18. Schultz, W.: Predictive reward signal of dopamine neurons. J. Neurophysiol. 80(1), 1–27 (1998) 19. Jung, M.W., Qin, Y., McNaughton, B.L., Barnes, C.A.: Firing characteristics of deep layer neurons in prefrontal cortex in rats performing spatial working memory tasks. Cereb. Cortex 8(5), 437–450 (1998) 20. Otani, S.: Prefrontal cortex function, quasi-physiological stimuli, and synaptic plasticity. J. Physiol. Paris 97(4-6), 423–430 (2003) 21. Fuster, J.M.: The prefrontal cortex–an update: time is of the essence. Neuron. 30(2), 319–333 (2001) 22. Cohen, J.D., Braver, T.S., Brown, J.W.: Computational perspectives on dopamine function in prefrontal cortex. Curr. Opin. Neurobiol. 12(2), 223–229 (2002) 23. Mountcastle, V.B.: Modality and topographic properties of single neurons of cat’s somatic sensory cortex. J. Neurophysiol. 20(4), 408–434 (1957) 24. Mountcastle, V.B.: The columnar organization of the neocortex. Brain 120, 701– 722 (1997) 25. Buxhoeveden, D.P., Casanova, M.F.: The minicolumn hypothesis in neuroscience. Brain 125(5), 935–951 (2002)
Cortical Model for Navigation Planning
55
26. Hampson, S.: Connectionist problem solving. In: The Handbook of Brain Theory and Neural Networks, pp. 756–760. The MIT Press, Cambridge (1998) 27. Meyer, J.A., Filliat, D.: Map-based navigation in mobile robots - ii. a review of map-learning and path-planing strategies. J. Cogn. Syst. Res. 4(4), 283–317 (2003) 28. Burnod, Y.: An adaptative neural network: the cerebral cortex. Masson (1989) 29. Bieszczad, A.: Neurosolver: a step toward a neuromorphic general problem solver. Proc. World. Congr. Comput. Intell. WCCI94 3, 1313–1318 (1994) 30. Frezza-Buet, H., Alexandre, F.: Modeling prefrontal functions for robot navigation. IEEE Int. Jt. Conf. Neural. Netw. 1, 252–257 (1999) 31. Hasselmo, M.E.: A model of prefrontal cortical mechanisms for goal-directed behavior. J. Cogn. Neurosci. 17(7), 1115–1129 (2005) 32. Schmajuk, N.A., Thieme, A.D.: Purposive behavior and cognitive mapping: a neural network model. Biol. Cybern. 67(2), 165–174 (1992) 33. Dehaene, S., Changeux, J.P.: A hierarchical neuronal network for planning behavior. Proc. Natl. Acad. Sci. USA. 94(24), 13293–13298 (1997) 34. Banquet, J.P., Gaussier, P., Quoy, M., Revel, A., Burnod, Y.: A hierarchy of associations in hippocampo-cortical systems: cognitive maps and navigation strategies. Neural Comput. 17, 1339–1384 (2005) 35. Fleuret, F., Brunet, E.: Dea: an architecture for goal planning and classification. Neural Comput 12(9), 1987–2008 (2000) 36. Tolman, E.C., Honzik, C.H.: ”Insight” in rats. Univ. Calif. Publ. Psychol. 4(14), 215–232 (1930) 37. Arleo, A., Gerstner, W.: Spatial orientation in navigating agents: modeling headdirection cells. Neurocomput. 38(40), 1059–1065 (2001) 38. Arleo, A., Smeraldi, F., Gerstner, W.: Cognitive navigation based on nonuniform gabor space sampling, unsupervised growing networks, and reinforcement learning. IEEE Trans. Neural. Netw. 15(3), 639–651 (2004) 39. Rao, S.G., Williams, G.V., Goldman-Rakic, P.S.: Isodirectional tuning of adjacent interneurons and pyramidal cells during working memory: evidence for microcolumnar organization in pfc. J. Neurophysiol. 81(4), 1903–1916 (1999) 40. Triesch, J.: Synergies between intrinsic and synaptic plasticity mechanisms. Neural Comput. 19(4), 885–909 (2007) 41. Willmore, B., Tolhurst, D.J.: Characterizing the sparseness of neural codes. Netw. Comput. Neural Syst. 12(3), 255–270 (2001) 42. Bialek, W., Rieke, F., de Ruyter van Steveninck, R., Warland, D.: Reading a neural code. Science 252(5014), 1854–1857 (1991) 43. Samsonovich, A., Ascoli, G.: A simple neural network model of the hippocampus suggesting its pathfinding role in episodic memory retrieval. Learn. Mem. 12, 193– 208 (2005)
Efficient Wayfinding in Hierarchically Regionalized Spatial Environments Thomas Reineking, Christian Kohlhagen, and Christoph Zetzsche Cognitive Neuroinformatics University of Bremen 28359 Bremen, Germany {trking,ckohlhag,zetzsche}@informatik.uni-bremen.de
Abstract. Humans utilize region-based hierarchical representations in the context of navigation. We propose a computational model for representing region hierarchies and define criteria for automatically generating them. We devise a cognitively plausible online wayfinding algorithm exploiting the hierarchical decomposition given by regions. The algorithm allows an agent to derive plans with decreasing detail level along paths, enabling the agent to obtain the next action in logarithmic time and complete solutions in almost linear time. The resulting paths are reasonable approximations of optimal shortest paths. Keywords: Navigation, hierarchical spatial representation, regions, region hierarchy, wayfinding.
1
Introduction
Agents situated in spatial environments must be capable of autonomous navigation using prior learned representations. There exist a wide variety of approaches for representing environments ranging from metrical maps [1] to topological graphs [2]. In the context of large-scale wayfinding topological models seem cognitively more plausible because they are robust with regard to global consistency, and because they permit abstracting from unnecessary details, enabling higher level planning [3]. Topological graph-based representations of space can be divided into those utilizing single “flat” graphs and those that employ hierarchies of graphs for different layers of granularity. Single-graph schemes are limited in case of large domains since action selection by an agent may take unacceptably long times due to huge search spaces. Hierarchical approaches on the other hand decompose these spaces and are therefore significantly more efficient but solutions are not always guaranteed to be optimal. One possibility for a hierarchical representation is to assume graph-subgraph structures in which higher levels form subsets of lower, more detailed levels. This approach has been particularly popular in geographical information systems (GIS), where this technique can be used to eliminate unwanted details. Based on this idea a domain-specific path planning algorithm for street maps was proposed in [4]. The authors approximated street maps by connection grids and C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 56–70, 2008. c Springer-Verlag Berlin Heidelberg 2008
Efficient Wayfinding in Hierarchically Regionalized Spatial Environments
57
Fig. 1. Small indoor environment with superimposed region hierarchy. Local connectivity graphs are used for efficient near-optimal path planning (not shown).
constructed each hierarchical layer as a subset of the next lower layer. Another way of defining topological hierarchies is having nodes at higher levels represent sets of nodes at lower levels. A hierarchical extension of the A∗ algorithm (HPA∗ ) can be found in [5]. In this approach a topological graph is abstracted from an occupancy gridmap which facilitates restricting the search space. In [6] the D∗ algorithm for robot path planning is modified in order to support hierarchies consisting of different node classes. Unlike the other approaches the hierarchical D∗ algorithm is guaranteed to generate optimal paths, however, it requires the offline computation and storage of partial paths. Furthermore, it depends on the availability of exact metrical information. In addition to wayfinding, hierarchical representations have been successfully applied to related spatial reasoning problems, such as the traveling salesman problem [7,8] or the automated generation of route directions [9]. In this paper we introduce a cognitively motivated hierarchical representation which is based on regions. We provide a formal data structure for this representation, and we develop an efficient wayfinding algorithm that exploits its specific properties. Many natural and man-made environments exhibit an intrinsic regionalization that can be directly exploited for building hierarchical representations and there is strong experimental evidence that humans actually make use of region-based hierarchical representations in the context of navigation [10,11,12,13,14]. Research on cognitive maps supports this idea and has identified the hierarchical nature of spatial representations as one of their crucial properties [15,16].
58
T. Reineking, C. Kohlhagen, and C. Zetzsche
Our work originated from a project on a cognitive agent which explores and navigates through an indoor environment by means of a hierarchical regionbased representation [17]. The architecture was based on a biologically inspired approach for recognizing regions by sensorimotor features [18]. Figure 1 shows a small example of a typical environment with a superimposed hierarchy. In this representation, regions were formed by means of visual scene analysis [19] and intrinsic connectivity structure. However, this solution depends on specific properties of the environment. In the current paper we hence use a quadtree approach for building hierarchies, to enable comparability with other approaches and to ease complexity analysis. In our model smaller regions are grouped to form new regions at the next hierarchy level which yields a tree-like structure. This hierarchical representation is augmented by local connectivity graphs for each subtree. Explicit metrical information is completely disregarded, instead the regions themselves constitute a qualitative metric at different levels of granularity. By combining the hierarchically organized connectivity graphs with this qualitative metric we were able to develop an efficient online wayfinding algorithm that produces near-optimal paths while drastically reducing the search space. Abstract paths are determined at the top of subtrees and recursively broken down to smaller regions. By only relaxing the first element of an abstract path the next action can be obtained in logarithmic time depending on the number of regions while generating complete paths can be done with almost linear complexity. The idea of region-based wayfinding has been introduced as a heuristic in [13], but without an explicit computational model. The aforementioned hierarchical wayfinding algorithms do not consider the topological notion of regions as an integral part of the underlying representation, nor can they be easily adapted to regionalized environments in general. The paper is structured into two parts. The first explains the hierarchical region representations, its properties, and how it can be constructed. The second introduces the wayfinding algorithm utilizing this data structure. We analyze its algorithmic complexity and compare it to existing path planning approaches. We conclude in a short discussion about advantages of the proposed model and give hints towards future extensions.
2
Region Hierarchy
In this section we describe the region hierarchy as a formal data structure and demonstrate how valid hierarchies can be constructed. We argue that most realworld environments are inherently regionalized in that they are comprised of areas that form natural units. These units form larger regions at higher levels of granularity, resulting in a hierarchy of regions. Most approaches in the context of path planning rely on metric information for estimating distances. As mentioned in the introduction the proposed representation enables an agent to obtain qualitative distances based on regions, thus making quantitative information an optional addition. A qualitative metric is given if
Efficient Wayfinding in Hierarchically Regionalized Spatial Environments
59
regions of one hierarchy level are of similar size and approximately convex. The length of a path can then be assessed by the number of crossed regions at a given level. This is especially useful since it allows an agent to estimate distances with near-arbitrary precision depending on the considered hierarchy level. In order to derive a computational model of the representation it is necessary to make assumptions about how regions are modeled. First we assume that each region is fully contained by exactly one region at next coarser level of granularity, that there is a single region containing all other ones, and that the set of all regions therefore constitutes a tree. Humans may use a more fuzzy representation with fluent region boundaries and less strict containment relations [20] but a tree of regions seems a reasonable approximation. Second, we demand the descendants of a region to be reachable from each other without leaving the parent region. This asserts the existence of a path within the parent region which can be used as a clustering criterion for constructing valid hierarchies. Unlike a flat connectivity graph a hierarchy allows the representation of connectivity at multiple layers of detail. We propose imposing region connectivity on the tree structure by storing connections in the smallest subtree only, thus decomposing a global connectivity graph into local subgraphs. This limits the problem space of a wayfinding task to the corresponding subtree, thus excluding large parts of the environment at each recursion step which is in fact the underlying idea for the wayfinding algorithm described in the next section. 2.1
Formal Definition
Here we state the proposed data structure formally using first-order logic. In the following formulas all variables are understood to be regions. We introduce basic properties of the hierarchy, explain how region connectivity is represented and at the end we give a key requirement for regions in the navigation context. First we provide a relation which expresses the containment of a region c by another region p and which can be thought of as defining a subset of the encompassing region. This relation is transitive with respect to a third region g: ∀c, p, g : in(c, p) ∧ in(p, g) ⇒ in(c, g).
(1)
The in predicate is reflexive for arbitrary regions r but to simplify notation we also define an irreflexive version in ∗ for which the following expression is not true: ∀r : in(r, r).
(2)
The tree’s root W contains all other regions: ∀r : in(r, W).
(3)
A direct descendant c of a region p is given by a region r that is located in (or equal to) c: ∀c, p, r : child(c, p, r) ⇒ in(r, c) ∧ in∗ (c, p) ∧ (¬∃d : in∗ (d, p) ∧ in∗ (c, d)).
(4)
60
T. Reineking, C. Kohlhagen, and C. Zetzsche
In order to determine the smallest subtree in which two nodes r1 , r2 are located, it is necessary to determine their first common ancestor (FCA) f , i.e., the subtree’s root node: ∀f, r1 , r2 : fca(f, r1 , r2 ) ⇔ in(r1 , f ) ∧ in(r2 , f ) ∧ (¬∃p : in∗ (p, f ) ∧ in(r1 , p) ∧ in(r2 , p)).
(5)
The hierarchy is composed of two kinds of nodes: atomic regions and complex regions. Atomic regions correspond to places in the environment that are not further divided while complex regions comprise atomic regions or other complex regions, and therefore represent the area covered by the non-empty set of all their descendants. Regions do not intersect other than by containment and the set of atomic regions exhaustively covers the environment. (In case of the indoor domain an atomic region could be a room, whereas a complex region might represent a hallway along with all its neighboring rooms.) An atomic region a therefore contains no other region r: ∀a, r : atomic(a) ⇔ ¬in∗ (a, r).
(6)
The connectivity of atomic regions is given by the environment. In our model an atomic connection is simply a tuple of atomic regions a1 , a2 : ∀a1 , a2 : con(a1 , a2 ) ⇒ atomic(a1 ) ∧ atomic(a2 ).
(7)
Further information like the specific action necessary for reaching the next region could be represented as well but for the sake of simplicity we stick to region tuples as connections. Note that the connection predicate is non-transitive and irreflexive, thus disallowing a region to be connected with itself. The global connectivity graph is hierarchically decomposed by storing atomic connections in the root of the smallest subtree given by two regions a1 , a2 . Therefore each region f carries a set of all atomic connections between its descendants provided that this node is their FCA: ∀f, a1 , a2 : consa (f, a1 , a2 ) ⇔ fca(f, a1 , a2 ) ∧ con(a1 , a2 ).
(8)
This connection set is later used for obtaining crossings between regions at the atomic level. Alongside atomic connections the hierarchy also needs to represent the connectivity of complex regions. For this purpose each region has a second set containing complex connections. A complex connection is a tuple of two regions c1 , c2 sharing the same parent f . A complex connection exists if a region a1 contained by (or equal to) c1 is atomically connected to a region a2 contained by (or equal to) c2 : ∀f, c1 , c2 : consc (f, c1 , c2 ) ⇔ ∃a1 , a2 : consa (f, a1 , a2 ) ∧ child(c1 , f, a1 ) ∧ child(c2 , f, a2 ).
(9)
Efficient Wayfinding in Hierarchically Regionalized Spatial Environments
61
A complex connection therefore exists if and only if the set of atomic connections contains a corresponding entry. The set of complex connections defines a connected graph by interpreting region tuples as edges. This graph enables searching for paths between complex regions, whereas the atomic connection set yields the actual (atomic) crossings between the complex regions. The existence of a path between two arbitrary regions s, d is conditioned on a third region r that completely encompasses the path. Hence for all nodes x along the path the in predicate must be fulfilled with respect to r: ∀s, d, r : path(s, d, r) ⇔ (∃x : con(s, x) ∧ in(x, r) ∧ path(x, d, r)) ∨ con(s, d) ∨ s = d.
(10)
Finally we state the connectivity criterion that a valid hierarchy must satisfy. We require that all regions r1 , r2 located in a common parent region p must be reachable from each other without leaving p: ∀r1 , r2 , p : in(r1 , p) ∧ in(r2 , p) ⇔ path(r1 , r2 , p).
(11)
This enables the hierarchical decomposition of wayfinding tasks because a hierarchy adhering to this principle reduces the search space to the subtree given by a source and a destination region. 2.2
Clustering
The problem of imposing a hierarchy onto an environment is essentially a matter of clustering regions hierarchically. Humans seem to be able to do this effortlessly and there is evidence that the acquisition of region knowledge happens very early during the exploration of an environment [14]. Some suggestions on principles for the automated hierarchical clustering of spatial environments can be found in [21]. However, automatically generating hierarchies similar to the ones constructed by humans for arbitrary spatial configurations is an unsolved problem. We briefly describe a domain-specific clustering approach for indoor environments. For the purpose of auditability and comparability of our wayfinding algorithm’s performance we first state the more generic albeit artificial quadtree as a possibility for generating hierarchies. While humans seem to use various criteria for grouping regions, we focus on the connectivity aspect, since it is essential for navigation. We require a proper hierarchy to fulfill four properties. The first two are similarity of region size at each hierarchy level and convexity as mentioned above. The third is given by (11) and asserts the existence of a path within a bounding region. The fourth property concerns the hierarchy’s shape. The tree should be nearly balanced and its depth must be logarithmically dependent on the number of atomic regions. This excludes “flat” hierarchies as well as “deformed” ones with arbitrary depths. Note that the third requirement is necessary for correctness of the wayfinding algorithm described in the next section while the hierarchy’s shape merely affects
62
T. Reineking, C. Kohlhagen, and C. Zetzsche
the algorithms computation time. Size and convexity of regions determine the accuracy of qualitative distance estimates. Generating proper clusters becomes significantly easier if one makes assumptions about the connectivity structure of an environment. In the spatial domain it is popular to approximate place connectivity, i.e., connections between atomic regions, by assuming grid-like connections where each region is connected to its four neighboring regions. In this case a simple quadtree can be applied in which a set of four adjacent regions at level k corresponds to one region at level k + 1. The resulting hierarchy would indeed satisfy the connectivity property defined by (11) and with a constant branching factor of b = 4 its depth would be logarithmically bounded. Similar region size and convexity are asserted due to the uniformity of grid cells. However, the applicability of the quadtree approach is limited in case of environments with less regular and more restricted connectivity since this could easily violate the connectedness of regions. This connectivity restriction is especially predominant in indoor environments. A first approach towards modeling human grouping mechanisms for indoor environments has been proposed in [17]. It applied a classical metrical cluster algorithm to local quantitative representations and combined this with domainspecific connectivity-based heuristics for the topological level. Indoor environments are especially suited for the hierarchical clustering of regions because they offer natural region boundaries in the form of walls and because they are characterized by local connectivity agglomerations. The latter can be exploited by defining regions based on places of high connectivity, e.g., a hallway could form a region together with its connected rooms. This not only asserts connectedness of regions, it also leads to intuitively shaped hierarchies with convex regions like the one shown in figure 1. Similar region sizes at each hierarchy level are a direct result of the high degree of symmetry found in many indoor environments.
3
Wayfinding
Given a hierarchy that satisfies the properties described above it is possible to devise algorithms that utilize this data structure. In this section we propose a wayfinding algorithm that is capable of planning paths between atomic source regions and arbitrary (atomic or complex) destination regions. By decomposing tasks hierarchically and by estimating path lengths qualitatively based on regions, the algorithms aims to be cognitively plausible while efficiently producing near-optimal solutions. The first part explains the algorithm in-depth using pseudo code. The second part analyzes its time and space complexity. Finally, we compare its properties to different path planning algorithms, both hierarchical and non-hierarchical ones. 3.1
Algorithm
The basic idea of the algorithm is to limit the search space to the region given by the minimum subtree containing the source and destination region at each
Efficient Wayfinding in Hierarchically Regionalized Spatial Environments
63
recursion step. This is possible because (11) guarantees the connectedness of such a region. Within this region an “abstract” path is constructed using a shortest path algorithm. The search space is the local connectivity graph composed of the direct descendants of the subtree’s root and their connections. Start and goal of the abstract path are given by the regions in which the source and destination region are located at the corresponding level. For each connection of the abstract path a corresponding one from the set of atomic connections is selected and used as a new destination for the next recursion. This process is repeated until the atomic level is reached and a complete path has been constructed. Alternatively only the first element from each abstract path is relaxed, which yields only the first atomic region crossing while keeping the other crossings abstract. Figure 2 illustrates how the region hierarchy is used to find a path from one atomic region to another by means of complex and atomic connections. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
find way ( s , d , h ) i f i n ( s , d , h ) then return [ ] f c a := cs a = cs c = p c :=
// empty p a t h
fca ( s , d , h) c o n a ( f c a , h ) // atomi c c o n n e c t i o n s c o n c ( f c a , h ) // complex c o n n e c t i o n s Dijkstra ( cs c , child ( s , fca ) , child (d , fca ))
c u r := s // c u r r e n t ( atomi c ) r e g i o n p a := [ ] // ( atomi c ) p a t h f o r each c c from p c c a := s e l e c t ( cur , c c , c s a ) p a := p a + f i n d w a y ( cur , s o u r c e ( c a ) , h ) p a := p a + c a cur = d e s t i n a t i o n ( c a ) return p a + f i n d w a y ( cur , d , h )
The algorithm has three input parameters: s denotes the atomic source region, d the atomic or complex destination region and h is a properly clustered region hierarchy. First the trivial case of s being located in (or equal to) d is covered which results in an empty path (lines 2-3). If this is not the case, the FCA (defined by (5)) of s and d is determined (line 5). Given this region the set of complex connections cs c between the direct descendants of fca and the set of corresponding atomic connections cs a (given by (9) and (8)) is obtained (lines 6-7). The former is used to construct an abstract path p c from s to d composed of the FCA’s direct descendants by applying Dijkstra’s shortest path algorithm. Next the current region cur is initialized with the source region s (line 10) and the algorithm iterates through all complex connections in the abstract path (line 12). From the set of atomic connections cs a an element corresponding to the current complex connection is randomly selected (line 13). Corresponding means that the source region of the atomic connection must be located in (or
64
T. Reineking, C. Kohlhagen, and C. Zetzsche
Fig. 2. For finding a path from region R1 to region R5 the FCA of both nodes is first determined. This node contains a connectivity graph composed of complex connections between direct descendants (C1-R3, R3-C2), which is used for finding the abstract path from C1 to C2 (C1→R3→C2). The actual region crossings obtained from the set of corresponding atomic connections (R2-R3, R3-R4) form new destinations for the subsequent recursion steps (first R2, then R4).
equal to) the complex connection’s source region and the same must be true for the destination region. This atomic connection c a can be thought of as a waypoint that the agent has to pass in order to reach the next region, even if the complete path has not been determined yet. This intermediate goal then becomes the new destination for the recursion step; the new source is given by the current region cur (line 14). The result is a partial atomic path which is concatenated to the path variable p a along with the atomic connection c a (lines 14-15). Afterwards the current region is updated accordingly (line 16). Finally, the path from the current region to the actual destination d is obtained recursively and the combined path is returned (line 18). As mentioned the algorithm can be operated in two modes. In the form in which it is stated above a complete path containing all atomic steps is constructed. If the agent is only interested in generating the next action towards the goal, the iteration over all complex connections can be omitted. Instead only the first connection at each hierarchy level is recursively broken down to the atomic level while the remaining paths are kept abstract. This guarantees a decrease of hierarchy level by at least one in each recursion because the next intermediate destination given by l a is always located in the same complex region (one level below the current FCA) as the current region. 3.2
Complexity
Here we investigate the time and space complexity of our algorithm. Space complexity is equal to that of a flat connectivity graph apart from a slight overhead.
Efficient Wayfinding in Hierarchically Regionalized Spatial Environments
65
The number of atomic regions n and atomic connections is exactly the same. 1 There is an additional number of b−1 n of complex regions, assuming the tree is balanced and that it has a constant branching factor b. The number of complex connections depends on the specific connectivity of an environment but it is obviously lower than the number of atomic connections, since complex connections are only formed for atomic connections defining a region crossing at the complex level. The overall space complexity is thus of the same class as that of a flat connectivity graph. If the algorithm is operated such that it only determines the next action, it exhibits a time complexity of O(log n). The construction of an abstract path at a given level lies in O(1), if the branching factor has an upper bound independent of n (e.g., b = 4 for quadtrees). The selection of an atomic connection is of constant time complexity as well because, given a complex connection, it is possible to immediately obtain a corresponding atomic, for instance by using a lookup table. By expanding only the first element of each abstract path, the current FCA basically moves from the initial FCA down to the source region which leads to logb n visited nodes in the worst case. This is because the tree’s depth is assumed to be logarithmically dependent on n and because each recursion step leads to a decrease of the current hierarchy level by at least one. Even though the current subtree has to be redetermined for each new destination the complexity stays the same. The overall complexity of determining the FCAs for all recursion step is O(log n) as well since there are only logarithmically many candidates. Implementation-wise the FCA search could be accomplished by using a simple loop apart from the recursion. By combining the number of visited nodes with the complexity for determining the current subtree additively, the complexity for obtaining the next action is O(log n). For planning complete paths the worst-case complexity is O(n log(log n)). In the unlikely case of having to visit every single atomic region each complex region becomes the FCA once. In caseof a properly shaped hierarchy the number of com1 n and therefore lies in O(n). Determining the FCA plex regions is given by b−1 of two arbitrary regions lies in O(log(log n)) because it involves comparing two hierarchy paths of logarithmic length. The comparison itself can be performed by a binary search in O(log n). Again, the construction of an abstract path and the retrieval of an atomic connection take a constant amount of time, which yields an overall complexity of O(n log(log n)) for generating complete (atomic) paths. 3.3
Comparison
We compare the properties of our algorithm to the HPA∗ [5] path planning algorithm and the hierarchical D∗ algorithm [6] as well as Dijkstra’s classical shortest path algorithm [22]. We show that our approach exceeds all of these in terms of time complexity while at the same time yielding reasonable approximations of optimal solutions. When implemented with an efficient distance queue the time complexity of Dijkstra’s algorithms is O(|E| + n log n) with |E| denoting the number of edges
66
T. Reineking, C. Kohlhagen, and C. Zetzsche
in a given connectivity graph [23]. In case of sparse graph, i.e., graphs for which n log n is not dominated by the number of edges, the complexity reduces to O(n log n). This sparseness is generally satisfied in the spatial domain, since connectivity is limited to the intermediate neighborhood, which is certainly true for regionalized environments. For practical purposes HPA∗ and hierarchical D∗ are both significantly faster than Dijkstra’s algorithm, however their worst-case complexity poses no improvement over the one of Dijkstra’s algorithm. In contrast, our algorithm exhibits a complexity of only O(n log(log n)) for planning complete paths. Furthermore, the number of expanded regions is nb instead of n in case of Dijkstra’s algorithm because atomic regions behave trivially as FCAs. The complexity of O(log n) for determining the next action can not be compared, since Dijkstra’s algorithm has to generate all shortest paths before being able to state the first step. Despite being efficient an algorithm obviously also needs to produce useful paths. Like Dijkstra’s algorithm the hierarchical D∗ algorithm generates optimal paths, however, it does so at the expense of having to calculate partial solutions offline, leading to increased storage requirements. HPA∗ on the other hand yields near-optimal results, if an additional path smoothing is applied, while also using precomputed partial solutions. Since both approaches make use of the A∗ search algorithm, they both require the availability of metrical information in order to obtain an admissible distance heuristic [24], which makes them unsuited for cases where such knowledge can not be provided. For the purpose of analyzing our algorithm in terms of path lengths we set up a simulation in which we tested the proposed algorithm against optimal solutions obtained via Dijkstra’s shortest path algorithm. Since we did not consider metric knowledge the length of a path was measured by the number of visited atomic regions. The environment was a quadratic grid with 64 atomic regions in which each cell was connected to its four neighbors. We chose a grid-like connectivity since it works well as a general approximation for many environments and since it allowed us to avoid using domain-specific clustering criteria. A simple quadtree was therefore applied to obtain a region hierarchy. In 1000 iterations we randomly selected two cells as source and destination regions for the wayfinding task and we compared the path lengths obtained via our hierarchical algorithm to the ones of Dijkstra’s algorithm. On average the produced paths contained 20.5% more atomic regions. Although these results are not sufficient for an in-depth analysis, the example demonstrates that the resulting paths are not arbitrarily longer than shortest paths. For domains with more restricted connectivity such as indoor environments we observed better performance, typically equal to optimal solutions. The main source of error resulted from the selection of atomic connections between two regions, because the regions themselves do not offer any information that would permit the derivation of a useful selection heuristic. The discussion points to some work that could improve this behavior. The error is considerably reduced if the hierarchy’s branching factor is increased. In fact there is a direct
Efficient Wayfinding in Hierarchically Regionalized Spatial Environments
67
Fig. 3. A grid-like environment with source S and destination D and two paths, one optimal (dashed), the other obtained by the hierarchical algorithm (continuous). The different gray levels of hierarchy nodes indicate the different search levels: white nodes are completely disregarded, gray nodes are part of a local search and black nodes are part of a search and recursively expanded. The gray cells at the bottom visualize the atomic search space.
trade-off between efficiency and average path length, because higher branching factors lead to larger local search spaces for which optimal solutions are obtained. Next to time complexity and path optimality it is noteworthy to take a look at the search space of the proposed algorithm. Unlike Dijkstra’s algorithm which blindly expands nodes in all directions, our algorithm limits the set of possible paths for each recursion by excluding solutions at the parent level. Figure 3 shows the environment used during the simulation along with two exemplary paths between two region S and D, one optimal, the other one constructed hierarchically. 7 of the regions are considered by our approach while On the atomic level only 16 Dijkstra’s algorithm visits each region. This ratios decreases further with more atomic regions and it tends to zero as the number of region tends to infinity.
4
Discussion
Humans make use of hierarchical representations of regionalized environments, as has been shown in several experiments [10,11,12,7]. In this paper we proposed a
68
T. Reineking, C. Kohlhagen, and C. Zetzsche
computational model of such a hierarchical representation and we demonstrated how it can be used by an agent to solve wayfinding tasks. Region connectivity is represented at different levels of granularity in the hierarchy and this can be exploited for a simplification of the search by a decomposition of the global search space into locally bounded subspaces. This results in an efficient path planning and in the ability to generate the next action in logarithmic time. We described the proposed hierarchy formally and we provided several criteria based on which valid hierarchies can be formed. An essential criterion for the clustering of regions seems to be connectivity, in particular if one is interested in basic navigation issues, but connectivity is probably only one criterion among a larger set of criteria used by humans. We considered a connectivity-based clustering approach with domain-specific rules. However, this algorithm is still ad hoc and restricted to office-like domains. Currently there exist no general spatial cluster algorithms that produce satisfying hierarchies for arbitrary environments and more work concerning this problem, and on how it is solved by humans, is obviously needed. As a first approximation and as a basis for comparison with other approaches we hence resorted to using a quadtree decomposition, which can be regarded as a prototypical hierarchical representation, which satisfies the required formal criteria. The proposed wayfinding algorithm that operates on the region hierarchy is cognitively plausible in that it derives paths by searching at different levels of detail, which humans seem to do as well. By only obtaining the first action and leaving the remaining path abstract, an agent can save on computation and memory resources. Furthermore, planning further ahead than necessary also bares the risk of rendering solutions obsolete, since actions lying far ahead are less likely to actually occur. This is especially important for environments that exhibit dynamic effects, e.g., closing doors, in which case offline algorithms are forced to perform complete replanning. Wiener and Mallot introduced the terms ’coarse-to-fine’ and ’fine-to-coarse’ to refer to the commitment of a wayfinding algorithm regarding the level of plan detail [13]. The mode of generating complete paths can be called ’coarse-to-fine’ because it decomposes complex tasks into smaller subtasks down to the atomic level. When only obtaining the next action the agent restricts itself to planning the waypoints that are necessary for moving to the next atomic region which can be thought of as a least-commitment strategy. This complies with the ’fineto-coarse’ scheme, since a detailed plan is produced only for the intermediate surroundings. The plan resolution decreases monotonically along the path which we believe is an essential property of human navigation. Even though the resulting solutions are not guaranteed to be optimal, they are not significantly worse. When planning complete paths the algorithm exhibits a time complexity of only O(n log(log n)) and is thus more efficient than Dijkstra’s shortest path algorithm as well as hierarchical approaches like HPA∗ and hierarchical D∗ . Unlike most algorithms that do not entirely rely on bestshot heuristics our approach can be used to obtain intermediate actions without planning complete path or risk moving towards dead ends. Such a retrieval of
Efficient Wayfinding in Hierarchically Regionalized Spatial Environments
69
the next action is done in O(log n). Aside from time complexity the hierarchical organization leads to drastically reduced search spaces for most domains. There are two main sources of error that can cause a suboptimality of computed paths. One is the estimated path length based on regions at higher levels. This estimate becomes less accurate the more region size varies and the less convex regions are shaped. However, this problem effects human judgment of distance as well, as has been shown in [10]. The other type of error results from the way how atomic connections between regions are selected. In the current implementation this is done by a first-match mechanism which is problematic for environments with many region crossings, as in case of the mentioned grid maps. This problem could be reduced if suitable selection heuristics were available. One possibility is the use of relative directions for which hierarchical region-based approaches already exist [25]. In addition to efficient wayfinding the suggested hierarchical region-based representation could provide an agent with other useful skills. For example, it enables an abstraction from unnecessary details, and it can also overcome the problem of asserting a common reference frame. Obviously all this requires that real environments are actually regionalized in an inherently hierarchical fashion. We believe that this is indeed the case for most real-world scenarios. Finally, we expect hierarchical representations to facilitate the communication about spatial environments between artificial agents and humans [9,26] and spatial problem solving in general.
Acknowledgements This work was supported by the DFG (SFB/TR 8 Spatial Cognition, project A5-[ActionSpace]).
References 1. Thrun, S.: Robotic mapping: A survey (2002) 2. Werner, S., Krieg-Br¨ uckner, B., Herrmann, T.: Modelling navigational knowledge by route graphs. In: Habel, C., Brauer, W., Freksa, C., Wender, K.F. (eds.) Spatial Cognition II 2000. LNCS (LNAI), vol. 1849, pp. 295–317. Springer, Heidelberg (2000) 3. Kuipers, B.: The spatial semantic hierarchy. Technical Report AI99-281 (29, 1999) 4. Car, A., Frank, A.: General principles of hierarchical reasoning - the case of wayfinding. In: SDH 1994, Sixth Int. Symposium on Spatial Data Handling, Edinburgh, Scotland (September 1994) 5. Botea, A., M¨ uller, M., Schaeffer, J.: Near optimal hierarchical path-finding. Journal of Game Development 1(1), 7–28 (2004) 6. Cagigas, D.: Hierarchical D* algorithm with materialization of costs for robot path planning. Robotics and Autonomous Systems 52(2-3), 190–208 (2005) 7. Graham, S., Joshi, A., Pizlo, Z.: The traveling salesman problem: a hierarchical model. Memory & Cognition 28(7), 1191–1204 (2000) 8. Pizlo, Z., Stefanov, E., Saalweachter, J., Li, Z., Haxhimusa, Y., Kropatsch, W.: Traveling salesman problem: a foveating pyramid model. Journal of Problem Solving 1, 83–101 (2006)
70
T. Reineking, C. Kohlhagen, and C. Zetzsche
9. Tomko, M., Winter, S.: Recursive construction of granular route directions. Journal of Spatial Science 51(1), 101–115 (2006) 10. Stevens, A., Coupe, P.: Distortions in judged spatial relations. Cognitive Psychology 10, 526–550 (1978) 11. Hirtle, S.C., Jonides, J.: Evidence of hierarchies in cognitive maps. Memory and Cognition 13(3), 208–217 (1985) 12. McNamara, T.P.: Mental representations of spatial relations. Cognitive Psychology 18, 87–121 (1986) 13. Wiener, J., Mallot, H.: ’Fine-to-coarse’ route planning and navigation in regionalized environments. Spatial Cognition and Computation 3(4), 331–358 (2003) 14. Wiener, J., Schnee, A., Mallot, H.: Navigation strategies in regionalized environments. Technical Report 121 (January 2004) 15. Voicu, H.: Hierarchical cognitive maps. Neural Networks 16(5-6), 569–576 (2003) 16. Thomas, R., Donikian, S.: A model of hierarchical cognitive map and human memory designed for reactive and planned navigation. In: 4th International Space Syntax Symposium, Londres (June 2003) 17. Gadzicki, K., Gerkensmeyer, T., H¨ unecke, H., J¨ ager, J., Reineking, T., Schult, N., Zhong, Y., et al.: Project MazeXplorer. Technical report, University of Bremen (2007) 18. Schill, K., Zetzsche, C., Wolter, J.: Hybrid architecture for the sensorimotor representation of spatial configurations. Cognitive Processing 7, 90–92 (2006) 19. Schill, K., Umkehrer, E., Beinlich, S., Krieger, G., Zetzsche, C.: Scene analysis with saccadic eye movements: Top-down and bottom-up modeling. Journal of Electronic Imaging 10(1), 152–160 (2001) 20. Montello, D.R., Goodchild, M.F., Gottsegen, J., Fohl, P.: Where’s downtown?: Behavioral methods for determining referents of vague spatial queries. Spatial Cognition & Computation 3(2-3), 185–204 (2003) 21. Han, J., Kamber, M., Tung, A.K.H.: Spatial clustering methods in data mining. In: Geographic Data Mining and Knowledge Discovery, pp. 188–217. Taylor & Francis, Inc., Bristol (2001) 22. Dijkstra, E.W.: A note on two problems in connexion with graphs. Numerische Mathematik 1(1), 269–271 (1959) 23. Barbehenn, M.: A note on the complexity of dijkstra’s algorithm for graphs with weighted vertices. IEEE Trans. Comput. 47(2), 263 (1998) 24. Junghanns, A.: Pushing the limits: new developments in single-agent search. PhD thesis, University of Alberta (1999) 25. Papadias, D., Egenhofer, M.J., Sharma, J.: Hierarchical reasoning about direction relations. In: GIS 1996: Proceedings of the 4th ACM international workshop on Advances in geographic information systems, pp. 105–112. ACM, New York (1996) 26. Maaß, W.: From vision to multimodal communication: Incremental route descriptions. Artificial Intelligence Review 8(2), 159–174 (1994)
Analyzing Interactions between Navigation Strategies Using a Computational Model of Action Selection Laurent Doll´e1, , Mehdi Khamassi1,2 , Benoˆıt Girard2 , Agn`es Guillot1 , and Ricardo Chavarriaga3 1
ISIR, FRE2507, Universit´e Pierre et Marie Curie - Paris 6, Paris, F-75016, France 2 LPPA, UMR7152 CNRS, Coll`ege de France, Paris, F-75005, France 3 IDIAP Research Institute, Martigny, CH-1920, Switzerland
[email protected]
Abstract. For animals as well as for humans, the hypothesis of multiple memory systems involved in different navigation strategies is supported by several biological experiments. However, due to technical limitations, it remains difficult for experimentalists to elucidate how these neural systems interact. We present how a computational model of selection between navigation strategies can be used to analyse phenomena that cannot be directly observed in biological experiments. We reproduce an experiment where the rat’s behaviour is assumed to be ruled by two different navigation strategies (a cue-guided and a map-based one). Using a modelling approach, we can explain the experimental results in terms of interactions between these systems, either competing or cooperating at specific moments of the experiment. Modelling such systems can help biological investigations to explain and predict the animal behaviour. Keywords: Navigation strategies, Action selection, Modelling, Robotics.
1
Introduction
In natural environments, animals encounter situations where they have to simultaneously learn various means for reaching interesting locations and to select dynamically the best to use. Many neurobiological studies in both rodents and humans have investigated how this selection is performed using experimental paradigms in which several navigation strategies may be learned in parallel. Some strategies may be based on a spatial representation (i.e., inferring a goal-directed action as a function of its location, called map-based strategies), whereas other strategies can be based on direct sensory-motor associations without requiring a spatial representation (i.e., map-free) [1,2,3,4]. A number of experimental results lead to the hypothesis that these strategies are learned by separate memory systems, with the dorsolateral striatum involved in the acquisition of the map-free strategies and the hippocampus mediating the map-based strategy [5,6].
Corresponding author.
C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 71–86, 2008. c Springer-Verlag Berlin Heidelberg 2008
72
L. Doll´e et al.
However, it is not yet clear whether these learning systems are independent or whether they interact for action control in a competitive or in a cooperative manner. The competition implies that inactivation of one system enhances the learning of the remaining functional system, while the cooperation states that learning in one system would compensate the limitations of the other one [7,8,9,10,11]. The present work aims at investigating such interactions using a computational model of spatial navigation based on the selection between the map-based and map-free strategies [12]. Besides a qualitative reproduction of the experimental results obtained in animals, the modelling approach allows us to further characterize the competitive or cooperative nature of interactions between the two strategies. Following our previous modelling efforts [12], we study the interaction between the navigation strategies in the experimental paradigm proposed by Pearce et al. (1998) [13]. In this paradigm, which is a modification of the Morris Hidden Water Maze task [14], two groups of rats (“Control” group of intact animals and “Hippocampal” group of animals with damaged hippocampus) had to reach a hidden platform indicated by a landmark located at a fixed distance and orientation from the platform. After four trials, the platform and its associated landmark were moved to another location and a new session started. The authors observed that both groups of animals were able to learn the location of the hidden platform, but at the start of each new session the hippocampal animals were significantly faster in finding the platform than controls. Moreover, only the control rats were able to decrease their escape latencies within a session. From these results, authors conclude that rats could simultaneously learn two navigation strategies. On the one hand, a map-based strategy encodes a spatial representation of the environment based on visual extra-maze landmarks and self-movement information. On the other hand, a map-free strategy (called by the authors “heading vector strategy”) encodes the goal location based on its proximity and direction with respect to the intra-maze cue [15]. Based on these conclusions, the decrease in the escape latency within sessions could be explained by the learning of a spatial representation by intact animals. Furthermore, such learning also suggests that when the platform is displaced at the start of a new session, intact rats would swim to the previous (wrong) location of the platform based on the learned map, whereas hippocampal animals would swim directly to the correct location. For the modelling purposes, the results of this experiment can be summarized as follows: (i) both groups of rats could decrease their escape latencies across sessions, but only the control rats improved their performance within sessions; (ii) the improvement in the performance within each session, observed in the control group, could be attributed to the use of a map-based strategy by these rats; and (iii) higher performance of hippocampal rats relative to the controls at the start of each session could be due to the use of the map-free strategy (the only strategy that could be used by the lesioned animals). In other words, the process of choosing the best strategy (i.e. the competition) performed by
Analyzing Interactions between Navigation Strategies
73
the control, but not the hippocampal, animals, decreased the performance of controls relative to that of lesioned animals. We have shown previously that the computational model used in the present study is able to reproduce the behaviour of rats in the experiment of Pearce et al. [12]. In the present paper, we extend these results by performing a further analysis of the interactions between both learning systems at different stages of the experiment, taking into account the three points formulated above. In the following section, we describe the model, the simulated environment and the experimental protocol. Then we present the results and the analyses. Finally, we discuss the results in terms of interactions between systems.
2 2.1
Methods and Simulation Navigation Model
The neural network computational model is based on the hypothesis of different, parallel learning systems exclusively involved in each strategy, and interacting for behaviour control (Fig. 1). It is composed of two experts, learning separately a map-based strategy and a map-free one (the experts are denoted MBe, and MFe, respectively), both following reinforcement learning rules to acquire their policy, i.e., the way the expert chooses an action given the current state in order to maximize the reward. The model provides a mechanism that selects, at each timestep, which strategy should drive the behaviour of the simulated robot, given its reliability to find the goal. This section briefly describes both navigational experts, their learning process as well as the selection mechanism and the learning mechanism underlying this selection (for a more detailed description see [12]). The map-free strategy is encoded by the MFe that receives visual signals from sensory cells (SI), consisting of a vector of 36 inputs of gray values (one input for every ten degrees), transducing a 360 degrees horizontal 1-D gray-scale image. To simulate the heading vector strategy stated by Pearce et al., the landmark is viewed with an allocentric reference: for example, when the landmark is located to the North with regards to the robot, it will appear in the same area of the camera, whatever might be the orientation of the robot. The map-based strategy is encoded by the MBe that receives information from a spatial representation encoded in a regular grid of 1600 place cells (PC) with Gaussian receptive fields of width σP C [16] (values of all model parameters are given in Table 1). Strategy Learning. Both experts learn the association between their inputs and the actions leading the robot to the platform, using a direct mapping between inputs (either SI or PC) and directions of movement (i.e., actions). Movements are encoded by a population of 36 action cells (AC). The policy is learned by both experts by means of a neural implementation of Q-learning algorithm [17].
74
L. Doll´e et al.
Sensory inputs Place cells
AC
MFe
AC
MBe
Gating network
gMFe
gMBe
ΦMFe , AMFe
ΦMBe , AMBe
Selection
Φ
Fig. 1. The computational model of strategy selection [12]. The gating network receives the inputs of both experts, and their reward prediction error, in order to compute their reliability according to their performance (i.e., gating values gk ). Gating values are then used with the Action value Ak in order to compute the probability of each expert to be selected. The direction Φ proposed by the winning expert is then performed. See text for further explanations.
* *
* *
(a)
(b)
Fig. 2. (a) A simplified view of ad hoc place cells. Each circle represents a place cell and is located at the cell’s preferred position (i.e., the place where the cells are most active). Cell activity is color coded from white (inactive cells) to black (highly active cells) (b) The environment used in our simulation (open circles: platform locations, stars: landmarks).
In this algorithm, the value of every state-action pair is learned by updating the synaptic weight wij linking input cell i to action cell j: Δwij = ηhk δeij ,
(1)
where η is the learning rate and δ the reward prediction error. The scaling factor hk ensures that the learning module is updating its weights according to its reliability (for all the following equations, k is either the MBe or the MFe). Its
Analyzing Interactions between Navigation Strategies
75
computation is detailed further below. The eligibility trace e allows the expert to reinforce the state-action couples previously chosen during the trajectory: eij (t + 1) = rjpre ri + λeij (t),
(2)
where rjpre is the activity of the pre-synaptic cell j, λ a decay factor, and ri the activity of the action cell i in the generalization phase. Generalization in the action space is achieved by reinforcing every action weighted by a Gaussian of standard deviation σac centered on the chosen action. Each expert suggests a direction of movement Φk : k i ai sin(φi ) Φk = arctan k , (3) i ai cos(φi ) where ai is the action value of the discrete direction of movement φi . The corresponding action-value Ak , computed by linear interpolation of the two nearest discrete actions [17]. Action Selection. In order to select the direction Φ of the next robot movement, the model uses a gating scheme such that the probability of being selected depends not only on the Q-values of the actions (Ak ), but also on a gating value gk . Gating values are updated in order to quantify the expert’s reliability according to the current inputs. It takes the shape of a network linking the inputs (place cells and sensory inputs) to the gating values gk , computed as a weighted sum: (4) gk = zkP C rP C + zkSI rSI , where zkP C is the synaptic weight linking the PC, with activation rP C to the gate k, idem for zkSI . Weights are updated in order to approach hk = gk(gcikci ) i
2
where ck = e(−ρδk ) (ρ > 0), according to the following rule: P C,SI Δzkj = ξ(hk − gk )rjP C,SI .
(5)
The next action will be then chosen according to a probability of selection P : gk Ak . i i∈k gi A
P (Φ = Φk ) =
(6)
If both experts have the same gating value (i.e., reliability), then the expert with the highest action value will be chosen. In contrast, if both experts have the same action value, the most reliable expert, i.e., the one with highest gating value, will be chosen. 2.2
Simulated Environment and Protocol
In our simulation, the environment is a square of size equivalent to 200×200 cm, while the simulated robot’s diameter is 15 cm (Fig. 2b). The landmark, represented by a star of diameter 10 cm, is always situated at a constant distance
76
L. Doll´e et al. Table 1. Parameters of the model Parameters NP C σP C NAC σAC η λ ξ ρ
Value 1600 10 cm 36 22.5˚ 0.015 0.76 0.01 1.0
Description Number of place cells Standard deviation of PC’s activity profile Number of action cells Standard deviation of the enforced activity profile Learning rate of both experts Decay factor of both experts Learning rate of the gating network Decreasing rate in ck
of 30 cm to the North of the platform, whose diameter is 20 cm. These dimensions have been chosen in order to keep similar ratio of distances as in Pearce et al.’s experimental setting (the platform’s size has been scaled up, as the original size (10 cm) was too small and did not allow the experts to learn the task). The number of possible locations of the platform has been reduced from eight to four, in order to compensate the new size of the platform. As in [13], at the beginning of each trial, the simulated robot is placed at a random position at least 120 cm from the platform. The robot moving speed is 10 cm per timestep, meaning that it requires at least 12 timesteps to reach the platform. If it is not able to reach the platform in 150 timesteps, it is automatically guided to it, as were the actual rats. A positive reward (R = 1) is provided when the platform is reached. We performed three sets of 50 experiments. In the first set, both experts (MBe and MFe) are functional (Control group), in the second set only the MFe is activated (Hippocampal group). For the third set of experiments, only the MBe is activated. This “Striatal group” emulates a striatal lesion not included in the original experiment. 2.3
Data Analysis
Performances of different groups were statistically assessed by comparing their mean escape latency (signed-rank Wilcoxon test for matched-paired samples). Moreover, following Pearce’s analysis, we assess learning differences within a session by comparing the performance on the first and fourth trials using the same test as before. Concerning the differences between both groups (i.e., between the first trials of Control and Hippocampal groups, and between their fourth trials), we use a Mann-Whitney test for non matched-paired samples. To assess strategy changes during the whole experiment, we compare their selection rates at every first and fourth trials of both early (first three sessions) and late sessions (last three sessions). The selection rate of each expert is recorded on two squares of 0.4 m2 , centered on the current and on the previous platform positions and is computed as the number of times the robot chooses one strategy over the total number of times it goes inside each of these regions.
Analyzing Interactions between Navigation Strategies
77
In order to estimate strategy changes within a trial, the selected strategy at each timestep is recorded. Since trajectories have different lengths, they are first normalized in 10 bins, then we compute the selection rate on each of these bins. The navigational maps of both experts, the preferred orientation at each location of the environment, are also provided in order to illustrate changes in the expert’s learning across trials or sessions. Finally, we evaluate the influence of the robot’s behaviour when controlled by an expert on the learning of another expert. The averaged heading error across the sessions is computed for the three groups. This heading error corresponds to the difference between the actual direction of movement proposed by the expert and the “ideal” direction, pointing to the current platform location (the heading error will be zero when the robot points towards the platform; an error of one means that the robot moves in the opposite direction). This error is computed in the neighbourhood of the current platform –on a square of 0.4 m2 – in order to take values from places that are sufficiently explored by the robot. The influence between experts can be assessed by measuring whether the heading error for one of the strategies decreases as a result of the execution of the other strategy.
3 3.1
Results Learning across and within Session
Our model qualitatively reproduces the results obtained in animals (Fig. 3a). As shown in Fig. 3b, both Control and Hippocampal groups are able to learn the task, i.e., their escape latencies decrease with training. Moreover, the performance of the Control group improves within each session, as there is a significant decrement of the escape latency between the first and fourth trials (p<0.001). Finally, as it was the case with rats, escape latencies of the Hippocampal group in the first trial are smaller than the Control group (p<0.001). Concerning the Striatal group, Fig. 3c shows a significant improvement within session for this group, but no learning is achieved across sessions, suggesting a key role of the MFe in the performance improvement across sessions of the Control group. 3.2
Role of Interactions between the MFe and the MBe in the Control Group
First trials: Increase of MFe Selection Across Sessions and Competition Between the MFe and the MBe Within Trials In the first trial of every session, the platform is relocated, so as the strategy learned by the MBe in the previous session is not relevant anymore. Accordingly, the selection of the MFe expert near the current platform location increases from the early to the late sessions (p<0.05), strongly suggesting a role of the MFe in the latency decrease across sessions that occurs in the Control group (Fig. 4a). Fig. 4a also shows that the MBe is often selected near the previous platform location, suggesting the existence of a competition between both experts. MBe
78
L. Doll´e et al. (a) Control and Hippocampal (Pearce)
(b) Control and Hippocampal
100 80 60 40 20 0
(c) Control and Striatal
160
160
140
140
Escape latency (timesteps)
Escape latency (timesteps)
Escape latency (seconds)
120
120 100 80 60 40
0
2
4
6 Session
8
10
12
100 80 60 40 20
20 0
120
0
2
4
6 Session
8
10
0
12
0
2
4
6 Session
10
12
Control−1 Control−4 Striatal−1 Striatal−4
Control−1 Control−4 Hippocampal−1 Hippocampal−4
Control−1 Control−4 Hippocampal−1 Hippocampal−4
8
Early sessions
Fig. 3. Mean escape latencies measured during the first and the fourth trial of each session. (a) Results of the original experiment with rats, reproduced from [13]. (b) Hippocampal group (MFe only) versus Control group (provided with both a MFe and a MBe). (c) Striatal group (MBe only) versus Control group. See text for explanations.
1
1
0.8
0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 (start)
Late sessions
0
2
4
6
2
4
6
8
10 (goal)
1
1
0.8 0.8
0.6 0.6
0.4 0.4
0.2 0.2
0 (start)
0
current goal
(a)
previous goal
8
10 (goal)
current goal occupation MBe selection (b)
Fig. 4. First trials: (a) Selection rates of MBe (empty boxes) and MFe (full boxes) near the current and the previous platform in early (top) and late sessions (bottom) (b) Selection rates of MBe and current goal occupation within trial in early (top) and late (bottom) sessions
preference does not change within a trajectory and is in average less selected than the MFe (Fig. 4b). The trajectories (Fig. 5a and 5b) confirm the existence of a competition: the MBe tends to lead the robot to the previous location of the platform – as shown in the navigational maps of this expert (Fig. 5c and 5d) – whereas the MFe has
Analyzing Interactions between Navigation Strategies
*
* *
*
* *
*
*
*
*
*
*
(a)
(c) MBe
(e) MFe
*
*
*
*
*
*
*
*
*
(b)
(d) MBe
MBe MFe
79
*
* *
(f) MFe Current platform Previous platform
Fig. 5. First trials: (a) Trajectory of the robot for the 3rd session (b) Trajectory of the robot for the 9th session. (c) Navigational map of the MBe for the 3rd session (d) Navigational map of the MBe for the 9th session (e) Navigational map of the MFe for the 3rd session (f) Navigational map of the MFe for the 9th session.
learned to orient the robot towards the appropriate direction, i.e., at the South of the landmark (Fig. 5e and 5f). This result is consistent with the explanation provided by Pearce and colleagues and shows that the competition between the MBe and the MFe is mainly responsible for the poor performances of the Control group in the first trials. Fourth trials: Cooperation Between the MFe and the MBe Within Trials. At the end of a session, the platform location remained stable during four trials, allowing the MBe to learn its location. According to Pearce’s hypothesis, rats behaviour depends mainly on the map-based strategy (involving the hippocampus) that has learned the platform location for this session. However, simulation results show that the Striatal group –controlled by the MBe only– is outperformed by both the Hippocampal and the Control groups, despite a high improvement within sessions (c.f. Fig. 3c). This suggests that the performance of the Control group on the fourth trials cannot be explained exclusively by the MBe expert. Indeed, although this expert leads the agent towards the current goal position, it also leads to the previous goal location as illustrated by its selection rate on both sites (Fig. 6a). In addition, selection rates within a trajectory show a strategy change from the MFe –which is preferred at the beginning of a trial– towards a preference for the MBe at the end of the trajectory (Fig. 6b).
L. Doll´e et al.
Late sessions
Early sessions
80
1
1
0.8
0.8
0.6
0.6
0.4
0.4 0.2
0.2
0 (start)
0 1
2
4
6
8
10 (goal)
2
4
6
8
10 (goal)
1 0.8
0.8
0.6
0.6 0.4
0.4 0.2
0.2
0 (start)
0
current goal
previous goal
(a)
current goal occupation MBe selection
(b)
Fig. 6. Fourth trials: (a) Selection rates of MBe (empty boxes) and MFe (full boxes) near the current and the previous platform in early (top) and late (bottom) sessions. (b) Selection rates of MBe and current goal occupation within trial in early (top) and late (bottom) sessions.
This sequence is visible in typical trajectories (Fig. 7a and 7b). The navigational maps of each expert reveal that the MFe orients the robot towards the South of the landmark (Fig. 7e and 7f), whereas the MBe leads it on the precise location of the platform, only when it is at its vicinity (Fig. 7c and 7d). This suggests that the experts are cooperating by both adequately participating to the resolution of the task, depending on their reliability at a specific point of the journey. Our findings –pointing out a cooperative interaction at the end of each session– extend Pearce’s hypothesis of MBe dominance in behaviour control. 3.3
Interactions between MFe and MBe
In simulations of both the Hippocampal and Striatal groups, the inactivation of one expert only prevented it to control the robot’s behaviour, but not to learn. We can thus analyze how the interactions influence the learning of each strategy. First, looking at the accuracy of both experts in the neighbourhood of the current platform (Fig. 8), we observe that when the robot behavior is driven by the MBe (i.e. Striatal group), the performance of the MFe decreases (Fig. 8c). Second, we observe that MBe performs better in the Control group (Fig. 8a) than in Striatal and Hippocampal groups (Fig. 8b and c), presumably because of the influence of the efficient learning of the MFe (i.e., cooperative interactions). The navigational maps of MFe are similar –i.e., pointing to the South of the landmark– for the Control, Striatal and Hippocampal groups, despite the difference of performance observed above (Fig. 9c, d and 7f). In contrast, those of the MBe are different: in the Striatal group (Fig. 9a), the MBe is less attracted by the previous platform location than in the Control group (Fig. 7d), whereas it is attracted by the four possible locations in the Hippocampal group (Fig. 9b). The
Analyzing Interactions between Navigation Strategies
*
*
* *
*
*
*
*
*
*
*
*
(a)
(c) MBe
(e) MFe
*
*
*
*
*
*
* *
*
(b)
(d) MBe
81
*
* *
(f) MFe Current platform Previous platform
MBe MFe
Fig. 7. Fourth trials: (a) Trajectory of the robot for the 3rd session (b) Trajectory of the robot for the 11th session. (c) Navigational map of the MBe for the 3rd session (d) Navigational map of the MBe for the 11th session (e) Navigational map of the MFe for the 3rd session (f) Navigational map of the MFe for the 11th session.
(a) Control group
(b) Hippocampal group (MFe only) 0.5
0.4
0.3
0.2
0.4
0.3
0.2
0.1
0.1
0
(c) Striatal group (MBe only) 0.5
MBe−1 MBe−4 MFe−1 MFe−4 Averaged heading error
MBe−1 MBe−4 MFe−1 MFe−4 Averaged heading error
Averaged heading error
0.5
0
2
4
6 Session
8
10
12
0
MBe−1 MBe−4 MFe−1 MFe−4
0.4
0.3
0.2
0.1
0
2
4
6 Session
8
10
12
0
0
2
4
6 Session
8
10
12
Fig. 8. Average heading error near the current platform for the three groups. Zero means the expert is pointing to the platform, one means a difference of π. (a) Results in the Control group (MBe and MFe activated) (b) Hippocampal group (MFe only) (c) Striatal group (MBe only).
MBe is able to reach the every possible platform location, but only when it is in its vicinity. This suggests that a cooperation between the MFe –leading the robot to the neighbourhood of the current platform– and the MBe –finding the precise location once the robot is there– would perform well and enhance the performance of the robot. Therefore, this particular configuration of the MBe is impaired in the
82
L. Doll´e et al.
*
* *
*
*
MBe in Striatal group (MBe only)
*
*
*
*
(a)
*
* *
*
*
(b)
MBe in Hippocampal group
(c)
(MFe only)
MFe in Striatal group (MBe only)
* *
(d)
MFe in Hippocampal group (MFe only)
Current platform Previous platform
Fig. 9. (a) Navigational map of the MBe in the Striatal group in the last session (fourth trial): there are no center of attractions in other platform locations than the current and the previous ones (b) Navigational map of the MBe in the Hippocampal group in the last session (fourth trial): the MBe has learned the ways to go the four possible locations of the platform (c) Navigational map of the MFe in the Striatal group in the last session (fourth trial): it has learned the same kind of map as in the Hippocampal and the Control groups (d) Navigational map of the MFe in the Hippocampal group in the last session (fourth trial): the learned policy is very close to the one in the Striatal group.
case where the MBe should perform the trajectory alone, but enhanced in the case of a cooperation with the MFe. We observe that the behavior of the robot when controlled by the MFe, strongly influence the MBe. In contrast, the MBe-based behavior has less influence on the improvement of the MFe strategy. Remarkably, activation of both experts (i.e., Control group) do not impair the simultaneous learning of both strategies and allows the MBe to achieve better performance than when this expert is the only one available.
4 4.1
Discussion Competition and Cooperation
We have been able to reproduce the behaviour of rats in an experiment designed to study interactions between different spatial learning systems. Our simulation results are consistent with the original hypothesis of competitive interaction between map-based (MB) and map-free (MF) strategies at the start of a session when the location of the hidden cue-marked platform suddenly changes [13]. In addition, our model suggests a cooperative interaction during the learning of current location within a session. In these trials, the MF strategy is preferred at the beginning of the journey when the local cue gives information about the general direction to follow, while the robot gets closer to the goal, the MB strategy provides more accurate information about the real platform location and is chosen more often.
Analyzing Interactions between Navigation Strategies
83
Other experimental studies have reported strategy changes during a journey depending on the available information and the animal previous experience [18,19]. Hamilton et al. [19] reported a change from a map-based to a taxon strategy when rats were looking for a visible, stable platform. Contrasting to Pearce et al.’s settings, there are no intra-maze cues, and the authors report that rats first used distal landmarks to find a general direction, then approached the platform using a taxon strategy. Both, our results and those of Hamilton follow the same rationale, i.e., rats first choose a general direction of movement, and then choose the strategy that allows them to accurately locate the platform. Rat’s head scanning was analyzed in order to estimate the strategy changes. The same approach could be applied to the animal trajectories in Pearce’s paradigm in order to identify whether the strategy change predicted by our model is confirmed by the rats’ behaviour. 4.2
Synergistic Interactions and Dependence of an Expert on Another One
Changes in the heading error, assessed by the evolution of the error in the different experimental groups, suggest synergistic interactions between the two experts. The MFe orients the robot towards the landmark, and the MBe helps the robot to find the platform in the vicinity of the Landmark. If we define an expert as being dependent on another based on its ability to achieve a task alone, we conclude that MBe is dependent on the MFe, as the MBe does not learn the task across sessions. It should be noticed that an opposite relationship –i.e., MFe depending on MBe– has been reported in different experimental conditions (see [11] for a review). 4.3
Further Work
Despite qualitatively reproducing most of the results reported by Pearce et al. [13], our model differs from animal results since a performance improvement was observed within sessions in the Hippocampal group. This difference seems to be mainly due to the learning process of the MFe in cases where, in the previous session, the robot could reach the platform only by following the landmark (for example, if the platform is at the North, as illustrated in Fig. 10). This impairments can also explain the absence of convergence of both groups in the last session. In contrast to Pearce’s results, no significant difference is found between the fourth trials of the Control and Hippocampal groups. We impute this to the stochastic selection process –i.e., the probabilities associated with a strategy (see section 2.1)– which is sometimes sub-optimal. More generally our results might be improved by the use of a dynamically updated hippocampal map, as well as the use of explicit extra-maze cues on which –according to the authors– both strategies were anchored. In our simulation, these cues were only designed by an absolute reference for the MFe, and an ad hoc cognitive map for the MBe. Finally, models of map-based strategy different than place-response associations, can be
84
L. Doll´e et al.
* *
* *
*
*
*
*
(a)
(c)
*
*
*
* *
*
* *
(b)
(d) Current platform Previous platform
Fig. 10. (a) Trajectory at the fourth trial of the 7th session: as the simulated robot mainly went to this platform from the South, direction to the North were reinforced, even at the North of the platform. (b) Trajectory at the first trial of the 8th session: Starting from North, the robot needs then a longer trial to readjust the direction towards the current platform. (c) Navigational map of the MFe at the fourth trial of the 7th session : direction to the North were reinforced, even at the North of the platform. (d) Navigational map of the MFe at the first trial of the 8th session.
taken into account. The place-response strategy currently used in the model associates locations to actions that lead to a single goal location. Therefore, when the platform is relocated, the strategy has to be relearned. An alternative map-based strategy can be proposed such as the relations between different locations are learned irrespectively of the goal location (e.g. a topographical map of the environment). Planning strategies can be used to find the new goal location without relearning [3]. The use of computational models of planning (e.g. [20,21]) as a map-based strategy in our model can yield further insights on the use of spatial information in these types of tasks.
5
Conclusion
What stands out from our results is that our model allowed to analyze the selection changes between both learning systems, while providing information that is not directly accessible in experiments with animals (e.g., strategy selection rate, expert reliability). This information can be used to elaborate predictions, and propose new experiments towards the two-fold goal of further improving
Analyzing Interactions between Navigation Strategies
85
our models and expand our knowledge of animal behaviour. It showed also that opposite interactions can happen within a single experiment, and depend mainly on contextual contingencies and practice, as it has been suggested by recent works (e.g., [22,23]). Coexistence of several spatial learning systems allows animals to dynamically select which navigation strategy is the most appropriate to achieve their behavioural goals. Furthermore, interaction among these systems may improve the performance, either by speeding learning through collaboration of different strategies, or competitive processes that prevents sub-optimal strategies to be applied. Besides, better understanding of these interactions in animals by use of the modelling approach described in this paper also contributes to the improvement of autonomous robot navigation systems. Indeed, several bioinspired studies began exploring the robotic use of multiple navigation strategies [12,24,25,26], the topic is however far from being fully explored yet.
Acknowledgment This research was granted by the EC Integrated Project ICEA (Integrating Cognition, Emotion and Autonomy). The authors wish to thank Angelo Arleo, Karim Benchenane, Jean-Arcady Meyer and Denis Sheynikhovich for useful discussions.
References 1. Trullier, O., Wiener, S.I., Berthoz, A., Meyer, J.A.: Biologically-based artificial navigation systems: review and prospects. Progress in Neurobiology 83(3), 271– 285 (1997) 2. Filliat, D., Meyer, J.A.: Map-based navigation in mobile robots - i. a review of localisation strategies. Journal of Cognitive Systems Research 4(4), 243–282 (2003) 3. Meyer, J.A., Filliat, D.: Map-based navigation in mobile robots - ii. a review of maplearning and path-planning strategies. Journal of Cognitive Systems Research 4(4), 283–317 (2003) 4. Arleo, A., Rondi-Reig, L.: Multimodal sensory integration and concurrent navigation strategies for spatial cognition in real and artificial organisms. Journal of Integrative Neuroscience 6, 327–366 (2007) 5. Packard, M., McGaugh, J.: Double dissociation of fornix and caudate nucleus lesions on acquisition of two water maze tasks: Further evidence for multiple memory systems. Behavioral Neuroscience 106(3), 439–446 (1992) 6. White, N., McDonald, R.: Multiple parallel memory systems in the brain of the rat. Neurobiology of Learning and Memory 77, 125–184 (2002) 7. Kim, J., Baxter, M.: Multiple brain-memory systems: The whole does not equal the sum of its parts. Trends in Neurosciences 24(6), 324–330 (2001) 8. Poldrack, R., Packard, M.: Competition among multiple memory systems: Converging evidence from animal and human brain studies. Neuropsychologia 41(3), 245–251 (2003)
86
L. Doll´e et al.
9. McIntyre, C., Marriott, L., Gold, P.: Patterns of brain acetylcholine release predict individual differences in preferred learning strategies in rats. Neurobiology of Learning and Memory 79(2), 177–183 (2003) 10. McDonald, R., Devan, B., Hong, N.: Multiple memory systems: The power of interactions. Neurobiology of Learning and Memory 82(3), 333–346 (2004) 11. Hartley, T., Burgess, N.: Complementary memory systems: Competition, cooperation and compensation. Trends in Neurosciences 28(4), 169–170 (2005) 12. Chavarriaga, R., Strosslin, T., Sheynikhovich, D., Gerstner, W.: A computational model of parallel navigation systems in rodents. Neuroinformatics 3(3), 223–242 (2005) 13. Pearce, J., Roberts, A., Good, M.: Hippocampal lesions disrupt navigation based on cognitive maps but not heading vectors. Nature 396(6706), 75–77 (1998) 14. Morris, R.: Spatial localisation does not require the presence of local cues. Learning and Motivation 12, 239–260 (1981) 15. Doeller, C.F., King, J.A., Burgess, N.: Parallel striatal and hippocampal systems for landmarks and boundaries in spatial memory. Proceedings of the National Academy of Sciences of the United States of America 105(15), 5915–5920 (2008) 16. Arleo, A., Gerstner, W.: Spatial cognition and neuro-mimetic navigation: A model of hippocampal place cell activity. Biological Cybernetics 83(3), 287–299 (2000) 17. Str¨ osslin, T., Sheynikhovich, D., Chavarriaga, R., Gerstner, W.: Robust selflocalisation and navigation based on hippocampal place cells. Neural Networks 18(9), 1125–1140 (2005) 18. Devan, B., White, N.: Parallel information processing in the dorsal striatum: Relation to hippocampal function. Neural Computation 19(7), 2789–2798 (1999) 19. Hamilton, D., Rosenfelt, C., Whishaw, I.: Sequential control of navigation by locale and taxon cues in the morris water task. Behavioural Brain Research 154(2), 385– 397 (2004) 20. Martinet, L.E., Passot, J.B., Fouque, B., Meyer, J.A., Arleo, A.: Map-based spatial navigation: A cortical column model for action planning. In: Spatial Cognition (in press, 2008) 21. Filliat, D., Meyer, J.: Global localization and topological map-learning for robot navigation. In: Proceedings of the seventh international conference on simulation of adaptive behavior on From animals to animats table of contents, pp. 131–140 (2002) 22. Pych, J., Chang, Q., Colon-Rivera, C., Haag, R., Gold, P.: Acetylcholine release in the hippocampus and striatum during place and response training. Learning & Memory 12(6), 564–572 (2005) 23. Martel, G., Blanchard, J., Mons, N., Gastambide, F., Micheau, J., Guillou, J.: Dynamic interplays between memory systems depend on practice: The hippocampus is not always the first to provide solution. Neuroscience 150(4), 743–753 (2007) 24. Meyer, J., Guillot, A., Girard, B., Khamassi, M., Pirim, P., Berthoz, A.: The Psikharpax project: Towards building an artificial rat. Robotics and Autonomous Systems 50(4), 211–223 (2005) 25. Guazzelli, A., Corbacho, F.J., Bota, M., Arbib, M.A.: In: Affordances, motivations, and the world graph theory, pp. 435–471. MIT Press, Cambridge (1998) 26. Girard, B., Filliat, D., Meyer, J.A., Berthoz, A., Guillot, A.: Integration of navigation and action selection in a computational model of cortico-basal gangliathalamo-cortical loops. Adaptive Behavior 13(2), 115–130 (2005)
A Minimalistic Model of Visually Guided Obstacle Avoidance and Path Selection Behavior Lorenz Gerstmayr1,2, Hanspeter A. Mallot1 , and Jan M. Wiener1,3 1 2
Cognitive Neuroscience, University of T¨ ubingen, Auf der Morgenstelle 28, D-72076 T¨ ubingen, Germany Computer Engineering Group, University of Bielefeld, Universit¨ atsstr. 25, D-33615 Bielefeld, Germany 3 Centre for Cognitive Science, Universtity of Freiburg, Friedrichstr. 50, D-79098 Freiburg, Germany
Abstract. In this study we present an empirical experiment investigating obstacle avoidance and path selection behavior in rats and a number of visually-guided models that could account for the empirical data. In the experiment, the animals were repeatedly released into an open arena containing several obstacles and a single feeder that was marked by a large visual landmark. We recorded and analyzed the animals’ trajectories as they approached the feeder. We found that the animals adapted their paths according to the specific obstacle configurations not only to avoid the obstacles that were blocking the direct path, but also to select optimal or near-optimal trajectories. On basis of these results, we then develop and present a series of minimalistic models of obstacle avoidance and path selection behavior that are based purely on visual input. In contrast to standard approaches to obstacle avoidance and path planning, our models do not require a map-like representation of space. Keywords: Spatial cognition, obstacle avoidance, path selection, biologically inspired model.
1
Introduction
Selecting a path to approach a goal while avoiding obstacles is a fundamental spatial behavior. Surprisingly few studies investigated the underlying mechanisms and strategies in animals or humans (but see [1,2]). In the robotics community, in contrast, obstacle avoidance and path selection is a vivid field of research and several models have been developed (for an overview see [3,4]). These models usually require rich spatial information: for example, the distances and directions to the goal and the obstacles have to be known and often a 2d map of the environment has to be generated to select a trajectory to the goal. We belief that in many situations successfull navigation behavior can also be achieved using very sparse spatial information directly obtained from vision without maplike representations of space. In this article, we present a series of minimalistic visually guided models that closely predict empirical results on path selection and obstacle avoidance behavior in rats. C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 87–103, 2008. c Springer-Verlag Berlin Heidelberg 2008
88
L. Gerstmayr, H.A. Mallot, and J.M. Wiener
Obstacle avoidance methods which are related to the models proposed in the following can be divided into two main categories: the first group models the goal as an attractor whereas each obstacle is modelled as an repellor. Thus, the position of each obstacle has to be known and the model’s complexity depends on the number of obstacles. This group of methods is influenced by potential field methods [3,4] which treat the robot as a particle moving in a vector field. The combination of attractive and repulsive forces can be used to guide the agent towards the goal while avoiding obstacles. Potential fields suffer several limitations: the agent can get trapped in local minima, lateral obstacles can have a large influence on the agent’s path towards the goal, and the approach predicts oscillating trajectories in narrow passages [5]. Several improvements of the original method have been proposed to overcome these drawbacks [6]. Potential fields have also been used to model prey-approaching behavior in toads [7]. The task of goal approaching and obstacle avoidance can also be formulated as dynamical system [8]. The movement decision is generated by solving a system of differential equations. Again, the goal is represented as attractor whereas each obstacle is modelled as repellor. The model is used to explain data obtained for human path selection experiments [2]. In this model, route selection emerges from on-line steering rather than from explicit path planning. In comparison to the potential field method, the dynamical approach predicts smoother paths and does not get trapped in local minima. A further extension of the model was tested in real robot experiments [6]. The second class of obstacle avoidance methods only relies on distance information at the agent’s current position and do not assume that the exact position of each obstacle is known. The family of vector-field histogram (VFH) methods [9,10,11] uses an occupancy grid as representation of the agent’s environment. In a first processing step, obstacle information is condensed to a 1d polar histogram. In this representation, candidate corridors are identified and the corridor which is closest to the goal direction is selected. The VFH+ method additionally considers the robot’s dynamics: corridors which cannot be reached due to the robot’s movement constraints are rejected [10]. The VFH* method incorporates a look-ahead verification based on a map of the agent’s environment to prevent trap situations due to dead ends [11]. For determing the movement decision, it takes the consequences of certain movement decisions into account. In the following, we present an exploratory empirical study investigating obstacle avoidance and path selection behavior in rats (Sec. 2). We then (Sec. 3) present a series of minimalistic visually guided models of obstacle avoidance and path selection that could account for the empirical data. In contrast to the models introduced above, the proposed models (1) act purely on visual input, (2) do not require map-like representations of space, and (3) the most basic 1d model does not require any distance information to generate obstacle avoidance and path selection behavior. Based on our findings, we finally conclude (Sec. 4) that reliable obstacle avoidance is possible with such minimalistic models and that our models implicitly solve the path-planning problem.
A Minimalistic Model of Visually Guided Obstacle Avoidance
2
89
Part 1: Behavioral Study
In this part we present an exploratory study, examining path selection and obstacle avoidance behavior in rats. For this, animals were trained to receive food reward at a landmark visible in the entire experimental arena. Rats were repeatedly released into this arena and their paths, approaching this landmark, was recorded. Placing obstacles between the start position and the landmark, allowed us to systematically investigate how rats reacted to the obstacles during target approach. The behavioral results from this experiment are the basis for the development of a series of visually guided models of obstacle avoidance and path selection behavior in the second part of this manuscript. 2.1
Material and Methods
Animals. Six Long Evans rats (rattus norvegicus), approximately 7 weeks old at the beginning of the study, weighing between 150 and 200 g, participated in the study. They were housed individually under constant temperature and humidity.
Fig. 1. Experimental setup
Apparatus. The apparatus consisted of an open area (140 × 140 cm), separated from the remaining laboratory by white barriers (height 40 cm) and surrounded by a black curtain (see Fig. 1). Within this area up to 6 obstacles (brown 0.5 l bottles) were distributed. Food was available from a small feeder that was placed directly under a black-white striped cylinder (25 cm diameter, 80 cm in height). The cylinder was suspended from the ceiling about 40 cm above the ground and was visible in the entire arena. A transparent start box was placed in one of the corners of the arena. At the beginning of each trial, rats were released by opening the door of the start box. Their trajectories were recorded by a tracking system registering the position of a small reflector foil that was attached to a soft leather harness the animals were wearing (sampling rate: 50 Hz).
90
L. Gerstmayr, H.A. Mallot, and J.M. Wiener
Procedure. Prior to the experiments, the animals were familiarized with the experimental setup by repeatedly placing them in the arena, allowing them to explore it for 180 sec. The obstacles, feeder and landmark, and the start box were randomly repositioned within the arena on a daily basis. After familiarization, the animals were trained to receive cocoa flavored cereal (Kellogg’s) that was located in the feeder under the landmark. For each trial, the feeder (incl. landmark) as well as the obstacles were randomly distributed in the arena. Before each trial, the rats were placed in the start box and released after 15 seconds. A trial lasted until food reward was found or until 600 seconds passed. Animals were given 4 training-trials each day for a period of 10 days. In addition to the food reward during the experiments, the rats received 15g rodent food pellets per day. The procedure during the test phase was identical to the training phase, but animals received 8 trials per day for a given test-configuration of obstacles and feeder. For each trial, rats’ trajectories were recorded until food reward was obtained. Each day, the release box was randomly repositioned in one of the corners of the arena. The positions of feeder and obstacles were adjusted accordingly. Each rat was tested in each of the 20 test-configurations (see Fig(s). 2, 3, and 7). The rats were subdivided into 2 groups with 3 animals each that were exposed to the test-configuration in a different order. Analysis. For each configuration, we evaluated the percentage of trials in which a single animal passed the obstacles on the left or on the right side. For configurations in which the obstacles created a gap (see Fig. 3) we also calculated the percentage of trials in which animals passed through that gap. Altogether, 67 trials (6.98 %) were removed from the final data set (893 trials remaining). This was due to the following reasons: (1) In 8 trials the tracking system failed; (2) In the remaining 59 trials the rats either left the open area by running towards and touching the surrounding walls, or they turned more than 180◦ , thus running in a loop. In all these cases, the animals did not behave goal-directed (i.e. approaching the feeder position). 2.2
Results
Fig(s). 2, 3, and 7 display the entire set of trajectories for all configurations as well as the relative frequencies for passing the obstacles on the left side or the right side or passing through the gap. It is apparent from these figures that the rats adapted their path selection behavior according to the specific configuration. In the following, we present a detailed analysis of how the configurations influenced path selection behavior. Specific interest concerns the questions whether animals minimized path length, how animals reacted to gaps of different sizes, and whether animals built up individual preferences to pass obstacle configurations to the left or right side. Finally, we extract motion parameters from the recorded trajectories that will be used in the second part in which we present a visually guided model of path selection and obstacle avoidance behavior.
A Minimalistic Model of Visually Guided Obstacle Avoidance
1
2
3
4
5
6
7
8
91
Fig. 2. Asymmetric configurations: rats’ chosen trajectories are displayed in the upper row, the predictions of the 1d-model (see Sec. 3) are displayed below. The black and gray horizontal bars depict the animals (upper) or the models (lower) behavior with respect to passing the obstacles on the left (black) or the right (light gray) side.
Distance minimization. Fig. 2 displays the asymmetric configurations. Passing the obstacles on the right and on the left side resulted in paths of unequal length. For these configurations we evaluated whether animals showed a preference for the shorter alternative. Animals preferred the shorter over the longer alternative in 76.53 % of the runs (t-test against chance level (50 %), t(5)=8.09, p<0.001). Gap size. In configurations 13 to 16 the obstacles were arranged such that they created a gap (see Fig. 3). The width of the gap was either 32 cm (configurations 13 and 14) or 14 cm (configurations 15 and 16). Rats’ behavior in choosing the path through the gap depended on the width of the gap. In configurations 13 and 14 (wide gap) they ran through the gap in 83.76 % of the runs as compared to 36.20 % of the runs for the configurations 15 and 16 (narrow gap; t-test: t(5)=3.00, p=0.03). Symmetric configurations. In the symmetric configurations (see Fig. 3) passing the obstacle on the right side and on the left side resulted in paths of equal
92
L. Gerstmayr, H.A. Mallot, and J.M. Wiener 9
10
11
12
13
14
15
16
Fig. 3. Symmetric configurations: rats’ chosen trajectories are displayed in the upper row, the predictions of the 1d-model (see Sec. 3) are displayed below. The gray shaded horizontal bars depict the animals (upper) or the models (lower) behavior with respect to passing the obstacles on the left (black) or the right (light gray) side, or passing through the gap (middle gray).
length. The symmetric configuration allowed investigating whether rats developed individual preferences to pass the obstacles on the left side or the right side. Three rats displayed an overall tendency to pass obstacles on the left side, the remaining three rats displayed a tendency to pass obstacles on the right side. While the individual preferences are moderate at the beginning of the experiment, they strongly increase over time (i.e. over experimental sessions, r=0.89, p<0.01). Locomotor behavior. To adjust the movement parameters of our visually guided models of obstacle avoidance and path selection behavior and to compare their behavior to the empirical data presented above, it was necessary to estimate the rats’ navigation velocity and turning behavior from the recorded trajectories. To do so, we hand-selected 6 runs for each rat (36 runs in total) for which the tracking data were precise and error-free. We then calculated the distance covered by the rat between 2 successive samples (tracker frequency: 50 Hz). Furthermore,
A Minimalistic Model of Visually Guided Obstacle Avoidance 250
93
600
500
200
400 Count
Count
150 300
100 200 50
100
0
0 0
1
2 3 Distance [cm]
4
5
0
10
20 30 Orientation change [deg]
40
Fig. 4. Left: histogram of rats’ distance covered between 2 subsequent samples; right: histogram of change in orientation between 2 subsequent samples. The vertical gray lines mark the mean distance and the mean orientation change, respectively.
we calculated the rats’ change in orientation (turning rate) between successive samples. Fig. 4 displays the results of this analysis. The average distance covered was 2.1 cm per timestep which relates to a velocity of 105 cm/sec.; the average turning rate was ±9.25◦ per timestep. 2.3
Discussion
In the part of the work we presented an exploratory study examining rats’ path selection and obstacle avoidance behavior. Animals were released from a start box into an open arena with a number of obstacles and a feeder, marked by a large landmark. Rats avoided the obstacles and approached the feeder fastly and efficiently. In fact, in over 75 % of the trials in which path alternatives differed in length, the animals showed a preference for the shorter alternative. These empirical results demonstrate that the animals reacted to the specific target configurations. The fact that animals minimized path length is remarkable to some extend as the additional energy expenditure when taking detours or suboptimal paths are estimated to be rather small in this scenario. Nevertheless, the animals did not adopt a general strategy that could account for the entire set of configurations, such as moving towards and along the walls, but they decided on the exact path on a trial by trial basis. It has to be noted, however, that for the symmetric configurations (see Fig. 3) rats built up rather strong individual preferences to pass obstacles on the right or the left side. Such preferences can be explained by motor programs. Rats are well-known to develop stereotyped locomotory behavior when repeatedly exposed to the same situation (for an overview see [12]), such as being released from the start box. In other words, the animals draw movement decisions already in the start box that were independent of the specific obstacle configuration. However, at some point on their trajectory, animals reacted to the configuration. Otherwise no variance in behavior would have been observed.
94
3
L. Gerstmayr, H.A. Mallot, and J.M. Wiener
Part 2: A Visually Guided Model of Obstacle Avoidance and Path Selection
In this part of our paper, we present a series of visually guided models for obstacle avoidance and path selection behavior which are inspired by the experiments presented above. The proposed algorithms were designed to be both minimalistic and biologically plausible models for the rats’ behavior. The models are purely reactive, do not build or iteratively update a map-like representation of the environment, and make only use of visual information. For our models, the position of each obstacle with respect to the agent’s position needs not to be known. Their complexity solely depends on the size of the visual input. By such a bottom-up approach we hope to find out which kind of information is relevant for the rat. Visual input. As input, our models use a panoramic image with a horizontal field of view of 360◦. The vertical field of view covers 90◦ below the horizon. The rats’ visual field above the horizon is neglected as it does — at least for our setup — not contain information necessary for obstacle avoidance. The angular resolution of the images is 1◦ per pixel. Images are generated by a simple raycaster assuming that the rats’ eye level is 5 cm above the ground plane. The process of image generation is sketched in Fig. 5. For each direction of sight, the raycaster computes the distance from the current robot position to the corresponding obstacle (modeled as cylinders, black pixels) or the walls of the arena (white pixels). Since the object’s distance to the agent is directly linked to the elevation under which it is imaged, a 2d view of the environment were computed (Fig. 5, middle). Close-by objects are imaged both larger and under a larger elevation than distant objects. Based on the 2d images, 1d images can be obtained by taking slices of constant elevation below the horizon (gray horizontal lines) out of the 2d image. Depending on the elevation of the slice, the resulting 1d view only contains obstacle information up to a certain distance (Fig. 5, right). In case the slice is taken along the horizon (top right), also objects at a very large distance are imaged. For this case, no depth cue is available because we do not analyze the angular extend of the obstacles and we do not compute optical flow between two consecutive input images. Since the input images only contain visual information about the obstacles, the goal direction (Fig. 5, vertical dark gray line) w.r.t. the agent’s current heading direction (vertical light gray line) is provided as another input parameter. Model assumptions and restrictions. As outlined in the previous section, the raycaster computes a binary image which only contains the obstacle information (Fig. 5). Assuming such obstacle-segmented images facilitates the further processing of the visual information by our models. Nevertheless, we are aware that such a strong assumption makes it difficult to test the proposed models with a robot operating in a natural environment. Such a test would require further image preprocessing steps such as ground plane segmentation: based on color or texture cues, optical flow, or geometrical considerations such algorithms can
A Minimalistic Model of Visually Guided Obstacle Avoidance 1d views
2d view Elevation [DEG]
bird’s eye view
95
22.5 45 67.5 90 −90
0 Azimuth [DEG]
90
180
−90
0 Azimuth [DEG]
90
180
Fig. 5. Image generation. For detailed explanations see the text; for visualization purposes, the 1d views are stretched in the y-direction.
classify whether image regions lie in the ground plane or not (for reviews see [13,14]). The ground plane could then be assumed to be free space, whereas other image regions would be interpreted as obstacles. Assuming an image preprocessing step is also reasonable from the viewpoint of visual information processing in the brain: lower processing stages usually transform the visual information into a representation which facilitates further processing by higher-level stages [15]. As the goal direction could be derived in earlier stages, we think it is reasonable to pass it as input parameter to our models. The behavior is modeled in discrete time steps, each time step corresponding to one sampling cycle of the used tracker. The models also neglect dynamic aspects of the moving rats. The simulated agents are moving with a constant velocity of 2.1cm per time step and a maximum turning rate of ±9.25◦ per time step (compare Sec. 2.2). By limiting the maximum turning rate, aspects of the agent’s kinematic are at least partially considered [10]; though the simplifications could complicate a real robot implementation. 3.1
1D Model without Depth Information
In this section, we will propose a model for obstacle avoidance and goal approaching behavior. As it uses a 1d view of the environment taken along the horizon, no depth cues are available. To this end, we refer to the model as 1d model. For our 1d model, we only use the most fundamental building blocks needed to achieve obstacle avoidance and goal approaching behavior. These building blocks are (1) the ability to detect obstacles and (2) the ability of steering [2,6,8]. These two abilities are sufficient to guide the agent towards its goal position. Thus, the path planning problem is implicitly solved. In detail, our algorithm includes the following steps (see Fig. 6): (0) The agent is initialized with a position and a orientation. (1) Obstacles are enlarged for a constant angle δ which is independent of the agent’s distance to the obstacle and the extend of the obstacle within the input image. The angle δ is the only model parameter of our 1d model. Growing the robots representation of obstacles is a standard method in mobile robotics: it ensures that the robot passes the real obstacles at a safe distance [4].
L. Gerstmayr, H.A. Mallot, and J.M. Wiener goal direction
βl ρ
δ γ
heading direction ρ(= α )
δ
4.5
βr (= α)
4 3.5
Residual E(δ)
96
3 2.5 2 1.5 1
movement direction
0.5 0
5
10 Enlargement δ
15
20
Fig. 6. Left: sketch of the 1d model. For explanations see the description above. Right: optimization residuals E depending on the enlargement parameter δ. For details see the section below.
(2) Check whether the goal direction γ is blocked by an obstacle or not. In case it is not blocked, choose α = γ as desired movement direction for the next simulation step and proceed with step (4). (3) Determine the angles βl , βr between the agent’s current heading direction and the borders of the enlarged obstacles. In case abs(βl ) < abs(βr ) choose α = βl ; otherwise use α = βr . In each simulation step, this method of selecting the next movement direction tries to keep the agent on a straight trajectory by minimizing the change of orientation. This step is similar to the corridor selection of the VFH methods [9,10,11]. (4) Limit the desired movement direction α to the range ±ρ (light gray shaded area). The result α is used as change of orientation for the next simulation step. After rotation, the agent moves straight (for d = 2.1 cm). Steps (1) to (4) of the algorithm are repeated until the agent’s distance to the goal is smaller than a certain distance (in our experiments 6 cm). In case the agent does not reach the goal within a maximum number of steps, in case it hits an obstacle, or in case the simulated trajectory extends beyond the limits of the arena, the trial is counted as unsuccessful. 3.2
Model Evaluation
Parameter optimization. As the goal of our model is to optimally reproduce the data obtained from the behavioral experiments, the enlargement parameter δ was systematically varied in the range δ ∈ 0◦ , 1◦ , 2◦ , . . . , 20◦ . For each of the 20 configurations (c ∈ 1, 2, . . . , 20), 70 trajectories were simulated (starting at 7 release positions with 10 different initial orientations; positions and orientations were equally and symmetrically distributed). As for the rats’ trajectories, we analyzed for each simulated trajectory whether the agent passes the obstacles on
A Minimalistic Model of Visually Guided Obstacle Avoidance
97
the left side, the right side or through the middle. Depending on the configuration c and the enlargement δ, a vector hsim (c, δ) = hL (c, δ), hM (c, δ), hR (c, δ) (1) of relative frequencies for passing on the left side, on the right side, or passing through the middle. In order to determine the optimal value of δ, the following dissimilarity measure E(δ) =
20 c=1
SSD hsim (c, δ), hrat (c)
(2)
was minimized. The measure computes the sum of squared differences (SSD) between the vectors of relative frequencies hsim and hrat for the simulation and the rats’ data, respectively. The best fit (Fig. 6, right) was obtained for δ = 6◦ with an optimization residual of E = 0.989. The resulting trajectories are shown in Fig(s). 2, 3, and 7; the configurations depicted in Fig. 7 were solely used for adjusting the model parameters.
17
18
19
20
Fig. 7. Further configurations used only for the parameter optimization. Rats’ chosen trajectories are displayed in the upper row, the predictions of the 1d-model (see Part 2) are displayed below. The gray shaded horizontal bars depict the animals (upper) or the models (lower) behavior with respect to passing the obstacles on the left (black) or the right (light gray) side, or passing through the gap (middle gray).
Correlation between simulation and behavioral data. To assess how good the model fits the behavioral data, we correlated the relative frequencies hrat (c) and hsim (c) (for 1 ≤ c ≤ 20). From the 9 possible combinations, the correlations rL,L = 0.919, rM,M = 0.947, and rR,R = 0.935 are most relevant for our purposes; the mixed correlations all exhibit negative correlations around −0.45. However, this analysis does not distinguish between configurations with and without a gap: the correlation rM,M is influenced because hM = 0 is assumed for all configurations without a gap. To overcome this drawback, we separately
98
L. Gerstmayr, H.A. Mallot, and J.M. Wiener
correlated the relative frequencies for the configurations with and without a gap. For the first class (configurations 13 to 20) we obtained correlations rL,L = 0.816, rM,M = 0.943, and rR,R = 0.969; for the second class (configurations 1 to 12) we obtained rL,L = rR,R = 0.934 (as for these configurations hR equals 1 − hL , the correlations rL,L and rR,R are identical). Trap situations. In order to test our model with other obstacle configurations than those tested in the behavioral experiments, we performed tests (Fig. 8) with cluttered environments (configurations 1, 2), a U-shaped obstacle configuration (3), and a configuration for which the agent is completely surrounded by obstacles (4). For each test run, the agents were initialized with identical start position; the initial orientation was varied in steps of 15◦ . After initialization, the simulated agent was moved forward for one step. Afterwards, the model was applied to predict the agent’s path. The results for configurations 1 and 2 show that almost every simulated trajectory reaches the goal position. Some of the paths are not as short as possible because the model also tries to avoid distant obstacles which do not directly block the agent’s way towards the goal. In case the agent hits the obstacle, it cannot turn fast enough to avoid the obstacle due to the limited turning rate. Our 1d model is also able to reach the goal in test configuration 3. This is a test situation for which many obstacle avoidance methods relying on depth information (e.g. potential field approaches) fail due to local minima [5]. Our model fails for condition 4: in this case, no movement decision can be derived because the agent is completely surrounded by obstacles (resulting in a completely black 1d view). 1
2
3
4
Fig. 8. Trap situations for the 1d model
3.3
Discussion
Our model is capable to produce smooth trajectories reaching the goal position without crashing into the obstacles. Since our model does not contain any noise, the simulated trajectories look much smoother than the rats’ trajectories. Comparing the analysis whether the agent passed on the left side, on the right side or through the gap to the corresponding behavioral data reveals that the model covers several aspects we outlined in Sec. 2.2. These aspects will be discussed in the following paragraphs in more detail; afterwards, we outline the limitations of the 1d model.
A Minimalistic Model of Visually Guided Obstacle Avoidance
99
Distance minimization. For the asymmetric configurations (Fig. 2) 78.93% of the simulated trajectories pass the obstacles such that the length of the resulting path is minimal (rats’ trajectories: 76.53%). Non-optimal paths are due to the model’s tendency to predict straight trajectories: since in every time step the change of orientation is kept as small as possible, the agent passes the obstacles on that side which results in a longer path. This effect is also visible for configurations 19 and 20: there, the shortest path would be to pass through the gap. However, this path would require the agent to turn more than passing on the left or the right of the obstacles. Gap size. For the behavioral data we observed that the rats more frequently pass through larger gaps. Comparing configurations 13 to 16 (Figure 3) reveals that all simulated trajectories pass through the gap if the gap is large (in comparison to 83.76% for the rats’ trajectories). In case the gap is small, only 15.71% of the simulated trajectories pass through the gap (rats: 36.20%). Symmetric configurations. Our model did not reproduce the left-right-preferences we observed for the symmetric configurations (Fig. 3). As we initialized the simulated agent with symmetrically and equally distributed release positions and orientations, we did not account for reproducing the rats’ preferences. It is left for future work to initialize the model with positions and directions which better reproduce the rats’ trajectories. Model limitations. Although the model is capable to reproduce the results obtained from the behavioral experiments, the comparison between the simulated and the rats’ trajectories reveal several aspects which are due to the lack of depth information in our model: (1) the model seems to react earlier to obstacles than the rats, (2) the simulated trajectories pass closer to obstacles than the rats’ trajectories and (3) our model cannot solve the trap configuration 4 which definitively could be solved by rats. The latter aspect is due to neglecting the agent’s dynamics in our simulation. (1) Reaction to obstacles. Many simulated trajectories (e.g. for configurations 10, 11, and 12) are curved, then turn straight until they pass the obstacle, turn again and finally the agent travels along a straight line towards the goal position. In contrary, many rats run straight towards the goal and only later start to avoid the obstacles. They avoid the obstacles on a curved path and also often approach the goal on a curved path. This behavior suggests that the rats try to approach the goal and only start reacting to obstacles when they are at a certain distance from the obstacles. Since our model does not incorporate any information about the distance to the obstacles it tries to avoid obstacles independent of the agent’s current distance to the obstacles. (2) Distance while passing obstacles. Comparing the model’s and the rats’ trajectories also reveals that the simulated agent passes-by closer to obstacles than
100
L. Gerstmayr, H.A. Mallot, and J.M. Wiener
the rats. This can also be explained with the lack of depth information: independent of the distance to the obstacle, the obstacle is enlarged by δ. If the agent is far away from the obstacle, the enlargement δ is large compared to the size of the obstacles. In case the agent is close to the obstacle, this enlargement is small compared to the size of the obstacles imaged on the agent’s retina. For this reason, the agent passes very close to obstacles. For larger δ, gaps between obstacles get closed due to obstacle growing. For these cases, the agent could no longer pass through gaps. These model properties could be avoided by introducing an enlargement mechanism which depends on the distance to the obstacle. 3.4
Outlook: Models Incorporating Depth Information
In order to overcome the drawbacks of the 1d model outlined in the previous section, we are currently working on extensions of the 1d model which also incorporate depth information. Two of these extensions will be briefly described in this section. Both models have in common that the elevation under which models are imaged is used as depth cue and that the 2d input image is reduced to a 1-dimensional representation of the environment. Again, our models are purely reactive, do not need a map-like representation and their complexity depends only on the resolution of the input images. 1.5d model. Except for the input image, our 1.5d model is identical to the 1d model described above. In contrary to our 1d model, the 1.5d model uses a view taken out of the 2d image with constant elevation > 0 (Figure 5, right). Hence, it only takes obstacles up to a certain distance from the agent’s current position into account. Monocular vision as a cue of depth information received attention in the context of robot soccer. A method called “visual sonar” [16,14] searches along radial scan lines in the camera image. In case an obstacle is encountered along the scan line, its distance can be computed. This information can then be used for further navigation capabilities such as obstacle avoidance, path planning or mapping. Like the proposed 1.5d model, the “visual sonar” relies on elevation as a cue for depth information [17]. This depth cue can also be used by frogs and humans [18,19]. Fig. 9 visualizes the results obtained for testing the 1.5d model with the trap situations described in Sec. 3.2. For the experiments, horizontal views taken at = 30◦ were used. For the cluttered environments (configurations 1 and 2), the model predicts paths which are shorter than the paths predicted by the 1d model. Since the 1.5d model does not consider distant obstacles, there are situations in which the 1.5d model approaches the goal, whereas the 1d model avoids distant obstacles. Hence, the 1.5d model is able to predict shorter paths. For test configuration 3, our model suffers the same problems as many reactive obstacle avoidance methods incorporating depth information: due to local minima, the simulated agents head towards the goal. When the obstacle in front of the agent comes into sight, it starts to avoid the obstacle. However, it then cannot reach the goal position any more. Related work tries to solve this problem by map-building and look-ahead path-planning algorithms [11].
A Minimalistic Model of Visually Guided Obstacle Avoidance 1
2
3
101
4
Fig. 9. Trap situations for the 1.5d model 2d view Elevation [DEG]
bird’s eye view
3
22.5
heading direction
45 67.5 90
goal direction −90
0 Azimuth [DEG]
90
180
4
repelling profile attracting profile combination α 2ρ
Fig. 10. Left: sketch of the 2d model. Right: trap situations for the 2d model.
Since the model incorporates depth information, it can solve the test condition 4 at least if the initial orientation points towards the open side of the U-shaped obstacle. Due to the restricted movement parameters, the model cannot turn fast enough and hits the obstacles for other initial orientations. 2d model. The 2d model (Fig. 10) we are currently working on uses a 2d view of the environment as shown in Fig. 5. For a set of n horizontal directions of sight ϕi (1 ≤ i ≤ n), the distance di towards the visible obstacle is computed based on the elevation under which the obstacle is imaged. By this step, the 2d image information is reduced to a 1d depth profile. At each direction of sight ϕi , a periodic and unimodal curve (comparable to the von Mises distribution) is placed. The curve’s height is weighted by the inverse of di . By summing over all the von Mises curves, a repelling profile is computed. Goal attraction is modeled by an attracting profile with a minimum at the goal direction. Both profiles are summed up and a minimization process searches the profile’s minimum in the range ±ρ around the agent’s heading direction. The direction of the minimum α is used as movement direction. The polar obstacle representation is recomputed in each iteration and not updated from stept to step.
102
L. Gerstmayr, H.A. Mallot, and J.M. Wiener
Fig. 10 also visualizes the model’s trajectories obtained for trap configurations 3 and 4. Although the agent gets trapped in configuration 3, much more test trials than for the 1.5d model successfully reach the goal. Since the trajectories were simulated with relatively large σ, objects are passed by at a comparatively large distance. We are currently working on improving the distance weighting as well as the interplay between repelling and attractive profiles. By this means, we expect that our 2d model performs better than the other models.
4
Conclusion
In this work we presented an exploratory study examining obstacle avoidance and path selection behavior in rats and a minimalistic visually-guided model that could account for the empirical data. The particular appeal of the model is its simplicity, neither requiring map-like representations of the goal and obstacles nor does it incorporate depth information. These results demonstrate that reliable obstacle avoidance can be achieved with only two basic building blocks: (1) the ability to approach the goal and (2) the ability to detect if the course towards the goal is blocked by an obstacle and to avoid the obstacle. While the proposed basic 1d-model is capable to reproduce the results of the behavioral experiment described in Sec. 2, a detailed comparison of the simulated trajectories with the empirical data suggests that the rats probably used depth information. This can be concluded from the fact that rats seem to only react to obstacles when they are at a certain distance from them and that rats passed by obstacles at a comparatively large distance. Both of these aspects cannot be reproduced by our 1d model. In order to explain these findings, we have presented first ideas (1.5d and 2d model) of how depth information can be integrated in our model in a sparse and biologically inspired fashion.
References 1. Fajen, B., Warren, W.: Behavioral dynamics of steering, obstacle avoidance, and route selection. Journal of Experimental Psycholology: Human Perception Performance 29(2), 343–362 (2003) 2. Fajen, B., Warren, W., Temizer, S., Kaelbling, L.P.: A dynamical model of visuallyguided steering, obstacle avoidance, and route selection. International Journal of Computer Vision 54(1–3), 13–34 (2003) 3. Choset, H., Lynch, K., Hutchinson, S., Kantor, G., Burgard, W., Kavraki, L., Thrun, S.: Principles of Robot Motion. MIT Press, Cambridge (2005) 4. Siegwart, R., Nourbaksh, I.: Introduction to Autonomous Mobile Robots. MIT Press, Cambridge (2004) 5. Koren, Y., Borenstein, J.: Potential field methods and their inherent limitations for mobile robot navigation. In: Proceedings of the IEEE Conference on Robotics and Automation, pp. 1398–1404 (1991) 6. Huang, W., Fajen, B., Finka, J., Warren, W.: Visual navigation and obstacle avoidance using a steering potential function. Robotics and Autonomous Systems 54(4), 288–299 (2006)
A Minimalistic Model of Visually Guided Obstacle Avoidance
103
7. Arbib, M., House, D.: Depth and Detours: An Essay on Visually Guided Behavior. In: Vision, Brain, and Cooperative Computations, pp. 129–163. MIT Press, Cambridge (1987) 8. Sch¨ oner, G., Dose, M., Engels, C.: Dynamics of behavior: Theory and applications for autonomous robot architectures. Robotics and Autonomous Systems 16(2–4), 213–245 (1995) 9. Borenstein, J., Koren, Y.: The vector field histogram – fast obstacle avoidance for mobile robots. IEEE Journal of Robotics and Automation 7(3), 278–288 (1991) 10. Ulrich, I., Borenstein, J.: VFH+: Reliable obstacle avoidance for fast mobile robots. In: Proceedings of the IEEE Conference on Robotics and Automation (1998) 11. Ulrich, I., Borenstein, J.: VFH*: Local obstacle avoidance with look–ahead verification. In: Proceedings of the Internactional Conference on Robotics and Automation (2000) 12. Gallistel, C.R.: The Organisation of Learning. MIT Press, Bradford Books, Cambridge (1990) 13. Chen, Z., Pears, N., Liang, B.: Monocular obstacle detection using reciprocal-polar rectification. Image and Vision Computing 24(12), 1301–1312 (2006) 14. Lenser, S., Veloso, M.: Visual sonar: Fast obstacle avoidance using monocular vision. In: Proceedings of the IEEE Conference on Intelligent Robots and Systems, pp. 886–891 (2003) 15. Simoncelli, E.P., Olshausen, B.A.: Natural image statistics and neural representation. Annual Reviews Neuroscience 24, 1193–1216 (2001) 16. Horswill, I.D.: Visual collision avoidance by segmentation. In: Proceedings of the IEEE Conference on Robotics and Autonomous Systems, pp. 901–909 (1994) 17. Hoffmann, J., J¨ ungel, M., L¨ otzsch, M.: A vision based system for goal-directed obstacle avoidance. In: Nardi, D., Riedmiller, M., Sammut, C., Santos-Victor, J. (eds.) RoboCup 2004. LNCS (LNAI), vol. 3276, pp. 418–425. Springer, Heidelberg (2005) 18. Collett, T.S., Udin, S.B.: Frogs use retinal elevation as a cue to distance. Journal of Comparative Physiology A 163(5), 677–683 (1988) 19. Ooi, T.L., Wu, B., He, Z.J.: Distance determined by the angular declination below the horizon. Nature 414(6860), 197–200 (2001)
Route Learning Strategies in a Virtual Cluttered Environment Rebecca Hurlebaus1 , Kai Basten1 , Hanspeter A. Mallot1 , and Jan M. Wiener2 1
Cognitive Neurosccience, University of T¨ ubingen, Auf der Morgenstelle 28, D-72076 T¨ ubingen, Germany 2 Center for Cognitive Science, University of Freiburg, Friedrichstr. 50, D-79089 Freiburg, Germany Abstract. Here we present an experiment investigating human route learning behavior. Specific interest concerned the learning strategies as well as the underlying spatial knowledge. In the experiment naive participants were asked to learn a path between two locations in a complex, cluttered virtual environment that featured local and global landmark information. Participants were trained for several days until they solved the wayfinding task fastly and efficiently. The analysis of individual navigation behavior demontrates strong interindividual differences suggesting different route learning strategies: while some participants were very conservative in their route choices, always selecting the same route, other participants showed a high variability in their route choices. In the subsequent test phase we systematically varied the availability of local and global landmark information to gain first insights into the spatial knowledge underlying these different behavior. Participants showing high variability in route choices strongly depended on global landmark information. Moreover, participants who were conservative in their route choices were able to reproduce the basic form of the learned routes even without any local landmark information, suggesting that their route memory contained metric information. The results of this study suggest two alternative strategies for solving route learning and wayfinding tasks that are reflected in the spatial knowledge aquired during learning. Keywords: spatial cognition, route learning, navigation.
1
Introduction
Finding the way between two locations is an essential and frequent wayfinding task for both animals and humans. Typical examples include the way from the nest to a feeding site or the route between your home and the office. While several navigation studies, both in real and virtual environments, investigated the form and content of route knowledge (e.g., [1,2,3]), empirical studies investigating the route learning process itself are rather limited (but see [4,5]). A very influential theoretical framework of spatial knowledge acquisition proposes three stages when learning a novel environment [6]. First, landmark knowledge, i.e., knowledge about objects or views allowing to identify places, is acquired. C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 104–120, 2008. c Springer-Verlag Berlin Heidelberg 2008
Route Learning Strategies in a Virtual Cluttered Environment
105
In the second stage, landmarks are combined to form route knowledge. With increasing experience in the environment, survey knowledge (i.e. knowledge about distances and direction between landmarks) emerges. According to this model, the mental representation of a route can be conceived as a chain of landmarks or places with associated movement directives (e.g. turn right at red house, turn left at the street lights). This landmark to route to survey knowledge theory of spatial learning has not remained unchallenged: Recent findings, for example, demonstrate that repeated exposures to a route not necessarily resulted in improving metric knowledge between landmarks encountered on the route [5]. Most participants either had accurate knowledge from the first exposure or they never acquired it. Furthermore, results from route learning experiments in virtual reality suggest two spatial learning processes that act in parallel rather than subsequently [4]: (1) a visually dominated strategy for the recognition of routes (i.e., chains of places with associated movement directives) and (2) a spatially dominated strategy integrating places into a survey map. The latter strategy requires no prior connection of places to routes. Support for parallel rather than subsequent learning processes also comes from experiments with rats: depending on the exact training and reinforcement procedure, rats can be trained to approach positions that are defined by the configuration of extramaze cues (c.f. spatially dominated strategy), to follow local visual beacons (c.f. visually dominated strategy), or to execute motor responses (e.g., turn right at intersection; [7,8]). Evidence for a functional distinction of spatial memories also comes from experiments demonstrating that participants who learned a route by navigation performed better on route perspective tasks, while participants who learned a route from a map performed better on tasks analysing survey knowledge [9]. In any case, route knowledge is usually described as a chain of stimulusresponse pairs [10,11], in which the recognition of a place stimulates or triggers a response (i.e., a direction of motion). Places along a route can be recognized by objects but also by views or scenes [12]. Evidence for this concept of route memory mostly comes from experiments in mazes, buildings, or urban environments, in which decision points were well defined (e.g. [1,13,3]). Furthermore, distinct objects (i.e., unique landmarks) are usually presented at decision points. Route learning in open environments, in contrast, has received little attention in humans, but has been convincingly demonstrated in ants [2]. The desert ant Melophorus bagoti is a singly foraging ant and its environment is characterized by cluttered distributed small grass tussocks. The ants establish idiosyncratic routes while shuttling back and forth between a feeder and their nest. Each individual ant follows a constant route for inbound runs (feeder to nest) and outbound runs (nest to feeder). Usually both routes differ from each other and show a high directionality [14]. In contrast, wood ants can learn bi-directional routes when briefly reversing direction and tracing their path for a short distance [15]. For both ant species view-dependend learning is essential for route learning in open cluttered environments [16]. View dependent representations [17] and view dependent recognition of places has also been demonstrated in humans and has been shown to be relevant for navigation [12].
106
R. Hurlebaus et al.
Most studies investigating route knowledge in humans were conducted in urban environments in which the number of route alternatives between locations as well as possible movement decisions at street junctions are rather limited. How do humans behave when faced with a route learning task in open environments, lacking road networks, predefined places, and unique objects or landmarks? Are they able to learn their way between two locations in such environments? And if so, what are the underlying route learning processes? 1.1
Synopsis and Predictions
In the following we present a route learning experiment in an open cluttered environment characterized by prismatic objects differing in contour but neither in hight nor in texture. The environment did not contain any predefined places, road networks, or unique landmarks. Distal cues were present in form of four large colored columns and background texture. Participants task was to explore the environment and to shuttle between two target locations repeatedly. We monitored participants navigation and route learning behavior during an extensive training phase. Subsequently, we tested the impact and influence of proximal and distal spatial information on participants navigational ability. We expected that over an extended period of training, participants were able to solve the general experimental task (i.e., to fast and efficiently navigate between the home and the feeder position). It was, however, an open question whether participants established fixed routes (as ants do when faced with such a task in a similar environment [2]) or whether they learned global directions and distances between the relevant locations. The latter alternative would allow solving the task without explicit route knowledge but requires spatial knowledge that is best described as survey knowledge. In contrast to route knowledge, survey knowledge allows for more flexible navigation behavior when shuttling between distant locations. Consequently, one might expect a higher variability of (similarly efficient) route choices between navigations. Moreover, it is possible that different participants adopted or weighted these alternative spatial learning strategies differently. In the test phase, we systematically varied the availability of local and global cues to study which spatial information was relevant for solving the wayfinding task. If participants established fixed routes, they were expected to strongly depend on local (i.e, proximal) spatial information to guide their movements. Hence, if that information was removed in a no-local-objects test, their navigation performance was expected to decrease dramatically. If, on the other hand, participants relied on global directions and distal information to solve the task, we expected their navigation performance to drop when such information was removed by adding fog to the environment.
2 2.1
Material and Methods Participants
Twenty-one students of the University of T¨ ubingen participated in this study (10 females). The average age was 24 years (range 19-28). No participant had
Route Learning Strategies in a Virtual Cluttered Environment
(a) Participants view
(c) Condition: no-local-objects
107
(b) Map
(d) Condition: fog
Fig. 1. (a) The virtual environment from the perspective of the participant at the home position. The sphere was visible only in close proximity. The text specified the current task (here:”Search for the feeder!”). A distal landmark (large small column) is visible in the background; (b) A map of the environment: the positions of home and feeder are marked by asterisks; the crossed circles indicate the positions of the colored columns (for illustration columns were plotted closer to the center of the environment); (c) The no-local-objects condition; (d) The fog condition.
prior knowledge of the virtual environment or the experimental hypotheses at the time of testing. They were paid 8e per hour. One participant (female) had to be excluded, because of motion sickness during the first experimental trials. 2.2
Virtual Environment
The virtual environment was generated using Virtual Environments Library (VeLib)1 . It consisted of a ground plane cluttered with objects of equal height and texture, which differed only in the shape of their groundplate. The background texture consisted of a cloudy sky and separate flat hills. To provide distinct global landmark information, four large columns of different colour (red, blue, 1
http://velib.kyb.mpg.de/ (March 2008)
108
R. Hurlebaus et al.
green, yellow) were positioned on four sides of the environment, at a distance of 80 units from the center. The directions of these globale landmarks are shown in the environment map in Figure 1b, but are plotted closer to the obstacles for the sake of clarity. Two additional objects, a red sphere and a blue sphere, that marked the relevant location (referred to as home and feeder in the following) were placed in the environment with a distance of ˜5.5 units between them. These were so-called pop-up objects that were visible only at close proximity (< 0.4 units). An experimental session always started at the blue sphere that was referred to as the home location. In analogy to experiments with ants (see Introduction), the red sphere was referred to as the feeder. Figure 1a displays the participants’ view within the virtual environment. 2.3
Experimental Setup
The virtual environment was presented on a standard 19” computer monitor. Participants were sitting in front of the monitor on an office chair in a distance of approximately 80cm. Using a standard joypad (Logitech RumblePad 2) they steered through the virtual environment. Translation and rotation velocity could be adjusted separately by the two analog controls. Maximum translation velocity was 0.4 units per second; maximum rotation velocity was 26◦ per second. All participants were instructed how to use the joypad and had chance to familiarize themselves with the setup. 2.4
Procedure
General Experimental Task and Procedure. The general experimental task was to repeatedly navigate between two target locations, the home (blue sphere) and the feeder (red sphere). During navigation, the target (home or feeder) for the current run was indicated by a text message (e.g., ”Go home!”). As soon as the participant moved over the current target (e.g., blue sphere indicating home location), the respective text message changed (e.g., ”Search for the feeder!”). Runs from home to feeder are referred to as outbound runs, runs from the feeder to home are referred to as inbound runs. Experimental session always started at the home position. As participants were naive with respect to the environment, the experiment had an extensive training-phase prior to the test-phase. Training-Phase. The training-phase consisted of several sessions, during which participants were instructed to repeatedly navigate between home and feeder. At the beginning of each session participants were positioned at the home location. Pilot experiments demonstrated that the experimental task was very difficult in the first run; participants were therefore provided with course directional information about the direction from home to the other feeder at the beginning of the first session (”seen from home, the feeder is situated between the red and the green distal column”). Participants unterwent two training sessions per day with a 5min break between them. A single training session was completed with the first visit of the home after 20min. In case participants showed a performance of > 2 runs per
Route Learning Strategies in a Virtual Cluttered Environment
109
minute (for inbound and outbound runs) after 5 training session, they advanced to the test phase. The maximal number of training sessions was 9. Test-Phase. In the test phase participants were confronted with 2 wayfinding tasks that are reported here and that were always conducted in the same order. 1: No-Local-Objects Condition. Directly after the last training session, participants conducted the no-local-objects condition. They were positioned at the home location and were asked to navigate to the feeder. Upon reaching it all local objects (i.e., all local landmark information) disappeared, while the global landmarks (i.e., the distal colored columns and baground texture) remained (see Figure 1c). Participants task was to navigate back to the home location. Once they were convinced to have reached the position of their home, they pressed a button on the joypad. This procedure was repeated three times. After the no-local-objects test, participants were given a 5 min. break. If they already went through 2 training session that day, they conducted the fog test (see below) at the next day. 2: Fog Condition. The fog condition was identical to a training session but the visibility conditions were altered by introducing fog in the virtual environment: the visibility of the environment decreased with increasing distance. Beyond a distance of 2.0 units only fog but no other information was perceptible. By these means, global landmarks as well as structures such as view axes or corridors arising from obstacle constellations were eliminated from the visual scene. The fog also covered the ground plane such that it provided no optic flow during navigation. In this modified environment participants had to rely only on local objects in their close proximity to find their way between home and feeder. All participants had 20 minutes to complete as many runs as possible. After that time the fog test stopped with the first visit back at home. Unfortunately, data of three participants had to be excluded from the final data set due to technical problems with the software. 2.5
Data Analysis
During the experiment, participants’ positions were recorded with a sampling frequency of 5Hz, allowing to reproduce detailed trajectories. In the following we describe the different dependent variables extracted from these trajectories: 1. Performance. Navigation performance is measured in runs per minute. A single run is defined as a navigation from home to the feeder or vice versa. For each experimental session we calculated the average number of runs per minute. 2. Route-similarity. The route similarity measure describes how conservative or variable participants were with respect to their route choices. High values (≈1) demonstrate that participants repeatedly choose the same route; low values correspond to a high variablity in route choices. To calculate the route similarity, we used a two step method: (1) the raw trail data was
110
R. Hurlebaus et al.
reduced to sequences of numbers; (2) the similarity of these sequences was compared. For the first step, the environment was tessellated into triangles: we reduced each local obstacle to its center point and applied a Delaunaytriangulation to this set of points. A unique number was assigned to each resulting triangle. Now every run was expressed as a sequence of numbers corresponding to the triangles crossed. To compare these sequences, we used an algorithm originally developed to compare genetic sequences [18]. In each case two single sequences are compaired. The basic principle is to find the number of matches and relate that to the total length of the sequences (for details [19]). A complete match results in a value of 1.0. For each participant these comparisons were done separately for all outbound and inbound routes. In the following we present the mean similarity of all comparisons of runs performed in one session. 3. Change in performance. In order to describe navigation performance during the fog test, participants performance (runs per minute) in the fog condition was divided by their performance in the last training session. Equal performance then results in a value of 1.0, increased navigation performance results in values > 1.0 and decreased performance results in values < 1.0. 4. No-local-object test. For the homing task in the No-Local-Objects test, the following variables were evaluated: (a) Homing error : Distance between participant’s endpoint and actual home position. (b) Distance error : Air-line distance from start (feeder location) to participant’s chosen endpoint compared with the air-line distance from start (feeder location) to the actual home position (c) Angular error : Angle between the beeline of participants’ homing response and the correct direction towards the home location.
3 3.1
Results Training Phase
Route Learning Performance. 19 out of the total of 20 participants were able to solve the task: they learned to efficiently navigate between the home and the feeder. One participant was removed from the final data set, as he did not reach the learning criterion. This participant also reported to be clueless about the positions of the home and the feeder. For the remaining participants, the time to reach the learning criterion differed: four participants reached it after 5 training sessions, 6 participants after 6 sessions, 2 after 7 sessions, 2 after 8 sessions, and 5 participants needed 9 training session (6.9 sessions on average). The increase in navigation performance was highly significant for both inbound and outbound runs (see Figure 2, paired t-test first vs. last training session, inbound: t(18) = 14.26; p < .001; outbound: t(18) = 10.76; p < .001). Figure 3 show examples from two participants how route-knowledge evolved with increasing training sessions for two participants. At the end of the training phase, all remaining 19 participants solved the task of navigating between home and
Route Learning Strategies in a Virtual Cluttered Environment
111
2.5
runs per minute
2
1.5
1
0.5
0
1
2
3
4
5
6
7
8
9
session
Fig. 2. During the training phase participants’ performance increased in number of runs per minute with increasing number of sessions for both, outbound runs (♦), and inbound runs (∗). Mean values of all participants ± standard error.
the feeder reliably and efficiently. For this result and all other results we did not found any significant gender differences. Since we have small groups (10 female and 11 male) small differences if present are not ascertainable. Outbound Runs and Inbound Runs. Participants showed better navigation performance (runs/min) on inbound runs as compared to outbound runs (Wilcoxon signed rank test: p < .01, see Figure 2). In other words, participants found the way from feeder to home faster than the way from home to feeder. It appears that this difference increases with increasing number of sessions. Note, however, that some particpants reached the learning criterion already after 5 sessions and proceeded to the test phase. In later sessions the number of participants therefore decreases which explains the increasing variations in later sessions and which could account for the saturation effect. Constant and Variable Routes. Analysing the chosen trajectories in the last training session in detail, reveals remarkable inter-individual differences. While some participants were very conservative in their route choices (see right column in Figure 3), other participants showed a large variability in their choices (see left column in Figure 3). The calculated mean route similarity ranged from .19 in case of very variable to 1.0 for constant routes (mean=.67, std=.24). Figure 4 displays the route similarity values for all participants revealing a continuum rather than distinct groups. Navigation performance (runs/min) in the last training session was significantly correlated with route similarity. Specifically, with higher route similarity the navigation performance increased (r = .47, p < .05). Neither navigation speed during the last training session, nor the number of sessions needed to reach the performance criterium to enter the test phase significantly correlated with the route similarity values of the last training session
112
R. Hurlebaus et al.
0
1
AS25 session 1
0
1
⇓
1
AS25 session 8
⇓
2
1
2
FW23 session 1
0
2
AS25 session 5
0
0
2
1
⇓
2
FW23 session 4
0
1
⇓
2
FW23 session 7
Fig. 3. Evolving route knowledge of two participants. Left column: variable routes with similarity of 0.55 (mean of outbound and inbound runs of the last session, compare to fig. 4 and see text); right column: constant route with similarity 1.0, (mean of outbound and inbound runs of the last session, compare to fig. 4 and see text). lower left corner measuring unit, participant, and session number.
Route Learning Strategies in a Virtual Cluttered Environment
113
1 0.9 0.8
similarity
0.7 0.6 0.5 0.4 0.3 0.2 0.1 TE 27 JF 2 SM 7 2 EH 2 2 C 3 P2 AS 5 2 KH 5 2 TS 6 2 TU 1 2 N 7 M 20 KB 2 TK 6 2 C 3 G 2 SP 5 2 AN 0 1 BG 9 2 SN 7 2 KE 3 2 FW 8 23
0
subject
Fig. 4. Route similarity values of all participants of their final session
(correlation navigation speed and route similarity: r = −.01, p = .97; correlation number of sessions and route similarity: r = −.34, p = .16). 3.2
No-Local-Objects Condition
In the no-local-objects condition, all obstacles disappeared after participant reached the feeder. By moving to the estimated position of the home and pressing a button, participants marked the location where they assume the position 3
homing error
2.5
2
r=−0.13, n.s.
1.5
1
0.5 0
0.2
0.4
0.6
0.8
1
similarity
Fig. 5. Navigation without local landmarks: Participants’ homing error as a function of their route similarity of the last training session
114
R. Hurlebaus et al.
of their home. On average participants produced a homing error of 1.57 units (std=1.00), an angular error of 16.38 degrees (std=12.17), and a mean distance error of 0.97 units (std=0.76). Together, these results suggest that, in principle, participants could solve the homing task. None of the measures for the homing performance did significantly correlate with the route similarity measure (homing error and route similarity: r = −.13, p = .61, see Fig. 5; angular error and route similarity: r = −.22, p = .36; distance error and route similarity: r = −.42, p = .07). Apparently, participants performance in solving the homing task was independent of whether or not they had established fixed routes during the training phase. Nevertheless, a closer look at the homing trajectories themself suggests that participants’ differed in the strategies they applied to solve the homing task. Figure 6 provides a few examples of homing trajectories. Participants, with low route similarity values (i.e, participants showing a high variability in their choices) show more or less straight inbound routes when homing. Participants with high route similarity values (i.e., participants that established fixed routes during training) generate trajectories that are typically curved, not linear. Moreover, their trajectories were often similar to their habitual routes: the shape of the routes they established during training was roughly reproduced, even if the translational or rotational metric did not fit exactly (see Figure 6). In some cases, established routes were close to the beeline between feeder and home. In such cases it is indistinguishable if the established route is reproduced or if another strategy was used. The same is obviously true for participants showing a high variability in their route choices. While it is not clear how such data could be quantitatively analyzed, Figure 6 demonstrates that some participants with high route similarity values reproduced the form of their habitual routes. 3.3
Fog Condition
In this part of the experiment participants were able to see closely obstacles. Spatial information at larger distances was masked by fog (Fig. 1d). Individual performance (runs per minutes) during the fog test was compared with the performance of the last training session (expressed as change in performance). As expected, most participants show a performance decrease in the fog test (inbound: 14 of 16, outbound: 13 of 16). More interestingly, we found significant correlations between the change in performance and the route similarity in the last training session: participants with low route similarity values show stronger (negative) changes in performance as compared to participants with higher route similarity values (see Figure 7). These correlation were significant for both, inbound runs (r = .50, < .02) and outbound runs (r = .75, p < .001).
4
Discussion
In this work we presented a navigation experiment investigating human route learning behavior in a complex cluttered virtual environment. In contrast to
Route Learning Strategies in a Virtual Cluttered Environment
0
1
2
0
1
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
0
1
2
115
2
Fig. 6. Four examples of behavior in the last training session (left column) and homing behavior in the no-local-objects test (right column). The two top rows show results from a participant with low route similarity values, the lower two rows show examples from a participant with high route similarity values.
most earlier studies on route learning and route knowledge (e.g.,[3,20,21]), the current environment did not feature a road network with predefined places, junctions (decision points), and unique local landmarks. The environment was made up of many similarly shaped objects with identical texture and height that were
116
R. Hurlebaus et al.
Fig. 7. Participants had to navigate in a foggy VR environment, so only objects in close proximity were visible. Given is the change in performance (runs per minute) of all participants in fog compared to the similarity of routes established in the last training session.
non-uniformly distributed about a large open space. In addition to these local objects, four distal unique landmarks provided global references. Specific interest concerned the question if navigators were able to learn their way between two locations in such an environment. Furthermore, we were interested if all participants used similar or identical route learning strategies (for example: do navigators establish fixed routes or do they rather learn the global layout of the environment and chose between different similarly efficient routes). In the experiment, participants were trained for several session until they were able to efficiently navigate between two locations, the home and the target. After reaching a learning criterion they entered the test phase, during which the availability of spatial information was systematically varied to investigate which spatial information (local or global) participants used to solve the task. All but one participant reached the learning criterion after a maximum of 9 training sessions. Navigation performance (measured as runs per minute) clearly increased with the number of training session (see Figure 2). This demonstrates that participants were able to efficiently and reliably navigate in complex cluttered environments lacking predefined places, road networks, and local landmark information that is usually provided by unique objects (e.g., large red house) at decision points or road crossings. Comparisons of navigation performance over
Route Learning Strategies in a Virtual Cluttered Environment
117
the entire training phase revealed differences for outbound runs (home to target) and inbound runs (target to home): specifically, participants found their way faster on inbound runs. This could be explained by the specific significance of the home location, which may result from the fact that each training session started at the home/nest. In central place foragers, like the desert ants, the importance of the nest and its close surrounding is well documented [22]. An alternative explanation for this effect is that the local surrounding of nest and feeder were different (i.e. the spatial distribution of the surrounding obstacles): the nest, for example, was positioned at a larger open space, surrounded by fewer objects, as compared to the feeder. By these means, the nest position might have been recognized from larger distances, hence resulting in an increased performance. Further experiments will have to show whether semantic or spatial (configurational) effects were responsible for the described effect. The most important result of the training phase is that participants greatly differed with respect to their route choices: using a novel method to compare trajectories (see Section 2.5) we obtained descriptions of the similarity of the traveled paths during the last training session. While some participants were very conservative, selecting the same outbound path and the same inbound path on most runs, others showed a high variability, navigating along many different paths (for examples, see Figure 3). Participants’ route similarity values of the last training session were correlated with their navigation performance during that session: participants that established fixed routes during training showed better navigation performance than participants that showed higher variabilities in their route choices. How can these inter-individual differences in route similarity and navigation performance be interpreted? Did different participants employ different navigation or learning strategies, relying on different spatial information? Results from the test phase in which the availability of different types of spatial information was systematically manipulated allowed for first answers: In the fog condition (see Figure 1d) only obstacles in close proximity were visible. By these means, global spatial information was erased (i.e., distal global landmarks and spatial information, emerging by lined-up obstacles such as visual gate-ways or corridors). We observed correlations between participants’ route similarity values and their performance in the fog condition. Specifically, individuals showing a high variability in route choices showed a clear reduction of navigation performance during the fog condition as compared to the last training session. Individuals with a low variability in route choices, on the other hand, were largely unaffected by the fog. These results suggest that participants with variable route choice behavior strongly relied on distal or global spatial information, while participants exhibiting conservative route choice behavior rather relied on proximal spatial information, as provided by the close-by obstacles or obstacle configurations. A straight forward assumption is that the latter group learned local views (obstacle configurations) and corresponding movement decisions (c.f. [23]) during the training phase that were also available also during
118
R. Hurlebaus et al.
the fog condition. In other words, route knowledge for these participants would be best described as a sequence of recognition triggered responses [1,3]. If, in fact, participants exhibiting conservative route choice behavior relied on recognition triggered responses, and participants showing variable route choice behavior primarily relied on distal, global spatial information or knowledge, the following behavior had to be predicted for the no local obstacle condition: if all local obstacles dissappear after reaching the feeder and only the distal global landmarks remained, returning to the home should be impossible for participants relying on recognition triggered responses only. Participants relying on global information, on the other hand, should be able to solve the task. In contrast to these predictions, results demonstrate that all participants were able to solve the task with a certain accuracy (see Figures 5 and 6). Furthermore, virtually no correlation (r=-.13) was found between participants’ route similarities in the last training session and their homing performance in the no local obstacle condition. This disproves the explanations given above: apparently participants showing conservative route choice behavior did not solely rely on stored views and remembered movement decisions (i.e., recognition triggered responses), but had additional spatial knowledge allowing them to solve the homing task. A detailed inspection of their homing trajectories revealed that some participants reproduced the overall form of their habitual routes from the last training session (see Figure 6). There are two ways of achieving such behavior: (1) participants learned a motor program during training that was replayed during the no local obstacle condition, or (2) they possessed a metric representation of the established routes. While this experiment does not allow distinguishing between these alternatives, informal interviews participants after the experiment support the latter explanation. Taken together, we have shown that participants could learn to efficiently navigate between two locations in a complex cluttered virtual environment, lacking predefined places, decision points, and road networks. In such unstructured environments a route is best described as a sequence places defined by views or object configurations [3], rather than as a sequence of places defined by unique single objects. Analyzing participants’ navigation behavior, we could show strong interindividual differences that could be related to different navigation or orientation strategies taking different kind of spatial information into account. Specifically, participants showing a high variability in their route choices depended on distal spatial information, suggesting that they learned global directions and distances between relevant locations. Participants who established fixed routes instead relied on proximal obstacles to guide their movements. However, even if such local spatial information was not available, some were able to reproduce the overall form of their preferred paths. Apparently they learned more than reflex-like recognition triggered responses during training, presumably generating a metric representation of their preferred paths. These results are not in line with the dominant landmark to route to survey knowledge framework of spatial knowledge acquisition [6], stating that survey knowledge emerges not until route knowledge is established. Apparently some participants were able to
Route Learning Strategies in a Virtual Cluttered Environment
119
learn about distances and directions in the environment without first establishing route knowledge (c.f. [5]). The fact that participants’ route similarities of their last training session did not fall into two distinct clusters but constituted a continuum, furthermore, suggests that the two learning strategies sketched above are not exclusive but complementary, existing in parallel (c.f. [4]), and that different participants weighted them differently. It is highly likely that these weights are adopted during the course of learning. Further research is needed to answer questions arising from this exploratory study. For example, what triggers the usage of which strategy? How are the strategies related to each other? And, how is metric information entangled with the strategies applied? Acknowledgement. The work described in the paper was supported by the European Commission (FP6-2003-NEST-PATH Project “Wayfinding”).
References 1. van Janzen, G., Turennout, M.: Selective neural representation of objects relevant for navigation. Nature Neuroscience 7(6), 572–574 (2004) 2. Kohler, M., Wehner, R.: Idiosyncratic route-based memories in desert ants, Melophorus bagoti: How do they interact with path-integration vectors? Neurobiol. Learn. Mem. 83, 1–12 (2005) 3. Mallot, H., Gillner, S.: Route navigation without place recognition. what is recognized in recognition triggered responses? Perception 29, 43–55 (2000) 4. Aginsky, V., Harris, C., Rensink, R., Beusmans, J.: Two strategies for learning a route in a driving simulator. Journal of Environmental Psychology 17, 317–331 (1997) 5. Ishikawa, T., Montello, D.: Spatial knowledge acquisition from direct experience in the environment: Individual differences in the development of metric knowledge and the integration of separately learned places. Cognitive Psychology 52, 93–129 (2006) 6. Siegel, A., White, S.: The development of spatial representations of large-scale environments. Advances in child development and behavior 10, 9–55 (1975) 7. Restle, F.: Discrimination cues in mazes: A resolution of the ’place-vs-response’ question. Psychological Review 64(4), 217–228 (1957) 8. Leonard, B., McNaughton, B.: Spatial representation in the rat conceptual behavioural and neurophysiological perspectives. In: Kessner, R., Olton, D.S. (eds.) Comparative Cognition and Neuroscience: Neurobiology of Comparative Cognition. Hillsdale, New Jersey (1990) 9. Taylor, H., Naylor, S., Chechile, N.: Goal-specific influences on the representation of spatial perspective. Memory and Cognition 27, 309–319 (1999) 10. Trullier, O., Wiener, S., Berthoz, A., Meyer, J.A.: Biologically based artificial navigation systems: review and prospects. Progress in Neurobiology 51(5), 483–544 (1997) 11. Kuipers, B.: The spatial semantic hierarchy. Artificial Intelligence 119, 191–233 (2000) 12. Gillner, S., Mallot, H.: Navigation and acquisition of spatial knowledge in a virtual maze. Journal of Cognitive Neuroscience 10, 445–463 (1998)
120
R. Hurlebaus et al.
13. H¨ olscher, C., Meilinger, T., Vrachliotis, G., Br¨ osamle, M., Knauff, M.: Up the down staircase: Wayfinding strategies and multi-level buildings. Journal of Environmental Psychology 26(4), 284–299 (2006) 14. Wehner, R., Boyer, M., Loertscher, F., Sommer, S., Menzi, U.: Ant navigation: One-way routes rather than maps. Current Biology 16, 75–79 (2006) 15. Graham, P., Collett, T.: Bi-directional route learning in wood ants. Journal of Experimental Biology 209, 3677–3684 (2006) 16. Judd, S., Collett, T.S.: Multiple stored views and landmark guidance in ants. Nature 392, 710–714 (1998) 17. Diwadkar, V., McNamara, T.: Viewpoint dependence in scene recognition. Psychological Science 8, 302–307 (1997) 18. Needleman, S., Wunsch, C.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970) 19. Basten, K., Mallot, H.: Building blocks for trail analysis (in preparation, 2008) 20. Gaunet, F., Vidal, M., Kemeny, A., Berthoz, A.: Active, passive and snapshot exploration in a virtual environment: influence on scene memory, reorientation and path memory. Cognitive Brain Research 11(3), 409–420 (2001) 21. Munzer, S., Zimmer, H., Schwalm, M., Baus, J., Aslan, I.: Computer-assisted navigation and the acquisition of route and survey knowledge. Journal of Environmental Psychology 26(4), 300–308 (2006) 22. Bisch-Knaden, S., Wehner, R.: Landmark memories are more robust when acquired at the nest site than en route: experiments in desert ants. Naturwissenschaften 90, 127–130 (2003) 23. Christou, C.G., B¨ ulthoff, H.H.: View dependence in scene recognition after active learning. Memory and Cognition 27(6), 996–1007 (1999)
Learning with Virtual Verbal Displays: Effects of Interface Fidelity on Cognitive Map Development Nicholas A. Giudice1 and Jerome D. Tietz2,* 1
Department of Spatial Information Science and Engineering 348 Boardman Hall, University of Maine, Orono Orono, ME 04469
[email protected] 2 University of California, Santa Barbara Department of Psychology Santa Barbara, CA 93106-9660
Abstract. We investigate verbal learning and cognitive map development of simulated layouts using a non-visual interface called a virtual verbal display (VVD). Previous studies have questioned the efficacy of VVDs in supporting cognitive mapping (Giudice, Bakdash, Legge, & Roy, in revision). Two factors of interface fidelity are investigated which could account for this deficit, spatial language vs. spatialized audio and physical vs. imagined rotation. During training, participants used the VVD (Experiments 1 and 2) or a visual display (Experiment 3) to explore unfamiliar computer-based layouts and seek-out target locations. At test, participants performed a wayfinding task between targets in the corresponding real environment. Results demonstrated that only spatialized audio in the VVD improved wayfinding behavior, yielding almost identical performance as was found in the visual condition. These findings suggest that learning with both modalities led to comparable cognitive maps and demonstrate the importance of incorporating spatial cues in verbal displays. Keywords: wayfinding, verbal learning, spatialized audio, interface fidelity.
1 Introduction Most research investigating verbal spatial learning has focused on comprehension of route directions or the mental representations developed from reading spatial texts [1-4]. Owing to this research emphasis, there is much less known about the efficacy of verbal information to support real-time spatial learning and navigation. What distinguishes a real-time auditory display from other forms of spatial verbal information is the notion of dynamic updating. In a dynamically-updated auditory display, the presentation of information about a person’s position and orientation in the environment changes in register with physical movement. For example, rather than receiving *
The authors thank Jack Loomis for insightful comments on the manuscript, Maryann Betty for experimental preparation, Brandon Friedman for assistance in running participants, and Masaki Miyanohara for helping with running participants and data analysis. This work was supported by an NRSA grant to the first author, #1F32EY015963-01.
C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 121–137, 2008. © Springer-Verlag Berlin Heidelberg 2008
122
N.A. Giudice and J.D. Tietz
a sequential list of all the distances and turns at the beginning of a route, as is done with traditional verbal directions, a real-time display provides the user with contextsensitive information with respect to their current location/heading state as they progress along the route. Vehicle-based navigation systems utilizing GPS and speechbased route directions represent a good example of these dynamic displays. Dynamic auditory interfaces also have relevance in navigation systems for the blind, and in this context, they have proven extremely effective in supporting real-time route guidance [see 5 for a review]. Rather than addressing route navigation, the current research uses free exploration of computer-simulated training layouts to investigate environmental learning. The training environments are explored using a non-visual interface called a virtual verbal display (VVD). The VVD is based on dynamically-updated geometric descriptions, verbal messages which provide real-time orientation and position information as well as a description of the local layout geometry [see 6 for details]. A sample output string is: “You are facing West, at a 3-way intersection, there are hallways ahead, left, and behind.” If a user executed a 90° left rotation at this t-junction, the VVD would return an updated message to reflect that he/she was now facing South, with hallways extending ahead, left, and right. We know that geometric-based displays are extremely effective for supporting free exploration (open search) in both real and computer-based layouts [7-9]. However, their efficacy for supporting cognitive map development is unclear. That is, participants who trained using a virtual verbal display to search computer-based environments performed significantly worse on subsequent wayfinding tests in the corresponding real environment [7, 8] than subjects who trained and tested exclusively in real environments [9]. These findings suggest that training with a virtual verbal display results in impoverished environmental learning and cognitive map development compared to use of the same verbal information for searching real building layouts. This deficit cannot be attributed to environmental transfer more generally, as previous studies have demonstrated that learning in virtual environments (VEs) transfers to accurate real-world navigation, even with perceptually sparse visual displays similar to our geometric verbal display [10-12]. The current studies investigate several factors of interface fidelity which may account for problems in spatial knowledge acquisition with the VVD. As described by Waller and colleagues [13], interface fidelity refers to how the input and output of information from the virtual display is used, i.e. how one’s physical actions affect movement in the VE and how well feedback from the system supports normal perceptual-motor couplings. These interactions can be distinguished from factors relating to environment fidelity, which refers to how well the information rendered in the VE resembles the real environment, e.g. sensory richness, spatial detail, surface features, and field of view [13]. Our previous work with VVDs dealt with environment fidelity, investigating whether describing more of the layout from a given vantage point, called “verbal view depth,” would facilitate learning of global structure and aid subsequent wayfinding behavior. However, the lackluster environmental transfer performance with three levels of verbal view depth, ranging from local to global descriptions, demonstrated that deficits in cognitive map development were not due to availability of environmental information but to the interface itself [7, 8]. The current experiments hold environmental variables constant and manipulate several factors relating to interface fidelity. Experiment 1 compares traditional verbal descriptions, where the message is delivered as a monaural signal to both ears, with
Learning with VVDs: Effects of Interface Fidelity on Cognitive Map Development
123
spatialized audio descriptions, where the message is heard as coming from a specific direction, e.g. a hallway to the left would be heard as a description emanating from the navigator’s left side. Experiment 2 addresses the influence of body-based information, e.g. physical rotation vs. imagined rotation. Experiment 3 follows the same design of the first two verbal studies but uses a visual display as a control. All experiments incorporate training in computer-based layouts and environmental transfer requiring wayfinding in the corresponding real environment. Our focus is on the transfer tests, as they provide the best index of environmental learning and cognitive map development.
2 Experiment 1 In this study, blindfolded participants are given a training period where they use verbal descriptions to freely explore unfamiliar computer-based floors of university buildings and seek out four target locations. At test, they must find routes between target pairs in the corresponding real environment. This design is well-suited for addressing environmental learning, as theories of cognitive map development have long emphasized the importance of free exploration and repeated environmental exposure [14, 15]. The wayfinding test represents a good measure of cognitive map accuracy, as performance cannot be accomplished using a route matching strategy. Since no routes are specified during training, accurate wayfinding behavior requires subjects to form a globally coherent representation of the environment, i.e. the trademark of a cognitive map [16]. Our previous work with virtual verbal displays was based exclusively on spatial language (SL), i.e. consistent, unambiguous terminology for describing spatial relations [17]. The problem with any purely linguistic display is that the information provided is symbolic. A description of a door at 3 o’clock in 10 feet has no intrinsic spatial content and requires cognitive mediation to interpret the message. By contrast, a spatialized audio (SA) display is perceptual, directly conveying spatial information about the environment by coupling user movement with the distance and direction of object locations in 3-D space. For instance, rather than describing the location of the door, the person simply hears its name as coming from that location in the environment. Several lines of research support the benefit of spatialized auditory displays. Experiments comparing different non-visual displays with a GPS-based navigation system for the blind have shown that performance on traversing novel routes, finding landmarks, and reaching a goal state is superior when guided with spatialized audio versus spatial language [18-20]. Research has also shown that spatialized auditory displays are beneficial as a navigation aid during real-time flight [21] and for providing non-visual information to pilots in the cockpit of flight simulators [22]. It is predicted that spatialized audio displays will have similar benefits on cognitive map development, especially when training occurs in computer-based environments as are used here. Spatial updating and environmental learning is known to be more cognitively effortful in VEs than in real spaces, [23, 24]. However, recent work suggests that SA is less affected by cognitive load than SL during guidance of virtual routes, yielding faster and more accurate performance in the presence of a concurrent distractor task [25]. These findings indicate that the use of SA in the VVD may reduce the working memory demands associated with virtual navigation, thus increasing resources available for cognitive map development.
124
N.A. Giudice and J.D. Tietz
To address this issue, Experiment 1 compared environmental learning with virtual verbal displays based on Spatial language descriptions about layout geometry [8] with identical descriptions that added spatial information to the signal, e.g. a hallway on the left would be heard in the left ear. 2.1 Method Participants. Fourteen blindfolded-sighted participants, ages 18-32 (mean = 20.6), balanced equally by gender, ran in the two hour study. Subjects in all experiments were unfamiliar with the test environments, reported normal (or corrected to normal) visual and auditory acuity, gave informed consent, and received course credit for their participation. Environments and Apparatus. Two simulated floors of the UC Santa Barbara Psychology building, and their physical analogs, were used. The computer-based layouts were rendered to be perceptually sparse, with the verbal messages providing information about the user’s facing direction and layout geometry only. The simulated layouts were broken into corridor segments separated by nodes (each segment approximated ten feet in the real space). The floors averaged 100.5 m of hallway extent and 8.5 intersections. Each floor contained four targets which must be found during the search. The name of each target was spoken whenever subjects reached its location in the layout (Ss were told the names of the four targets before starting the trial). Figure 1 shows an illustration of an experimental environment and sample verbal descriptions.
Fig. 1. Experimental layout with target locations denoted. What is heard upon entering an intersection (listed above and below the layout) is depicted in gray. Each arrow represents the orientation of a user at this location.
Learning with VVDs: Effects of Interface Fidelity on Cognitive Map Development
125
Participants navigated the virtual environments using the arrow keys on a USB numberpad. Pushing the up arrow (8) translated the user forward and the left (4) and right (6) arrows rotated them in place left or right respectively. Forward movements were made in discrete "steps," with each forward key press virtually translating the navigator ahead one corridor segment (approximately ten feet in the real environment). Left-right rotations were quantized to 90 degrees. Pressing the numpad 5 key repeated the last verbal message spoken and the 0 key served as a "shut-up" function by truncating the active verbal message. Verbal descriptions, based on a female voice, were generated automatically upon reaching an intersection or target location and rotation at any point returned an updated heading, e.g. “facing north”. A footstep sound was played for every forward move when navigating between hallway junctions. Movement transitions took approximately 750 ms. The Vizard 3-D rendering application (www.worldviz.com) was used to coordinate the verbal messages, present a visual map of what was being heard for experimenter monitoring, and to log participant search trajectories for subsequent analyses. 2.2 Design and Procedure A within subjects design was used with participants running in one spatial language and spatialized audio condition, counterbalanced across the two experimental environments. The experiment comprised three phases. During practice, the movement behavior was demonstrated and participants were familiarized with the speech output from the VVD on a visual map depicting what would be spoken for each type of intersection. Training Phase. To start the trial, blindfolded participants stood in the center of a one meter radius circle with four three inch RadioShack speakers mounted on tripods (at a height of 152 cm) placed on the circumference at azimuths of 0° (ahead), 90° (right), 180° (behind) and 270° (left). In the SL conditions, the verbal message was simultaneously presented from the left and right speaker only. With the SA conditions, the participant heard the verbal message as coming from any of the four speakers based on the direction of the hallway being described. The spatialized audio messages were generated by sending the signal from the speaker outputs on the computer’s sound card (Creative Labs Audigy2 Platinum) to a four-channel multiplexer which routed the audio to the relevant speaker. The input device was affixed via Velcro to an 88 cm stand positioned directly in front of them. Subjects were started from an origin position in the layout, designated as "start" and instructed to freely explore the environment using the verbal descriptions to apprehend the space and the input device to affect movement. Their task for the training period was to cover the entire layout during their search and to seek out four hidden target locations. Although no explicit instructions were given about search strategy or specific routes, they were encouraged to try to learn the global configuration of the layout and to be able to navigate a route from any target to any other target. The training period continued until the number of forward moves in their search trajectory equaled three times the number of segments comprising the environment. Participants were alerted when 50 % and 75 % of their moves were exhausted.
126
N.A. Giudice and J.D. Tietz
Testing Phase. Upon completion of the training period, participants performed the transfer tests. Blindfolded, they were led via a circuitous route to the corresponding physical floor and started at one of the target locations. After removing the blindfold, participants were told they were now facing north, standing at target X and requested to walk the shortest route to target Y. They performed this wayfinding task using vision, no verbal descriptions about the environment or target locations were given. Participants indicated that they had reached the destination by speaking the target’s name (e.g., “I have reached target dog”). To reduce accumulation of error between trials, they were brought to the actual target location for incorrectly localized targets before proceeding. Participants found routes between four target pairs, the order of which were counterbalanced. Analysis. Although our focus was on transfer performance, three measures of search behavior were also analyzed from the training phase in all experiments: 1. Floor coverage percent: the number of unique segments traversed during training divided by the total number of segments in the environment. 2. Unique targets percent: ratio of unique targets encountered during training to the total number of target locations (4). 3. Shortest routes traversed: sum of all direct routes taken between target locations during the search period. A shortest route equals the route between target locations with the minimum number of intervening segments. Two wayfinding test measures were analyzed for all studies during the transfer phase in the real building: 1. Target localization accuracy percent: ratio of target locations correctly found at test to the total number of target localization trials (four). 2. Route efficiency: length of the shortest route between target locations divided by length of the route traveled (only calculated for correct target localization trials). 2.3 Results and Discussion As predicted, training performance using both VVD display modes revealed accurate open search behavior. Collapsing across SL and SA conditions, participants covered 97.3% of the segments comprising each floor, found 97.3% of the target locations and traveled an average of 9.9 shortest routes between targets. By comparison, the theoretical maximum number of shortest routes traveled during the training period, given 100% floor coverage with the same number of moves, is 14.5 (averaged across floors). Results from the inferential tests provide statistical support for the near identical performance observed between inputs; none of the one-way repeated measures ANOVAs conducted for each training measure revealed reliable differences between SL and SA conditions, all ps > .1. Indeed, performance on the training measures was almost identical for all conditions across experiments (see table 1 for comparison of all means and standard errors). These findings indicate that irrespective of training condition, subjects adopted a broadly distributed, near optimal route-finding search strategy.
Learning with VVDs: Effects of Interface Fidelity on Cognitive Map Development
127
Table 1. Training Measures of Experiments 1-3 by Condition. Each cell represents the mean (± SEM) on three measures of search performance for participants in experiments 1-3. No significant differences were observed between any of the dependent measures.
Experiment 1 (N=14)
2 (N=16)
3 (N=14)
Condition Spatialized Audio Spatial Language Rotation + Spatialized Audio Rotation + Spatial Language Visual Control
Floor Coverage (%) 98.46(1.35) 96.14(2.98)
Unique Targets Hit (%) 98.21(1.79) 96.43(3.57)
Total Shortest Routes Traversed 10.71(0.98) 9.07(1.22)
98.99(0.85)
100.00(0)
12.0625(0.99)
99.14(0.50)
98.44(1.56)
10.94(1.11)
97.34(2.46)
96.43(2.43)
11.07(1.29)
Environmental learning/cognitive map development was assessed using a wayfinding test in the physical building. To address the effect of spatialization, one-way repeated measures ANOVAs comparing spatial language and spatialized audio were conducted for the two transfer test measures of target localization accuracy and route efficiency. Participants who trained using spatialized audio in the VVD correctly localized significantly more targets, 76.8% (SE = 6.12) than those who learned with spatial language, 51.8% (SE = 10.63), F(1,13) = 7.583, p = .016, η2 = 0.39. Target localization accuracy in both conditions, average 64.3%, was significantly above chance performance of ~3%, defined as one divided by 33 possible target locations, e.g. a target can be located at any of the 33 segments comprising the average environment, t(27) = 8.94, p < 0.001. Route efficiency for correctly localized targets did not reliably differ between conditions, SA, 96.9% (SE = 1.6) and SL, 95.9% (SE = 2.6), ps >.1. Note that route efficiency is calculated for correctly executed routes only. Thus, the near ceiling performance simply means that the routes that were known were followed optimally. It is likely that this measure would be more sensitive to detecting differences between conditions on floors having greater topological complexity. Experiment 1 investigated the effect of spatialization on environmental learning. Results from the spatial language condition almost perfectly replicated a previous experiment using the same SL condition and near identical design [8]. Both studies showed that training with the VVD led to efficient search behavior but poor wayfinding performance in the real environment, 51.8% target localization accuracy in the current study vs. 51.3% accuracy found previously. These findings support our hypothesis that limitations arising from use of the VVD are not due to problems performing effective searches but to deficits in building up accurate spatial representations in memory. The SA condition, serving as a perceptual interface providing direct access to spatial relations, was tested here because we believed that it would aid cognitive map development. The confirmatory results were dramatic. Participants who trained in computer-based layouts using spatialized audio demonstrated 50% better wayfinding performance in the corresponding real building than when they trained with spatial language. The 76.8% target localization accuracy in the SA condition was also on par
128
N.A. Giudice and J.D. Tietz
with target localization performance of 80% observed in a previous study after verbal learning in real buildings [9]. This similarity is important as it shows that the same level of spatial knowledge acquisition is possible between learning in real and virtual environments. Our results are consistent with the advantage of spatialized auditory displays vs. spatial language found for route guidance [18-20, 25] and extend the efficacy of spatialized audio displays for supporting cognitive mapping and wayfinding behavior.
3 Experiment 2 Experiment 2 was designed to assess the contribution of physical body movement during virtual verbal learning on cognitive map development. Navigation with our virtual verbal display, as with most desktop virtual environment technologies, lacks the idiothetic information which is available during physical navigation, i.e. bodybased movement cues such as proprioceptive, vestibular, and biomechanical feedback. VEs incorporating these cues have greater interface fidelity as the sensorimotor contingencies are more analogous to real-world movement [26]. Various spatial behaviors requiring accessing an accurate cognitive map show improved performance when idiothetic information is included. For instance, physical rotation during VE learning vs. imagined rotation benefits tasks requiring pointing to previously learned targets [27, 28], estimation of unseen target distances [29] and updating self orientation between multiple target locations [30]. Path integration is also better in VEs providing proprioceptive and visual information specifying rotation compared to visual information in isolation [31]. The inclusion of idiothetic information has also led to improved performance on cognitive mapping tasks similar to the current experiment, where VE learning is tested during transfer to real-world navigation [32, 33]. Where the previous work has addressed the role of body-based cues with visual displays, Experiment 2 investigates whether similar benefits for verbal learning manifest when physical body rotation is included in the VVD. As with experiment 1, participants use the VVD to explore computer-based training environments and then perform wayfinding tests in the corresponding real environment. However, rather than using arrow keys to affect imagined rotations and translations during training, participants physically turn in place whenever they wished to execute a change of heading. Translations are still done via the keypad as the benefit of physical translation on VE learning is generally considered nominal. This is consistent with studies in real environments showing that pointing to target locations is faster and more accurate after actual than imagined rotations, whereas errors and latencies tend not to differ between real and imagined translations [34]. We predict that inclusion of idiothetic information in the VVD will yield marked improvements in spatial knowledge acquisition and cognitive map development. In addition to the previous evidence supporting body-based cues, we believe the conversion of linguistic operators into a spatial form in memory is a cognitively effortful process, facilitated by physical movement. Evidence from several studies support this movement hypothesis. Avraamides and colleagues (Experiment 3, 2004) showed that mental updating of allocentric target locations learned via spatial language was
Learning with VVDs: Effects of Interface Fidelity on Cognitive Map Development
129
impaired until the observer was allowed to physically move before making their judgments, presumably inducing the spatial representation. Updating object locations learned from a text description is also improved when the reader is allowed to physically rotate to the perspective described by the text [35], with egocentric direction judgments made faster and more accurately after physical, rather than imagined rotation [36]. To test our prediction, this experiment adds real rotation to the spatialized audio and spatial language conditions of Experiment 1. If the inclusion of rotational information is critical for supporting environmental learning from verbal descriptions, wayfinding performance during real-world transfer should be better after training with both physical rotation conditions of the current experiment than was observed in the analogous conditions with imagined rotation of Experiment 1. Furthermore, assuming some level of complementarity between rotation and spatialization, the rotation+spatialized audio (R+SA) condition is predicted to show superior performance to the rotation+spatial language (R+SL) condition. 3.1 Method Sixteen blindfolded-sighted participants, nine female and seven male, ages 18-24 (mean = 19.6) ran in the two hour study. Experiment 2 employs the same spatial language and spatialized audio conditions as Experiment 1 and adopts the same within Ss design using two counterbalanced conditions, each including a practice, training, and transfer phase. The only difference from Experiment 1 is that during the training phase, participants used real body rotation in the VVD instead of imagined rotation via the arrow keys. Since all intersections were right angle, left and right rotations always required turning 90° in place. An automatically-updated heading description was generated when their facing direction was oriented with the orthogonal corridor. They could then either continue translating by means of the keypad or request an updated description of intersection geometry. Heading changes were tracked using a three degree-of-freedom (DOF) inertial orientation tracker called 3D-Bird (ascension corporation: http://www.ascension-tech.com/ products/3dbird.php). 3.2 Results and Discussion To address the effect of rotation on environmental learning and wayfinding performance, One-way repeated measures ANOVAs were conducted for the transfer tests of target localization accuracy and route efficiency. Results indicated a significant difference for target localization only, with the 78.1% (SE = 5.03) accuracy of the rotation+spatialized audio condition found to be reliably better than the 57.8% (SE = 9.33) accuracy of the rotation+spatial language condition, F(1,15) = 4.601, p<.05, η2 = 0.24. Performance on route efficiency did not differ between conditions, R+SA = 98.5% and R+SL = 100%. As discussed in Experiment 1, the results of this measure are far less interesting than the target localization performance on what they say about cognitive map development. Since we were interested in evaluating whether the physical rotation conditions were better than the same conditions using imagined rotation of Experiment 1, we performed a two-way between subjects ANOVA
130
N.A. Giudice and J.D. Tietz
comparing target accuracy performance between experiments by spatialized and nonspatialized conditions. This between Ss comparison is appropriate as the subject groups in both experiments were similar in age, sex, educational background and spatial ability (as assessed by the Santa Barbara sense of Direction Scale, SBSOD). As can be seen in Figure 2, results showed a main effect of target accuracy by spatialization, F(1,28) = 11.753, p<.05, η2 = 0.3, but the more meaningful experiment by spatialization interaction was not significant, F(1,28) = .126, p = >.1, η2 = 0.004. Likewise, a one-way ANOVA comparing target localization accuracy collapsed across condition between experiments, thereby directly addressing the influence of rotation factoring out spatialization, was not significant, p > .1. The results of Experiment 2 paint a clear, yet surprising picture. The addition of physical rotation in the VVD was predicted to significantly benefit spatial knowledge acquisition and cognitive map development, as “real” movement was thought to be particularly important in converting the symbolic verbal messages into a spatial form in memory. While there was a difference in transfer performance between conditions in this experiment, comparison of the data to analogous conditions of experiment 1 confirm that this difference was driven by the presence of spatialized audio descriptions, not physical rotation. Subjects in the SL condition of experiment 1 found routes between targets in the real building with 51.8% accuracy. The 57.8% accuracy of the R+SL condition of Experiment 2, which is identical to that condition except for the addition of real vs. imagined body turning during training, represents a small, nonsignificant performance improvement, P>.1. Likewise, the absence of reliable differences between the SA condition of Experiment 1 and the same condition with rotation in Experiment 2 (76.8% vs. 78.1% correct target accuracy respectively), demonstrates that the addition of physical rotation did not benefit environmental learning.
Fig. 2. Comparison of mean target localization accuracy (± SEM) between Experiments 1 and 2. Note: Both experiments compared SL and SA conditions but Experiment 1 used imagined rotation and Experiment 2 (gray bars) used body rotation.
Learning with VVDs: Effects of Interface Fidelity on Cognitive Map Development
131
The finding that idiothetic information did not benefit transfer performance was unexpected given previous literature showing that physical body movement during and after verbal learning significantly improves latency and error performance at test [35-37]. Differences in task demands likely contribute to these findings. In the previous studies, subjects learned a series of target locations from text or speech descriptions and then were tested using a pointing-based spatial updating task. The increased weighting of physical movement demonstrated in those studies may be less important with the free exploration paradigm and transfer tests used here, as these tasks do not force updating of Euclidean relations between targets. Thus, the addition of a pointing task between target locations may have shown greater benefit of physical rotation than was evident from our wayfinding task. This needs to be addressed in future experiments as it cannot be resolved from the current data.
4 Experiment 3 Experiment 3 followed the same design of the previous two studies but subjects learned the computer-based training environments from a visual display rather than a verbal display. The main goal of Experiment 3 was to provide a good comparison benchmark with the previous two verbal experiments. Specifically, we wanted to investigate whether learning with verbal and visual displays lead to comparable environmental transfer performance, findings which would provide proof of efficacy of the VVD. Our previous experiments using an almost identical design to the current studies, found that wayfinding performance during environmental transfer was significantly worse after learning from a virtual verbal display than from a visual display [8, Experiment 3, 10]. However, those studies only compared visual learning with a spatial language condition, analogous to that used in experiment 1. By contrast, the significantly improved transfer performance of the spatialized audio conditions are on par with our previous findings with the visual display. Likewise, the SA conditions in the first two experiments provide perceptual information about the direction of hallways which is better matched with what is apprehended from a visual display. Since the visual display and movement behavior in the previous studies differed slightly from the information and movement of the VVD used here, Experiment 3 was run to serve as a more valid comparison. 4.1 Method Fourteen normally sighted participants, six females and eight males, ages 18-21 (mean = 19.2) ran in the one hour study. The experimental procedure was identical to the previous studies except that subjects only learned one environment and trained with a visual display instead of the VVD. During training, participants saw the same geometric “views” of the layout on the computer monitor (Gateway VX700, 43.18 cm diagonal) as were previously described with each message from the VVD. The environment was viewed from the center of the monitor and movement was performed via the keypad’s arrow keys, as described earlier. Figure 3 shows an example of what would be seen from a 3-way intersection. With each translation, the participant heard the footstep sound and the
132
N.A. Giudice and J.D. Tietz
Fig. 3. Sample 3-way intersection as seen on a visual display. Information seen from each view is matched to what would be heard in the corresponding message from the VVD.
next corridor segment(s) was displayed with an animated arrow indicating forward movement. With rotations, they saw the viewable segments rotate in place and an updated indication of heading was displayed. In addition, they heard the target names, starting location noise, foot step sound, and percent of training time elapsed via monaural output through the same speakers. This design ensured the visual display was equivalent in information content to what was available in the auditory conditions of experiments 1 and 2. 4.2 Results and Discussion Performance on the transfer tests after visual learning was quite good, resulting in target localization accuracy of 78.6% (SE = 8.6) and route efficiency of 95.6% (SE = 2.4). Given our interest in comparing learning performance between the visual display and the VVD, independent samples t-tests were used to evaluate how wayfinding performance after visual learning compared to the same tests after verbal learning in Experiments 1 and 2. As the presence or absence of spatialized information was the only factor that reliably affected verbal learning performance, the visual learning data was only compared to the combined performance from the spatial language and spatialized audio conditions of the previous experiments, collapsing across imagined and real rotation. Note that these between-subjects comparisons were based on participants drawn from a similar background and who fell within the same range of spatial abilities as measured by the SBSOD scale. As can be seen in Figure 4, the 78.6% (SE = 8.6) target localization performance observed after visual learning was significantly better than the 54.6% (SE = 5.4) performance of the spatial language conditions, t(28) = 2.345, p=.027. By contrast, target localization accuracy in the spatialized audio
Learning with VVDs: Effects of Interface Fidelity on Cognitive Map Development
133
conditions, 77.5% (SE = 4.1), was almost identical to performance in the visual condition, t(26) = .116, p=.908. In agreement with the previous studies, route efficiency was highly insignificant between all conditions, ps > .1. Experiment 3 was run to benchmark performance with the VVD against visual learning. Replicating earlier work, transfer performance after learning from a visual display was significantly better than learning with spatial language with a VVD [8, Experiment 3, 10]. However, target localization accuracy between the spatialized audio conditions and the visual condition were nearly identical. This finding suggests that learning with a spatialized audio display and an information-matched visual display build up into a spatial representation in memory which can be acted on in a functionally equivalent manner.
Fig. 4. Comparison of mean target localization accuracy (± SEM) across all experiments. “Spatial language” represents combined data from the two language conditions of Experiments 1 and 2, collapsing across imagined and real rotation. “Spatialized audio” represents the same combined data from the two spatialized conditions of Experiments 1 and 2.
5 General Discussion The primary motivation of these experiments was to investigate verbal learning and cognitive map development using a new type of non-visual interface, called a virtual verbal display. Previous research has demonstrated that VVDs support efficient search behavior of unfamiliar computer-based environments but lead to inferior cognitive map development compared to verbal learning in real environments or learning in visually rendered VEs. The aim of this research was to understand what could account for these differences. Deficits in spatial knowledge acquisition with the VVD
134
N.A. Giudice and J.D. Tietz
were postulated as stemming from inadequacies of the interface. To address this prediction, two factors influencing interface fidelity, spatialized audio and physical rotation, were compared on a wayfinding task requiring accessing of an accurate cognitive map. Results showing almost identical performance on the training measures for all conditions across experiments (see Table 1) but widely varying wayfinding accuracy during transfer tests in the real building are informative. Indeed, these findings support the hypothesis that deficits in cognitive map development are related to factors of interface fidelity, rather than use of ineffective search strategies with the VVD. The most important findings from these studies are the results showing that information about layout geometry conveyed as a spatialized verbal description versus from spatial language lead to a dramatic improvement on cognitive map development. These findings are congruent with previous studies showing an advantage of 3-D spatial displays vs. spatial language during route guidance [18-20, 25]. The current results extend the efficacy of spatialized audio for providing perceptual access to specific landmarks in the surrounding environment for use in route navigation to specifying environmental structure during free exploration to support cognitive mapping. Of note to the motivations of the current work, wayfinding performance during transfer after learning in the SA conditions in the VVD was on par with performance after learning with an information-matched visual display, experiment 3, and with verbal learning in real buildings [9]. The similarity of these results suggest that virtual verbal displays incorporating spatialized information can support equivalent spatial knowledge acquisition and cognitive map development. Although comparisons between verbal and visual learning were made between subjects in the current paper, these results are consistent with previous findings demonstrating functionally equivalent spatial representations built up after learning target arrays between the same conditions [38]. Interestingly, the benefit of SA seems to be magnified for open search exploration of large-scale environments vs. directed guidance along routes, as the 50% improvement for spatialized information observed in the current study is much greater than the marginal advantage generally found in the previous real-world route guidance studies. This finding is likely due to the increased cognitive effort known for learning and updating in VEs [23, 24] being offset by the decreased working memory demands of processing spatialized audio vs. spatial language [25]. The effects of including physical rotation vs. imagined rotation in the VVD were investigated in Experiment 2. We expected this factor to have the greatest influence on virtual verbal learning given the importance attributed to idiothetic cues from the inclusion of physical rotation in visually rendered VEs [27, 29, 31, 33], and the importance of physical movement on updating verbally learned target locations [35, 36]. Surprisingly, the inclusion of physical rotation during training with the VVD did not lead to a significant advantage on subsequent wayfinding performance. Indeed, comparison of transfer performance between experiments 1 and 2 shows that conditions employing spatialized descriptions led to the best verbal learning performance and did not reliably differ whether they employed real or imagined rotation. As discussed in Experiment 2, this finding may relate to our experimental design and more research is needed to make any definitive conclusions.
Learning with VVDs: Effects of Interface Fidelity on Cognitive Map Development
135
For researchers interested in verbal spatial learning, especially in the context of navigation, dynamically-updated virtual verbal displays represent an excellent research tool. They also have important application to blind individuals for remote environmental learning before traveling to a new place or as part of a multi-modal virtual interface for training sighted people in low-light environments. Until now, their efficacy as a research tool or navigation aid was questionable, as VVD training seemed to lead to deficient cognitive map development. However, the results of this paper clearly demonstrate that the VVD can be used to support these tasks and can be as effective as verbal learning in real buildings or from a visual display when spatialized verbal descriptions are used. These findings have clear implications for the importance of incorporating spatialized audio in dynamically-updated verbal interfaces.
References 1. Taylor, H.A., Tversky, B.: Spatial mental models derived from survey and route descriptions. Journal of Memory and Language 31, 261–292 (1992) 2. Denis, M., et al.: Spatial Discourse and Navigation: An analysis of route directions in the city of Venice. Applied Cognitive Psychology 13, 145–174 (1999) 3. Lovelace, K., Hegarty, M., Montello, D.: Elements of good route directions in familiar and unfamiliar environments. In: Freksa, C., Mark, D.M. (eds.) Spatial information theory: Cognitive and computational foundations of geographic information science, pp. 65–82. Springer, Berlin (1999) 4. Tversky, B.: Spatial perspective in descriptions. In: Bloom, P., et al. (eds.) Language and Space, pp. 463–492. MIT Press, Cambridge (1996) 5. Loomis, J.M., et al.: Assisting wayfinding in visually impaired travelers. In: Allen, G.L. (ed.) Applied spatial cognition: From research to cognitive technology, pp. 179–202. Erlbaum, Mahwah (2007) 6. Giudice, N.A.: Navigating novel environments: A comparison of verbal and visual learning, Unpublished dissertation, University of Minnesota, Twin Cities (2004) 7. Giudice, N.A.: Wayfinding without vision: Learning real and virtual environments using dynamically-updated verbal descriptions. In: Conference and Workshop on Assistive Technologies for Vision and Hearing Impairment, Kufstein, Austria (2006) 8. Giudice, N.A., et al.: Spatial learning and navigation using a virtual verbal display. ACM Transactions on Applied Perception (in revision) 9. Giudice, N.A., Bakdash, J.Z., Legge, G.E.: Wayfinding with words: Spatial learning and navigation using dynamically-updated verbal descriptions. Psychological Research 71(3), 347–358 (2007) 10. Giudice, N.A., Legge, G.E.: Comparing verbal and visual information displays for learning building layouts. Journal of Vision 4(8), 889 (2004) 11. Ruddle, R.A., Payne, S.J., Jones, D.M.: Navigating buildings in “desk-top” virtual environments: Experimental investigations using extended navigational experience. Journal of Experimental Psychology: Applied 3(2), 143–159 (1997) 12. Bliss, J.P., Tidwell, P., Guest, M.: The effectiveness of virtual reality for administering spatial navigation training to firefighters. Presence 6(1), 73–86 (1997) 13. Waller, D., Hunt, E., Knapp, D.: The transfer of spatial knowledge in virtual environment training. Presence 7, 129–143 (1998) 14. Piaget, J., Inhelder, B., Szeminska, A.: The child’s conception of geometry. Basic Books, New York (1960)
136
N.A. Giudice and J.D. Tietz
15. Siegel, A., White, S.: The development of spatial representation of large scale environments. In: Reese, H. (ed.) Advances in Child Development and Behavior. Academic Press, New York (1975) 16. O’Keefe, J., Nadel, L.: The hippocampus as a cognitive map. Oxford University Press, London (1978) 17. Ehrlich, K., Johnson-Laird, P.N.: Spatial descriptions and referential continuity. Journal of Verbal Learning & Verbal Behavior 21, 296–306 (1982) 18. Loomis, J.M., et al.: Personal guidance system for people with visual impairment: A comparison of Spatial Displays for route guidance. Journal of Visual Impairment & Blindness 99, 219–232 (2005) 19. Loomis, J.M., Golledge, R.G., Klatzky, R.L.: Navigation system for the blind: Auditory display modes and guidance. Presence 7, 193–203 (1998) 20. Marston, J.R., et al.: Evaluation of spatial displays for navigation without sight. ACM Transactions on Applied Perception 3(2), 110–124 (2006) 21. Simpson, B.D., et al.: Spatial audio as a navigation aid and attitude indicator. In: Human Factors and Ergonomics Society 49th Annual Meeting, Orlando, Florida (2005) 22. Oving, A.B., Veltmann, J.A., Bronkhorst, A.W.: Effectiveness of 3-D audio for warnings in the cockpit. Int. Journal of Aviation Psychology 14, 257–276 (2004) 23. Richardson, A.E., Montello, D.R., Hegarty, M.: Spatial knowledge acquisition from maps and from navigation in real and virtual environments. Memory & Cognition 27(4), 741– 750 (1999) 24. Wilson, P.N., Foreman, N., Tlauka, M.: Transfer of spatial information from a virtual to a real environment. Human Factors 39(4), 526–531 (1997) 25. Klatzky, R.L., et al.: Cognitive load of navigating without vision when guided by virtual sound versus spatial language. Journal of Experimental Psychology: Applied 12(4), 223– 232 (2006) 26. Lathrop, W.B., Kaiser, M.K.: Acquiring spatial knowledge while traveling simple and complex paths with immersive and nonimmersive interfaces. Presence 14(3), 249–263 (2005) 27. Lathrop, W.B., Kaiser, M.K.: Perceived orientation in physical and virtual environments: Changes in perceived orientation as a function of idiothetic information available. Presence (Camb) 11(1), 19–32 (2002) 28. Bakker, N.H., Werkhoven, P.J., Passenier, P.O.: The effects of proprioceptive and visual feedback on geographical orientation in virtual environments. Presence 8(1), 36–53 (1999) 29. Ruddle, R.A., Payne, S.J., Jones, D.M.: Navigating large-scale virtual environments: What differences occur between helmet-mounted and desk-top displays. Presence 8(2), 157–168 (1999) 30. Wraga, M., Creem-Regehr, S.H., Proffitt, D.R.: Spatial updating of virtual displays during self- and display rotation. Mem. and Cognit. 32(3), 399–415 (2004) 31. Klatzky, R.L., et al.: Spatial updating of self-position and orientation during real, imagined, and virtual locomotion. Psychological Science 9(4), 293–299 (1998) 32. Grant, S.C., Magee, L.E.: Contributions of proprioception to navigation in virtual environments. Human Factors 40(3), 489–497 (1998) 33. Farrell, M.J., et al.: Transfer of route learning from virtual to real environments. Journal of Experimental Psychology: Applied 9(4), 219–227 (2003) 34. Presson, C.C., Montello, D.R.: Updating after rotational and translational body movements: Coordinate structure of perspective space. Perception 23(12), 1447–1455 (1994)
Learning with VVDs: Effects of Interface Fidelity on Cognitive Map Development
137
35. de Vega, M., Rodrigo, M.J.: Updating spatial layouts mediated by pointing and labelling under physical and imaginary rotation. European Journal of Cognitive Psychology 13, 369–393 (2001) 36. Avraamides, M.N.: Spatial updating of environments described in texts. Cognitive Psychology 47(4), 402–431 (2003) 37. Chance, S.S., et al.: Locomotion mode affects the updating of objects encountered during travel: The Contribution of vestibular and proprioceptive inputs to path integration. Presence 7(2), 168–178 (1998) 38. Klatzky, R.L., et al.: Encoding, learning, and spatial updating of multiple object locations specified by 3-D sound, spatial language, and vision. Experimental Brain Research 149(1), 48–61 (2003)
Cognitive Surveying: A Framework for Mobile Data Collection, Analysis, and Visualization of Spatial Knowledge and Navigation Practices Drew Dara-Abrams University of California, Santa Barbara Depts. of Geography and Psychology
[email protected]
Abstract. Spatial cognition researchers must, at present, choose between the relevance of the real world and the precision of the lab. Here I introduce cognitive surveying as a framework of computational techniques to enable the automated and precise study of spatial knowledge and navigation practices in everyday environments. I draw on surveying engineering to develop the framework’s components: a hardware platform, data structures and algorithms for mobile data collection, and routines for data analysis and visualization. The cognitive surveying system tracks users with GPS, allows them to label meaningful points as landmarks, and asks them to point toward or estimate the distance to out-of-sight landmarks. The data from these and other questions is then used to produce specific analyses and comprehensive overviews of the user’s spatial knowledge and navigation practices, which will be of great interest to spatial cognition researchers, the developers of location-based services, urban designers, and city planners alike. Keywords: spatial cognition, spatial knowledge, navigation, geographic data collection, GPS tracking.
1
Introduction
Much has been established about how people learn and navigate the physical world thanks to controlled experiments performed in laboratory settings. Such studies have articulated the fundamental properties of “cognitive maps,” the fidelity of the sensory and perceptual systems we depend on, and the set of decisions we make in order to reach a novel destination, for instance. Clear and precise findings certainly, but in the process what has often been controlled away is the worldly context of spatial cognition. To divorce internal mental processes from the external influences of the world is to tell an incomplete story,
This work has been generously supported by the National Science Foundation through the Interactive Digital Multimedia IGERT (grant number DGE-0221713) and a Graduate Research Fellowship. Many thanks to Martin Raubal, Daniel Montello, Helen Couclelis, and, of course, Alec and Benay Dara-Abrams for suggestions.
C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 138–153, 2008. c Springer-Verlag Berlin Heidelberg 2008
Cognitive Surveying
139
for what sort of thinking is more intimately and concretely tied to our physical surroundings? And, forgetting the real world also means forgetting that spatial cognition research can be valuable to professionals and ordinary people alike. Building construction and city planning projects are oftentimes so complex that concerns about engineering or budget take all attention away from the impact that the environments will ultimately have on the people that will live and work there. Ecological concerns are now addressed with environmental-impact reports. Spatial cognition research has already identified techniques that could be used to produce similarly useful reports on how a building floor plan can allow visitors to successfully navigate or how a neighborhood can be designed to naturally draw residents together in communal spaces. Non-professionals may not be charged with designing their surroundings, but still, many would appreciate having an opportunity to contribute. Spatial cognition methodology can be used to collect, aggregate, and analyze their input. Also, the same approach can be applied to assist individuals: In-car navigation systems and other location-based services are notoriously inflexible and would certainly be improved if they took into account end-users’ spatial knowledge and other subjective tendencies. What is needed are techniques for precisely measuring spatial cognition in real-world settings and analyzing the behavioral data in an automated fashion so that results are consistent and immediately available to act on. Surveying engineers have perfected this sort of measurement and analysis; the only difference is that whereas a surveyor maps the physical world more or less as it exists, spatial cognition researchers attempt to characterize the world as it is used and remembered by people. Such a sentence is most always followed by a reference to The Image of the City, that slim and intriguing volume by Kevin Lynch (1960). He and his fellow urban planners come to understand three very different American cities by studying and interviewing residents, and by aggregating the results for each city, they produce “images” that represent how an average person might remember Boston, Jersey City, or downtown Los Angeles. These “images” are composed of five elements (paths, edges, districts, nodes, and landmarks), which Lynch details in terms at once understandable to an urban designer and meaningful to a spatial cognition researcher. Unfortunately, it’s less clear how to go about collecting “images of the city” on your own with any precision or consistency, since a fair amount of expert interpretation appears to be involved. Toward that end, let me propose cognitive surveying as an umbrella under which we can pursue the goal of characterizing people’s spatial knowledge and navigation practices in more carefully defined computational and behavioral terms, while still producing results that are understandable to researchers and laypeople alike. In this paper, I will specify the architecture of such a system for behavioral data collection, analysis, and visualization. Much relevant research already exists in spatial cognition, surveying engineering, geographic information science, and urban planning; this framework of cognitive surveying ought to serve well to integrate the pieces.
140
2
D. Dara-Abrams
Cognitive Surveying
Cognitive surveying is the measurement of a person’s spatial knowledge and navigation practices conducted in a piecemeal fashion in the real world. In computational terms, cognitive surveying can be described as a combination of data structures, algorithms, analysis routines, and visualizations. When implemented as a useable system, some electronics are also involved (although this framework is not tied to the particulars of any one computer system). The novel contribution of cognitive surveying is this integration, and so let’s begin by stepping through the framework in its entirety, even if not every component will be required for each application. 2.1
Hardware
The tools of a surveyor have been combined, in recent years, into the single package of a total station, which contains a theodolite to optically measure angles, a microwave or infrared system to electronically measure distances, and a computer interface to record measurements (for surveying background, see Anderson & Mikhail, 1998). Some total stations also integrate GPS units to take measurements in new territory or when targets are out of sight. The equipment remains bulky, yet its functions can also now be approximated with portable, consumer-grade hardware: a GPS unit, an electronic compass, and a mobile computer (as illustrated in Figure 1). If the mobile computer has its own wireless Internet connection, it can automatically upload measurements to a centralized server for analysis. Otherwise, the user can download the measurements to a PC that has an Internet connection. Although it is a somewhat more involved process, asking the user to connect the mobile device to a PC at home, work, or in a lab provides a further opportunity to also assess their spatial knowledge away from the environment, by presenting tasks, like a map arrangement, on the big screen of a PC. (More on these tasks and other measurements momentarily.) Ultimately, cellular phones may be the platform of choice for real-world measurement, since they are connected, ready at hand, and kept charged. 2.2
Mobile Data Collection
With their equipment, surveyors make only a few types of measurements, but by repeating elementary measurements they are able to perform complex operations like mapping property boundaries. The measurement techniques of cognitive surveying are similarly elementary, already widely used in the spatial cognition literature (if rarely used together), and become interesting when repeated according to a plan (detailed in Figure 2). Since all this data is collected in the field, the most fundamental measurement is the user’s position, which can be captured to within a few meters by GPS, cleaned to provide a more accurate fix, and recorded to a travel log (see Shoval & Isaacson, 2006). The GPS information can be supplemented with status information provided by the user
Cognitive Surveying
141
mobile computer (worn on a sash or belt clip?) • •
Bluetooth
GPS unit (worn on shoulder?)
• •
small screen for display small keyboard (or onscreen) for user input sound for alerting user battery life for a full day
• • •
storage space for logging processing power to run real-time algorithms interfaces to other devices (USB, serial, Bluetooth)
USB WiFi
USB or serial download to user’s PC for map arrangement
electronic compass (affixed to mobile computer)
upload data to server for analysis, output, and visualization
Internet
Fig. 1. Hardware components to use for mobile data collection
point measures (questions like “what’s the name of the neighborhood you’re currently in?)
datetime
lat.
long.
measure
value
2008-01-20 134.224 32.332 neighborhood downtown ...
GPS unit
...
...
...
user input: user changes status (e.g., “lost” or “not lost”; “in car” or “on foot”)
algorithm: determine when to ask point measure questions to sufficiently sample a region
algorithm: clean traces
...
electronic compass algorithm: determines when to ask distance estimate questions in order to cover entire environment with sufficient repetition
travel log (records user’s movement every few seconds) datetime lat. long. altitude status 2008-01-20 134.224 32.332 24.5 ft.
completely lost
...
...
...
distance estimates
lat.
target estimated long. landmark distance
...
...
...
algorithm: determines when to ask direction estimate questions in order to cover entire environment with sufficient repetition
direction estimates (questions like “which way to the courthouse?”)
(used to precompute sampling patterns and to clean GPS input)
2008-01-20 134.224 32.332 courthouse 10 units ...
...
base map of environment
(questions like “how far to the courthouse?”)
datetime
...
...
datetime
target compass long. landmark heading
...
...
...
...
user input: user decides to label a landmark
landmarks (points labeled by user) name lat. long. county courthouse 134.254 32.362 ...
lat.
2008-01-20 134.224 32.332 courthouse 24.5 ° ...
algorithm: suggests to user when to label a landmark
correct for magnetic declination
...
...
landmark map arrangement (optional): download landmarks to PC and ask participant to drag and drop dots to make a map
Fig. 2. Data structures and algorithms for mobile data collection
(e.g., “lost” or “in car”). The travel log alone allows us to begin to evaluate navigation practices (to be discussed in Section 3). More complex measurements are required to assess spatial knowledge. These are landmarks, other point measures, direction estimates, and distance estimates (to be discussed in Section 4). Landmarks are points of personal importance
142
D. Dara-Abrams
labeled by the user. She may label any point of her choosing; an algorithm can be used to suggest potential landmarks to her as well. Her knowledge of the landmarks’ relative locations is measured by direction and distance questions. From one landmark, she is asked to point to other landmarks, using an electronic compass. (Compass readings will need to be corrected against models of magnetic declination, which are available on-line.) Also, she is asked to judge the distance from her current location to other landmarks by keying in numerical estimates. In addition to labeling landmarks, the user can be asked to provide other point measures. For instance, she can be asked “What’s the name of the neighborhood you’re currently in?” The algorithms that decide when to ask these questions can call on a base map of the environment, not to mention the user’s travel log. Finally, when the user is sitting at a larger computer screen—back home at her PC, say—she can be asked to again consider her landmarks by arranging them to make a map. For all of these tasks, both the user’s answer and her reaction time can be recorded. From these measurements will come a comprehensive data set on an individual’s spatial knowledge and navigation practices to analyze and visualize. 2.3
Data Analysis and Visualization
Lynch only produced “images” for groups, but from this point, angle, and distance data can come both individual and aggregate analyses (see Figure 3). Of particular interest will be the routes that people take, the accuracy of their spatial knowledge, and the contents of their spatial knowledge (all of which will be discussed below). While quantitative output will be necessary to run behavioral studies, also important will be visualizations, which are oftentimes much more effective at quickly conveying, for instance, the distorted nature of spatial knowledge or the cyclical nature of a person’s movement day after day (see also Dykes & Mountain, 2003; Kitchin, 1996a).
3
Moving, Traveling, Navigating
People move, doing so for any number of reasons, through any number of spaces, at any number of scales. As such, a number of research traditions consider human movement. At the scale of cities and other large environments, time geography studies how the constraints of distance limit an individual, and transportation geography studies the travel of masses between origins and destinations (Golledge & Stimson, 1996). Cognition is certainly involved in both, but memory and cognitive processes are most evident, and most studied, in navigation. More specifically, spatial cognition research often decomposes navigation into locomotion and wayfinding (Montello, 2005), the former being perceptually guided movement through one’s immediate surrounds (walking through a crowded square and making sure to avoid obstacles, say) and the latter being route selection between distant locations (figuring out how to travel from Notre Dame to the Luxembourg Gardens, say). When attempting to understand how an individual
Cognitive Surveying data collected from participants
143
individual-level analyses routes • • •
travel log
characterize participant’s travel behavior: loop-backs, pauses, shortcuts… determine most popular route taken between two points relate routes to structure of environment (like Conroy Dalton, 2003)
landmarks spatial knowledge accuracy • • •
direction estimates distance estimates
•
point measures
average across repeated direction and distance estimates infer missing measurements and propegate error among existing ones fit together multiple direction and distance estimates (multidimensional scaling, like Waller & Haun, 2003) relate accuracy to structure of environment, by correlating with measures from environmental form models (as in Dara-Abrams, 2008)
landmark map arrangements
aggregate analyses spatial knowledge contents • •
environmental data sources
compute regularity of landmark use across multiple participants evaluate distributions of point measures (say, to identify neighborhood areas); draw polygons around regions of similar points (similar to Montello, Goodchild, Gottsegen, & Fohl, 2003)
base map of environment models of environmental form
visualizations
exploration patterns • • •
map out routes taken by participants, frequency of visits overlay on environment (base map and models of form) like Dykes and Mountain’s (2003) Location Trends Extractor
“image of the city” •
• •
arrange landmarks to fit measurements taken by an individual or a group outline best-fit neighborhood boundaries mark highly frequented routes
Fig. 3. Data analysis and visualization
uses and remembers a city, wayfinding behavior is of particular interest. Which locations does a person choose to visit? What routes does she take between those places? Do those routes appear to be derived from certain wayfinding strategies? When is she confident in her wayfinding abilities and when might she instead be feeling lost? Cognitive surveying can allow us to consider these questions in real world settings thanks to the automated tracking of GPS, the option to collect user input along the way, and the ability to analyze and visualize this data together with users’ custom sets of landmarks, their direction and distance estimates, and any number of other measurements taken in the field.
4
Spatial Knowledge
Spatial knowledge is the stored memories we call on when orienting ourselves in a familiar environment, navigating toward a known destination, writing route directions for a visitor from out of town, and so on. Like other aspects of cognition, spatial knowledge can be modeled in a computational fashion (Kuipers, 1978). That, however, is not the goal of cognitive surveying, which is focused on measuring spatial knowledge. In fact, the “read-out” provided by a cognitive surveying system should be of much interest to cognitive scientists who are developing and calibrating computational models of spatial knowledge. Therefore, what is needed for the purposes of cognitive surveying is not a theory of spatial knowledge itself but simply a working abstraction that can be used to measure spatial knowledge. Lynch’s five elements are one such abstraction,
144
D. Dara-Abrams
but some are too subjective. By borrowing from surveying engineering, we can fashion a more computationally precise set of elements to use to measure spatial knowledge: landmarks, direction estimates, distance estimates, and regions. 4.1
Landmarks and Other Locations
Land surveyors rely on control points and monuments to fix their measurements. For humans, landmarks are the equivalent identifying features. Like the Eiffel Tower, landmarks are often visually salient, have special meaning, and stand in prominent locations (Sorrows & Hirtle, 1999). Yet just as the plainest pipe may be used as a monument by a surveyor, people often depend on unconventional landmarks when providing route directions or describing their neighborhoods. Allowing each person to identify his own custom set of landmarks is an important way in which cognitive surveying can more closely study the spatial knowledge of individuals rather than aggregates (see Figure 4). Algorithms can be used to suggest potential landmarks to a user. Heuristics that rely on GPS input measure signal loss—since it’s often when entering buildings that sight of the GPS satellites is lost (Marmasse & Schmandt, 2000)—or duration—as people are more likely to remain for a while in or near landmarks (Ashbrook & Starner, 2003; Nurmi & Koolwaaij, 2006). Another approach is to rely on information about the environment itself to identify in advance which locations may make for salient landmarks (Raubal & Winter, 2002) and to then suggest those when the user is nearby. user input please press this button whenever you’re near a landmark that you’d like to label
GPS signal loss •
(see Marmasse & Schmandt, 2000)
GPS point clustering
? ?
procedure: 1. watch for loss of GPS signal 2. when the signal degrades and then disappears, assume the user has entered a building 3. ask user if the obstructing location is worth labeling
environmental analysis
(after Ashbrook & Starner, 2003)
procedure: 1. filter based on speed to separate moving points (unfilled dots) from pause points (filled dots) 2. when pause points cluster together, ask the user if this is a meaningful landmark worth labeling
(after Raubal & Winter, 2002) procedure: 1. ahead of time identify potential landmarks by analyzing base maps, taking into account: • visual attraction (facade area, shape, color, visibility) • semantic attraction (cultural and historical importance, signage) 2. when user nears a potential landmark, ask if it’s personally meaningful and worth labeling
Fig. 4. Landmark identification approaches
4.2
Directions and Distances
Once a surveyor has identified measurement points, the next task is to determine the directions and distances among the set. Using triangulation and other trigonometric techniques may mean that some measurements can be inferred from others. People certainly rely on shortcuts and heuristics, too, but whereas
Cognitive Surveying
145
a surveyor strives for accuracy, human spatial knowledge is systematically distorted, presumably in the interest of efficiency (Tversky, 1992). People asked to estimate the direction from San Diego to Reno are likely to draw an arrow pointing northeast, toward the center of Nevada but away from the actual northwestern direction (Stevens & Coupe, 1978). People stopped on the street are significantly more accurate at pointing to distant locations when on an orthogonal street grid than at an odd angle (Montello, 1991a). And people asked to estimate the distance between university buildings judge the more common buildings to be closer to the landmark buildings than vice versa—their distance estimates are asymmetric (Sadalla, Burroughs, & Staplin, 1980). Paper, pencil, and manual pointing dials have served these experimenters well; the studies are controlled and the results are straightforward to interpret, although see Montello, Richardson, Hegarty, and Provenza (1999) and Waller, Beall, and Loomis (2004) on the relative merits of various direction estimate methods, some of which must be analyzed using circular statistics (Batschelet, 1981); also see Montello (1991b) on distance estimate methods. Cognitive surveying can use similarly simple direction and distance estimates, but by automating the data collecting with GPS tracking, an electronic compass, and a mobile computer, the questions can be asked in situ for landmarks drawn from each participant’s custom set. The time taken by the user to complete an estimate can be precisely recorded as well. Surveying Operations for Direction and Distance Estimates. If direction and distance estimates can be made at any time, when and where should the user be queried? To tackle this question, surveying engineering offers us a number of surveying operations (illustrated in Figure 5; see Anderson & Mikhail, 1998; Barnes, 1988). In a traverse, one of the simpler and more accurate surveying operations, control points are placed in a line and at each, measurements are taken backward. Repeatedly asking a person to estimate the distance and direction toward the place they just left may also yield relatively accurate measurements. In fact, this is similar to the “look-back” strategy that children and adults can use to better learn a route (Cornell, Heth, & Rowat, 1992). Asking questions according to a traverse arrangement may thus be more appropriate for testing route knowledge rather than survey knowledge (more on the distinction in a moment). Triangulation, in which measurements are combined following the law of sines, is likely a better method for capturing a person’s knowledge of the configuration of a number of landmarks spread over a large environment (see Kirasic, Allen, & Siegel, 1984, where the authors call their procedure projective convergence; triangulation can be used for similar purposes in indoor settings as well: Hardwick, McIntyre, & Pick, 1976). Land surveyors using triangulation cover their terrain with triangles, measure the inner angles of those triangles, and finally measure the distance of one leg of a triangle. From that baseline distance measurement, the entire network of measurements can be scaled. Human spatial knowledge is hardly as precise as a microwave or infrared distance measuring system, and so to evaluate distance knowledge based on only one distance estimate would not be
146
D. Dara-Abrams
traversing
D C
A dDC αDC dBA
αCB
dCB
B
αBA two advantages: • easy computations • not too many estimates required
procedure: 1. estimate distance (d) to previous landmark visited 2. estimate angle (α) to previous landmark visited a disadvantage: perhaps only useful for measuring routes (see Cornell, Heth, & Rowat, 1992)
triangulation (direction only) procedure: 1. estimate directions to other landmarks so that each landmark sits at the vertex of a triangle 2. fit together direction estimates to determine the relative position of each landmark an advantage: can measure survey knowledge for large areas a disadvantage: depends on a large number of estimates a question: How does triangle size affect the data collected? What would be the difference between using ∆BCD and ∆BCE?
A B αBA αAD α AE
αBA αBD αBC
C αCB
D αDA
αDB
αCD αCE
αDE
αDC
E αEA
αED αEC
trilateration procedure: 1. estimate distances from landmark to landmark (note that A-B can be perceived to be a different distance than B-A) 2. combine distance estimates using multidimensional scaling to produce a best-fit arrangement of landmarks advantages, disadvantages, and questions are similar to triangulation another issue: distance knowledge is often poorer than direction knowledge (Waller & Huan, 2003)
A dAB B dBC C
dBA dDB
dBD dCB
dDC dCD
dAD
D
dEC dDE
dDA dAE dEA
dED
dCE E A
triangulation (direction and distance)
B αBA
dAB
αBA
procedure: αAD αAE d αBD BA dAD 1. estimate directions and distances from landmark to landmark D dBC α d dDA DB α BC C DA 2. combine estimates using direction/ dBD αCB distance scaling to produce a best-fit dCB αDB d αCD DC arrangement of landmarks d d αDE AE EA an advantage: more comprehensive αCE dCD αDC measure of survey knowledge d dEC dDE ED E a disadvantage: computationally intensive a question: Can missing estimates be dCE αED αEA approximated based on others that were in fact taken?
αEC
Fig. 5. Surveying operations for collecting direction and distance estimates
Cognitive Surveying
147
proper. Thus, triangulation is probably best used for cognitive surveying when performed without any distance estimates or with a number of distance estimates distributed around the triangle network. If using only distance measurements, trilateration—as is performed in more complex forms to derive positions from GPS signals—can be used instead. In any case, the cognitive surveyor has an advantage over the land surveyor: Precise base maps already exist for most urban and natural environments, and so we can use information about streets, paths, and other environmental features to guide the sampling design that determines when to ask users to estimate directions and distances. (More on sampling design in the next section.) As these direction and distance estimates are collected, they can be integrated in order to approximate the user’s overall spatial knowledge and to propagate error among repeated measurements. Multidimensional scaling, or MDS, is one such technique often used to turn pairwise distance and direction estimates into a best-fit two-dimensional configuration (Waller & Haun, 2003). When applying MDS to distance estimates alone, the optimization procedure is effectively trilateration repeated many times. Map Arrangements. In addition to taking multiple direction and distance estimates so that MDS and other fitting routines have more data to work with, it’s ideal to test people on a range of tasks, with the goal of using different methods to converge on their spatial knowledge (see Kitchin, 1996b). Direction and distance questions evaluate knowledge of the relative positions of two locations. To consider people’s overall knowledge of an environment, using a different approach, we can ask them to also make a map arrangement. The user is given dots for each of her landmarks and asked to arrange those dots so that they best approximate the real-life locations of the landmarks. While a map arrangement may be simplistic, the technique has important advantages over sketch mapping (first used by Lynch, 1960); asking people to draw a map is a difficult process to automate, the sketchs are tricky to score consistently, and the process conflates spatial knowledge with drawing ability. Like a sketch map, a map arrangement does often serve as a compelling visualization of a person’s distorted spatial knowledge. Moreover, x and y coordinates for each landmarks can easily be extracted from a map arrangement, and these measures can be compared with those produced by MDS using multidimensional regression (Friedman & Kohler, 2003). Asking direction and distance questions in the field and doing map arrangements when users have returned home or to the lab will provide us will a more complete understanding of their spatial knowledge. 4.3
Learning
People do not come by their spatial knowledge of an environment instantaneously— we learn over time from repeated exposure and novel experience. The most widely accepted theory of spatial microgenesis (as one’s acquisition of spatial knowledge for an environment is called) proposes that people first learn the locations of pointlike landmarks, then learn the linear routes that connect pairs of landmarks, and
148
D. Dara-Abrams
finally learn how the landmarks and routes fit into an overall configuration, known as survey knowledge (Siegel & White, 1975). If people follow discrete stages in this manner, they will not begin to acquire metric knowledge, like the direction between a pair of landmarks, until the final stage. Yet longitudinal studies suggest that spatial microgenesis may progress in a continuous manner, without qualitatively different stages (Ishikawa & Montello, 2006). The consistent, automated data collection that a cognitive surveying system offers will be invaluable for studying how people learn an environment over time. 4.4
Regions
One way by which we learn environments is to subdivide them into meaningful regions (Hirtle, 2003). In the case of cities, these regions are usually neighborhoods, districts, wards, barrios, and so on. Some are official while others are informally defined—the City of London versus the Jets’ turf. Even if a region name is in common parlance, its boundary is still likely vague (Montello, Goodchild, Gottsegen, & Fohl, 2003). Regions may be areas, but like any other polygon, their extents can be approximated by point measures. In other words, users can be asked occasionally “What’s the name of this neighborhood?” and around that sampling of points, polygons of a certain confidence interval can be drawn. As with direction and distance estimates, there is the question of when to ask point measurement questions. A number of sampling designs can be used (as in Figure 6): wait for user input; ask questions at preset temporal intervals; ask questions at uniform, preset spatial locations; and, ask questions at preset spatial locations whose selection has been informed by base maps of the environment in question. The best approach is likely a combination of all four.
wait for user input
uniform spatial stratification
informed spatial stratification
a combined approach •
please press this button whenever you’re ready to answer a question
•
•
Use informed spatial stratification when base map is available; when not, revert to uniform spatial stratification. Limit the number of questions asked by time period, to ensure that the user is not overwhelmed. Allow the option of user input.
time Every t minutes, ask a question.
When the user comes within radius r of a potential sample point, ask a question. (see Longley, et al., 2005, p. 91)
Based on a base map, cluster potential sample points where the user is more likely to travel.
Fig. 6. Sampling approaches
Other point measures may be collected and analyzed in a similar manner. For example, in one of the more clever demonstrations of GPS tracking, Christian Nold has logged people’s position and their galvanic skin response with the goal of mapping the physical arousal associated with different parts of a city (see biomapping.net).
Cognitive Surveying
149
This sort of subjective spatial data is highly personal, yet when aggregated it can be of use. Again, take the example of regions. Mapping firms now compete to provide data sets of city neighborhoods to Web search engines, real estate firms, and others who want to organize their spatial data in a more intuitive manner. (For example, see Zillow.com and EveryBlock.com.) A cognitive surveying system is one means by which an individual’s spatial knowledge can be measured and aggregated for these sorts of purposes.
5
Spatial Ability and Other Individual Differences
On the other hand, aggregating spatial knowledge and navigation practices and other behavioral results can mask the fact of individual differences. Some certainly do better than others at navigating an unfamiliar city, and psychometric tests certainly find that people vary in terms of their spatial abilities (Hegarty & Waller, 2005). One’s sex is sometimes all too quickly targeted as the cause of these differences, leaving aside the role of one’s motivation, penchant to explore, confidence in staying oriented, money to be able to fund travels. These are only a selection of factors that could be considered when analyzing data collected with a cognitive surveying system.
6
User Assistance and Location-Based Services
If individual differences can be understood, those who might benefit from additional assistance can also be helped. Today’s location-based services, or LBS, like in-car navigation systems are poorly designed and difficult to operate. If LBS can adapt themselves to better fit each particular user, they will be more useable. Accordingly, the first step is to understand your user, and for this, data collection is key. The methods of cognitive surveying can provide LBS with: – a custom set of landmarks to use when generating route directions (Raubal & Winter, 2002) – subjective distances that represent how long the user thinks a travel segment will take to traverse (relevant to travel constraint planning: Raubal, Miller, & Bridwell, 2004) – a rating of the user’s relative performance on route knowledge (collected using a traverse operation) and survey knowledge (collected using a triangulation or trilateration operation), which may indicate what type of instructions to include in his route directions – a map of what territory is known to the user and what he has yet to explore (which can be applied to formulate route directions of varying detail: Srinivas & Hirtle, 2006) Using this sort of subjective spatial data may very well help LBS become easier and more useful for the end-users.
150
7
D. Dara-Abrams
The Environment
So far we have focused on the knowledge that people carry in their heads and the cognitive processes that they use to navigate. Both are, by definition, intimately tied to the physical world, the information that it offers and the constraints that it imposes. Spatial cognition researchers wish to understand the interplay, while the designers and planners who are charged with creating and enhancing built environments want to understand how those places are used. As cognitive surveying can be performed accurately in real-world settings, such a system can be effective in both cases. The behavioral data being collected and analyzed here is perfectly suited for comparison with computational models of environmental form. In short, an environmental form model captures the patterns of accessibility or visibility for a physical setting like a building interior or a university campus using grid cells (Turner, Doxa, O’Sullivan, & Penn, 2001), lines (Hillier & Hanson, 1984), or other geometric building blocks. Quantitative measures can be computed and extracted for a certain location, a path, or an entire region. Certain environmental form measures have been found to predict the number of pedestrians walking on city streets (Hillier, Penn, Hanson, & Xu, 1993) and the accuracy of students’ spatial knowledge for their university campus (Dara-Abrams, 2008). A cognitive surveying system will help further research on the relationship between human behavior/cognition and models of environmental form. Even without the specificity of an environmental form model, the data collection and analysis of cognitive surveying can inform the work of architects, urban designers, and city planners. Lynch demonstrated that collecting “images of the city” identifies design flaws to remediate, captures reasons underlying residents’ attitudes toward development, and reveals which places are attractive to residents and which are not. These, among other practical outcomes of measuring spatial knowledge and navigation practices, are details that can guide not just the mechanics of design but also the way in which projects are presented and framed to the public. Collecting “images” depends on trained experts, but a cognitive surveying system could be deployed and used by architects and planners, as well as expert cognitive scientists.
8
Further Research and Conclusion
What I am presenting as cognitive surveying is an amalgamation of mobile computer hardware, software for data collection and analysis, ideas for behavioral studies, and practical applications. Many of these components already exist. The novelty is in the framework that unites the behavioral methodology of spatial cognition, the techniques of surveying engineering, the data analysis methods of geographic information science, and the concerns of design professionals. The basic measurements being collected are simple, even simplistic—travel logs, landmarks and other point-based measures, estimated directions and distances
Cognitive Surveying
151
between those landmarks, and so on—but from these, complex descriptions can be constructed and theoretically interesting questions addressed, including: – When people are allowed to freely travel through an environment, does their spatial knowledge contain the same sort of systematic errors that have been found in lab-based studies? – When people repeatedly explore an environment, how does their spatial knowledge develop over time? Does their learning follow a fixed set of qualitative stages or instead progressively increase from the beginning? – How do spatial abilities relate to other factors that may also cause individual differences in spatial knowledge and navigation practices (e.g, regular travel extent, confidence in spatial abilities, sex, demographics)? – What are the optimal surveying operations and sampling designs for measuring spatial knowledge? Are particular parameters more appropriate for certain circumstances and studies than others? For instance, is knowledge for a long route best tested using a different set of parameters than knowledge for a neighborhood? – Can models of environmental form predict where people are likely to travel, which features they are likely to remember, and how accurate that spatial knowledge will likely be? If so, can these models be used to better understand which particular properties of real-world environments influence people’s spatial knowledge and navigation practices? – How can the automated collection of this subjective data improve locationbased services and assist the users of other electronic services? – Will summaries and visualizations of people’s spatial knowledge and navigation practices make for the beginnings of a “psychological-impact report” for environmental design projects? Cognitive surveying will better enable us to pursue all of these research questions. This paper’s contribution is the framework of cognitive surveying. In the future, I intend to present implemented systems along with results that begin to address the preceding questions. Even as a conceptual framework, cognitive surveying can already help us take spatial cognition research into the real world. We now know what sort of questions to ask of a person and what sort of measurements to record, when to ask each question and when to alternate methods, how to synthesize all these measurements and how to present them for analysis. In addressing such issues, cognitive surveying will allow us to characterize the world as it is remembered and used by people—if not with absolute accuracy, at least with consistency and ease.
References Anderson, J.M., Mikhail, E.M.: Surveying: Theory and practice. WCB/McGraw-Hill, Boston (1998) Ashbrook, D., Starner, T.: Using GPS to learn significant locations and predict movement across multiple users. Personal Ubiquitous Computing 7, 275–286 (2003)
152
D. Dara-Abrams
Barnes, W.M.: BASIC surveying. Butterworths, London (1988) Batschelet, E.: Circular statistics in biology. Academic Press, New York (1981) Conroy Dalton, R.: The secret is to follow your nose. Environment and Behavior 35, 107–131 (2003) Cornell, E.H., Heth, C.D., Rowat, W.L.: Way finding by children and adults: Response to instructions to use look-back and retrace strategies. Developmental Psychology 28, 328–336 (1992) Dara-Abrams, D.: Modeling environmental form to predict students’ spatial knowledge of a university campus. Unpublished master’s thesis, University of California, Santa Barbara (2008) Dykes, J.A., Mountain, D.M.: Seeking structure in records of spatio-temporal behaviour: Visualization issues, efforts and applications. Computational Statistics and Data Analysis 43, 581–603 (2003) Friedman, A., Kohler, B.: Bidimensional regression: Assessing the configural similarity and accuracy of cognitive maps and other two-dimensional data sets. Psychological Methods 8, 468–491 (2003) Golledge, R.G., Stimson, R.J.: Spatial behavior: A geographic perspective. Guilford Press, New York (1996) Hardwick, D.A., McIntyre, C.W., Pick, H.L.: The content and manipulation of cognitive maps in children and adults. In: Monographs of the Society for Research in Child, Development, vol. 41(3), pp. 1–55. University of Chicago Press, Chicago (1976) Hegarty, M., Waller, D.A.: Individual differences in spatial abilities. In: Shah, P., Miyake, A. (eds.) The Cambridge handbook of visuospatial thinking. Cambridge University Press, UK (2005) Hillier, B., Hanson, J.: The social logic of space. Cambridge University Press, Cambridge (1984) Hillier, B., Penn, A., Hanson, J., Xu, J.: Natural movement: or, configuration and attraction in urban pedestrian movement. Environment and Planning B 20, 29–66 (1993) Hirtle, S.C.: Neighborhoods and landmarks. In: Duckham, M., Goodchild, M.F., Worboys, M.F. (eds.) Foundations of geographic information science. Taylor & Francis, London (2003) Ishikawa, T., Montello, D.R.: Spatial knowledge acquisition from direct experience in the environment: Individual differences in the development of metric knowledge and the integration of separately learned places. Cognitive Psychology 52, 93–129 (2006) Kirasic, K.C., Allen, G.L., Siegel, A.W.: Expression of configurational knowledge of large-scale environments: Student’s performance of cognitive tasks. Environment and Behavior 16, 687–712 (1984) Kitchin, R.: Exploring approaches to computer cartography and spatial analysis in cognitive mapping research: CMAP and MiniGASP prototype packages. Cartographic Journal 33, 51–55 (1996a) Kitchin, R.: Methodological convergence in cognitive mapping research: investigating configurational knowledge. Journal of Environmental Psychology 16, 163–185 (1996b) Kuipers, B.J.: Modeling spatial knowledge. Cognitive Science 2, 129–153 (1978) Longley, P.A., Goodchild, M.F., Maguire, D.J., Rhind, D.W.: Geographic information systems and science. Wiley, Chichester (2005) Lynch, K.: The image of the city. MIT Press, Cambridge (1960) Marmasse, N., Schmandt, C.: Location-aware information delivery with ComMotion. In: Thomas, P., Gellersen, H.-W. (eds.) Handheld and ubiquitous computing. Springer, Heidelberg (2000)
Cognitive Surveying
153
Montello, D.R.: Spatial orientation and the angularity of urban routes: A field study. Environment and Behavior 23, 47–69 (1991a) Montello, D.R.: The measurement of cognitive distance: Methods and construct validity. Journal of Environmental Psychology 11, 101–122 (1991b) Montello, D.R.: Navigation. In: Shah, P., Miyake, A. (eds.) The Cambridge handbook of visuospatial thinking, pp. 257–294. Cambridge University Press, UK (2005) Montello, D.R., Goodchild, M.F., Gottsegen, J., Fohl, P.: Where’s downtown?: Behavioral methods for determining referents of vague spatial queries. Spatial Cognition and Computation 3, 185–204 (2003) Montello, D.R., Richardson, A.E., Hegarty, M., Provenza, M.: A comparison of methods for estimating directions in egocentric space. Perception 28, 981–1000 (1999) Nothegger, C., Winter, S., Raubal, M.: Selection of salient features for route directions. Spatial Cognition and Computation 4, 113–136 (2004) Nurmi, P., Koolwaaij, J.: Identifying meaningful locations. In: The 3rd Annual International Conference on Mobile and Ubiquitous Systems: Networks and Services (MobiQuitous), San Jose, CA (2006) Raubal, M., Miller, H.J., Bridwell, S.A.: User-centered time geography for locationbased services. Geografiska Annaler-B 86, 245–265 (2004) Raubal, M., Winter, S.: Enriching wayfinding instructions with local landmarks. In: Egenhofer, M., Mark, D. (eds.) Geographic Information Science, pp. 243–259. Springer, Heidelberg (2002) Sadalla, E.K., Burroughs, W.J., Staplin, L.J.: Reference points in spatial cognition. Journal of Experimental Psychology: Human Memory and Learning 5, 516–528 (1980) Shoval, N., Isaacson, M.: Application of tracking technologies to the study of pedestrian spatial behavior. The Professional Geographer 58, 172–183 (2006) Siegel, A.W., White, S.H.: The development of spatial representations of large-scale environments. In: Advances in child development and behavior, vol. 10, pp. 9–55. Academic, New York (1975) Sorrows, M.E., Hirtle, S.C.: The nature of landmarks for real and electronic spaces. In: Freksa, C., Mark, D. (eds.) Spatial Information Theory, pp. 37–50. Springer, Heidelberg (1999) Srinivas, S., Hirtle, S.C.: Knowledge-based schematization of route directions. In: Barkowsky, T., Knauff, M., Ligozat, G., Montello, D.R. (eds.) Spatial cognition V: Reasoning, action, interaction, pp. 346–364. Springer, Berlin (2006) Stevens, A., Coupe, P.: Distortions in judged spatial relations. Cognitive Psychology 10, 422–437 (1978) Turner, A., Doxa, M., O’Sullivan, D., Penn, A.: From isovists to visibility graphs: A methodology for the analysis of architectural space. Environment and Planning B: Planning and Design 28, 103–121 (2001) Tversky, B.G.: Distortions in cognitive maps. Geoforum 23, 131–138 (1992) Waller, D.A., Beall, A., Loomis, J.M.: Using virtual environments to assess directional knowledge. Journal of Environmental Psychology 24, 105–116 (2004) Waller, D.A., Haun, D.B.M.: Scaling techniques for modeling directional knowledge. Behavior Research Methods, Instruments, and Computers 35, 285–293 (2003)
What Do Focus Maps Focus On? Kai-Florian Richter1,2 , Denise Peters1,2, Gregory Kuhnm¨ unch1,3 , and Falko Schmid1,2 1
SFB/TR 8 Spatial Cognition Universit¨ at Bremen, Germany 3 Universit¨ at Freiburg, Germany {richter,peters,schmid}@sfbtr8.uni-bremen.de,
[email protected] 2
Abstract. Maps are an important, everyday medium to communicate spatial information. We are faced with a great variety of different maps used for different purposes. While many of these maps are task-specific and concentrate on specific pieces of information, often they do not support map reading to extract the information relevant for the task at hand. In this paper, we explore the concept of focus maps. This concept has been previously presented with a restricted scope, however it covers a range of different kinds of maps that all focus a map user’s attention on the relevant information, be it specific features or areas. We discuss their general properties and the importance of context for designing such maps, and introduce a toolbox for constructing schematic maps that provides a generic way of generating the different kinds of maps discussed. Furthermore, we provide empirical evidence supporting our approach and outline how navigation in 3D virtual environments may benefit from a transfer of the proposed concept of focus maps from 2D to 3D. Keywords: Schematic maps, map design, wayfinding assistance.
1
Introduction
Maps are a dominant medium to communicate spatial information. They are omnipresent in our daily life. In news and ads they point out where specific places are, often in relation to other places; they link events, dates, and other data to locations to illustrate, for example, commercial, historical, or sports developments. For planning holidays or trips to unknown places inside or outside our hometown we often grab a map—or, nowadays, we recur to Internet planners, like Google Maps, or (car) navigation systems. And if we ask someone for directions, we may well end up with a sketch map illustrating the way to take. All these maps display different information for different purposes. Often, they are intended for a specific task. However, the design of the maps not always reflects this task-specificity. The depicted information may be hard to extract, either because of visual clutter, i.e., a lot of excess information, or because the map user is not properly guided to the relevant information. In this paper, we discuss the concept of focus maps, which is an approach to designing maps that guide a map C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 154–170, 2008. c Springer-Verlag Berlin Heidelberg 2008
What Do Focus Maps Focus On?
155
user in reading information off a map. Using simple graphical and geometric operations, the constructed maps focus a user’s attention on the relevant information for a given task. This way, we are able to design maps that not only are tailored for the intended task, but also assist a map user in reading them. In the next section, we present approaches to map-based assistance in spatial tasks and illustrate the fundamental concepts underlying our approach, namely schematization and a computational approach to constructing schematic maps. Section 3 explains the concept of focus maps previously presented by [1] in a restricted scope and discusses its generalized aim and properties. Section 4 introduces a toolbox for map construction and the relevant components needed for designing focus maps. This section also shows examples of different kinds of focus maps. In Section 5 we provide empirical evidence supporting our approach; in Section 6 we outline how the concept of focus maps may be transferred to the construction of 3D virtual worlds. The paper ends with conclusions and an outlook on future work in Section 7.
2
Maps and Map-Based Assistance
Maps and map-like representations have been used by humans since ancient times [2]. There are evidences that they are used universally, i.e., across cultures [3]. That is, maps are (or have become) a form of representing space used by almost any human being, just as natural language. Over time, maps have become an everyday product. However, often there is a mismatch between what the map designer has intended and how the map reader actually uses the map [4]. This problem persists even though maps are rarely purely graphical representations, but usually also contain (explanatory) verbal elements [5]. And this problem increases with the increasing use of map-like representations in electronic form. While there is a rich set of rules and guidelines for the generation of paper-based cartographic maps (e.g., [6,7]), these rules are mostly missing for electronic maps presented on websites or on mobile devices. This can be observed in approaches for automatic assistance in spatial tasks. Maps play a major role here; in addition to verbal messages almost all Internet route planners and car navigation systems also provide information on the way to take in graphical form. In research, for example in the areas of human-computer interaction and context awareness, several approaches exist that deal with mapbased assistance (e.g., [8,9,10]). Most of these approaches employ mobile devices to present maps; the maps are used as interaction means in location based services [8,11,9]. Accordingly, this research aims at an automatic adaptation of the maps to the given medium and situation [12]. Questions of context awareness and adaptation to context play an important role [13] (see also the next section). Our work is based on ideas presented by Berendt et al. [14]. They develop a computational approach to constructing maps they term schematic. Schematic maps are representations that are intentionally simplified beyond technical needs to achieve cognitive adequacy [15]. They represent the specific knowledge needed for a given task; accordingly, the resulting maps are task-specific maps [16]. Three
156
K.-F. Richter et al.
different levels of knowledge are distinguished in this approach: 1) knowledge that needs to be represented unaltered, 2) knowledge that can be distorted but needs to be represented, 3) knowledge that can be omitted [17]. This distinction guides the map construction process in that the required knowledge, called aspects, is selected from existing knowledge prior to map construction and ranked in a depictional precedence [17]. This order guides the construction, for example, in deciding which knowledge may be distorted to solve local conflicts that are due to space limitations in the depictional medium. When reading a schematic map, the reader’s assumptions about this depictional precedence needs to match the actually used precedence. Otherwise, map reading may lead to mis- or overinterpretation [18].
3
The Concept of Focus in Map Design
As we have detailed in the last section, maps are important in our everyday life. It is a prime means to communicate spatial information; reading maps is a recurring task. Consequently, assistance systems that use maps as communication means should not only assist in the given spatial task, but also provide assistance in reading the maps. This holds especially since the advent of mobile devices with their small displays as platform for these assistance systems. In line with the aspect maps approach (see last section), maps as assistance means should concentrate on the relevant information. This serves to reduce cognitive load of the users; they should not need to process spatial information that is not needed for the task at hand. At the same time, however, these maps should also guide their reading. This serves to speed up information processing; by the design of the map, map users should be drawn to the relevant information. We term this design principle of reader guidance focus map. The focus effect is a specific form of schematization. While it does not reduce information represented in a map homogeneously by, for example, removing objects or simplifying geometry over all objects, it reduces the information to be processed by funneling a reader’s attention to the relevant information. Since schematic maps are task-specific [16], what information focus maps focus on is dependent on the task at hand. When the task is to guide a user from location A to location B, maps need to be designed differently from maps that present points of interest in the depicted environment. That is, map design is context dependent; the appearance of the generated map depends on the environment depicted, on the selected information, and on the intended task. Other than the approaches listed in Section 2 and other “traditional” approaches to context (e.g., [19,20]) that define context by (non-exhaustive) lists of factors whose parametrization is supposed to result in context-adaptive behavior, we take a process-oriented approach to context [21]. Figure 1 provides a diagrammatic view on this approach. It distinguishes between the environment at hand, the environment’s representation (in the context of this paper this is the focus map), and an agent using the representation to interact with the environment— here, this is the map user. Between these three constituents, processes determine
What Do Focus Maps Focus On?
157
the interactions going on to solve a given task. For example, map reading and interpretation processes determine what information the agent extracts from the map, while processes of selection and schematization determine what information gets depicted in the map by the map designer, i.e., determine the representation. These processes, finally, are determined by the task at hand. The designer selects and schematizes information with a specific task in mind, the map user reads information off the map to solve a specific task. This way of handling context is also flexibe with respect to task changes—be it the kind of task or the concrete task at hand. Thus, it may well be the basis for flexibly producing different kinds of maps using the same data basis, for example, in mobile applications.
Fig. 1. A process-oriented view on context (from [21], modified). It is determined by the interaction between environment (E), representation (R), and agent (A). The task (T) determines the processes that drive this interaction.
What to Focus on The term focus map stands for representations that guide a map user’s reading processes to the relevant information. However, as just explained, depending on the context there is a great variety of what this relevant information might be. Accordingly, different kinds of maps can be summarized under the term focus map. It is important to note that what is generally depicted on a map, i.e., the (types of) objects shown, are selected in a previous step (see Section 4). The selected features depend on the kind of task as illustrated above; focusing then highlights specific instances of these features, namely those specifically relevant for the actual task. For example, for a wayfinding map the street network as well as landmark features may be selected for depiction; the route connecting origin and destination and those landmarks relevant for the route then may be highlighted using focus effects. Broadly, we can distinguish between maps that focus on specific objects (or object types) and maps that focus on specific areas of the depicted environment (cf. also the distinction between object- and space-schematization in [22]). Focusing
158
K.-F. Richter et al.
on objects can be achieved by using symbols to represent the relevant objects, for example, landmarks [23,24]. It may also be achieved by object-based schematization, i.e., by altering the appearance of specific objects to either increase or decrease their visibility (see Section 4.2). When focusing on specific areas, all objects in these areas are in focus, independent of their type. Objects in the focused area are highlighted, all other objects are diminished. Such maps may, for example, focus on the route between some origin and destination, funneling a wayfinder’s attention to the route to take [1]. Several different areas can be in focus at the same time, which may be disconnected. This holds also for focusing on multiple routes at the same time to, for example, indicate alternative detours next to the proposed main route. For all the different kinds of focus maps, graduated levels of focus are possible, i.e., it is possible to define several levels of varying focus. In a way, this corresponds to the depicitional precedence explained in Section 2; different types of information may be highlighted to different degrees. This may be used to either depict “next-best” information along with the most important information, or to increase the funneling effect by having several layers of increasing focus around an area. With this graduated levels of focus, we can distinguish strong and weak focus. Using a strong focus, there is an obvious, hard difference in presenting features in focus and those that are not. Features in focus are intensely highlighted, those that are not are very much diminished. A weak focus provides a smoother transition between those features in focus and those that are not. The kinds of focus maps presented so far all focus on either objects or areas, i.e., on parts of the depicted environment. They emphasize structural information [25]. However, maps may also be designed such that they emphasize the actions to be performed. Such maps focus on functional information. Wayfinding choreme maps [26] are an example of this kind of maps. In designing such maps, the visual prototypes identified by Klippel [25] that represent turning actions at intersections emphasize the incoming and outgoing route-segments at intersections, i.e., the kind of turn due at an intersection. This way, they ease understanding which action to perform, reducing ambiguity and fostering conceptualization of the upcoming wayfinding situations. Combining structural and functional focus, for example, as in chorematic focus maps [27], then results in maps that focus on the relevant information in the relevant areas. Combining structural and functional focus is also employed in generating personalized wayfinding maps. Here, different levels of focus are used in that maps depict information in different degrees of detail (focus) depending on how well known an area is to the wayfinder [28]. Such maps that show transitions between known and unknown parts of an environment are a good example for using multiple levels of focus. The maps consist of three classes of elements of different semantics and reference frames: – One or more familiar paths; those paths are obtained by an analysis of previous trajectories of the wayfinder and map matching. Familiar paths belong to an individual frame of reference, as they describe a previously traveled route between two individually meaningful places. These are the
What Do Focus Maps Focus On?
159
most restricted elements of the map: only the previously traveled path and prominent places or landmarks along the path are selected and depicted on the resulting map. – Transition points; they describe the transition from familiar to unfamiliar areas and also define the transition between the individual reference frame and a geographic frame of reference. For reasons of orientation and localization, elements of the known part at the transition points are selected and added to the map. – One or more unfamiliar areas; all elements of these areas belong to a geographic frame of reference. This means focus effects can only sensibly be applied to unfamiliar environments, as is further explained below. We apply focus effects differently for each of the three classes of elements. The familiar paths are highly schematized, chorematized (all angles are replaced by conceptual prototypes; see [26]), and scaled down. No focusing is applied to these parts of the map, as there is no additional environmental information depicted that could distract the attention of the wayfinder. These paths only serve as connections between familiar and unfamiliar environments. The purpose of maps based on previous knowledge is to highlight the unknown parts of a route. Accordingly, the transition areas are subject to focus. To enable localization, a transition point has to be clearly oriented and identifiable. This requires resolving ambiguities that may arise. To this end, elements in direct vicinity of the transition points that belong to the known parts of a route are selected and displayed. We apply a strong focus function to these points. This enables a smooth reading of the transition between the different parts. In unfamiliar parts, we display much more environmental information to provide more spatial context. To focus a wayfinder’s attention on the route to travel, we apply focus effects on the route as explained above (see also Section 4.3).
4
Implementation
Focus maps, as a specific kind of schematic maps, are part of the toolbox for schematic map design developed in project I2-[MapSpace] of the Transregional Collaborative Research Center SFB/TR 8 Spatial Cognition.1 In this section, we will briefly introduce the basics of this toolbox and the underlying operations for generating focus maps. Section 4.3 then introduces a generic way of generating focus maps and shows examples of the different kinds of focus maps discussed so far. 4.1
Toolbox for Schematic Maps
The toolbox for schematic maps collects functionality for the design of schematic maps in a coherent framework. It comprises fundamental operations, such as 1
http://www.sfbtr8.spatial-cognition.de/project/i2/
160
K.-F. Richter et al.
vector-based geometry and building up the required data structures (e.g., extracting a graph from a given street network). The toolbox is able to deal with data given in different formats, for instance, as EDBS- or GML-files.2 There is also functionality provided to export data again, which is also used as one way to communicate between different parts of the toolbox. The main part of the toolbox, though, is the provision of operations for the graphical and geometric manipulation of spatial objects (features represented as points, lines or polygons). These operations form the basis for the different implemented schematization principles; those operations required for focus maps are explained in more detail in the next subsection. The toolbox is implemented in Lisp. Maps can be produced as Scalable Vector Graphics (SVG)3 or in Flash format4 . SVG is an XML-based graphics format that is highly portable across different platforms and applications. Flash allows for a simple integration of interaction means in the map itself and can be displayed by most modern browsers. Mostly, the operations of the data processing part can be used independently from each other; there is no predefined order of execution. The context model presented in Section 3 (see also Fig. 1) may be used to implement a control module that determines the execution order given a task, agent, and environment. 4.2
Basic Operations for Focus Maps
As for any schematic map to be constructed, the spatial information (e.g., objects or spatial relations) to be depicted needs to be selected. Specific to focus maps, the selection operation also involves determining which parts of this information are to be highlighted. The concrete operation achieving this focus depends on the kind of focus effect aimed for. Focusing on specific objects, for example, is realized simply by type comparison of the objects in the database with the target type. In focusing on specific areas, on the other hand, for every object a focus factor is calculated that depends on the object’s distance to the focus area. The most important operation for designing focus maps is adapted coloring of depicted graphical objects. This operation determines the visual appearance of the map; it works on a perceptual level. This operation is used for any kind of focus map described in Section 3. The coloring operation manipulates the color— the RGB values—of objects before they get depicted. Those objects that are in focus are depicted in full color to make them salient. In contrast, the color of objects not in focus is shifted towards white. This color shift renders these objects less visible as they are depicted in lighter, more grayish color. In contrast, the non-shifted objects stick out, putting them in focus. Additionally, the geometry of objects not in focus may be simplified. This further diminishes their visual appearance, as has been demonstrated by [1]. To this end, the toolbox implements line simplification based on discrete curve evolution [29]. 2 3 4
EDBS: http://www.atkis.de; GML: http://www.opengis.net/gml/ http://www.w3.org/Graphics/SVG/ http://www.adobe.com/support/documentation/en/flash/documentation.html
What Do Focus Maps Focus On?
4.3
161
A Generic Way of Generating Focus Maps
As argued in Section 3, conceptually focus maps cover a range of different effects that put parts of the depicted information in focus. This is reflected in the implementation. A single function allows for the generation of different kinds of focus maps. It provides a uniform, generic interface to designing the different kinds of maps by capturing the fundamental logics of map construction. Setting two parameters, a map designer can determine which focus effects are employed. The first parameter determines which features shall be in focus—by either specifying their boundary (area focus) or by listing them (object focus). The second parameter then determines which kind of focus effect shall be generated—graduated focus or focus on a single (type of) feature. The focus map function performs the requested operations on the map objects to be depicted. Visual presentation—drawing the map—is realized in the toolbox by another function taking parameters that determine, for example, map size, whether a grid is drawn, and which output format (SVG or Flash) is used. This function takes all objects to be depicted as a list and draws them in the list’s order. Thus, additional visual effects can be achieved by carefully choosing the objects’ order in the list. Objects at the end of the list are drawn later than those in the beginning, i.e., by ordering objects effects of (partial) occlusion may be achieved. In the following, we show some example focus maps generated by the toolbox. For each map, we provide the parameter settings used. We use part of the inner city of Bremen, Germany, as an example environment; the maps depict streets, water bodies, and tramways. The first sample focus map shown in Figure 2 highlights tramways, while water bodies are strongly diminished. This is achieved by setting the first parameter to ranked-objects; the value ranked-objects corresponds to maps that allow for a rank order in focusing on objects. Accordingly, the second parameter states this rank order, given as a list of weights (0 ≤ w ≤ 1). Here, the weight for tramways is 1, for water bodies 0.2, and for streets 0.5. The sample maps of Figure 3 illustrate effects that focus on specific areas. In Figure 3a, a single route is in focus, while Figure 3b additionaly keeps alternative detours to a lesser degree in focus. To achieve this, the first parameter is set to area-focus. The second parameter states the area(s) to focus on. In case of the example maps shown below, these areas are the routes in focus (given as sequence of coordinates). In case there are multiple routes to focus on, the first route is taken as main route, the following routes as alternatives with lesser degree of focus. The chosen example of Figure 3b is an artificial one, emphasizing the effect of focusing on multiple areas. Usually, the additional routes are meant to be possible detours in case the main route is blocked, for example. But having disconnected routes in a single map as shown here helps to make the effects more visible. More formally, fading out of colors is achieved by calculating for each coordinate its distance to the area in focus (here, the route). As in the RGB color-space (0, 0, 0) is black and (255, 255, 255) is white, a shift towards white corresponds to adding a
162
K.-F. Richter et al.
Fig. 2. A focus map emphasizing a specific oject type. In this example, tramways (the big, black lines) are highlighted, while water bodies (the light gray areas) are strongly diminished.
factor to each color component. The distance d is defined as the minimal distance between a coordinate c and the focus area f . The three new color components r , g , b then are calculated to be the minimum of 230,5 and the sum of the old color component (r, g, b, respectively) and the distance d multiplied with a factor k, which determines how quickly colors fade out (i.e., corresponds to strong or weak focus). This sum is normalized by the size of the environment s. d = |c − f | kd ) s kd ) g = min(230, g + s kd ) b = min(230, b + s r = min(230, r +
When multiple areas a0 , ..., an are present, the secondary areas are integrated as focus objects such that they decrease the added sum again. This is achieved by calculating an additional distance value n that gives the minimal distance (nearness) of a coordinate c to the nearest additional area. However, to restrict the influence of additional areas, we only take those coordinates into account that are nearer to the areas than the average distance between all objects and 5
A RGB-color of (230, 230, 230) corresponds to a light grey that is still visible on white background.
What Do Focus Maps Focus On?
a)
163
b)
Fig. 3. Focus maps emphasizing route information. a) A single route is in focus; b) Multiple routes in focus; one is the main route (the same as in a), two others (the bigger lines) are focused on to a lesser degree.
the main focus object (the average distance is denoted by p). The value n is additionaly modified by another focus factor j that determines the strength of the additional areas’ influence. n = max(0, p − argmin(|c − a|)) a
kd r = min(230, r + − jn) s kd − jn) g = min(230, g + s kd − jn) b = min(230, b + s
5
Empirical Results
In the literature and our own work, we can find several arguments why focus maps as they have been discussed in the previous sections are beneficial for map reading and, consequently, for task performance. Li and Ho [30], for example, discuss maps for navigation systems that highlight the area a wayfinder is currently in. A user study demonstrates that people consider this highlighting as beneficial, especially if a strong focus function is used, i.e., if only the area in the immediate vicinity is highlighted. In a similar line, the resource-adaptive navigation system developed at Universit¨ at Saarbr¨ ucken [8,11] adapts presentation of information to the available display space and the time a user has to extract the required information. The less time there is, the more focused the presentation is.
164
K.-F. Richter et al.
If there is only short time to extract information, the presented information is highly schematized and restricted to the route. If time permits, i.e., if a user remains at one spot for some time, the system displays the surroundings with increasing detail. This way, users can quickly and with little effort read off from the device what to do when in a hurry, but are also enabled to re-plan and orient themselves when sufficient time is available [8]. In one of our own studies—which will be the first in a line of studies concerned with the performance of different map types (see Section 7)—Kuhnm¨ unch and Strube [31] tested wayfinding performance with three differerent schematic maps. Participants had to follow a specific route indicated on printed maps. The route consisted of pathways situated on a campus area (Universit¨at Freiburg, Germany). The route tested had a length of 775 meters and comprised sixteen decision points. Sixteen participants unfamiliar with the campus used a chorematic focus map of the type depicted in Figure 4; the route is indicated by a line connecting the origin and the destination. In the same experiment, we tested two other types of maps (two additional groups with sixteen participants each). These results are not reported, as we specifically discuss focus maps here. Concerning wayfinding performance, seven participants went astray but finally reached the destination; nine accomplished the task without errors. Furthermore, a post-test asked participants to evaluate the given map and their wayfinding experience with it. Taken together, the focus map yielded good results. On five-point rating scales (0: “I strongly disagree”; 4: “I strongly agree”) participants indicated it was easy to use (Mean = 3; Standard Deviation = 0.65), they succeeded to localize themselves on the map (Mean = 3; Standard Deviation = 1.07), and they mostly knew which action to take at decision points (Mean = 3; Standard Deviation = 0.89). Surprisingly, though, none of the participants stated to have used the contours of buildings for self-localization. Instead, all participants indicated they had used the structure of the pathways and landmarks given in the map for solving the
Fig. 4. Sample of a chorematic focus map as used in the study
What Do Focus Maps Focus On?
165
task. In fact, they experienced such structural information successively while wayfinding and could match it with the structures shown on the map. Presumably, comparing contours of buildings was not considered necessary or too intricate for this task. Of course, these results should not be interpreted as a recommendation to omit buildings in maps. Instead, it depends on the task how useful or essential information on buildings is. Another experiment [32] exemplifies this for the task of self-localization with minimal given context. Participants unfamiliar with the same campus area were blindfolded and placed in front of a building. After the blindfold had been removed, they were asked to indicate their position on either a map that only displayed the contours of all the buildings of the campus, or on a map that only depicted all pathways of the campus.6 They were allowed to explore a small area around their current position in order to understand the local configuration of pathways or buildings. If participants indicated the wrong position on the map, they were asked to rethink their answer. Experimenters measured the time until the correct answer was given. As expected, participants with the building map were significantly faster. They could rely on contours and the configuration of buildings as relatively unambiguous landmarks. Instead, the other participants had to match the experienced small section of the complete path network with the map, which is more ambiguous and, therefore, causes more errors and longer reaction times.
6
Focus in 3D
Schematization methods, including focus effects, can also be transferred to the generation of 3D virtual environments (VEs). Nowadays, these environments are more and more utilized, for example, to visualize geospatial data [33]. Some of these geospatial virtual environments remodel real environments, for example, cities, such as in Google Earth. One of the reasons for this newly emerged trend is the huge amount of available 3D data to produce high quality virtual environments. These virtual cities can be used not only for entertainment, but they can also provide a new medium for tourism and can be used for training people, for example, in rescue scenarios. A virtual environment “[...] offers the user a more naturalistic medium in which to acquire spatial information, and potentially allows to devote less cognitive effort to learning spatial information than by maps” ([34], p. 275). While this is an important aspect of using VEs for getting acquainted with an environment, also several navigational problems have been identified [34]. Compared to navigational experiences in a real environment, when navigating in a virtual environment people get less feedback and information on their movement. This is due to the fact that often virtual environments are only presented on a desktop and movement is controlled by a joystick or a mouse. Vestibular and proprioceptive stimuli are missing in this case [35]. Therefore, people have severe problems in orienting and acquiring survey knowledge. Accordingly, there has been a lot of research trying to improve navigational performance in VEs 6
None of these maps have been focus maps.
166
K.-F. Richter et al.
(e.g., [36,37]). Nevertheless, there are contradicting results how well survey knowledge can be learned by a virtual environment [34] and which ways of presenting spatial information in VEs are most efficient. We believe that a transfer of schematization principles from 2D to 3D representations is a promising way to ease the extraction of the relevant information in VEs and, hence, a promising way to improve navigation performance. One example of this transfer is the use of focus effects in 3D by, for example, fading colors away from the relevant areas and using simplified geometry in areas that are not in focus—similar to the maps depicted in Figure 3. This way, we can form regions of interest, such as a specific route (see Fig. 5 for a sketch of this effect). This focus effect may be used to form several regions of interest by highlighting different features and using different levels of detail. Forming such regions may help to get a better sense of orientation [38].
Fig. 5. Sketched example of how to transfer focus effects to 3D virtual environments
A transfer of focus maps from 2D to 3D environments is also proposed by Neis and Zipf [24] (see also [39]). They present an approach to integrating landmarks in 2D focus maps and outline how to transfer this to 3D virtual environments. However, there are more options to achieve a focus effect than only the use of landmarks. As explained in Section 3, we distinguish between objectand space-schematization [22]. In object-schematization—which also covers the highlighting of specific objects as landmarks—a further option is, for example, to highlight specific features of an object, depending on its role in the given task. Space-schematization can be used to highlight specific areas as explained above, but also to emphasize distance or direction information, such as the transfer of the choreme prototypes for turning actions (see Section 3).
7
Conclusions
We have discussed and generalized the concept of focus maps previously presented by [1]. Focus maps are specific kinds of schematic maps. The concept of focus maps covers a range of different kinds of maps that all have in common
What Do Focus Maps Focus On?
167
that they guide map reading to the relevant information for a given task. We can distinguish between maps that focus on specific (types of) objects, and those that focus on specific areas. We have illustrated their properties, design principles, and how they relate to our context model. We have introduced a toolbox for the design of schematic maps and shown example maps constructed with this toolbox. We have also outlined how navigation in 3D virtual environments may benefit from a transfer of the concept of focus maps to these representations. In addition to the transfer of focus effects from 2D to 3D representations explained in Section 6, we plan to employ the concept of focus maps in maps that while primarily presenting the route to take, also provide information on how to recover from accidental deviation from that route. Here, those decision points (intersections) that are considered to be especially prone to errors may be highlighted and further environmental information, i.e., the surrounding area, may be displayed in more detail than is used for the rest of the map. We reported some empirical studies that support the claims of our approach. Further analyses and empirical studies are required, though, to better understand the properties of focus maps and wayfinding performance with diverse types of maps. For example, we plan to perform eye-tracking studies that will determine whether a map user’s map reading is guided as predicted by the employed design principles. We will also further analyze the performance of map users in different wayfinding tasks, such as route following or self-localization, where they are assisted by different types of maps, for example, tourist maps or focus maps. These studies will help to improve design principles for schematic maps and will lead to a detailed model of map usage. Finally, we will evaluate the consequences of transferring focus maps to 3D environments on navigation performance in these environments.
Acknowledgments This work has been supported by the Transregional Collaborative Research Center SFB/TR 8 Spatial Cognition, which is funded by Deutsche Forschungsgemeinschaft (DFG). Fruitful discussions with Jana Holsanova, University of Lund, helped to sharpen ideas presented in this paper. We also like to thank participants of a project-seminar held by C. H¨ olscher and G. Strube at Universit¨ at Freiburg for providing their empirical results (see [32]).
References 1. Zipf, A., Richter, K.-F.: Using focus maps to ease map reading — developing smart applications for mobile devices. KI Special Issue Spatial Cognition 02(4), 35–37 (2002) 2. Tversky, B.: Some ways that maps and diagrams communicate. In: Freksa, C., Brauer, W., Habel, C., Wender, K.F. (eds.) Spatial Cognition II - Integrating Abstract Theories, Empirical Studies, Formal Methods, and Practical Applications, pp. 72–79. Springer, Berlin (2000)
168
K.-F. Richter et al.
3. Stea, D., Blaut, J.M., Stephens, J.: Mapping as a cultural universal. In: Portugali, J. (ed.) The Construction of Cognitive Maps, pp. 345–358. Kluwer Academic Publishers, Dordrecht (1996) 4. Mijksenaar, P.: Maps as public graphics: About science and craft, curiosity and passion. In: Zwaga, H.J., Boersema, T., Hoonhout, H.C. (eds.) Visual Information for Everyday Use: Design and Research Perspectives, pp. 211–223. Taylor & Francis, London (1999) 5. Tversky, B., Lee, P.U.: Pictorial and verbal tools for conveying routes. In: Freksa, C., Mark, D.M. (eds.) Spatial Information Theory - Cognitive and Computational Foundations of Geopraphic Information Science, Berlin, International Conference COSIT, pp. 51–64. Springer, Heidelberg (1999) 6. MacEachren, A.: How Maps Work: Representation, Visualization and Design. Guilford Press, New York (1995) 7. Hirtle, S.C.: The use of maps, images and ”gestures” for navigation. In: Freksa, C., Brauer, W., Habel, C., Wender, K.F. (eds.) Spatial Cognition II - Integrating Abstract Theories, Empirical Studies, Formal Methods, and Practical Applications, pp. 31–40. Springer, Berlin (2000) 8. Wahlster, W., Baus, J., Kray, C., Kr¨ uger, A.: REAL: Ein ressourcenadaptierendes mobiles Navigationssystem. Informatik Forschung und Entwicklung 16, 233–241 (2001) 9. Schmidt-Belz, B.P.S., Nick, A., Zipf, A.: Personalized and location-based mobile tourism services. In: Workshop on Mobile Tourism Support Systems, Pisa, Italy (2002) 10. Kray, C., Laakso, K., Elting, C., Coors, V.: Presenting route instructions on mobile devices. In: International Conference on Intelligent User Interfaces (IUI 2003), pp. 117–124. ACM Press, New York (2003) 11. Baus, J., Kr¨ uger, A., Wahlster, W.: A resource-adaptive mobile navigation system. In: IUI 2002: Proceedings of the 7th international conference on Intelligent user interfaces, pp. 15–22. ACM Press, New York (2002) 12. Reichenbacher, T.: The world in your pocket — towards a mobile cartography. In: Proceedings of the 20th International Cartographic Conference, Beijing, China (2001) 13. Zipf, A.: User-adaptive maps for location-based services (LBS) for tourism. In: Woeber, K., Frew, a., Hitz, M. (eds.) Proceedings of the 9th International Conference for Information and Communication Technologies in Tourism, Innsbruck, Austria, ENTER 2002. Springer, Heidelberg (2002) 14. Berendt, B., Barkowsky, T., Freksa, C., Kelter, S.: Spatial representation with aspect maps. In: Freksa, C., Habel, C., Wender, K.F. (eds.) Spatial Cognition 1998. LNCS (LNAI), vol. 1404, pp. 157–175. Springer, Heidelberg (1998) 15. Klippel, A., Richter, K.-F., Barkowsky, T., Freksa, C.: The cognitive reality of schematic maps. In: Meng, L., Zipf, A., Reichenbacher, T. (eds.) Map-based Mobile Services - Theories, Methods and Implementations, pp. 57–74. Springer, Berlin (2005) 16. Freksa, C.: Spatial aspects of task-specific wayfinding maps - a representationtheoretic perspective. In: Gero, J.S., Tversky, B. (eds.) Visual and Spatial Reasoning in Design, pp. 15–32. University of Sidney, Key Centre of Design Computing and Cognition (1999) 17. Barkowsky, T., Freksa, C.: Cognitive requirements on making and interpreting maps. In: Hirtle, S.C., Frank, A.U. (eds.) COSIT 1997. LNCS, vol. 1329, pp. 347– 361. Springer, Heidelberg (1997)
What Do Focus Maps Focus On?
169
18. Berendt, B., Rauh, R., Barkowsky, T.: Spatial thinking with geographic maps: An empirical study. In: Czap, H., Ohly, P., Pribbenow, S. (eds.) Herausforderungen an die Wissensorganisation:Visualisierung, multimediale Dokumente, Internetstrukturen, pp. 63–73. Ergon-Verlag, W¨ urzburg (1998) 19. Dey, A.K.: Understanding and using context. Personal and Ubiquitous Computing 5(1), 4–7 (2001) 20. Sarjakoski, L.T., Nivala, A.M.: Adaptation to context - a way to improve the usability of topographic mobile maps. In: Meng, L., Zipf, A., Reichenbacher, T. (eds.) Map-based Mobile Services - Theories, Methods and Implementations, pp. 107–123. Springer, Berlin (2005) 21. Freksa, C., Klippel, A., Winter, S.: A cognitive perspective on spatial context. In: Cohn, A.G., Freksa, C., Nebel, B. (eds.) Spatial Cognition: Specialization and Integration. Number 05491 in Dagstuhl Seminar Proceedings, Dagstuhl, Germany, Internationales Begegnungs- und Forschungszentrum f¨ ur Informatik (IBFI), Schloss Dagstuhl, Germany (2007) 22. Peters, D., Richter, K.F.: Taking off to the third dimension — schematization of virtual environments. International Journal of Spatial Data Infrastructures Research (accepted); Special Issue GI-DAYS 2007. Young Researchers Forum, M¨ unster 23. Elias, B., Paelke, V., Kuhnt, S.: Concepts for the cartographic visualization of landmarks. In: Gartner, G. (ed.) Location Based Services & Telecartography - Proceedings of the Symposium 2005. Geowissenschaftliche Mitteilungen, TU Vienna, pp. 1149–1155 (2005) 24. Neis, P., Zipf, A.: Realizing focus maps with landmarks using OpenLS services. In: Mok, E., Gartner, G. (eds.) Proceedings of the 4th International Symposium on Location Based Services & TeleCartography, Department of Land Surveying & Geo-Informatics. HongKong Polytechnic University (2007) 25. Klippel, A.: Wayfinding choremes. In: Kuhn, W., Worboys, M.F., Timpf, S. (eds.) COSIT 2003. LNCS, vol. 2825, pp. 320–334. Springer, Heidelberg (2003) 26. Klippel, A., Richter, K.-F., Hansen, S.: Wayfinding choreme maps. In: Bres, S., Laurini, R. (eds.) VISUAL 2005. LNCS, vol. 3736, pp. 94–108. Springer, Heidelberg (2006) 27. Klippel, A., Richter, K.F.: Chorematic focus maps. In: Gartner, G. (ed.) Location Based Services & Telecartography. Geowissenschaftliche Mitteilungen. Technische Universit¨ at Wien, Wien, pp. 39–44 (2004) 28. Schmid, F.: Personalized maps for mobile wayfinding assistance. In: 4th International Symposium on Location Based Services and Telecartography, Hong Kong (2007) 29. Barkowsky, T., Latecki, L.J., Richter, K.-F.: Schematizing maps: Simplification of geographic shape by discrete curve evolution. In: Freksa, C., Brauer, W., Habel, C., Wender, K.F. (eds.) Spatial Cognition II - Integrating Abstract Theories, Empirical Studies, Formal Methods, and Practical Applications, pp. 41–53. Springer, Berlin (2000) 30. Li, Z., Ho, A.: Design of multi-scale and dynamic maps for land vehicle navigation. The Cartographic Journal 41(3), 265–270 (2004) 31. Kuhnm¨ unch, G., Strube, G.: Wayfinding with schematic maps. Data taken from an article in preparation (2008) 32. Ahles, J., Scherrer, S., Steiner, C.: Selbstlokalisation mit Karten und Orientierung im Gel¨ ande. Unpublished report from a seminar held in 2007/08 by C. H¨ olscher and G. Strube. University of Freiburg (2007)
170
K.-F. Richter et al.
33. Slocum, T., Blok, C., Jiangs, B., Koussoulakou, A., Montello, D., Fuhrmann, S., Hedley, N.: Cognitive and usability issues in geovisualization. Cartography and Geographic Information Science 28(1), 61–75 (2006) 34. Montello, D.R., Hegarty, M., Richerdson, A.E.: Spatial memory of real environments, virtual environments, and maps. In: Allen, G. (ed.) Human spatial memory: Remembering where, pp. 251–285. Lawrence Erlbaum Associates, Mahwah (2004) 35. Nash, E.B., Edwards, G.W., Thompson, J.A., Barfield, W.: A review of presence and performance in virtual environments. International Journal of Humancomputer Interaction 12(1), 1–41 (2000) 36. Darken, R.P., Sibert, J.L.: A toolset for navigation in virtual environments. In: UIST, pp. 158–165 (1993) 37. Darken, R.P., Sibert, J.L.: Wayfinding strategies and behaviours in large virtual worlds. In: CHI, pp. 142–149 (1996) 38. Wiener, J.M., Mallot, H.A.: ’fine-to-coarse’ route planning and navigation in regionalized environments. Spatial Cognition and Computation 3, 331–358 (2003) 39. Coors, V.: Resource-adaptive interactive 3d maps. In: SMARTGRAPH 2002: Proceedings of the 2nd international symposium on Smart graphics, pp. 140–144. ACM, New York (2002)
Locating Oneself on a Map in Relation to Person Qualities and Map Characteristics Lynn S. Liben1, Lauren J. Myers1, and Kim A. Kastens2 1 Department of Psychology, The Pennsylvania State University, University Park, PA 16802, USA 2 Lamont-Doherty Earth Observatory, Department of Earth & Environmental Sciences, Columbia University, Palisades, NY 10964 USA
[email protected],
[email protected],
[email protected]
Abstract. Adults were taken to various positions on a college campus and asked to mark their locations on a round or square map drawn from either directly overhead or from an oblique angle. In session 1, participants were also given paper and pencil spatial tests to assess their skills in mental rotation (2D figure rotation), spatial visualization (paper folding), and spatial perception (water level). In session 2, participants completed computer-based navigation and mapping tasks. Performance varied widely among participants. Regression analyses showed that spatial skills predicted performance on both campus and computer mapping tasks, but the specific spatial skills that predicted success differed. Across map types, some differences in strategies and speed were observed. Findings show the value of research with both real and simulated environments, and with maps having varying cartographic properties. Keywords: Spatial cognition, maps, navigation, spatial skills.
1 Introduction Spatial cognition refers to the myriad of cognitive processes involved in acquiring, storing, representing, and manipulating knowledge about space. The spaces in question may range from small spaces, visible from a single viewpoint and amenable to direct manipulation (e.g., a desk surface littered with objects), to environmental spaces that may be experienced by navigating to multiple vantage points (e.g., a campus or city environment), to geographic or celestial spaces that are rendered visible by amplifiers of human capacities (e.g., maps representing the entire surface of Earth at once, photographs of the far side of the moon) [1]. Cognitive processes concerning space may be supported by a variety of representations ranging from the interior and mental (e.g., mental images of individual objects or landmarks, a survey-like cognitive map) to the external and concrete (e.g., Global Positioning System technology, a room blueprint, a road map). The focus of the research discussed here is on human adults’ ability to use external spatial representations (maps) to represent navigable environments. Specifically, we examine adults’ success in connecting locations in outdoor (campus or park) environments to locations on a map. C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 171–187, 2008. © Springer-Verlag Berlin Heidelberg 2008
172
L.S. Liben, L.J. Myers, and K.A. Kastens
The motivation for our focus on maps is both practical and theoretical. At the practical level, maps are pervasive tools across eras and cultures, and maps are used to teach new generations about how to conceptualize and use the environments in which they live and work [2,3,4,5]. They play a central role in a wide range of disciplines as diverse as epidemiology, geology, geography, and ecology; they are used for common life tasks such as navigating to new locations, interpreting daily news reports, and making decisions about where to buy a house or locate a business [6,7]. Map use and map education may provide important pathways for enhancing users’ spatial skills more generally [5,8,9,10]. Research on map use may thus help to identify what map qualities impede or enhance clarity or use, and may help to identify what qualities of people must be taken into account when designing maps or educational interventions. At the theoretical level, research on map understanding is valuable because maps challenge users’ representational, logical, and – of particular relevance here – spatial concepts. Studying how adults successfully use maps (or become confused by them) may help to identify component spatial processes and strategies, in turn enhancing understanding of basic spatial cognition. In the current research, people were asked to find correspondences between locations in environmental space and locations on a map of that space. Figuring out where one is “on a map” is an essential step for using a map to navigate from one’s current location to another location. It is also an essential step for using a map to record information about spatial distributions of phenomena observed in the field, as when geologists record locations of rock outcrops, ecologists record the nesting areas of a particular species, or city planners record areas of urban blight. There is a relatively large body of research that explores the way that people develop and use mental representations of large environments [11,12,13]. There is also a relatively large body of research that explores the way that people use maps to represent vista spaces, that is, spaces that extend beyond the tabletop, but that can still be seen from a single vantage point or with only minor amounts of locomotion [14,15]. But there has been relatively little work that combines experience in large-scale, navigable spaces with finding one’s location on ecologically valid maps of those spaces. Our work falls at this intersection, and, as enumerated below, was designed to address four major topics: adults’ success and strategies in identifying their current locations on a map, whether these would differ with different map characteristics, whether success would vary with participants’ spatial skills and gender, and, finally, whether patterns of findings would be similar for field and computer mapping tasks. 1.1 Finding Oneself on a Map First, we were interested in examining how well adults carry out the important step in map use of locating themselves on a map when they are in a relatively unfamiliar environmental space and are given a map of that space without verbal information. This is the condition one faces in real life when one is in a new environment with a map labeled in a completely foreign language (as, for example, when an English-literate monolingual is using a map labeled in Japanese or Arabic). To collect relevant data, we asked college students (relatively new to campus) to show their locations on a map similar to the one routinely provided to campus visitors. Prior research [16] has shown that many adults head off in the wrong direction
Locating Oneself on a Map in Relation to Person Qualities and Map Characteristics
173
after consulting posted “You Are Here” maps when the map is unaligned with the referent space (i.e., when up on the map does not indicate straight ahead in the space). Would adults likewise have difficulty identifying their own location on a map even if they had the opportunity to manipulate it as they liked? Would they rotate the map as they tried to get their bearings? 1.2 Map Qualities Second, we were interested in examining the effect of map variables on the user’s success in identifying correct locations. Within psychology, research on map use has tended to pay relatively little attention to the particular kind of map used. That is, psychological research has generally examined map performance in relation to person variables (e.g., age, sex, spatial skills) rather than in relation to cartographic variables (e.g., scale, viewing angle, color schemes). Within cartography, research has tended to examine the pragmatic effects of manipulating map variables (i.e., asking which of several maps works best), paying relatively little attention to how perceptual and cognitive theories inform or are informed by the observed effects. One potentially fruitful way to tie these two traditions together is through the concept of embodiment, the notion that our bodies and bodily activities ground some aspects of meaning [17]. There has been considerable work on the importance of embodied action for encoding spatial information from the environment. For example, Hegarty and colleagues [18] reported that kinesthetic experiences associated with moving through the environment contribute to learning spatial layouts. An embodiment perspective also implies that place representations will be relatively more or less difficult to interpret to the degree that they are more or less similar to embodied experience [19]. Consistent with this argument, prior research has shown that preschool children are better able to identify locations on an oblique perspective map than on an overhead map (plan view) of their classroom, and are better able to identify referents on oblique than vertical aerial photographs [19,20,21]. In comparison to plan representations, oblique representations are more consonant with perceptual experiences as humans move through their ecological niche using the sensory and locomotor capacities of their species. To test whether map characteristics have an effect on adult performance, we examined adults’ success in marking their locations on one of four different kinds of campus maps created by crossing two dimensions – viewing angle (varying whether the map was plan vs. oblique) and map shape (varying whether the map was round vs. square). We expected that the difference in viewing angle might show an advantage for the oblique map (following the embodiment argument above). We expected that the difference in shape might advantage the round map because unlike a rectilinear map, it does not implicitly privilege any particular orientation (thus perhaps increasing participants’ propensity to turn the map into alignment with the environment). However, because the two map variables might be expected to interact (because an oblique – but not a plan view map – specifies a particular viewing direction), we did not design this work as a test of a priori predictions, but instead as a means of examining adults’ success and strategies in relation to map type.
174
L.S. Liben, L.J. Myers, and K.A. Kastens
1.3 Spatial Skills and the Campus Mapping Task A third goal of our research was to examine whether spatial skills would predict performance on the campus mapping task, and if so, which spatial tasks would have predictive value. Earlier investigators have addressed the relation between spatial abilities and success in learning large-scale spatial layouts [18,22]. Here we extended this approach to tasks that did not require integrating or remembering information gathered across time and space, but instead required participants to link information from the visible, directly perceived environment to a graphic representation of that environment. To select the candidate spatial skills, we drew from the task- and metaanalysis of Linn and Petersen [23] which identified three major kinds of spatial abilities: mental rotation (skill in imagining figures or objects moving through two- or three-dimensional space), spatial perception (skill in representing one’s own or an object’s orientation despite conflicting visual cues or frames of reference), and spatial visualization (skill in solving multi-step spatial tasks by a combination of verbal and visual strategies). In addition, we designed our work to examine whether participant sex would have any predictive value for performance on the mapping task, above and beyond any that might be attributed to differences in measured spatial skills. This question was of interest because of the continuing evidence of gender differences in spatial cognition [24]. 1.4 Simulating Environmental Mapping A final goal of our research was motivated by the practical challenges of studying map-related spatial cognition in the field as in the campus mapping task just described. There are surprisingly frequent changes in field sites even in environments that might be expected to be highly stable. In our work, for example, even over short time spans we have encountered the construction of new buildings, new roads, and new signage, all of which influence the test environment, require a change in routes between locations, and necessitate the preparation of new maps. Outdoor testing is open to the exigencies of weather and daylight; the use of large field sites requires energetic experimenters and participants. The layout of field sites cannot be manipulated to test theoretically interesting questions. It is difficult to identify local participants who do not yet have too much familiarity with the site and equally well it is difficult to identify and transport non-local participants to the site. These and similar concerns led us to join others who have attempted to develop simulated testing environments [19,25] to study environmental cognition. The specific approach taken here was to derive research measures from the software included in the Where Are We? [WAW?] map-skills curriculum developed by Kastens [26]. This software links dynamic images of eye-level views of a park (videotaped as someone walked through a real park) to a plan map of that park. The software allows the user to control the walk through the park (and hence the sequence of scenes shown on the video image) by clicking on arrows beneath the videotaped inset. Arrows (straight, pointing left, pointing right) control whether the video inset shows what would be seen if walking straight ahead, turning left, or turning right. As described in more detail below, using WAW? exercises, we created mapping tasks in which eye-level views of the terrain had to be linked to locations and orientations on
Locating Oneself on a Map in Relation to Person Qualities and Map Characteristics
175
the map. Our goal was first, to explore whether the same kinds of spatial skills (if any) would predict performance on the campus mapping and computer tasks, and second, to examine whether performance on the campus and computer tasks was highly related. 1.5 Summary In summary, this research was designed to provide descriptive data on adults’ success and their strategies in marking maps to indicate their locations in a relatively new campus environment, to determine whether mapping performance or strategies would vary across maps that differed with respect to viewing angle (plan vs. oblique) and shape (square vs. round), to examine whether paper and pencil spatial tasks and participant sex would predict success on the campus mapping task, to explore whether similar person qualities would predict success on a computer mapping task, and to determine whether performance on the field and computer mapping tasks would be highly correlated.
2 Method Students who were new to a large state university campus in the U.S. and were members of the psychology department’s subject pool were recruited to participate in this study. Sixty-nine students (50 women, 19 men; M [SD] age = 18.6 [1.4] years) participated in session 1 for which they received course credit. Most participants (48) took part in this first session within 6 weeks of their arrival on campus, and the remainder did so within 10 weeks of arrival. Self-reported scores on the Scholastic Aptitude Test (SAT) were provided by 44 participants: Ms (SDs) for verbal and quantitative scores, respectively, were 599 (75) and 623 (78). Participants’ race/ethnicity reflected the subject pool which was almost entirely White. Following completion of all session-1 testing, participants were invited to return for session 2 for which they received either additional course credit or $10, as preferred. Of the initial group, 43 students (31 women, 12 men) returned. Session 1 included the outdoor campus mapping activity and paper and pencil spatial tasks; session 2 included the computer mapping tasks. All testing for session 1 was completed first to take advantage of better weather for outdoor testing, and to minimize students’ familiarity with campus for the campus mapping task. 2.1 Campus Mapping Task Participants were greeted in a small testing room in the psychology department where they completed consent forms. They were then given a map of the room and asked to place an arrow sticker on the map so that the point of the arrow would show exactly where they were sitting in the room, and the direction of the arrow would show which direction they were facing. They were told that the experimenter would be using a stopwatch to keep track of how long the activities were taking, but to place the sticker at a comfortable pace rather than attempt to rush. Participants implemented these directions indoors without difficulty. Following this introduction to the procedure, they were told that they would be doing something similar outside as they toured campus.
176
L.S. Liben, L.J. Myers, and K.A. Kastens
Participants were then led along a fixed route to five locations on campus. At each, a laminated campus map was casually handed to participants (maps were intentionally unaligned with the space), and participants were asked to place an arrow sticker on the map to show their location and direction. (Because there was some experimenter error in orienting participants at some locations, the directional data were compromised and thus only those data depending on participant location are described here.) Each participant was randomly assigned to use one of four different campus maps described earlier. Both the oblique perspective map (the official campus map) and the plan map were created by the university cartographers except that all labels were removed. All maps were identical in size and scale: square sides and circle diameters were 205 mm, representing approximately 965 m, thus at a scale of approximately 1:4,700. An illustrative map is shown in Fig. 1. At each location, the experimenter recorded whether the participant turned the map from its initial orientation, the time taken to place the sticker on the map (beginning from when the map was handed to the participant), and the map orientation (in relation to the participant’s body) at the moment the sticker was placed. Participants did not have a map as they were led from location to location, and experimenters chatted with participants as they walked to reduce the likelihood that participants would focus on their routes. After all test locations had been visited, the participants returned to the lab where they were given the paper and pencil spatial tasks (described later). Participants were asked to provide their scores on the SAT if they could remember them and were willing to report them.
Fig. 1. Round oblique map. See text for information on map size and scale.
Locating Oneself on a Map in Relation to Person Qualities and Map Characteristics
177
After the session was completed, each map with its sticker was scanned. Of the potential 345 sticker placements (5 stickers for each of 69 participants), 3 stickers from two participants’ maps became dislodged before the maps were scanned and thus full data for the campus map task were available for 67 of the 69 participants. Sticker placements were scored as correct if the tip of the arrow fell within a circle centered on the correct location, with a radius of 6 mm (equivalent to approximately 28 m on the ground). 2.2 Computer Mapping Tasks In session 2 we administered computer mapping tasks drawn from the WAW? curriculum described earlier. One task was drawn from the activity called Are We There Yet? In this activity, the participant is shown a starting position and facing direction on the map, sees on a video inset what would be visible from that position, and is asked to use the arrow keys to navigate to a target location. To ease the participant’s introduction to the software, the navigation task used here was the easiest one available in WAW? The second activity was drawn from the WAW? activity called Lost! In this activity, participants are dropped into the park in some unknown location (i.e., it is not marked on the map), and are asked to discover where they are by traveling around the park via arrow clicks that control which video images are seen. We gave participants two Lost! problems, the first at the easiest level of task difficulty and the second at the most difficult. For all three tasks, we recorded whether or not the problem was solved (i.e., whether the target location was found or whether the location was correctly identified), how many seconds and how many arrow clicks the participant used within the maximum time allotted (8 minutes for each of the tasks). 2.3 Spatial Tasks During session 1, participants were given paper and pencil tests to measure the three spatial skills identified by Linn and Petersen [23]. A paper folding test (PFT) was used to assess spatial visualization [27]. This task shows 20 sequences of between two and four drawings in which a sheet of paper is folded one or more times and then a hole is punched through the layers. Respondents are asked to select which of five drawings shows the pattern of holes that would appear if the paper were then completely unfolded. Scores are the number marked correctly minus one-fourth the number marked incorrectly within the allowed time (here 2 minutes). The test of spatial perception was the water level task (WLT) in which students are given drawings of six tipped, straight-sided bottles and asked to draw a line in each to show where the water would be if the bottle were about half full [28]. Lines drawn within 5° of horizontal were scored as correct. Finally, mental rotation (MR) was assessed by a modified version of the Spatial Relations subtest of the Primary Mental Abilities (PMA) battery [29]. Respondents are shown 21 simple line figures as models. Each model is followed by five similar figures, and respondents are asked to circle any that show the model rotated but not flipped over (i.e., not a mirror image). Scores are the number correctly circled (2 per row) minus those incorrectly circled (up to 3 per row) within the allotted time (here 2 minutes).
178
L.S. Liben, L.J. Myers, and K.A. Kastens
3 Results The data are presented below in five sections. First, we offer descriptive data on the performance on the campus mapping task. Second, we address the question of whether performance or strategies on the campus mapping task differed as a function of map type. Third, we address whether performance on the campus mapping task is predicted by participant variables. Fourth, we address the same question for the computer mapping task. Finally, we address the relation between performance on the campus and computer mapping tasks. 3.1 Performance on the Campus Mapping Task College students’ performance on the campus mapping task covered the full range, with some placing none, and others placing all five stickers correctly, M (SD) = 2.2 (1.4). An even more telling index of performance variability is evident in Fig. 2 which shows the locations of erroneous responses for one target location. It is striking not only that many responses are distant from the correct location, but also that many responses fail to show the correct kind of location.
Fig. 2. Erroneous sticker placements (40 black circles) for one target location (star). Omitted are 12 stickers placed correctly and 17 stickers falling within the area defined by adjacent buildings (striped region). Note that some errors were particularly egregious, as in stickers placed in open fields or parking lots.
Locating Oneself on a Map in Relation to Person Qualities and Map Characteristics
179
3.2 Campus Mapping Task and Map Variables Accuracy of Sticker Placements. As explained initially, this research was also designed to examine whether task performance would vary with map qualities of shape and viewing angle. To examine this question, the total number correct served as the dependent variable in a two-way analysis of variance (ANOVA) in which betweensubjects factors were map shape and map angle. Neither main effect nor their interaction was significant. Means (SDs) for round versus square, respectively, were 2.2 (1.3) versus 2.3 (1.5); for plan versus oblique, 2.1 (1.4) versus 2.4 (1.4). Speed of Sticker Placements. As a second means of examining the possible impact of map variables on performance on the campus mapping task, we analyzed the time participants took to place the arrows on the map. A two-way ANOVA showed a significant interaction between map shape and viewing angle, F(1,65)=6.98, p = .010, subsuming a main effect of viewing angle, F(1,65)=7.52, p = .008. Specifically, when the map was square, average response times were significantly longer on the plan than the oblique map, Ms (SDs) in seconds, respectively: 38.7 (21.7) versus 19.1 (9.3), whereas when map shape was round, response times did not differ significantly for the plan and oblique maps, 27.7 (11.5) versus 27.74 (14.6). (If all four map types are entered as four levels of a map-type factor, the average response time was significantly longer for the square plan map than for any other map type among which there were no significant differences.) This pattern holds within individual items and irrespective of accuracy. That is, the reaction times for the square plan map are consistently longer both among individuals who responded correctly and among those who responded incorrectly on a particular item. Map Turning. A third dependent measure examined in relation to map type was use of a map-turning strategy. For this analysis, the dependent measure was the number of locations (0-5) at which participants turned the map rather than leaving it in the orientation in which they received it from the experimenter. A few participants never turned the map or turned it only once (n=4); on average, the map was turned on 3.9 (1.3) items. An ANOVA on the number of turns revealed neither main effects nor interactions with respect to map shape or viewing angle. Means (SDs) for round versus square, respectively were 3.9 (1.2) versus 4.0 (1.4); for plan versus oblique, 4.1 (1.2) versus 3.8 (1.4). Map Orientation. The final behavior examined with respect to map type was how the participant held the map (with respect to the participant’s own body) while placing the sticker. Based on the sides of the square map, we defined as canonical the position shown in Fig. 2 or its 90°, 180°, or 270° rotation. A 2 (map shape) x 2 (map angle) ANOVA on the number of canonical orientations (0-5) revealed a significant main effect of map shape, F(1,65)=5.35, p=.024. More canonical orientations were used by participants with square than with circular maps, Ms (SDs), respectively, 4.0 (1.0) versus 3.3 (1.4). 3.3 Campus Mapping Task and Participant Variables To provide descriptive data on the association between performance on the campus mapping task and participant qualities, we first computed the correlation between the
180
L.S. Liben, L.J. Myers, and K.A. Kastens
number of stickers placed correctly on the campus mapping task and scores on each of the three paper and pencil spatial tests. Correlations of sticker accuracy with mental rotation (MR), spatial visualization (PFT), and spatial perception (WLT), respectively, were r(67) = .048, p = .357; r(67) = .321, p = .004; and r(67) = .219, p = .038 (here and below, one-tailed tests were used given directional hypotheses). These correlations reflect data from all participants in session 1, irrespective of whether they were available for session 2. (An identical pattern of results holds if analyses are limited to the 43 participants who took part in both sessions.) As anticipated, performance on the three spatial measures was also correlated: MR with PFT, r(69) = .425, p < .001; MR with WLT, r(68) = .410, p < .001, and PFT with WLT, r(68) = .253, p = .019. (Again, identical patterns hold with the smaller sample as well.) The number of correct sticker placements was then used as the criterion variable for a regression analysis of the campus mapping task. A stepwise regression was performed with the three spatial tests entered on the first step. We entered participant sex on the second step to determine if there were any effects of sex above and beyond those that could be attributed to possible spatial skill differences. Finally, on step three we entered the strategy variable of the number of locations at which the participant turned the map. At the first level of the model, all three predictors together accounted for 15% of the variance, R2 = .15, F(3, 66) = 3.61, p = .018. Within this multiple regression, however, only PFT predicted success (standardized β = .34, p = .010). At the second level of the model, participant sex did not significantly increase the prediction, p-change = .56, although PFT remained a significant predictor (standardized β = .34, p = .010) and the overall model remained significant, R2 = .15, F(4, 66) = 2.76, p = .035. Finally, at the third level of the model, the map-turning strategy significantly improved the prediction, R2-change = .108, p-change = .004 (standardized β = .35, p = .004), and PFT remained a significant predictor (standardized β = .27, p = .033). The final overall model was R2 = .25, F(5, 66) = 6.59, p = .002. 3.4 Computer Mapping Task and Participant Qualities A composite measure of participants’ performance on the computer mapping tasks was created by summing the number of WAW? tasks that were completed correctly within the allotted amount of time. (Similar patterns of results were obtained with time or the number of arrow clicks measures instead.) As in the campus mapping task, we first computed the correlation between performance on the computer mapping task with each of the three paper and pencil spatial tests. Correlations with mental rotation (MR), spatial visualization (PFT), and spatial perception (WLT), respectively, were r(43) = .495, p < .001; r(43) = .317, p = .019; and r(43) = -.009, p = .478. These correlations necessarily reflect data from only those who participated in both session 1 and 2 (when WAW? data were collected). The composite WAW? measure served as the outcome variable for a regression parallel to the one described above, that is, with the spatial tests entered on step 1 and participant sex on step 2 (although the map-turning strategy was not entered on step 3 because there was no corresponding opportunity for map rotation on the computer mapping task). As was true in the regression analysis of the campus mapping task, there was a significant effect of spatial measures at step 1, R2 = .30, F(3, 42) = 5.44,
Locating Oneself on a Map in Relation to Person Qualities and Map Characteristics
181
p = .003, but again, participant sex at step 2 did not add significantly to the model after spatial scores had been entered (p-change = .603). However, unlike the prior regression, in this analysis it was MR (standardized β = .52, p = .003) rather than PFT (standardized β = .12, p = .475) that predicted mapping performance on the computer task. 3.5 Relating Performance on Campus and Computer Mapping Tasks An additional goal of this research was to explore the possibility that the computer mapping tasks drawn from WAW? might be a viable substitute for measuring success on mapping tasks in the real, life-size environment. To evaluate this possibility, we computed correlations between scores on the two tasks. Irrespective of which dependent measure is used for the WAW? tasks (number completed, time in seconds, or number of arrow clicks), there was no significant relation between scores on the campus and computer tasks. The highest correlation was between the number of correctly placed stickers on the campus mapping task and the number of correctly completed WAW? tasks, and it was not marginally significant even with a one-tailed test, r(43) = .121, p = .22. Furthermore, what little trend toward an association there was disappears entirely by statistically controlling for scores on the spatial tasks: partial r(39) = .005, p = .487. As an additional means of examining the distinctions or comparability of the two mapping tasks, we compared the patterns of association between success on each mapping task and the success on the paper and pencil spatial tasks. As is evident from the findings described for each of the two mapping tasks taken individually, the regression analyses showed different patterns for the campus and computer mapping tasks. Particularly striking was the finding that MR score predicted performance on the computer mapping task, but not performance on the campus mapping task. To provide data bearing on the question of whether the associations differ in the two tasks, we compared the sizes of the correlations between MR score and performance on campus versus computer tasks. These correlations differed significantly, t(40)=1.73, p <.05. Neither of the other correlations (PFT or WLT) differed significantly between the two mapping tasks.
4 Discussion We begin our discussion by commenting on what the empirical data suggest about how well adults can mark a map to show their location in a real, relatively newly encountered campus environment, addressing the question of whether performance differs in relation to the two manipulated map characteristics (viewing angle and map shape). In the course of doing so, we comment on the appearance and distribution of the map-related behaviors observed during the campus mapping task. We then discuss findings from the regression analyses concerning which individual difference variables predict performance on the campus mapping task and performance on the computer mapping task. Finally, we discuss implications of data concerning the relation between performance on the two mapping tasks.
182
L.S. Liben, L.J. Myers, and K.A. Kastens
4.1 Performance and Strategies on the Campus Mapping Task and Their Relation to Map Characteristics The data from the campus mapping task offer a compelling demonstration that many adults are challenged by the request to show their location on a map. The fact that some participants were right at every one of the locations establishes that the task was a solvable one. The fact that some participants were wrong at every one of the locations establishes that the task was not a trivial one. Furthermore, egregious errors (see Fig. 2) suggest that some adults’ map-interpretation skills are particularly poor. Although it is perhaps not surprising to see errors like these among preschool and elementary school children [20,30], it is surprising to see them among adults. Based on participants’ comments and affective demeanor during testing, we have every reason to believe that all were engaged by the task, and all were trying their best. In addition to providing information on absolute levels of performance, the campus mapping task was of interest as an avenue for testing the possible impact of the map characteristics of map shape and viewing angle. One reason that we thought that map characteristics might lead to different behaviors and different levels of accuracy was because the different map characteristics might be differentially conducive to participants’ aligning the map with the space, and research with both adults and children had shown better performance with aligned than unaligned maps [16,31,32]. The current data, however, provided no evidence that map shape affected accuracy on the location tasks nor that it affected the number of items on which participants turned the map. This was true even if we limited the comparison to the plan maps which – unlike the oblique maps – did not imply a particular vantage point. We had also hypothesized that oblique maps – in comparison to plan maps – might elicit better performance insofar as they were more consonant with an embodied view, that is, one more similar to that encountered by humans as they navigate through the environment [19] and given that past research had shown advantages to obliqueperspective representations for children [20,21]. Again, however, there were no significant differences in accuracy or strategies in relation to map angle, either as a main effect or in interaction with map shape. Although there were no differences in accuracy in relation to map type, participants were significantly slower on the square plan map than on any other map type. In addition, square maps were held in canonical positions in relation to participants’ bodies significantly more often, implying that these maps were less often aligned with the environmental space. Perhaps the extra time taken for the square plan maps reflects additional time needed for mental rotation with unaligned maps. That the oblique version did not require additional time suggests that participants may (like children) find it easier to work with the oblique map, despite the fact that in most orientations, its vantage point differs from the one experienced in the actual environment. The data do not yet permit definitive conclusions about process, but they do permit the conclusion that additional research on the effects of map characteristics is worthwhile. 4.2 Predictors of Success on Campus and Computer Mapping Tasks As expected, the regression analyses showed that spatial skills significantly predicted performance on both the campus mapping task and the computer mapping task. Sex
Locating Oneself on a Map in Relation to Person Qualities and Map Characteristics
183
added no additional prediction in either task. Interestingly, the specific spatial skills that predicted performance differed on the two tasks. For the campus mapping task, it was the score on the paper folding task that was the significant predictor. Mental rotation scores added nothing further to the prediction. The reverse held in the computer mapping task. For this task, it was the score on the mental rotation task that predicted task success, and other spatial scores did not add significantly to the prediction. In the taxonomy offered by Linn and Petersen [23], the paper folding task falls within the skill category labeled spatial visualization which they describe as covering tasks that involve multiple steps, using visual or verbal strategies, or both. It is possible to think of the campus mapping task as one for which varied approaches would indeed be viable. For example, someone might focus on landmark buildings, someone else might focus on the geometric qualities of the streets, someone else might try to figure out the direction walked from some earlier identified spot, some might try to align the map and the space, and so on. In other words, this outdoor task – much like normal map-based navigation – gives the map-user considerable freedom in structuring the task. That mental rotation mattered for performance on the computer mapping task is also easily understood because in this task – unlike the campus mapping task – participants had less control over the visual array and the map. That is, although participants controlled which video clip they saw (by selecting which of three arrows they clicked at every choice point), they had no control of what was seen within the resulting video clip that was played. That is, once a video clip had been selected by an arrow click, participants saw whatever part of the park was recorded by the camera – at the camera’s height, at the camera’s angle, at the camera’s azimuth, and at the camera’s speed of rotation or translation. Furthermore, participants had no control over the orientation of the map: the map of the videotaped park was always in a fixed position, and thus, usually out of alignment with the depicted vista. It is thus not surprising that under these conditions, an ability to handle mental rotation was significantly associated with performance. An additional finding from the regression analysis on the campus mapping task lends further support to the hypothesized importance of participants’ own actions for success on the task. Specifically, as reported earlier, participants’ use of the mapturning strategy added significant prediction to the score on the campus mapping task even after spatial skills had been entered into the regression model. Aligning a map with the referent space is an epistemic action, defined by Kirsch and Maglio as an action in which an agent manipulates objects in the environment with the goal of acquiring information [33]. As explicated by Kirsch and Maglio for the case of expert Tetris players, epistemic actions serve the user by revealing otherwise inaccessible information or by decreasing the cognitive load required to gain information. For example, it is more time-efficient for Tetris players to rotate a polygon on the screen and visually compare its shape with a candidate nesting place than to do the rotation and comparison mentally. In our work, we have observed epistemic actions in a task in which adults visited eight outcrops in a field site, and were asked to select which of 14 scale models best depicts the underlying geological structure [34]. As they struggled to select the correct model, some participants rotated candidate models into alignment with a map of the area, rotated candidate models into alignment with the full-scale geological structure, placed two candidate models side by side to facilitate
184
L.S. Liben, L.J. Myers, and K.A. Kastens
comparison, and pushed rejected models out of the field of view. Like rotating a Tetris shape or rotating a scale model of a geological structure, rotating a map into alignment with the referent space decreases the cognitive load required to solve the task at hand by substituting direct perception for mental rotation and mental comparison. Use of epistemic actions requires that the agent foresees, before the action is taken, that the action will have epistemic value; such tactical foresight is separate from the spatial skills measured by the paper and pencil tasks, in which the actions are prescribed by the experimenter. 4.3 Computer Screens Are Not Real Environments The regression findings just discussed provide one line of evidence that the computer mapping task cannot be used as a substitute for the campus mapping task for studying spatial cognition. That is, the finding that different spatial skills predict performance on each of the two mapping tasks implies that the two tasks differ in important ways. This conclusion is bolstered by two other findings, first, that there is a significant difference in the size of the correlation between MR and performance on the campus mapping task versus the computer mapping task, and second, that the correlation between scores on the two mapping tasks is not significant. Taken together, these data imply that it is important to continue to conduct mapping research – as well as map skill education – in real, life-size environments.
5 Conclusions The data from the present research bear upon adults’ success in using one of the most common kinds of spatial representations of large environments – maps – as they observe the environment directly in the field or via another representational medium. Our data show dramatic variability with respect to how well cognitively intact adults (all of whom met the intellectual criteria needed for university admission) succeed in indicating their locations on a map. Although some participants showed outstanding performance, others made serious errors reminiscent of those made by young children [20,32,35]. Our data also bear on questions about research in different kinds of spatial environments. The finding that different spatial skills predicted success on the campus versus computer mapping tasks coupled with the finding that participants’ scores on the two mapping tasks were not significantly correlated, lead to the conclusion that it is unwise to substitute one task for the other. From the pragmatic perspective of conducting behavioral research in environmental cognition, this conclusion is perhaps disheartening. It would ease research significantly if the answer were otherwise. From the perspective of theoretical work on spatial cognition, however, the finding is more intriguing than disheartening. The current findings contribute evidence to the growing conclusion that the skills entailed in solving spatial problems in object or vista spaces do not entirely overlap with skills entailed in solving spatial problems in environmental spaces. Past researchers have shown the importance of testing in real environments even for indoor, built spaces (corridors and rooms) that are highly defined, homogeneous, and rectilinear [18]. Our findings add to the evidence for the importance of testing in larger, more
Locating Oneself on a Map in Relation to Person Qualities and Map Characteristics
185
varied, less clearly defined outdoor environments as well [36]. Outdoor environments provide potential clues (e.g., a nearby building, a distant skyscraper, a river, the position of the sun). But they also present potential challenges including barriers (that may obstruct otherwise useful landmarks), an absence of clear boundaries to define the borders of the space (in contrast to the walls of a room), and vistas that may appear homogenous to the untrained eye (e.g., desert vistas, dense forests, or acres of wheat fields as far as the eye can see). A full understanding of human spatial cognition will thus require studying how people identify and use information that is available within a diverse range of environments. Likewise, the findings from the research described here bear on the role of map characteristics. Although our data do not yet permit firm conclusions about the way that map qualities interact with environmental and person qualities, they do provide strong support for the importance of systematically varying map qualities as we continue to explore the fascinating territory of spatial cognition. Acknowledgments. Portions of this work were supported by National Science Foundation (NSF) grants to Liben (RED95-54504; ESI 01-01758) and to Kastens (ESI-9617852; ESI 01-011086), although no endorsement by NSF is implied. We acknowledge with thanks the contributions of past and current members of the Penn State Cognitive & Social Development Lab, particularly Lisa Stevenson and Kelly Garner who contributed in so many ways to this project.
References 1. Liben, L.S.: Environmental cognition through direct and representational experiences: A life-span perspective. In: Gärling, T., Evans, G.W. (eds.) Environment, cognition, and action, pp. 245–276. Oxford, New York (1991) 2. Downs, R.M., Liben, L.S.: Mediating the environment: Communicating, appropriating, and developing graphic representations of place. In: Wozniak, R.H., Fischer, K. (eds.) Development in context: Acting and thinking in specific environments, pp. 155–181. Erlbaum, Hillsdale (1993) 3. Harley, J.B., Woodward, D. (eds.): The history of cartography: Cartography in prehistoric, ancient and Medieval Europe and the Mediterranean, vol. 1. University of Chicago Press, Chicago (1987) 4. Stea, D., Blaut, J.M., Stephens, J.: Mapping as a cultural universal. In: Portugali, J. (ed.) The construction of cognitive maps, pp. 345–360. Kluwer Academic Publishers, The Netherlands (1996) 5. Uttal, D.H.: Seeing the big picture: Map use and the development of spatial cognition. Dev. Sci. 3, 247–264 (2000) 6. MacEachren, A.M.: How maps work. Guilford, New York (1995) 7. Muehrcke, P., Muehrcke, J.O.: Map use: Reading, analysis, and interpretation, 4th edn. JP Publications, Madison (1998) 8. Davies, C., Uttal, D.H.: Map use and the development of spatial cognition. In: Plumert, J.M., Spencer, J.P. (eds.) The emerging spatial mind, pp. 219–247. Oxford, New York (2007) 9. Liben, L.S.: Education for spatial thinking. In: Damon, W., Lerner, R.(series eds.) Renninger, K.A., Sigel, I.E. (vol. eds.) Handbook of child psychology: Child psychology in practice, 6th edn., vol. 4, pp. 197–247. Wiley, Hoboken (2006)
186
L.S. Liben, L.J. Myers, and K.A. Kastens
10. National Research Council: Learning to think spatially: GIS as a support system in the K12 curriculum. National Academy Press, Washington (2006) 11. Evans, G.W.: Environmental cognition. Psy. Bull. 988, 259–287 (1980) 12. Gärling, T., Golledge, R.G.: Environmental perception and cognition. In: Zube, E.H., Moore, G.T. (eds.) Advances in environment, behavior and design, pp. 203–236. Plenum Press, New York (1987) 13. Kitchin, R., Blades, M.: The cognition of geographic space. L.B. Taurus, London (2002) 14. Montello, D.R.: Scale and multiple psychologies of space. In: Campari, I., Frank, A.U. (eds.) COSIT 1993. LNCS, vol. 716, pp. 312–321. Springer, Heidelberg (1993) 15. Montello, D.R., Golledge, R.G.: Scale and detail in the cognition of geographic information. Report of the specialist meeting of Project Varenius, Santa Barbara, CA, May 14-16, 1998. University of California Press, Santa Barbara (1999) 16. Levine, M., Marchon, I., Hanley, G.: The placement and misplacement of You-Are-Here maps. Env. and Beh. 16, 139–158 (1984) 17. Johnson, M.L.: The meaning of the body. In: Overton, W.F., Mueller, U., Newman, J.L. (eds.) Body in mind, mind in body: Developmental perspectives on embodiment and consciousness, pp. 191–224. Erlbaum, New York (2008) 18. Hegarty, M., Montello, D.R., Richardson, A.E., Ishikawa, T., Lovelace, K.: Spatial abilities at different scales: Individual differences in aptitude-test performance and spatiallayout learning. Intelligence 34, 151–176 (2006) 19. Liben, L.S.: The role of action in understanding and using environmental place representations. In: Rieser, J., Lockman, J., Nelson, C. (eds.) The Minnesota symposium on child development, pp. 323–361. Erlbaum, Mahwah (2005) 20. Liben, L.S., Yekel, C.A.: Preschoolers’ understanding of plan and oblique maps: The role of geometric and representational correspondence. Child Dev. 67, 2780–2796 (1996) 21. Plester, B., Richards, J., Blades, M., Spencer, C.: J. Env. Psy. 22, 29–47 (2002) 22. Allen, G.L., Kirasic, K.C., Dobson, S.H., Long, R.G., Beck, S.: Predicting environmental learning from spatial abilities: An indirect route. Intelligence 22, 327–355 (1996) 23. Linn, M.C., Petersen, A.C.: Emergence and characterization of sex differences in spatial ability: A meta-analysis. Child Dev. 56, 1479–1498 (1985) 24. Halpern, D.F.: Sex differences in cognitive abilities, 3rd edn. Erlbaum, Mahwah (2000) 25. Lawton, C.A., Morrin, K.A.: Gender differences in pointing accuracy in computersimulated 3D mazes. Sex Roles 40, 73–92 (1999) 26. Kastens, K.A.: Where Are We? Tom Snyder Productions, Watertown, MA (2000) 27. Ekstrom, R.B., French, J.W., Harman, H.H.: Manual for kit of factor-referenced cognitive tests. Educational Testing Service, Princeton (1976) 28. Liben, L.S., Golbeck, S.L.: Sex differences in performance on Piagetian spatial tasks: Differences in competence or performance? Child Dev. 51, 594–597 (1980) 29. Thurstone, T.G.: Primary mental abilities for grades 9-12. Science Research Associates, Chicago (1962) 30. Kastens, K.A., Liben, L.S.: Eliciting self-explanations improves children’s performance on a field-based map skills task. Cog. and Instr. 25, 45–74 (2007) 31. Bluestein, N., Acredolo, L.: Developmental changes in map-reading skills. Child Dev. 50, 691–697 (1979) 32. Liben, L.S., Downs, R.M.: Understanding person-space-map relations: Cartographic and developmental perspectives. Dev. Psy. 29, 739–752 (1993) 33. Kirsch, J.G., Maglio., P.: On distinguishing epistemic from pragmatic action. Cog. Sci. 18, 513–549 (1994)
Locating Oneself on a Map in Relation to Person Qualities and Map Characteristics
187
34. Kastens, K.A., Liben, L.S., Agrawal, S.: Epistemic actions in science education. In: Freska, C., Newcombe, N.S., Gärdenfors, P. (eds.) Spatial cognition VI. Springer, Heidelberg (in press) 35. Liben, L.S., Kastens, K.A., Stevenson, L.M.: Real-world knowledge through real-world maps: A developmental guide for navigating the educational terrain. Dev. Rev. 22, 267– 322 (2002) 36. Pick, H.L., Heinrichs, M.R., Montello, D.R., Smith, K., Sullivan, C.N., Thompson, W.B.: Topographic map reading. In: Hancock, P.A., Flach, J., Caird, J.K., Vicente, K. (eds.) Local applications of the ecological approach to human-machine systems, vol. 2, pp. 255– 285. Erlbaum, Hillsdale (1995)
Conflicting Cues from Vision and Touch Can Impair Spatial Task Performance: Speculations on the Role of Spatial Ability in Reconciling Frames of Reference Madeleine Keehner School of Psychology, University of Dundee, UK
[email protected]
Abstract. In “hand assisted” minimally invasive surgery, the surgeon inserts one hand into the operative site. Despite anecdotal claims that seeing their own hand via the laparoscopic camera enhances spatial understanding, a previous study using a maze-drawing task in indirect viewing conditions found that seeing one’s own hand sometimes helped and sometimes hurt performance (Keehner et al., 2004). Here I present a new analysis exploring the mismatch between kinesthetic cues (knowing where the hand is) and visual cues (seeing the hand in an orientation that is incongruent with this). Seeing one’s left hand as if from the right side of egocentric space (palm view) impaired performance, and this depended on spatial ability (r=-.54). Conversely, there was no relationship with spatial ability when viewing the left hand from the left side of egocentric space (back view). The view-specific nature of the confusion raises a possible role for spatial abilities in reconciling spatial frames of reference. Keywords: bimodal, cross-modal, visuotactile, frame of reference, mental rotation, spatial ability, kinesthetic, proprioceptive, sensory cues, individual differences.
1 Introduction This paper presents a new analysis of data originally presented at the Human Factors and Ergonomics Society annual conference [1]. The original motivation for the study was to assess a specific anecdotal claim made by surgeons working under minimally invasive conditions. In laparoscopic or “keyhole” surgery, a special technique is sometimes employed in which one of the small incisions in the patient’s body is slightly enlarged, and the surgeon’s non-preferred hand is inserted through this into the operative site. Under these conditions, the surgeon's hand becomes visible on the video monitor via the laparoscopic camera, and it can be guided and used like a surgical instrument. Surgeons anecdotally report that seeing their own hand on the video monitor enhances their understanding of the spatial relations within the operative space, in this otherwise spatially demanding domain. This claim is intuitively plausible, and is consistent with prior literature on crossmodal sensory integration in peripersonal space. However, previous studies in this field have typically allowed participants to view their own hands directly, not via C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 188–201, 2008. © Springer-Verlag Berlin Heidelberg 2008
Conflicting Cues from Vision and Touch Can Impair Spatial Task Performance
189
video feedback, and in these studies the angle from which the hand is seen is usually consistent with its actual orientation in space. By contrast, in minimally invasive surgery the surgical camera, or laparoscope, is often placed in an orientation that is at odds with the surgeon's own perspective, producing a view of the hand that is spatially misaligned and incompatible with proprioceptive information. In the 2004 paper, we showed that congruent and conflicting kinesthetic and visual cues sometimes help and sometimes impair performance on a spatial task (maze drawing in indirect viewing conditions). The present new analysis provides novel insights into these effects. In this paper I show that the confusion caused by seeing one’s own hand from an incongruent angle is viewpoint-specific. Moreover, the degree of confusion and even whether any confusion occurs (relative to performance on the same task without seeing the hand) depends strongly on individual differences in spatial abilities. In this paper I discuss possible reasons for the viewpoint specific nature of the confusion and speculate on how spatial abilities might function in reconciling conflicting sensory cues relating to position and orientation in space. 1.1 The Relationship between Vision and Touch Previous studies with humans and primates have shown that the senses of vision and touch have a special relationship. Graziano and Gross have identified bimodal neurons that respond only when the information received through vision and touch corresponds [2]. These specialized visuo-tactile neurons fire when the hand reaches towards an object or location in reachable space that can simultaneously be seen. When the hand or its target location is unseen, these cells do not respond. This finding suggests that highly dexterous higher primates, including humans, have developed specialized bimodal connections between vision and touch, evolved for exploring the world with seen hands. The fundamental nature of the relationship between vision and touch is demonstrated neatly by the crossmodal congruency effect. Driver and colleagues have shown that visual cues, such as LEDs attached to the surface of the hands, can enhance speed of responses to tactile stimuli [3]. This inter-modal facilitatory effect demonstrates the rapid and automatic crosstalk between the two senses, such that spatial cues presented in one modality can speed reactions to spatial cues presented in the other modality. Importantly, this effect follows the hand when it moves in space, such as when the hands are crossed in front of the body, demonstrating that these cross-modal sensory cues are coded in limb centered or body centered spatial coordinates. 1.2 Representation of the Hands in the Body Schema It is well established that the somatosensory cortex of the brain represents the moment-by-moment positions and orientations of body parts as we move our limbs, trunk, and head in space and in relation to each other [4]. The somatosensory cortex receives proprioceptive feedback from muscles, joints, and tendons, and combines these in a representation of the body's configuration and the relationships among different body parts or effectors. This “felt” position and orientation of body parts makes up our internal representation or “body schema”.
190
M. Keehner
Although this internal representation is being constantly updated by proprioceptive information as we move our bodies in space, studies involving mental rotation of hands have shown that there is a “default” representation of hand orientation within the body schema. Sekiyama has shown that the fastest responses in imagined hand rotation tasks are generally to stimuli showing a back view of the hand [5]. This finding suggests the default or baseline representation of hand orientation in the body schema is the equivalent of having the hand in front of the body with the back of the hand in view, and it indicates that motor imagery processes start from this orientation. Interestingly, there is evidence that this default positional representation develops over time. Funk and colleagues had participants perform a mental rotation task using hand images while they held their hands in either palm up or palm down orientations [6]. In children, the physical orientation of the hands in space affected speed of response - when the actual orientation of their hands matched the target orientation they were fastest, whereas when the two were incongruent they took longer. This suggests that the children started the mental rotation process from an internal representation that matched the physical orientation of their hands in space. By contrast, adults were consistently faster on trials where a back view was presented, regardless of the physical orientation of their hands in space. This suggests that by adulthood the default internal representation of hand orientation (the back view) defines the starting orientation of our imagined hands in the body schema, and is more influential in this kind of motor imagery task than the actual orientation of our hands in space. 1.3 Body Schema Can Be Influenced by Visual Experience Although these studies demonstrate that by adulthood the body schema is welldeveloped in the sensorimotor cortex, a number of ingenious experiments using fake hands and visual prisms have shown that what we see can influence our internal representation of limb position. Sekiyama had participants wear a visual prism that reversed left and right. This produced a conflict between vision and touch, such that the participant’s right hand looked like their left hand when viewed through the prism. After adaptation the internal representation of the hand had changed in a way that brought it into line with the visual information [7]. This finding demonstrates that visual experience can dramatically affect the body schema representation. Indeed, it has been argued that visual experience may be a key mechanism by which we acquire our default representation of hand orientation in the body schema by the time we reach adulthood, since the back of the hand is the most frequently seen orientation of our own hands as we grasp and manipulate everyday objects [6]. The crossmodal congruency effect described above [3], in which cues from one modality (vision) can speed attention in another modality (touch), has been shown to occur even when the seen “hand” is not the participant’s own. The effect has been demonstrated with fake hands, and occurs even though the participant’s own hand is displaced somewhat from the location of the fake hand, such as being underneath the table on which the fake hand is placed. Studies have shown that one of the most important factors for producing these illusions with false hands is temporal alignment. If the participant feels a touch at precisely the same moment as they see a fake hand or rubber glove being touched they can experience an illusion whereby they are convinced that it is their own hand that they are seeing [10]. Thus, a perfect match in
Conflicting Cues from Vision and Touch Can Impair Spatial Task Performance
191
terms of timing between what is seen and what is felt seems to be critical in aligning information received through vision and touch. However, this effect is disrupted when the discrepancy between the orientations of the fake hand and the participant’s own hand become too great, such as when the fake hand is rotated ninety degrees relative to the participant’s hand [8, 9]. 1.4 Spatial Coding and the Parietal Cortex Our apparently unitary representation of body position in space is generated when the information from all of our sensory modalities are integrated in the brain. Human and monkey studies have shown that this occurs in the posterior parietal cortex, specifically within the intraparietal sulcus (IPS). Areas within monkey IPS are critical for integrating information acquired through vision and touch, and are active in controlling reaching and pointing movements in space. Homologous regions exist in human IPS, and in both species this area appears to play a critical role in creating a representation of the space of our bodies and the space around our bodies, with particular importance in tasks that involve movement of the hands guided by vision [11]. In an ingenious study, monkeys were trained to retrieve food rewards from beneath a glass plate that could be turned clear or opaque at the flick of a switch. After training, neurons in the ventral intraparietal sulcus, which had previously been responding only to proprioceptive information, showed visual responses, indicating that they had become bimodal through the process of associating visual and proprioceptive information. The visual receptive fields persisted even when the view of the arm was obscured, leading the authors to argue that these intraparietal bimodal neurons allow the updating of body images even when the limb is unseen [12]. Sekiyama argues that of all the brain regions containing bimodal neurons, the IPS is perhaps the most important for our internal representation of the body in space [13]. Graziano and colleagues have shown that neurons in parietal area 5 respond to the sight of a fake arm, and furthermore that these neurons can distinguish between plausible and implausible arm orientations and even between a left hand and a right hand by sight alone (e.g., the neurons did not respond when a fake right arm was placed in the same position and orientation as the monkey’s own left arm) [14]. Sekiyama argues that bimodal neurons in this region (unlike other sensorimotor regions) integrate visual and proprioceptive cues when the visual information matches the internal body schema representation, and therefore the parietal cortex and specifically the IPS contains the highest level of representation of the body in space [13]. Thus, the parietal lobe plays a critical role in integrating the many different forms of sensory information that we receive into a high-level, overarching representation of our own body in space. From many different sensory inputs (e.g., head-based, trunkbased, arm-based, and retinocentric frames of reference), the parietal lobe generates a global egocentric frame of reference and a unified internal sense of our position and orientation in space [15, 16]. 1.5 The Parietal Lobe and Adaptability of Spatial Frames of Reference Despite the obvious stability of this representation over time, Sekiyama argues that the body schema is somewhat adaptable [13]. Studies from a range of domains indicate that these adaptations occur in the high-level representation of the IPS. It appears
192
M. Keehner
that the bimodal IPS neurons can take account of changes to information from multiple senses and as a result can alter the internal representation of the body. Such modifications to the body schema are seen in prism adaptation studies, as discussed earlier, in which bimodal neurons of the IPS alter the way that they code the relationship between vision and touch to recalibrate discrepant sensory information caused by wearing prisms [7]. Similar modifications to the body schema are evident in studies involving tool use in monkeys. Research has shown that changes to spatial coding of the limbs result from extensive experience of using long tools or instruments, such that the tips of the tools become coded in the same way as the tips of the limbs [17]. As with prism adaptation, this recalibration of the spatial extent of the limb is reflected in changes at the neural level [18]. Essentially these kinds of flexible processes allow the system to alter the way that different spatial frames of reference operate together, in order to maintain a coherent sense of space and position. Perhaps it is this capacity to adapt to new information, both real and imagined, that allows us to perform everyday spatial tasks. The computational processes involved in encoding, maintaining, and manipulating spatial information include the kinds of spatial transformation processes that psychologists study and that define what we call spatial ability [19]. Standardized tests of spatial ability, as well as everyday operations in the real world, include tasks such as imagining how an object would change if we picked it up and turned it (mental rotation) or imagining how the world would look, and the consequences for our intended actions, if we moved to a different location or orientation in space (perspective shifting). What all of these processes have in common is the requirement to represent, manipulate, update, and reconcile different spatial frames of reference. These flexible processes are among the key determinants of spatial ability, and therefore individuals with better spatial abilities should be better able to reconcile sensory cues that represent conflicting frames of reference. This is the central hypothesis in the present analysis. 1.6 The Set-Up in Our Study and in Typical Hand Assisted Surgery In typical minimally invasive surgery conditions, the surgeon has no direct view of the operative site, but must instead depend on a 2-D image from the laparoscopic camera presented on a monitor. This image lacks binocular depth cues, and furthermore it is quite common for the laparoscope to be inserted into the patient at an angle that differs from the orientation of the surgeon. This means that the viewpoint from which the camera captures the operative site is inconsistent with the surgeon's perspective, and presumably some kind of spatial transformation, such as mental rotation or perspective shifting, must be performed in order to align the two. In extreme cases, the laparoscope may be inserted through a port in the patient's body that produces a view of the operative site that is up to 180° discrepant from the surgeon's perspective. Ideally, the surgeon seeks to minimize the discrepancy between their view and that of the camera, but this is by no means always possible and in any case the angle of the laparoscope is often altered multiple times during a procedure in order to provide unobstructed views of particular structures or to allow a particular instrument to be inserted through a specific port. In traditional minimally invasive surgery, the surgeon has no direct contact with the operative site using his or her hands. However, in hand assisted methods, one
Conflicting Cues from Vision and Touch Can Impair Spatial Task Performance
193
hand is inserted into the operative site through a slightly enlarged port in the patient's body. This allows the surgeon to use one hand like a very dexterous instrument, and it also makes the hand visible on the monitor via the laparoscopic camera. These are the conditions that we replicated in our original study. The hand was either inserted into the task space, and thus it appeared on the monitor, or it was not present in the task space and was therefore not visible via the camera. It was not allowed to interfere with the task at all, so that any effect of seeing the hand in view of the camera was due to its presence alone, and not to any benefits that might result from using it to help with the spatial task. In the original paper, we found that both camera angle and spatial ability had main effects on performance. We also found that having the hand in view was helpful, relative to performing the task without the hand in view, for all participants when the camera was inserted from the left side of the workspace. Contrary to this, we found unexpectedly that when the camera was inserted from the right side of the workspace, having the hand in view impaired performance for lower spatial participants only [1]. In what follows, I explore these effects further, and attempt to establish whether there is some qualitative difference in the effects of seeing the hand in view of the camera that depends on how the hand looks. I also examine whether and under which circumstances these effects depend on the spatial abilities of the individual participant. Given the preceding discussion of the importance of spatial ability for reconciling different frames of reference, and the fundamental connection between what we see and what we feel, I predict that spatial ability may be especially important for reconciling incongruent visual and kinesthetic cues and for adapting to inconsistencies between these two sources of information.
2 Method 2.1 Participants Forty right-handed paid volunteers (18 males) were recruited from the UC Berkeley undergraduate population, mean age 20.1 years (SD 2.3 years). 2.2 Apparatus The apparatus was constructed to mimic laparoscopic conditions (see Figure 1). The participant’s view of the workspace was provided by a laparoscope, with the image presented on a monitor at head height. A shield prevented direct view of the workspace. A permanent marker pen was attached to the end of a laparoscopic instrument, whose movements were constrained by a fulcrum (analogous to the point of entry through the abdominal wall). The instrument was offset -45º (+315º) in azimuth. Holding the instrument handle with their right hand, participants used the monitor image to guide the pen tip around a star-shaped maze mounted on a platform. 2.3 Design The independent variables were camera angle and spatial ability. The dependent variable was error difference (number of errors with hand in view minus number of
194
M. Keehner
Monitor
90
270 Shield
Participant Fig. 1. Experimental setup. The participant’s view of the maze and the instrument tip was obscured by the shield and they completed the maze drawing task using only the image from the laparoscope, which was positioned either at 90º (left side) or at 270º (right side). On half of the trials, the participant’s left hand was visible in the monitor image.
errors without hand in view). Reported here are two conditions (camera angles 90º and 270º) that were common across three separate experiments. Amalgamating the experiments resulted in a mixed design, in which some participants participated in both camera angle conditions, while others participated in either 90º or 270º but not both. The methodologies of the experiments were identical in all essential design features (instructions and practice trials, apparatus, procedure, counterbalancing of conditions and trials, total number of conditions and trials). 2.4 Procedure The laparoscopic camera was secured at one of two positions (offset in azimuth by 90º or 270º; see Figure 1). On half of the trials, participants were instructed to hold the maze platform so that their left hand appeared on the monitor (the hand did not interfere or help with the task). Participants completed one practice trial at each angle (using a different maze), followed by four experimental trials, two with the hand in view and two without (order ABBA/BAAB). Instructions were given to complete the star mazes as quickly as possible but with as few errors as possible. The order of conditions was counterbalanced using a Latin square design. Spatial visualization ability was assessed using three paper-and-pencil tests: the Mental Rotations Test [20], the Paper Folding Test [21] and the Card Rotations Test [21]. These tests correlated positively (r = .58 to .63), so standardized scores (z-scores) were calculated for the three tests and they were averaged to produce an aggregate measure of
Conflicting Cues from Vision and Touch Can Impair Spatial Task Performance
195
spatial ability. A median split was performed on the aggregate measure to create two groups, defined as high and low spatial ability. Errors were scored blind after the task was completed by a manual frequency count. Using the ink trace, one error was allocated for every time the pen crossed the outer border of the maze.
3 Results Previously, we reported main effects on performance of camera angle, hand position, spatial abilities, and the interactions among these variables. In the present analysis a new variable was created to establish whether performance was affected positively or negatively by having the hand in view. In this analysis, performance without the hand in view was used as the baseline and the positive or negative effects of seeing the hand in the monitor were assessed against this. This variable was generated by subtracting the number of errors made without the hand in view from the number of errors made with the hand in view, in each of the two conditions. Thus, a negative error difference indicates that seeing one’s own hand helped performance, whereas a positive error difference indicates that seeing one’s own hand impaired performance. This new variable makes it possible to isolate the effect of seeing the hand in the camera
Error difference (with hand - without hand)
-25 -20 -15 -10
High spatial
-5
Low spatial
0 5 10 15 Camera at 90 degrees (back view of hand)
Camera at 270 degrees (palm view of hand)
Fig. 2. Difference in errors with hand in view versus without hand in view, under the two viewing orientations, split by high and low spatial participants (median split of aggregate ability measure). Error bars represent +/- 1 SEM.
196
M. Keehner
view, and to determine whether the effect is negative or positive, relative to not seeing the hand. In all analyses the variables met assumptions of normality. Figure 2 represents this difference in errors with the hand in view versus without the hand in view under the two viewing orientations, split by high and low spatial participants (median split of aggregate spatial ability measure). This plot indicates that qualitatively different patterns of errors occurred when the camera was positioned to show the back view of the hand versus when it showed the palm view of the hand. When the back view was visible (90º), seeing the hand improved performance for all participants. An independent samples t-test showed that this effect did not differ for higher and lower spatial participants, t(25) = .67, p = .51, n/s. By contrast, when the palm view was visible (270º), seeing the hand impaired performance for lower spatial participants (more errors) but it somewhat helped performance for higher spatial participants (fewer errors), and this difference between higher and lower spatial participants was significant, t(21) = -2.93, p = .008. Four separate one-sample t-tests with alpha adjusted for multiple comparisons tested these effects against zero. This analysis showed that all of the error differences except one were significantly different from zero (t = -3.77 to 3.98, p = .003 to .002, in all significant cases). Thus, in the 90º condition, seeing the hand was significantly beneficial to both low and high spatial participants. By contrast, in the 270º condition, seeing the hand significantly impaired low spatial participants, whereas it did not significantly affect high spatial participants, either positively or negatively.
90º
270º
Fig. 3. Relationship between spatial ability and error difference (hand versus no hand) in the two viewing conditions. Points below the dotted line indicate performance that was better with the hand in view than without, and points above the dotted line indicate performance that was poorer with the hand in view than without. The solid line represents the best-fit regression line.
These patterns were explored further using correlational analyses. Figure 3 shows the relationships between spatial ability and error difference (hand minus no hand) in the two viewing conditions. The dotted line indicates the level of errors at which there was no effect, either positive or negative, of seeing the hand relative to not seeing the hand. Points below this line indicate that seeing the hand helped performance, whereas points above this line indicate that seeing he hand hurt performance, relative to not seeing the hand. The solid line is the best-fit regression line.
Conflicting Cues from Vision and Touch Can Impair Spatial Task Performance
197
Figure 3 indicates that there was no systematic relationship between spatial ability and error difference in the 90º view condition (back view), r = -.007, p = .97, n/s. By contrast, in the 270º condition, Figure 3 shows a clear linear relationship between spatial ability and error difference, indicating that as spatial ability decreases, the detrimental effect of seeing the hand increases. This correlation was significant, r = .54, p = .008.
4 Discussion This new analysis of the data from these experiments reveals that qualitatively different effects occurred depending on the view of the hand that was available to the participant in each trial. In the 90° view condition, when the back of the hand was visible, all participants benefited from seeing their own hand in the monitor. Furthermore, in this condition there was no significant correlation with individual differences in spatial ability. By contrast, the effects in the 270º condition, in which the palm view of the hand was visible, were quite different. In this condition, low spatial participants were significantly impaired when they saw their own hand in the monitor, while high spatial participants did not experience any significant benefit or detriment. Moreover, there was a strong correlation between spatial ability and the effect of seeing the hand. In this condition, individual differences in spatial ability strongly predicted whether an individual became confused by the sight of their own hand. If we compare overall performance, it is clear from Figure 3 that there is overall more benefit gained from seeing the back view of the hand (90° condition) - more of the data points fall below the dotted line, indicating that people are better off with the sight of the hand than without it. By contrast, in the 270° condition around half of all participants do worse when they see their own hand than when they do not (data points above the dotted line), and these are primarily lower spatial individuals. What is responsible for the enhancement of performance in the 90° condition and the apparent confusion in the 270° condition, caused by seeing one's own hand? It would appear that the view of the hand in the 90° condition is adequately aligned with how the hand feels (its internal representation in the body schema) that it does not cause confusion. In terms of previous research with monkeys, this might be analogous to the responses that occur in bimodal neurons only when the visual and kinesthetic information are sufficiently compatible [2, 8]. Perhaps this harmonious “visuotactile” representation of the hand in space is what helps the participant better understand the spatial relations of the task when the hand appears in the camera image compared to when it is not present. By contrast, the view of the hand in the 270° condition does not allow the participant to compensate for the camera misalignment. In fact, for individuals with poor spatial abilities it caused more confusion than when the hand was not visible at all. This suggests that how the hand looks in this condition is fundamentally at odds with how it feels. In other words, it is not possible to reconcile this view of the hand with the internal representation of the hand’s position in the body schema. In this sense, this condition seems analogous to previous studies where false hands were placed in orientations that were too incongruent with the “felt” hand position to allow the illusion of unity to prevail [8].
198
M. Keehner
Figure 4 shows how the hand looks from these two camera orientations. There are at least two possible reasons why the palm view of the hand should be so difficult to reconcile with the internal representation, compared to the back view. One possibility is that the default representation of the hand in the body schema, which has been shown to be the back view [5], means that the 90° view can be more readily integrated with this, whereas the 270° palm view causes too much of a conflict. Another possibility is that the appearance of the hand, and especially the angle of the wrist, in the 270° view is a biomechanically awkward position for the left hand to adopt, and therefore it is difficult to perceive the seen hand as one’s own left hand. In fact, it is almost impossible to move the left hand in such a way as to produce this view of it under normal circumstances, whereas it is relatively easy to orient the left-hand in such a way as to produce a view similar to that in the 90° viewing condition. This account is consistent with previous research on mental rotation of hands, which has shown that motor imagery (imagined movements of body parts) is subject to the same biomechanical constraints as real movements of the body in space [22]. These two accounts are not mutually exclusive. Indeed, given that extended visual experience of seeing the hands in particular orientations can influence the internal body schema [6, 7] it seems plausible that they might, if anything, be mutually reinforcing.
Fig. 4. View of the hand from the 90º camera orientation (left) and the 270º camera orientation (right)
Why is spatial ability so important in the 270° view condition? If we assume that the confusion in this condition arises from difficulties with reconciling conflict between two incompatible frames of reference (visual and kinesthetic), this gives us an interesting insight into what kinds of abilities psychometric spatial tests such as the ones we used may be tapping. Perhaps one of the key components of cognitive spatial abilities is the ability to represent, manipulate, update, and reconcile different spatial frames of reference. It has been claimed that all spatial manipulation tasks essentially involve manipulating relations among three spatial frames of reference: egocentric, object-centered, and environmental [23]. For example, mental rotation tests require the test taker to transform the orientation of an object around its intrinsic axes and then update the transformed object-centered frame of reference in relation to stable reference frames of the environment and the self (egocentric). Paper folding tests require the test taker to manipulate internal parts of an object with respect to the
Conflicting Cues from Vision and Touch Can Impair Spatial Task Performance
199
object’s overall frame of reference. Test of spatial orientation, which involve egocentric perspective shifts, require the test taker to transform and update their own egocentric reference frame with respect to stable environmental and object-centered frames of reference. Thus, it is possible that individuals who performed poorly on the psychometric spatial ability tests that we administered were generally poor at such processes, and therefore also had particular difficulty reconciling the conflict between the visual and kinesthetic frames of reference in the 270° condition. Although somewhat speculative, this interpretation is consistent with what we know about brain regions involved in integrating multiple frames of reference. Spatial information from many different sensory sources are integrated in posterior parietal cortex into a coherent whole [15, 16]. It has also been shown that spatial transformation tasks such as mental rotation involve these same parietal regions [24-28], and moreover, individual differences in parietal activation have been shown to correlate with individual differences in spatial abilities [29]. Thus, it may be that an essential function of this region is to encode, represent, manipulate, and reconcile different spatial frames of reference. While more research is needed to demonstrate that these effects are replicable within a single experiment and with a larger sample, the present analysis does raise some interesting potential avenues to pursue. Future studies could establish the parameters of congruent versus conflicting visual and kinesthetic cues. For example, is there a degree of rotation of the image of the hand at which the information changes from being primarily helpful to primarily hurtful in these kinds of tasks (at least for lower spatial individuals)? Another interesting future question is whether extended visual experience of the hand in apparently incongruous orientations can overcome confusion such as that observed in the 270º palm-view condition. Could this view of the hand eventually become integrated with the body schema representation, such as occurs in prism adaptation studies [7], and consequently help in spatial reasoning tasks such as these, even for individuals with poorer spatial abilities? If replicable, the implications of these findings for hand-assisted minimally invasive surgery are clear. In previous studies we have found that laparoscopic surgeons comprise the same wide range of spatial abilities as the general population [30], because the domain of medicine does not pre-select for these abilities. Therefore it is likely that surgeons using these methods will be subject to the same effects that were evident here (at least in the beginning of their laparoscopic experience - we do not know about possible effects of extended experience with these methods). Thus, while in some conditions seeing the hand in the operative view may be helpful, as surgeons claim, in other circumstances it may actually impair their understanding of the spatial relations of the operative space. Knowing how to avoid these conditions with judicious laparoscope placement might be an important applied outcome of this line of research. Finally, these findings shed light on the interface between vision and touch and the multimodal nature of our apparently unitary internal representation of the space around us. They also highlight the importance of individual differences. The data suggest that spatial ability is a key variable, and should be included in theoretical accounts of how, and how well, people generate, maintain, and manipulate their mental representations of space.
200
M. Keehner
References 1. Keehner, M., Wong, D., Tendick, F.: Effects of viewing angle, spatial ability, and sight of own hand on accuracy of movements performed under simulated laparoscopic conditions. In: Proceedings of the Human Factors and Ergonomics Society’s 48th Annual Meeting, pp. 1695–1699 (2004) 2. Graziano, M.S.A., Gross, C.G.: A bimodal map of space - somatosensory receptive-fields in the macaque putamen with corresponding visual receptive-fields. Experimental Brain Research 97(1), 96–109 (1993) 3. Driver, J., Spence, C.: Attention and the crossmodal construction of space. Trends in Cognitive Sciences 2, 254–262 (1998) 4. Penfield, W., Rasmussen, T.L.: The cerebral cortex of man. MacMillan, New York (1955) 5. Sekiyama, K.: Kinesthetic aspects of mental representations in the identification of left and right hands. Perception and Psychophysics 32, 89–95 (1982) 6. Funk, M., Brugger, P., Wilkening, F.: Motor processes in children’s imagery: the case of mental rotation of hands. Developmental Science 8(5), 402–408 (2005) 7. Sekiyama, K., et al.: Body image as a visuomotor transformation device revealed in adaptation to reversed vision. Nature 407, 374–377 (2000) 8. Graziano, M.S.A.: Where is my arm? Proceedings of the National Academy of Sciences 96, 10418–10421 (1999) 9. Maravita, A., Spence, C., Driver, J.: Multisensory integration and the body schema: Close to hand and within reach. Current Biology 13, R531–R539 (2003) 10. Pavani, F., Spence, C., Driver, J.: Visual capture of touch: Out-of-the-body experiences with rubber gloves. Psychological Science 11(5), 353–359 (2000) 11. Grefkes, C., Fink, G.R.: The functional organization of the intraparietal sulcus in humans and monkeys. Journal of Anatomy 207, 3–17 (2005) 12. Obayashi, S., Tanaka, M., Iriki, A.: Subjective image of invisible hand coded by monkey intraparietal neurons. Neuroreport. 11(16), 3499–3505 (2000) 13. Sekiyama, K.: Dynamic spatial cognition: Components, functions, and modifiability of body schema. Japanese Psychological Research 48(3), 141–157 (2006) 14. Graziano, M.S.A., Cooke, D.F., Taylor, C.S.R.: Coding the location of the arm by sight. Science 290, 1782–1786 (2000) 15. Cohen, Y.E., Anderson, R.A.: A common reference frame for movement plans in the posterior parietal cortex. Nature Reviews Neuroscience 3, 553–562 (2002) 16. Colby, C.L.: Action-oriented spatial reference frames in cortex. Neuron. 20, 15–24 (1998) 17. Maravita, A., et al.: Tool-use changes multimodal spatial interactions between vision and touch in normal humans. Cognition 83, B25–B34 (2002) 18. Iriki, A., Tanaka, M., Iwamura, Y.: Coding of modified body schema during tool use by macaque postcentral neurones. NeuroReport 7(14), 2325–2330 (1996) 19. Hegarty, M., Waller, D.: Individual differences in spatial abilities. In: Miyake, A., Shah, P. (eds.) The Cambridge handbook of visuospatial thinking. Cambridge University Press, Cambridge (2005) 20. Vandenberg, S.G., Kuse, A.R.: Mental rotations, a group test of three-dimensional spatial visualization. Perceptual & Motor Skills 47, 599–604 (1978) 21. Ekstrom, R.B., et al.: Manual for kit of factor-referenced cognitive tests. Educational Testing Service, Princeton (1976) 22. Parsons, L.M.: Imagined spatial transformations of one’s hands and feet. Cognitive Psychology 19, 178–241 (1987)
Conflicting Cues from Vision and Touch Can Impair Spatial Task Performance
201
23. Zacks, J.M., Michelon, P.: Transformations of visuospatial images. Behavioral and Cognitive Neuroscience Reviews 4(2), 96–118 (2005) 24. Zacks, J.M., Vettel, J.M., Michelon, P.: Imagined viewer and object rotations dissociated with event-related fMRI. Journal of Cognitive Neuroscience 15(7), 1002–1018 (2003) 25. Carpenter, P.A., et al.: Graded functional activation in the visuospatial system with amount of task demand. Journal of Cognitive Neuroscience 11(1), 9–24 (1999) 26. Harris, I.M., et al.: Selective right parietal lobe activation during mental rotation. Brain 123, 65–73 (2000) 27. Podzebenko, K., Egan, G.F., Watson, J.D.G.: Widespread dorsal stream activation during a parametric mental rotation task, revealed with functional magnetic resonance imaging. Neuroimage 15, 547–558 (2002) 28. Keehner, M., et al.: Modulation of neural activity by angle of rotation during imagined spatial transformations. Neuroimage 33, 391–398 (2006) 29. Lamm, C., et al.: Differences in the ability to process a visuo-spatial task are reflected in event-related slow cortical potentials of human subjects. Neuroscience Letters 269, 137– 140 (1999) 30. Keehner, M., et al.: Spatial ability, experience, and skill in laparoscopic surgery. American Journal of Surgery 188(1), 71–75 (2004)
Epistemic Actions in Science Education Kim A. Kastens1,2, Lynn S. Liben3, and Shruti Agrawal1 1
2
Lamont-Doherty Earth Observatory of Columbia University Department of Earth & Environmental Sciences, Columbia University 3 Department of Psychology, The Pennsylvania State University
[email protected],
[email protected],
[email protected]
Abstract. Epistemic actions are actions in the physical environment taken with the intent of gathering information or facilitating cognition. As students and geologists explain how they integrated observations from artificial rock outcrops to select the best model of a three-dimensional geological structure, they occasionally take the following actions, which we interpret as epistemic: remove rejected models from the field of view, juxtapose two candidate models, juxtapose and align a candidate model with their sketch map, rotate a candidate model into alignment with the full scale geological structure, and reorder their field notes from a sentential order into a spatial configuration. Our study differs from prior work on epistemic actions in that our participants manipulate spatial representations (models, sketches, maps), rather than non-representational objects. When epistemic actions are applied to representations, the actions can exploit the dual nature of representations by manipulating the physical aspect to enhance the representational aspect. Keywords: spatial cognition, epistemic actions, science education.
1 Introduction Kirsch and Maglio [1] introduced the term "epistemic action" to designate actions which humans (or other agents) take to alter their physical environment with the intent of gathering information and facilitating cognition.1 Epistemic actions may uncover information that is hidden, or reduce the memory required in mental computation, or reduce the number of steps involved in mental computation, or reduce the probability of error in mental computation. Epistemic actions change the informational state of the actor, as well as the physical state of the environment. Kirsch and 1
Magnani [24] used a similar term, "epistemic acting," more broadly, to encompass all actions that provide the actor with additional knowledge and information, including actions that do not alter anything in the environment (e.g., "looking [from different viewpoints]," "checking," "evaluating," "feeling [a piece of cloth]".) Roth [25] (p. 142) used "epistemic action" to refer to sensing of objects and "ergotic action" to refer to manipulating objects in a school laboratory setting. In this paper, we use the term "epistemic action" in the original sense of Kirsh and Maglio.
C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 202–215, 2008. © Springer-Verlag Berlin Heidelberg 2008
Epistemic Actions in Science Education
203
Maglio contrasted epistemic actions with "pragmatic actions," those taken to implement a plan, or implement a reaction, or in some other way move oneself closer to a goal. Kirsch and Maglio [1] explicated their ideas in terms of the computer game Tetris. They showed that expert players make frequent moves that do not advance the goal of nestling polygons together into space-conserving configurations, but do gain information. For example, a player might slide a falling polygon over to contact the side of the screen and then count columns outwards from the side to determine where to drop the polygon down to fit into a target slot. For a skilled player this backtracking maneuver is more time-efficient than waiting for the polygon to fall low enough for the judgment to be made by direct visual inspection. At a different point in the game, a player might rotate a polygon through all four of the available configurations before selecting a configuration. Kirsh and Maglio showed that such physical rotation, followed by direction perceptual comparison of the polygon and the available target slots, is more time-efficient than the corresponding mental rotation. As an individual player's skill increases from novice to expert, the frequency of such "extraneous" moves increases [2]. In this paper, we apply the concept of epistemic actions to science and science education. Scientists and science students manipulate objects in the physical world in the course of trying to solve cognitively demanding puzzles. We argue that epistemic actions, in the sense of Kirsch and Maglio [1], are an underappreciated tool that scientists use, and that science students could be taught to use, to enhance the efficiency of their cognitive effort. We begin by showing examples of participant actions that we believe to be epistemic which emerged in our own study of spatial thinking in geosciences. We then describe epistemic actions in other domains of science education, and conclude by offering some generalizations and hypotheses about how epistemic actions may work.
2 Epistemic Actions in Our Geoscience Field Study Our study [3] investigates how students and professional geologists gather and record spatial information from rock outcrops scattered across a large field area, and then integrate that information to form a mental model of a geological structure, keeping in mind that the structure is partly eroded and mostly buried. Participants observe and take notes on eight artificial outcrops constructed on a campus, then select from an array of fourteen 3-D scale models to indicate which they think could represent the shape of a structure formed by the layered rocks in the eight outcrops. The scale models vary systematically on key attributes, including convex/concave, circular/elongate, symmetric/asymmetric, open/closed, and shallow/deep. Participants are videotaped as they make their selection and explain why they chose the selected model and rejected the other models. Based on their comments and body language, students find this task difficult but engaging, and all appear to be trying determinedly to solve the puzzle posed to the best of their ability. As detailed elsewhere [4], students use abundant deictic (pointing) gestures to indicate features on their notes, a model or group of models, a real-world direction, or the outcrops in that real-world direction. For example, a student points over his shoulder
204
K.A. Kastens, L.S. Liben, and S. Agrawal
to indicate the location of the most steeply-dipping outcrops. They also make frequent use of iconic gestures, while discussing or describing attributes of an observed outcrop, a specific model, a group of models, or a hypothesized structure. For example, a student uses a cupped hand to convey her interpretation that the structure is concave upwards. In addition to abundant deictic and iconic gestures, the videotapes also document instances in which participants spontaneously move their hands in ways that do not have apparent communicative value, manipulating the objects available to them in a manner that we interpret as "epistemic actions." 2.1 Situation #1: Participant Moves Rejected Models Out of View Participants frequently begin their reasoning process by eliminating individual models or categories of models, for example, all the convex models. In many cases, they merely point out the rejected models with deictic gesture, or describe the rejected category in words (i.e., "it can't be convex"). But in some cases, they go to considerable effort to remove the rejected models from their field of view, for example by setting them off to the side (Fig. 1), or handing them to the experimenter. We infer that they are seeking to decrease their perceptual and cognitive load by decreasing the complexity of the visual array and by reducing the number of possibilities that are actively competing for their attention. These actions serve to address one of the basic problems of visual attention, namely that there is a limited capacity for processing information. Although there is a considerable research literature showing that humans are able to focus attention on some rather than other stimuli within a particular visual array [5], at least some processing is necessary when there are competing stimuli, and thus any actions that reduce that competition may be expected to simplify the task [6]. 2.2 Situation #2: Participant Moves Two Candidate Models Side by Side As participants progress through their reasoning process, they may take two candidate models out of the array and place them side by side (Fig. 2.) We infer that this action is intended to facilitate comparing and contrasting attributes of the two models. The side-by-side comparison technique is employed when the two models differ subtly; for example, in Fig. 2 the two models are both concave, both elongate, both steepsided, both closed, and differ only in that one is symmetrical along the long axis while the other is asymmetrical. Based on eye movements of people who were asked to recreate spatial patterns of colored blocks working from a visually-available model, Ballard, Hayhoe, Pook and Rao [7] concluded that their participants adopted a "minimum memory strategy" when the model and under-construction area were close together. They kept in mind only one small element of the model (for example, the color of the next block), and relied on repeated revisits back and forth between the model and the under-construction block array. The revisits allowed them to acquire information incrementally and avoid even modest demands on visual memory. Ballard, et al.'s participants overwhelmingly favored this minimal memory strategy even though it was more time-consuming than remembering multiple aspects of the model, and even though they were instructed to complete the task as quickly as possible. When Ballard, et al. increased the distance between model and copy, use of the minimal memory strategy decreased.
Epistemic Actions in Science Education
205
Fig. 1. Participant places rejected models out of field of view. We infer that the purpose of this action is to decrease the number of visually-available comparisons.
206
K.A. Kastens, L.S. Liben, and S. Agrawal
Fig. 2. After rejecting most models, this participant took the remaining two candidate models out of the array and placed them side-by-side, to faciliate comparison of details
We hypothesize that by moving two subtly-different models side-by-side, our participants enabled a minimal memory strategy to efficiently compare and contrast attributes of the models incrementally, without relying on visual memory to carry the entire model shape as attention is transferred from model to model. 2.3 Situation #3: Participant Moves Candidate Model Adjacent to Inscriptions In some cases, participants place a candidate 3-D model side by side with their inscriptions (field notes) (Fig. 3). We infer that this juxtaposition facilitates the process of comparing observation (in the notes) with interpretation (embodied in the candidate 3-D model), presumably through enabling the minimal memory strategy as described above. Participants' inscriptions took many forms [3], including a map of the field area with outcrop locations marked. Among the participants who had a map, we noted an additional epistemic action: participants rotated the map and candidate model such that the long axis of the model was oriented parallel to the long axis of the cluster of outcrop positions marked on the map (Fig. 3). This alignment allowed a direct perceptual
Fig. 3. This participant has placed her inscriptions (notes) side by side with a candidate model to facilitate comparison between her recorded observations and her candidate interpretation
Epistemic Actions in Science Education
207
Fig. 4. This participant, an expert, rotates several candidate models so that the long axis of the model aligns with the long axis of the full-scale structure
208
K.A. Kastens, L.S. Liben, and S. Agrawal
comparison of inscriptions and model, without requiring the additional cognitive load of mental rotation, as in the case of Kirsh and Maglio's [1] Tetris players. 2.4 Situation #4: Participant Rotates Model to Align with the Referent Space In a few cases, a participant spontaneously rotated a model or models to align with the full-scale structure formed by the outcrops in the perceptual space2 (Fig. 4). As in Situation #3, we hypothesize that the alignment achieved by physical rotation enabled a direct comparison, eliminating the cognitive load of mental rotation. An interesting aspect of Situation #4 is that the full-scale structure was not perceptually available to compare with the model structure. Only 2 of the 8 outcrops were visible to the participants as they made and defended their model selection. We hypothesize that
Fig. 5. While observing the eight outcrops, this participant recorded observations onto blank sheets of paper “sententially,” that is, sequenced from top to bottom, left to right on the paper, like text in a book. When confronted with the integrative task, she tore up her inscriptions into small rectangles with one outcrop per rectangle, and reorganized them into a map-like spatial arrangement. (Note: in order to show the reader both the spatial arrangement of the paper scraps and the details of the sketch, this figure was constructed by scanning the student’s inscriptions and superimposing the scanned sketches onto a video screen shot). 2
After completing their explanation of their model selection, all participants were asked by the experimenter to rotate their selected model into alignment with the full-scale structure. In this paper, we are referring to individuals who spontaneously elected to align their model with the structure before being asked to do so by the experimenter.
Epistemic Actions in Science Education
209
as they moved through the field area from outcrop to outcrop and then back to the starting place, some participants acquired or constructed an embodied knowledge of the outcrop locations and configuration, and that embodied knowledge is somehow anchored to, or superimposed upon the landscape through which they moved. 2.5 Situation #5: Participant Rips Up Inscriptions, and Reorders Them in Space In the no-map condition of our experiment [3], participants recorded their observations onto blank paper. Some participants situated their observations spatially to form a sketch map of the field area, and others recorded their observations "sententially" [8], in chronological order on the page from top to bottom, left to right, like text in a book. One participant, a novice to field geology, recorded her observations sententially, sketching each outcrop as she visited it. Then, when she was confronted with the selection task, she spontaneously tore up her papers so that each outcrop sketch was on a separate scrap of paper, and arranged the scraps spatially into a rough plan view of the outcrop locations (Fig. 5).
3 Other Occurrences of Epistemic Actions in Science Education In the laboratory or "hands-on" component of a well-taught science education program, students are engaged in manipulating physical objects while thinking hard— conditions that may tend to foster use of epistemic actions. And indeed, we can envision epistemic actions across a range of science fields. For example: • Elementary school children grow bean plants in paper cups. They place their bean plants in a row along the window sill such that each plant gets the same amount of sunlight. Each child waters his or her bean plant by a different amount each day. Initially, they arrange the plants in alphabetical order by child's name. Then, as the plants sprout and begin to grow, they rearrange the bean plants in order by amount of daily watering, to make it easier to see the relationship between amount of water and growth rate. • High school chemistry students arrange their test tubes in a test tube rack in order so that the tube that received the most reagent is farthest to the right. • College paleontology students begin their study of a new taxonomic group by arranging fossils across the lab table in stratigraphic order from oldest to youngest, to make it easier to detect evolutionary trends in fossil morphology. • Earth Science students begin their study of igneous rocks by sorting a pile of hand samples into a coarse-grained cluster and a fine-grained grained cluster, to reinforce the conceptual distinction between intrusive rocks (which cooled slowly within the Earth's crust and thus have large crystals) and extrusive rocks (which cooled quickly at the Earth's surface and thus have small crystals). • Elementary school geography students, or high school Earth Science students, rotate the map of their surroundings until map and referent are aligned. This makes it easier to see the representational and configurational correspondences between map and referent space, without the effort of mental rotation, which is known to be a cognitively demanding task [9].
210
K.A. Kastens, L.S. Liben, and S. Agrawal
4 Discussion 4.1 Are Epistemic Actions Consciously Purposeful? The participants in our study produced the actions described above spontaneously, as they struggled to puzzle their way through a spatially-demanding task that most found difficult. Some participants first asked whether it was OK to move or turn the models, which suggests that they knew in advance that such actions would be beneficial. They valued these actions sufficiently that they were willing to risk rejection of a potentially forbidden move, and they anticipated that the experimenter might see these actions as being of sufficient value to outlaw. 4.2 Are Epistemic Actions Always Spatial? All of the examples of epistemic actions we have provided thus far, and the original Tetris examples of Kirsch and Maglio [1], have involved spatial thinking, that is, thinking that finds meaning in the shape, size, orientation, location, direction, or trajectory of objects, processes, or phenomena, or the relative positions in space of multiple objects, processes, or phenomena. Spatial examples of epistemic actions seem most obvious and most powerful. But is this association between epistemic actions and spatial thinking inevitable? Are all epistemic actions in service of spatial thinking? No. It is possible to think of counter-examples of epistemic actions that seek nonspatial information. An everyday example would be placing two paint chips side by side to make it easier to determine which is darker or more reddish, seeking information about color. The science equivalent would be placing a spatula full of dirt or sediment next to the color chips in the Munsell color chart [11]. 4.3 Taxonomies of Epistemic Actions Kirsh [12] developed a classification scheme for how humans (or other intelligent agents) can manage their spatial environment: (a) spatial arrangements that simplify choice; (b) spatial arrangements that simplify perception, and (c) spatial dynamics that simplify internal computation. Our Situation #1, in which participants remove rejected 3-D models from view, is a spatial arrangement that simplifies choice. Situation #2 and #3, in which participants juxtapose two items to simplify comparison, are spatial arrangements that simplify perception. Situations #3 and #4 from the outcrop experiment, plus the case of rotating a map to align with the terrain, simplify internal computation by eliminating the need for mental rotation. Kirsh's scheme classified epistemic actions according to the change in cognitive or informational state of the actor. Epistemic actions could also be classified by the nature of the change to the environment: (a) relocate/remove/hide objects, (b) cluster objects, (c) juxtapose objects, (d) order or array objects, (e) rotate/reorient objects. Considering both classification schemes together yields a two-dimensional matrix for categorizing epistemic actions (Table 1). Each cell in the matrix of Table 1 describes benefits obtained by the specified change to the environment (row) and change to the cognitive state of the actor (column).
Epistemic Actions in Science Education
211
Table 1. Two-dimensional taxonomy of epistemic actions
Changes to cognitive state of actor (after Kirsh) Change to environment
Simplify choice
Simplify perception
Simplify cognition
Remove or hide object(s)
Fewer apparent choices
Less visual input, fewer visual distractions
Fewer pairwise comparisons required
Cluster objects
Choice is among few clusters (e.g., concave vs. convex) rather than among many individuals
Easier to see within-group similarities; easier to see between-group differences
Fewer attributes that need to be considered
Juxtapose objects
Order or array objects Rotate/ reorient objects
Easier to select end members (e.g., largest, smallest) or central "typical" example
Easier to see differences and similarities Easier to see trends (e.g., bean plant growth by watering rate) and correlations Easier to see correspondences
Less demand on visual memory
No need for mental re-ordering No need for mental rotation
"Juxtapose objects" appears at first glance to be a special case of "cluster objects," but we have separated them because the information gained and the nature of the change of cognitive state may be different. The value-added of juxtaposing two similar objects is that it is easier to perceive similarities and differences, without the cognitive load of carrying a detailed image of object 1 in visual memory while the gaze is shifted laterally to object 2 [7]. The value-added of clustering objects into groups is that one can then reason about a small number of groups rather than a larger number of individual objects. An example of the latter would be separating the trilobites from the brachiopods in a pile of fossils; an example of the former would be juxtaposing two individual trilobite samples to compare their spines. The taxonomy of Table 1 has been structured to accommodate a variety of tasks and to allow extension as new observations accrue from other studies. 4.4 Epistemic Actions and the Duality Principle of Representations Kirsh's [12] taxonomy of actions to manage space was based on observation of people playing games and engaging in everyday activities such as cooking, assembling furniture, and bagging groceries. In the case of science or science education, we suggest that epistemic actions can enhance cognition in a manner not explored by Kirsh: epistemic actions can exploit or enhance the dual nature of representations. A spatial representation, such as a map, graph, or 3-D scale model, has a dual nature: it is, simultaneously, a concrete, physical object, and a symbol that represents
212
K.A. Kastens, L.S. Liben, and S. Agrawal
something other than itself [13-18]. We suggest three ways in which epistemic actions can exploit or enhance the dual nature of representations: 1. The action can rearrange or reorder the physical aspect of the representation so that the referential aspect of the representation is more salient and/or has more dimensions. 2. The action can rearrange or reorder the physical aspect of the materials so that a more useful representation replaces a less useful representation. 3. The action can create a dual-natured representation from what had previously been mere non-representational objects. Mechanism (1): Manipulate the Physical Representation to Enhance or Foreground its Referential Meaning. In Situation #4 of the artificial outcrop experiment, an expert rotates candidate 3-D scale models to align with the full-scale structure. Before rotation, the correct model accurately represented the full-scale structure with respect to the attributes of concave/convex, elongate/circular, steep-sided/gentlesided, symmetric/asymmetric, and closed/open. After rotation, the model accurately represented the full-scale structure with respect to all of those attributes, and also with respect to alignment of the long axis. In other words, manipulating the physical object transformed the representation into a more complete or more perfect analogy to the referent structure. The same is true of rotating a map to align with the represented terrain [19]. In addition to creating a new correspondence (alignment) where none had existed previously, rotating the correct model to align with the referent space makes the other correspondences more salient, and easier to check or verify. On the other hand, if the model chosen is an incorrect model (for example, open-ended rather than closedcontoured), the discrepancy between model and full-scale structure becomes harder to overlook when the long axes of the model and referent are brought into alignment. Mechanism (2): Manipulate the Physical Representation to Create a More Useful Representation. In Situation #5 of the artificial outcrop experiment, the participant had initially arranged annotated sketches of each outcrop onto her paper such that the down-paper dimension represented the temporal sequence in which the eight outcrops had been visited and the observations had been made. Upon receiving the task directions and seeing the choice array, she apparently realized that this was not a useful organizational strategy. She physically destroyed that organization schema. Then she physically reorganized the fragments into a more task-relevant spatial arrangement, in which positions of outcrop sketches represented positions of full-scale outcrops. This participant apparently had the ability to think of her inscriptions as both (a) a concrete object that could be torn into pieces and reordered, and (b) a set of symbolic marks standing for individual outcrops. Mechanism (3): Manipulate the Physical World to Carry Representational Meaning. In several of the examples described above, the objects have no representational significance before the epistemic action. The epistemic action creates representational significance where none had previously existed. For example, in the case of the children's growing bean plants, as a consequence of the epistemic action, the spatial dimension parallel to the window sill becomes a
Epistemic Actions in Science Education
213
representation of water per unit time. The vertical dimension, the height of each plant, becomes a representation of growth rate as a function of watering rate. The entire array of plants becomes a living bar graph. In the case of the fossils arranged on the table, the spatial dimension along the line of fossils acquires two representational aspects, which run in parallel: geologic time and evolutionary distance. In the case of the igneous rocks, the two piles of rocks, fine-grained and coarsegrained, represent the fundamental division of igneous rocks into extrusive and intrusive products of cooling magma. Within each pile, the rocks could further be ordered according to the percentage of light-colored minerals, an indicator of silica content. Kirlik [20] presents a compelling non-science example, in which a skilled shortorder cook continuously manipulates the positions of steaks on a grill, such that the near-far axis of the grill (from the cook's perspective) represents doneness requested by the customer, and the distance from left-hand edge of the grill represents time remaining until desired doneness. This skilled cook need only monitor the perceptually-available attribute of distance from the left edge of grill, and need not try to perceive the hidden attribute of interior pinkness, nor try to remember the variable attribute of elapsed-duration-on-grill. A less skilled cook in the same diner created only one axis of representation (the near-far requested-doneness axis), and the least skilled cook had no representations at all, only steaks.
5 Conclusions and Directions for Further Research Cowley and MacDorman [21] make the case that capability and tendency to use epistemic actions is an attribute that separates humans from other primates and from androids. If so, then we might expect that the most cognitively demanding of human enterprises, including science, would make use of this capability. In reflecting on the significance of their work, Maglio and Kirsh [2] note (p. 396) that "it is no surprise…that people offload symbolic computation (e.g., preferring paper and pencil to mental arithmetic…), but it is a surprise to discover that people offload perceptual computation as well." This description applies well to science education. Science and math educators have long recognized the power of "offloading symbolic computation," and explicitly teach the techniques of creating and manipulating equations, graphs, tables, concept maps, and other symbolic representations. However, science educators have generally not recognized or emphasized that humans can also "set up their external environment to facilitate perceptual processing" (p. 396). All science reform efforts emphasize that students should have ample opportunities for "hands-on" inquiry [22]. But we are just beginning to understand what students should do with those hands in order to make connections between the physical objects available in the laboratory or field-learning environment and the representations and concepts that lie at the heart of science. We hypothesize that epistemic actions may be a valuable laboratory inquiry strategy that could be fostered through instruction and investigated through research. Questions for future research include the following: Can instructors foster epistemic actions in their students? If so, do student learning outcomes on laboratory
214
K.A. Kastens, L.S. Liben, and S. Agrawal
activities improve? Is there individual variation in the epistemic actions found useful by different science students or scientists, as Schwan and Riempp [23] have found during instruction on how to tie nautical knots? Do those scientists who have reputations for "good hands in the lab" make more epistemic actions than those who do not, by analogy with the strategic management of one's surrounding space that Kirsh [12] found to be an attribute of expertise in practical domains? Acknowledgements. The authors thank the study participants for their thoughts and actions, G. Michael Purdy for permission to use the grounds of Lamont-Doherty Earth Observatory, T. Ishikawa, M. Turrin and L. Pistolesi for assistance with data acquisition, L. Pistolesi for preparing the illustrations, and the National Science Foundation for support through grants REC04-11823 and REC04-11686. The opinions are those of the authors and no endorsement by NSF is implied. This is Lamont-Doherty Earth Observatory contribution number 7171.
References 1. Kirsh, D., Maglio, P.: On distinguishing epistemic from pragmatic action. Cog. Sci. 18, 513–549 (1994) 2. Maglio, P., Kirsh, D.: Epistemic action increases with skill. In: Proceedings of the 18th annual meeting of the Cognitive Science Society (1996) 3. Kastens, K.A., Ishikawa, T., Liben, L.S.: Visualizing a 3-D geological structure from outcrop observations: Strategies used by geoscience experts, students and novices [abstract]. Geological Society of America Abstracts with Program, 171–173 (2006) 4. Kastens, K.A., Agrawal, S., Liben, L.S.: Research in Science Education: The Role of Gestures in Geoscience Teaching and Learning. In: Geosci, J. (ed.) (2008) 5. Broadbent, D.E.: Perception and Communication. Oxford University Press, Oxford (1958) 6. Desimone, R., Duncan, J.: Neural mechanisms of selective visual attention. Ann. Rev. of Neurosci. 18, 193–222 (2000) 7. Ballard, D.H., Hayhoe, M.M., Pook, P.K., Rao, R.P.N.: Deictic codes for the embodiment of cognition. Beh. & Brain Sci. 20, 723–767 (1997) 8. Larkin, J.H., Simon, H.A.: Why a diagram is (sometimes) worth ten thousand words. Cog. Sci. 11, 65–99 (1987) 9. Shepard, R.N., Metzler, J.: Mental Rotation of Three-Dimensional Objects. Sci. 171, 701– 703 (1971) 10. Liben, L.S., Downs, R.M.: Understanding Person-Space-Map Relations: Cartographic and Developmental Perspectives. Dev. Psych. 29, 739–752 (1993) 11. Goodwin, C.: Practices of Color Classification. Mind, Cult., Act. 7, 19–36 (2000) 12. Kirsh, D.: The intelligent use of space. Artif. Intel. 73, 31–68 (1995) 13. Goodman, N.: Languages of art: An approach to a theory of symbols. Hackett, Indianapolis (1976) 14. Potter, M.C.: Mundane Symbolism: The relations among objects, names, and ideas. In: Smith, N.R., Franklin, M.B. (eds.) Symbolic functioning in childhood, pp. 41–65. Lawrence Erlbaum Associates, Hillsdale (1979) 15. DeLoache, J.S.: Dual representation and young children’s use of scale models. Child Dev. 71, 329–338 (2000)
Epistemic Actions in Science Education
215
16. Liben, L.S.: Developing an Understanding of External Spatial Representations. In: Sigel, I.E. (ed.) Development of mental representation: theories and applications, pp. 297–321. Lawrence Erlbaum Associates, Hillsdale (1999) 17. Liben, L.S.: Education for Spatial Thinking. In: Renninger, K.A., Sigel, I.E. (eds.) Handbook of child psychology, 6th edn., vol. 4, pp. 197–247. Wiley, Hoboken (2006) 18. Uttal, D.H., Liu, L.L., DeLoache, J.S.: Concreteness and symbolic development. In: Balter, L., Tamis-LeMonde, C.S. (eds.) Child Psychology: A Handbook of Contemporary Issues, pp. 167–184. Psychology Press, New York (2006) 19. Liben, L.S., Myers, L.J., Kastens, K.A.: Locating oneself on a map in relation to person qualities and map characteristics. In: Freska, C., Newcombe, N.S., Gärdenfors, P. (eds.) Spatial Cognition VI. LNCS, vol. 5248. Springer, Heidelberg (2008) 20. Kirlik, A.: The ecological expert: Acting to create information to guide action. In: The Conference on Human Interaction with Complex Systems, Piscataway, NJ (1998) 21. Cowley, S.J., MacDorman, K.F.: What baboons, babies and Tetris players tell us about interaction: A biosocial view of norm-based social learning. Cog. Sci. 18, 363–378 (2006) 22. National Research Council.: National Science Education Standards. National Academy Press, Washington (1996) 23. Schwan, S., Riempp, R.: The cognitive benefits of interactive videos: learning to tie nautical knots. Learn. and Instr. 14, 293–305 (2004) 24. Magnani, L.: Model-based and manipulative abduction in science. Found. of Sci. 9, 219– 247 (2004) 25. Roth, W.M.: From epistemic (ergotic) actions to scientific discourse: The bridging function of gestures. Prag. and Cogn. 11, 141–170 (2003)
An Influence Model for Reference Object Selection in Spatially Locative Phrases Michael Barclay and Antony Galton Exeter University, Exeter, UK
[email protected]
Abstract. A comprehensive influence model for choosing a reference object in a spatially locative phrase is developed. The model is appropriate for a Bayesian network implementation and intended as a step toward machine learning of spatial language. It takes its structure from the necessary steps a listener must take in utilising spatial communication and contains as variables parameters derived from the literature concerning characteristics of landmarks for wayfinding as well as reference objects in general. Practical limitations on the implementation and training of the model are discussed. Keywords: Bayesian, reference, spatial, locative.
1 1.1
Introduction Reference Objects in Spatially Locative Phrases
A spatially locative phrase is one in which a reference object, a spatial preposition and an object to be located (herein the “target”) are combined for the purpose of identifying the position of the target to the listener. Hence “The cup is on the table” and “The bird flew over the hill” are both examples of simple spatially locative phrases, with the cup and the bird respectively as targets, and the table and the hill as reference objects, further information and any temporal reference being provided by the verb. In the landmark domain a phrase such as “The meeting place is in front of the cathedral” is clearly in this category. The problem addressed by this paper, illustrated in Fig. 1, is to identify a suitable reference object from the many present in a scene, and so take the first step in forming a spatially locative phrase. In Fig. 1 in answer to the question “where is the man?” the answers “by the skip” or “in front of the white house” are acceptable answers, but what about “on the sidewalk” or “to the right of the road”? If Talmy’s categorisation (see [1] and Sect. 2.1) of reference objects is considered, the road and the sidewalk should be suitable candidates, but it is apparent that they do not serve to locate the man as well as the skip or the white house. It should be noted that this is not the same issue as involved in generating referring expressions (see for example [2], [3], [4]). The target object in this paper is assumed to be unique: it needs to be located not disambiguated. A candidate C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 216–232, 2008. c Springer-Verlag Berlin Heidelberg 2008
An Influence Model for Reference Object Selection
217
Fig. 1. Influences on search-space optimisation
reference object may or may not be ambiguous and this leads to a variety of issues which are discussed in Sect. 3.4. Even an ambiguous candidate reference object must be treated differently from a referent in a referring expression because it has a purpose in helping to locate the target. 1.2
Spatial Language Generation
While much work on the characteristics of landmarks and some on the characteristics of reference objects in general has been undertaken (see Sect. 2), there has been no attempt to combine these into a model suitable for machine learning of the task of reference selection and hence of an important part of spatial language generation and scene description. This paper proposes such a model. Machine learning of spatial preposition use has been attempted by Regier [5], Lockwood et al. [6] and Coventry et al. [7] among others, but these systems have ‘pre-selected’ references. Machine learning of reference selection takes place to an extent in the “Describer” system (Roy [8]). This uses a 2 dimensional scene set with limited influences on reference object choice, although it is impressive in its scope, tackling reference choice, target (referent) disambiguation and preposition assignment simultaneously. The “VITRA” system [9] is an ambitious scene description system including all elements of spatial language, but it does not include machine learning. Implementations of the model described in this paper are expected to be trained with, and tested on, 3-dimensional scenes which are schematised (see Herskovits [10]) representations of reality.
218
1.3
M. Barclay and A. Galton
Scope of the Investigation
Discourse and Context. Returning to Fig. 1, it can be seen that answering “on the sidewalk” to the question “where is the man?” is appropriate if the man had previously been in the road with a bus approaching. A discourse process or context may have raised the salience of the sidewalk to the point where it was a better reference than the skip or the pink house. For the purposes of the current paper the scenes for which reference choice is to be made are presumed to be memoryless i.e. time leading up to the point at which a scene is described is unknown or non-existent. The multiple added complexities of describing a scene during, or as part of, a discourse will be the subject of future work. Functionally Locative Phrases. As noted by Miller and Johnson-Laird [11] phrases of the type “the bird is out of the cage” conform to the template of a spatially locative phrase but provide no effective communication about the location of the target object. Instead the purpose of the phrase is to convey the information that the cage is not fulfilling its normal containment function. The same can be said to be true of the phrase “The flowers are in the vase”, which if the vase is as mobile as the flowers, conveys only the information that the vase is performing its containment function. If someone is trying to find the flowers the phrase “the flowers are in the bedroom” is more likely to be helpful than “the flowers are in the vase”. If “the bird is in the cage” and the cage is fixed and its location known then the cage is also a good spatial reference for the bird. In this paper the assumption is made that the aim is to select a good spatial reference, irrespective of any functional relationship, and that the existence of a functional relationship between a target and a reference does not of itself make the reference more suitable (see [12]). 1.4
Structure of Paper
The remainder of this paper is structured as follows: Section 2 gives an overview of the literature from linguistics and landmarks concerning the characteristics of reference objects. Section 3 develops an influence model for reference object selection starting from the function of a spatially locative phrase. Section 4 discusses possible computational methods for implementing the model and issues surrounding these.
2 2.1
Reference Object Characteristics Linguistic Investigations
Miller and Johnson-Laird [11] note that the scale of the reference and target objects is important in selection of a reference: “It would be unusual to say
An Influence Model for Reference Object Selection
219
that the ashtray is by the town-hall”. Talmy [1] lists attributes of located and reference objects, and states that relative to the located object the reference is: 1. 2. 3. 4. 5. 6. 7. 8.
More permanently located Larger Geometrically more complex in its treatment Earlier on the scene / in memory Of lesser concern / relevance More immediately perceivable More backgrounded when located object is perceived More independent
Thus the reference is likely to be somewhat bigger, if not vastly so, than the target object. This scale issue is discussed in Sect. 3.3, permanence and perceivability in Sect. 3.2. These are not intended as absolute categorisations and the model developed in this paper embodies the concept that the influences can be traded against each other. For instance the phrase “the bicycle is leaning on the bollard” uses as a reference an object smaller than the target (less appropriate) but more permanently located (more appropriate). Bennett and Agarwal [13] investigate the semantics of ‘place’ and derive a logical categorisation of reference attributes. DeVega et al [14] analyse Spanish and German text corpora and (with a restricted range of prepositions) find that reference objects are more likely to be solid and countable (i.e. not a substance like ‘snow’). It should be noted that the corpora were taken from novels rather than first hand descriptions of real scenes. Recent experimental work by Carlson and Hill [12] indicates that the geometric placement of a reference is a more important influence than a conceptual link between target and reference, and that proximity and joint location on a cardinal axis (e.g., target directly above or directly to the left of reference) are preferred (see Sect. 3.3). The experiments were carried out using 2-dimensional object representations on a 2-dimensional grid. Earlier work by Plumert et al. [15] focusses on hierarchies of reference objects in compound locative phrases but also finds that in particular the smallest reference in the hierarchy might be omitted if the relationship between it and the target did not allow sufficient extra information to be provided (see Sect 3.4). 2.2
Landmark Characteristics
A considerable body of work on landmarks exists, including the role of landmarks in cognitive mapping and structuring of space, which cannot be comprehensively reviewed here. Of more relevance to the present paper is the practical matter of selecting landmarks when giving way-finding instructions and much of this work can be related directly to general reference object selection, augmenting the work from linguistics. Various researchers have looked at the nature of objects chosen as landmarks and derived characteristics of good landmarks. Burnett et al. [16] deal with the
220
M. Barclay and A. Galton
case of urban navigation. Based on interviewing subjects who have chosen particular landmarks in an experimental setting they derive the following characteristics of good landmarks: 1. 2. 3. 4. 5.
Permanence Visibility, Usefulness of Location Uniqueness Brevity of description
They also note that most landmarks do not exhibit all of the desired characteristics, indeed the most frequently used landmarks, traffic lights, are ubiquitous rather than unique. This is discussed in section 3.4. The factors which contribute to “visual and cognitive salience” in urban wayfinding are investigated by Raubal and Winter [17] and Nothegger et al [18], who test automatically selected landmarks against those selected by humans. The measure of saliency for visual features is complex. Nothegger et al. [18] point out that using deviation from a local mean or median value (for example in a feature such as building colour) to represent salience does not hold for asymmetric quantities such as size, where bigger is usually better than smaller. Cognitive salience, including cultural or historic significance, is in practice related to the issue of prior knowledge of the Landmark by the listener and is discussed in section 3.2. Winter [19] adds advance visibility to the list of desirable characteristics for landmarks, citing both way-finder comfort and reduced likelihood of reference frame confusion as reasons. Sorrows and Hirtle [20], along with singularity (sharp contrast from the environment), prominence (visibility) and cultural or historic significance, which are picked up in the lists already mentioned, also list accessibility and prototypicality as characteristics of landmarks. Accessibility (as in the junction of many roads) may make a landmark more frequently used and may lead to the accretion of other characteristics useful for way-finding, but it probably mostly denotes usefulness of location, which is further discussed in Sect. 3.3. Prototypicality is an important factor as without specific knowledge of a landmark or reference, categorical knowledge is required. A church which looked like a supermarket would be a problematic reference. Tezuka and Tanaka [21] note that landmark use is relative to the task at hand, mode of transport and time of day. A good landmark for pedestrian navigation is not necessarily good for car drivers. This seems always to be expressible in terms of visibility etc. but highlights the need for speed, conditions and viewpoint to be taken into account in assessing visibility. Also cultural factors, preferences and, according to Klabunde and Porzel [22], social status may affect landmark choice. In [21] a reinforcement mechanism is proposed whereby landmark usage effectively improves the goodness of the landmark. The initial choice of a landmark
An Influence Model for Reference Object Selection
221
which subsequently becomes much used would presumably have been made because it displayed characteristics of a good landmark. However, an object’s prior use as a landmark may cause continuation of use even if an otherwise more suitable landmark appears. A related case is noted in [20], “turn left where the red barn used to be”, where the use of the landmark outlives the landmark itself.
3 3.1
Processing a Locative Phrase Three Primary Influences on Reference Object Suitability
The three primary influences on reference object suitability can be derived from the necessary steps a listener must take on hearing a locative phrase, with the addition of a cost function. Presented with a locative phrase and the task of finding the target object the listener must do two things: 1. Locate the reference object. 2. Search for the target object in the region constrained by combining the reference object location with the spatial preposition. Making the assumption that the speaker intends his communication to be effective, or at least is trying to cooperate with the listener, it will follow that the speaker will have chosen the reference object to be easily locatable: and also that, in conjunction with the preposition, the reference will optimise the region in which the listener must search for the located object. There is some evidence for this cooperation with (or consideration for) the listener in spatial communication (see Mainwaring et al. [23]) in the adoption of reference frames that reduce the mental effort required by the listener. The functional basis of this analysis also leads to the addition of a third criterion for reference objects, the communication cost of using them (see Grice [24] on brevity and giving the optimum amount of information). Communication cost will be an important consideration if a potential reference is ambiguous or if the time taken for the communication is comparable to the time the listener will take to locate the object. The following sections expand on the aspects of reference objects which contribute to the three primary influences shown in Fig. 2.
Reference locatability
Search-space optimisation
Communication cost
Reference suitability
Fig. 2. Three primary influences on reference suitability
222
3.2
M. Barclay and A. Galton
Influences on Reference Locatability
Specific and Categorical Knowledge. For a reference object to be locatable the listener must either have specific knowledge of the object in question, or have categorical knowledge of the type of object in question so that it is apparent when visually encountered. Specific knowledge may substitute for or enhance categoric knowledge: for instance, in the case of “The National Gallery” specific knowledge of the building in question would be required by the listener if we accept that there is no visual category for art galleries. “St Paul’s Cathedral” may be specifically known but it is also clearly a member of its category and hence could be referred to as “the Cathedral” in some circumstances. Since the influence model is for reference choice it is more appropriate to term these two primary influences on reference locatability “reference apparency” for the case where categoric knowledge only is assumed and “degree of belief in listener specific knowledge” for the case where specific knowledge is relied on. These are shown in Fig. 3.
Listener approach
Target mobility
Reference general significance
Reference obscurancy
Reference visual contrast
Reference mobility
Temporal relevance (Listener presence)
Reference persistence
Reference prototypicality
Reference size
Reference visibility
Knowledge of listener's past locales
Degree of Belief in listener specific knowledge
Reference apparency
Reference locatability
Search-space optimisation
Reference suitability
Fig. 3. Influences on reference locatability
Communication cost
An Influence Model for Reference Object Selection
223
Degree of Belief in Listener’s Specific Knowledge. From the studies of landmarks discussed in Sect. 2.2, the value of cultural or historic significance in landmark choice is clear and can be simply represented as an influence on the speaker’s degree of belief in listener’s specific knowledge as such. (A historically significant reference is more likely to be known, this is termed “reference general significance” in Fig. 3) The second influence on the speaker’s degree of belief in listener specific knowledge comes from the speaker’s knowledge of the listener. For instance in Shaftesbury, “Gold Hill”, a well known landmark, would be useful in giving directions to visitors or locals; “Shooters hill”, less well known, would only be of use if the speaker knew that the listener was local to Shaftesbury. Sorrows and Hirtle [20] give a related example relating reference choice to frequency of visits to a building. This influence is included in Fig. 3 as “speaker’s knowledge of listener’s past locales”. Indeed this factor may influence more than just reference object choice for a simple locative phrase and may influence whether a simple or complex locative phrase can be used (see Sect. 4.5). For example “In front of St Martin in the Fields” is likely to be replaced by “In front of the church at the north east corner of Trafalgar square” for a listener unacquainted with the church in question. Reference Apparency. For a reference to be apparent to a listener who has categoric knowledge of the object in question it must be a good representative of the category (be prototypical [20]), must be visible, and must persist until it is no longer required as a reference (be persistent, or permanent [16]). Note that although ambiguity may be thought to have a direct influence on apparency, the way it is proposed to deal with ambiguous references results in them influencing either communication cost or search space optimisation. This is discussed in Sect. 3.4. It is an open question as to whether persistence should be a direct influence on apparency or considered an influence on visibility as “visible at the time required”. As it should make no difference to the model output (although it may reduce comprehensibility) it is left as a direct influence at present. Prototypicality. This is a complex area and initial computer implementations of the influence model will not include this parameter. Size, geometry and presence or absence of features will all influence prototypicality. Further study of relevant literature and consideration of methods of representation will be required before this can be brought within the scope of the model. At present reference objects are assumed to be recognisable members of their category. Visibility. This is affected by many factors identified in the landmark studies cited in Sect. 2.2. Size, obscurance, brightness, colour contrast, shape factor are all relevant. The speed of travel of the listener [21] and the direction of approach of the listener [19] are also important. These are included in Fig. 3 as the influences “reference size”, “reference obscurance”, “reference visual contrast” and “listener approach”. Note that even something as seemingly simple as size may
224
M. Barclay and A. Galton
have multiple influences. For size, bounding box volume, convex hull volume, actual volume, maximum dimension and sum of dimensions are all possible candidate influences. The apparent size, the area projected toward the speaker, may in some cases be more important than the actual size. Raubal and Winter [17] note this in the case of building fa¸cades, for instance. These are omitted from Fig. 3 for simplicity, although they will be included in model implementations. Persistence. Following Talmy [1] and the work by de Vega et al. [14] it is clear that both the target object and candidate reference object mobility influence reference choice. Intuitively the reference object is expected to be more stable (see [25]) than the target. Also important, as pointed out by Burnett [16], is when the listener will need to use the reference to find the target. If in Fig 1 the target object is the post box and the listener will not be at the scene for some time, then the pink house, rather than the skip (which may be removed) will be a better reference even though the skip is nearer and plainly visible. This factor is summarised as “Temporal relevance (listener presence)” in Fig. 3. 3.3
Searching for the Target Object
This is a conceptually simpler task than finding the reference object. Less knowledge about the listener is required and the target object itself, its characteristics and function, do not come in to focus until it is found (when, as Talmy [1] notes, “the reference object is backgrounded”). Scene Scale. As already noted, Miller and Johnson Laird [11] point out that the scale of the reference and located objects are important in determining whether a reference is appropriate. It is proposed here, following Plumert et al. [15], that this is due to the influence on the search space. Choosing a large reference may make the reference more apparent but may leave the listener a difficult task finding the target object as, along with any preposition, it defines too large a region of interest (e.g., “the table is near Oxford”). Reference size must be treated carefully as, dependent on geometry, the search space may vary considerably. To say a target object is “next to the train” defines a large search area but to say that it is “in front of the train” defines a much smaller area. Computational models illustrating this can be seen in Gapp [26]. Geometry here is effectively a shorthand term for what might be termed “projected area in the direction of the target” A further important influence on search space is the location of the listener relative to a target object of a given size. As Plumert et al. [15] point out, if the target object is a safety pin and the listener is more than a few yards away, there may be no single suitable reference. This factor is included with reference size and geometry and target object size as influences on “scene scale” (see Fig 4) which in turn influences search space. The real effect of some critical combination of a small target object and a distant listener will be to suppress the suitability of all reference objects and force the decision to use a compound locative phrase containing more than one reference. This is discussed in Sect. 4.5.
An Influence Model for Reference Object Selection
225
Reference ambiguity
Disambiguation by grouping
Reference size
Target size
Reference geometry
Reference proximity
Reference/Target topology
Reference locatability
Target obscurance
Target visibility
Listener location
Cardinal axis placement
Reference location
Search-space optimisation
Scene scale
Communication cost
Reference suitability
Fig. 4. Influences on search-space optimisation
Reference Location. This is likely to affect search-space optimisation in two ways. Firstly the simple proximity of the reference to the target reduces the search space and secondly the presence of the target on a cardinal axis (where the reference is the origin) appears to make the search easier. This is apparent in Carlson and Hill [12] and also studies on preposition appropriateness (see for instance [27]). Intuitively, given a preposition “above” and a reference the listener will locate the reference and move his eyes up from there until the target is encountered. Given a reference and the direction “above and to the left” the process is much more involved. Proximity and cardinal axis placement are shown in Fig. 4 as influencing reference location which in turn influences search space. Reference/Target Topology. From the study by Plumert et al [15] it is clear that as well as the geometry of the reference and target, the topological relationship between them is also important. If a target object was “on the book on the table” the book was more likely to be included as a reference than if the target was “near the book on the table” (in which case the the target was simply “on the table”). The search space would appear to be comparable but
226
M. Barclay and A. Galton
in the case where the target was “on the book” the extra communication cost of using the two references was considered worthwhile by the speaker. It is possible that there is a perceived chance of confusion in that an object “on A which is on B” is not necessarily seen as “on B” (i.e., “on” is not always accepted as transitive, although this is not necessarily the same as Miller and Johnson-Laird’s limited transitivity [11]). The reference/target topology influence is included in the model at present pending further testing of its relevance. The inclusion of reference ambiguity along with disambiguation by grouping in Fig. 4 is discussed in Sect. 3.4. 3.4
Communication Cost
Reference Innate Cost. The costs of simple references such as “hill”, “house” or “desk” are typically fairly comparable. However references can be parts of objects (see [14]) such as “the town hall steps” or regions such as “the back of the desk”. The distinction between this form of reference and a compound reference is that there is still only a single preposition (in contrast to “in front of the town hall by the steps”). It is clear that references of this nature incur cost both for the speaker and the listener and in a computational model a cost function will be required to prevent over-specifying of a reference (e.g., “The back right hand corner of the desk”) when a less specific reference (“on the desk”) would be sufficient. Sufficiency here is clearly related to the difficulty of the search task. How these costs are quantified in the model, beyond a simple count of syllables (which will be used in initial implementations), needs further investigation. Search Task Difficulty. It was earlier noted that communication cost would become important if the time taken for the communication approached that required for the speaker to locate the target. As noted this is a factor in the results reported by Plumert et al. [15]. The study concluded that a secondary reference might be omitted because the target was “in plain view” although the topological relationships involved were also a factor (see Sect. 3.3). Much of the search task difficulty is already expressed in the model as search-space optimisation and does not require re-inclusion as a factor influencing communication cost; however, some factor is required in the model to represent the speed of visual search the listener is capable of. This should be more or less constant for human listeners and if not would require the speaker to know if the listener was much slower or quicker than normal, which is outside the scope of the model at this point. As a constant it should be incorporated automatically into the weights of the model as it is learned and so is not explicitly included in Fig. 5. Reference Ambiguity. Two possibilities exist for a speaker confronted with an ambiguous reference in the case of a spatially locative phrase, as opposed to the case in which the object is the intended referent in a referring expression, when disambiguation is mandatory. Consider a scene such as that in Fig 6. The speaker can choose to aggregate the ambiguous references into a single unambiguous
An Influence Model for Reference Object Selection
227
Reference ambiguity
Disambiguation by specification
Reference locatability
Search-space optimisation
Reference innate cost
Communication cost
Reference suitability
Fig. 5. Influences on Communication Cost
reference as in “The bus-shelter is in front of the grey houses”, or disambiguate as in “The bus-shelter is in front of the second grey house”. The first of these alternatives creates a reference with different size and geometry and hence has the effect included in Fig 4. The second increases the communication cost of using the reference object. Methods for disambiguation and algorithms for arriving at suitable phrases are addressed in the literature on referring expressions, see for instance [2] and for an empirical study of disambiguation using spatial location see [28]. As an example of the balancing influences between communication cost and search-space optimisation, consider the use of traffic lights as ‘landmarks’ in way-finding instructions. Burnett [16] notes the frequency of their use in spite of their ubiquity and the impossibility of disambiguating them except by the use of a count specifier (e.g., “turn right at the third set of traffic lights”). The communication cost of the count specifier is quite high both for the speaker and the listener, but the precision with which the traffic lights optimise the searchspace for the target (effectively the “right turn” in this example) makes them a suitable reference. This influence on communication cost is shown in Fig. 5 as “disambiguation by specification”.
4 4.1
Discussion: Implementation Possibilities for the Influence Model Computational Structure
The problem as stated in 1.1, “to identify a suitable reference object from the many present in a scene”, is essentially a classification task. As such a wide variety of well documented techniques is available for solving the problem. The least flexible and most error prone would be to reduce the model by hand to an
228
M. Barclay and A. Galton
Fig. 6. “The bus shelter is in front of ?”
algorithm with constant values decided from (necessarily) piecemeal experimentation. Some form of supervised machine learning approach would seem more promising. From the machine learning literature, decision trees, neural networks, probabilistic (Bayesian) networks and kernel machines would all appear to be potential candidates for a computational structure. With a view to retaining, as well as validating, to some extent, the semantics of the derived variables in the model, a decision tree or probabilistic network approach would be favoured, although a constrained neural network approach such as that of [5] could be used. A Bayesian network approach has been initially adopted without making any further claim for its superiority. 4.2
Model Output
The influence model, as illustrated in Figs. 2, 3, 4 and 5, when implemented as a Bayesian-network and trained, will return a suitability value for each reference object in a scene, rather than a chosen reference object. The number of values assigned to the various variables in the network will make it more or less likely that more than one candidate reference object will have the same suitability value. Since it is clear that in a typical scene there would be no consensus reference chosen by a group of human subjects, having several judged suitable by the model is not unreasonable. What is also clear is that in some scenes there will be no suitable single reference and this must be reflected in the model output. The absence of a suitable reference would be the trigger for formation of a compound locative phrase or description and this is briefly discussed in Sect. 4.5. The model must evaluate the candidate reference objects in some order and how a suitable order is determined is not clear. The evaluation order may prove to be important if the first suitable reference is to be returned or if pruning of
An Influence Model for Reference Object Selection
229
the evaluation is required (i.e., ignoring references that are clearly unsuitable). Evidence from research on visual search (see for example Horowitz and Wolfe [29]), although not directly applicable to the reference choice task, may help guide experiments in this area. 4.3
Complexity of the Model
The model can be easily represented as a poly-tree (i.e., no more than one path connects any two variables) and although many of the observable parameters are by nature continuous, initial versions of the model can use discretized values. The number of variables and the number of values required for a given variable will be the subject of future experimentation. Considering the model as presented in Figs. 2, 3, 4 and 5 there are 20 evidence variables and 10 hidden variables as well as the output variable. As an indication of the model size, if all variables are 5-valued the model can be represented with 20.51 +6.53 +3.54 +2.55 or 8975 values (as opposed to 4.7∗1014 for the conditional probability table with no hidden variables). Representation and evaluation of a model of this scale and form is straightforward. 4.4
Training Data Set
A scene corpus for training and testing the model is currently under development. The scenes depicted in Figs. 1, 6 have been taken from the corpus. The design and construction of the corpus is described in [30]. The practical limit for the initial size of the corpus is likely to be around 2000 test cases in some 500 scenes. The training data available for the model limits the influences that can be included in it to some extent. Considering Fig. 3, the following simplifications are made: 1. There is no way at present of including learnable measures of “reference general significance” or “speaker knowledge of listener’s past locales”. A simple mechanism for tagging some of the objects in the scene as specifically known to the listener will be used initially. 2. As noted, there is no attempt to measure or learn “prototypicality”. Prototypicality by itself will, in all probability, require a model of simlar complexity to the one developed here. 3. “Listener approach” is limited to a “scene entry point” at present. 4. However, learnable information about target and reference mobility are both available as the training scenes are sequences of pictures of locations with some objects moving and some not. 5. Also, although not strictly limited by the training data, only a simple measure of communication cost related to utterance length will be used. This clearly does not express many aspects of the mental effort involved (in disambiguation in particular) which will need to be the subject of future work. The corpus contains scenes derived from real situations and scenes designed to test specific elements of the model. These designed scenes include for instance
230
M. Barclay and A. Galton
Fig. 1 in which the sidewalk is inappropriate (due to its linear extension) as a reference for the location of the man, although it is the closest object to him. All objects and parts of objects in a scene are potential references, other limitations on the representation of scenes are described in [30]. 4.5
Simple and Compound Locative Phrases
As noted in Sect. 4.2 if no suitable single reference is found by the model a compound locative phrase is required. Various possible algorithms for this can be investigated using the model as described in an iterative fashion. For instance this could be achieved by conceptually moving the listener within the scene to a point closer to the target object and selecting an appropriate reference object and then making this the new target object with the listener moved further towards his initial position. Whether the model would be effective in this task without some learned concept of scale-spaces (see for example [31]), and how this learning would be achieved if required, are open questions. Also problematic is the case where a reference object requires disambiguation by use of a second reference as in “the keys are on the desk under the window” in a room with more than one desk. As noted, the work of Roy [8] addresses this in a simple context disambiguating both the target and reference objects as necessary. A reference must be provided that is suitable for the desired primary reference and unsuitable for any distractors. In practice using the model to achieve this may be easier than detecting the problem in the first place and recognising that though ambiguous the desk is still a good reference because of its conventional use in defining a space (indeed in typifying a scale) where objects are collected. Cases of reference combinations that are not hierarchical such as “the library is at the intersection of 5th Street and 7th Avenue” will also need to be the subject of future work.
5
Conclusions, Further Work
Reviewing recent literature on landmarks and references enables a list of relevant characteristics to be drawn up. Reasoning from this list and the function of spatially locative phrases allows an organisation to be imposed on the characteristics which is lacking in the literature. The resulting model will enable effective probabilistic modeling and machine learning of reference (and hence landmark) suitability. The speaker’s assumed knowledge of the listener in intial implementations of the model is (for practical reasons) limited to that of his location, his temporal requirements for the spatial information and whether he has prior knowledge of a candidate reference object. A more sophisticated model of the listener would repay investigation. The model could also be expanded to include the different cases where multiple or compound locative phrases are planned by the speaker. As noted in the discussion, development of an automated system for reference object choice based on the analysis in this paper is currently under way. Initial
An Influence Model for Reference Object Selection
231
results from a limited model, containing some 8 variables relating to target and reference geometry, trained with a 320 case scene corpus, suggest that results from the full model will be very worthwhile.
References [1] Talmy, L.: Toward a Cognitive Semantics. MIT Press, Cambridge (2000) [2] Dale, R., Reiter, E.: Computational interpretations of the gricean maxims in the generation of referring expressions. Cognitive Science 19, 233–263 (1995) [3] Duwe, I., Kessler, K., Strohner, H.: Resolving ambiguous descriptions through visual information. In: Coventry, K.R., Olivier, P. (eds.) Spatial Language. Cognitive and Computational Perspectives, pp. 43–67. Kluwer Academic Publishers, Dordrecht (2002) [4] van Deemter, K., van der Sluis, I., Gatt, A.: Building a semantically transparent corpus for the generation of referring expressions (2006) [5] Regier, T.: The human semantic potential: Spatial language and constrained connectionism. MIT Press, Cambridge (1996) [6] Lockwood, K., Forbus, K., Usher, J.: Spacecase: A model of spatial preposition use. In: Proceedings of the 27th Annual Conference of the Cognitive Science Society (2005) [7] Coventry, K.R., Cangelosi, A., Rajapakse, R., Bacon, A., Newstead, S., Joyce, D., Richards, L.V.: Spatial prepositions and vague quantifiers: Implementing the functional geometric framework. In: Proceedings of Spatial Cognition Conference (2004) [8] Roy, D.K.: Learning visually-grounded words and syntax for a scene description task. Computer Speech and Language 16(3) (2002) [9] Herzog, G., Wazinski, P.: Visual translator: Linking perceptions and natural language descriptions. Artificial Intelligence Review 8, 175–187 (1994) [10] Herskovits, A.: Schematization. In: Olivier, P., Gapp, K.-P. (eds.) Representation and Processing of Spatial Expressions, pp. 149–162. Laurence Earlbaum Associates (1998) [11] Miller, G.A., Johnson-Laird, P.N.: Language and perception. Harvard University Press (1976) [12] Carlson, L.A., Hill, P.L.: Processing the presence, placement, and properties of a distractor in spatial language tasks. Memory and Cognition 36, 240–255 (2008) [13] Bennett, B., Agarwal, P.: Semantic categories underlying the meaning of ‘place’. In: Winter, S., Duckham, M., Kulik, L., Kuipers, B. (eds.) COSIT 2007. LNCS, vol. 4736. Springer, Heidelberg (2007) [14] de Vega, M., Rodrigo, M.J., Ato, M., Dehn, D.M., Barquero, B.: How nouns and prepositions fit together: An exploration of the semantics of locative sentences. Discourse Processes 34, 117–143 (2002) [15] Plumert, J.M., Carswell, C., DeVet, K., Ihrig, D.: The content and organization of communication about object locations. Journal of Memory and Language 34, 477–498 (1995) [16] Burnett, G.E., Smith, D., May, A.J.: Supporting the navigation task: characteristics of good landmarks. In: Proceedings of the Annual Conference of the Ergonomics Society. Taylor & Francis, Abington (2001) [17] Raubal, M., Winter, S.: Enriching wayfinding instructions with local landmarks. In: Egenhofer, M.J., Mark, D.M. (eds.) GIScience 2002. LNCS, vol. 2478, pp. 243–259. Springer, Heidelberg (2002)
232
M. Barclay and A. Galton
[18] Nothegger, C., Winter, S., Raubal, M.: Computation of the salience of features. Spatial Cognition and Computation 4, 113–136 (2004) [19] Winter, S.: Route adaptive selection of salient features. In: Kuhn, W., Worboys, M., Timpf, S. (eds.) COSIT 2003. LNCS, vol. 2825. Springer, Heidelberg (2003) [20] Sorrows, M., Hirtle, S.: The nature of landmarks for real and electronic spaces. In: Freska, C., Mark, D. (eds.) Spatial Information Theory: Cognitive and Computational Foundations of GIS. Springer, Heidelberg (1999) [21] Tezuka, T., Tanaka, K.: Landmark extraction: A web mining approach. In: Cohn, A.G., Mark, D.M. (eds.) COSIT 2005. LNCS, vol. 3693. Springer, Heidelberg (2005) [22] Klabunde, R., Porzel, R.: Tailoring spatial descriptions to the addressee: a constraint-based approach. Linguistics 36(3), 551–577 (1998) [23] Mainwaring, S.D., Tversky, B., Ohgishy, M., Schiano, D.J.: Descriptions of simple spatial scenes in english and japanese. Spatial Cognition and Computation 3(1), 3–43 (2003) [24] Grice, H.P.: Logic and conversation. In: Cole, P., Morgan, J. (eds.) Syntax and Semantics: Speech Acts, vol. 3, pp. 43–58. Academic Press, New York (1975) [25] Vandeloise, C.: Spatial Prepositions. University of Chicago Press (1991) [26] Gapp, K.P.: An empirically validated model for computing spatial relations. K¨ unstliche Intelligenz, pp. 245–256 (1995) [27] Regier, T., Carlson, L.: Grounding spatial language in perception: An empirical and computational investi- gation. Journal of Experimental Psychology: General 130(2), 273–298 (2001) [28] Tenbrink, T.: Identifying objects on the basis of spatial contrast: An empirical study. In: Freksa, C., Knauff, M., Krieg-Bruckner, B., Nebel, B., Thomas Barkowsky, T. (eds.) Spatial Cognition IV: Reasoning, Action, Interaction. International Conference Spatial Cognition 2004, pp. 124–146. Springer, Heidelberg (2005) [29] Horowitz, T.S., Wolfe, J.M.: Search for multiple targets: Remember the targets, forget the search. Perception and Psychophysics 63, 272–285 (2001) [30] Barclay, M.J., Galton, A.P.: A scene corpus for training and testing spatial communication systems (in press, 2008) [31] Montello, D.R.: Scale and multiple psychologies of space. In: Frank, A.U., Campari, I. (eds.) COSIT 1993. LNCS, vol. 716, pp. 312–321. Springer, Heidelberg (1993)
Tiered Models of Spatial Language Interpretation Robert J. Ross SFB/TR 8 Spatial Cognition Universit¨ at Bremen, Germany
[email protected]
Abstract. In this paper we report on the implementation and evaluation of an interactive model of verbal route instruction interpretation. Unlike in previous works, this approach takes a generalised plan construction view of the interpretation process which sees the elements of verbal route instructions as context enhanced specifications of physical action which move an agent through a mental or physical search space. We view such an approach as essential for effective dialogic wayfinding assistance. The model has been developed within a modular framework of spatial language production and analysis which we have developed to explore different reasoning and representation facilities in the spatial language cognition process. We describe the developed cognitive spatial model, the interpretation of individual actions based on explicit linguistic and extra-linguistic spatial context, and the interactive plan construction process which builds these individual processes into a route interpretation mechanism. Finally, we report on a recent evaluation study that was conducted with an initial implementation of the described models. Keywords: Spatial Language, Language Interpretation, Embodied Action.
1
Introduction
While particular semantics and schema based models of spatial language use have been proposed in the literature [1,2], as well as layered spatial representation and reasoning models [3,4], and a wealth of qualitative and quantitative models of spatial reasoning (see [5] for a review), the processing of spatial language remains challenging both due to to complexities of spatial reasoning, but also because of the inherent difficulties of language processing due to the remarkable efficiency of spoken communication (see [6] for a discussion). For the development of sophisticated linguistically aware spatial applications, it is not only necessary to develop spatial reasoning systems, but it is also necessary to identify the properties of spatial language - particularly with respect to what elements of language are left under-specified for efficient communication, and which hence must be retrieved through other mechanisms. In this paper we will consider this problem of moving from the surface language form to embodied processing for verbal route instructions. Route interpretations, like scene descriptions, involve the semantic conjunction, or complexing, C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 233–249, 2008. c Springer-Verlag Berlin Heidelberg 2008
234
R.J. Ross
of a number of smaller grained spatial language types such as spatial locating expressions and action descriptions; thus their interpretation requires a clear understanding of these individual language types. Before outlining our general approach to this problem, we first review some of the known properties of verbal route instructions and computational approaches to their interpretation and modelling. 1.1
The Structure of Verbal Routes
The properties of route instructions as monologic explanations have been studied extensively in the discourse analysis [7,8], discourse semantics [9], and spatial cognition communities [10]. Denis [7] for example characterized routes as schematized descriptions of motion, starting with an initial localization, followed by one or more instances of route segments which move a hearer along a path, finalised by an orientation of the hearer towards the destination. Denis however also characterised route instructions as making extensive use of landmark reference and action descriptions. Pr´evot [9] took a more detailed semantic approach to the analysis of route interpretation, concluding that verbalised routes are highly underspecified structurally when introduced through dialogue and that hearers can only produce the intended interpretation through the application of many layers of context - including both discourse and situational information. More recently, the semantic and surface form features of route instruction in a dialogic rather than monologic setting have been discussed at length by Shi & Tenbrink [11]. We will not repeat such analysis here, but will instead note some of the defining features of individual motion process and spatial locating utterances which compose a route instruction. In general, motion process sentences typically mark spatial elements such as a general direction of motion, or include more specific trajectory constraints which mark particular landmarks for the path of motion. Moreover, qualitative or quantitative extents for motions are often frequently found in surface language. In the surface form trajectory landmarks are discourse referents which can have arbitrarily complex descriptions to aid landmark resolution by the hearer. While such descriptions can include non-spatial features, geo-positional information such as projective or ordinal relationships can be present, e.g., (1) go into the second room, it’s the one after John’s office, on the left Trajectory constraints as well as the spatial characterization of landmarks which play a role in such constraints, e.g., “into” or “after” above, are subject to some choice of perspective and reference frame. To process spatial expressions in a principled way, each piece of this information must be identified. While it is possible to hard-code some choices, in general we must select these parameters from either the surface form as provided by the user, or retrieve them from either discourse or situational context. One issue which is problematic for previous approaches is that the resolution of such parameters does not rely on a static domain model, but is instead quite dynamic.
Tiered Models of Spatial Language Interpretation
1.2
235
Computational Models of Route Interpretation
A number of approaches to the computational interpretation of routes from a language understanding perspective have been considered in recent years. One approach to this problem from the formal spatial modelling community is tightly bound to Denis’s [7] and Tversky’s [12] characterization of certain route types as schematized structures. Such models of routes, typified by Werner et al. [10], and later Krieg-Br¨ uckner’s [3] Route Graph, focus on a spatial structural rather than action oriented view of routes. Such a structurally oriented view of space led to the exploration of direct mapping of formally specified logical representations of verbalised routes against spatial representations coded as route graphs aided in part through the use of qualitative spatial calculi (see for example [13] and [14]). Such an approach essentially conflates verbalised route instructions with the spatial model which they are to guide a user through, relying on a preformalisation of the verbalised route which is not trivially achievable in practical language systems. Such approaches, in themselves, also offer no mechanisms to fall back to clarification dialogues in the case of ambiguities which can not be resolved within context alone. Within the applied robotics domain on the other hand, a more pragmatic approach to the route interpretation problem has been investigated. Works such as Lauria’s [15] interpretation of routes by a simple robot in a toy environment, or Mandel et al. [16] interpretation of route instructions against a detailed spatial representation have pursued an action oriented view of route instructions more in keeping with the instruction based nature of route descriptions proposed by Denis, but, as with the formal modelling accounts, often assume perfect user input, fail to integrate the spatial reasoning algorithm with dialogic capabilities, and in the case of Mandel et al., require unnaturally long route instructions to aid the pruning of a search space, which while being well configured for robot navigation and localization, was highly unsuited to the requirements of natural language processing. Moreover, this approach presupposed a specification of the user’s route instruction as a ground structural formalism which already assumed a schematized view of routes which was already considerably abstracted from the realities of surface spatial language. More recently, Shi et al. [17] attempted to evaluate the relative merits of the formal spatial and robot oriented views of route interpretation through a common action description framework. However, no attempt was made to reconcile these two methodologies against the realities of under-specified natural language, nor was there any attempt to incorporate notions of quantitative information or spatial extent in route description. 1.3
Towards an Action Oriented View of Route Interpretation
In this paper we will adhere to an action oriented view of route interpretation in keeping with Denis’s main construal of routes. Moreover, while we recognize the schematization of long routes into a series of ‘segments’ as a general characteristic of route descriptions in unknown environments, and one which has particular significance to the production of clear unambiguous routes [18], we argue that
236
R.J. Ross
the robust resolution of verbalised routes by artificial systems should not be overly dependent on such well formed segmentation. The resolution of ambiguous circumstance relies on one hand on a search space technique based on Mandel et al.’s methodology, but is instead applied to spatial representations more suited to the processing of spatial language so as to reduce the overall search space and hence be practical even for short language inputs. Moreover, our goal is to tease out the various stages of spatial language interpretation so as to provide a more methodological blueprint for scalable spatial language interpretation. We proceed in Section 2 with a description of the logical surface of language which forms the first layer in a clear model of spatial language interpretation. Then in Section 3 we move from language itself to issues of embodiment in terms of representation and spatial action. With the elements of spatial language and embodiment established, in Section 4 we explore the relation between the two through a description of the language interpretation process for simple spatial actions before extending this model to the larger grained structures of routes. Section 5 considers the application and evaluation of the models described through a modular framework for spatial language production and analysis which we report on in the context of a recent evaluation study. We conclude with a general discussion and report on on-going and future work.
2
The Logical Surface of Route Instructions
In many approaches to the computational interpretation of spatial language there exists an implicit assumption of a direct mapping or isomorphism between the surface meaning of language and the types used for model specific spatial reasoning. While such a view is appealing for its simplicity, it unfortunately belies the complexity of spatial language, spatial reasoning, and the relationships between the two. An alternative view is to subscribe strictly to a “two level semantics” view of knowledge representation within a spatial language system, within which the first level or “Linguistic Semantics” captures the direct surface meaning of spatial language, and a second, world oriented or “Conceptual Semantics” captures contextualized spatial knowledge and reasoning process. A mapping then exists between the two, with the complexity of the mapping being a function of the particular conceptual organization. For the modelling presented here we subscribe to the second view, thus assuming the surface form of spatial language to be given in terms of a logical formalism which abstracts over the surface language in terms of types and roles which describe the spatial import of language. For this interface we adopt the Generalised Upper Model (GUM), a “Linguistic Ontology” or formal theory of the world whose categories are motivated based on the lexicogrammatical evidence of natural language. In particular we assume the categories provided by the latest version of GUM, which has been specifically extended to provide a comprehensive account of the natural language semantics of spatial language. Those extension, described in detail by Bateman et al. [2], are rooted in the traditions of formal spatial language semantics (e.g., [19]) and more descriptive
Tiered Models of Spatial Language Interpretation
237
accounts of spatial phenomena in language (e.g., [1]), resulting in category types which are wholly motivated by the distinctions made by language in its construal of space. The range of categories defined for spatial language description within GUM have been defined in depth elsewhere, and we will make no attempt to replicate such detail here; instead, we simply highlight some of the more salient modelling points so as to explain the role of the surface logical form in spatial language interpretation. Within GUM, the central organizational unit is the Configuration - which can broadly be considered a semantic categorization of sentence types based on the constituents of those sentences. Of the configurations, those of most relevance to the linguistic description of routes are the subclasses of NonAffectingSpatialDoing, i.e., OrientingChange and DirectedMotion which provide surface semantics for expressions of directed motion. Such configurations define typical dependent semantic participants, e.g., the performer of the action, the direction within which the action is to be made, and so forth. Of particular relevance to the specification of spatial routes in language is the category of GeneralizedRoute, which through playing the role of route within a configuration, specifies the trajectory elements captured by a single utterance. Based on grammatical evidence a generalized route is defined as consisting of minimally one of a source, destination, or path placement role. Roles such as source and destination are in turn filled by so called GeneralizedLocation entities which comprise both the relatum of an expression and the dynamic spatial preposition, or semantically, the spatial modality given by the term. To illustrate this surface spatial semantics provided by GUM, consider the following directed motion expression along with its surface spatial semantics provided in a frame like formalism: (2) Go along the corridor to the kitchen (SL1 / NonAffectingDirectedMotion processInConfiguration (G1 / Going) route (GR1 / GeneralizedRoute pathPlacement (GL1 / GeneralizedLocation relatum (C1 / Corridor) hasSpatialModality (SM1 / PathIndicatingExternal)) destination (GL2 / GeneralizedLocation relatum (K1 / Kitchen) hasSpatialModality (SM2 / GeneralDirectional)))) It should be noted that such a surface semantics is not necessarily a model of the real world in the sense of referential semantics, but is better thought of as a logical specification of surface language. Furthermore, while the surface spatial semantics attempts to cover all surface spatial meaning given by an utterance, some elements of a complete spatial meaning such as perspective are not marked overtly by natural language and hence, as we will shortly see, must be retrieved from extra-linguistic context during the spatial language interpretation process.
238
R.J. Ross
Before considering the contextualization process, we first turn to the embodied non-linguistic representation and reasoning of the spatial context assumed by our model.
3 3.1
The Embodiment Model Spatial Representation
The heterogeneous nature of reasoning types employed by human and artificial agents dictate that a single homogeneous model of space and intention for both non-communicative perception, action, localisation, and navigation, as well as communicative and cognitive reasoning processes is not practical. A multi-tiered representation which separates out distinct knowledge types as argued for, for example, by [4,3,20] is a useful means of achieving the required diversity of reasoning types. In the following we outline the layered representation used in our processing of spatial language. Our implemented spatial model follows from the graph based organization of space proposed by Krieg-Br¨ uckner [3] which was also placed within an ontological characterization of spatial knowledge by Bateman [20]. For present purposes we organize the agent’s spatial information as a three layer structure: SM = {RS, CS, GS}
(1)
where: (a) RS or Region Space is a metric model of the agent’s environment composed of a Cartesian coordinate based grid system within which each grid element can be either unoccupied or occupied by one or more Environment Things some of which can be abstract; (b) CS or Concept Space is a knowledge base like representation of Environment Things and their non-spatial characteristics and relationships; and (c) GS or Graph Space is a structural abstraction of navigable space which sets up possible navigable accessibilities and visibility relations between environment things as marked by decision points. The explicit distinction between CS and RS is motivated principally by pragmatic concerns in that by separating out the physical geo-spatial properties of the model from the non-spatial we can on one hand make use of the most appropriate reasoning system for the information type, e.g., ontological reasoners for non-spatial category like information, and explicit spatial reasoning techniques for the spatial properties of space - such a distinction also follows from a more principled organization of spatial content for robotic systems [20]. For current purposes, we adopt a metric RS layer, but we can in principles replace this with a more abstracted or qualitative representation of space without breaking the general organization of our agent’s knowledge system. With respect to the GS layer, one minor differences between our applied model and Krieg-Br¨ uckner’s Route Graph formalism is that following Denis [7] and Kuipers [4], and a general principles of ontological organization of space, our notion of Route Graph place or Decision Point is strictly zero-dimensional and does not import notions such as reference system from associated regions.
Tiered Models of Spatial Language Interpretation
(a) Raw Image
(c) Graph Space
239
(b) Region Space
(d) Concept Space
Fig. 1. Illustration of the various levels of the NavSpace model
For illustration, Figure 1 depicts a simplified office environment in abstract terms, along with the three representational layers of the spatial model. The office environment, described in more detail in Section 5, is illustrated in Figure 1(a); note here that the darker area in the centre of the environment is free space. The Region Space layer for the same environment is illustrated in Figure 1(b); note here the overlapping of both abstract entities with each other and with concrete entities. Figure 1(c) similarly shows the Graph Space which includes the Route Graph like structuring. Finally Figure 1(d) illustrates a highly simplified view of the concept space model1 . While spatial representation layers similar to the Region Space and Graph Space are commonly used for practical robot navigation and localization, i.e., occupancy grids and voronoi graph, it should be made clear that the presence of GS and RS here, like the CS layer, in the current model is motivated entirely by the necessities of spatial language interpretation. 3.2
Action Schemas
If we assume that the core unit of route descriptions are motion expressions which correspond to actions which should be performed by an agent to reach a 1
The actual concept space model, even for the office environment used here, is considerably more detailed, but is replaced here for visual clarity.
240
R.J. Ross
goal, then we must define those actions in a meaningful way. Such definitions require, amongst other factors, a suitable choice of action granularity, relevant parametrization, as well as the traditional notions of applicability constraints and effects. For the current model, we have chosen a granularity and enumeration of action close to the conception of spatial action in human language as identified in the Generalized Upper Model introduced earlier. We will refer to these action types as action schemas, but it should be noted that the types of action schemas and GUM configurations are not one to one; action schemas necessarily introduce full spatial information including perspective, and are also, as will be seen below, marginally finer grained than GUM configurations. Excluding non-spatial features such as start time, performer of the action and so forth, we can define the generalized form of a directed motion action schema as follows: M otion(direction, extent, pathConstraint) (2) where: – direction ∈ Direction is a direction expressed with respect to the movers perspective and intrinsic reference frame. – extent ∈ GeneralizedExtent is a quantitatively or qualitatively expressed extent for the movement – pathConstraint ∈ {P lace, M odality} is an expression of a path constraint broadly equivalent to the GeneralizedLocation which play the roles of source, destination, and path placement in the Generalized Upper Model. For each action there is also an implicit source which is the starting point of any motion. The source of a motion, typically omitted from surface language, is necessarily required to define an action. Trivially the source of motioni is equal to the final location of motioni−1 . Furthermore, certain pragmatic constraints hold on which parameters of a motion action schema are set. For example, the specification of an extent without either direction or a pathConstraint is not permitted. Furthermore explicit definition of extent and a path constraint must not be contradictory with respect to the world model. While action schemas are similar in centralization and composition to configurations within the Generalised Upper Model, action schemas are more finely centralized, typically decomposing a single GUM motion configuration into multiple action schemas, e.g., the configuration given for Sentence 2 earlier is given by two distinct action schemas instances within the embodiment model, one capturing the path placement constraint, while the other captures the destination constraint. Multiple action schemas are then given a logical structuring with ordering and conditional operators. We must also define the effects of such schemas. The defining characteristic of a movement is a profile of probable location of the mover following the initialization of the action. While there are some logical symbolic ways to define such results, our approach follows Mandel et al. [16] in that we give a probable location of
Tiered Models of Spatial Language Interpretation
241
the mover as a function of the starting pose and the action schemas considered as follows: p(xj , yk , ol ) = fschema (x0 , y0 , o0 ) (3) where x0 , y0 , o0 denotes the starting pose of the agent (location on Cartesian plane and orientation respectively), p(xj , yk , ol ) denotes the probability of eventual occupation of an arbitrary pose, and the motion profile of each schema is determined empirically as a function of the supplied parameters.
4
Route Interpretation through Interactive Plan Construction
While action schemas and the logical form of language share common features, the mapping function between the two is non-trivial and is highly dependent on forms of spatial and discourse context. To illustrate, Figure 2 schematically depicts an office environment with a robot (oval with a straight line to indicate orientation). In such environments, the identification of discourse referents
Fig. 2. Illustration of a spatial situation wherein parametrization of action schemas based solely on surface language fails because of multiple salient and relevant objects
Fig. 3. Illustration of a spatial situation wherein parametrization of action schemas based solely on surface language fails because of ambiguous perspective
242
R.J. Ross
typically used as landmarks within route instructions can depend on a range of spatial factors such as visual saliency, proximity, or accessibility relations. This can be seen in that if the robot was told to “enter the office” then it is highly likely that the office in question is directly ahead rather than behind the agent or to its right. Moreover, the mapping process can also involve the application of non-physical context to enrich the surface information provided. To illustrate, consider Figure 3 which depicts an agent situated at a junction while being given instructions by an operator who has a survey perspective equivalent to the figurative view presented in Figure 3. In such a case and where the instructee is aware of the instructor’s non-route perspective, an instruction such as “turn to the left” can have alternative meanings. While explicit clarification through dialogue is possible, the more efficient solution is to apply contextual information in the transformation process. In general, the relationship between surface spatial language and interpretation is not a simple mapping, but rather a more complex function. In the case of our action schemas then: action schema = f (context, surf ace)
(4)
where f is a complex grounding and enrichment process involving on one hand the resolution of objects from the speaker’s model which match the (sometimes partial and ambiguous) descriptions given in the spatial language surface form, while on the other hand requiring the application of context to supplement the information not provided in the surface form. This grounding and enrichment process can itself require linguistic interaction when context alone cannot provide a unique action schema. Here we will sketch and illustrate the spatial elements of such a function. An appropriate action schema must first be selected from a schema inventory provided by the embodiment layer. In the case of fully specified surface forms, this decision can be based directly on mappings between configuration type and action schema types. However, if a surface form is missing key information such as a verb or other indications of configuration type, the choice of action schema must instead be based on the schema type whose parameters are maximally filled by the information given in the surface form. Similarly, rule based decomposition of configurations into series of action schemas is necessary for example in the case of surface forms which supply complex route information. Since the surface spatial form typically provides concept descriptions rather than ground concepts from the agent’s spatial model, grounding functions must be applied to action schema parameters. The set of grounding functions are themselves dependent on a wide variety of spatial and non-spatial factors, the automatic and complete definition or categorization of which is arguably an AIComplete problem in that it requires human level intelligence. However, within practical models we can of course make various simplifications and assumptions which provide an adequate approximation of actual grounding models. Typically within the route interpretation domain parameters to be ground include landmarks. As illustrated by Figure 3, since action schema depend on directions
Tiered Models of Spatial Language Interpretation
243
and motion constraints defined largely in terms of the agent’s ego-centric perspective, the grounding process must also include the application of perspective and reference frame transformation to directions provided in surface form spatial descriptions. For single action instructions, if, during the grounding process, a unique parametrization of the action schema can be made, then the action may be committed to by the agent immediately. Whereas if no suitable parametrization is found, or if multiple solutions exist, then clarification through spoken dialogue is necessary to resolve the inherent ambiguity. For multiple action schemas as typified by complete route instructions we must adopt an incremental integration approach which composes a process structure from supplied information: 1. Construct multiple ungrounded action schemas through decomposition and augmentation of surface spatial language configurations. 2. For action schema 1 apply grounding process as per single schema grounding, Store final position, probability tuples for the action. 3. For action schema i + 1 take the most probably location tuples from action schema i and supply them as input parameters to the grounding of action schema i + 1. 4. If for action schema n one solution exists where the probability of location (p) is greater than a threshold (t), the sequence of grounded action schemas can be committed to by the agent. This method, similar to the search algorithm applied by [16], essentially moves a set of most probable locations through a physical search space seeking the most probable final destination given the set of action specifications supplied. However since the search space in our case has been simplified to a conceptual graph structure which includes information on explicit junctions etc., rather than a more fine grained voronoi graph which treats all nodes equally, the search process is considerably simplified, resulting in even short route interpretations providing accurate results. Moreover, the current model offers a simple backtracking solution for the case where for action schema n the number of solutions is greater than one, or where no solution exists. In this case rather than rejecting the user’s request, the interpretation algorithm may backtrack to the last action segment where no unique solution exists and compose a clarification question relevant to that point.
5
Model Application and Evaluation
The models of spatial action and route interpretation described earlier have been partially implemented within a modular framework of spatial language production and analysis, and evaluated as a whole in a user study with the developed system.
244
R.J. Ross
Fig. 4. The Corella Dialogue Framework & Navspace Application
5.1
Implementation Description
Figure 4 presents an overview of the system which comprises on one hand an application independent dialogue processing framework, and on the other a domain specific robot simulation application based on the models described in earlier sections. The dialogue framework, named Corella, is broadly based on the principles of Information State processing [21] and fine-grained grammatical mechanisms [22]. Various choices within the dialogue architecture - including the choices of input and output components, linguistics semantics formalisms, and grammars for analysis and generation - have been outlined elsewhere [23], thus for reasons of space constraints we will not consider them further here. In earlier work [23], direct integration with the robotic wheelchair platform described by [16] was undertaken. However, for the currently presented work, we have developed an application model based around a simplified simulation environment so as to best study representation of space for linguistic interaction, as well as issues of domain interface design, without having to consider low-level interfaces and models which have been designed for robot specific navigation and localization issues. Following our background in developing language interfaces for intelligent wheelchairs, the current Navspace design models a simulated robot in a schematized office environment. Typical interaction with a system involves using an interface similar to Figure 5 to direct the wheelchair around the environment with typed free form natural language input. The system can also communicate with the user through a text window, with system communicative acts displayed to the user within the same interaction window.
Tiered Models of Spatial Language Interpretation
245
Fig. 5. The Interaction Window
5.2
Study 1: Human-Human Gold Standard
As a data gathering exercise we first ran a human-human study in our research group to gather linguistic and behavioural data for a gold-standard of humanhuman route interpretation performance. The study, detailed and analysed in detail by Goschler et al. [24], required pairs of native German speakers to play the roles of “route giver” (RG) and “route follower” (RF); thus emulating a route interpretation task for a user and intelligent wheelchair. Participants, placed in separate physical rooms, interacted through a chat interface and two views on a shared spatial environment. Within a given trial, each participant viewed the shared simulated environment from a plan perspective which included corridors, named and unnamed rooms, and a simulated wheelchair’s position in the environment at any given time. The RG, but not the RF, was also made aware of a given target destination for each trial thorough the highlighting of a portion of the screen. The RF on the other hand was given physical control over the wheelchair via a joystick, and could thus move the wheelchair towards the target based on typed instructions from the RG. Figure 5 depicts a screen similar to that presented to RG and RF during each trial. The key differences being a re-naming of the chat window label “Rohlstuhl” (Wheelchair) to “Partner” for human-human trials. Given, (a) both RG and RF were situated in the same spatial model and had no reason to think otherwise; (b) RG and RF were given continuous real time information on wheelchair pose; and, (c) RF was manually controlling the wheelchair thus required to move his/her hands from the joystick to the keyboard to type,
246
R.J. Ross
we expected limited, if any, explicit dialogic interaction. However, as detailed in [24], a surprising amount of dialogue was observed, including explicit acknowledgements, clarifications, and challenges. Thus, in addition to allowing the collection of data, the human-human study illustrates the complexities of spatial language processing even for competent human speakers. 5.3
Study 2: Human-Computer Interaction
To evaluate the spatial language interpretation models, we ran a second study with human participants interacting with the dialogue system implementation. In this study participants played the role of Route Giver (RG) while the system itself played the role of Route Follower (RF) in a scenario, set up, and spatial configuration practically identical to that for the human-human study just described. The application included many of the model features discussed earlier including linguistic semantic to action schema transformations, and the spatial context sensitive resolution of referenced landmarks. Reference frame identification and transformation was not however included. The interface presented to the participant is shown in Figure 5. In total, thirteen participants took part in this study, each of whom was a native German speaker and an undergraduate student of Applied Linguistics at University Bremen. As with the human-human study, each participant was given written instructions informing them of their objectives. Participants took part in 11 trials, the first trial being a ‘test run’ after which questions could be directed to the experimenter. For each trial the wheelchair avatar was positioned within a room in the environment and a target location was highlighted for the participant to see. The same 11 start and end points were used as in the human-human study. Unlike in the human-human trial, a time-out mechanism was included in the experiment scenario to (a) encourage participants to complete the task in a timely fashion, and (b) to prevent participants becoming stuck in a trial due to failure of the dialogue system. The time-out was set for four minutes (apart from the test trial), after which the trial was aborted and the next trial loaded. While detailed linguistic analysis of the results has not yet been completed, we can report here on the success rates and observations of limitations in the implemented model. The study was broadly split into two groups: group one for testing and group two for evaluation. Group one involved 7 of 13 participants, who were used for isolating technical issues and extending the input grammars towards common language strategies used by the participants. For the second group of 6, the system design was left constant so as to allow comparison between participants. Of group one, 5 of 7 participants aborted the study, before partaking in all 11 trials, due to system errors. However for group two only 1 of 6 participants had to abort early. Of group two we had a total of 58 trials (5 complete sets of 11 plus one set of 3). Of these 58 trials, 50 (86%) were successfully completed by the participants within the timeout window. As predicted both by the results of our own gold standard study [24], and earlier studies with humans and a Wizard of Oz scenario [11], a wide variety
Tiered Models of Spatial Language Interpretation
247
of language strategies were used by participants to direct the RF to the destination. Following an initial review of the results corpus, it is clear that while many causes of communicative failure were due to insufficient grammatical coverage, the results show that many other errors are due to modest verbal feedback from the system in case of uncertain interpretations. We believe that generating feedback based on the explicit grounded interpretations maintained in action schemas could thus significantly improve performance rates; thus we are currently extending our model to achieve just this.
6
Discussion and Outlook
The aim of the work presented in this paper has been to investigate complete models of spatial language interpretation which systematically consider issues of language modelling and representation as well as issues of embodiment. We argue that the elements of spatial and linguistic reasoning must be separated out through appropriate modularization of the cognitive backbone, but that clear modelling of the relationships between different reasoning layers is necessary. The mapping between different reasoning layers is non-trivial, being highly dependent on spatial, discourse, and other forms of context. We have proposed an action oriented view of the interpretation of spatial motion instructions and route instructions. While an action oriented perspective is in some way a simplification of previous more formalized views, we believe that the action oriented perspective is ideally suited to the modelling of a range of complex spatial constraints which can be captured within route instructions. Such an action-oriented view does not however preclude the use of sophisticated qualitative modelling and reasoning techniques within the spatial processing system of such agents, indeed such models are essential for more complex forms of spatial reasoning necessary to ground the various parameters of the action schemas assumed by our models. It is simply however that we assume the interface to spatial motion to be defined in terms of actions rather than what we consider a premature transition to a structural formalism. As mentioned, the system implementation used in the user studies reported, assumes system ego-perspective in all interpretation tasks, thus a detailed model of perspective choice on spatial and discourse context is currently under development. Furthermore, while various approximations of the action schema profiles have been developed for our prototype, a more systematic coverage of dynamic spatial relations would be highly beneficial. Despite both these limitations, and the necessity of providing a broad overview of the problems addressed rather than a precise fine grained model, we believe the models presented here to be a useful intermediate step towards more detailed models of spatial language interpretation for artificial agents. Acknowledgments. I gratefully acknowledge the support of the Deutsche Forschungsgemeinschaft (DFG) through the Collaborative Research Center SFB/TR 8 Spatial Cognition - Project I5-[DiaSpace]. I also thank Prof. John Bateman for his useful comments on early drafts of this paper.
248
R.J. Ross
References 1. Levinson, S.C.: Space in language and cognition: explorations in cognitive diversity. Cambridge University Press, Cambridge (2003) 2. Bateman, J., Hois, J., Ross, R., Tenbrink, T., Farrar, S.: The Generalized Upper Model 3.0: Documentation. SFB/TR8 internal report, Collaborative Research Center for Spatial Cognition, University of Bremen, Germany (2008) 3. Krieg-Br¨ uckner, B., Frese, U., L¨ uttich, K., Mandel, C., Mossakowski, T., Ross, R.J.: Specification of route graphs via an ontology. In: Proceedings of Spatial Cognition 2004, Chiemsee, Germany (2004) 4. Kuipers, B.: The spatial semantic hierarchy. Artificial Intelligence 19, 191–233 (2000) 5. Cohn, A., Hazarika, S.: Qualitative spatial representation and reasoning: an overview. Fundamenta Informaticae 43, 2–32 (2001) 6. Pickering, M.J., Garrod, S.: Towards a mechanistic psychology of dialogue. Behavioural and Brain Sciences 27(2), 169–190 (2004) 7. Denis, M.: The Description of Routes: A Cognitive Approach to the Production of Spatial Discourse. Cahiers de Psychologie Cognitive 16, 409–458 (1997) 8. Denis, M., et al.: Spatial discourse and navigation: An analysis of route directions in the city of venice. Applied Cognitive Psychology 13, 145–174 (1999) 9. Pr´evot, L.: Topic structure in route explanation dialogues. In: Proceedings of the workshop ”Information structure, Discourse structure and discourse semantics” of the 13th European Summer School in Logic, Language and Information (2001) 10. Werner, S., Krieg-Br¨ uckner, B.: Modelling navigational knowledge by route graphs. In: Freksa, C., Habel, C., Wender, K. (eds.) Spatial Cognition 2000. LNCS (LNAI), vol. 1849, pp. 295–317. Springer, Heidelberg (2000), http://www.springer.de 11. Shi, H., Tenbrink, T.: Telling Rolland where to go: HRI dialogues on route navigation. In: WoSLaD Workshop on Spatial Language and Dialogue, 2005, October 23-25 (2005) 12. Tversky, B., Lee, P.U.: How space structures language. In: Freksa, C., Habel, C., Wender, K.F. (eds.) Spatial Cognition 1998. LNCS (LNAI), vol. 1404, pp. 157–176. Springer, Heidelberg (1998) 13. Bateman, J., Borgo, S., Luttich, K., Masolo, C., Mossakowski, T.: Ontological modularity and spatial diversity. Spatial Cognition & Computation 7(1) (2007) 14. Krieg-Br¨ uckner, B., Shi, H.: Orientation calculi and route graphs: Towards semantic representations for route descriptions. In: Proc. International Conference GIScience 2006, M¨ unster, Germany (2006) 15. Lauria, S., Kyriacou, T., Bugmann, G., Bos, J., Klein, E.: Converting natural language route instructions into robot executable procedures. In: Proceedings of the 2002 IEEE Int. Workshop on Robot and Human Interactive Communication, Berlin, Germany, pp. 223–228 (2002) 16. Mandel, C., Frese, U., R¨ ofer, T.: Robot navigation based on the mapping of coarse qualitative route descriptions to route graphs. In: Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS 2006) (2006) 17. Shi, H., Mandel, C., Ross, R.J.: Interpreting route instructions as qualitative spatial actions. In: Spatial Cognition V. LNCS (LNAI), vol. 14197, Springer, Heidelberg (2007), http://www.springeronline.com 18. Richter, K.F., Klippel, A.: A model for context-specific route directions. In: Proceedings of Spatial Cognition 2004 (2004)
Tiered Models of Spatial Language Interpretation
249
19. Eschenbach, C.: Geometric structures of frames of reference and natural language semantics. Spatial Cognition and Computation 1(4), 329–348 (1999) 20. Bateman, J., Farrar, S.: Modelling models of robot navigation using formal spatial ontology. In: Proceedings of Spatial Cognition 2004 (2004) 21. Larsson, S.: Issue-Based Dialogue Management. Ph.d. dissertation, Department of Linguistics, G¨ oteborg University, G¨ oteborg (2002) 22. Steedman, M.J.: The syntactic process. MIT Press, Cambridge (2000) 23. Ross, R.J., Shi, H., Vierhuf, T., Krieg-Bruckner, B., Bateman, J.: Towards Dialogue Based Shared Control of Navigating Robots. In: Proceedings of Spatial Cognition 2004, Germany. Springer, Heidelberg (2004) 24. Goschler, J., Andonova, E., Ross, R.J.: Perspective use and perspective shift in spatial dialogue. In: Proceedings of Spatial Cognition 2008 (2008)
Perspective Use and Perspective Shift in Spatial Dialogue Juliana Goschler, Elena Andonova, and Robert J. Ross Universit¨ at Bremen, Germany {goschler,robertr}@informatik.uni-bremen.de,
[email protected]
Abstract. Previous research has shown variability in spatial perspective and the occurrence of perspective shifts to be common in monologic descriptions of spatial relationships, and in route directions, in particular. Little is known, however, about preferences and the dynamics of use of route vs. survey perspectives as well as perspective shifts in dialogue. These were the issues we addressed in a study of dialogic interaction where one participant instructed the other on how to navigate a wheelchair avatar in a shared environment towards a goal. Although there was no clear preference for one of the two perspectives overall, dialogues tended to evolve from an early incremental, local, ego-based strategy towards a later more holistic, global, and environment-oriented strategy in utterance production. Perspective mixing was also observed for a number of reasons, including the relative difficulty of spatial situations and changes across them, navigation errors by the interlocutor, and verbal reactions by the interlocutor. Keywords: Spatial Language, Perspective.
1
Introduction
Communication about the world, even in its simplest form, can easily turn into a problem-solving task because form and function do not match unequivocally in language systems. Multiple forms may correspond to one and the same function or meaning, and multiple functions may be associated with one and the same verbal expression. In addition, the same referential object or scene can trigger a number of different perceptual and conceptual representations [1], or a certain arrangement of objects can be perceived and conceptualized in multiple ways. For example, in a study of goal-directed dialogue [2], different description schemes were used by participants in reference to a maze and movement in it (path, coordinate, line, and figural schemes). Similarly, in a study of how people describe complex scenes with multiple objects, participants’ choices varied significantly [3] depending on the nature of the spatial array. Thus, we wanted to investigate how people deal with these issues in a dialogic spatial task. Specifically, we were interested in their perspective-taking and how they would solve occurring ambiguities and misunderstandings. C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 250–265, 2008. c Springer-Verlag Berlin Heidelberg 2008
Perspective Use and Perspective Shift in Spatial Dialogue
251
Multiple perspectives, or ways of speaking about the world and the entities that populate it, are reflected at different levels of language, e.g., in lexical and syntactic alternatives, but also in variation at a conceptual level. In spatial reference, different conceptualizations can be seen in the choices of spatial perspective and frames of reference. Perspective taking involves abstracting from the visual scene or schematization [4] and it has been interpreted as occurring at the level of microplanning of utterances [5,6] rather than macroplanning (deciding on what information to express, e.g., which landmarks and their relations are to be mentioned). Therefore, while being related to lexical and grammatical encoding, it carries conceptual choices beyond them. In spatial perspective, there have been two views as defined by Tversky [7] - on the narrow view, perspective is realized through the choice of reference system, variously classified into deictic, intrinsic, extrinsic, or, egocentric and allocentric in wayfinding, and relative, intrinsic, and absolute in Levinson’s framework [8]. On the other hand, the broadly viewed perspective choices refer to the use of reference systems in extended spatial descriptions (e.g., of a room, apartment, campus, town). Spatial perspective of this kind has also been categorized in alternative ways. In a binary classification schema, embedded perspective refers to a viewpoint within the environment and goes well together with verbs of locomotion and terms with respect to landmarks’ spatial relations to an agent while external perspective takes a viewpoint external to the environment and is commonly associated with static verbs and cardinal directions [9]. In a tripartite framework of spatial perspective, the route perspective/tour is typical of exploring an environment with a changing viewpoint, the gaze perspective is associated with scanning a scene from a fixed viewpoint outside an environment (e.g., describing a room from its entrance), and in the survey perspective a scene or a map is scanned from a fixed viewpoint above the environment [7,6]. Variability in perspective is an important feature of spatial language. Previous studies have considered several individual, environmental, and learning factors as a source of this kind of variation in verbal descriptions. Mode of knowledge acquisition has been shown to affect perspective choices in spatial memory, for example, participants who studied maps gave more accurate responses later to survey perspective tasks whereas participants who were navigating gave more accurate responses to route perspective tasks [10]. In addition, in these experiments, spatial goals (route vs. survey) were also shown to affect perspective choices. Taylor & Tversky [6] tested the influence of four environmental features on spatial perspective choices and found that although overall most participants’ descriptions followed a survey or a mixed perspective, preference for the use of route perspective rather than mixed was enhanced in environments that contained a single path (vs. multiple paths) and environments that contained landmarks of a single size scale (vs. landmarks of varying size). The other two environmental features that were manipulated (overall size and enclosure) did not produce any clear pattern of preferences in their participants’ descriptions.
252
J. Goschler, E. Andonova, and R.J. Ross
Variability in perspective choices is frequently accompanied with perspective switching behavior-participants tend to mix perspectives quite regularly, for example, 27 out of 67 participants in Taylor & Tversky’s first experiment and 74 out of 192 participants in their second experiment mixed perspectives in their descriptions [6]. There are multiple reasons why a speaker may switch from one perspective to another. Perspectives are problem-ridden and fit certain situations and tasks better than others. For example, in a deictic system which derives from a speaker’s viewpoint, interlocutors must keep track of their partners’ viewpoints throughout the spatial discourse, or, in an intrinsic system, the success of coordination between interlocutors depends on having a shared image of the object in order to identify intrinsic sides and relations unambiguously [11]. However, consistency in spatial description, including spatial perspective, has also been identified as an important factor in choices of reference frame, in lexical and syntactic means of expression. Vorwerg, for example, found considerable internal, within-participant consistency in reference frames as well as language features [12]. Consistency in the use of perspective may be useful in at least two ways - by offering more cognitive ease for the speaker who can proceed with a given perspective or reference frame that has already been activated (successfully) and more cognitive ease for the addressee by providing coherence to the spatial discourse. Tversky found that in comprehension, reading times and immediate verification times are both decreased by consistency of perspective in texts [7]. They conclude that the cognitive cost to switching perspective is relatively small and transient, while there is a cognitive cost involved also in retaining the same perspective, for example, changes in the viewpoint within a route perspective, and that on balance, switching perspective may be more effective in communication than not switching perspective. The paradox of having advantages both for switching perspective and for staying in the same perspective can only be resolved by analysis of specific situations and interactions [11]. In sum, previous studies have been able to account for some important aspects of perspective choice and perspective switching. However, little is known about two related issues that could enhance the ecological validity of research on the topic. First, previously published research has focused on verbal descriptions by individuals in a monologic format (even if participants were told that their descriptions would later be used by future addressees in a different task). There are a few exceptions: Schober’s study [13] showed that speakers set spatial perspectives differently with actual addressees than with imaginary ones. Striegnitz et. al., in their study on perspective and gesture in direction-giving dialogues [14], point out that the use of survey perspective increases in answers to clarification questions and in re-descriptions of an already given route description. Second, online interaction involving simultaneous verbal exchanges and physical movement along a route in a given environment has not been studied yet systematically. It is not immediately clear to what extent speakers’ choices in such a format of interaction (both dialogic and online) would replicate existing findings. For example, switching perspective in dialogue can take forms not found in single-participant descriptions, e.g., spatial reference may occur as part of the
Perspective Use and Perspective Shift in Spatial Dialogue
253
joint cooperative activities of the two interlocutors, or spatial perspective negotiation may emerge as a natural feature of their interaction. In this paper we will consider perspective taking in the spatial domain in such a dialogic situation, and more specifically, we will examine the differences between the use of route and survey perspectives.
2
Perspective Use in Spatial Dialogue
When two interlocutors refer to one and the same spatial array, they select a frame of reference or a perspective for the description. Thus, in dialogue, perspective use and perspective switching are part of overall coordination. The need to align perspectives may arise because interlocutors have different viewing positions (vantage points) with respect to a scene, or because the terms referring to objects’ spatial relations may be ambiguous or underspecified. In our study, we kept the vantage point invariable, but there were two possible perspectives on the scene, namely survey and route perspective: participants could look at the map and refer to the main directions as left, right, up, and down in a survey perspective; or they could take the perspective of the wheelchair avatar and refer to the main directions with left, right, forward, backward in a route perspective. The availability of these two different perspectives on spatial scenes leads in many situations to ambiguous utterances in route or location descriptions, e.g., the meaning of left and right may differ in route and survey perspective. Whenever people have to deal with two-dimensional representations of threedimensional space, this problem is likely to occur. Thus, the data we collected with participants who navigated a wheelchair avatar on a map on the computer screen point to more general problems when people have to use maps of any kind. For example, if one is told to go left in the position indicated by the wheelchair avatar on Fig.1., this could be interpreted as an instruction to turn to one’s intrinsic left and then continue movement (in the route perspective) or to move in the direction they are already facing in which case left would be employed as a term in the survey perspective on this map. This term left could receive the same interpretation only when the instruction-follower’s orientation is aligned with the bodily axes of the speaker (the two perspectives would then be conflated). How do interlocutors manage to align perspectives and communicate successfully then? Alignment or marking of perspective can be achieved explicitly by giving a verbal signal of the choice or of the switch of perspective. Previous research has indicated that this is rare. However, previous studies have mostly focused on individuals giving a spatial description to an imaginary rather than a real interlocutor. Dialogue, on the other hand, offers the addressee the possibility to explicitly confirm or question a perspective choice or even initiate a switch that may not have otherwise occurred. In the corpus of data we collected, participants did so by saying “from my point of view”, “(to the left) on the map”, “if you look at the picture” to express that they were using the survey perspective. The route perspective was signalled
254
J. Goschler, E. Andonova, and R.J. Ross
Table 1. Examples of linguistic markers of route and survey perspective in the corpus Route Perspective Survey Perspective vom Rollstuhl aus von dir/ mir aus gesehen vom Fahrer aus auf der Karte so wie der Fahrer es sieht wenn du auf das Bild guckst wieder (links, rechts) oben, unten (links, rechts) dann links, rechts obere, untere hinter u ¨ber, unter vor oberhalb, unterhalb R¨ uckseite, Vorderseite hoch, runter nach vorne ganz (links, rechts, oben, unten) vorw¨ arts, r¨ uckw¨ arts der rechte/linke Flur
by phrases such as “seen from the wheelchair”, “seen from the driver”, etc. (Table 1). There are in fact some further linguistic markers for perspective that can give the interaction partners clues about which perspective their dialogue partner is taking. For example, while the terms left and right (Ger., “links”, “rechts”) are perspectively ambiguous, up/above and down (Ger., “hoch”/“nach oben”; “runter”/“nach unten”) are not. Alignment of perspective, however, may also be achieved implicitly, without any verbal reaction. In the case of real-world tasks such as navigation, for example, tacit agreement (and alignment of perspective) may also occur at the level of subsequent task-relevant non-verbal action (e.g., physical movement) by the instruction-follower which indicates that the previous speaker’s utterance was treated as felicitous enough and ensuing action could be initiated. Most of the participants in our study did not refer explicitly to their perspective choices at all, and still managed to take the same perspective and accomplish the task.
3
Method
In order to examine how people deal with the problem of perspective when more than one is possible and appropriate to use, we elicited a small corpus of typed interaction by giving participants a shared spatial task. To accomplish this task, which consisted of the navigation of an iconic wheelchair on the schematized map of an office building, participants had to interact with a partner. Participants’ utterances were then analyzed with respect to the use of the route and survey spatial perspectives. 3.1
Participants
Participants were 22 sixteen to seventeen year-old students at a local high-school. All of them were native speakers of German. Dialogue partners communicated in same-sex dyads (5 male, 6 female).
Perspective Use and Perspective Shift in Spatial Dialogue
3.2
255
Apparatus
A networked software application was used to allow participant pairs to communicate via a chat interface whilst being provided with a view on a schematised spatial environment which at any time included the location of an iconic wheelchair. Within a given environment the wheelchair avatar location was controllable via a joystick. Movement behaviour simulated two-wheel differential drive robot movement, with both angular and transitional velocities proportional to joystick position. Movement of the avatar was constrained by the presence of walls in the environment, but no doors were explicitly modelled, thus allowing the avatar to be freely moved in and out of spaces. Text based rather than spoken communication was used to best simulate the communication channels conditions typical of our computational dialogue implementations (See [15]). While this communication mode results in more favourable transcription and analysis times, we are aware of the differences introduced by typed rather than spoken language. To alleviate these effects a history-less chat interface was used where partner utterances were removed in a time proportional to the length of that utterance. Participants typed into one text box labelled Ich (“I” in German), while their partner’s text was presented in an appropriately labelled text box. In addition to recording all typed utterances, the application logged the position and orientation of the wheelchair at regular intervals. The software was set up on two terminals in separate rooms which were connected over a network. Terminals were identical apart from the presence of the joystick controller at one terminal only. 3.3
Stimuli
Each dyad participant was given a view of a schematised indoor environment on the computer screen. The same spatial environment was available to both speakers to minimize explicit negotiation of the map. The same map was used throughout all trials and with all dyads. The map, depicted in Figure 1, included unnamed locations, 6 named locations, and the position of the wheelchair avatar at any given time. One participant’s view also indicated a target location for the shared task through the highlighting of one room on the screen. 3.4
Procedure
Two participants at a time were placed at the terminals and provided with separate written instructions. Instructions required the participant with the joystick to act as an instruction-follower in the interaction by imagining that they were situated in the environment with their partner, who is in the wheelchair and is instructing them towards a goal. The instruction-giver on the other hand was asked to imagine being situated in the wheelchair and giving instructions towards a goal.
256
J. Goschler, E. Andonova, and R.J. Ross
Fig. 1. Interface window. The goal area is identified only on the instruction-giver’s map.
The complete task consisted of 11 trials where within each trial the instructiongiver directed the instruction-follower towards the goal. Each trial began with the wheelchair avatar located within a room but facing an exit onto a corridor. Participants were then free to communicate via the chat interface. No time-out was used for the trial, but instructions did request that participants attempt the task as quickly as possible. Once participants had successfully navigated the wheelchair avatar to the target room, the screen went blank. After two seconds, the map reappeared with a new starting position of the wheelchair in one of the rooms and a new target room. The same 11 start and end point configurations were used across all dyads in a different pseudo-randomized order for each dyad. While the task structure is similar to the Map Task [16], and in particular its text-mediated realization by Newlands et al. [17], in that both tasks involve the description of routes between two interlocutors, there are important differences between the tasks with respect to our research goals. The Map Task purposefully introduces disparities between the maps used by interlocutors to solicit explicit discussion of the spatial arrangements presented. While this results in interesting dialogue structure, it also complicates rationality for explicit perspective shift, which, as we will see, exists even with the isomorphic spatial representations
Perspective Use and Perspective Shift in Spatial Dialogue
257
present in our task. Moreover, it has been our aim to analyse communication in an interaction situation which is more directly related to our targeted application domain of route following assistance systems [15].
4
Results and Discussion
The main research questions in this study were related to the choice of perspective made by the instruction-giver and instruction-follower in these dialogues, how their choices changed over time, especially in terms of the general efficiency of interaction (measured here in number of utterances spoken before the goal was reached), how much coordination there was between interlocutors, and the patterns underlying shifts from route to survey perspective and vice versa. There are several features of the design that are very likely to have influenced the choice of perspective. One was the setup with the map on the screen which was positioned vertically in front of them. This should trigger the use of the survey perspective since it is the one aligned with their own bodily axes. It is unambiguous and cognitively “cheaper” because there is no need for mental rotation because of the orientation of the wheelchair. That is why the survey perspective could have been expected to dominate. On the other hand, participants may have been biased towards the use of the route perspective by the task in which movement in a wheelchair with its clear intrinsic front and back was involved. In addition, participants were explicitly encouraged to take the perspective of the wheelchair in the task instructions. The interaction in the eleven dyads on 11 trials each yielded a corpus of 121 dialogues and a total of 1301 utterances, the majority of which (1121) were task-related. As the focus of this study was on perspective use, only the 552 utterances indicating a spatial perspective were included in the analyses (49.24% of all task-related dialogues). Other task-related utterances included incremental route instructions by the instruction-giver such as go on or stop, go out of the room or similar and clarification questions by the instruction-follower such as what?, where to? here?. 4.1
Preferred Perspective Use
In order to examine the preference for one of the two perspectives (route vs. survey), we first classified all task-related utterances indicating a spatial perspective into the following categories: (a) utterances with route perspective, (b) utterances with survey perspective, (c) utterances with mixed perspective, and (d) utterances with conflated perspective where the description is valid in both route and survey perspectives. Only a small percentage was either mixed (1.59%) or conflated (7.58%) and they were excluded from subsequent analyses. Thus, the data could be analysed in terms of a binary choice between route and survey perspective utterances yielding a mean percent use of route perspective as a measure and 462 utterances to be included in the analysis. As a result, the overall mean percent use of route perspective in this corpus was established as
258
J. Goschler, E. Andonova, and R.J. Ross
Table 2. Number of utterances with spatial perspective and mean percent use of utterances in route perspective produced by instruction-givers and instruction-followers
N utt. in spatial perspective N utt. in Route perspective N utt. in Survey perspective Mean % use of Route perspective
Instruction Giver 414 294 120 71.01%
Instruction Follower 48 20 28 41.67%
Total 462 314 148 67.97
67.97% (SD=46.71%). Participants produced a total of 314 utterances in the route perspective and 148 utterances in the survey perspective. A breakdown by dialogic role showed that instruction-givers dominated these dialogues (Table 2). This is not surprising given that it was their task to provide directions to their conversational partners. In addition, the task requirements for the instructionfollower were such that producing speech was in addition to joystick navigation which took up a relatively large amount of their time and effort, hence their limited participation in these exchanges. Furthermore, instruction-givers’ utterances were mostly in the route perspective while the opposite was observed for the instruction-followers whose utterances in the survey perspective outnumbered those in the route perspective (Table 2). Although this descriptive analysis reveals a general preference for routeperspective utterances, based on instruction-givers’ utterances, it is not informative of the dynamics of use across interactions, and for this purpose, further analyses were conducted on the data averaged for each trial and dyad, yielding 121 means (11 dyads x 11 trials). One issue that could be addressed in this analysis was the relative dominance of use of the two perspectives across dyads and speakers. A one-way ANOVA revealed a great deal of variation across dyads with respect to the mean percent use of route perspective (F(10,100) = 12.29, p <.001). On average, dyads produced 53.61% of their utterances in a trial in the route perspective (SD = 43.43%). However, while the mean percent use of route perspective in some dyads was as low as 0%, i.e., all their relevant utterances were framed in the survey perspective, in others it was as high as 94.91%. As these figures indicate, although in the corpus as a whole, route perspective utterances were far more numerous, if perspective choices are examined within dyads, we find a considerable amount of variation, and no clear dominance of one of the two perspectives (M=53.61%). 4.2
Speaker Coordination and Efficiency
Clearly, there were significant differences between instruction-givers and instruction-followers on the total number of utterances and on the mean percent use of route perspective (Table 2). However, conversational partners also tend to converge in their choices and develop mutually acceptable strategies of reference and description schemes [18,2,19]. This is why we also explored the degree to which there was coordination between instruction-givers and instruction-followers
Perspective Use and Perspective Shift in Spatial Dialogue
259
Table 3. Correlations among trial number, number of utterances by the interlocutors, mean percent use of route perspective, and mean percent perspective shift. *p < .05, †p < .01, ‡p < .001. Note. Trial - trial number; IG Utt - number of instruction-givers’ utterances; IF Utt - number of instruction-followers’ utterances; ALL Utt - number of all utterances in a dyad; %Route - mean percent use of route perspective in utterances; %Shift - mean percent perspective shift; n.s. - not significant.
IG Utt IF Utt ALL Utt %Route %Shift
Trial -.19* -.20* -.20* -.28† n.s.
IG Utt
IF Utt ALL Utt
%Route
.86‡ .98‡ .28† .22*
.94‡ .25† .27†
n.s.
.28† .24†
in individual dyads in terms of ‘talkativeness’, i.e., number of utterances, and in terms of preferences for a certain perspective. A correlation analysis revealed a strong positive correlation between the average number of utterances with a spatial perspective produced by instruction-givers and the number of such utterances by instruction-followers (r = .86, p <.001; see Table 2). This confirms the existence of a high degree of coordination between interlocutors in the individual dyads in terms of the number of their dialogic contributions – the more the instruction-giver spoke, the more the instruction-follower said as well, and vice versa. However, no statistically significant correlation emerged between the mean percent use of route perspective by instruction-givers and by instruction-followers. At first glance, this may appear to suggest that instruction-givers and instruction-followers had different perspective preferences and maintained these preferences irrespective of their interlocutors’ utterances. However, a more detailed analysis reveals that the two speaker roles were associated with different types of dialogic contributions – whereas instruction-givers’ utterances with spatial perspective were mostly examples of direction-giving (assertions & action-directives in the terminology of DAMSL [20]), instruction-followers’ spatial-perspective contributions were frequently queries, information requests, or reformulations of the directions given by instruction-givers. Thus, they did not function as a contribution to joint referential activities by repetition or imitation (convergence) but by offering alternative formulations or questioning incomplete and ambiguous directions by instructiongivers. At the same time, efficiency and economy of effort was a major consideration for instruction-followers who were always doing joystick navigation simultaneously. Therefore, instruction-followers’ utterances in the same perspective would have been redundant and relatively inefficient. In fact, efficiency of communication should increase across trials according to the collaborative model, or the framework of joint common ground [18]. Previous findings have demonstrated such effects as far as reference to the same pictorial stimuli is concerned, for example, in a study of Tangram shape descriptions, Krauss [21] found that later references (noun phrases) to the same figure tended to be shorter. Similarly, in Clark’s [18] referential communication task, the
260
J. Goschler, E. Andonova, and R.J. Ross
director and the matcher became more efficient not only from one trial to the next but also from the beginning to the end of a trial (measured in number of words). In our study, we addressed the question of changes in the efficiency of interaction over time in a correlation analysis which revealed a negative correlation between trial number, on the one hand, and the average number of utterances produced on a trial by dyads, the number of instruction-givers’ utterances, and the number of instruction-followers’ utterances, on the other hand. These correlations reached significance but were rather weak (Table 3) and need to be interpreted in view of the differences across experimental designs and measures. The studies using the referential communication task mentioned above examine efficiency in reference to the same shape, object, or stimulus more generally, selected from a limited and pre-specified set of options which were visually available to both participants, whereas in our study, although the task remained the same (giving route directions in a certain map), the routes themselves, i.e., the positioning of their start and end points on the map, varied on each trial. Still, the overall result is clear - efficiency in terms of shorter dialogues increases across the span of the experimental session. 4.3
Perspective Use across the Interaction Span
Apart from considerable inter-dyad diversity on the use of perspective and perspective shifts, there was also much variation across experimental trials, ranging from 77.36% mean use of route perspective on the first trial to 30% on the last trial. In order to examine this decline in the use of route perspective systematically, we turn next to the analysis of perspective choices and perspective shift depending on how long the interlocutors in a dyad have already interacted in this task. The measure for this is the number of the experimental trial (1-11 where 1 is the first trial and 11 is the last trial for a dyad). The correlation analysis confirmed the existence of a weak to moderate negative correlation between trial number and mean percent use of route perspective (r = -.28, p <.01). On early trials, utterances in the route perspective tended to be produced more than on later trials overall. This correlation illustrates the tendency of speakers to opt for the survey perspective more and more as time elapsed. In addition, as Table 3 shows, there were significant positive correlations between the different efficiency measures (number of utterances overall, by instruction-givers, and by instruction-followers) and the mean percent use of route perspective indicating that dyads who spoke on average more on a trial, also tended to have a preference for route-perspective utterances. Having in mind that later utterances also tended to be shorter, as discussed earlier, these correlations taken as a whole point towards survey-perspective utterances being more economical in this task and suggest, as before, that over time, participants tended to be more efficient in their interactions by communicating about routes less and less in the initially popular yet eventually relatively inefficient route perspective.
Perspective Use and Perspective Shift in Spatial Dialogue
4.4
261
Perspective Shift/Switch
Perspective shift was coded systematically in the following way: Every use of a certain perspective by an interlocutor was checked against the speaker’s latest utterance in a spatial perspective on a given trial. If the perspective in the utterance differed from the perspective used in the previous utterance, it was coded as a perspective shift, e.g., from route to survey or vice versa. Shifts across speakers were not included in the analysis. In order to examine the distribution of perspective shifts across trials and dyads, we calculated the mean percent perspective shift on each trial for each dyad. Although the overall percentage of perspective shift was relatively low (M=8.78%), there was considerable variation across dyads (SD=15.94%) ranging from 0% to 67% switches. We found that although perspective shift did not correlate with trial number, it did correlate positively with the three measures of efficiency (number of utterances overall, utterances by instruction-givers, and utterances by instruction-followers). Trials on which participants were ‘highvolume’ speakers tended to elicit also more perspective shifts in their utterances. That perspective switches occur at all is not surprising and has been described by Tversky [7]. They argue that after a while, both perspectives are conceptually represented and available in speakers’ minds. The question is, when and why do switches occur? The factors influencing perspective choices can be of spatial or communicative nature. Thus, certain changes in the spatial situation could be responsible for the occurrence of perspective shifts. In addition, both the interlocutor’s verbal and non-verbal behaviour can lead to a perspective switch. If the instructiongiver is faced with behaviour by the instruction-follower that does not follow the plan as outlined and intended by the instruction-giver, for example, a turn in the wrong direction, this could be a reason for the speaker to shift perspective in order to achieve better mutual understanding. We found indeed occurrences of misunderstandings and mistakes that might have triggered a perspective shift, as is likely to have happened in this piece of conversation: (1) a. Instructor: fahr den flur nach rechts ganz durch. Dann nach links in den 2. flur und in den letzten raum rechts. drive through the corridor to the right. Then to the left into the 2nd corridor and into the last room on the right-hand side b. Instructor: falscher raumM wrong room c. Instructee: wo denn? where? d. Instructor: nach oben up e. Instructor: jez revhts now right
262
J. Goschler, E. Andonova, and R.J. Ross
f. Instructor: und in den raum rein der jez neben dir is and into the room which is now beside you g. Instructor: genau exactly In this example, the instruction-giver first describes the way in the route perspective but then the instruction-follower makes a navigation error which the instruction-giver points out by uttering falscher raum (E., “wrong room”). When the instruction-follower asks wo denn? (E., “where?”), the instruction-giver immediately switches to the survey perspective by saying “nach oben” (E., “up”). The following example shows how the use of a certain perspective by one of the interlocutors could influence their partner to use the same perspective, i.e., to align with them: (2) a. Instructor: so, jetzt wieder links, dann den gang nach rechts direkt so, now left again, then the corridor to the right directly b. Instructor: links left c. Instructee: ich bin nu oben links in der Ecke I am in the upper left corner d. Instructor: aso, dann auf der karte nach rechts ok, then to the right on the map e. Instructor: genau, jetzt nach unten exactly, now down Here again, the instruction-giver starts by using the route perspective, but the instruction-follower interrupts with a description of her location in the survey perspective. The instruction-giver switches to survey, marking the use of this perspective explicitly by saying auf der Karte (E., “on the map”). The interlocutor, in this case the instruction-follower, can even explicitly ask for directions in another perspective than the one used by the instruction-giver: (3) a. Instructor: jtyt nah rechts now to the right b. Instructee: oben oder unten top or bottom c. Instructor: in den nachsten gang nach rechts into the next corridor to the right d. Instructor: unten bottom e. Instructee: ok ok f. Instructor: gradeaus straight In example 3, the instruction-follower asks a clarification question in the survey perspective, oben oder unten (E., “top or bottom”) after the instruction-giver
Perspective Use and Perspective Shift in Spatial Dialogue
263
had given directions in the route perspective. After answering this particular question in the survey perspective, the instruction-giver switches back to using the route perspective. Thus, the analysis of the corpus data shows that the verbal behavior of the interlocutor exerts an influence on perspective choices and can lead to perspective shifts. It remains to be studied in future research to what extent perspective shifts can be caused by different kinds of verbal behavior, non-verbal behavior (spatial action), and how exactly these diverse factors interact.
5
Conclusions
Previous findings have shown variability in spatial perspective and perspective shift to be ubiquitous in monologic descriptions of spatial relationships and in spatial instructions such as those found in route directions. This study explored these issues in a dialogic online navigation task which enhances the ecological validity of such interactions. The analyses reveal that within a route instruction task of this kind, survey and route spatial perspectives are more or less equally likely to occur, although there is a clear tendency for instruction-givers to have an initial preference for route-perspective descriptions which, however, gradually evolve towards the more economical and efficient use of route perspective instructions. This is, in effect, a reflection of a trend away from a rather incremental, local, ego-based strategy towards a more holistic, global, and environment-oriented strategy in producing directions. Our results also point towards a great deal of coordination among speakers, even though the instruction-followers’ verbal contributions were limited because of the nature of the task and the requirement to navigate via a joystick. In addition, the findings support communicative models that account for increased efficiency as a result of joint effort across the lifespan of an interaction. Our data confirm the occurrence of mixing of perspective in spatial language on a regular basis (on approximately 9% of all trials) and thus show that this phenomenon is not restricted to monological spatial descriptions. The correlation between perspective shifts and number of utterances needed before the spatial goal is reached reflect the tendency for speakers in more efficient dialogues to stay within one perspective and minimize the number of switches in describing and negotiating a route. As a whole, we identified several driving forces behind perspective shifts in dialogues, including the relative difficulty of specific spatial situations and changes across situations, navigation errors by the interlocutor, and explicit and implicit verbal reactions by the interlocutor. Controlled experimental paradigms in future research need to disentangle these diverse influences.
Acknowledgements We gratefully acknowledge the support of the Deutsche Forschungsgemeinschaft (DFG) through the Collaborative Research Center SFB/ TR 8 Spatial
264
J. Goschler, E. Andonova, and R.J. Ross
Cognition - Projects I5-[DiaSpace]. We would also like to thank the students and teachers of the Gauderkesee Gymnasium, Bremen for their participation in our study.
References 1. Clark, E.: Conceptual perspective and lexical choice in acquisition. Cognition. 64, 137 (1997) 2. Garrod, S.C., Anderson, A.: Saying what you mean in dialogue: a study in conceptual and semantic co-ordination. Cognition. 27, 181–218 (1987) 3. Andonova, E., Tenbrink, T., Coventry, K.: Spatial description, function, and context (submitted) 4. Tversky, B., Lee, P.U.: How space structures language. In: Freksa, C., Habel, C., Wender, K.F. (eds.) Spatial Cognition 1998. LNCS (LNAI), vol. 1404, pp. 157–176. Springer, Heidelberg (1998) 5. Levelt, W.J.M.: Speaking: From intention to articulation. MIT Press, Cambridge (1989) 6. Taylor, H., Tversky, B.: Perspective in spatial descriptions. Journal of Memory and Language 35, 371–391 (1996) 7. Tversky, B., Lee, P., Mainwaring, S.: Why do speakers mix perspectives? Spatial cognition and computation 1, 399–412 (1999) 8. Levinson, S.C.: Space in language and cognition: explorations in cognitive diversity. Cambridge University Press, Cambridge (2003) 9. Kriz, S., Hegarty, M.: Spatial perspective in spoken descriptions of real world environments at different scales. In: Proceedings of the XXVII Annual Meeting of the Cognitive Science Society, Stresa, Italy (2005) 10. Taylor, H., Naylor, S., Chechile, N.: Goal-specific influences on the representation of spatial perspective. Memory & Cognition 27, 309–319 (1999) 11. Levelt, W.: Perspective Taking and Ellipsis in Spatial Descriptions. In: Bloom, P., Peterson, M., Nadel, L., Garrett, M. (eds.) Language and Space, pp. 77–109. MIT Press, Cambridge (1996) 12. Vorwerg, C.: Consistency in successive spatial utterances. In: Coventry, K., Tenbrink, T., Bateman, J. (eds.) Spatial language and dialogue. Oxford University Press, Oxford (in Press) 13. Schober, M.F.: Spatial perspective taking in conversation. Cognition 47(1), 1–24 (1993) 14. Striegnitz, K., Tepper, P., Lovett, A., Cassel, J.: Knowledge representation for generating locating gestures in route directions. In: Spatial Language in Dialogue. Oxford University Press, Oxford (2008) 15. Ross, R.J.: Tiered models of spatial language interpretation. In: Proceedings of Spatial Cognition 2008, Freiburg, Germany (2008) 16. Anderson, A.H., Bader, M., Bard, E.G., Boyle, E.H., Doherty, G.M., Garrod, S.C., Isard, S.D., Kowtko, J.C., McAllister, J.M., Miller, J., Sotillo, C.F., Thompson, H.S., Weinert, R.: The HCRC Map Task corpus. Language and Speech 34(4), 351– 366 (1992) 17. Newlands, A., Anderson, A.H., Mullin, J.: Adapting communicative strategies to computer-mediated communication: an analysis of task performance and dialogue structure. Applied Cognitive Psychology 17(3), 325–348 (2003)
Perspective Use and Perspective Shift in Spatial Dialogue
265
18. Clark, H.H., Wilkes-Gibbs, D.: Referring as a Collaborative Process. Cognition 22(1), 1–39 (1986) 19. Pickering, M.J., Garrod, S.: Towards a mechanistic psychology of dialogue. Behavioural and Brain Sciences 27(2), 169–190 (2004) 20. Core, M.G., Allen, J.F.: Coding dialogues with the DAMSL annotation scheme. In: Traum, D. (ed.) Working Notes: AAAI Fall Symposium on Communicative Action in Humans and Machines. American Association for Artificial Intelligence, pp. 28–35. AAAI, Menlo Park, California (1997) 21. Krauss, R., Weinheimer, S.: Changes in reference phrases as a function of frequency of usage in social interaction. Psychonomic Science (1964)
Natural Language Meets Spatial Calculi Joana Hois and Oliver Kutz SFB/TR 8 Spatial Cognition University of Bremen, Germany {joana,okutz}@informatik.uni-bremen.de
Abstract. We address the problem of relating natural language descriptions of spatial situations with spatial logical calculi, focusing on projective terms (orientations). We provide a formalism based on the theory of E -connections that connects natural language and spatial calculi. Semantics of linguistic expressions are specified in a linguistically motivated ontology, the Generalized Upper Model. Spatial information is specified as qualitative spatial relationships, namely orientations from the double-cross calculus. This linguistic-spatial connection cannot be adequately formulated without certain contextual, domain-specific aspects. We therefore extend the framework of E -connections twofold: (1) external descriptions narrow down the class of intended models, and (2) context-dependencies inherent in natural language descriptions feed back into the representation finite descriptions of necessary context information. Keywords: Spatial language, Spatial calculi, Ontologies, E -connections.
1
Introduction
We are aiming at a formal specification of connections between linguistic representations and logical theories of space. Language covers various kinds of spatial relationships between entities. It can express, for instance, orientations between them (“the cat sat behind the sofa”), regions they occupy (“the plant is in the corner”), shapes they commit to (“the terrace is surrounded by a wall”), or distances between them (“ships sailed close to the coast”). Formal theories of space also cover various types of relations, such as orientations [1], regions [2,3], shapes [4], or even more complex structures, such as map hierarchies [5]. Compared to natural language, spatial theories focus on one particular spatial aspect and specify its underlying spatial logic in detail. Natural language, on the other hand, comprises all of these aspects, and has thus to be linked to a number of different spatial theories. This linking has to be specified for each aspect and each spatial logic, identifying relevant information necessary for a linking or mapping function. This process involves contextual as well as domain-specific knowledge. Our overall aim is to provide a general framework for identifying links between language and space as a generic approach to spatial communication and independent of concrete kinds of applications in which it is used. It should be applicable to any spatial context in connection with human-computer interaction, C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 266–282, 2008. c Springer-Verlag Berlin Heidelberg 2008
Natural Language Meets Spatial Calculi
267
be it a geographic applications for way-finding and locating, city guides using maps, home/office automation applications, paths and spatial guidance, or architectural design planners. In particular, rather than attempting to integrate the most general spatial theories, we propose to use, in a modular way, various specialised (qualitative) spatial logics supporting dedicated and optimised reasoning algorithms. In this paper, we analyse the linking between natural language and one specific aspect of space, namely orientation information for static spatial situations. We concentrate on static descriptions throughout this article, because dynamic descriptions (as they are defined in the linguistic ontology) do not differ from static descriptions with respect to their orientation-based locations: in “I am going to the left” and “The stove is to the left” the “to the left” refers to the same leftness in terms of the orientation. Moreover, most information about locatives are given by static descriptions of locations rather than dynamic movement [6]. We define links between language and a spatial theory concerning orientations, showing examples of linguistic projective terms, such as “A is to the right of B”, “A is sitting to B’s left”, or “A is straight ahead”. These types of terms are specified in a linguistic ontology, the Generalized Upper Model [7], and linked with necessary non-linguistic information of the orientation calculus [8]. In order to apply this representation to spatial maps, we introduce spatial orientations according to four basic projective, two-dimensional directions (left, right, front, back), which are distinguished and formalised. In particular, spatial entities are reducible to points and refer to material objects with finite dimensions. We will introduce the linguistic ontology and its representation of spatial relationships in the next section. In Section 3, the connection between linguistic semantics, the double-cross calculus and relevant link-related aspects will be analysed using natural language examples. Finally, in Section 4, we will introduce an extension of the framework of E-connections to formalise all these aspects in a modular way, which can be represented as a structured logical theory in the system Hets for heterogeneous specification.
2
Linguistic Spatial Semantics
Natural language groups spatial relations into different categories according to certain aspects, which can be related to specific spatial theories that deal with these aspects. A linguistic categorisation of spatial relationships on the basis of linguistic evidence, empirical research, and grammatical indications has been developed in detail in the Generalized Upper Model GUM [7,9], a linguistically motivated ontology. Linguistic ontologies structure language into groups of categories and relations by their semantics, i.e. categories are not based on lexemes but meanings. As a formal theory, GUM is axiomatised in first-order logic, parts of which can also be expressed in description logics (DLs) such as SROIQ [10], underlying the Web Ontology Language OWL 2.0. GUM’s signature, i.e. its set of non-logical symbols, contains categories (unary predicates) and relations (binary predicates).
268
J. Hois and O. Kutz utterance: The chair
is
to the right of
the table
linguistic description of the utterance: locatum
process
placement spatialModality
relatum
Fig. 1. Relations in GUM of an utterance example of a static spatial situation
GUM captures linguistic semantics of spatial expressions while nevertheless rendering this organisation independently of specific spatial logics. Also, its categorisation is not simply based on groups of spatial prepositions, but based on linguistic characteristics of spatial relations, grammatically or inherently, linguistic evidence and empirical data. Therefore, the development of GUM has been carried out with respect to empirical results in human computer interaction [11,7,12] and general linguistic research [13,14,15,16,17]. Utterances of spatial situations are specified as instances in GUM. We refer the reader to an overview of GUM in [9] and specific spatial components in [7]. 2.1
Linguistic Specifications in the Generalized Upper Model
An utterance expressing a static spatial description is instantiated in GUM as a SpatialLocating. This category is a subclass of Configuration, a category that represents activities or states of affairs usually expressed at the level of the clause. They are defined according to their possible relations within the ontology, i.e. defined by entities that participate in the activity or state of affair. In principle, a single static description is specified by an instance of Configuration. Specific parts of the description (what, where, who, how, etc.) are specified by instances of Element, and their roles within the Configuration are specified by instances of relations (actor, manner, process, attribute, etc.). Subcategories of Configuration that represent spatial activities or conditions are divided into static spatial situations and dynamic spatial situations. In the following, we will concentrate on the former, the SpatialLocating. This GUM category defines at least the following three relations: 1. The relation locatum in GUM relates the SpatialLocating to its located object within the spatial linguistic description. In the example “the chair is to the right of the table” (see Fig. 1), “the chair” is at a specific spatial position and represents the locatum [7] (also called the “referent” in [13]), i.e. the entity that is located somewhere. 2. The relation processInConfiguration relates the SpatialLocating to its process, the action or condition entity, which is usually expressed by a verbal group, indicating tense, polar and modal aspects [17]. In the example in Fig. 1, the process corresponds to “is”.
Natural Language Meets Spatial Calculi
269
3. The relation placement relates the SpatialLocating to the location of the locatum. This location is represented by the GUM category GeneralizedLocation. It refers to “to the right of the table” in the example. A GeneralizedLocation specifies the spatial position of a locatum and consists of a spatial term, e.g. a spatial preposition, and an entity that corresponds to the reference object. Hence, the GeneralizedLocation defines two relations: spatialModality (spatial relation) and relatum (reference object). In the example, the spatialModality is expressed by “to the right of” and the relatum is expressed by “the table”. The relatum, however, may remain implicit in natural language discourse [12], such as in the example “the chair is to the right”, i.e. to the right of an undefined relatum, be it the speaker, listener or another entity. In case multiple relata are described together with the same spatial modality, they fill the relation relatum as a collection. Binding the relatum and the spatialModality in the placement relation is rather a design issue than a logical constraint. This encapsulation allows convenient combinations of multiple locations expressed within one configuration: in the example “The plant is in the corner, by the window, next to the chair.”, one SpatialLocating defines three placements. This is even more important as soon as placements are modified by expressing spatial perspectives, spatial accessibility, extensions or enhancements of the spatial relation. The utterance “The plant is to the front left of the chair, right here in the corner.” combines two relations (front and left) with respect to one relatum (the chair), while a second relatum (in the corner) is combined with possible access information (right here). Moreover, modifications that are encapsulated together with the placement are easier to compare in case of re-use of spatial placements, e.g. throughout a dialogue discourse. Moreover, the GeneralizedLocation retains its structure independently of the configuration. It is equally specified in “he goes to the right of the chair ” (dynamic spatial configuration) and “he stands to the right of the chair ” (static spatial configuration), related by different relations (destination and placement). Types of spatial relationships between locatum and reference objects are described by the category SpatialModality. Linguistically, this category corresponds to a preposition, an adverb, an adjective, or parts of the verb. It is subdivided into several categories that are primarily grouped into (1) relations expressing distance between entities, (2) functional dependencies between entities, and (3) positions between entities relative to each other depending on particular properties of the entities (such as intrinsic front side, size, shape). There are, however, intersections between these three general groups. Subcategories that refer particularly to spatial relationships based on orientations are subsumed under ProjectionRelation, describing positions between entities relative to each other depending on particular orientation-based properties of the entities. 2.2
Orientation-Related Linguistic Spatial Relationships
Projective Relations are distinguished along their three dimensions and can be divided into horizontal and vertical directions [18]. In order to reason (and talk)
270
J. Hois and O. Kutz
FrontalProjection
LeftProjection LateralProjection RightProjection
BackProjectionExternal BackProjectionInternal LeftProjectionExternal LeftProjectionInternal RightProjectionExternal
Disjointness SpatialDistance
HorizontalProjection
FrontProjectionInternal
Parthood
BackProjection
FrontProjectionExternal
Access
FrontProjection
RightProjectionInternal
Fig. 2. Projective horizontal relations in GUM
about map-like representations, it suffices to concentrate on horizontal relations, which can be distinguished along lateral and frontal directions. Lateral projections comprise the directions left and right, frontal projections comprise front and back. All four ontological categories of horizontal ground (atomic) projective relations, namely LeftProjection, RightProjection, FrontProjection, and BackProjection, can be expressed as an internal or external relationship [19]. Internal projective relations inherit from the category Parthood (topological) and refer to internal projections between locatum and relatum, such as “A is in the left (part) of C” or “B is in the front of C”. External projective relations inherit from the categories Disjointness (topological) and SpatialDistance and refer to external projections between locatum and relatum, such as “A is to the left of C” or “B is in front of C” (compare Fig. 3). Furthermore, the category FrontProjectionExternal also inherits from the category Access, as external front projections imply functional access between locatum and B relatum. An overview of the projective categories and their hierarchical dependencies in GUM are shown in C’s orientation Fig. 2. These categories are pairwise disjoint, for inB stance, FrontalProjection is disjoint with LateralProjection. They can, however, be extended (in GUM terminology), A’ A C i.e. an instance of FrontProjectionInternal (“front”) in “A is in the front left” is extended by an instance of LeftProjectionInternal (“left”). Spatial modalities can also Fig. 3. Internal and be enhanced (in GUM terminology) by additional entities, external projective e.g. distance information in “A is 10 meters to the left”. relations Hence, GUM represents linguistic characterisations of orientations, which have to be associated with concrete spatial situations in order to yield a fully contextualised interpretation. In the next section, we will introduce an orientation-based spatial calculus and link this representation to GUM’s projective categories. We will also identify missing aspects needed to minimise ambiguity in such a connection, namely context-dependent and domain-specific information.
Natural Language Meets Spatial Calculi
3
271
Orientation Calculi and External Aspects
Spatial calculi address specific aspects of space, such as regions, orientations, shapes, etc., in order to provide formal representations as well as automatic reasoning techniques. General overviews for such representations are given in [20] and [21]. Calculi most relevant for mapping GUM’s projective categories are those involving orientations since the linguistic projective relations described above refer to orientations within a spatial situation.1 Many well known spatial calculi for orientations have been studied in the literature, among them are the double-cross calculus [8], the star calculus [22], the line segment-based calculus [23], or a model for positional relations2 [24]. Such calculi are intended to be used for either static or dynamic relationships. They refer either to point-based or region-based spatial entities. They are based either on geometric or cognitive factors. The approach described in this paper maps orientations expressed in natural language to orientations represented in the double-cross calculus. 3.1
The Double-Cross Calculus
[8] introduces a ternary calculus of spatial orientations, the so-called double-cross calculus (DCC) [21]. In DCC, 2 12 15 relations are distinguished between an observer at po14 3 11 sition A, who is oriented (or moves) towards an entity at B position B (compare Fig. 4). The 15 orientation relations 10 13 4 are defined along three axes motivated by human cogniA 5 9 tive characteristics: A and B are called the perspective 15 point and the reference point respectively in [21]. They 6 8 determine the front-back axis. Orthogonal to this axis are 7 two further axes specified by A and B. Another entity located at some position C can then be described according Fig. 4. DCC’s 15 quato one of the 15 orientations. litative orientation reSome of the correspondences between GUM and DCC lations according to [8] are readily inferred: given an utterance, the perspective from where the relationship holds refers to an entity at A, the relatum refers to an entity at B, and the locatum refers to an entity located with respect to one of the 15 orientation relations determined by the spatial modality. The perspective, however, is often underspecified in utterances and might refer to the speaker, the listener, some other entity, or B. Which frame of reference [13] underlies the utterance is often not explicitly given. Also, in case the relatum is missing, B has to be inferred by other implicit or contextual 1
1
2
Although cardinal direction, i.e. north, east, south, west, are also related to orientations in some calculi, they are different from linguistic projective terms as introduced above and should thus be investigated separately. [24] use projective relations in their model, which do not correspond to linguistic projective relations as they are used in GUM (e.g. “surround”, “inside”, “outside” are not linguistic projective terms in GUM).
272
J. Hois and O. Kutz
information. The perspective (A) and the relatum (B) can even be identical: in this case, the locatum and the relatum are identical (i.e. A = B). The reference frame will automatically be intrinsic, and the orientation has to be determined by the intrinsic front. Even if GUM’s spatial relationships, then, are linked almost directly with DCC’s orientations, especially by means of the inherent distinction between front/back and right/left projections, a missing perspective and relatum of an utterance have to be inferred and mapped to a DCC representation. What exactly these missing links are, and how an adequate mapping can be constructed by taking other information into account, is described in the following. 3.2
External Spatial Aspects in Linguistic Semantics
As GUM’s linguistic specification is strongly based on concepts indicated by natural language, it does not entail enough information in order to map linguistic elements directly to entities of the spatial calculus. Hence, a mapping function from language to (models of) space needs additional information: [6] identifies eight parameters necessary to interpret a linguistic utterance. Among them are speaker and addressee, their locations and a view- or vantage point. Although [6] argues that orientations of speakers and addressees can be derived from their locations, this derivation is not specified in more detail, and as orientations are highly important in interpreting projective terms, our mapping has to specify them directly. Still missing are also intrinsic fronts of the reference object: projective linguistic terms can be interpreted along intrinsic orientations of objects independent of location and orientation of speaker or listener. Before we introduce links between corresponding linguistic and spatial entities, we start with examples of natural language utterances from a scene description. They motivate missing aspects not given in the utterance. The examples are
plant
windows writing computer table desk table chair 2 chairs
fridge
coffee table sofa
stove kitchen table
TV on table
armchair
speaker dining table with 4 chairs
3 chairs door
door
Fig. 5. Room layout of a scene description task, introduced in [7]. Arrows indicate intrinsic orientations of objects.
Natural Language Meets Spatial Calculi
273
Table 1. Example of utterances of native English speakers from spoken experiment and their representation in GUM. Utterances are cited without padding and pauses. utterance 1. the armchair is almost directly to my right 2. with the table just in front of it 3. and diagonally behind me to the right is the table 4. the stove is directly to our left 5. and to the right of that is the fridge 6. there is a table to the right 7. further to the right a little bit in front is a living room 8. directly in front, there are two tables 9. from here the television is diagonally to the right
locatum armchair table table
spatialModality relatum RightProjectionExternal me
FrontProjectionExternal BackProjectionExternal / RightProjection stove LeftProjectionExternal fridge RightProjectionExternal table RightProjectionExternal living room RightProjection + FrontProjection tables FrontProjectionExternal television RightProjectionExternal (perspective: here)
it me /– us that – – – –
taken from a series of experiments involving human-robot interaction in which participants were asked to explain a spatial situation to a robot. A detailed description of the experimental design is given in [7]. Fig. 5 shows the room layout of the experiment. Here, the position of speaker and listener coincides, i.e. they share the same perspective. In Table 1, an excerpt from the corpus data is given, in which participants refer to positions of objects in the room along their projective relationships. Although utterances from the corpus lack information about relatum and perspectives in general, such information is commonly omitted in natural language and has to be determined by other contextual or domain-specific factors. Even though positions of locatum, relatum, and perspective point have to be determined with respect to these external factors, links between projective spatial modalities and DCC relations can be defined in general: a concrete mapping, for instance, from a LeftProjection to the DCC orientations 2–6 is not affected by the position of A. 3.3
Non-linguistic Spatial Aspects of Projective Relations in GUM
The utterance “the armchair is (almost directly)3 to my right” shows an example, where the locatum (armchair) is located to the right (RightProjectionExternal) of the speaker (relatum: me) (see Table 1). This utterance refers to an intrinsic frame of reference, where the perspective coincides with the relatum, i.e. the speaker, related to the position A in DCC. The locatum is then located at a point with one of the orientations 8–12 in the DCC, A and B are identical. In this example, information about the speaker’s identity with “my (right)” and the frame of reference has to be added to the mapping function. 3
Although GUM specifies modifications such as “almost directly” as enhancements of the spatial relation, we disregard them for a general mapping function, as they have minor impact on orientations (i.e. left does not become right).
274
J. Hois and O. Kutz
The next sentence “with the table just in front of it (the armchair)” also refers to an intrinsic frame of reference, but with the armchair as origin, i.e. the armchair refers to A in DCC (see also Fig. 6), which also coincides with B. In this case, the locatum (table) is located at a position with one of the orientations 1–4 and 10–14. Hence, information about the armchair’s intrinsic front and the frame of reference have to be taken into account. In case of a relative frame of reference as in “to the right of that (the stove) is the fridge”, the perspective point A is indicated by the speaker, the reference point B is indicated by the relatum (stove), and the locatum (fridge) is indicated by a point that refers to one of the orientations 10–12 in DCC. Here, the frame of reference, the possibility of the stove having an intrinsic front and the perspective, i.e. the position of the speaker, are relevant for the mapping. If the relatum has no intrinsic front, it follows that a relative frame of reference applies. Otherwise, the choice of the underlying frame of reference is based on user preferences (extracted from the dialogue history) and the likeliness of intrinsic vs. relative frame of reference (according to the contextual descriptions). In cases where the relatum is missing—e.g. the relatum of “further to the right” is omitted in Example 7—it is usually possible to determine its position by considering the preceding utterances. Hence, the sequence of utterances may give implicit information about missing entities in GUM’s representation, and thus has to be considered throughout the construction of the mapping between GUM and DCC. Similarly, in Example 9, the given perspective “here” can either be interpreted as reference to the speaker or to the position that has just been described in a previous sentence, though a relative frame of reference can be assumed for explicit perspectives. Given the corpus data, we conclude that the following parameters are involved in mapping the linguistic semantics of an utterance to a spatial situation: 1. Position and orientation of speaker and listener 2. Reference system (relative or intrinsic) and origin (perspective)
Fig. 6. DCC orientations of different entities: different perspectives cause different projective relationships. The DCC orientations in the left figure are based on the perspective of the speaker (participant), while the orientations in the right figure are based on intrinsic orientations of objects with intrinsic fronts and a changed orientation of the speaker. Objects are implicitly reduced to points defined by their centre.
Natural Language Meets Spatial Calculi
275
3. Domain-specific knowledge of entities (e.g. possibility of intrinsic fronts, their orientations and granularity) 4. Dialogue history (sequence of utterances) A linguistic representation in GUM together with the parameters can then be mapped to the location of the perspective point A, the reference point B and possible orientations towards the position of the located entity in DCC. The formalisation of this mapping is described in the following.
4
Multi-dimensional Formalisms and Perspectivism
The formation of multi-dimensional formalisms, i.e. formalisms that combine the syntax and semantics of different logical systems in order to create a new hybrid formalism, is a difficult and complex task in general, compare [25] for an overview. ‘Classical’ formalisms that have been used for formalising natural language statements involving modalities are counterpart theory [26] and modal predicate logics. However, both these formalisms, apart from being computationally rather difficult to deal with, are not particulary suited to deal with (qualitative) spatial reasoning as they do not, in their standard formulations, provide a dedicated spatial component, neither syntactically nor semantically. Similarly, the semantically tight integration that product logics of space and modality provide does not support the sometimes loose or unsystematic relationships that natural language modelling requires. From the discussion so far, it follows that there are three desiderata for the envisaged formalisation: 1. To be able to represent various aspects of space and spatial reasoning, it needs to be multi-dimensional. However, in order to keep the semantics of the spatial calculi intact, the interaction between the formalisms needs to be loose initially, but also fine-tunable and controllable. 2. It needs to account for common sense knowledge that would typically be formalised in a domain ontology, and allow to restrict further the interaction between components. 3. It needs to account for context information not present in the representation using linguistic semantics. The general idea of counterpart relations being based on a notion of similarity, however, gives rise to a framework of knowledge representation languages that seems quite well-suited to meet these requirements, namely the theory of E-connections [27,28], which we sketch in the next section. 4.1
From Counterparts to E-Connections
In E-connections, a finite number of formalisms talking about distinct domains are ‘connected’ by relations relating entities in different domains, intended to capture different aspects or representations of the ‘same object’. For instance,
276
J. Hois and O. Kutz
an ‘abstract’ object o of a description logic L1 (e.g. an instance in GUM defining a linguistic item) can be related via a relation R to its life-span in a temporal logic L2 (a set of time points) as well as to its spatial extension in a spatial logic L3 (a set of points in a topological space, for instance). Essentially, the language of an E-connection is the (disjoint) union of the original languages enriched with operators capable of talking about the link relations. The possibility of having multiple relations between domains is essential for the versatility of this framework, the expressiveness of which can be varied by allowing different language constructs to be applied to the connecting relations. E-connections approximate the expressivity of products of logics ‘from below’ and could be considered a more ‘cognitively adequate’ counterpart theory. E-connections have also been adopted as a framework for the integration of ontologies in the Semantic Web [29], and, just as DLs themselves, offer an appealing compromise between expressive power and computational complexity: although powerful enough to express many interesting concepts, the coupling between the combined logics is sufficiently loose for proving general results about the transfer of decidability: if the connected logics are decidable, then their (basic) connection will also be decidable. More importantly in our present context, they allow the heterogeneous combination of logical formalisms without the need to adapt the semantics of the respective components. Note that the requirement of disjoint signatures of the formal languages of the component logics is essential for the expressivity of E-connections. What this boils down to is the following simple fact: while more expressive E-connection languages allow to express various degrees of qualitative identity, for instance by using number restrictions on links to establish partial bijections, they lack means to express ‘proper’ numerical trans-module identity. For lack of space we can only sketch the formal definitions, and present only the two-dimensional case, but compare [28]: we assume that the languages L1 and L2 of two logics S1 and S2 are disjoint. To form a connection C E (S1 , S2 ), fix a non-empty set of links E = {Ej | j ∈ J}, which are binary relation symbols interpreted as relations connecting the domains of models of S1 and S2 . The basic E-connection language is then defined by enriching the respective languages with operators for talking about the link relations. A structure M = W1 , W2 , E M = (EjM )j∈J , where Wi = (Wi , .Wi ) is an interpretation of Si for i ∈ {1, 2} and EjM ⊆ W1 ×W2 for each j ∈ J, is called an interpretation for C E (S1 , S2 ). Given a concept C of logic S2 , denoting a subset of W2 , the semantics of the basic E-connection operator is (Ej 1 C)M = {x ∈ W1 | ∃y ∈ C M (x, y) ∈ EjM } Fig. 7 displays the connection of an ontology with a spatial logic for regions such as S4u , by means of a single link relation E which we might read as ‘is the spatial extension of’.
Natural Language Meets Spatial Calculi
277
Fig. 7. A two-dimensional connection
As follows from the complexity results of [28], E-connections add substantial expressivity and interaction to the component formalism. However, it is also clear that many properties related to (2) and (3) above can not directly be formalised in this framework. The next section sketches an extension to E-connections that adds these expressive means, called perspectival E-connections. 4.2
Perspectival E-Connections
We distinguish three levels of interaction between the two representation languages S1 and S2 : 1. internal descriptions: axioms formulated in the link language 2. external descriptions: axioms formulated in an external description language: reasoning over the same signature, but in a richer logic. They add interaction constraints not expressible in (1), motivated by general domain knowledge. 3. context descriptions: a class of admissible models needs to be finitely specified: here, not a unique model needs to be singled out in general, but a description of a class of models compatible with a situation (a context). There are several motivations for such a modular representation: it (i) respects differences in epistemic status of the modules; (ii) reflects different representational layers; (iii) displays different computational properties of the modules; (iv) facilitates independent modification and development of modules; (v) allows to apply structuring techniques developed in algebraic specification theory; etc. The general architecture of perspectival E-connections is shown in Fig. 8. For an E-connection of GUM with DCC, the internal descriptions cover the axioms of GUM and the constraint systems of DCC. Moreover, basic interactions can be axiomatised, e.g. mappings from GUM elements to DCC points need to be functional. 4.3
Layered Expressivity: External Descriptions and Context
The main distinction between external and contextual descriptions is not technical but epistemic. External descriptions are meant to enforce necessary interactions between ontological and spatial dimensions, while contextual descriptions add missing context information. The formal languages used to enforce these
278
J. Hois and O. Kutz O
Contextual Descriptions
D ...... ...... . . . . . . External Descriptions ...... E C (S1 , S2 )(D, O)
6 C E (S1 , S2 ) Internal Descriptions Fig. 8. Architecture of Perspectival E -connections
constraints will typically be different. Similar to conceptual spaces [30], they are intended to reflect different representational dimensions or layers of a situation. External Descriptions. An example, taken from [27], is the following constraint: “The spatial extension of the capital of every country is included in the spatial extension of that country”. This is a rather natural condition in an E-connection combining a DL describing geography conceptually and a qualitative calculus for regions. Unfortunately, a basic E-connection C E (ALCO, S4u ) is not expressive enough to enforce such a condition. However, it can be added as an external description if we assume the external language allows quantification ∀x∀y x capital of y → E(x) ⊆ E(y) In this case, the external description does not affect the decidability of the formalism, as shown in [28]. Of course, this is not always the case: the computational benefits of using E-connections as the basic building block in a layered representation can get lost in case the external descriptions are too expressive. While a general characterisation of decidability preserving constraints is difficult to give, this can be dealt with on a case-by-case basis. In particular, the benefits of a modular design remain regardless of this issue. Similarly to the above example, when combining GUM with DCC, assuming Φ axiomatises a LeftProjection (“left of”) within a SpatialLocating configuration, we need to enforce that elements participating in that configuration are mapped to elements of DCC models restricted to the five ‘leftness’ relations of DCC (see Section 3.2). 6 Li (E(x), E(y), E(z)) ∀x, y, z Φ(x, y, z) → i=2
This would be a typical external description for C E (GU M, DCC). Note that any internal description can be turned into an external one in case the external language is properly more expressive. However, the converse may be the case as well. For a (set of) formula(s) χ, denote by Mod(χ) the class of its models. An
Natural Language Meets Spatial Calculi
279
external description Ψ may now be called internally describable just in case there is a finite set X of internal descriptions such that Mod(Ψ ) = Mod(X ). Contextual Descriptions. Assume an E-connection C = CLE (S1 , S2 ) with link language L is given, and where Sig(C) denotes its signature, i.e. its set of non-logical symbols, including link relations. Moreover, assume S is a finite set of situations for C. Now, from an abstract point of view, a context oracle (or simply an oracle) is any function f mapping situations for an E-connection to a subclass of its models to pick out the class of models compatible with a situation: f: S −→ P(Mod(Sig(C))), where P denotes the powerset operation. This restricts the class of models for CLE (S1 , S2 ) independently of the link language L and the external description language. For practical applications, however, we need to assume that these functions are computable, and that the classes {f(s) | s ∈ S} of models they single out can be finitely described by a context description language for S. For combining GUM and DCC, the context description language simply needs to add the missing items discussed at the end of Section 3.3, i.e. fix the position of the speaker, the reference system, etc., relative to a situation s. Clearly, there are many options how to internalise the contextual information into an E-connection. We have mentioned a language for specifying descriptions of finite models, but there are many other possibilities. For instance, [31] discuss several formal logics that have been designed specifically for dealing with contextual information, and compare their expressive power. Moreover, it might turn out that different contextual aspects require different logics or languages of context to be adequately formalised. Such problems, however, are left for future work. 4.4
Perspectival E-Connections in Hets
The Heterogeneous Tool Set Hets [32] provides analysis and reasoning tools for the specification language HetCasl, a heterogeneous extension of Casl supporting a wide variety of logics [33]. In particular, OWL-DL, relational schemes, C E (S1 , S2 )(D, O)
-
(possible) contexts
.. ..6 .. .. .. .. .. .. .
linguistic semantics
- C E (S1 , S2 )(O) ....
.... C E (S1 , S2 )(D)
-
.. ..6 .. .. .. .. .. .. .
C E (S1 , S 2)
- S1 .................
(na¨ıve)physics/ world knowledge
................ S2
(qualitative) spatial reasoning
Fig. 9. Perspectival E -connections as a structured theory in Hets
280
J. Hois and O. Kutz
sorted first-order logic FOLms , and quantified modal logic QS5, are covered. The DCC composition tables and GUM have already been formalised in Casl, and it has also been used successfully to formally verify the composition tables of qualitative spatial calculi [34]. As should be clear from the discussion so far, E-connections can essentially be considered as many-sorted heterogeneous theories: component theories can be formulated in different logical languages (which should be kept disjoint or sorted), and link relations are interpreted as relations connecting the sorts of the component logics.4 Fig. 9 shows perspectival E-connections as structured logical theories in the system Hets. Here, dotted arrows denote the extra-logical or external sources of input for the formal representation, i.e. for the description of relevant context and world-knowledge; black arrows denote theory extensions, and dashed arrows a pushout operation into a (typically heterogeneous) colimit theory of the diagram (see [35,36,37] for technical details).
5
Conclusions and Future Work
We have investigated the problem of linking spatial language as analysed in a linguistically motivated ontology with spatial (qualitative) calculi, by mapping GUM’s projective spatial relationships to DCC’s orientations. We concluded that various aspects important for this connection but omitted or not given explicitly in the linguistic semantics need to be added to the formal representation. Moreover, we argued that these additional aspects can be divided into domainspecific (world-dependent) and contextual (situation-dependent) aspects. An approach for connecting all these heterogeneous modules into a structured heterogeneous theory is defined, called perspectival E-connections. Perspectival E-connections now provide us with a formal framework for defining relationships between spatial language and calculi. This is not limited to the aspect of orientation discussed in detail in this paper. Rather, it can be carried out in the same way to deal with aspects covered by alternative orientation calculi, as well as calculi for distances, topology, shapes, etc. Here, the interplay between various such spatial calculi and GUM’s respective treatment of the relevant non-projective spatial language has to be analysed.
Acknowledgements Our work was carried out in the DFG Transregional Collaborative Research Center SFB/TR 8 Spatial Cognition, project I1-[OntoSpace]. Financial support by the Deutsche Forschungsgemeinschaft is gratefully acknowledged. The authors would like to thank John Bateman and Till Mossakowski for fruitful discussions. 4
The main difference between various E -connections now lies in the expressivity of the ‘link language’ L connecting the different logics. This can range from a sub-Boolean logic, to various DLs, or indeed to full first-order logic.
Natural Language Meets Spatial Calculi
281
References 1. Moratz, R., Dylla, F., Frommberger, L.: A Relative Orientation Algebra with Adjustable Granularity. In: Proceedings of the Workshop on Agents in Real-Time and Dynamic Environments (IJCAI 2005) (2005) 2. Casati, R., Varzi, A.C.: Parts and Places - The Structures of Spatial Representation. MIT Press, Cambridge (1999) 3. Cohn, A.G., Bennett, B., Gooday, J., Gotts, N.M.: Representing and Reasoning with Qualitative Spatial Relations. In: Stock, O. (ed.) Spatial and Temporal Reasoning, pp. 97–132. Kluwer Academic Publishers, Dordrecht (1997) 4. Schlieder, C.: Qualitative Shape Representation. In: Geographic Objects with Indeterminate Boundaries, pp. 123–140. Taylor & Francis, London (1996) 5. Kuipers, B.: The Spatial Semantic Hierarchy. Artificial Intelligence 19, 191–233 (2000) 6. Kracht, M.: Language and Space, Book manuscript (2008) 7. Bateman, J., Tenbrink, T., Farrar, S.: The Role of Conceptual and Linguistic Ontologies in Discourse. Discourse Processes 44(3), 175–213 (2007) 8. Freksa, C.: Using Orientation Information for Qualitative Spatial Reasoning. In: Frank, A.U., Campari, I., Formentini, U. (eds.) Theories and methods of spatiotemporal reasoning in geographic space, pp. 162–178. Springer, Berlin (1992) 9. Bateman, J.A., Henschel, R., Rinaldi, F.: Generalized Upper Model 2.0: Documentation. Technical report, GMD/Institut f¨ ur Integrierte Publikations- und Informationssysteme, Darmstadt, Germany (1995) 10. Horrocks, I., Kutz, O., Sattler, U.: The Even More Irresistible SROIQ. In: Knowledge Representation and Reasoning (KR 2006), pp. 57–67 (2006) 11. Shi, H., Tenbrink, T.: Telling Rolland Where to Go: HRI Dialogues on Route Navigation. In: WoSLaD Workshop on Spatial Language and Dialogue, Delmenhorst, Germany, October 23-25 (2005) 12. Tenbrink, T.: Space, Time, and the Use of Language: An Investigation of Relationships. Mouton de Gruyter, Berlin (2007) 13. Levinson, S.C.: Space in Language and Cognition: Explorations in Cognitive Diversity. Cambridge University Press, Cambridge (2003) 14. Herskovits, A.: Language and Spatial Cognition: An Interdisciplinary Study of the Prepositions in English. Studies in Natural Language Processing. Cambridge University Press, London (1986) 15. Coventry, K.R., Garrod, S.C.: Saying, Seeing and Acting. The Psychological Semantics of Spatial Prepositions. Essays in Cognitive Psychology. Psychology Press, Hove (2004) 16. Talmy, L.: How Language Structures Space. In: Pick, H., Acredolo, L. (eds.) Spatial Orientation: Theory, Research, and Application, pp. 225–282. Plenum Press, New York (1983) 17. Halliday, M.A.K., Matthiessen, C.M.I.M.: Construing Experience Through Meaning: A Language-Based Approach to Cognition. Cassell, London (1999) 18. Vorwerg, C.: Raumrelationen in Wahrnehmung und Sprache: Kategorisierungsprozesse bei der Benennung visueller Richtungsrelationen. Deutscher Universit¨ atsverlag, Wiesbaden (2001) 19. Winterboer, A., Tenbrink, T., Moratz, R.: Spatial Directionals for Robot Navigation. In: van der Zee, E., Vulchanova, M. (eds.) Motion Encoding in Language and Space. Oxford University Press, Oxford (2008)
282
J. Hois and O. Kutz
20. Cohn, A.G., Hazarika, S.M.: Qualitative Spatial Representation and Reasoning: An Overview. Fundamenta Informaticae 43, 2–32 (2001) 21. Renz, J., Nebel, B.: Qualitative Spatial Reasoning Using Constraint Calculi. In: Aiello, M., Pratt-Hartmann, I., van Benthem, J. (eds.) Handbook of Spatial Logics, pp. 161–215. Springer, Dordrecht (2007) 22. Renz, J., Mitra, D.: Qualitative direction calculi with arbitrary granularity. In: Zhang, C., Guesgen, H.W., Yeap, W.K. (eds.) PRICAI 2004. LNCS (LNAI), vol. 3157, pp. 65–74. Springer, Heidelberg (2004) 23. Dylla, F., Moratz, R.: Exploiting Qualitative Spatial Neighborhoods in the Situation Calculus. In: Freksa, C., Knauff, M., Krieg-Br¨ uckner, B., Nebel, B., Barkowsky, T. (eds.) Spatial Cognition IV: Reasoning, Action, Interaction. International Conference Spatial Cognition 2004. Springer, Heidelberg (2005) 24. Billen, R., Clementini, E.: Projective Relations in a 3D Environment. In: Sester, M., Galton, A., Duckham, M., Kulik, L. (eds.) Geographic Information Science, pp. 18–32. Springer, Heidelberg (2006) 25. Gabbay, D., Kurucz, A., Wolter, F., Zakharyaschev, M.: Many-Dimensional Modal Logics: Theory and Applications. Studies in Logic and the Foundations of Mathematics, vol. 148. Elsevier, Amsterdam (2003) 26. Lewis, D.K.: Counterpart Theory and Quantified Modal Logic. Journal of Philosophy 65; repr. in The Possible and the Actual, Michael J.(ed.) Loux, Ithaca (1979); also in: David K. Lewis. (ed.) Philosophical papers 1, Oxford 1983, pp. 113–126 (1968) 27. Kutz, O.: E -Connections and Logics of Distance. PhD thesis, The University of Liverpool (2004) 28. Kutz, O., Lutz, C., Wolter, F., Zakharyaschev, M.: E -Connections of Abstract Description Systems. Artificial Intelligence 156(1), 1–73 (2004) 29. Cuenca Grau, B., Parsia, B., Sirin, E.: Combining OWL Ontologies Using E -Connections. Journal of Web Semantics 4(1), 40–59 (2006) 30. G¨ ardenfors, P.: Conceptual Spaces - The Geometry of Thought. Bradford Books. MIT Press, Cambridge (2000) 31. Serafini, L., Bouquet, P.: Comparing Formal Theories of Context in AI. Artificial Intelligence 155, 41–67 (2004) 32. Mossakowski, T., Maeder, C., L¨ uttich, K.: The Heterogeneous Tool Set. In: Grumberg, O., Huth, M. (eds.) TACAS 2007. LNCS, vol. 4424, pp. 519–522. Springer, Heidelberg (2007) 33. CoFI: The Common Framework Initiative: Casl Reference Manual. Springer, Freely (2004), http://www.cofi.info 34. W¨ olfl, S., Mossakowski, T., Schr¨ oder, L.: Qualitative Constraint Calculi: Heterogeneous Verification of Composition Tables. In: Proceedings of the Twentieth International Florida Artificial Intelligence Research Society Conference (FLAIRS 2007), pp. 665–670. AAAI Press, Menlo Park (2007) 35. Kutz, O., Mossakowski, T., Codescu, M.: Shapes of Alignments: Construction, Composition, and Computation. In: International Workshop on Ontologies: Reasoning and Modularity (at ESWC) (2008) 36. Kutz, O., Mossakowski, T.: Conservativity in Structured Ontologies. In: 18th European Conf. on Artificial Intelligence (ECAI 2008). IOS Press, Amsterdam (2008) 37. Codescu, M., Mossakowski, T.: Heterogeneous Colimits. In: Boulanger, F., Gaston, C., Schobbens, P.Y. (eds.) MoVaH 2008 (2008)
Automatic Classification of Containment and Support Spatial Relations in English and Dutch Kate Lockwood, Andrew Lovett, and Ken Forbus Qualitative Reasoning Group, Northwestern University 2133 Sheridan Rd Evanston, IL 60208 {kate, andrew-lovett, forbus}@northwestern.edu
Abstract. The need to communicate and reason about space is pervasive in human cognition. Consequently, most languages develop specialized terms for describing relationships between objects in space – spatial prepositions. However, the specific set of prepositions and the delineations between them vary widely. For example, in English containment relationships are categorized as in and support relationships are classified as on. In Dutch, on the other hand, three different prepositions are used to distinguish between different types of support relations: op, aan, and om. In this paper we show how progressive alignment can be used to model the formation of spatial language categories along the containment-support continuum in both English and Dutch. Keywords: Cognitive modeling, spatial prepositions.
1 Introduction Being able to reason and communicate about space is important in many human tasks from hunting and gathering to engineering design. Virtually all languages have developed specialized terms to describe spatial relationships between objects in their environments. In particular, we are interested in spatial prepositions. Spatial prepositions are typically a closed-class of words and usually make up a relatively small part of a language. For example, in English there are only around 100 spatial prepositions. Understanding how people assign spatial prepositions to arrangements of objects in the environment is an interesting problem for cognitive science. Several different aspects of a scene have been shown to contribute to spatial preposition assignment: geometric arrangement of objects, typical functional roles of objects (e.g. [9]), whether those functional relationships are being fulfilled (e.g. [4]) and even the qualitative physics of the situation (e.g. [5]). The particular elements that contribute to prepositions and how they are used to divide the space of prepositions has been found to vary widely between languages (e.g. [1, 2]). This paper shows how progressive alignment can be used to model how spatial prepositions are learned. Progressive alignment uses the structural alignment process of structure-mapping theory to construct generalizations from an incremental stream of examples. The specific phenomena we model here is how people make distinctions C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 283–294, 2008. © Springer-Verlag Berlin Heidelberg 2008
284
K. Lockwood, A. Lovett, and K. Forbus
along the containment-support continuum in both English and Dutch, based on a psychological experiment by Gentner and Bowerman [11]. To reduce tailorability in encoding the stimuli, we use hand-drawn sketches which are processed by a sketch understanding system. We show that our model can learn to distinguish these prepositions, using (as people do) semantic knowledge as well as geometric information, and requiring orders of magnitude fewer examples than other models of learning spatial prepositions. The next section describes the Gentner and Bowerman study that provided the inspiration for our experiments. Section 3 reviews structure-mapping theory, progressive alignment, and the analogical processing simulations we use in our model. It also summarizes the relevant aspects of CogSketch, the sketch understanding system we used to encode the stimuli, and the ResearchCyc knowledge base we use for common sense knowledge. Section 4 describes the simulation study. We conclude by discussing related work, broader issues, and future work.
2 Gentner and Bowerman’s Study of English/Dutch Prepositions Gentner and Bowerman [11] were testing the Typological Prevalence hypothesis that the frequency with which distinctions and categories are found across the world’s languages provides a clue to conceptual “naturalness” and how easy that particular distinction is to learn. To explore this, they focused on a subset of spatial prepositions in English and Dutch. The English and Dutch languages divide the supportcontainment continuum quite differently. In English there are two prepositions: in is used for containment relationships and on is used for support relationships. However, Dutch distinguishes three different forms of support. The prepositions for Dutch and English are outlined in Table 1 below. Table 1. Table showing the containment and support prepositions in English and Dutch. Drawings here are taken from the original Genter and Bowerman paper.
English
Dutch
Relationship
on
op
support from below
on
aan
hanging attachment
on
om
encirclement with contact
in
in
containment
Example
Automatic Classification of Containment and Support Spatial Relations
285
Bowerman and Pederson found in a previous study [1] that some ways of dividing up the containment-support continuum are very common crosslinguistically while others are relatively rare. English follows a more linguistically common approach by grouping all support relations together into the on category while the Dutch op-omaan distinction is extremely rare. Both use the very common containment category. Following the Typological Prevalence Hypothesis, both English and Dutch children should learn the common and shared category of in around the same time. It should take Dutch children longer to learn the rare aan/op/om distinctions for support than it takes the English children to learn the common on category. 2.1 Experiment They tested children in five age groups (2, 3, 4, 5, and 6 years old) as well as adults who were native speakers of English and Dutch. Each subject was shown a particular arrangement of objects and asked to describe the relationship in their native language. In the original experiment, 3-dimensional objects where used. So, for example, a subject would be shown a mirror on the wall of a doll house and asked “Where is the mirror”. The set of all stimuli is shown in Table 2 below. Table 2. Stimuli from the Gentner and Bowerman study
op/on cookie on plate toy dog on book bandaid on leg raindrops on window sticker on cupboard lid on jar top on tube freckles on face
aan/on
om/on
in/in
mirror on wall purse on hook clothes on line lamp on ceiling
necklace on neck rubber band on can bandana on head hoop around doll
cookie in bowl candle in bottle marble in water stick in straw
handle on pan
ring on pencil
apple in ring
string on balloon knob on door button on jacket
tube on stick wrapper on gum ribbon on candle
flower in book Cup in tube Hole in towel
The results of the study were consistent with the Typological Prevalence hypothesis. Specifically, Dutch children are slower to acquire the op, aan, om system of support relations than English children are to learn the single on category. Both groups of children learned the in category early and did not differ in their proficiency using the term. Across all prepositions, English-speaking 3 to 4 year old children used the correct preposition 77% of the time, while the Dutch children used the correct preposition 43% of the time. Within the Dutch children, the more typical op category was learned sooner than the rarer aan and om categories. For a more detailed description of the results, please see the original paper. 2.2 Motivation for Our Simulation Study Modeling these results in detail is a daunting challenge for cognitive simulation. To accurately capture the developmental trajectories of learning over multiple years, for example, requires constructing a body of stimuli whose statistical properties are based
286
K. Lockwood, A. Lovett, and K. Forbus
on hypotheses about the commonalities of experiences in the world. No cognitive simulation has ever operated at that scale. There are practical challenges as well as methodological challenges: automatic encoding of stimuli becomes essential, for instance, whereas most cognitive simulations operate over hand-coded representations. Consequently, in this paper we focus on a simpler, but still difficult, question: Can progressive alignment be used to learn the spatial language containment/support categories in both English and Dutch? We use the Gentner & Bowerman stimuli as a starting point, a known set of good examples for each of these categories.
3 Simulation Background Several existing systems were used in our simulation. Each is described briefly here. 3.1 Simulating Similarity Via Analogical Matching We use Gentner’s structure-mapping theory of analogy and similarity [12]. In structuremapping, analogy and similarity are defined in terms of a structural alignment process operating over structured, relational representations. Our simulation of comparison for finding similarity is the Structure-Mapping Engine [8], which is based on structuremapping theory. SME takes as input two cases, a base and a target. It produces as output between one and three mappings describing the comparison between base and target. Each mapping consists of: (1) correspondences between elements in the base and elements in the target; (2) a structural evaluation score, a numerical characterization of how similar the base and target are; and (3) candidate inferences, conjectures about the target made by projecting partially-mapped base structures. There is considerable psychological evidence supporting structure-mapping theory, including modeling visual similarity and differences [13, 17] and SME has been used to successfully model a variety of psychological phenomena. 3.2 Progressive Alignment and SEQL Progressive alignment constructs generalizations by incremental comparisons, assimilating examples that are sufficiently similar into generalizations. These generalizations are still rather concrete, and do not contain variables. Attributes and relationships that are not common “wear away”, leaving the important commonalities in the concepts. Probabilities are associated with each statement in the generalization, which provides a way of identifying what aspects of the description are more common (and hence more central) to the generalization. We model progressive alignment via SEQL [14, 15, 20], which uses SME as a component. SEQL creates generalizations from an incoming stream of examples. A generalization context consists of a set of generalizations and examples for a concept. For example, in learning spatial prepositions, there would be one generalization context per preposition. All scenes described with the word op, for example, would be processed in the op context. There can be more than one generalization per context, since real-world concepts are often messy and hence disjunctive. When a new example arrives, it is compared against every generalization in turn, using SME. If it is sufficiently close to one of them (as determined by the assimilation threshold), it is assimilated into that generalization. The probabilities associated with
Automatic Classification of Containment and Support Spatial Relations
287
statements that match the example are updated, and the statements of the example that do not match the generalization are incorporated, but with a probability of 1/n, where n is the number of examples in that generalization. If the example is not sufficiently close to any generalization, it is then compared against the list of unassimilated examples in that context. If the similarity is over the assimilation threshold, the two examples are used to construct a new generalization, by the same process. An example that is determined not to be sufficiently similar to either an existing generalization or unassimilated example is maintained as a separate example. 3.3 CogSketch CogSketch1 is an open-domain sketch understanding system. Each object in a CogSketch sketch is a glyph. Glyphs have ink and content. The ink consists of polylines, i.e., lists of points representing what the user drew. The content is a symbolic token used to represent what the glyph denotes. In CogSketch, users indicate the type of the content of the glyph in terms of concepts in an underlying knowledge base. This is one form of conceptual labeling. The knowledge base used for this work is a subset of the ResearchCyc KB, which contains over 30,000 concepts. In addition to conceptual labels, the contents of glyphs can also be given names. A name is a natural language string that the user can use to refer to the content of the glyph. CogSketch automatically computes a number of qualitative spatial relations and attributes for glyphs in a sketch. The relations computed include the RCC-8 qualitative relations [3] that describe all possible topological relations between two-dimensional shapes (e.g. disconnected, edge-connected, partially-overlapping). RCC-8 relations are also used to guide the computation of additional spatial relationships such as positional relations like right/left. CogSketch also computes two types of glyph groups: connected glyph groups and contained glyph groups. Connected glyph groups consist of a set of glyphs whose ink strokes intersect. A contained glyph group consists of a single container glyph and all of the glyphs fully contained within it. 3.4 ResearchCyc Consider the sketch below showing the stimuli “freckles on face”. If you just look at the topological relationship between the freckle glyphs and the face glyph, they clearly form a contained glyph group with the face as the container and the freckles as the insider. As work by Coventry and others has shown [6], geometric properties are not sufficient to account for the way people label situations with spatial prepositions. A purely geometric account would declare freckles to be in the face, but we actually say freckles are on/op faces. To model such findings, we must use real-world knowledge as part of our simulation. For example, we know that freckles are physically part of a face. We use knowledge from the ResearchCyc2 as an approximation for such knowledge. Freckles, for example, are a subclass of PhysiologicalFeatureOfSurface, providing the semantic knowledge that, combined with geometric information, enables us to 1
Available online at http://spatiallearning.org/projects/cogsketch_index.html. The publicly available version of CogSketch comes bundled with the OpenCyc KB as opposed to the ResearchCyc KB which was used for this work. 2 http://research.cyc.com/
288
K. Lockwood, A. Lovett, and K. Forbus
Fig. 1. Sketch of the spatial arrangement “freckles on face”. If you examine just the geometric information, the freckles are in the area delineated by the face.
model spatial preposition judgments. As the world’s largest and most complete general knowledge base, ResearchCyc contains much of the functional information needed about the figure and ground objects in our stimuli.
4 Experiment 4.1 Materials All 32 original stimuli from the Gentner and Bowerman study were sketched using CogSketch. Each sketch was stored as a case containing: (1) the automatically computed qualitative spatial relationships and (2) information about the types of objects in the sketch. In the original experiment subjects were cued as to which object should be the figure (e.g. “where is the mirror”) and which should be the ground. To approximate this, each sketch contained two glyphs, one named figure and one named ground, and these names were used by the model. Recall that names in CogSketch are just strings that are used to refer to the objects. Each object was also conceptually labeled using concepts from the ResearchCyc KB. For instance, in the mirror on the wall stimulus, the mirror was declared to be an instance of the concept Mirror and the wall was labeled as an instance of WallInAConstruction. When people learn to identify spatial language categories in their native languages, they learn to focus on the relationships between objects, and to retain only the important features of the objects themselves rather than focusing on the surface features of the objects. As noted above, having conceptual labels and a knowledge base allows us to simulate this type of knowledge. For each conceptual label, additional concepts from its genls hierarchy were extracted from ResearchCyc. The genls hierarchy specifies subclass/superclass relationships between all the concepts of the KB. So, for example, Animal and Dog would both be genls of Daschund. Here we were particularly interested in facts relating to whether objects were surfaces or containers – and this was particularly important for ground glyphs. The original facts were removed (in our example “Daschund” would be deleted) to simulate abstraction away from specific object types to more important semantic categories. In the original study, the physical objects used as stimuli were manipulated to make the important relationships more salient to subjects. We approximated this by drawing our sketches so as to highlight the important relationships for the individual
Automatic Classification of Containment and Support Spatial Relations
289
spatial language categories. For example, the sketches for aan that required showing a connection by fixed points were drawn from an angle that made the connectivity between the parts observable. Figure 2 below shows two aan sketches: knob aan door and clothes aan line. They are drawn from perspectives that allow the system easy access to the point-contact relationship.
Fig. 2. Two sketched stimuli showing objects drawn from different angles to make the point connections salient
4.2 Method The basic spatial category learning algorithm is this: For each word to be learned, a generalization context is created. Each stimulus representing an example of that word in use is added to the appropriate generalization contexts using SEQL. (Since we are looking at both Dutch and English, each example will be added to two generalization contexts, one for the appropriate word in each language.) Recall that SEQL can construct more than one generalization, and can include unassimilated examples in its representation of a category. We model the act of assigning a spatial preposition to a new example E as follows. We let the score of a generalization context be the maximum score obtained by using SME to compare E to all of the generalizations and unassimilated examples in that context. The word associated with the highest-scoring generalization context represents the model’s decision. To test this model, we did a series of trials. Each trial consisted of selecting one stimulus as the test probe, and using the rest to learn the words. The test probe was then labeled as per the procedure above. The trial was correct if the model generated the intended label for that stimulus. There were a total of 32 trials in English (8 for in and 24 for on) and 32 trials in Dutch (8 each for in, op, aan, and om) one for each stimulus sketch. 4.3 Results The results of our experiment are shown below. The generalizations and numbers given are for running SEQL on all the sketches for a category. The table below summarizes the number of sketches that were classified correctly, for each preposition the
290
K. Lockwood, A. Lovett, and K. Forbus Table 3. Summary of correct labels for each preposition category tested
in on
English 6 75% 21
87%
in op aan om
Dutch 6 75% 7 87% 6 75% 8 100%
Table 4. Number of exemplars and generalizations for each generalization context
Generalizations Exemplars
English in on 2 6 2 0
in 2 2
op 2 2
Dutch aan 3 0
om 3 2
number is out of 8 total sketches except for English on which has 24 total sketches. All results are statistically significant (P < 10-4), except for the English in (P < 0.2), which is close. For an in-depth discussion of the error patterns, see section 4.4. Recall that within each generalization context, SEQL was free to make as many generalizations as it liked. SEQL was also able to keep some cases as exemplars if they did not match any of the other cases in the context. The table below summarizes the number of generalizations and exemplars for each context. Best Generalization IN Size: 3 (candle in bottle, cookie in bowl, marble in water) --DEFINITE FACTS: (rcc8-TPP figure ground) --POSSIBLE FACTS: 33%: (Basin ground) 33%: (Bowl-Generic ground)
Fig. 3. One of the generalizations for English in along with the sketches for the component exemplars
Automatic Classification of Containment and Support Spatial Relations
291
At first the amount of variation within the contexts might seem surprising. However, since the stimuli were chosen to cover the full range of situations for each context it makes more sense. Consider the Dutch category op. The 8 sketches for this one generalization included very different situations: clingy attachment (e.g. sticker op cupboard), traditional full support (e.g. cookie op plate) and covering relationships (e.g. top op jar). Two of the English generalizations are shown in the figures below. For each generalization the cases that were combined are listed followed by the facts and associated probabilities. Best Generalization ON Size: 2 (top on tube, lid on jar) --DEFINITE FACTS: (Covering-Object figure) (above figure ground) --POSSIBLE FACTS: 50%: (definiteOverlapCase figure ground) 50%: (rcc8-PO figure ground) 50%: (rcc8-EC figure ground)
Fig. 4. Sample generalizations for English on along with the component sketches
4.4 Error Analysis Closer examination of the specific errors made by SEQL is also illuminating. For example, both the Dutch and English experiments failed on two in stimuli. It was the same two stimuli for both languages: flower in book, and hole in towel. The first case, flower in book, is hard to represent in a sketch. In the original study, actual objects were used making it easier to place the flower in the book. It is not surprising that this case failed given that it was an exemplar in both in contexts and did not share much structure with other stimuli in that context. Hole in towel fails for a different reason. The ResearchCyc knowledge base does not have any concept of a hole. Moreover, how holes should be considered in spatial relationships seems different than for physical objects. Many of our errors stem from the small size of our stimuli set. For contexts that contained multiple variations, there were often only one or two samples of each. An
292
K. Lockwood, A. Lovett, and K. Forbus
interesting future study will be to see how many stimuli are needed to minimize error rates. (Even human adults are not 100% correct on these tasks.) Interestingly, om is one of the prepositions that is harder for Dutch children to learn (it covers situations of encirclement with support). However, it was the only Dutch preposition for which our system scored 100%. This again is probably explainable by sample size. Since the entire context contained only cases of encirclement with support, there was more in common between all of the examples. 4.5 Discussion Our results suggest that progressive alignment is a promising technique for modeling the learning of spatial language categories. Using a very small set of training stimuli (only 7 sketches in some cases) SEQL was able to correctly label the majority of the test cases. An examination of the results and errors indicates that our model, consistent with human data, uses both geometric and semantic knowledge in learning these prepositions. SEQL is able to learn these terms reasonably well, even with far less data than human children, but on the other hand, it is given very refined inputs to begin with (i.e., sketches). As noted below, we plan to explore scaling up to larger stimulus sets in future work.
5 Related Work There has been considerable cognitive science research into spatial prepositions, including a number of computational models. Most computational models (cf. [16, 18, 10]) are based only on geometric information, which means that they cannot model findings of Coventry et al [6] and Feist & Gentner[9], who showed that semantic knowledge of functional properties is also crucial. Prior computational models have also focused only on inputs consisting of simple geometric shapes (squares, circles, triangles, etc.). We believe our use of conceptually labeled sketches is an interesting and practical intermediate point between simple geometric stimuli and full 3D vision. We also differ from many other models of spatial language use in the number of training trials required. Many current models use orders of magnitude more trials than we do. We are not arguing that people learn spatial preposition categories after exposure to only 7 examples. After all, children have a much harder task than the one we have modeled here: they have many more distractions and a much richer environment from which to extract spatial information. On the other hand, we suspect that requiring 103-104 exposures, as current connectionist models need, is psychologically implausible. For example, one model requires an epoch of 2100 stimuli just to learn the distinction above/below/over/under for one arrangement of objects (a container pouring a liquid into a bowl/plate/dish) [7]. The actual number of trials that is both sufficient and cognitively plausible remains an open question and an interesting problem for future work.
6 Conclusions and Future Work Our model was able to successfully learn the support-containment prepositions in both Dutch and English with a small number of training trials. We see three lines of
Automatic Classification of Containment and Support Spatial Relations
293
investigation suggested by these results. First, we would like to expand our experiments to include more relationships (e.g. under, over, etc). Second, we would like to expand to other languages. For example, Korean uniquely divides the containment relationship into tight fit and loose fit relations. Third, we are in the process of building a sketch library of more instances of spatial relations. With more sketches, we will have additional evidence concerning the coverage of our model. There is also clearly a tradeoff between using a cognitively plausible number of training examples and having enough training examples to get good generality. For example, being able to automatically extract the important object types and features (e.g. containers) and ignore the spurious ones (e.g. that something is edible). We are planning future experiments to examine this issue by varying the number of training trials used. It will also be interesting to see if we can use the same set of experiments to model the development of spatial language categories in children by varying the availability of different types of information. Acknowledgments. This work was sponsored by a grant from the Intelligent Systems Program of the Office of Naval Research and by The National Science Foundation under grant no: SBE-0541957, The Spatial Intelligence and Learning Center. The authors would like to thank Dedre Gentner and Melissa Bowerman for access to their in-press paper and stimuli.
References 1. Bowerman, M., Pederson, E.: Crosslinguistic perspectives on topological spatial relationships. In: The 87th Annual Meeting of the American Anthropological Association, San Francisco, CA (paper presented, 1992) 2. Bowerman, M.: Learning How to Structure Space for Language: A Crosslinguistic Perspective. In: Bloom, P., Peterson, M.A., Nadel, L., Garrett, M.F. (eds.) Language and Space, pp. 493–530. MIT Press, Cambridge (1996) 3. Cohn, A.: Calculi for Qualitative Spatial Reasoning. In: Pfalzgraf, J., Calmet, J., Campbell, J.A. (eds.) AISMC 1996. LNCS, vol. 1138, pp. 124–143. Springer, Heidelberg (1996) 4. Coventry, K.R., Prat-Sala, M., Richards, L.V.: The Interplay Between Geometry and Function in the Comprehension of ‘over’, ‘under’, ‘above’, and ‘below’. Journal of Memory and Language 44, 376–398 (2001) 5. Coventry, K.R., Mather, G.: The real story of ‘over’? In: Coventry, K.R., Oliver, P. (eds.) Spatial Language: Cognitive and Computational Aspects, Kluwer Academic Publishers, Dordrecht (2002) 6. Coventry, K.R., Garrod, S.C.: Saying, Seeing and Acting: The Psychological Semantics of Spatial Prepositions. Essays in Cognitive Science Series. Lawrence Erlbaum Associates, Mahwah (2004) 7. Coventry, K.R., Cangelosi, A., Rajapakse, R., Bacon, A., Newstead, S., Joyce, D., Richards, L.V.: Spatial prepositions and vague quantifiers: Implementing the functional geometric framework. In: Proceedings of Spatial Cognition Conference. Springer, Germany (2005) 8. Falkenhainer, B., Forbus, K., Gentner, D.: The Structure-Mapping Engine. In: Proceedings of the Fifth National Conference on Artificial Intelligence, pp. 272–277. Morgan Kaufmann, San Francisco (1986)
294
K. Lockwood, A. Lovett, and K. Forbus
9. Feist, M.I., Gentner, D.: On Plates, Bowls, and Dishes: Factors in the Use of English ‘in’ and ‘on’. In: Proceedings of the 20th Annual Conference of the Cognitive Science Society (1998) 10. Gapp, K.P.: Angle, distance, shape and their relationship to project relations. In: Moore, J.D., Lehman, J.F. (eds.) Proceedings of the Seventeenth Annual Conference of the Cognitive Science Society, pp. 112–117. Lawrence Erlbaum Associates Inc., Mahwah (1995) 11. Gentner, D., Bowerman, M.: Why Some Spatial Semantic Categories are Harder to Learn than Others: The Typological Prevalence Hypothesis (in press) 12. Gentner, D.: Structure-Mapping: A theoretical framework for analogy. Cognitive Science 7, 155–170 (1983) 13. Gentner, D., Markman, A.B.: Structure mapping in analogy and similarity. American Psychologist 52, 42–56 (1997) 14. Halstead, D., Forbus, K.: Transforming between Propositions and Features: Bridging the Gap. In: Proceedings of AAAI, Pittsburgh, PA (2005) 15. Kuehne, S., Forbus, K., Gentner, D., Quinn, B.: SEQL: Category learning as progressive abstraction using structure mapping. In: Proceedings of the 22nd Annual Meeting of the Cognitive Science Society (2000) 16. Lockwood, K., Forbus, K., Halstead, D., User, J.: Automatic Categorization of Spatial Prepositions. In: Proceedings of the 28th Annual Conference of the Cognitive Science Society (2006) 17. Markman, A.B., Gentner, D.: Commonalities and differences in similarity comparisons. Memory & Cognition 24(2), 235–249 (1996) 18. Regier, T.: The human semantic potential: Spatial language and constrained connectionism. MIT Press, Cambridge (1996) 19. Regier, T., Carlson, L.A.: Grounding spatial language in perception: An empirical and computational investigation. Journal of Experimental Psychology: General 130(2), 273– 298 (2001) 20. Skorstad, J., Gentner, D., Medin, D.: Abstraction Process During Concept Learning: A Structural View. In: Proceedings of the 10th Annual Conference of the Cognitive Science Society (1988)
Integral vs. Separable Attributes in Spatial Similarity Assessments Konstantinos A. Nedas and Max J. Egenhofer National Center for Geographic Information and Analysis and Department of Spatial Information Science and Engineering University of Maine Boardman Hall, Orono, ME 04469-5711, USA
[email protected],
[email protected]
Abstract. Computational similarity assessments over spatial objects are typically decomposed into similarity comparisons of geometric and non-geometric attribute values. Psychological findings have suggested that different types of aggregation functions—for the conversions from the attributes’ similarity values to the objects’ similarity values—should be used depending on whether the attributes are separable (which reflects perceptual independence) or whether they are integral (which reflects such dependencies among the attributes as typically captured in geometric similarity measures). Current computational spatial similarity methods have ignored the potential impact of such differences, however, treating all attributes and their values homogeneously. Through a comprehensive simulation of spatial similarity queries the impact of psychologically compliant (which recognize groups of integral attributes) vs. deviant (which fail to detect such groups) methods have been studied, comparing the top-10 items of the compliant and deviant ranked lists. We found that only for objects with very small numbers of attributes—no more than two or three attributes for the objects—the explicit recognition of integral attributes is negligible, but the differences between compliant and deviant methods become progressively worse as the percentage of integral attributes increases and the number of groups in which these integral attributes are distributed decreases.
1 Introduction Similarity assessment implies a judgment about the semantic proximity of two or more entities. In a rudimentary form, this process consists of a decomposition of the entities under comparison into elements in which the entities are the same, and elements in which they differ (James 1890). People perform such tasks based on their intuitions and knowledge; however, their judgments are often subjective and display no strict mathematical models (Tversky 1977). Formalized similarity assessments are critical ingredients of Naive Geography (Egenhofer and Mark 1995), which serves as the basis for the design of intelligent GISs that will act and respond much like a person would. The challenge for machines to perform similarly is the translation of a qualitative similarity assessment into the quantitative realm of similarity scores, C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 295–310, 2008. © Springer-Verlag Berlin Heidelberg 2008
296
K.A. Nedas and M.J. Egenhofer
typically within the range of 0 (worst match) to 1 (best match). This paper addresses similarity within the context of spatial database systems. Spatial similarity assessment is commonly based on the comparisons of spatial objects, which are typically characterized by geometric (Bruns and Egenhofer 1996) and thematic (Rodríguez and Egenhofer 2004) attributes. Geometric attributes are associated with the objects’ shapes and sizes, while thematic attributes capture non-spatial information. For example, the class of Rhodes is island, its name and population are thematic attributes, while a shape description such as the ratio of the major and minor axes of its minimum bounding rectangle provides values for its geometric attributes. The same dichotomy of spatial and thematic characteristics applies to relations. For example, Rhodes, which is disjoint from the Greek mainland and located 650km southeast of Thessaloniki, has a smaller population than Athens. Spatial similarity assessments consider the objects’ characteristics and relations. The similarity of two spatial objects is typically computed with a distance (i.e., dissimilarity) measure that is defined upon the objects’ representations. To yield cognitively plausible results, this estimate must match with people’s notions of object similarity (Gärdenfors 2000). A critical aspect in this process is the role of an aggregation function, which combines atomic judgments (i.e., comparisons of pairs of attribute values) into an overall composite measure for pairs of objects. Separable attributes are perceptually independent as they refer to properties that are obvious, compelling, and clearly perceived as two different qualities of an entity (Torgerson 1965). Conversely, integral attributes create a group when their values are conceptually correlated, but lack an obvious separability (Ashby and Townsend 1986). Conceptual correlation implies that the values of such attributes are perceived as a single property, independent of their attributes’ internal representations (e.g., as a set of concomitant attributes). While general-purpose information systems employ primarily separable attributes, such as age, job title, salary, and gender in a personnel database, a significant amount of integral attributes may be hidden in the representational formalisms that GISs employ to model the complex topological relations of spatial objects (Egenhofer and Franzosa 1995; Clementini and di Felice 1998). The set of possible integral attributes grows with metric refinements of topological relations (Egenhofer and Shariff 1998; Nedas et al. 2007). Psychological research has converged to a consensus that aggregation functions should differ depending on whether the atomic judgments are made on separable or integral attributes (Attneave 1950; Nosofsky 1986; Shepard 1987; Nosofsky 1992; Takane and Shibayama 1992; Hahn and Chater 1997; Gärdenfors 2000). Since the recognition of the integral attributes and the form of the aggregation function affect the rankings at the object level, spatial information systems should employ a psychologically compliant model (i.e., a model that acocunts for integral attributes) for similarity assessments using psychologically correct aggregation functions to determine the similarity of a result to a query. Most of the current studies and prototypes, however, do not account for integral attributes as they use psychologically deviant methods, making no distinctios between separable and integral attribtues. Would the incorporation of psychologically compliant provisions into a formalized spatial similarity assessment yield different similarity results? To answer this question, this paper sets up a similarity simulation that generates a broad spectrum of experimental results for spatial similarity queries. This simulation provides a rationale
Integral vs. Separable Attributes in Spatial Similarity Assessments
297
for deciding whether spatial similarity assessments should employ psychologically compliant measures that recognize integral vs. separable attributes, or whether this distinction has no impact on computational similarity comparisions. The remainder of this paper reviews similarity measures (Section 2), describes the experimental and analytical setup for the simulation (Section 3), and presents an indepth interpretation of the results (Section 4). Conclusions and future work are discussed in Section 5.
2 Similarity Measures Similarity-based information retrieval goes beyond the determination of an exact match between queries and stored data. It provides the users with a range of possible answers, which are the most similar to the initial requests and, therefore, the most likely to satisfy their queries. The results of such spatial queries are ranked (Hjaltason and Samet 1995) according to similarity scores, enabling exploratory access to data by browsing, since users usually know only approximately what they are looking for. Such similarity-based retrieval also relieves users from the burden of reformulating a query repeatedly until they find useful information. 2.1 Similarity at the Attribute Level The core of a similarity mechanism’s inferential abilities is at the attribute level. By exploiting the differences among attribute values of objects and relations, a similarity algorithm can reason about the degree of difference or resemblance of a result to a query. When the query consists of a constraint on an atomic value of a single attribute, the process of similarity assessment takes place at the attribute level, while for a query that consists of multiple such constraints, a similarity assessment takes place at the object level. In both cases, the results are objects; the difference, however, is that in the latter case the individual similarity scores that were produced separately for each attribute must somehow be combined to a meaningful composite. Dey et al. (2002) developed simple similarity measures for attribute values to identify duplicates for the same entity in databases. Rodríguez and Egenhofer (2003) combined distinguishing features of entities with their semantic relations in a hierarchical network and created a model that evaluates similarity among spatial concepts (i.e., entity classes). Based on theories for reasoning with topological, metric, and directional relations several computational models for spatial relation similarity have been developed (Egenhofer 1997; Egenhofer and Shariff 1998; Goyal and Egenhofer 2001), including their integrations into qualitative (Bruns and Egenhofer 1996; Li and Fonseca 2006) and quantitative (Gudivada and Raghavan 1995; Nabil et al. 1996) similarity measures. Most established similarity measures are derived in an ad-hoc manner, guided by experience and observation. In this sense, they are concerned with similarity from a pragmatic, not a cognitive, point of view. Different roles of attributes have a profound impact on similarity assessments. Some attributes are perceived as separable, while others are perceptually indistinguishable. Such perceptual indistinguishability must not be confused with correlation. For instance, people’s heights and weights may have a positive correlation as attributes, but they are probably separable quantities in perception. Integral attributes are
298
K.A. Nedas and M.J. Egenhofer
produced by artificial decompositions of quantities or qualities that make no intuitive sense, but often serve to describe representations in information systems, with color (e.g., RGB and CMYK) being the classic example. Integral spatial attributes typically occur for high-level abstractions, such as shape or spatial relation. Shape is often captured through a series of metric parameters, for instance for elongation and perforation (Wentz 2000), or combinations of deformations, such as stretching and bending (Basri et al. 1998). When judging shape, these detailed spatial properties are typically perceived in combination, rendering shape an integral spatial attribute. In the same vein, spatial relations (Bruns and Egenhofer 1996), or more specifically topological relations (Egenhofer and Franzosa 1995), are spatial representations that contain several integral dimensions. While a user perceives one topological relation (Figure 1), there are a dozen spatial attributes—from coarse topological properties (Egenhofer and Herring 1991) over detailed topological properties (Egenhofer and Franzosa 1995) to detailed metric refinements (Egenhofer and Shariff 1998, Nedas et al. 2007) in order to differentiate spatial relations.
Fig. 1. Representing a topological relation at progressively finer levels of detail
2.2 Similarity at the Object Level Findings from psychology about the way that people perceive the nature of similarity, its properties, and its relationship to peripheral notions, such as difference and dissimilarity, are largely ignored in computational similarity assessments. The focus on the computational feasibility and efficiency, while dismissing cognitive elements, renders the plausibility of such approaches to human perception questionable. The similarity of one object to another is an inverse function of the distance between the objects in a conceptual space, that is the collection of one or more domains (Gärdenfors 2000). Attribute weights that indicate each dimension’s salience within the space offer a refined similarity asssessment. The distance in a conceptual space indicates dissimilarity, which should be compatible with people’s judgments of overall dissimilarity; therefore, its correct calculation becomes important. Following widely accepted psychological research (Attneave 1950; Torgerson 1965; Shepard 1987; Ashby and Lee 1991; Nosofsky 1992; Gärdenfors 2000), the perceived
Integral vs. Separable Attributes in Spatial Similarity Assessments
299
interpoint distances between the objects’ point representations in that space should be computed either by a Euclidean metric or a city-block metric (also known as the Manhattan distance). Which one to employ depends on whether one deals with integral or separable dimensions. Integral dimensions are strongly unanalyzable and typically perceived as a single stimulus. For instance, the proximity of two linear objects may be described with a number of measures that associate the boundaries and interiors of the objects (Nedas et al. 2007), but the closeness relation may be perceived as one stimulus from the users that inspect the lines. Hence, a set of integral dimensions constitutes in essence one multi-dimensional attribute (Torgerson 1965). Separable dimensions, on the other hand, are different and distinct properties (e.g., length and height) that are perceptually independent (Ashby and Lee 1991). It has been suggested and experimentally confirmed (Attneave 1950; Torgerson 1965; Shepard 1987) that, with respect to human judgments for similarity, a Euclidean metric performs better with integral dimensions, whereas a city-block metric matches more closely separable dimensions. Perceptually separable dimensions are expected to have a higher frequency of occurrance in databases; therefore, in the general case the composite dissimilarity indicator between two objects will be calculated by the weighted average of individual dissimilarities along each of the dimensions. For a group of n integral attributes, however, an Euclidean metric should be adopted to derive the dissimilarity of the objects with respect to this integral group. Therefore, the combination of the n concomitant attributes of an integral group should yield one dissimilarity component rather than n individual components in the composite measure (Figure2).
Fig. 2. Combining the dissimilarity values d4 and d5 of two integral attributes (Attribute 4 and Attribute 5) into a single dissimilarity component, before summing it up with the dissimilarity values d1 … d3 (of the separable attributes 1…3) to determine the overall dissimilarity D between a DB Object and a Query Object
3 Object Similarity Simulation The impact of intergral vs. separable attributes on object-based similarity comparisions is evaluated for a comprehensive series of query scenarios. Factors that may influence the outcome include the number of objects compared, the number of attributes of each object, and the distribution of separable vs. integral attributes. The intended exhaustive
300
K.A. Nedas and M.J. Egenhofer
character of these experiments was a prohibitive factor in locating real-world datasets that accommodate all tested scenarios. Hence, the assessment relies on simulations with synthetic datasets and queries that are randomly generated with the Sensitivity Analyzer for Similarity Assessments SASA (Nedas 2006). This software prototype served as a testbed to examine different processing strategies for an exhaustive set of similarity queries. The experiment’s results comprise a ranked list for a compliant method and another one for a defiant method. We introduce tailored measures for comparing the relevant portions of such ranked lists (Section 3.1). In SASA these synthetic constructs were originally populated with random values that followed different statistical distributions each time (e.g., uniform, normal). The experimental set-up included five parameters (Section 3.2), which offered a wide range of variations for the distribution of separable and integral attributed. The underlying distribution of the actual data had a negligible effect on the final results. The distribution of random values is, therefore, kept constant and assumed to be uniform throughout this study. Likewise, a consideration of different attribute types in the simulated databases is immaterial for the purposes of the experiment, because the algorithm performs atomic value assessments, yielding a dissimilarity measure between 0 and 1 regardless of the attribute type. The focus of the experiment, however, is to examine how such atomic dissimilarities should be combined to create scores of aggregate dissimilarity. The experiment was conducted several thousand times and the results were averaged in order to make the rank-list measures converge to their medium values. The number of repetitions was determined empirically, such that successive experiment executions yielded results with less than 1% deviation. In order to summarize the test results effectively a 4-dimensional rendering was developed (Section 3.3), which fixes one of the five parameters—the size of the test set—and visualizes the other three parameters through a 3-D-plus-color diagram. 3.1 Comparisons of Ranked Lists Most approaches to compute the deviations between two ranking lists (Mosteller and Rourke 1973; Gibbons 1996) rely on statistical tests, which consider the entire range of the lists. An evaluation of ranking lists produced from database queries or web search queries is different, however, as they focus only on the first few ranks, because the relevance of retrieved items decreases rapidly for lower ranks. For the experiments in this study, the relevant portion of the ranking list was defined as the ten best hits. This decision was partially based on the experimental outcomes that people retain no more than five to nine items in short term memory (Miller 1956). The 7+/-2 rule refers to unidimensional stimuli; therefore, people are expected to be able to retain this number of results in short term memory only for very simple queries. This choice was also based on the typical strategy of current web-search engines, which present ten items per page, starting from the most relevant. Therefore, the set of the ten best results is not only easy to browse and inspect, but also convenient in the sense that users can memorize it to a large degree and perform swift comparative judgments about the relevance of each match to their query. As the database size grows, the ranks of the ten best results are determined based on finer differences of their similarity values. If one also considers that psychologically compliant methods approximate better, but do not necessarily model human perception
Integral vs. Separable Attributes in Spatial Similarity Assessments
301
exactly, then a measure of incompatibility that relies only on rank differences would be strict. A more practical and objective indicator of the incompatibility between two methods considers instead the overlap of common objects within the relevant portion of the ranking lists. This measure, denoted by O, expresses the percentage of the common items within the ten best results that the compared methods produce. The selection of this measure is also further justified by the fact that each of the items in the relevant portion is equally accessible to the users (i.e., ten results per page). The actual rank differences are examined as a secondary, less crucial index of incompatibility. They are used as an additional criterion when the overlap measure provides borderline evidence for that purpose. The rank differences are assessed using a modified Spearman Rank Correlation (SRC) test. This test is an appropriate statistic for ordinal data, provided that its resulting coefficient is used only to test a hypothesis about order (Stevens 1951). The SRC coefficient R, with xi and yi as the rank orders of item i in two compared samples that contain n items each (Equation 1), takes a value between –1 and +1, where +1 indicates perfect agreement between two samples (i.e., the elements are ranked identically), while –1 signals complete disagreement (i.e., the elements are ranked in inverse order). A value of 0 means that there is no association between the two samples, whereas other values than 0, 1, and –1 would indicate intermediate levels of correlation. n
R = 1−
6∑ (xi − yi ) 2 i=1
(1)
n ⋅(n −1) The SRC coefficient and similar statistics are designed for evaluations of ranking lists that contain exactly the same elements. Hence, it cannot be readily applied to tests that require a correlation value between a particular subsection of the ranking lists. This observation is essential, because the items in the relevant portion of the lists will only incidentally be the same for two different methods. To enable the comparison of lists with different numbers of entries, a modified SRC coefficient is computed as 2
Fig. 3. Overlap percentage O and modified Spearman Rank Correlation coefficient R' for the relevant portion of two ranking lists
302
K.A. Nedas and M.J. Egenhofer
follows: first, the different elements in the two lists are eliminated and R (Equation 1) is computed for the common elements that remain. Then, the modified coefficient R' is calculated by multiplying R with the overlap percentage O (Figure 3). This second step is necessary in order to avoid misleading results. For example, when among the top ten items only one common element exists, R = 1, but R' = 0.1. Methods that produce very similar results are characterized by positive values of the measures O and R', close to 1, whereas methods that produce very dissimilar results are characterized by an overlap value close to 0 and by a modified SRC coefficient value close to 0 or negative. 3.2 Test Parameters
The dissimilarities of the ranks for an object query with different methods are captured through the incompatibility measures O and R', which are each functions of five variables n, m, p, g, and d. • Variable n is the number of objects in the database, determining the database size. The experiments were conducted for the set N = {1,000, 5,000, 25,000, 100,000}, so that each database size increases approximately one order of magnitude over its predecessor. A dataset of 1,000 objects was adopted as a characteristic case of a small database, a dataset of 100,000 objects as a characteristic case of a large database, with datasets of 5,000 and 25,000 objects as representatives of medium-small and medium-large databases, respectively. • Variable m is the number of attributes that participate in the similarity assessment of a database object to a query object. The set examined is M = {2, 5, 10, 20, 30, 40, …, 100} and accounts for the most simple and complex modeled objects. The case of queries on a single attribute is omitted, because it is irrelevant for this investigation. One integral attribute is undefined, because it essentially degenerates to one separable attribute. • Variable p is the percentage of integral attributes out of the total number of attributes m. The actual number of integral attributes is, therefore, p⋅m. In this manner, p also indirectly determines the number of separable attributes. The percentages taken are p = {0%, 10%, 20%, …, 100%). The two extreme values represent the cases where all attributes are separable (0%) and integral (100%). • Variable g is the number of integral groups in which the integral attributes are distributed. Its values are constrained by the specific instantiations of m and p. For example, for objects with ten attributes (m = 10), four of which are integral (p = 40%), there could be one group of four attributes or two groups of two attributes. For the experiment, g has a range from 1 to 50. The smallest value occurs in various settings, starting with m = 2 and p = 100%. The largest value occurs only if m = 100 and p = 100%. • Variable d is the group distribution policy, specifying how a number of integral attributes p⋅m is distributed into g integral groups. For some configurations there could be numerous such possibilities. For instance, eight integral attributes that are distributed into two groups can yield several different allocations, such as 6-2, 5-3, and 4-4. Preliminary experimentation indicated that the similarity results could be affected by the distribution policy, especially for larger percentages of integral
Integral vs. Separable Attributes in Spatial Similarity Assessments
303
attributes. This parameter is treated as a binary variable taking the values “optimal” and “worst.” An optimal distribution policy tries to distribute the integral attributes evenly, such that each integral group contains approximately the same number of attributes (Figure 4a), whereas a worst-case distribution policy creates disproportionately-sized groups by assigning as many attributes as possible to one large integral group, while populating the remaining groups with the smallest number of attributes (Figure 4b). The binary treatment of the group distribution policy allows inferences about the behavior of this variable between its two extremes settings, while keeping the number of produced diagrams within realistic limits.
Fig. 4. Splitting integral attributes into groups using (a) an optimal and (b) a worst distribution policy
3.3 Visualization of Similarity Scores
A specific instantiation of the variables n, m, p, g, and d represents a possible database configuration and is referred to as a db scenario. The simultaneous interaction of all variables involved for such db scenarios and their effect on the ranks cannot be accommodated by the representational capabilities of typical 2-dimensional or 3dimensional visualization techniques due to the large amount of diagrams that would have to be produced. In order to visualize the results effectively, while keeping the number of produced diagrams within acceptable bounds, a 4-dimensional visualization technique was employed. For each 4-dimensional diagram, the database size n and the distribution policy d are kept fixed, while the remaining variables are allowed to vary within a 3-dimensional cubic space. The axes X, Y, and Z of this space correspond to the number of integral groups g, the number of attributes m, and the percentage of integral attributes p, respectively. Each point in the cubic space signifies, therefore, a db scenario determined by the instantiation of the triple (m, p, g) that defines the point, and the fixed values of n and d. The color assigned to a db scenario (i.e., point) embeds a fourth dimension in the visualization, which represents the measurement of O (i.e., the overlap) or R' (i.e., the modified SRC coefficient) between the two compared methods for that db scenario. Warm colors (in the range of red) display shades of high similarity, while cool colors (in the range of blue) indicate shades of high dissimilarity. Since there are two incompatibility measures, four database sizes, and two distribution policies, a total of sixteen diagrams was produced for each experiment.
304
K.A. Nedas and M.J. Egenhofer
The 4-dimensional diagrams (Figure 5) correspond to the scenario of a database of 1,000 objects, each with 40 attributes—20 separable and 20 integral. The latter are distributed in 10 groups through an optimal distribution policy (i.e., each group contains 2 attributes). For the db scenario of point A in Figure 5 the overlap measure is approximately 40%, whereas A’s value of R' is approximately 0.2. A triangular half of the volumes of the produced cubes is not populated with measurements, because it corresponds to non-applicable db scenarios. For example, point B in Figure 5 is such a db scenario, because it is impossible to allocate 60 integral attributes within 40 groups. Realizable db scenarios are located within the remaining half of the cube. Since the values of the variables m, p, and g are discrete, the realizable db scenarios form a dense grid, rather than a continuous surface. The diagrams, however, use continuous color-rendered surfaces instead—produced by interpolating the grid values—in order to facilitate the interpretation of the results. Furthermore, the cube is sliced at regular intervals along the Z-axis to reveal the patterns in its interior.
Fig. 5. A 4-dimensional diagram depicting the measures (a) O and (b) R' (color figure available at http://www.spatial.maine.edu/~max/similarity/4D-0.pdf)
4 Experiment Results and Interpretation The test results (Figure 6) indicate a definitive pattern of gradual variation. The deviant method in this experiment is a manifestation of the Manhattan distance function with no integral groups recognized. Hence, the number of aggregated terms is always equal to the total number of attributes m. Furthermore, each term contributes equally to the similarity score assigned to each object of the database. As the variables change, the form of the compliant method becomes more or less similar to the pattern
Integral vs. Separable Attributes in Spatial Similarity Assessments
305
of the deviant method. The interactions behind these deviations explain the outcome illustrated in the diagrams. The main conclusion is that the measures O and R' become progressively worse as the percentage of integral attributes increases and the number of groups in which these integral attributes are distributed decreases. When either or both trends occur, the aggregated terms with the compliant method reduce to a number much less than m. For example, for one separable attribute, nine integral attributes, and three groups, the deviant method aggregates ten terms and the compliant four terms. Moreover, the effect of the one remaining separable attribute with the compliant method is disproportionate on the final score compared to that of the other attributes. As the number of groups increases, the measures have a greater concordance, because the impact of such isolated attributes on the final score diminishes. This observation also explains the dissonance to the deterioration pattern observed at the highest layer of the optimal distribution policy diagrams, where such separable attributes disappear. The even distribution of integral attributes into groups makes the compliant method behave similarly to the deviant at this layer. For example, consider a query with ten attributes, all of which are integral and must be distributed in five groups. The deviant approach will aggregate all ten attributes as separable. The compliant will first separate the ten attributes in groups of two, aggregate each group, and combine the resulting five terms to derive the object’s similarity. For a single group, the compliant method becomes identical to the Euclidean distance function. The trend of deterioration, however, is not interrupted at the highest layer of the diagrams for the worst distribution policy because the group sizes with this policy differ drastically. In this case, the smaller integral groups continue to have a disproportionate influence on the final similarity score. The more uniform the distribution into groups is, the less significant the effects on the measures O and R' become. The wavy patterns at the higher layers of the optimaldistribution diagrams also support this conclusion. Such effects are due to the alternating exact and approximate division of integral attributes into groups. For example, for nine integral attributes and three groups the division is exact with three attributes in each group, while for ten or eleven integral attributes, the groups differ in size. In the diagrams of the worst distribution policy where group sizes remain consistently imbalanced, the small stripes of temporary improvements disappear. Excluding the wavy patterns and the case of all attributes being integral, the measures appear to be invariant to the group distribution policy elsewhere. The results worsen slightly with an increase in the number of attributes; however, the influence of this variable is much more subtle compared to the others. When the attribute number is very small, the methods are often identical, because the attributes are insufficient to form integral groups (e.g., for two attributes and up to 50% percentage of integral attributes). This observation explains the cause for the very high values of O and R' detected at the rightmost edge of the diagrams. The compared methods also yield progressively different outcomes as the database size increases. This result was anticipated, because two functions are expected to demonstrate approximately the same degree of correlation regardless of the sample size with which they are tested. Hence, if the entire ranking lists were considered (i.e., if the lists contained all database objects), and assuming all other variables equal, the two compared methods would exhibit on average the same correlation, regardless of
306
K.A. Nedas and M.J. Egenhofer
Fig. 6. Overview of the results acquired from the experiment (color figure available at http:// www.spatial.maine.edu/~max/similarity/4D-1.pdf)
Integral vs. Separable Attributes in Spatial Similarity Assessments
307
the database size. Increasing the number of objects in the database, while keeping the size of the relevant portion constant leaves more potential for variations within the ten best results and explains why the overlaps and correlations decline for larger databases. Both O and R' take a value of 1 at the lowest layer where all attributes are separable and the compared methods coincide. For all other db scenarios, the modified Spearman Rank Correlation coefficient R' has a lower value than the overlap O. This result is not surprising considering that R' is a stricter measure than O. The diagrams suggest that the correct recognition of integral attributes and groups is immaterial for smaller datasets as long as the percentage of integral attributes remains below 40%. For the largest database considered this limit drops to around 20%. At these percentages, O and R' have values of 0.5 and 0.2, respectively. Such values constitute borderline measurements, because they imply that only half of the retrieved objects in the relevant portion are the same and that these common objects are ranked very differently. The need for different treatments of separable vs. integral attributes also corroborated by the actual sizes of real-world geographic databases, which are often much larger than the largest dataset in this experiment. Only for objects with very small numbers of attributes—no more than two or three attributes for the objects—the recognition of integral attributes is negligible.
5 Conclusions Computational similarity assessments among spatial objects typically compare the values of corresponding attributes and relations employing distance functions to capture dissimilarities. Psychological findings have suggested that different types of aggregation functions—for the conversions from the attributes’ similarity values to the objects’ similarity values—should be used depending on whether the attributes are separable (which reflects perceptual independence) or whether they are integral (which reflects a dependency among the attributes). Current computational similarity methods have ignored the potential impact of such differences, however, treating all attributes and their values homogeneously. An experimental comparison between a psychologically compliant approach (which recognizes groups of integral attributes) and a psychologically deviant approach (which fails to detect such groups) showed that the rankings produced with each method are incompatible. The results do not depend per se on the correlation of the attribute dimensions. It is the choice of the aggregation function that yields the object similarities depending on whether the attributes are perceptually distinguishable or not and, therefore, the perceptual plausibility of the obtained results will be affected if one ignores the perceptual "correlation" of the attribute dimensions. The simulations showed that even for a modest amount of integral attributes within the total set of attributes considered, the dissimilarities are pronounced, particularly in the presence of a single integral group or a small number of them. This trend worsens for large-scale databases. Both scenarios correspond closely to spatial representations and geographic databases. The structure of the current formalisms used to represent detailed topological, directional, and metric relations is often based on criteria other than a one-to-one correspondence between the representational primitives employed
308
K.A. Nedas and M.J. Egenhofer
and human perception. Such formalisms are likely to contain one or few integral groups within their representation. Furthermore, geographic databases are typically large, in the order of 105 or 106 objects. This result is, therefore, significant, because it suggests that existing similarity models may need to be revised such that new similarity algorithms must consider the possible presence of perceptually correlated attributes. Future work should consider the impact of these findings beyond highly-structured spatial databases to embrace the less rigid geospatial semantic web (Egenhofer 2002), which is driven by ontologies. Similarity relations fit well into an ontological framework, because it is expected that people who commit to the same ontology perceive identically not only the concepts that are important in their domain of interest, but also the similarity relations that hold among these concepts. This alignment of individual similarity views towards a common similarity view is emphasized by the fact that ontologies already have inherent a notion of qualitative similarity relations among the concepts that they model. This notion is reflected in their structure (i.e., in the way they specify classes and subclasses) and in the properties and roles that are attributed to each concept. Formalizing similarity within ontologies would be a step forward in the employment of ontologies not only as means for semantic integration, but also as tools for semantic management, and would help their transition from symbolic to conceptual constructs.
Acknowledgments This work was partially supported by the National Geospatial-Intelligence Agency under grant numbers NMA401-02-1-2009 and NMA201-01-1-2003.
References Ashby, F., Lee, W.: Predicting Similarity and Categorization from Identification. Journal of Experimental Psychology: General 120(2), 150–172 (1991) Ashby, F., Townsend, J.: Varieties of Perceptual Independence. Psychological Review 93(2), 154–179 (1986) Attneave, F.: Dimensions of Similarity. American Journal of Psychology 63(4), 516–556 (1950) Basri, R., Costa, L., Geiger, D., Jacobs, D.: Determining the Similarity of Deformable Shapes. Vision Research 38, 2365–2385 (1998) Bruns, T., Egenhofer, M.: Similarity of Spatial Scenes. In: Kraak, M.-J., Molenaar, M. (eds.) Seventh International Symposium on Spatial Data Handling (SDH 1996), Delft, The Netherlands, pp. 173–184. Taylor & Francis, London (1996) Clementini, E., di Felice, P.: Topological Invariants for Lines. IEEE Transactions on Knowledge and Data Engineering 10(1), 38–54 (1998) Dey, D., Sarkar, S., De, P.: A Distance-Based Approach to Entity Reconciliation in Heterogeneous Databases. IEEE Transactions on Knowledge and Data Engineering 14(3), 567–582 (2002) Egenhofer, M.: Query Processing in Spatial-Query-by-Sketch. Journal of Visual Languages and Computing 8(4), 403–424 (1997)
Integral vs. Separable Attributes in Spatial Similarity Assessments
309
Egenhofer, M.: Towards the Semantic Geospatial Web. In: Voisardand, A., Chen, S.-C. (eds.) 10th ACM International Symposium on Advances in Geographic Information Systems, McLean, VA, pp. 1–4 (2002) Egenhofer, M., Franzosa, R.: On the Equivalence of Topological Relations. International Journal of Geographical Information Systems 9(2), 133–152 (1995) Egenhofer, M., Mark, D.: Naive Geography. In: Kuhn, W., Frank, A.U. (eds.) COSIT 1995. LNCS, vol. 988, pp. 1–15. Springer, Heidelberg (1995) Egenhofer, M., Shariff, R.: Metric Details for Natural-Language Spatial Relations. ACM Transactions on Information Systems 16(4), 295–321 (1998) Gärdenfors, P.: Conceptual Spaces: The Geometry of Thought. MIT Press, Cambridge (2000) Gibbons, J.: Nonparametric Methods for Quantitative Analysis. American Sciences Press, Syracuse (1996) Goyal, R., Egenhofer, M.: Similarity of Cardinal Directions. In: Jensen, C., Schneider, M., Seeger, B., Tsotras, V. (eds.) Proceedings of the Seventh International Symposium on Spatial and Temporal Databases, Los Angeles, CA. LNCS, vol. 2121, pp. 36–55. Springer, Heidelberg (2001) Gudivada, V., Raghavan, V.: Design and Evaluation of Algorithms for Image Retrieval by Spatial Similarity. ACM Transactions on Information Systems 13(1), 115–144 (1995) Hahn, U., Chater, N.: Concepts and Similarity. In: Lamberts, K., Shanks, D. (eds.) Knowledge, Concepts, and Categories, pp. 43–92. MIT Press, Cambridge (1997) Hjaltason, G., Samet, H.: Ranking in Spatial Databases. In: Egenhofer, M.J., Herring, J.R. (eds.) SSD 1995. LNCS, vol. 951, pp. 83–95. Springer, Heidelberg (1995) James, W.: The Principles of Psychology. Holt, New York (1890) Li, B., Fonseca, F.: TDD: A Comprehensive Model for Qualitative Similarity Assessment. Spatial Cognition and Computation 6(1), 31–62 (2006) Miller, G.: The Magical Number Seven, Plus or Minus Two: Some Limits on Our Capacity for Processing Information. The Psychological Review 63(1), 81–97 (1956) Mosteller, F., Rourke, R.: Sturdy Statistics: Nonparametric & Order Statistics. AddisonWesley, Menlo Park (1973) Nabil, M., Ngu, A., Shepherd, J.: Picture Similarity Retrieval using the 2D Projection Interval Representation. IEEE Transactions on Knowledge and Data Engineering 8(4), 533–539 (1996) Nedas, K.: Semantic Similarity of Spatial Scenes. Ph.D. Dissertation, Department of Spatial Information Science and Engineering, University of Maine (2006) Nedas, K., Egenhofer, M., Wilmsen, D.: Metric Details for Topological Line-Line Relations. International Journal of Geographical Information Science 21(1), 21–24 (2007) Nosofsky, R.: Attention, Similarity, and the Identification-Categorization Relationship. Journal of Experimental Psychology: General 115(1), 39–57 (1986) Nosofsky, R.: Similarity Scaling and Cognitive Process Models. Annual Review of Psychology 43(1), 25–53 (1992) Rodríguez, A., Egenhofer, M.: Determining Semantic Similarity among Entity Classes from Different Ontologies. IEEE Transactions on Knowledge and Data Engineering 15(2), 442– 456 (2003) Rodríguez, A., Egenhofer, M.: Comparing Geospatial Entity Classes: An Asymmetric and Context-Dependent Similarity Measure. International Journal of Geographical Information Science 18(3), 229–256 (2004) Shepard, R.: Toward a Universal Law of Generalization for Psychological Science. Journal of Science 237(4820), 1317–1323 (1987)
310
K.A. Nedas and M.J. Egenhofer
Stevens, S.: Mathematics, Measurement, and Psychophysics. In: Stevens, S. (ed.) Handbook of Experimental Psychology, pp. 1–49. John Wiley & Sons, Inc., New York (1951) Takane, Y., Shibayama, T.: Structures in Stimulus Identification Data. In: Ashby, F. (ed.) Probabilistic Multidimensional Models of Perception and Cognition, pp. 335–362. Earlbaum, Hillsdale (1992) Torgerson, W.: Multidimensional Scaling of Similarity. Psychometrika 30(4), 379–393 (1965) Tversky, A.: Features of Similarity. Psychological Review 84(4), 327–352 (1977) Wentz, E.: Developing and Testing of a Trivariate Shape Measure for Geographic Analysis. Geographical Analysis 32(2), 95–112 (2000)
Spatial Abstraction: Aspectualization, Coarsening, and Conceptual Classification Lutz Frommberger and Diedrich Wolter SFB/TR 8 Spatial Cognition Universit¨ at Bremen Enrique-Schmidt-Str. 5, 28359 Bremen, Germany {lutz,dwolter}@sfbtr8.uni-bremen.de
Abstract. Spatial abstraction empowers complex agent control processes. We propose a formal definition of spatial abstraction and classify it by its three facets, namely aspectualization, coarsening, and conceptual classification. Their characteristics are essentially shaped by the representation on which abstraction is performed. We argue for the use of so-called aspectualizable representations which enable knowledge transfer in agent control tasks. In a case study we demonstrate that aspectualizable spatial knowledge learned in a simplified simulation empowers strategy transfer to a real robotics platform. Keywords: abstraction, knowledge representation, knowledge transfer.
1
Introduction
Abstraction is one of the key capabilities of human cognition. It enables us to conceptualize the surrounding world, build categories, and derive reactions from them to cope with a certain situation. Complex and overly detailed circumstances can be reduced to much simpler concepts and not until then it becomes feasible to deliberate about conclusions to draw and actions to take. Certainly, we want to see such abstraction capabilities in intelligent artificial agents too. This requires us to implement abstraction principles in the knowledge representation used by the artificial agent. First of all, abstraction is a process transforming a knowledge representation. But how can this process be characterized? We can distinguish three different facets of abstraction. For example it is possible to regard a subset of the available information only, or the level of detail of every bit of information can be reduced, or the available information can be used to construct new, more abstract entities. Intuitively, these types of abstraction are different and lead to different results as well. Various terms have been coined for abstraction principles, distributed over several scientific fields like cognitive science, artificial intelligence, architecture, linguistics, geography, and many more. Among others we find the terms granularity [1,2], generalization [3], schematization [4,5], idealization [5], selection [5,6], amalgamation [6], or aspectualization [7]. Unfortunately, some of these terms define overlapping concepts, different ones sometimes have the same meaning, or a single term is C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 311–327, 2008. Springer-Verlag Berlin Heidelberg 2008
312
L. Frommberger and D. Wolter
used for different concepts. Also, these terms are often not distinguished in an exact manner or only defined by giving examples. In this work we take a formal view from a computer scientist’s perspective. We study abstraction as part of knowledge representation. Our primary concern is representation of spatial knowledge, yet we aim at maintaining a perspective as general as possible, allowing adaption to other domains. Spatial information is rich and can be conceptualized in a multitude of ways, making its analysis challenging as well as relevant to applications. Handling of spatial knowledge is essential to all agents acting in the real world. One contribution of this article is a formal definition of abstraction processes: aspectualization, coarsening, and conceptual classification. We characterize their properties and investigate into the consequences that arise when using abstraction in agent control processes. Applying the formal framework to a real application in robot navigation exemplifies its utility. Appropriate use of abstraction allows knowledge learned in a simplified computer simulation to be transferred to a control task with a real autonomous robot. Aspectualizable knowledge representations which we introduce and promote in this paper play a key role. The exemplified robot application shows how abstraction principles empower intelligent agents to transfer decision processes, thereby being able to cope with unfamiliar situations. Put differently, aspectualizable knowledge representations enable knowledge transfer. This paper is organized as follows: In Section 2 we give our definition of the spatial abstraction paradigms and discuss the role of abstraction in knowledge representation and its utility in agent control tasks. Section 3 covers the case study of learning navigational behavior in simulation and transferring it to a real robot. The paper ends with a discussion of formal approaches to spatial abstraction and their utility (Section 4) and a conclusion.
2
A Formal View on Facets of Abstraction
The term abstraction is etymologically derived from the Latin words “abs” and “trahere”, so the literal meaning is “drawing away”. However, if we talk about abstraction in the context of information processing and cognitive science, abstraction covers more than just taking away something, because it is not intended merely to reduce the amount of data. Rather, abstraction is employed to put the focus on the relevant information. Additionally, the result is supposed to generalize and to be useful for a specific task at hand. We define abstraction as follows: Definition 1. Abstraction is the process or the result of reducing the information of a given observation in order to achieve a classification that omits all information that is irrelevant for a particular purpose. We first concentrate on information reduction. Let us say that all potential values of a knowledge representation are elements of a set S which can be regarded as
Spatial Abstraction
313
a Cartesian product of features from different domains: S = D1 × D2 × . . . × Dn . We call s = (s1 , . . . , sn ) ∈ S a feature vector, and every si is a feature. S is also called state space and its elements are states. Abstraction is a non-injective function κ : S → T mapping the source space S to a target set T . Non-injectiveness is important as otherwise no reduction (n:1mapping) is possible. In the case of S being finite it holds that |S| > |Image(κ)|. Without loss of generality we will assume in the following, simply to ease readability, that all domains D are of the same kind: S = Dn . In the following we will formally classify abstraction into three different categories: aspectualization, coarsening, and conceptual classification. 2.1
Aspectualization
Aspects are semantic concepts. They are pieces of information that represent certain properties. For example, if we record the trajectory of a moving robot, we have a spatio-temporal data set denoting at what time the robot visited which place. Time and place are two different aspects of this data set. Aspectualization singles out such aspects. Definition 2. Aspectualization is the process or result of explicating certain aspects of an observation purely by eliminating the others. Formally, it is defined as a function κ : Dn → Dm (n, m ∈ N, n > m): κ(s1 , s2 , . . . , sn ) = (si1 , si2 , . . . , sim ), ik ∈ [1, n], ik < ik+1 ∀k . Thus, aspectualization projects Dn to Dm . Example 1. An oriented line segment s in the plane is represented as a point (x, y) ∈ R2 , a direction θ ∈ [0, 2π], and a length l ∈ R: s = (x, y, θ, l). The reduction of this line segment to an oriented point is an aspectualization with κa (x, y, θ, l) = (x, y, θ). Aspects may span over several features si . However, to be able to single out an aspect from a feature vector by aspectualization, it must be guaranteed that no feature refers to more than one aspect. We call this property aspectualizability: Definition 3. If an aspect is exclusively represented by one or more components of a feature vector s ∈ S (that is: no si refers to more than one aspect), then we call S aspectualizable regarding this aspect. Example 2. The oriented line segment representation in Example 1 can be bijectively mapped from point, angle, and length to two points (x1 , y1 , x2 , y2 ). Then aspectualization as defined in Def. 2 cannot single out the length of the line segment, because length is not represented explicitly. S is not aspectualizable regarding length.
314
L. Frommberger and D. Wolter
aspectualization (e.g., by focusing on shape, disregarding object color)
conceptual classification (e.g., by grouping objects to form new objects)
coarsening (e.g., by reducing spatial resolution)
Fig. 1. Iconographic illustration of the three abstraction principles aspectualization, coarsening, and conceptual classification applied to the same representation
2.2
Coarsening
When the set of values a feature can take is reduced, we speak of a coarsening: Definition 4. Coarsening is the process or result of reducing the details of information of an observation by lowering the granularity of the input space. Formally, it is defined as a function κ : Dn → Dn (n ∈ N), κ(s) = (κ1 (s1 ), κ2 (s2 ), . . . , κn (sn )) with κi : D → D and at least one κi being not injective. The existence of a non-injective κi ensures that we have an abstraction. Example 3. An important representation in the area of robot navigation is the occupancy grid [8], a partition of 2-D or 3-D space into a set of discrete grid cells. A function κ : R2 → R2 , κ(x, y) = (x, y) is a coarsening that maps any coordinate to a grid cell of an occupancy grid. 2.3
Conceptual Classification
Conceptual classification is the most general of the three proposed abstraction facets. It can utilize all components of the input to build new entities:
Spatial Abstraction
315
Definition 5. Conceptual classification abstracts information by grouping semantically related features to form new abstract entities. Formally, it is defined as a non-injective function κ : Dn → Dm (m, n ∈ N), κ(s1 , s2 , . . . , sn ) = (κ1 (s1,1 , s1,2 , . . . , s1,h1 ), κ2 (s2,1 , s2,2 , . . . , s2,h2 ), . . . , κm (sm,1 , sm,2 , . . . , sm,hm )) with κi : Dhi → D and hi ∈ {1, . . . , n}, whereby i ∈ {1, . . . , m}. Conceptual classification subsumes the other two abstraction concepts: If all κi have the form κi : D → D and m = n, a conceptual classification is a coarsening; and if all κi have the form κi (sj ) = sj , i ≤ j, m < n, and κi = κj ⇒ i = j, then a conceptual classification becomes an aspectualization. Example 4. Data gathered from a laser range finder comes as a vector of distance values and angles to obstacles in the local surrounding, which can be represented as 2-D points in a relative coordinate frame around the sensor. Abstraction of these points to line segments by the means of a line detection algorithm (as, for example, described in [9]) is a conceptual classification. To sum up, Fig. 1 illustrates aspectualization, coarsening, and conceptual classification in an iconographic way. 2.4
Abstraction and Representation
Intuitively, aspectualization and coarsening describe two very different processes: The first one reduces the number of features of the input, the latter one the variety of instances for every single feature. While aspectualization necessarily reduces the dimensionality of a representation, coarsening preserves dimensionality. Depending on the representation of the feature vector, coarsening can produce a result that is equivalent to an aspectualization though: Let one or more mappings κi in a coarsening be defined as mappings to a single constant value: κi = ci , ci ∈ D. Assume all other mappings κi to be the identity function. Now, consider an aspectualization that retains exactly the components not mapped to single constant values ci by the coarsening. Obviously, this aspectualization has a canonical embedding in the result of the coarsening. We illustrate this by an example: Example 5. As in Example 1, an oriented line segment in the plane is represented as a point (x, y) ∈ R2 , a direction θ ∈ [0, 2π], and a length l ∈ R: s = (x, y, θ, l). κa is defined as in Example 1. The function κc (x, y, θ, l) = (x, y, 1, l) is a coarsening, and it trivially holds: {κc (x, y, θ, l)} = {(x, y, 1, l)} ∼ = {(x, y, l)} = {κa (x, y, θ, l)} It is also possible to transform a knowledge representation such that a coarsening can be expressed by an aspectualization. For example, this is the case when abstraction operates on a group:
316
L. Frommberger and D. Wolter
Theorem 1. If κc is a coarsening on a group, for example (S, +), then there exists an isomorphism ϕ and an aspectualization κa such that the following diagram commutes: ϕ / S S? ?? ~ ?? ~~ ~~κa κc ?? ~ ~~ T Proof. Choose ϕ(s) = (s + κc (s), κc (s)) , ϕ−1 (t1 , t2 ) = t1 + (−t2 ) and κa (t1 , t2 ) = t2 , and define (S , ⊕) with S = Image(ϕ) and t ⊕ u = ϕ ϕ−1 (t) + ϕ−1 (u) for each t, u ∈ S . Checking that (S , ⊕) is a group and ϕ a homomorphism is straightforward. We illustrate this theorem by the following example: Example 6. Coordinates (x, y) ∈ R2 can be bijectively mapped to a representation (x, x − x, y, y − y) which features decimal places separately. The function κ(x, x , y, y ) = (x, y) is an aspectualization with the same result as the coarsening in Example 3. Note that Theorem 1 does not introduce additional redundancy into the representation. If we would allow for introducing redundancy we could bijectively create new representations by concatenating s and an arbitrary abstraction κ(s) with the effect that any abstraction, including conceptional classification, can always be achieved by an aspectualization from this representation. Therefore, we do not regard this kind of redundancy here. Not every representation allows for coarsening, as the following example shows: + Example 7. Commercial rounding is defined by a function f : R+ 0 → R0 , f (x) = x + 0.5. f is a coarsening. If, similar to Example 6, x ∈ R is represented as (x, x − x), then commercial rounding can neither be expressed by aspectualization (because the representation is not aspectualizable regarding this rounding) nor by coarsening (because the abstraction function operates on both components x and x − x of the feature vector, which contradicts Def. 4). So even if commercial rounding reduces the number of instances in half of the components, the example above cannot be expressed as a coarsening under this representation following Def. 4. It then must be seen as a conceptual classification, which is the most general of the three facets of abstraction presented here.
Different abstraction paradigms, even if describing distinct processes, can thus lead to the same result: Applicability of a specific abstraction principle relies heavily on the given representation, and usually different types of abstraction can be utilized to achieve the same result. Thus, the choice of an abstraction paradigm is tightly coupled with the choice of the state space representation. In the following we will argue for an action-centered view for choosing appropriate representations.
Spatial Abstraction
2.5
317
Abstraction in Agent Control Processes
Abstraction, as we define it, is not a blind reduction of information, but comes with a particular purpose. It is applied to ease solving a specific problem, and the concrete choice of abstraction is implied by the approach to master the task. If we want to utilize spatial abstraction in the context of agent control tasks, we try to reach three goals: 1. Significantly reducing the size of the state space the agent is operating in 2. Eliminating unnecessary details in the state representation 3. Subsuming similar states to unique concepts The first goal, reducing state space size, is a mandatory consequence of the latter two, which must be seen in the context of action selection: The question whether a detail is “unnecessary” or whether two states are “similar” depends on the task of the agent: – A detail is considered unnecessary if its existence does not affect the action selection of the agent. – Two states are considered to be similar if the agent should select the same action in any of the states. This action centered view expands classical definitions of similarity, as it is for example given by Fred Roberts [10]: Two states s and s are indistinguishable (written s ∼ s ) if there is a mapping f : S → R and an ∈ R+ with s ∼ s ⇔ |f (s) − f (s )| < . Roberts’ concept is data driven whereas ours is action driven in order to account for the task at hand. States may be very near concerning a certain measure, but nevertheless require different actions to take in certain contexts. Grid based approaches, achieved by coarsening, easily bear the danger of not being able to provide an appropriate state separation due to missing resolution in critical areas of the state space. Furthermore, a “nearness” concept as presented by Roberts is again a matter of representation and may only be appropriate in homogeneous environments. Abstraction shall ease the agent’s action selection process. If those details are eliminated that are irrelevant for the choice of an action, difficulty and processing time of action selection is reduced, and action selection strategies may be applicable to a broader range of scenarios. When choosing an abstraction paradigm for a given data set, the result must be regarded in the context of accessibility of information. The goal of abstraction must be to enable easy access to the relevant information. Which piece of information is relevant, of course, depends on the task at hand: A computer-driven navigation control may require different concepts than a system interacting with a human being. Abstraction retains information that is relevant for a certain purpose. Therefore, it can never be regarded as purely data driven, but requires a solid a-priori concept of the problem to solve and, consequently, the actions to take. As we have seen in Section 2.4, we can use different abstraction paradigms to achieve the same effect, given an appropriate state space representation. From
318
L. Frommberger and D. Wolter
the view point of accessibility we now argue for preferring the use of aspectualizable representations, as relevant aspects are clearly separated and easy to access and aspectualization itself is a computationally simple process. Accessibility eases knowledge extraction: Section 3.3 will show an example of an algorithm that makes use of the aspectualizability of a representation. Once again, aspectualizability can be achieved by abstraction. In particular, conceptual classification is a powerful means. So abstraction helps to create representations that allow for distinguishing different aspects by using aspectualization.
3
Knowledge Transfer of Simulation Strategies to a Real Robot
In this section we show how abstraction supports knowledge transfer. We regard the problem of transferring navigation skills learned with reinforcement learning (RL) [11] in a simple simulator to a real robot and demonstrate that on the one hand abstraction allows us to cope with real world concepts in the same way as with simulated ones and on the other hand that the transfer of knowledge benefits from aspectualizability of the state space representation. The key benefit of this approach is that learning is much more efficient in simple simulations than in real environments or complex simulations thereof. In particular, we show that the use of an aspectualizable representation empowers the derivation of aspectualizable behavior that is key to successful knowledge transfer. 3.1
The Task
The task considered here is the following: A simulated robot shall learn to find find a specified location s∗ ∈ S within an unknown environment (see Fig. 2 left for a look on the simulation testbed). This scenario is formalized as a Markov Decision Process (MDP) S, A, T, R with a continuous state space S = {(x, y, θ)— x, y ∈ R, θ ∈ [0, 2π)} where each system state is given by the robot’s position (x, y) and an orientation θ, an action space A of navigational actions the agent can perform to move to another state, a transition function T : S × A × S → [0, 1] denoting a probability distribution that performing an action a at state s will result in state s , and a reward function R : S → R, where a positive reward will be given when a goal state s∗ ∈ S is reached and a negative one if the agent collides with an obstacle. A solution to this MDP is a policy π(s) that assigns an action to take to any state s. RL is a frequently used method to compute such a strategy. After successful learning, following π in a greedy way will now bring the robot from every position in the world to the goal location s∗ . In general, this strategy π is bound to the goal state the agent learned to reach. For mastering another task only differing in s∗ , the whole strategy would need to be re-learned from scratch, including low-level skills as turning actions and collision avoidance—the knowledge gained in the first learning task would not be applicable. The challenge of avoiding this and re-using parts of the gained
Spatial Abstraction
319
Fig. 2. Left: a screenshot of the robot navigation scenario in the simulator, where the strategy is learned. Right: a Pioneer 2 in an office building, where the strategy shall be applied. The real office environment offers structural elements not present in the simulator: open space, uneven walls, tables, and other obstacles.
knowledge of a learning task and transferring it to another one has recently been labeled transfer learning, and several approaches have been proposed to tackle this problem [12,13,14, e.g.]. We will describe how such transfer capabilities can be achieved by spatial state space abstraction and we will point out how abstraction mechanisms allow for knowledge transfer in a more general sense: Learned navigation knowledge is not only transferable to a similar task with another goal location, but abstraction allows us to operate on the same abstract entities in quite different tasks. We will show that the spatial state space abstraction approach even allows for bridging the gap between results gained in a simple simulator and real robotics just by the use of spatial abstraction. 3.2
Learning a Policy in Simulation
In our simulation scenario, the robot is able to perceive walls around it as line segments within a certain maximum range. This perception is disturbed by noise such that every line segment is detected as several smaller ones. The agent can also identify the walls. In our simulator, this is modeled in a way that every wall has a unique color and the agent perceives the color of the wall. The robot is capable of performing three actions: moving forward and turning a few degrees either to the left or to the right. Turning includes a small forward movement; and some noise is added to all actions. There is no built-in collision avoidance or any other navigational intelligence provided. For learning we use the reinforcement learning paradigm of Q-learning [15]. The result is a Q-function that assigns an expected overall reward to any stateaction pair (s, a) and a policy π(s) = argmaxa Q(s, a) that delivers the action with the highest expected reward for every state s.
320
L. Frommberger and D. Wolter 14 13 12
14 13 12 15
11 10
16
(a)
15 4 3 2 11 1 10 16 5
(b)
Fig. 3. Neighboring regions around the robot in relation to its moving direction. Note that the regions R1 , . . . , R5 in the immediate surroundings (b) overlap R10 , . . . , R16 (a). The size of the grid defining the immediate surroundings is given a-priori. It is a property of the agent and depends on its size and system dynamics (for example, the robot’s maximal speed). In this work, only the thick drawn boundaries in (a) are regarded for building the representation.
This learning task is a complex one, because the underlying state space is large and continuous, and reinforcement learning processes are known to suffer from performance problems under these conditions. Thrun and Schwartz stated that for being able to adapt RL to complex tasks it is necessary to discover the structure of the world and to abstract from its details [16]. In any case, a sensible reduction of the state space will be beneficial for any RL application. To achieve that structural abstraction, we make use of the observation that navigation in space can be divided into two different aspects: Goal-directed behavior towards a task-specific target location, and generally sensible behavior that is task-independent and the same in any environment [17]. According to [12], we refer to the first as problem space and to the latter as agent space. It is especially agent space that encodes structural information about the world that persists in any learning task and therefore this knowledge is worth transferring to different scenarios. The structure of office environments as depicted in Fig. 2 is usually characterized by walls, which can be abstracted as line segments in the plane. Even more it is relative position information of line segments with respect to the robot’s moving direction that defines structural paths in the world and leads to sensible action sequences for a moving agent. Thus, for encoding agent space, we use the qualitative representation RLPR (Relative Line Position Representation) [17]. Inspired by the “direction relation matrix” [18], the space around the agent is partitioned into bounded and unbounded regions Ri (see Fig. 3). Two functions τ : N → {0, 1} and τ : N → {0, 1} are defined: τ (i) denotes whether there is a line segment detected within a region Ri and τ (i) denotes whether a line spans from a neighboring region Ri+1 to Ri . τ i is used for bounded sectors in the immediate vicinity of the agent (R1 to R5 in Fig. 3(b)). Objects that appear there have to be avoided in any case. The position of detected line segments in R10 to R16 (Fig. 3(a)) is helpful information to be used for general orientation and midterm planning, so τ is used for R10 to R16 . This abstraction from line segments in the simulator to a vector of RLPR values is a conceptual classification.
Spatial Abstraction
321
environmental data from simulation conceptual classification landmark enriched RLPR representation Ψ1 aspectualization
problem space
Ψ2 aspectualization
agent space
Fig. 4. Abstraction principles used to build an aspectualizable state space representation
For representing problem space it is sufficient to encode the qualitative position of the agent within the world. We do this by representing a circular order of detected landmarks, as for example proposed in [19]. Therefore we regard a sequence of detected wall colors ci at seven discrete angles around the robot: ψl (s) = (c1 , . . . , c7 ). As suggested in Section 2.5, we now use these two conceptual classifications to create an aspectualizable state space representation by concatenating ψl and ψr . The result ψ(s) is the landmark-enriched RLPR representation: ψ(s) = (ψl (s), ψr (s)) = (c1 , . . . , c7 , τ (R1 ), . . . , τ (R5 )), τ (R10 ), . . . , τ (R16 ) We call the new emerging state space O = Image(ψ) the observation space. It is a comparably small and discrete state space, fulfilling the three goals of abstraction we defined in Section 2.5. The RLPR based approach has been shown to outperform metrical representations that rely on distances or absolute coordinates with regard to learning speed and robustness [17]. For an example of deriving RLPR values refer to Fig. 5. So conceptual classification is employed twice for both problem and agent space to create a compact state space representation. ψ(s) is aspectualizable regarding the two aspects of navigation (see Fig.4). Let us now investigate how to take advantage of that to transfer general navigation knowledge to a new task. 3.3
Extracting General Navigation Behavior
The learning process results in a policy π : O → A that maps any o ∈ O to an action to take when the agent observes o. The corresponding Q-values are stored in a lookup table. We now want to apply this policy to a totally different domain, the real world, where we cannot recognize the landmarks encountered
322
L. Frommberger and D. Wolter
during learning. So the policy must provide sensible actions to take in the absence of known landmarks. This is the behavior that refers to the aspect of general navigation behavior or agent space. It has to be singled out from π. By design, ψ(s) is aspectualizable with regard to agent space, and the desired information is easily accessible. An aspectualization κ(o) = κ(ψl (s), ψr (s)) = ψr (s)) provides structural world information for any observation. That is, structurally identical situations share an identical RLPR representation. A new Q-function Qπ for a general, aspectualized policy π for arbitrary states with the same aspect ψr (s) can be constructed by Q-value averaging over states with identical ψr (s), which are easily accessible because of the aspectualizability of O. Given a learned policy π with a value function Qπ (o, a) (o ∈ O, a ∈ A), we construct a new policy π with Qπ (o , a) (o ∈ O , a ∈ A) in a new observation space O = Image(ψr ), with the following function [20]: −1 Q((c, o ), a) c∈{ψl (s)} (maxb∈A (|Qπ ((c, o ), b)|))) Qπ (o , a) = |{((c, o ), a)|Qπ ((c, o ), a) = 0}| This is a weighted sum over all possible landmark observations (in reality, of course, only the visited states have to be considered, because Q(o, a) = 0 for the others, so the computational effort is very low). It is averaged over all stateaction pairs where the information is available, that is, the Q-value is not zero. A weighting factor scales all values according to the maximum reward over all actions. This procedure has been applied to a policy learned in the simulated environment depicted in Fig. 2 for 40,000 learning episodes. For the exact experimental conditions of learning the policy in simulation refer to [20]. The resulting policy Qπ can then be used to control a real robot as shown in the following section. 3.4
Using the Aspectualized Strategy on a Mobile Robot
Controlling a mobile robot with a strategy learned in simulation requires sensory input to be mapped to the same domain as used in the simulation. This can be accomplished in a straightforward manner given that an abstract intermediate representation is constructed on the robot from raw sensor data. We now detail this approach using a Pioneer-2 type robot equipped with a laser range finder. Laser range finders detect obstacles around the robot (the field of view of the sensor used on our robot is 180 ). By measuring laser beams reflected by obstacles one obtains a sequence of (in our case) 361 points in local coordinates. We use the well-known iterative split-and-merge algorithm that is commonly used in robotics to fit lines to scan data (see [9])—another conceptual classification, as we have seen in Example 4. With respect to parameters of the procedure we point out that a precise line fitting is not required [20]. Rather, we want to make sure that all obstacles detected by the laser range scanners get represented by lines, even if this is a crude approximation. The detected line configuration is then mapped every 0.25 seconds to the RLPR representation and fed into the learned strategy to obtain the action primitive to perform. See Fig. 5 for a view
Spatial Abstraction
323
τ (R1 ) = 1 τ (R2 ) = 0 τ (R3 ) = 0 τ (R4 ) = 1 τ (R5 ) = 0 τ (R10 ) = 1 τ (R11 ) = 0 τ (R12 ) = 0 τ (R13 ) = 0 τ (R14 ) = 1 τ (R16 ) = 1
Fig. 5. Screenshot: Abstraction to RLPR in the robot controller. Depicted are the qualitative regions (see Fig. 3) and the interpreted sensor data which has been acquired from the robot position shown in Fig. 2 right. The overall representation for this configuration is ψr (s) = {1, 0, 0, 1, 0, 1, 0, 0, 0, 1, 1}.
on the line detection of the laser range finder data and the corresponding RLPR representation. Fig. 6 gives an overview on the development of representations in both simulator and robot application. In the simulation three action primitives (straight on, turn left, turn right) have been used that always move the robot some fixed distance. Rather then implementing this step-wise motion on the real robot, we mapped the action to commands controlling the wheel speeds in order to obtain continuous motion. Additionally, movement is smoothed by averaging the most recent wheel speed commands to avoid strong acceleration/deceleration which the robot drive cannot handle well. We applied the averaging to the last 8 actions which (given the 0.25 second interval of wheel commands) yields a time of 2 seconds before reaching the wheel speed associated with the action primitive. In accordance to the robot’s size and motion dynamics the inner regions of the RLPR grid (Fig.3(b)) have been set to 60 cm in front and both 30 cm to the left and the right of the robot. We analyzed the behavior of the Pioneer 2 robot with the learned policy in our office environment. In contrast to the simple simulation environment the office environment presents uneven walls, open spaces of several meters, plants, and furniture like a sofa or bistro tables. The robot shows a reasonable navigation behavior, following corridors straightly and turning smoothly around curves. It also showed ability to cope with structural elements not present in the simulated environment, such as open space or tiny obstacles. In other words, general navigation skills learned in simulation have been transferred to the real-world environment. The robot only got stuck when reaching areas with a huge amount of clutter
324
L. Frommberger and D. Wolter efficient learning in abstracted simulation
realization in real world environment
mobile robot with laser range finder
artificial robot simulation
RLPR representation
landmark configuration
geometric representation of obstacle outlines
landmark enriched RLPR representation
RLPR representation
RLPR representation
action selection based on learned strategy
Fig. 6. Evolution of spatial representations in both simulation and real robot application. Abstraction techniques enable both scenarios to operate on the RLPR representation to achieve a reasonable action selection.
Fig. 7. Pioneer 2 entering an open space, using the aspectualized policy learned in the simulator. It shows a reasonable navigation behavior in a real office environment, driving smoothly forward and safely around obstacles.
(such as hanging leaves of plants) and in dead ends where the available motion primitives do not allow for collision-free movement anymore. Because the original task was goal-oriented (searching a specific place), the robot also showed a strong tendency of moving forward and thus actively exploring the environment instead of just avoiding obstacles. This generally sensible navigation behavior could now,
Spatial Abstraction
325
for example, be used as a basis for learning new goal-oriented tasks on the robotics platform. Fig. 7 gives an impression of the robot experiment.
4
Discussion
Performing abstraction is a fundamental ability of intelligent agents and different facets of abstraction have thus been issued in previous work, addressing various scientific fields and considering a rich diversity of tasks. First, we comment on a critical remark by Klippel et al.: In their thorough study on schematization, they state that “there is no consistent approach to model schematization” [4]. We believe that by our formal definitions of abstraction principles the manifold terms used to describe abstraction can very well be classified and related. The insight that abstraction can be divided into different categories has been mentioned before. Stell and Worboys present a distinction of what they call “selection” and “amalgamation” and formalize these concepts for graph structures [6]. Our definition of aspectualization and coarsening corresponds to selection and amalgamation, which Stell and Worboys describe as being “conceptually distinct” types of generalization. Regarding this, we pointed out that this conceptual distinctness does only apply to the process of abstraction and not the result, as we could show that the effect of different abstraction paradigms critically depends on the choice of the initial state space representation. Bertel et al. also differentiate between different facets of abstraction (“aspectualization versus specificity”, “aspectualization versus concreteness”, and “aspectualization versus integration”), but without giving an exact definition [7]. “Aspectualization versus specificity” corresponds to our definition of aspectualization, and “aspectualization versus concreteness” to coarsening. However, our definition of aspectualization is tighter than the one given by Bertel et al.: According to them, aspectualization is “the reduction of problem complexity through the reduction of the number of feature dimensions”. In our definition, it is also required that all the other components remain unchanged. The notion of schematization, which Leonard Talmy describes as “a process that involves the systematic selection of certain aspects of a referent scene to represent the whole disregarding the remaining aspects” [21] is tightly connected to our definition of aspectualization. If we assume the referent scene to be aspectualizable according to Def. 3, then the process mentioned by Talmy is aspectualization as defined here. Annette Herskovits defines the term schematization in the context of linguistics as consisting of three different processes, namely abstraction, idealization, and selection [5]. According to our definition, abstraction and selection would both be an aspectualization, and idealization refers to coarsening. The action-centered view on abstraction we introduced in Section 2.5 is also shared by the definition of categorizability given by Porta and Celaya [22]. The authors call an environment categorizable, if “a reduced fraction of the available inputs and actuators have to be considered at a time”. In other words: In a
326
L. Frommberger and D. Wolter
categorizable environment, an abstraction can be achieved that subsumes identical action selection to identical representations.
5
Conclusion
In this article we classify abstraction by three distinct principles: aspectualization, coarsening, and conceptual classification. We give a formal definition of these principles for classifying and clarifying the manifold concept names for abstraction found in the literature. This enables us to show that knowledge representation is of critical importance and thus must be addressed in any discussion of abstraction. Identical information may be represented differently, and, by choosing a specific representation, different types of abstraction processes may be applicable and lead to an identical result. Also, as abstraction is triggered by the need to perform a certain task, abstraction can never be regarded as purely data driven, but it requires a solid a-priori concept of the problem to solve and, consequently, the actions to take. We introduce the notion of aspectualizability in knowledge representations. Aspectualizable knowledge representations are key to enabling knowledge transfer. By designing an aspectualizable representation, it is possible to transfer navigation knowledge learned in a simplified simulation to a real-world robot setting. Acknowledgments. This work was supported by the DFG Transregional Collaborative Research Center SFB/TR 8 “Spatial Cognition” (project R3-[Q-Shape]). Funding by the German Research Foundation (DFG) is gratefully acknowledged. The authors would like to thank Jan Oliver Wallgr¨ un, Frank Dylla, and Jae Hae Lee for inspiring discussions. We also thank the anonymous reviewers for pointing us to further literature from different research communities.
References 1. Hobbs, J.R.: Granularity. In: Proceedings of the Ninth International Joint Conference on Artificial Intelligence (IJCAI), pp. 432–435 (1985) 2. Bittner, T., Smith, B.: A taxonomy of granular partitions. In: Montello, D. (ed.) Spatial Information Theory: Cognitive and Computational Foundations of Geographic Information Science (COSIT), pp. 28–43. Springer, Berlin (2001) 3. Mackaness, W.A., Chaudhry, O.: Generalization and symbolization. In: Shekhar, S., Xiong, H. (eds.) Encyclopedia of GIS (2008) 4. Klippel, A., Richter, K.F., Barkowsky, T., Freksa, C.: The cognitive reality of schematic maps. In: Meng, L., Zipf, A., Reichenbacher, T. (eds.) Map-based Mobile Services – Theories, Methods and Implementations, pp. 57–74. Springer, Berlin (2005) 5. Herskovits, A.: Schematization. In: Olivier, P., Gapp, K.P. (eds.) Representation and Processing of Spatial Expressions, pp. 149–162. Lawrence Erlbaum Associates, Mahwah (1998) 6. Stell, J.G., Worboys, M.F.: Generalizing graphs using amalgamation and selection. In: G¨ uting, R.H., Papadias, D., Lochovsky, F. (eds.) SSD 1999. LNCS, vol. 1651, pp. 19–32. Springer, Heidelberg (1999)
Spatial Abstraction
327
7. Bertel, S., Vrachliotis, G., Freksa, C.: Aspect-oriented building design: Toward computer-aided approaches to solving spatial contraint problems in architecture. In: Allen, G.L. (ed.) Applied Spatial Cognition: From Research to Cognitive Technology, pp. 75–102. Lawrence Erlbaum Associates, Mahwah (2007) 8. Moravec, H.P., Elfes, A.E.: High resolution maps from wide angle sonar. In: Proceedings of the IEEE International Conference on Robotics and Automation (ICRA), St. Louis, MO (1985) 9. Gutmann, J.S., Weigel, T., Nebel, B.: A fast, accurate and robust method for selflocalization in polygonal environments using laser range finders. Advanced Robotics 14(8), 651–667 (2001) 10. Roberts, F.S.: Tolerance geometry. Notre Dame Journal of Formal Logic 14(1), 68–76 (1973) 11. Sutton, R.S., Barto, A.G.: Reinforcement learning: an introduction. In: Adaptive Computation and Machine Learning. MIT Press, Cambridge (1998) 12. Konidaris, G.D., Barto, A.G.: Building portable options: Skill transfer in reinforcement learning. In: Proceedings of the Twentieth International Joint Conference on Artificial Intelligence (IJCAI) (2007) 13. Taylor, M.E., Stone, P.: Cross-domain transfer for reinforcement learning. In: Proceedings of the Twenty Fourth International Conference on Machine Learning (ICML 2007), Corvallis, Oregon (2007) 14. Torrey, L., Shavlik, J., Walker, T., Maclin, R.: Skill acquisition via transfer learning and advice taking. In: F¨ urnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 425–436. Springer, Heidelberg (2006) 15. Watkins, C., Dayan, P.: Q-learning. Machine Learning 8, 279–292 (1992) 16. Thrun, S., Schwartz, A.: Finding structure in reinforcement learning. In: Tesauro, G., Touretzky, D., Leen, T. (eds.) Advances in Neural Information Processing Systems: Proceedings of the 1994 Conference, vol. 7. MIT Press, Cambridge (1995) 17. Frommberger, L.: A generalizing spatial representation for robot navigation with reinforcement learning. In: Proceedings of the Twentieth International Florida Artificial Intelligence Research Society Conference (FLAIRS 2007), Key West, FL, USA, pp. 586–591. AAAI Press, Menlo Park (2007) 18. Goyal, R.K., Egenhofer, M.J.: Consistent queries over cardinal directions across different levels of detail. In: Tjoa, A.M., Wagner, R., Al-Zobaidie, A. (eds.) Proceedings of the 11th International Workshop on Database and Expert System Applications, Greenwich, UK, pp. 867–880 (2000) 19. Schlieder, C.: Representing visible locations for qualitative navigation. In: Carrete, N.P., Singh, M.G. (eds.) Qualitative Reasoning and Decision Technologies, Barcelona, Spain, pp. 523–532 (1993) 20. Frommberger, L.: Generalization and transfer learning in noise-affected robot navigation tasks. In: Neves, J., Santos, M.F., Machado, J.M. (eds.) EPIA 2007. LNCS (LNAI), vol. 4874, pp. 508–519. Springer, Heidelberg (2007) 21. Talmy, L.: How language structures space. In: Pick Jr., H.L., Acredolo, L.P. (eds.) Spatial Orientation: Theory, Research, and Application. Plenum, New York (1983) 22. Porta, J.M., Celaya, E.: Reinforcement learning for agents with many sensors and actuators acting in categorizable environments. Journal of Artificial Intelligence Research 23, 79–122 (2005)
Representing Concepts in Time* Martin Raubal Department of Geography, University of California, Santa Barbara 5713 Ellison Hall, Santa Barbara, CA 93106, U.S.A.
[email protected]
Abstract. People make use of concepts in all aspects of their lives. Concepts are mental entities, which structure our experiences and support reasoning in the world. They are usually regarded as static, although there is ample evidence that they change over time with respect to structure, content, and relation to real-world objects and processes. Recent research considers concepts as dynamical systems, emphasizing this potential for change. In order to analyze the alteration of concepts in time, a formal representation of this process is necessary. This paper proposes an algebraic model for representing dynamic conceptual structures, which integrates two theories from geography and cognitive science, i.e., time geography and conceptual spaces. Such representation allows for investigating the development of a conceptual structure along space-time paths and serves as a foundation for querying the structure of concepts at a specific point in time or for a time interval. The geospatial concept of ‘landmark’ is used to demonstrate the formal specifications. Keywords: Conceptual spaces, time geography, concepts, representation, algebraic specifications.
1 Introduction Humans employ concepts to structure their world, and to perform reasoning and categorization tasks. Many concepts are not static but change over time with respect to their structure, substance, and relations to the real world. In addition, different people use the same or similar concepts to refer to different objects and processes in the real world, which can lead to communication problems. In this paper, we propose a novel model to represent conceptual change over time. The model is based on a spatiotemporal metaphor, representing conceptual change as movement along space-time paths in a semantic space. It thereby integrates conceptual spaces [1] as one form of conceptual representation within a time-geographic framework [2]. Formal representations of dynamic concepts are relevant from both a theoretical and practical perspective. On the one hand, they allow us to theorize about how people’s internal processes operate on conceptual structures and result in their alterations over time. On the other hand, they are the basis for solving some of the current pressing research questions, such as in Geographic Information Science (GIScience) and *
This paper is dedicated to Andrew Frank, for his 60th birthday. He has been a great teacher and mentor to me.
C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 328–343, 2008. © Springer-Verlag Berlin Heidelberg 2008
Representing Concepts in Time
329
the disciplines concerned with ontologies. In GIScience, questions addressing which geospatial concepts exist, how to trace their developmental patterns, model their interactions (such as merging), and how to represent and process them computationally are of major importance [3]. Research on ontologies has focused on dynamic ontologies1 for services to be integrated within the semantic web [4]. If we consider ontologies as explicit specifications of conceptualizations [5], then formal representations of dynamic concepts can be utilized for translation into ontologies. Section 2 presents related work regarding concepts, and introduces conceptual spaces and time geography as the foundations for the proposed model. In Section 3, we define our use of representation and describe the metaphorical mapping from time-geographic elements to entities and operations in semantic space. We further elaborate on the difference of changes within and between conceptual spaces. Section 4 presents a computational model of conceptual change in terms of executable algebraic specifications. Within this model, the mappings of entities and operations are specified at the level of conceptual spaces, which consist of quality dimensions. Section 5 applies the formal specifications to represent the change of a person’s geospatial concept of ‘landmark’ over time. The final section presents conclusions and directions for future research.
2 Related Work This section starts with an explanation of the notion of concepts and their importance for categorization. We then introduce conceptual spaces and time geography as the underlying frameworks for representing concepts in time. 2.1 Concepts There are several conflicting views on concepts, categories, and their relation to each other across and even within different communities. From a classical perspective, concepts have been defined as structured mental representations (of classes or individuals), which encode a set of necessary and sufficient conditions for their application [6]. They deal with what is being represented and how such information is used during categorization [7]. Barsalou et al. [8] view concepts as mental representations of categories and point out that concepts are context dependent and situated. For example, the concept of a chair is applied locally and does not cover all chairs universally. From a memory perspective, “concepts are the underlying knowledge in long-term memory from which temporary conceptualizations in working memory are constructed.” [8, footnote 7] It is important to note the difference between concepts and categories: a concept is a mental entity, whereas a category refers to a set of entities that are grouped together [9]. Concepts are viewed as dynamical systems that evolve and change over time [8]. New sensory input leads to the adaptation of previous concepts, such as during the interactive process of spatial knowledge acquisition [10]. Neisser’s [11] perceptual cycle is also based on the argument that perception and cognition involve dynamic 1
See, for example, http://dynamo.cs.manchester.ac.uk/
330
M. Raubal
cognitive structures (schemata in his case rather than explicit concepts). These are subject to change as more information becomes available. Here, we use concepts within the paradigm of cognitive semantics, which asserts that meanings are mental entities—mappings from expressions to conceptual structures, which refer to the real world [12-14]. The main argument is therefore that a symbolic representation of an object cannot refer directly to objects, but rather through concepts in the mind. This difference between objects, concepts, and symbols is often expressed through the semiotic triangle [15]. 2.2 Conceptual Spaces The notion of conceptual space was introduced as a framework for representing information at the conceptual level [1]. Such representation rests on the beforementioned foundation of cognitive semantics. Conceptual spaces can be utilized for knowledge representation and sharing, and support the paradigm that concepts are dynamical systems [16]. Sowa [17] argued that conceptual spaces are a promising geometrical model for representing abstract concepts as well as physical images. Furthermore, conceptual spaces may serve as an explanatory framework for results from neuroscientific research regarding the representational structure of the brain [1]. A conceptual space is a set of quality dimensions with a geometrical or topological structure for one or more domains. Domains are represented through sets of integral dimensions, which are distinguishable from all other dimensions. For example, the color domain is formed through the dimensions hue, saturation, and brightness. Concepts cover multiple domains and are modeled as n-dimensional regions. Every object or member of the corresponding category is represented as a point in the conceptual space. This allows for expressing the similarity between two objects as the spatial distance between their points. Recent work has focused on representing actions and functional properties in conceptual spaces [18]. In [19], a methodology to formalize conceptual spaces as vector spaces was presented. Formally, a conceptual vector space is defined as Cn = {(c1, c2, ⁄, cn) | ci ∈ C} where the ci are the quality dimensions. A quality dimension can also represent a whole domain and in this case cj = Dn = {(d1, d2, ⁄, dn) | dk ∈ D}. Vector spaces have a metric and therefore allow for the calculation of distances between points in the space. This can also be utilized for measuring distances between concepts, either based on their approximation by ‘prototypical points’ or ‘prototypical regions’ [20]. In order to calculate these semantic distances between instances of concepts all quality dimensions of the space must be represented in the same relative unit of measurement. Assuming a normal distribution, this is ensured by calculating the z scores for these values, also called z-transformation [21]. For specifying different contexts one can assign weights to the quality dimensions of a conceptual vector space. This is essential for the representation of concepts as dynamical systems, because the salience of dimensions may change over time. Cn is then defined as {(w1c1, w2c2, ⁄, wncn) | ci ∈ C, wj ∈ W} where W is the set of real numbers. 2.3 Time Geography People and resources are available only at a limited number of locations and for a limited amount of time. Time geography focuses on this necessary condition at the core of human existence: “How does my location in space at a given time affect my
Representing Concepts in Time
331
ability to be present at other locations at other times?” It defines the space-time mechanics by considering different constraints for such presence—the capability, coupling, and authority constraints [2]. The possibility of being present at a specific location and time is determined by people’s ability to trade time for space, supported by transportation and communication services. Space-time paths depict the movement of individuals in space over time. Such paths are available at various spatial (e.g., house, city, country) and temporal granularities (e.g., decade, year, day) and can be represented through different dimensions. Figure 1 shows a person’s space-time path during a day, representing her movements and activity participation at three different locations. The tubes depict space-time stations—locations that provide resources for engaging in particular activities, such as sleeping, eating, and working. The slope of the path represents the travel velocity. If the path is vertical then the person is engaged in a stationary activity. Time
Doctor’s office Mall
Home
Geographical space
Fig. 1. Space-time path of a person’s daily activities
Three classes of constraints limit a person’s activities in space and time. Capability constraints limit an individual’s activities based on her abilities and the available resources. For example, a fundamental requirement for many people is to sleep between six and eight hours at home. Coupling constraints require a person to occupy a certain location for a fixed duration to conduct an activity. If two people want to meet at a Café, then they have to be there at the same time. In time-geographic terms, their paths cluster into a space-time bundle. Certain domains in life are controlled through authority constraints, which are fiat restrictions on activities in space and time. A person can only shop at a mall when the mall is open, such as between 10am and 9pm. All space-time paths must lie within space-time prisms (STP). These are geometrical constructs of two intersecting cones [22]. Their boundaries limit the possible locations a path can take based on people’s abilities to trade time for space. Figure 2 depicts a space-time prism for a scenario where origin and destination have the same location. The time budget is defined by Δt = t2−t1 in which a person can move away from the origin, limited only by the maximum travel velocity. The interior of the
332
M. Raubal
Time
t2 PPS
t1 PPA
Geographical space
Fig. 2. Space-time prism as intersecting cones
prism defines a potential path space (PPS), which represents all locations in space and time that can be reached by the individual during Δt. The projection of the PPS onto geographical space results in the potential path area (PPA) [23].
3 A Spatio-temporal Metaphor for Representing Concepts in Time In this section, we first give a definition of representation, which applies to the model presented here. The metaphorical mapping from time-geographic to semantic-space elements is then explained. A formal model for the resulting semantic space will be developed in the next section. 3.1 Representational Aspects Different definitions of what a representation is have been given in the literature. In this paper, we commit to the following: “A world, X, is a representation of another world, Y, if at least some of the relations for objects of X are preserved by relations for corresponding objects of Y.” [24, p.267] In order to avoid confusion about what is being represented how and where regarding conceptual change over time, we distinguish between two representations—the mental world and the mental model—, according to [24]. The mental world is a representation of the real world and concerned with the inner workings and processes within the brain and nervous system (i.e., inside the head). Here, we formally specify a possible mental model as a representation of the mental world2. The goal is to be able to use this model to explain the processes that lead to the change of concepts in time. In this sense, we are aiming for informational equivalence [24], see also [25] and [26] for examples from the geospatial domain. 2
A mental model is therefore a representation of a representation of the real world—see Palmer [24] for a formal demonstration of this idea.
Representing Concepts in Time
333
3.2 Metaphorical Mapping The proposed mental model for representing conceptual change in time is based on a spatio-temporal metaphor. The power of spatial metaphors for modeling and comprehending various non-spatial domains has been widely demonstrated [27-30]. From a cognitive perspective, the reason for such potential is that space plays a fundamental role in people’s everyday lives, including reasoning, language, and action [31]. Our representation of conceptual change in a mental model is based on the metaphorical projection of entities, their relations, and processes from a spatio-temporal source domain to a semantic target domain. As with all metaphors, this is a partial mapping, because source and target are not identical [30]. Concepts are represented as n-dimensional regions in conceptual spaces, which can move through a semantic space in time. The goal of this metaphor is to impose structure on the target domain and therefore support the explanation of its processes. Table 1. Metaphorical projection from time-geographic to semantic-space elements
Time-geographic elements geographic space geographic distance (dgeog) space-time path (ST-path) space-time station (STS) space-time prism (STP) coupling constraint authority constraint potential path space (PPS)
Semantic-space elements semantic space semantic distance (dsem) semantic space-time path (SST-path) semantic space-time station (SSTS) semantic space-time envelope (SSTE) semantic coupling constraint contextual constraint semantic potential path space (SPPS)
More specifically, individual time-geographic elements are being mapped to elements in the semantic space (Table 1, Figure 3). Geographic space is being mapped to semantic space, which can be thought of as a two- or three-dimensional attribute surface as used in information visualization [32, 33]. Both conceptual spaces and semantic spaces have a metric, which allows for measuring semantic distances dsem between concepts and conceptual spaces [19]. Conceptual spaces (CS1 and CS2 in Figure 3) move along semantic space-time paths (SST-path), vertical paths thereby signifying stationary semantics, i.e., no conceptual change involving a change in dimensions but changes in dimension values are possible (see Section 3.3). Such stationarity corresponds to a semantic space-time station (SSTS). The semantic spacetime envelope (SSTE) and semantic potential path space (SPPS) define through their boundaries, how far a conceptual space (including its concept regions) can deviate from a vertical path and still represent the same or similar semantics. Crossing the boundaries corresponds to conceptual change. It is important to note that these boundaries are often fuzzy and indeterminate [34]. The extent of the SSTE is a function of time depending on the changes in the semantic space as defined above. The partial mapping from source to target domain includes two constraints. Coupling constraints are being mapped to semantic coupling constraints, which specify the interaction of conceptual spaces (and concepts) based on the coincidence of their
334
M. Raubal
Time
Semantic coupling SPPS
SST-path2 SSTS SSTE SST-path1 CS2 dsem CS1
Semantic space
Fig. 3. Representation of moving conceptual spaces in a semantic space over time. For clarity reasons, the concept regions are only visualized once (during semantic coupling).
semantic space-time paths (i.e., semantic space-time bundling). Such coincidence signifies high (significant overlap of concept regions, see Figure 3) or even total conceptual similarity, e.g., when two different concepts merge into one over time, such as the abstract political concepts of Eastern and Western Germany. Authority constraints are being mapped to contextual constraints. Similar to fiat restrictions on activities in space and time, there exist legal definitions, such as traffic codes or data transfer standards, which create fiat conceptual boundaries. For example, the definition and meaning of terms, such as parcel or forest, depend on the legal system of the responsible administration—see also the discussion of institutional reality in [35]. The same symbol can therefore relate to different concepts represented by different dimensions or different regions in a conceptual space. 3.3 Within- and between-Conceptual-Space Changes Our proposed mental model allows for representing conceptual change over time from two perspectives, namely (a) change of the geometrical structure of concepts as ndimensional regions within one conceptual space and (b) changes between different conceptual spaces. Case (a) presumes that no change of quality dimensions has occurred in the conceptual space, therefore allowing only for movement of the concept region within this particular space—caused by a change in dimension values. One can
Representing Concepts in Time
335
then measure the semantic distance between a concept c at time ti and the same concept at time ti+1. Three strategies for calculating semantic similarity between conceptual regions, including overlapping concepts, have been demonstrated in [20] and can be applied here. These methods differ, in that for each vector of c(ti) one or several corresponding vectors of c(ti+1) are identified. Case (b) applies to mappings between conceptual spaces, leading to a change in quality dimensions. These mappings can either be projections, which reduce the complexity of the space by reducing its number of dimensions, or transformations, which involve a major change of quality dimensions, such as the addition of new dimensions. As shown in [36], projections (Equation 1) and transformations (Equation 2) can be expressed as partial mappings with C, D denoting conceptual spaces and m, n the number of quality dimensions. For projections, the semantics of the mapped quality dimensions must not change or can be mapped by rules. (Rproj: Cm → Dn) where n < m and Cm ∩ Dn = Dn (Rtrafo: C → D ) where (n ≤ m and C ∩ D ≠ D ) or (n > m) m
n
m
n
n
(1) (2)
4 Formal Model of Conceptual Change in Time This section develops a computational mental model for representing conceptual change in time according to the presented spatio-temporal metaphor. We take an algebraic approach to formally specify the mappings of entities and operations at the level of conceptual spaces (which represent the conceptual regions). These specifications will be used in Section 5 to demonstrate the applicability of the formal model. 4.1 Algebraic Specifications Our method of formalization uses algebraic specifications, which present a natural way of representing entities and processes. Algebraic specifications have proven useful for specifying data abstractions in spatial and temporal domains [25, 37-39]. Data abstractions are based on abstract data types, which are representation-independent formal definitions of all operations of a data type [40]. Entities are described in terms of their operations, depicting how they behave. Algebraic specifications written in an executable programming language can be tested as a prototype [41]. The tool chosen here is Hugs, a dialect of the purely functional language Haskell [42], which includes types, type classes, and algebraic axioms. Haskell provides higher-order capabilities and one of its major strengths is strong typing: every object has a particular type and the compiler checks that operations can only be applied to certain types. 4.2 Formal Model A conceptual space is formally specified3 as a data type, together with its attributes. Every conceptual space has an identifier Id, a Position in the semantic space at a 3
The complete Hugs code including the test data for this paper is available at http://www. geog.ucsb.edu/~raubal/Downloads/CS.hs. Hugs interpreters can be downloaded freely from http://www.haskell.org.
336
M. Raubal
given Time, and consists of a number of quality dimensions (list [Dimension]). Every Dimension has a Name and a range of values (ValueRange) with a given Unit, e.g., dimension weight with values between 0 and 250 kg. Here, we define Position as a coordinate pair in a 2-dimensional semantic space and Time through discrete steps. data ConceptualSpace = NewConceptualSpace Id Position Time [Dimension] data Dimension = Dimension Name ValueRange Unit We can now define a type class with common functions for conceptual spaces. These functions can be simple operations to observe properties, such as the current position of a conceptual space (getConceptualSpacePosition), but also more complex operations that specify the elements, processes, and constraints described in Section 3. The abstract type signatures are implementation-independent and can therefore be implemented for different types of conceptual spaces. Here, we inherit the class behavior to the data type ConceptualSpace as specified above. class ConceptualSpaces cs where getConceptualSpacePosition :: cs -> Position instance ConceptualSpaces ConceptualSpace where getConceptualSpacePosition (NewConceptualSpace id pos t ds) = pos Conceptual change happens through movement of conceptual spaces along spacetime paths in the semantic space (and through movement of conceptual regions within conceptual spaces). Conceptual spaces move to new positions only if there is a change in dimensions (dsNew), otherwise they are stationary. The semanticDistance function calculates either how far one conceptual space has moved in the semantic space during a particular time interval, or the distance between two different conceptual spaces (such as dsem in Figure 3). It is currently implemented for 2-D Euclidean distance (dist) but different instances of the Minkowski metric can be used instead, depending on the types of dimensions and spaces [1]. A SemanticSpaceTimePath is constructed by finding (filtering) all conceptual space instances for a particular Id and ordering them in a temporal sequence. class ConceptualSpaces cs where moveConceptualSpace :: cs -> [Dimension] -> ConceptualSpace semanticDistance :: cs -> cs -> Distance constructSemanticSpaceTimePath :: Id -> [cs] -> SemanticSpaceTimePath instance ConceptualSpaces ConceptualSpace where moveConceptualSpace (NewConceptualSpace id pos t ds) dsNew = if ds == dsNew then (NewConceptualSpace id pos newT ds) else (NewConceptualSpace id newPos newT dsNew) semanticDistance (NewConceptualSpace id pos t ds) (NewConceptualSpace id2 pos2 t2 ds2) = dist pos pos2
Representing Concepts in Time
337
constructSemanticSpaceTimePath i cs = NewSemanticSpaceTimePath id css where id = i css = filter ((i== ).getConceptualSpaceId) cs Semantic space-time stations are specified as special types of SemanticSpaceTimePaths—similar to the representation of space-time stations in [43]—, i.e., consisting of conceptual space instances with equal positions (but potential temporal gaps). The derivation of a SemanticSpaceTimeStation is based on the sorting function sortConceptualSpaces, which orders conceptual spaces according to their positions. class SemanticSpaceTimePaths sstPath where constructSemanticSpaceTimeStation :: sstPath -> [ConceptualSpace] instance SemanticSpaceTimePaths SemanticSpaceTimePath where constructSemanticSpaceTimeStation (NewSemanticSpaceTimePath id cs) = sortConceptualSpaces cs The data type SemanticSpaceTimeEnvelope is defined by a Center (of type Position) and a Boundary for each time step. The projection of SSTE to semantic space results in a region (equivalent to the PPA from time geography), whose boundary delimits a semantic similarity area. Note that contrary to semantic space-time stations, semantic potential path spaces—which result from integration over a sequence of SSTE slices—cannot have gaps. One can now determine algorithmically, whether a conceptual space falls inside the boundary or not (which identifies conceptual change). data SemanticSpaceTimeEnvelope = NewSemanticSpaceTimeEnvelope Center Time Boundary Semantic coupling constraints are represented through the semanticMeet function. It determines whether two instances of conceptual spaces interact at a given time step. This definition leaves room for integrating semantic uncertainty by specifying a threshold for the semantic distance (epsilon), within which the conceptual spaces are still considered to be interacting, see also [44]. Contextual constraints are fiat boundaries in the semantic space and can therefore be represented by the Boundary type. class ConceptualSpaces cs where semanticMeet :: cs -> cs -> Bool instance ConceptualSpaces ConceptualSpace where semanticMeet cs1 cs2 = (getConceptualSpaceTime cs1 == getConceptualSpaceTime cs2) && (semanticDistance cs1 cs2 <= epsilon)
338
M. Raubal
5 Application: Geospatial Concept Change in Time The formal model in the previous section provides executable specifications of the represented elements and processes for conceptual change based on the geometrical framework of conceptual spaces. In order to demonstrate the model with respect to analyzing the change of conceptual structures in time, we apply it to the use case of representing the concept of ‘landmark’ within the particular scenario of wayfinding in a city [45], where façades of buildings are often used as landmarks. Geospatial concepts, such as lake, mountain, geologic region, street, or landmark, differ in many qualitative ways from other concepts, due to their spatio-temporal nature [46, 47]. Their structure in terms of represented meaning changes for individual persons over time and may also differ between cultures, e.g., classifications of landscapes [48]. In the following, the change of a person’s conceptual structure of ‘landmark’ (in terms of façade as described above) over time is represented with respect to the change of quality dimensions in a semantic space. Based on previous work, we specify the dimensions façade area fa (square meters), shape deviation sd (deviation from minimum bounding rectangle in percent), color co (three RGB values), cultural importance ci (ordinal scale of 1 to 5), and visibility vi (square meters) [19, 45]. fa sd co ci vi
= = = = =
(Dimension (Dimension (Dimension (Dimension (Dimension
"area" (100,1200) "sqm") "shape" (0,100) "%") "color" (0,255) "RGB") "cultural" (1,5) "importance") "visibility" (0,10000) "sqm")
Time fa sd vi
t4
fa sd co t3 SST-path
fa sd ci co
t2
fa sd
t1
dcs1-cs3 vi
dcs2-cs3 dcs1-cs2 dcs1-cs2
Semantic space
Fig. 4. Change of a person’s conceptual structure of ‘landmark’ over time
Representing Concepts in Time
339
Four time steps are considered, which results in four instances of the conceptual space4. In this scenario, the person’s ‘landmark’ concept comprises three quality dimensions at time t1 (cs1). Through experience and over the years, the person has acquired a sense of cultural importance of buildings (cs2)—a building may be famous for its architectural style, therefore being a landmark—, adding this new dimension and also the significance of color. Next, for the reason of variation in the person’s interests, cultural importance vanishes again (cs3). Over time, due to physiological changes resulting in color blindness, the person’s concept structure changes back to the original one, eliminating color and again including visibility. Figure 4 visualizes these conceptual changes over time. cs1 = NewConceptualSpace 1 (3,1) 1 [fa,sd,vi] cs2 = NewConceptualSpace 1 (6,3) 2 [fa,sd,ci,co] cs3 = NewConceptualSpace 1 (4,2) 3 [fa,sd,co] cs4 = NewConceptualSpace 1 (3,1) 4 [fa,sd,vi] The formal specifications can now be used to query the temporal conceptual representation in order to find conceptual changes and when they happened, and what semantics is represented by a particular conceptual structure at a specific time. We can infer that the semantic change from cs1 at time 1 to cs2 at time 2 (transformation with two new dimensions) is larger than the change from cs1 at time 1 to cs3 at time 3 (transformation with one new dimension) by calculating the respective semantic distances (dcs1-cs2 and dcs1-cs3 in Figure 4). The change resulting from the move between time 2 and 3 (dcs2-cs3) is due to a projection, involving a reduction to three dimensions. Similarity is thereby a decaying function of semantic distance, which depends on the semantic space. The interpretation of semantic distance is domaindependent and may be determined through human participants tests [49]. semanticDistance cs1 cs2 3.605551 semanticDistance cs1 cs3 1.414214 semanticDistance cs2 cs3 2.236068 We can further construct the semantic space-time path for the conceptual space under investigation from the set of all available conceptual space instances (allCs). The result (presented below is only the very beginning for space reasons) is a list of the four conceptual space instances with Id=1 in a temporal sequence. This SSTpath is visualized in Figure 4. constructSemanticSpaceTimePath 1 allCs [NewSemanticSpaceTimePath 1 [NewConceptualSpace 1 …] Applying the constructSemanticSpaceTimeStation function to the SST-path derives all conceptual space instances with equal positions but potentially temporal gaps, such as cs1 and cs4. 4
The quantitative values for the positions of conceptual spaces in the semantic space are for demonstration purposes. Their determination, such as through similarity ratings from human participants tests, is left for future work.
340
M. Raubal
constructSemanticSpaceTimeStation (constructSemanticSpaceTimePath 1 allCs) [NewConceptualSpace 1 (3.0,1.0) 1 [Dimension "area" (100.0,1200.0) "sqm",Dimension "shape" (0.0,100.0) "%",Dimension "color" (0.0,255.0) "RGB"], NewConceptualSpace 1 (3.0,1.0) 4 [Dimension "area" (100.0,1200.0) "sqm",Dimension "shape" (0.0,100.0) "%",Dimension "color" (0.0,255.0) "RGB"]]
6 Conclusions and Future Work This paper presented a novel computational model to represent conceptual change over time. The model is based on a spatio-temporal metaphor, utilizing elements from time geography and conceptual spaces. Conceptual change is represented through movement of conceptual spaces along space-time paths in a semantic space. We developed executable algebraic specifications for the mapped entities, relations, and operations, which allowed demonstrating the model through an application to a geospatial conceptual structure. This application showed the potential of the formal representation for analyzing the dynamic nature of concepts and their changes in time. The presented work suggests several directions for future research: • The formal model needs to be extended to represent conceptual regions within the conceptual spaces. This will allow the application of semantic similarity measures, such as the ones proposed in [20], to determine semantic distances between individual concepts anchored within their corresponding conceptual spaces. • The quantification of conceptual change depends on the representation of the semantic space, which we have modeled as a two-dimensional attribute surface. More research in cognitive science and information science is required to establish cognitively plausible, semantic surface representations (similar to those developed in the area of information visualization) for different domains that can be used within our proposed model. This will also determine the distance and direction when moving a conceptual space due to a change in its quality dimensions. • Conceptual regions often do not have crisp boundaries therefore their representation must take aspects of uncertainty into account. Uncertainty also propagates when applying operations such as intersection to concept regions. Future work must address these issues based on the time-geographic uncertainty problems identified in [43]. • The semantic space is a similarity space, i.e., distance represents similarity between concepts. This leads to the question whether disparate concepts, such as roundness and speed, can be compared at all? A possible solution is to make concepts comparable only when they are within a certain threshold distance: if this is exceeded, then the similarity is zero. Another way is to specifically include infinite distance. It is essential to account for the given context in which concepts are compared. The context can be represented through different dimension weights. • The formal specifications serve as the basis for implementing a concept query language, which can be tested in different application domains. This will help understanding various concept dynamics, more specifically, the characterization and prediction of conceptual change through time.
Representing Concepts in Time
341
• In this work we utilized Gärdenfors’ [1] notion of conceptual spaces as a geometric way of representing information at the conceptual level. Different views on the nature of conceptual representations in the human cognitive system exist, such as the ideas of mental images [50] or schematic perceptual images extracted from modes of experience [8]. Could such images be represented in or combined with conceptual spaces? Would such combination be similar to a cognitive collage [51]? Human participants tests may help assess the validity of geometrical representations of concepts and point to potential limitations of conceptual spaces as a representational model.
Acknowledgments The comments from Carsten Keßler and three anonymous reviewers provided useful suggestions to improve the content of the paper.
Bibliography 1. Gärdenfors, P.: Conceptual Spaces - The Geometry of Thought. MIT Press, Cambridge (2000) 2. Hägerstrand, T.: What about people in regional science? Papers of the Regional Science Association 24, 7–21 (1970) 3. Brodaric, B., Gahegan, M.: Distinguishing Instances and Evidence of Geographical Concepts for Geospatial Database Design. In: Egenhofer, M., Mark, D. (eds.) Geographic Information Science - Second International Conference, GIScience 2002, Boulder, CO, USA, September 2002, pp. 22–37. Springer, Berlin (2002) 4. Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. In: Scientific American, pp. 34–43 (2001) 5. Gruber, T.: A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition 5(2), 199–220 (1993) 6. Laurence, S., Margolis, E.: Concepts and Cognitive Science. In: Margolis, E., Laurence, S. (eds.) Concepts - Core Readings, pp. 3–81. MIT Press, Cambridge (1999) 7. Smith, E.: Concepts and induction. In: Posner, M. (ed.) Foundations of cognitive science, pp. 501–526. MIT Press, Cambridge (1989) 8. Barsalou, L., Yeh, W., Luka, B., Olseth, K., Mix, K., Wu, L.: Concepts and meaning. In: Beals, K., et al. (eds.) Parasession on conceptual representations, pp. 23–61. University of Chicago, Chicago Linguistics Society (1993) 9. Goldstone, R., Kersten, A.: Concepts and Categorization. In: Healy, A., Proctor, R. (eds.) Comprehensive handbook of psychology, pp. 599–621 (2003) 10. Piaget, J., Inhelder, B.: The Child’s Conception of Space. Norton, New York (1967) 11. Neisser, U.: Cognition and Reality - Principles and Implications of Cognitive Psychology. Freeman, New York (1976) 12. Lakoff, G.: Cognitive Semantics, in Meaning and Mental Representations. In: Eco, U., Santambrogio, M., Violi, P. (eds.), pp. 119–154. Indiana University Press, Bloomington (1988) 13. Green, R.: Internally-Structured Conceptual Models in Cognitive Semantics. In: Green, R., Bean, C., Myaeng, S. (eds.) The Semantics of Relationships - An Interdisciplinary Perspective, pp. 73–89. Kluwer, Dordrecht (2002)
342
M. Raubal
14. Kuhn, W., Raubal, M., Gärdenfors, P.: Cognitive Semantics and Spatio-Temporal Ontologies. Spatial Cognition and Computation 7(1), 3–12 (2007) 15. Ogden, C., Richards, I.: The Meaning of Meaning: A Study of the Influence of Language Upon Thought and of the Science of Symbolism. Routledge & Kegan Paul, London (1923) 16. Barsalou, L.: Situated simulation in the human conceptual system. Language and Cognitive Processes 5(6), 513–562 (2003) 17. Sowa, J.: Categorization in Cognitive Computer Science. In: Cohen, H., Lefebvre, C. (eds.) Handbook of Categorization in Cognitive Science, pp. 141–163. Elsevier, Amsterdam (2006) 18. Gärdenfors, P.: Representing actions and functional properties in conceptual spaces. In: Ziemke, T., Zlatev, J., Frank, R. (eds.) Body, Language and Mind, pp. 167–195. Mouton de Gruyter, Berlin (2007) 19. Raubal, M.: Formalizing Conceptual Spaces, in Formal Ontology in Information Systems. In: Varzi, A., Vieu, L. (eds.) Proceedings of the Third International Conference (FOIS 2004), pp. 153–164. IOS Press, Amsterdam (2004) 20. Schwering, A., Raubal, M.: Measuring Semantic Similarity between Geospatial Conceptual Regions. In: Rodriguez, A., et al. (eds.) GeoSpatial Semantics - First International Conference, GeoS 2005, Mexico City, Mexico, November 2005, pp. 90–106. Springer, Berlin (2005) 21. Devore, J., Peck, R.: Statistics - The Exploration and Analysis of Data, 4th edn. Duxbury, Pacific Grove (2001) 22. Lenntorp, B.: Paths in Space-Time Environments: A Time-Geographic Study of the Movement Possibilities of Individuals. Lund Studies in Geography, Series B (44) (1976) 23. Miller, H.: Modeling accessibility using space-time prism concepts within geographical information systems. International Journal of Geographical Information Systems 5(3), 287– 301 (1991) 24. Palmer, S.: Fundamental aspects of cognitive representation. In: Rosch, E., Lloyd, B. (eds.) Cognition and categorization, pp. 259–303. Lawrence Erlbaum, Hillsdale (1978) 25. Frank, A.: Spatial Communication with Maps: Defining the Correctness of Maps Using a Multi-Agent Simulation. In: Freksa, C., et al. (eds.) Spatial Cognition II - Integrating Abstract Theories, Empirical Studies, Formal Methods, and Practical Applications, pp. 80–99. Springer, Berlin (2000) 26. Frank, A.: Pragmatic Information Content: How to Measure the Information in a Route Description. In: Duckham, M., Goodchild, M., Worboys, M. (eds.) Foundations of Geographic Information Science, pp. 47–68. Taylor & Francis, London (2003) 27. Lakoff, G., Johnson, M.: Metaphors We Live By. University of Chicago Press, Chicago (1980) 28. Kuipers, B.: The ’Map in the Head’ Metaphor. Environment and Behaviour 14(2), 202– 220 (1982) 29. Kuhn, W.: Metaphors Create Theories for Users. In: Frank, A.U., Campari, I. (eds.) Spatial Information Theory: Theoretical Basis for GIS, pp. 366–376. Springer, Berlin (1993) 30. Kuhn, W., Blumenthal, B.: Spatialization: Spatial Metaphors for User Interfaces. GeoinfoSeries, vol. 8. Department of Geoinformation, Technical University Vienna, Vienna (1996) 31. Lakoff, G.: Women, Fire, and Dangerous Things: What Categories Reveal About the Mind. The University of Chicago Press, Chicago (1987) 32. Skupin, A.: Where do you want to go today [in attribute space]? In: Miller, H. (ed.) Societies and Cities in the Age of Instant Access, pp. 133–149. Springer, Dordrecht (2007)
Representing Concepts in Time
343
33. Skupin, A., Fabrikant, S.: Spatialization Methods: A Cartographic Research Agenda for Non-Geographic Information Visualization. Cartography and Geographic Information Science 30(2), 95–119 (2003) 34. Burrough, P., Frank, A., Masser, I., Salgé, F.: Geographic Objects with Indeterminate Boundaries. GISDATA Series. Taylor & Francis, London (1996) 35. Frank, A.: Ontology for spatio-temporal Databases. In: Koubarakis, M., et al. (eds.) Spatiotemporal Databases: The Chorochronos Approach, pp. 9–77. Springer, Berlin (2003) 36. Raubal, M.: Mappings For Cognitive Semantic Interoperability. In: Toppen, F., Painho, M. (eds.) AGILE 2005 - 8th Conference on Geographic Information Science, pp. 291–296. Instituto Geografico Portugues (IGP), Lisboa (2005) 37. Winter, S., Nittel, S.: Formal information modelling for standardisation in the spatial domain. International Journal of Geographical Information Science, 2003 17(8), 721–742 (2003) 38. Raubal, M., Kuhn, W.: Ontology-Based Task Simulation. Spatial Cognition and Computation 4(1), 15–37 (2004) 39. Krieg-Brückner, B., Shi, H.: Orientation Calculi and Route Graphs: Towards Semantic Representations for Route Descriptions. In: Raubal, M., et al. (eds.) Geographic Information Science, 4th International Conference GIScience 2006, Muenster, Germany, pp. 234– 250. Springer, Berlin (2006) 40. Guttag, J., Horowitz, E., Musser, D.: The Design of Data Type Specifications. In: Yeh, R. (ed.) Current Trends in Programming Methodology, pp. 60–79. Prentice-Hall, Englewood Cliffs (1978) 41. Frank, A., Kuhn, W.: Specifying Open GIS with Functional Languages. In: Egenhofer, M., Herring, J. (eds.) Advances in Spatial Databases (SSD 1995), pp. 184–195. Springer, Portland (1995) 42. Hudak, P.: The Haskell School of Expression: Learning Functional Programming through Multimedia. Cambridge University Press, New York (2000) 43. Miller, H.: A Measurement Theory for Time Geography. Geographical Analysis 37(1), 17–45 (2005) 44. Ahlqvist, O.: A Parameterized Representation of Uncertain Conceptual Spaces. Transactions in GIS 8(4), 493–514 (2004) 45. Nothegger, C., Winter, S., Raubal, M.: Selection of Salient Features for Route Directions. Spatial Cognition and Computation 4(2), 113–136 (2004) 46. Smith, B., Mark, D.: Geographical categories: an ontological investigation. International Journal of Geographical Information Science 15(7), 591–612 (2001) 47. Brodaric, B., Gahegan, M.: Experiments to Examine the Situated Nature of Geoscientific Concepts. Spatial Cognition and Computation 7(1), 61–95 (2007) 48. Mark, D., Turk, A., Stea, D.: Progress on Yindjibarndi Ethnophysiography. In: Winter, S., et al. (eds.) Spatial Information Theory, 8th International Conference COSIT 2007, Melbourne, Australia, pp. 1–19. Springer, Berlin (2007) 49. Hahn, U., Chater, N.: Understanding Similarity: A Joint Project for Psychology, CaseBased Reasoning, and Law. Artificial Intelligence Review 12, 393–427 (1998) 50. Kosslyn, S.: Image and brain - The resolution of the imagery debate. MIT Press, Cambridge (1994) 51. Tversky, B.: Cognitive Maps, Cognitive Collages, and Spatial Mental Model. In: Frank, A., Campari, I. (eds.) Spatial Information Theory: Theoretical Basis for GIS, pp. 14–24. Springer, Berlin (1993)
The Network of Reference Frames Theory: A Synthesis of Graphs and Cognitive Maps Tobias Meilinger Max-Planck-Institute for Biological Cybernetics Spemannstr. 44, 72076 Tübingen, Germany
[email protected]
Abstract. The network of reference frames theory explains the orientation behavior of human and non-human animals in directly experienced environmental spaces, such as buildings or towns. This includes self-localization, route and survey navigation. It is a synthesis of graph representations and cognitive maps, and solves the problems associated with explaining orientation behavior based either on graphs, maps or both of them in parallel. Additionally, the theory points out the unique role of vista spaces and asymmetries in spatial memory. New predictions are derived from the theory, one of which has been tested recently. Keywords: graph; cognitive map; spatial memory; reference frame; route knowledge; survey knowledge; self-localization; environmental space.
1 Introduction Orientation in space is fundamental for all humans and the majority of other animals. Accomplishing goals frequently requires moving through environmental spaces such as forests, houses, or cities [26]. How do navigators accomplish this? How do they represent the environment they traveled? Which processes operate on these representations in order to reach distant destinations or to self-localize when lost? Various theories have been proposed to explain these questions. Regarding the underlying representation these theories can be roughly classified into two groups which are called here graph representations and cognitive maps. In the following paper, I will explain graph representations and cognitive maps. I will also highlight how graph representations and cognitive maps fail to properly explain orientation behaviour. As a solution I will introduce the network of reference frames theory and discuss it with respect to other theories and further empirical results. 1.1 Graph Representations and Cognitive Maps In graphs the environment is represented as multiple interconnected units (e.g., [4], [19], [20], [45], [48]; see Fig. 1). A node within such a graph, for example, represents a location in space or a specific sensory input encountered, such as a view. An edge within a graph typically represents the action necessary to reach the adjacent node. Graphs are particularly suitable for explaining navigating and communicating routes (i.e., a sequence of actions at locations or views which allows navigators at a location .
C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 344–360, 2008. © Springer-Verlag Berlin Heidelberg 2008
The Network of Reference Frames Theory
345
Fig. 1. Visualizations of a graph representation where an environment is represented as multiple interconnected units (left) and a cognitive map where an environment is represented within one reference frame (right)
A to reach B without necessarily knowing where exactly B is relative to A). This could be, for example,, turn right at the church, then turn left at the next intersection etc., The knowledge expressed in these sequences is called route knowledge. A cognitive map, on the other hand, assumes that the environment is represented within one single metric frame of reference, (i.e., all locations within the environment can be expressed by coordinates of one single coordinate system; see Fig. 1; [2], [7], [30]; cf., [28], [33]).1 A cognitive map has to be constructed from several different pieces of information encountered during navigation. The case of learning a cognitive map from a physical map which provides the information already within one frame of reference is not considered here. A cognitive map is especially suited to provide direct spatial relations between two locations, without necessarily knowing how to get there, for example, the station is 300 Meters to the east from my current location. This type of knowledge is known as survey knowledge. Survey knowledge is necessary for tasks such as shortcutting or pointing to distant locations. 1.2 Problems with Graph Representations and Cognitive Maps Graph representations and cognitive maps are especially suited to represent route and survey knowledge, respectively. The other side of the coin is, however, that they also have their specific limitations. These will now be described in detail. Graph representations (1) do not represent survey knowledge, (2) often ignore metric relations given in perception, and (3) often assume actions are sufficient to explain route knowledge. The main limitation of graph representations is that there is no survey knowledge expressed at all. Using a graph representation, navigators know how to reach a location and have the ability to choose between different routes. Graph representations, however, do not give navigators any cue as to where their goal is 1
Often the term cognitive map is used for the sum of all spatial representations. Contrary to that, cognitive map is understood here as a specific spatial representation, namely storing spatial information within one reference frame. A reference frame here is not understood as the general notion of representing something relative to ones’ own body (= egocentric) vs. relative to other objects (= allocentric), but a reference frame is considered as one single coordinate system (cf. [15]). Nevertheless, a reference frame can be egocentric or allocentric.
346
T. Meilinger
located in terms of direction or distance. This problem originates from the fact that graph representations normally do not represent metric knowledge at all. This is despite the fact that not only human navigators are provided with at least rough distance estimates especially by their visual system and by proprioceptive cues during locomotion. Some graph models ignore this already available information and instead assume that a navigator stores raw or only barely processed sensory data ([4], [20]). As a final point, actions themselves ([19], [20], [45]) can not be sufficient to explain route knowledge. Rats can swim a route learned by walking [18]. Cats can walk a route learned while being passively carried along a route [10]. We can cycle a path learned by walking. Even for route knowledge the edge of a graph representing how to get from one node to the next has to be more abstract than a specific action. However, not only graph representations are limited. Cognitive maps (1) have problems in explaining self-localization and route knowledge. (2) There is a surprising lack of evidence that proves non-human animals have cognitive maps at all. (3) Human survey navigation is not always consistent with a cognitive map, and (4), cognitive maps are necessarily limited in size. Self-localizing based exclusively on a cognitive map can only take the geometric relations into account that are displayed there, (e.g., the form of a place). The visual appearance of landmarks is almost impossible to represent within a cognitive map itself. This information has to be represented separately and somehow linked to a location within the cognitive map. This is probably one reason why simultaneously constructing a map while staying located within this map (SLAM) is considered a complicated problem in robotics [42]. Similarly, planning a route based on a cognitive map alone is also not trivial, as possible routes have to be identified first [16]. Another issue is that cognitive maps seem to be limited to human navigation. If animals had cognitive maps, they would easily be able to take novel shortcuts, (i.e., directly approach a goal via a novel path without using updating or landmarks visible from both locations). However, the few observations arguing for novel shortcuts in insects and mammals have been criticized because they do not exclude alternative explanations and could not be replicated in better controlled experiments [1]. For example, in the famous experiment by Tolman, Ritchie and Khalish [43] rat’s shortcutting behavior can be explained by assuming they directly approached the only available light source within the room. Although the discussion whether non-human animals are able to make novel shortcuts has yet to be settled, such shortcutting behavior should be fairly common if orientation was based on a cognitive map. This is clearly not the case. Similarly, a human shortcutting experiment within an “impossible” virtual environment casts doubt upon a cognitive map as the basis for such survey navigation [34]. In this experiment unnoticeable portals within the virtual environment teleported participants to another location within the environment. They could, therefore, not construct a consistent two-dimensional map of this environment. Still, participants were able to shortcut quite accurately. The last shortcoming of cognitive maps is that we have to use many of them anyway. We surely do not have one and the same cognitive map (reference frame) to represent the house we grew up in, New York and the Eiffel Tower. At one point, we have to use multiple cognitive maps and (probably) represent relations between them. Graph representations and cognitive maps have specific advantages and limitations. Graphs are good for representing route knowledge. However, they do not explain survey
The Network of Reference Frames Theory
347
knowledge. Contrary to that, cognitive maps are straight forward representations of survey knowledge. They are, however, not well suited for self-localization and route knowledge and fail to explain some human and non-human orientation behavior. As a solution to these limitations, often both representations are assumed in parallel to best account for the different behaviors observed ([2], [4], [30], [45]; see also [12], [28], [33]). However, assuming two representations in parallel also poses difficulties. First, the last three arguments against cognitive maps also argue against theories which assume both graphs and cognitive maps. In addition, according to “Occam’s razor” (law of parsimony), one simple representation is preferable to multiple representations when explaining behavior. Multiple representations of one environment also raise the question of how these representations are connected. A house for example, can be represented within a graph and a cognitive map. The house-representation in the map should refer to the corresponding house representation within the graph and not to a representation of another house. First, this correspondence has to be specified somehow, for example, via an association which results in even more information to be represented. Second, it is a non-trivial problem to keep the correspondences free of error. A theory has to state how this is accomplished. In conclusion, neither a graph representation, nor a cognitive map alone is sufficient to convincingly explain orientation behavior in humans and non-human animals. Both representations together also pose tremendous difficulties. As a solution to these problems, I would like to propose the network of reference frames theory which combines graphs and cognitive maps within one representation. This theory described in Chapter 2 avoids the problems which were already mentioned. Together with processes operating on this representation, it explains self-localization, route navigation and survey navigation. Furthermore, this theory can also explain other effects which have not yet been pointed out. This will be described in Chapter 3 where it will also be compared to other theories.
2 The Network of Reference Frames Theory In this chapter, I will describe the network of reference frames theory in terms of the representations and the processes acting on those, and how these are used for different tasks, such as navigation, survey knowledge, etc. 2.1 Representation The network of reference frames theory describes the memory representation acquired by human and non-human animals when locomoting through environmental spaces such as the country side, buildings, or cities. It also describes how this representation is used for self-localization, route and survey navigation. The theory is a fusion between graph representations and cognitive maps (cf., Fig.2). It assumes that the environment is encoded in multiple interconnected reference frames. Each reference frame can be described as a coordinate system with a specific orientation. These reference frames form a network or graph. A node within this network is a reference frame referring to a single vista space. Vista spaces surround the navigator and can be perceived from .
348
T. Meilinger
Fig. 2. A visualization of the network of reference frame theory. Reference frames correspond to single vista spaces. They are connected via perspective shifts which specify the translation and rotation necessary to get from one reference frame to the next one.
one point of view, for example, a room, a street or even a valley [26].2 This means that the basic unit in the network is always the reference frame of a vista space. Within this vista space reference frame, the location of objects and the surrounding geometry are specified. The edges in the network define the so called perspective shift necessary to move from one reference frame to the next. Such a perspective shift consists of both a translation and a rotation component, for example, moving forward 150 meters and then turning right 90°. Perspective shifts all point to another reference frame,3 they may differ in precision and the association strength with which they connect the two reference frames. The more familiar a navigator is with an environment, the more precise the perspective shifts will become and the more strongly the perspective shift will connect two reference frames. The network of vista space reference frames connected via perspective shifts is stored in long-term memory. Several processes shape or operate on this memory. These processes are encoding, reorientation by recognition, route navigation, and survey navigation. In the following they will be described in detail (for a summary see Table 1). 2.2 Encoding First Time Encounter. Encoding describes the process of constructing a representation of an environmental space through initial and continued contact. It is assumed 2
Vista spaces extend to the back of a navigator (although nothing might be represented there). While other senses such as audition or information from self motion may be used to construct a representation of a vista space, the main source to do so will be vision. 3 Humans are able to imagine how a perceived or a remembered vista space looks like from a different perspective. Such an imaginary shift in perspective within a vista space is not what is called perspective shift in the network of reference frames theory. Here a perspective shift, first, is stored in memory and is not imagined online, and second, a perspective shift always connects two vista spaces and does not occur within one vista space.
The Network of Reference Frames Theory
349
that encoding happens automatically. When navigating through an environmental space for the first time, we perceive vista spaces within the environmental space. This perceived vista space corresponds to a reference frame. The orientation of that reference frame is either determined by the view from which the vista space was experienced in the first place (cf., [20], [46]) or it is determined by the salient geometry of that vista space ([28], [33]). In daily life, these two directions usually coincide. For example, when entering a street or a house, our first view of the street or house is usually aligned with the geometry of the surrounding walls. Accessing such a reference frame will be easier and lead to an improved performance when one is aligned with the orientation of this reference frame, (e.g., looking down the street), than when not aligned, (e.g., facing a house in the street). Within this reference frame, the geometry of the enclosure is encoded (e.g., walls, hedges, houses or large objects). In addition to the geometry, locations of objects, such as landmarks, can be located within such a reference frame of a vista space. After encoding an individual reference frame, a navigator moves on and encodes other reference frames corresponding to other vista spaces. These vista spaces do not necessarily have to be adjacent. A perspective shift will connect the two vista space reference frames, (i.e., the translations and rotations necessary to get from the first reference frame to the second). This perspective shift can be derived (1) from the visual scene itself, (2) from updating during navigating between the two vista spaces, and (3) from global landmarks visible from both vista spaces. Deriving the perspective shift from the visual scene can be shown in an example such as standing in the corridor of a house and watching the kitchen door. The kitchen Table 1. Summary of the representation and processes assumed in the network of reference frames theory Representation Network (graph) consisting of nodes connected by edges (see Fig. 2) Node: a reference frame with an orientation specifying locations and orientations within a vista space; within this reference frame, objects and the geometric layout are encoded Edge: perspective shift, i.e., translation and rotation necessary to move to the next reference frame; perspective shifts point to the next reference frame and differ in precision and association strength. Processes Encoding: first time experience or the geometry of a vista space define the orientation of a new reference frame; the visual scene itself, updating, or global landmarks can provide the perspective shift to the next vista space reference frame; familiarity increases the accuracy of the perspective shifts and the association strength of these connections. Self-localization by recognition: recognizing a vista space by the geometry or landmarks it contains provides location and orientation within this vista space and the current node/reference frame within the network Route navigation by activation spread: an activation spread mechanism provides a route from the current location to the goal; during wayfinding, reference frames on the route are preactivated and, therefore, recognized more easily; recently visited reference frames are deactivated Survey navigation by imagination: imagining connected vista spaces not visible step-by-step within the current reference frame; allows retrieving direction and straight line distance to distant locations; this can be used for shortcutting or pointing.
350
T. Meilinger
door provides us with the information of where (translational component) and in which orientation (rotational component) the kitchen is located with respect to the reference frame of the corridor. Extracting the perspective shift from the visual scene itself, however, only works for adjacent vista spaces with a visible connection. For non-adjacent vista spaces, updating can provide the perspective shift. In doing so, one’s location and orientation within the current reference frame is updated while moving away from its origin, (i.e., navigators track their location and orientation relative to the latest encoded vista space). When encoding a new reference frame, the updated distance and orientation within the former reference frame provides the necessary perspective shift to get from the first reference frame to the next. In that sense, updating can provide the “glue” connecting locations in an environmental space (cf., [17]). Updating can also work as a lifeline saving navigators from getting lost. As long as navigators update the last reference frame visited, they are able to return to the origin of the last encoded reference frame, (i.e., they are oriented). A third possibility to get a perspective shift when already located in the second reference frame is by self-localizing with respect to a global landmark also visible from the first vista space reference frame, for example, a tower or a mountain top. Selflocalizing provides a navigator with the position and orientation with respect to the reference frame in which the global landmark was first experienced. This is the perspective shift necessary to get from the first reference frame to the second one. Repeated Visits. Re-visiting an environmental space can add new perspective shifts to the network and will increase the precision and association strength of existing perspective shifts (for the later see 2.4). Walking a new route to a familiar goal will form a new chain of reference frames and perspective shifts connecting the start and goal. That way, formerly unconnected areas, such as city districts, can be connected. When walking a known route in reverse direction, the theory assumes that new perspective shifts are encoded in a backward direction. Then two reference frames A and B are connected with two perspective shifts, one pointing from A to B and the other one pointing from B to A. In principle, inverting one perspective shift would be sufficient to get the opposite perspective shift. However, such an inversion process is assumed to be error-prone and costly therefore it is usually not applied. When navigating an existing perspective shift along its orientation repeatedly no new perspective shift is encoded, but the existing perspective shift becomes more precise. This increase in precision corresponds to a shift from route knowledge to more precise survey knowledge. The precision of survey knowledge is directly dependent upon the precision of the perspective shift (for a similar model for updating see [6]). For many people, perspective shifts will be imprecise after the first visit, and therefore, highly insufficient, (e.g., for pointing to distant destinations). However, they still accurately represent route knowledge, (i.e., indicate which reference frame is connected with which other reference frame). When the perspective shifts become more precise after repeated visits, survey knowledge will also become more precise (cf., [25]; see 2.5). This corresponds with the original claim that route knowledge usually develops earlier than survey knowledge (e.g., [36]). However, survey knowledge does not have to develop at all (e.g., [24]) or can in principle also be observed after just a few learning trials (e.g., [27]). Correspondingly, the perspective shifts may be precise enough for pointing or other survey knowledge tasks after little experience
The Network of Reference Frames Theory
351
or they may remain imprecise even after an extended experience. Here, large differences between individuals due to the sense of direction can be expected (cf., [9], [35]). Updating global orientation while navigating an environmental space will result in more precise perspective shifts, and therefore, improve survey knowledge. It follows that people with a good sense of direction will also acquire precise survey knowledge quicker. Similarly, environments which ease such updating will lead to more precise perspective shifts and improve survey knowledge accordingly. This facilitation can be gained, for example, by uniform slant, distant landmarks, or a grid city, which all have been shown to enhance orientation performance (e.g., [25], [32]). 2.3 Self-localization by Recognition When someone gets lost within a familiar environmental space, the principal mode of reorientation will be by recognizing a single vista space within this environment (for self-localizing by the structure of environmental spaces see [21], [38]). A vista space can be recognized by its geometry or by salient landmarks located within (cf. [3]). First, recognizing a vista space provides navigators, with their location and their orientation within this vista space. Second, recognizing a vista space provides navigators with their location within the network, (i.e., in which node or vista space reference frame they are located). Their position in terms of direction and distance with respect to currently hidden locations in the environmental space however, has to be inferred from memory. This will be explained in the section on survey navigation by imagination further below. 2.4 Route Navigation by Activation Spread Route navigation means selecting and traveling a route from the current location to a goal. The network of reference frames theory assumes an activation spread mechanism to explain route selection which was proposed by Chown et al. [4] as well as Trullier et al. [45]. Within the network, activation from the current reference frame (current node) spreads along the perspective shifts (edges) connecting the various reference frames (nodes). If the activation reaches the goal node, the route transferring the activation will be selected, (i.e., a chain of reference frames connected with perspective shifts). Here, the association strength of perspective shifts is important. The association strength is higher for the most navigated perspective shifts. Activation will be spread faster along those edges that are higher in association strength. If several possible routes are encoded within the network, the route that spreads the activation fastest will be selected for navigation. This route must not necessarily be the shortest route or the route with the least number of nodes. As the activation propagates easier via highly associated edges, such familiar routes will be selected with higher probability. During navigation, the perspective shift provides navigators with information about where to move next, (i.e., perform the perspective shift). If the perspective shift is rather imprecise, navigators will only have an indicated direction in which to move. Moving in this direction, they will eventually be able to recognize another vista space reference frame. By updating the last reference frame visited, it will prevent navigators from getting lost. Pre-activating reference frames to come and de-activating
352
T. Meilinger
already visited reference frames will facilitate recognition. When successfully navigating a known route, its perspective shifts will become more accurate and their association strengths will increase, making it more probable that the route will be selected again. The described process is probably sufficient to explain most non-human route navigation. It is also plausible that such a process is inherited in humans and applied for example, when navigating familiar environments without paying much attention. However, humans can certainly override this process and select routes by other means. 2.5 Survey Navigation by Imagination Survey knowledge tasks such a pointing or shortcutting require that relevant locations are represented within one frame of reference, (e.g., the current location and the goal destination). The network of reference frames theory assumes that this integration within one frame of reference occurs online within working memory. This is only done when necessary and only for the respective area. For example, when pointing to a specific destination, only the area from the current location to the destination is represented. In this framework, the integration within one frame of reference happens during the retrieval of information and not during encoding or elaboration, as with a cognitive map. The common reference frame is available only temporarily in working memory and is not constantly represented in long term memory. The integration itself is done by imagining distant locations as if the visibility barriers of the current vista space were transparent. The current vista space can be the one physically surrounding the navigator or another vista space that is imagined. From the current vista space’s reference frame, a perspective shift provides the direction and orientation of the connected reference frame. With this information, the navigator imagines the next vista space within the current frame of reference, (i.e., this location is imagined in terms of direction and distance from the current vista space). This way, the second vista space is included in the current reference frame. Now, a third vista space can be included using the perspective shift connecting the second and the third vista space reference frames. That way, every location known in the surrounding environmental space can be imagined. Now, the navigator can point to this distant location, determine the straight line distance, and try to find a shortcut.
3 The Network of Reference Frames Theory in the Theoretical and the Empirical Context 3.1 The Network of Reference Frames Theory Compared to Graph Representations and Cognitive Maps The network of reference frames theory is a fusion between graph representations and cognitive maps. Multiple reference frames or cognitive maps are connected with each other within a graph structure. As in graph representations, the basic structure is a network or graph. However, in contrast to most existing graph models ([4], [19], [20], [45], [48]), metric information is included within this graph. This is done for the
The Network of Reference Frames Theory
353
nodes, which consist of reference frames, as well as for the edges, (i.e., the perspective shifts, which represent translations and turns). Such a representation avoids the problems associated with the mentioned graph representations (see 1.2): (1) Most importantly, it can explain survey knowledge, as metric relations are represented contrary to other graph models. (2) Representing metric relations also uses information provided by perception. Depth vision and other processes allow us to perceive the spatial structure of a scene. This information is stored and not discarded like in other graph models. (3) Perspective shifts represent abstract relations that can be used to guide walking, cycling, driving, etc. No problem of generalizing from one represented action to another action occurs as in other graph representations. The network of reference frames theory also avoids problems from the cognitive map (see 1.2): (1) It can explain self-localization and route navigation in a straight forward manner which is difficult for cognitive maps. (2) An environmental space is not encoded within one reference frame as with a cognitive map. The representation, therefore, does not have to be consistent globally. So, contrary to cognitive maps, short cutting is also possible when navigating “impossible” virtual environments [34]. (3) The lack of clear evidence for survey navigation in non-human mammals and insects can be easily explained. According to the network of reference frames theory, these animals are not capable of imagining anything or they do not do so for survey purposes. However, survey navigation relies on the same representation as selflocalization and route navigation. Only the additional process of imagining operates on this representation. This process might have even evolved for completely different purposes than navigation. Contrary to that, cognitive map theory has to assume that an additional representation, (i.e., a cognitive map), evolved only in humans specifically for orientation. These are much stronger assumptions. (4) Imagining distant destinations within working memory involves a lot of computation. Survey tasks are, therefore, rigorous and error prone which probably most people can confirm. In contrast, this daily life observation is not plausible with a cognitive map. Deriving the direction to distant locations from a cognitive map is rather straight forward and should not be more rigorous than, for example, route navigation. 4 The network of reference frames theory also has advantages compared to assuming both a graph and a cognitive map in parallel (see 1.2):5 Here survey navigation is again explained by the cognitive map part. This does not avoid the last three problems mentioned in the last paragraph.6 In addition, the network of reference frames theory makes fewer assumptions. On a rough scale, it only assumes one representation, the 4
Alternatively to simply read out survey relations from a cognitive map, mental travel has been proposed as an alternative process [2]. Mental travel can be considered as being more effortful and is, therefore, much more plausible. For the network of reference frames theory continuous mental travel in the area of an encoded vista space can be imagined. Between nonadjacent vista spaces, this should be rather difficult. 5 Some theories assuming both a network representation and a global cognitive map are skeptical regarding the necessity and the evidence for such a cognitive map ([16], [31]). 6 In his theory Poucet [31] assumes a network layer with pairwise metric relations between places. This representation can be used to compute shortcuts and avoids the problems mentioned with cognitive maps. However, Poucet also proposes a global integration within a cognitive map, leading again to the mentioned problems. In addition, it is unclear which of the two metric representations determine survey navigation.
354
T. Meilinger
combination of graphs and maps assume two representations. More specifically, graphs and maps need to connect corresponding elements, for example elements which represent the same house. These connections are extra and are potentially more error prone. A last problem with cognitive maps already mentioned is that we must have multiple cognitive maps anyway, because we cannot represent the whole world within one single cognitive map. As we do use reference frames to represent spatial locations, the question is, what spatial area do such reference frames encode usually? Here, it is proposed that this basic unit consists of a vista space. 3.2 Vista Space Reference Frames as the Basic Unit in the Representation of Environmental Spaces Representing a space in multiple interconnected units, works with units of different size. Using large units such as towns, results in large packages of information which might be difficult to process as a whole. On the other hand, smaller units such as individual objects, result in an exponential increase in relations between the units which have to be represented. Many experiments show that humans are able to represent vista spaces within one frame of reference (e.g., [12], [28], [33]). So the main question is whether navigators use vista spaces or whether they also use larger units, (e.g., buildings or city districts), to represent locations within one reference frame. Updating experiments indicate that a surrounding room is always updated during blindfolded rotations. This is not necessarily the case for the whole surrounding campus suggesting that the relevant unit is smaller than a campus [47]. The network of reference fames theory predicts that there are no common reference frames for units larger than vista spaces. Other theories on spatial orientation in robots [50] and rodents [44] also rely on the visible area as the basic element.7 Several arguments support vista spaces as the basic unit in spatial orientation. (1) Vista spaces are the largest unit provided directly by visual perception, and (2) they are directly relevant for navigation. (3) Visibility is correlated with wayfinding performance. (4) Hippocampal place cells are likely related to vista spaces, and (5) our own experiments show that participants encode a simple environmental space not within one reference frame, but use multiple reference frames in the orientation predicted by the network of reference frames theory. Vista spaces can be experienced from only one point of view. In order to represent environmental spaces, such as buildings and cities, we have to move around (the case of learning from paper maps is not considered here). When encoding units larger than vista spaces, several percepts have to be integrated. Such integration is not done spontaneously [8]. Vista spaces are also the most relevant unit for navigation. Route decisions have to be taken within a vista space. When lost, self-localization is usually accomplished by recognizing the geometry or landmarks within a specific vista space 7
In Yeap’s theory [50] all vista spaces are directly adjacent to each other and are connected via exits. Survey relations computed from that representation are, therefore, correct when the form of individual vista spaces are correct. In the network of reference frames theory the preciseness of survey relations depends of the preciseness of the perspective shifts. In addition, Yeap assumes a hierarchical structuring on top of the basic vista space level. Touretzky and Redish [44] do not tell anything about environmental spaces. They also assume multiple, simultaneously active reference frames represent one vista space.
The Network of Reference Frames Theory
355
(e.g., [3]). Short cutting is difficult, because it encompasses more than just one vista space. In contrast, selecting the direct path to a necessarily visible location within a vista space is trivial. Visibility is also correlated with behavior. More vista spaces, (i.e., corridors on a route), lead to larger errors in Euclidean distance estimation [41]. Learning a virtual environmental space is easier with a full view down a corridor than when visual access is restricted to a short distance, which results in more vista spaces that need be encoded [38]. Place cells in human and rodent hippocampus seem to represent a location in a vista space ([5], [30]). Place cells fire every time a navigator crosses a specific area independent of head orientation. This area is relative to the surrounding boundaries of a vista space and is adjusted when changing the overall size or shape of the vista space [29]. One and the same place cell can be active in different vista spaces, and can therefore, not encode one specific location in an environmental space [37]. In conclusion, a set of place cells is a possible neuronal representation of locations within one frame of reference. This frame is likely to be limited to a vista space. In addition to arguments from the literature, we recently tested the prediction from the network reference frames theory concerning the importance of vista space reference frames [23]. This prediction incorporated, first, that a vista space is the largest unit encoded within one single reference frame, and second, that the orientation of such a vista space reference frame is important, (i.e., that navigators perform better when they are aligned with that orientation). Participants learned a simple immersive virtual environmental space consisting of seven corridors by walking in one direction. In the testing phase, they were teleported to different locations in the environment and were asked to self-localize and then point towards previously learned targets. As predicted by the network of reference frames theory, participants performed better when oriented in the direction in which they originally learned each corridor, (i.e., when they were aligned with an encoded vista space reference frame). If the whole environment was encoded within one single frame of reference, this result could not be predicted. One global reference frame should not result in any difference at all (cf., [12]) or participants should perform better when aligned with the orientation of this single global reference frame as predicted by reference axis theory ([28], [33]). No evidence for this could be observed. Participants seem to encode multiple local reference frames for each vista space in the orientation they experienced this vista space (which coincided with its geometry). 3.3 Egocentric and Allocentric Reference Frames The reference frames in the network of reference frames theory correspond to vista spaces and they are connected via perspective shifts. Are these relations egocentric or allocentric? Egocentric and allocentric reference frames have been discussed intensively over the last few years (e.g., [28], [46]). In an egocentric reference frame locations and orientations within an environment are represented relative to the location and orientation of a navigator’s body in space [15]. This is best described by a polar coordinate system. An allocentric reference frame is specified by a space external to a navigator. Here object-to-object relations are represented in contrast to the object-tobody relations in the egocentric reference frame. An allocentric reference frame is best described by a Cartesian coordinate system.
356
T. Meilinger
In principle, the network of reference frames theory is compatible with egocentric as well as allocentric reference frames. With egocentric reference frames, elements within a vista space are encoded relative to the origin of the egocentric reference frame by vectors (and additional rotations if the relative bearing matters). Perspective shifts are just egocentric vectors which point to another reference frame instead of an object within the vista space. Despite in principle being compatible with egocentric reference frames, the network of reference frames theory is better classified as allocentric. This decision is based on five arguments: (1) The origin which is quite prominent in polar coordinate systems does not play a role in the network of reference frames theory. No performance differences are predicted whether a navigator is located at the origin or at another location within a vista space reference frame. A polar coordinate system would suggest that this makes a difference. (2) Contrary to the origin, the orientation of a reference frame does make a difference according to the network of reference frames. When aligned with this orientation, participants should perform better and do so (see 3.2). Such an orientation, however, is more prominent in Cartesian coordinate systems, than it is in polar coordinate systems. (3) The orientation of a reference frame originates either from the initial experience with a vista space or from the vista space’s main geometric orientation, (e.g., the orientation of the longer walls of a room). In principle, the main geometric orientation might never have been experienced directly, (i.e., a navigator was never aligned with the surrounding walls). Still, the geometry might determine the orientation of the reference frame (cf., [33]). Although this is a highly artificial situation, such a reference frame has to be allocentric. (4) Within a vista space reference frame, the geometry of the boundaries of this vista space is encoded. It has been shown that the room geometry is encoded as a whole (i.e., allocentrically not by egocentric vectors; e.g., [46]). So at least some of the relations within a vista space are allocentric anyway. (5) Although perspective shifts can be understood as egocentric vectors (plus rotations), they are intuitively better described as relations between locations in an environmental space, (i.e., allocentric relations), rather then relations between egocentric experiences. In summary, the arguments suggest that the network of reference frames theory is better understood as allocentric than as egocentric. 3.4 The Relation between Vista Space Reference Frames: Network vs. Hierarchy Hierarchic theories of spatial memory have been very prominent (e.g., [4], [11], [40], [50]). In such views, smaller scale spaces are stored at progressively lower levels of the hierarchy. Contrary to these approaches, the network of reference frames theory does not assume environmental spaces are organized hierarchically, but assumes environmental spaces are organized in a network. There is no higher hierarchical layer assumed above a vista space. All vista spaces are equally important in that sense. This does not exclude vista spaces themselves from being organized hierarchically. Hierarchical graph models or hierarchical cognitive maps still face most of the problems discussed in 3.1. However, one argument for hierarchical structuring is based on clustering effects. In clustering effects, judgments within a spatial region are different from judgments between or without spatial regions. For instance, within a region distances are estimated faster and judged being shorter or locations are
The Network of Reference Frames Theory
357
remembered lying more to the center of such a region than they were seen before. Many of these clustering effects have been examined for regions within a vista space or a whole country usually learned via maps (e.g., [40]). They are, therefore, not relevant here. However, clustering effects are also found in directly experienced environmental spaces. Experiments show that distance judgments [11] and route decisions between equal length alternatives [49] are influenced by regions within the environmental space. These effects cannot be explained by the network of reference theory alone. A second categorical memory has to be assumed which represents a specific region (cf., [13]). Judgments must be based at least partially on these categories and not on the network of reference frames only. These categories might consist of verbal labels such as “downtown” [22]. As a prediction, no clustering effects for directly learned environmental spaces should be observed when such a category system is inhibited, (e.g., by verbal shadowing). 3.5 Asymmetry in Spatial Memory The perspective shifts assumed by the network of reference frames theory are not symmetric. They always point from one vista space to another and are not inverted easily. Tasks accessing a perspective shift in its encoded direction should be easier and more precise than tasks that require accessing the perspective shift in the opposite direction - at least as long as there is no additional perspective shift encoded in the opposite direction. This asymmetry can explain the route direction effect in spatial priming and different route choices for wayfinding there and back. After learning a route presented on a computer screen in only one direction, recognizing pictures of landmarks is faster when primed with a picture of an object encountered before the landmark than when primed with an object encountered after the landmark (e.g., [14]). According to the network of reference frames theory the directionality of perspective shifts speeds up activation spread in the direction the route was learned. Therefore, priming is faster in the direction a route was learned. Asymmetries are also found in path choices. In a familiar environment, navigators often choose different routes on the way out and back (e.g., [39]). According to the network of reference frames theory, different perspective shifts usually connect vista spaces on a route out and back. Due to different connections, different routes can be selected when planning a route out compared to planning the route back. The network of reference frames theory explains asymmetries on the level of route knowledge. However, it also predicts an asymmetry in survey knowledge. Learning a route mainly in one direction should result in an improved survey performance, (i.e., faster and more precise pointing), in this direction compared to the opposite direction. This yet has to be examined.
4 Conclusions The network of reference frames theory is a synthesis from graph representations and cognitive maps. It resolves problems that exist in explaining the orientation behavior of human and non-human animals based on either graphs, maps or both of them in parallel. In addition, the theory explains the unique role of vista spaces as well as
358
T. Meilinger
asymmetries in spatial memory. New predictions from the theory concern, first, the role of orientation within environmental spaces, which has been tested recently, second, the lack of clustering effects in environmental spaces based on the assumed memory alone, and third, an asymmetry in survey knowledge tasks. Further experiments have to show whether the network of reference frames theory will prove of value in these and other cases. Acknowledements. This research was supported by the EU grant “Wayfinding” (6th FP - NEST). I would like to thank Heinrich Bülthoff for supporting this work, Bernhard Riecke, Christoph Hölscher, Hanspeter Mallot, Jörg Schulte-Pelkum and Jack Loomis for discussing the ideas proposed here, Jörg Schulte-Pelkum for help with writing and Brian Oliver for proof-reading.
References 1. Bennett, A.T.D.: Do animals have cognitive maps? Journal of Experimental Biology 199, 219–224 (1996) 2. Byrne, P., Becker, S., Burgess, N.: Remembering the past and imagining the future: a neural model of spatial memory and imagery. Psychological Review 114, 340–375 (2007) 3. Cheng, K., Newcombe, N.S.: Is there a geometric module for spatial orientation? Squaring theory and evidence. Psychonomic Bulletin & Review 12, 1–23 (2005) 4. Chown, E., Kaplan, S., Kortenkamp, D.: Prototypes location, and associative networks (PLAN): Towards a unified theory of cognitive mapping. Cognitive Science 19, 1–51 (1995) 5. Ekstrom, A., Kahana, M., Caplan, J., Fields, T., Isham, E., Newman, E., Fried, I.: Cellular networks underlying human spatial navigation. Nature 425, 184–187 (2003) 6. Fujita, N., Klatzky, R.L., Loomis, J.M., Golledge, R.G.: The encoding-error model of pathway completion without vision. Geographical Analysis 25, 295–314 (1993) 7. Gallistel, C.R.: The organization of learning. MIT Press, Cambridge (1990) 8. Hamilton, D.A., Driscoll, I., Sutherland, R.J.: Human place learning in a virtual Morris water task: some important constraints on the flexibility of place navigation. Behavioural Brain Research 129, 159–170 (2002) 9. Hegarty, M., Waller, D.: Individual differences in spatial abilities. In: Shah, P., Miyake, A. (eds.) The Cambridge Handbook of Visuospatial Thinking, pp. 121–169. Cambridge University Press, Cambridge (2005) 10. Hein, A., Held, R.: A neural model for labile sensorimotor coordination. In: Bernard, E.E., Kare, M.R. (eds.) Biological prototypes and synthetic systems, vol. 1, pp. 71–74. Plenum, New York (1962) 11. Hirtle, S.C., Jonides, J.: Evidence of hierarchies in cognitive maps. Memory & Cognition 13, 208–217 (1985) 12. Holmes, M.C., Sholl, M.J.: Allocentric coding of object-to-object relations in overlearned and novel environments. Journal of Experimental Psychology: Learning, Memory and Cognition 31, 1069–1078 (2005) 13. Huttenlocher, J., Hedges, L.V., Duncan, S.: Categories and particulars: prototype effects in estimating spatial location. Psychological Review 98, 352–376 (1991) 14. Janzen, G.: Memory for object location and route direction in virtual large-scale space. The Quarterly Journal of Experimental Psychology 59, 493–508 (2006)
The Network of Reference Frames Theory
359
15. Klatzky, R.L.: Allocentric and egocentric spatial representations: Definitions, distinctions, and interconnections. In: Freska, C., Habel, C., Wender, K.F. (eds.) Spatial cognition - An interdisciplinary approach to representation and processing of spatial knowledge, pp. 1–17. Springer, Berlin (1998) 16. Kuipers, B.: The spatial semantic hierarchy. Artificial Intelligence 119, 191–233 (2000) 17. Loomis, J.M., Klatzky, R.L., Golledge, R.G., Philbeck, J.W.: Human navigation by path integration. In: Golledge, R.G. (ed.) Wayfinding behavior, pp. 125–151. John Hopkins Press, Baltimore (1999) 18. MacFarlane, D.A.: The role of kinesthesis in maze learning. University of California Publications in Psychology 4 277-305 (1930); (cited from Spada, H. (ed.) Lehrbuch allgemeine Psychologie. Huber, Bern (1992) 19. McNaughton, B.L., Leonard, B., Chen, L.: Cortical-hippocampal interactions and cognitive mapping: A hypothesis based on reintegration of parietal and inferotemporal pathways for visual processing. Psychbiology 17, 230–235 (1989) 20. Mallot, H.: Spatial cognition: Behavioral competences, neural mechanisms, and evolutionary scaling. Kognitionswissenschaft 8, 40–48 (1999) 21. Meilinger, T., Hölscher, C., Büchner, S.J., Brösamle, M.: How Much Information Do You Need? Schematic Maps in Wayfinding and Self Localisation. In: Barkowsky, T., Knauff, M., Ligozat, G., Montello, D.R. (eds.) Spatial Cognition V, pp. 381–400. Springer, Berlin (2007) 22. Meilinger, T., Knauff, M., Bülthoff, H.H.: Working memory in wayfinding - a dual task experiment in a virtual city. Cognitive Science 32, 755–770 (2008) 23. Meilinger, T., Riecke, B.E., Bülthoff, H.H.: Orientation Specificity in Long-Term-Memory for Environmental Spaces (submitted) 24. Moeser, S.D.: Cognitive mapping in a complex building. Environment and Behavior 20, 21–49 (1988) 25. Montello, D.R.: Spatial orientation and the angularity of urban routes: A field study. Environment and Behavior 23, 47–69 (1991) 26. Montello, D.R.: Scale and multiple psychologies of space. In: Frank, A.U., Campari, I. (eds.) Spatial information theory: A theoretical basis for GIS, pp. 312–321. Springer, Berlin (1993) 27. Montello, D.R., Pick, H.L.: Integrating knowledge of vertically aligned large-scale spaces. Environment and Behavior 25, 457–484 (1993) 28. Mou, W., Xiao, C., McNamara, T.P.: Reference directions and reference objects in spatial memory of a briefly viewed layout. Cognition 108, 136–154 (2008) 29. O’Keefe, J., Burgess, N.: Geometric determinants of the place fields of hippocampal neurons. Nature 381, 425–428 (1996) 30. O’Keefe, J., Nadel, L.: The hippocampus as a cognitive map. Clarendon Press, Oxford (1978) 31. Poucet, B.: Spatial cognitive maps in animals: New hypotheses on their structure and neural mechanisms. Psychological Review 100, 163–182 (1993) 32. Restat, J., Steck, S.D., Mochnatzki, H.F., Mallot, H.A.: Geographical slant facilitates navigation and orientation in virtual environments. Perception 33, 667–687 (2004) 33. Rump, B., McNamara, T.P.: Updating Models of Spatial Memory. In: Barkowsky, T., Knauff, M., Ligozat, G., Montello, D.R. (eds.) Spatial Cognition V, pp. 249–269. Springer, Berlin (2007) 34. Schnapp, B., Warren, W.: Wormholes in virtual reality: What spatial knowledge is learned for navigation? In: Proceedings of the 7th Annual Meeting of the Vision Science Society 2007, Sarasota, Florida, USA (2007)
360
T. Meilinger
35. Sholl, J.M., Kenny, R.J., DellaPorta, K.A.: Allocentric-heading recall and its relation to self-reported sense-of-direction. Journal of Experimental Psychology: Learning, Memory, and Cognition 32, 516–533 (2006) 36. Siegel, A.W., White, S.H.: The development of spatial representations of large-scale environments. In: Reese, H. (ed.) Advances in Child Development and Behavior, vol. 10, pp. 10–55. Academic Press, New York (1975) 37. Skaggs, W.E., McNaughton, B.L.: Spatial Firing Properties of Hippocampal CA1 Populations in an Environment Containing Two Visually Identical Regions. Journal of Neuroscience 18, 8455–8466 (1998) 38. Stankiewicz, B.J., Legge, G.E., Mansfield, J.S., Schlicht, E.J.: Lost in Virtual Space: Studies in Human and Ideal Spatial Navigation. Journal of Experimental Psychology: Human Perception and Performance 37, 688–704 (2006) 39. Stern, E., Leiser, D.: Levels of spatial knowledge and urban travel modeling. Geographical Analysis 20, 140–155 (1988) 40. Stevens, A., Coupe, P.: Distortions in judged spatial relations. Cognitive Psychology 10, 422–437 (1978) 41. Thorndyke, P.W., Hayes-Roth, B.: Differences in spatial knowledge acquired from maps and navigation. Cognitive Psychology 14, 560–589 (1982) 42. Thrun, S., Burgard, W., Fox, D.: Probabilistic Robotics. MIT Press, Cambridge (2005) 43. Tolman, E.C., Ritchie, B.F., Khalish, D.: Studies in spatial learning. I. Orientation and the short-cut. Journal of Experimental Psychology 36, 13–24 (1946) 44. Touretzky, D.S., Redish, A.D.: Theory of rodent navigation based on interacting representations of space. Hippocampus 6, 247–270 (1996) 45. Trullier, O., Wiener, S.I., Berthoz, A., Meyer, J.-A.: Biologically based artificial navigation systems: Review and prospects. Progress in Neurobiology 51, 483–544 (1997) 46. Wang, F.R., Spelke, E.S.: Human spatial representation: insights form animals. Trends in Cognitive Sciences 6, 376–382 (2002) 47. Wang, R.F., Brockmole, J.R.: Simultaneous spatial updating in nested environments. Psychonomic Bulletin & Review 10, 981–986 (2003) 48. Werner, S., Krieg-Brückner, B., Herrmann, T.: Modelling Navigational Knowledge by Route Graphs. In: Habel, C., Brauer, W., Freksa, C., Wender, K.F. (eds.) Spatial Cognition 2000. LNCS (LNAI), vol. 1849, pp. 295–316. Springer, Heidelberg (2000) 49. Wiener, J., Mallot, H.: Fine-to-coarse route planning and navigation in regionalized environments. Spatial Cognition and Computation 3, 331–358 (2003) 50. Yeap, W.K.: Toward a computational theory of cognitive maps. Artificial Intelligence 34, 297–360 (1988)
Spatially Constrained Grammars for Mobile Intention Recognition Peter Kiefer Laboratory for Semantic Information Technologies Otto-Friedrich-University Bamberg 96045 Bamberg, Germany
[email protected]
Abstract. Mobile intention recognition is the problem of inferring a mobile agent’s intentions from her spatio-temporal behavior. The intentions an agent can have in a specific situation depend on the spatial context, and on the spatially contextualized behavior history. We introduce two spatially constrained grammars that allow for modeling of complex constraints between space and intentions, one based on Context-Free, one based on Tree-Adjoining Grammars. We show which of these formalisms is suited best for frequently occurring intentional patterns. We argue that our grammars are cognitively comprehensible, while at the same time helping to prune the search space for intention recognition. Keywords: Intention recognition, Mobile assistance systems.
1
Introduction
The problem of inferring an agent’s intentions from her behavior is called intention recognition problem. The closely related problem of plan recognition has been discussed in AI literature since many years [1]. Approaches for plan recognition differ in the way the domain and possible plans are represented. While early work tended to be quite general, like Kautz’s event hierarchies [2], current research is typically concerned with specialized use cases (e.g. [3]), and efficient inference (e.g. [4]). A class of intention recognition problems with specific need for efficient inference is mobile intention recognition. We observe a mobile user’s trajectory and try to ‘guess’ what intentions she has in mind. These mobile problems are different, not only because of the restricted computational and cognitive resources [5]. Mobile intention recognition problems also differ to ‘traditional’ use cases because mobile behavior happens in space. This has a number of implications. One is that we have knowledge about the spatial context, about spatial objects, their relations, and spatial constraints. A glance at current research on the inverse problem, spatio-temporal planning, gives us an idea how these constraints can look like: Seifert et al. discuss an interactive assistance system that supports in spatio-temporal planning tasks [6]. In their example they describe the constraints that need to be considered when planning a trip: the temporal order of activities, C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 361–377, 2008. c Springer-Verlag Berlin Heidelberg 2008
362
P. Kiefer
the time needed for traveling from A to B, and spatial constraints about what actions can be performed at which location. Important about Seifert’s approach is that the chosen hierarchical spatial structure offers a cognitively appealing way of interaction between user and planning system, while at the same time helping to prune the search space. In this paper, we will see that complex constraints between intentions and space not only give us a rich toolbox to formalize typical behavioral patterns in mobile intention recognition, but can also speed up inference. We choose formal grammars to represent intentions so that the intention recognition problem becomes a parsing problem. Grammars are, in general, cognitively easy to understand and make the connection between expressiveness and complexity explicit. The main contribution of this paper is the combination of spatial constraints with Tree Adjoining Grammars (TAG), a formalism from natural language processing (NLP) that falls in complexity between context-free and context-sensitive grammars (CFG, CSG). The idea to apply grammar formalisms from NLP to plan/intention recognition is also followed by Geib and Steedman [7], and in own previous work [8]. In difference to these approaches, our spatially constrained grammars allow the formalization of complex, non-local constraints between intentions and space (and not only between intentions). The rest of this paper is structured as follows: in section 2 we explain which steps are necessary to state a mobile intention recognition problem as a parsing problem. In this context we review Spatially Grounded Intentional Systems (SGIS) [9]. In section 3, we explain which important use cases cannot be handled with SGIS, and proceed over Spatially Constrained Context-Free Grammars (SCCFG) to Spatially Constrained Tree-Adjoining Grammars (SCTAG). Using real motion track data from the location-based game CityPoker we discuss which general spatio-temporal behavior patterns are handled best with which formalism. The paper closes with a discussion of related work (section 4) and an outlook on questions that remain open (section 5).
2 2.1
From Spatio-temporal Behavior to Intentions Mobile Intention Recognition
The fact that mobile behavior happens in space and time has mainly two implications: one is that we can take use of spatial information. We do not only know the absolute coordinate of a user’s behavior, but also the spatial context. With an according spatial model we can say that the behavior happened, for instance, in a specific region, on a road, or close to a point of interest. We also have information about the spatial relations between these objects [10], like intersect, overlap, or north of. Depending on the specific use case, these spatial objects also bear a certain semantics: ‘a restaurant is a place where I can have the intention to eat something’. This is very similar to the basic intuition behind activity-based spatial ontologies [11]. However, inferring the agent’s intention directly from her position is too simple in many situations: a mobile user passing
Spatially Constrained Grammars for Mobile Intention Recognition
br b0
br
br
br
bcs bcs
br bs
br
b0
br br
br
bc b0
br
br
363
bs br br
bc
Fig. 1. Segmented motion track with classified behavior sequence from a CityPoker game. (The player enters from the right.)
by a restaurant does not necessarily have the intention to eat there. Schlieder calls this spatio-temporal design problem room crossing problem [9]. This leads us to the second implication of spatio-temporality: the gap between sensor input (e.g. position data from a GPS device) and high-level intentions (e.g. ‘find a restaurant’ ) is extremely large. It is not possible to design an intelligent intention recognition algorithm that works directly on pairs of (latitude/longitude). To bridge this gap, we use a multi-level architecture with the level of behaviors as intermediate level between position and intention. We process a stream of (lat/lon)-pairs as follows: 1. Preprocessing. The quality of the raw GPS data is improved. This includes removing points with zero satellites, and those with an impossible speed. 2. Segmentation. The motion track is segmented at the border of regions, and when the spatio-temporal properties (e.g. speed, direction) of the last n points have changed significantly [12]. 3. Feature Extraction. Each segment is analyzed and annotated with certain features, like speed and curvature [13]. 4. Classification. Using these features, each motion segment is classified to one behavior. We can use any mapping function from feature vector to behaviors, for instance realized as a decision tree. As output we get a stream of behaviors. In the example from Fig. 1 we distinguish the following behaviors: riding (br ), standing (b0 ), sauntering (bs ), curving (bc ), and slow-curving (bcs ). This track was recorded in the location-based game CityPoker. In this game, two players are trying to find (physical) playing cards
364
P. Kiefer
which are hidden in a city. The gaming area is structured by five rectangular cache regions. In each cache region there are three potential cache coordinates (one is drawn as a circle in Fig. 1). Cards are only hidden in one of the three potential caches. Players can find out about the correct cache by answering a multiple choice question. Once they have arrived at the cache, they perform a detail search in the environment, under bushes, trees, or benches, until they finally find the cards. They may then trade one card against one from their hand, and continue in the game. For a complete description of the game, refer to [9]. The reason why this game is especially suited as exemplary use case is that CityPoker is played by bike at high speed. The user’s cognitive resources are bound by the traffic, and she does not have the possibility to interact with the device (a J2ME enabled smartphone, localized by GPS) in a proper way. Similar situations occur in other use cases, like car navigation or maintenance work. Depending on the intention recognized we want to select an appropriate information service automatically. For instance, if we recognize the intention F ind W ay we will probably select a map service. It is up to the application designer to decide whether to present the service with information push, or just to ease the access to this service (‘hotbutton’). We will not discuss the step of mapping intentions to information services any further in this paper. 2.2
Parsing Behavior Sequences
The stream of behaviors described above serves as input to a parsing algorithm. Using behaviors as terminals and intentions as non-terminals, we can write rules of a formal grammar that describe the intentions of an agent in our domain. Most plan recognition approaches have followed a hierarchical structure of plans/intentions (e.g. [14,15]). We should say something about the difference between plans and intentions although an elaborate discussion of this issue is beyond the scope of this paper. In line with common BDI agent literature, we see intentions as ‘states of mind’ which are directed ‘towards some future state of affairs’ ([16, p.23]). We see ‘plans as recipes for achieving intentions.’ [16, p.28]. We can say that a rule in our grammar describes a plan, while each non-terminal stands for one intention. Thus, the aim of intention recognition is to find out (at least) the current intention. In CityPoker, for instance, a player will certainly have the intention to Play. At the beginning of each game, the members of a team discuss their strategy. Playing in CityPoker means exchanging cards in several cache regions, so we model a sequence of intentions as follows: GotoRegion HandleRegion, GotoRegion HandleRegion, and so on. In the cache region players find themselves a comfortable place to stand, answer a multiple-choice question, and select one out of three caches, depending on their answer. In the cache, they search a playing card which is hidden in the environment (see the behavior sequence in Fig. 1). A context-free production system for CityPoker is listed in Fig. 21 . Grammar rules like these are modular and intuitively understandable, also for noncomputer scientists. Formal properties of grammars are well-known, and parsing 1
Rules with a right-hand side of the form (symbol1 |...|symboln )+ are a simplified notation for ‘an arbitrary sequence of symbol1 , ..., symboln , but at least one of them’.
Spatially Constrained Grammars for Mobile Intention Recognition
365
algorithms exist. The choice of the formalism depends on the requirements of the use case. We briefly recall that with a CFG, we can express patterns of the form an bn . As argued in [9], most intention recognition use cases need at least this expressiveness. A typical example is leaving the same number of regions as entered before (entern leaven ). Note that parsing a stream of behaviors must be done incrementally, i.e. with an incomplete behavior sequence. We can find the currently active intention in the parse tree by choosing the non-terminal which is direct parent of the current behavior. 2.3
Reducing Parsing Ambiguities by Adding Spatial Knowledge
When parsing formal grammars we easily find ourselves in a situation where the same input sequence may have two or more possible parse trees, i.e. more than one possible interpretation. This is especially true when parsing an incomplete behavior sequence incrementally. One way to deal with ambiguity are probabilistic grammars [17] where we have to determine a probability for each rule in the grammar. A spatial way of ambiguity reduction is proposed by Schlieder in [9]: SGIS are context-free production systems, like that in Fig. 2, with the extension that each rule is annotated with a number of regions in which it is applicable. We call this the spatial grounding of rules. For instance, a HandleCache intention is grounded in all regions of type cache. We modify all rules accordingly. An SGIS rule for the original rule (12) would look like follows: HandleCache
→ SearchCards
DiscussStrategy
[grounding : cache1,1 , ..., cache5,3 ] This reduces the number of possible rules applicable at each position in the behavior sequence, thus avoiding many ambiguities. Figure 3 shows two possible interpretations for the behavior sequence from Fig. 1: without spatial knowledge we could not decide which of the two interpretations is correct. For parsing in SGIS we replace the pure behavior stream (beh1 , beh2 , beh3 , ...) by a stream of behavior/region pairs: ((beh1 , reg1 ), (beh2 , reg2 ), (beh3 , reg3 ), ...). Each behavior is annotated with the region in which it occurs. Also the non-terminals in the parse tree are annotated with a region (Intention, region), with the meaning that all child-intentions or child-behaviors of this intention must occur in that region. SGIS are a short form of writing rules of the following form (where Symbol can be an intention or a behavior): (Intention, regx )
→ (Symbol1 , regx ) ... (Symboln , regx )
That means, we cannot write rules for arbitrary combinations of regions. In addition, we require that another rule can only be inserted at an intention Symboli if the region of the other rule is (transitive) child in the partonomy, i.e. in the above rule we can only insert productions with a region regy part of regx (which includes the same region: regy .equals(regx )). SGIS have been designed for partonomially structured space. The nesting of rules follows closely the nesting of regions
366
P. Kiefer
Production Rules for CityPoker Play DiscussStrategy Continue GotoRegion HandleRegion SelectCache FindParkingPos AnswerQuiz GotoCache SearchWayToC NavigateTowardsC HandleCache SearchCards CrossCache DetailSearch
→ → → → → → → → → → → → → → →
DiscussStrategy Continue b0 ε | GotoRegion HandleRegion Continue (br |b0 |bc )+ SelectCache GotoCache HandleCache FindParkingPos AnswerQuiz (br |bc |bcs )+ b0 (SearchWayToC |NavigateTowardsC)+ (b0 |bcs |bs )+ (br |bc )+ SearchCards DiscussStrategy (CrossCache|DetailSearch)+ (br )+ (b0 |bcs |bs |bc )+
(1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13) (14) (15)
Fig. 2. Context-free production rules for intention recognition in CityPoker HandleRegion SelectCache
GoToCache
FindPP AnsQ
NavigateToC
HandleCache SearchCards
...
DetailSearch SearchCards DetailSearch SearchCards CrossCache SearchCards br
b0
br
bcs
bc
br
...
...
HandleRegion SelectCache FindPP AnsQ
GoToCache NavigateToC
HandleCache
GoToCache
SearchWayToC
...
...
GoToCache
SearchWayToC
GoToCache
NavigateTowardsC ... br
b0
br
bcs
bc
br
...
Fig. 3. Parsing ambiguity if we had no spatial knowledge (see track from Fig. 1). Through spatial disambiguation in SGIS we can decide that the bottom parse tree is correct.
Spatially Constrained Grammars for Mobile Intention Recognition
367
and sub-regions in the spatial model. The CityPoker partonomy is structured as follows: the game area contains five rectangular cache regions, each of which in turn contains three caches. SGIS deliberately restrict us in what we can express: we cannot write rules for arbitrary pairs of behavior and region. This makes sense from a spatial point of view (the agent cannot ‘beam’ herself), as well as from a cognitive point of view: as in Seifert et al. [6], the knowledge engineer is working with a representational formalism that resembles a structure of space prefered by many individuals: a hierarchical one [18].
3
Spatially Constrained Grammars
3.1
Spatially Constrained Context-Free Grammars
SGIS is a formalism with which we can model a variety of spatio-temporal intention recognition problems. With the spatial grounding of rules we can formalize spatial constraints of type part of . Constraints about the temporal order of intentions are formalized implicitly through the order of right-hand symbols in the production rules. However, the restrictions of SGIS hinder us from expressing frequently occuring use cases. Consider the motion track in Fig. 1: the agent enters the cache, shows some searching behavior, and then temporarily leaves the circular cache to the south. Knowing the whole motion track we can decide that this is better described as an accidental leaving, i.e. no intention change, than as a ChangePlan intention2 . For an incremental algorithm, it is not clear at the moment of leaving whether the agent will return. It is also not necessary that the intermediate behavior is located in the parent cache region of the cache. Finally, entering just any cache is not sufficient for accidental leaving, but we require that cache to be the same as left before. We would need the following rule (HandleCache, cache1,1) →(SearchCards, cache1,1 ), (accidental leaving behavior, [unconstrained]), (SearchCards, cache1,1 ) We cannot formulate this in SGIS, but still it makes no sense to write rules for pairs of (intention, region). We have already argued against this maximum of complexity in section 2.3. At this point, we can add another argument: we would have to write a plethora of similar rules for each cache in our game. What we would need to formalize the accidental leaving pattern elegantly is the following: identical HandleCache → SearchCards Confused SearchCards 2
A player in CityPoker who has given a wrong answer to the quiz will be searching at the wrong cache and probably give up after some time. He will then head for one of the other caches. The ChangeP lan intention was omitted in Fig. 2 for reasons of clarity.
368
P. Kiefer
We can easily find other examples of the pattern ‘a certain behavior/intention occurs in a region which has a spatial relation r to another region where the agent has done something else before’. For instance, we can find use cases where it makes sense to detect a ReturnT oX intention if the agent has forgotten the way back to some place. We could define this as ‘the agent shows a searching behavior in a region which touches a region she has been to before’: touches ClothesShopping → ExamineClothes HaveABreak ReturnToShop The definition of a new spatial context-free grammar that handles these examples is quite straightforward. Definition 1. A Spatially Constrained Context-Free Grammar is defined as SCCF G = (CF G, R, SR, GC, N LC), where – CFG is a context-free grammar (I, B, P, S), defined over intentions I, and behaviors B, with production rules P and start symbol S (the top-level intention). – R is a set of regions – SR is a set of spatial relations, where each relation r ⊆ R × R – GC ⊆ P × R is a set of grounding constraints (as in SGIS [9]) – NLC is a set of spatial non-local constraints. Each constraint has a type from the spatial relations SR and is defined for two right-hand symbols of one production rule from P. We introduce the grounding constraints to make SCCFG a real extension of SGIS. However, we will not always need them, as in the CityPoker example. The reason is that CityPoker-regions are typed according to their level in the partonomy (cache part of cache region part of gameboard). With a SCCFG we can rewrite the rules from Fig. 2 without spatial grounding in a specific region, but with part of and identical relations, for instance for rules (5) and (12): identical
part of
HandleRegion → SelectCache GotoCache HandleCache identical HandleCache → SearchCards DiscussStrategy SCCFG obviously have a higher expressiveness than SGIS. We can express more spatial relations than part of, and create a nesting of relations by applying the production rules. In difference to SGIS, the nesting of constraints is not necessarily accompanied by an according nesting of regions in the partonomy. The example above for rule (5) shows that we could also infer new relations from those we know (HandleCache must be partof SelectCache). In principle, we could define an SCCFG for a non-partonomial spatial structure although this might make the model cognitively more demanding.
Spatially Constrained Grammars for Mobile Intention Recognition
369
Fig. 4. Substitution (left) and adjoining (right) on a TAG (taken from [20, Fig. 2.2])
3.2
Cross-Dependencies: A Parallel to NLP
Quite frequently, players in CityPoker do not change a playing card although they have found it. They memorize the types of cards they have found and their exact position, and continue in the game. For a number of reasons it might make sense to change in another cache region first. Sometimes they return to that cache region at some time later in the game to change a card (without the effort of answering the quiz, cache search, and so on). An intelligent assistance system should recognize the intention RevisitRegion and offer an appropriate information service. The crossed return to region pattern we would like to model in this use case looks as follows: identical
identical
HandleRegion HandleRegion RevisitRegion HandleRegion RevisitRegion What we need for this is a possibility to create cross-dependencies. A constrained context-free grammar, like SCCFG, can have cross-dependencies, but only static ones which are defined directly in the rules. No new cross-dependencies can evolve during parsing by the operations offered for CFGs. Modeling all possibilities for cross-dependencies statically in the rules is infeasible, even for CityPoker. Note that more than two constraints might be crossing, and not all HandleRegion intentions are followed by an according RevisitRegion. As explained in [7] and [8], similar cross-dependencies occur in NLP. In some natural languages, cross-dependencies are possible between grammatical constructs. If a certain tense, case, or other grammatical form is chosen for the front non- or pre-terminal, we have to choose an according construct for the back non- or pre-terminal. To handle such cross-dependencies, the NLP community has developed formalisms with an extended domain of locality: ‘By a domain of locality we mean the elementary structures of a formalism over which dependencies such as agreement, subcategorization, filler-gap, etc. can be specified.’ ([19, p.5]). In the following, we introduce one of these formalisms, and convert it to a spatially constrained one. 3.3
Tree-Adjoining Grammars
Mildly Context-Sensitive Grammars (MCSG) are a class of formal grammars with common properties [21]. Their expressiveness falls between CFGs and CSGs, and
370
P. Kiefer
they support certain kinds of dependencies, including crossed and nested dependenciess. They are polynomially parsable and thus especially attractive for mobile intention recognition. Tree-Adjoining Grammars (TAG), first introduced in [22], are a MCSG with an especially comprehensible way of modeling dependencies. The fundamental difference to CFGs is that TAGs operate on trees, and not on strings. A good introduction to TAG is given by Joshi and Schabes in [20]. They define TAG as follows. Definition 2. A Tree-Adjoining Grammar is defined as TAG = (NT, Σ, IT, AT, S), where – NT are non-terminals – Σ are terminals. – IT is a finite set of initial trees. In an initial tree, interior nodes are labeled by non-terminals. The nodes on the frontier (leaf nodes) are labeled by either terminals, or non-terminals. A frontier node labeled with a non-terminal must be marked for substitution. We mark substitution nodes with a ↓. – AT is a finite set of auxiliary trees. In an auxiliary tree, interior nodes are also labeled by non-terminals. Exactly one node at the frontier is the foot node, marked with an asterisk ∗. The foot node must have the same label as the root node. All other frontier nodes are either terminals or substitution nodes, as in the initial trees. – S is a distinguished non-terminal (starting symbol). The two operations defined on TAGs are substitution and adjoining (see Fig. 4). Adjoining is sometimes also called adjunction. Both operations work directly on trees. Substitution is quite straightforward: we can place any initial tree (or any tree that has been derived from an initial tree) headed with a symbol X into a substitution node labeled with X↓. It is the adjoining operation that makes TAGs unique: we can adjoin an auxiliary tree labeled with X into an interior node of another tree with the same label. This operation works as follows: (1) we remove the part of the tree which is headed by the interior node, (2) replace it by the auxiliary tree, and (3) attach the partial tree which was removed in step 1 at the foot node. The language defined by a TAG is a set of trees. By traversing a tree we can certainly also interpret it as a String, just like traversing a parse tree of a CFG. If, just for a moment, we try to interpret the two operations as operations on Strings, we see that substitution just replaces a non-terminal by a number of symbols. This is exactly as applying a production rule in a CFG. Adjoining manipulates a String in a more intricate way: a part of the old String (the terminals of the grey tree in Fig. 4) becomes surrounded by new Strings to the left and to the right (by the left and right handside of the X∗ in the auxiliary trees). Joshi and Schabes later add to their definition of TAG the following Adjoining Constraints: Selective Adjunction, Null Adjunction, and Obligatory Adjunction. Every non-terminal in any tree may be constrained by one of these. Selective Adjunction restrains the auxiliary tree that may be adjoined at that node to a
Spatially Constrained Grammars for Mobile Intention Recognition (α)
Play
(β) Continue GotoRegion↓HandleRegion↓ Continue∗
DiscussStrategy↓ Continue identical
371
part of
(γ) Continue GotoRegion↓
HandleRegion↓ part of
Continue
Continue∗
RevisitR↓
identical
Fig. 5. Initial tree (α) and auxiliary trees (β and γ) in a SCTAG for CityPoker
set of auxiliary trees. Obligatory Adjunction does the same, but at the same time forces us to do adjoin at that node. Null Adjunction disallows any adjunction at that node. These local constraints are important to write sensible grammars, but will not be further discussed here due to our focus on non-local constraints. A discussion of the formal properties of TAGs, the differences to other grammars, a corresponding automaton, as well as parsing algorithms can be found in a number of publications, e.g. [20,23,24]. For our use case it should be clear that (1) we can easily rewrite any CFG as TAG, (2) TAGs are more expressive than CFGs, and (3) writing a TAG is not necessarily more complicated than writing a CFG. Instead of writing a number of production rules, we just write a number of trees. 3.4
Spatially Constrained Tree-Adjoining Grammars
Definition 3. A Spatially Constrained Tree-Adjoining Grammar is defined as SCT AG = (T AG, R, SR, GC, N LC), where – – – – –
TAG = (I, B, IT, AT, S), defined over intentions I, and behaviors B. R is a set of regions SR is a set of spatial relations, where each relation r ⊆ R × R GC ⊆ (IT ∪ AT ) × R is a set of grounding constraints NLC is a set of spatial non-local constraints. Each constraint has a type from the spatial relations SR and is defined for two nodes in one tree from IT∪AT.
This definition applies the idea of spatial constraints to TAGs. The non-local constraints are now defined between nodes in initial/auxiliary trees. The idea of specifying non-local dependencies in TAG is not new. In earlier work on TAGs, Joshi describes this concept as ‘TAGs with links’ [23, Section 6.2]. During the operations of substitution and adjoining the non-local constraints remain in the tree, and become stretched if necessary. Adjoining may also lead to
372
P. Kiefer (γ adj α)
Play DiscussStrategy↓
Continue
GotoRegion HandleRegion
Continue
Continue RevisitRegion↓ identical
(γ adj (γ adj α)) Play DiscussStrategy↓
Continue
GotoRegion↓HandleRegion↓ Continue GotoRegion↓HandleRegion↓ Continue Continue
RevisitRegion↓
ContinueRevisitRegion↓ identical
identical
Fig. 6. Adjoining in an SCTAG can lead to cross-dependencies of constraints. Noncrossing spatial constraints are omitted for reasons of clarity.
cross-dependencies like needed for modeling the crossed return to region pattern. Figure 5 lists part of a SCTAG that handles the re-visisting of cache regions in CityPoker. Non-local spatial constraints are displayed as dotted lines. A complete grammar for this use case would convert all context-free rules from Fig. 2 to trees and add them to the grammar. This step is trivial. Figure 6 demonstrates how cross-dependencies evolve through two adjoining operations. 3.5
Parsing Spatially Constrained Grammars
For parsing a spatially constrained grammar, we modify existing parsing algorithms. CFGs are typically handled with chart-based parsers, like the wellknown Earley algorithm [25]. An algorithm for parsing TAGs, based on the Cocke-Younger-Kasami algorithm, was proposed in [24], with a polynomial worst and average case complexity. Unfortunately, this complexity is O(n6 ) and thus
Spatially Constrained Grammars for Mobile Intention Recognition
373
quite high. Joshi presents a TAG parser that adopts the idea of Earley and improves the average case complexity [20]. We build the parsers for SCCFG and SCTAG on these Earley-like parsers. Earley parsers work on a chart in which the elementary constructs of the grammar are kept, production rules for CFGs, trees for TAGs. A dot in each of these chart entries marks the position up to which this construct has been recognized. In Joshi’s parser the ‘Earley dot’ traverses trees and not Strings. Earley parsers work in three steps: scan, predict, and complete. Predict checks for possible derivations and adds them to a chart. Scan reads the next symbol from the stream and matches it with the chart entries. Complete passes the recognition of rules up the tree until finally we have recognized the starting symbol. The TAG parser has a fourth operation, called ‘adjoin’, to handle this additional operation. Our point is that adding spatial constraints to such a parser will not make it slower but faster. The reason is that spatial constraints give us more predictive information. ‘Any algorithm should have enough information to know which tokens are to be expected after a given left context’ [20, p.36]. Knowing the spatial context of left-hand terminals we can throw away those hypotheses that are not consistent with the spatial constraints. We add this step after each scan operation.
4
Related Work
We started this paper by saying that approaches for intention recognition differ in the way the domain and possible intentions are represented. A number of formalisms has been proposed for modeling the mental state of an agent, ranging from finite state machines [26] to complex cognitive modeling architectures, like the ACT-R architecture [27]. With our formal grammars, which are between these two extremes, we try to keep the balance between expressiveness and computational complexity. Using formal grammars to describe structural regularities is common, not only in NLP, but also in areas like computer vision [28], and action recognition [29]. Pynadath’s state dependent grammars constrain the applicability of a rule dependent on a general state variable [17]. The generality of this state variable leads to an explosion in symbol space if trying to apply a parsing algorithm, so that an inference mechanism is chosen which translates the grammar into a Dynamic Bayes Network (DBN). Choosing a grammatical approach means using grammars not only for syntax description, but implicitly assigning a certain semantics (in terms of intentions and plans). Linguistics is also concerned with semantics, both, on the sentence level, and on the level of discourse. Webber et al. [30], as one example for the literature on discourse semantics, argue that multiple, possibly overlapping, semantic relations are common in discourse semantics. By using (lexicalized) TAG they describe these relations without the need for building multiple trees.
374
SGIS
P. Kiefer Dependencies supported
Typical spatial intention pattern
Nested: Yes (only part-of relation) Cross: No
Sub-intentions are located in the same or in sub-regions of their parent intention.
SCCFG Nested: Yes Cross: No (unless statically defined in productions)
Accidental leaving pattern.
SCTAG Nested: Yes Cross: Yes
Crossed return to region pattern.
Example
part-of R1R2R3R4R2R1
part-of touches R1R2R3R1R4R5R1R2
part-of touches R1R2R3R1R4R1R2R1R5
Fig. 7. A hierarchy of spatial grammars for mobile intention recognition
Approaches based on probabilistic networks, like DBNs, have widely been applied in plan recognition research. Starting from Charniak and Goldman’s Plan Recognition Bayesian Networks [31], to hierarchical Markov Models as used by Liao et al. in the BELIEVER system [32]. The semantics of ‘goal’ in the latter publication is ‘target location’ without a complex intention model. Bui proposes the Abstract Hidden Markov Memory Model for plan recognition in an intelligent office environment [33]. Geo-referenced DBN are proposed in [34] to fuse sensory data and cope with the problem of inaccurate data. Intention recognition approaches also differ in the way space is represented: the simplest model consists of a number of points of interest with circular or polygonal areas around them [35,26]. Others add a street network to these locations [32], use spatial tessellation [36], or formalize space with Spatial Conceptual Maps [37]. The quality of our intention recognition relies on a good preprocessing. Converting a motion track into a qualitative representation has been done by a number of researchers, for instance [38]. The authors also compare a number of approaches to generalization. For the classification of segments in Fig. 1 we used a simple decision tree. The set of behavior types we are interested in was chosen manually. An automatic detection of motion patterns is the concern of the spatio-temporal data mining community, see e.g. [39]. One concern of computational and formal linguistics is to find approaches that closely resemble the human conceptualization of language. Steedman, for instance, argues that planned action and natural language are related systems that share the same operations: functional composition and type-raising [40]. Combinatory Categorial Grammar (CCG) is a mildly context-sensitive formalism that supports these operators. Using a ’spatialized’ version of CCG for mobile intention recognition could be worthwhile. We chose TAG in this paper because we belief that TAG are cognitively more appealing for knowledge engineers not familiar with NLP concepts.
Spatially Constrained Grammars for Mobile Intention Recognition
5
375
Conclusion and Outlook
We have presented a hierarchy of formal grammars for mobile intention recognition: SGIS, SCCFG, and SCTAG. With increasing expressiveness we can handle a larger number of spatio-temporal patterns which frequently occur in scenarios of mobile intention recognition, like in CityPoker. Our grammars allow the knowledge engineer to specify complex intention/space relations by using intuitive spatial relations, instead of writing arbitrarily complex rules for input of behavior/region tuples. Figure 7 gives an overview on the three formalisms. We only sketched the principle of parsing. Currently, we are specifying the parsing algorithm for SCTAG formally. As a next step we will evaluate the algorithm on the restricted computational resources of a mobile device. In this paper we treated all spatial relations as arbitrary relations, and only mentioned that we could use them for inference. This is also one issue of our future work. Adding temporal constraints could be worthwhile, like ‘the duration between these two intentions may not be longer than a certain Δt’. Another issues that remains open is recognizing that the agent spontaneously changes her intention [15].
Acknowledgements I would like to thank Klaus Stein for the discussions on the algorithmic possibilities of SCTAG parsing. Christoph Schlieder’s motivating and constant support of my PhD research made this work possible.
References 1. Schmidt, C., Sridharan, N., Goodson, J.: The plan recognition problem: An intersection of psychology and artificial intelligence. Artificial Intelligence 11(1-2), 45–83 (1978) 2. Kautz, H.A.: A Formal Theory of Plan Recognition. PhD thesis, University of Rochester, Rochester, NY (1987) 3. Jarvis, P.A., Lunt, T.F., Myers, K.L.: Identifying terrorist activity with ai planrecognition technology. AI Magazine 26(3), 73–81 (2005) 4. Bui, H.H.: Efficient approximate inference for online probabilistic plan recognition. Technical Report 1/2002, School of Computing Science, Curtin University of Technology, Perth, WA, Australia (2002) 5. Baus, J., Krueger, A., Wahlster, W.: A resource-adaptive mobile navigation system. In: Proc. 7th International Conference on Intelligent User Interfaces, San Francisco, USA, pp. 15–22. ACM Press, New York (2002) 6. Seifert, I., Barkowsky, T., Freksa, C.: Region-Based Representation for Assistance with Spatio-Temporal Planning in Unfamiliar Environments. In: Location Based Services and TeleCartography, pp. 179–192. Springer, Heidelberg (2007) 7. Geib, C.W., Steedman, M.: On natural language processing and plan recognition. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI), pp. 1612–1617 (2007)
376
P. Kiefer
8. Kiefer, P., Schlieder, C.: Exploring context-sensitivity in spatial intention recognition. In: Workshop on Behavior Monitoring and Interpretation, 40th German Conference on Artificial Intelligence (KI-2007), CEUR, vol. 296, pp. 102–116 (2007) 102–116 ISSN 1613-0073 9. Schlieder, C.: Representing the meaning of spatial behavior by spatially grounded intentional systems. In: Rodr´ıguez, M.A., Cruz, I., Levashkin, S., Egenhofer, M.J. (eds.) GeoS 2005. LNCS, vol. 3799, pp. 30–44. Springer, Heidelberg (2005) 10. Egenhofer, M.J., Franzosa, R.D.: Point-set topological relations. International Journal of Geographical Information Systems 5(2), 161–174 (1991) 11. Kuhn, W.: Ontologies in support of activities in geographical space. International Journal of Geographical Information Science 15(7), 613–631 (2001) 12. Stein, K., Schlieder, C.: Recognition of intentional behavior in spatial partonomies. In: ECAI 2004 Worskhop 15: Spatial and Temporal Reasoning (16th European Conference on Artificial Intelligence) (2005) 13. Schlieder, C., Werner, A.: Interpretation of intentional behavior in spatial partonomies. In: Freksa, C., Brauer, W., Habel, C., Wender, K.F. (eds.) Spatial Cognition III. LNCS (LNAI), vol. 2685, pp. 401–414. Springer, Heidelberg (2003) 14. Kautz, H., Allen, J.F.: Generalized plan recognition. In: Proc. of the AAAI conference 1986 (1986) 15. Geib, C.W., Goldman, R.P.: Recognizing plan/goal abandonment. In: Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence (IJCAI), pp. 1515–1517 (2003) 16. Wooldridge, M.: Reasoning About Rational Agents. MIT Press, Cambridge (2000) 17. Pynadath, D.V.: Probabilistic Grammars for Plan Recognition. PhD thesis, The University of Michigan (1999) 18. Hirtle, S., Jonides, J.: Evidence of hierarchies in cognitive maps. Memory and Cognition 13(3), 208–217 (1985) 19. Joshi, A.K., Vijay-Shanker, K., Weir, D.: The convergence of mildly contextsensitive grammar formalisms. Technical Report MS-CIS-90-01, Department of Computer and Information Science, University of Pennsylvania (1990) 20. Vijay-Shanker, K., Weir, D.: The equivalence of four extensions of context-free grammars. Mathematical Systems Theory 27(6), 511–546 (1994) 21. Joshi, A., Levy, L., Takahashi, M.: Tree adjunct grammars. Journal of Computer and System Sciences 10, 136–163 (1975) 22. Joshi, A.K., Schabes, Y.: Tree-adjoining grammars. In: Rozenberg, G., Salomaa, A. (eds.) Handbook of Formal Languages, vol. 3, pp. 69–124. Springer, Berlin (1997) 23. Joshi, A.K.: Tree adjoining grammars: How much context-sensitivity is required to provide reasonable structural descriptions? In: Dowty, D.R., Karttunen, L., Zwicky, A.M. (eds.) Natural Language Parsing: Psychological, Computational, and Theoretical Perspectives, pp. 206–250. Cambridge University Press, Cambridge (1985) 24. Vijay-Shanker, K., Joshi, A.K.: Some computational properties of tree adjoining grammars. In: Meeting of the Association for Computational Linguistics, Chicago, Illinois, pp. 82–83 (1985) 25. Earley, J.: An efficient context-free parsing algorithm. Communications of the ACM 13(2), 94–102 (1970) 26. Dee, H., Hogg, D.: Detecting inexplicable behaviour. In: Proceedings of the British Machine Vision Conference, pp. 477–486. The British Machine Vision Association (2004) 27. Anderson, J., Bothell, D., Byrne, M., Douglass, S., Lebiere, C., Qin, Y.: An integrated theory of the mind. Psychological Review 111(4), 1036–1060 (2004)
Spatially Constrained Grammars for Mobile Intention Recognition
377
28. Chanda, G., Dellaert, F.: Grammatical methods in computer vision: An overview. Technical Report GIT-GVU-04-29, College of Computing, Georgia Institute of Technology, Atlanta, GA, USA (November 2004), ftp://ftp.cc.gatech.edu/pub/gvu/tr/2004/04-29.pdf 29. Bobick, A., Ivanov, Y.: Action recognition using probabilistic parsing. In: Proc. of the Conference on Computer Vision and Pattern Recognition, pp. 196–202 (1998) 30. Webber, B., Knott, A., Stone, M., Joshi, A.: Discourse relations: A structural and presuppositional account using lexicalised tag. In: Proc. of the 37th. Annual Meeting of the American Association for Computational Linguistics (ACL1999), pp. 41–48 (1999) 31. Charniak, E., Goldman, R.P.: A bayesian model of plan recognition. Artificial Intelligence 64(1), 53–79 (1993) 32. Liao, L., Patterson, D.J., Fox, D., Kautz, H.: Learning and inferring transportation routines. Artificial Intelligence 171(5-6), 311–331 (2007) 33. Bui, H.H.: A general model for online probabilistic plan recognition. In: Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI) (2003) 34. Brandherm, B., Schwartz, T.: Geo referenced dynamic Bayesian networks for user positioning on mobile systems. In: Strang, T., Linnhoff-Popien, C. (eds.) LoCA 2005. LNCS, vol. 3479, pp. 223–234. Springer, Heidelberg (2005) 35. Ashbrook, D., Starner, T.: Using gps to learn significant locations and predict movement across multiple users. Personal and Ubiquitous Computing 7(5), 275– 286 (2003) 36. Gottfried, B., Witte, J.: Representing spatial activities by spatially contextualised motion patterns. In: RoboCup 2007, International Symposium, pp. 329–336. Springer, Heidelberg (2007) 37. Samaan, N., Karmouch, A.: A mobility prediction architecture based on contextual knowledge and spatial conceptual maps. IEEE Transactions on Mobile Computing 4(6), 537–551 (2005) 38. Musto, A., Stein, K., Eisenkolb, A., R¨ ofer, T., Brauer, W., Schill, K.: From motion observation to qualitative motion representation. In: Habel, C., Brauer, W., Freksa, C., Wender, K.F. (eds.) Spatial Cognition 2000. LNCS (LNAI), vol. 1849, pp. 115– 126. Springer, Heidelberg (2000) 39. Laube, P., van Krefeld, M., Imfeld, S.: Finding remo - detecting relative motion patterns in geospatial lifelines. In: Developments in Spatial Data Handling, Proceedings of the 11th International Symposium on Spatial Data Handling, pp. 201– 215 (2004) 40. Steedman, M.: Plans, affordances, and combinatory grammar. Linguistics and Philosophy 25(5-6), 725–753 (2002)
Modeling Cross-Cultural Performance on the Visual Oddity Task Andrew Lovett, Kate Lockwood, and Kenneth Forbus Qualitative Reasoning Group, Northwestern University 2133 Sheridan Rd., Evanston, IL, 60201, USA {andrew-lovett@, kate@cs., forbus@}northwestern.edu
Abstract. Cognitive simulation offers a means of more closely examining the reasons for behavior found in psychological studies. This paper describes a computational model of the visual oddity task, in which individuals are shown six images and asked to pick the one that doesn’t belong. We show that the model can match performance by participants from two cultures: Americans and the Mundurukú. We use ablation experiments on the model to provide evidence as to what factors might help explain differences in performance by the members of the two cultures. Keywords: Qualitative representation, analogy, cognitive modeling, oddity task.
1 Introduction A central problem in studying spatial cognition is representation. Understanding and modeling the visual representations people construct for the world around them is a difficult challenge for cognitive science. Dehaene and colleagues [7] made important progress on this problem by designing a study which directly tests what features people represent when they look at geometric figures in a visual scene. Their study utilized the Oddity Task methodology: participants were shown an array of six images and asked to pick the image that did not belong (e.g., see Fig. 1). By varying the diagnostic spatial feature, i.e., the feature that distinguished one image from the other five, they were able to test which features their participants were capable of representing and comparing. Dehaene and colleagues ran their study on multiple age groups within two populations: Americans and the Mundurukú, an indigenous group in South America. They found that while the Americans performed better overall, the Mundurukú appeared to be capable of encoding the same spatial features. The Mundurukú performed above chance on nearly all of the 45 problems, and their pattern of errors correlated highly with the American pattern of errors. Dehaene concluded from the results that many spatial features are universal in human representation. However, several questions remain: (1) What makes one problem harder than another? (2) Why is it that, despite the high correlation between population groups, some problems seem especially hard for Americans, while other problems seem especially hard for the Mundurukú? (3) To what extent can questions 1) and 2) be answered in terms of the process of encoding representations, versus the process of operating over those representations to solve problems? C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 378–393, 2008. © Springer-Verlag Berlin Heidelberg 2008
Modeling Cross-Cultural Performance on the Visual Oddity Task A
B
C
D
E
F
379
Fig. 1. Six example problems from the Oddity Task
This paper presents a cognitive model designed to explore these questions. Our model is based upon two core claims about spatial cognition: (1) When people encode a visual scene, they focus on the qualitative attributes and relations of the objects in these scene [11]. This provides them with a more abstract, more robust representation than one filled with quantitative details about the scene. (2) People compare low-level visual representations using the same mapping process used to perform abstract analogies. Our model of comparison is based on Gentner’s [14] structure-mapping theory of analogy. Our model uses four components to simulate the oddity task from end-to-end. We use a modified version of CogSketch1 [13], a sketch understanding system, to automatically construct qualitative representations of sketches and other two-dimensional stimuli. We use the Structure-Mapping Engine (SME) [8], a computational model of structure-mapping theory, to model comparison and similarity judgments. We use two additional components based on structure-mapping theory: MAGI [9], which models symmetry detection, and SEQL [18], which models analogical generalization. Using this approach, we have modeled human performance on geometric analogy problems [25] (problems of the form “A is to B as C is to …?”); a subset of the Raven’s Progressive Matrices [20], a visually-based intelligence test; and basic visual comparison tasks [19,21]. However, the Dehaene task offers a unique opportunity in that it was designed to isolate specific spatial features and check for their presence or absence in one’s representation. This paper presents our cognitive model of performance on the Oddity Task and uses it to study factors that contribute to difficulty on the task. In comparing the model with human results, we focus on two population groups: American children aged 8-13, and the full set of Mundurkú of all ages. We consider these groups because their overall performance on the 45 problems in the Dehaene study was comparable: 75% for the Americans and 67% for the Mundurkú. We provide evidence for what might distinguish these groups from each other via ablation studies using the model. 1
Publicly available at http://www.spatialintelligence.org/projects/cogsketch_index.html
380
A. Lovett, K. Lockwood, and K. Forbus
We begin by briefly reviewing the components of our model. We then show how these component models are combined in our overall model of the Oddity Task. We analyze the results produced by running the model on the 45 problems from the original study, and use ablation studies to explore possible explanations for performance differences between the two groups. We close with a discussion of related and future work.
2 Modeling Comparison Via Analogy Our model of comparison is based on Gentner’s [14] structure-mapping theory of analogy. According to structure mapping, people compare two objects by aligning the common structure in their representations of the objects. Comparison is guided by a systematicity bias; that is, people prefer mappings that place deeper structure with more higher-order relations into correspondence. Structure-mapping has been used to explain and predict a variety of psychological phenomena, including visual similarity and differences [15,22]. Next we describe three computational models based on structure-mapping, each of which is used in the present study. 2.1 Structure-Mapping Engine The Structure-Mapping Engine (SME) [8,10], is a computational implementation of structure-mapping theory. It takes as input two cases, a base and a target. Each case is a symbolic representation consisting of entities, attributes of entities, and relations. There are both first-order relations between entities and higher-order relations between other relations. SME returns one to three mappings between the base and target. Each mapping has three components: a set of correspondences between elements in the base and elements in the target, a structural evaluation score, which estimates the degree of similarity between the cases; and a set of candidate inferences, inferences about the target supported by the mapping and unaligned structure in the base. 2.2 MAGI MAGI [9] is a model of symmetry detection based upon SME. Essentially, it identifies symmetry in a representation by comparing the representation to itself, while avoiding perfect self-matches. MAGI is important in modeling spatial cognition because it is often necessary to identify axes of symmetry in a visual scene, or in a specific object. 2.3 SEQL SEQL [18] is a model of analogical generalization. SEQL is based upon the idea that individuals learn generalizations for categories through a process of progressive alignment [16], in which instances of a category are compared and the commonalities are abstracted out as a direct result of the comparison. Given a set of cases, SEQL can build one or more generalizations from them by comparing them via SME and eliminating the structure that fails to align between cases, leaving only the structure that is common across all the cases in the generalization. Because the generalization is in the
Modeling Cross-Cultural Performance on the Visual Oddity Task
381
same form as individual case representations, new cases can be compared to the generalization to measure their similarity to a category.
3 Modeling Qualitative Representation Via CogSketch One of our core claims is that people use qualitative spatial representations when reasoning over or comparing images. While quantitative data, such as the exact sizes of objects or the exact orientation of edges, may vary widely, even between images of the same object, qualitative relations are much more consistent. For example, nearly every face contains an eye to the right of another eye, with both eyes above a nose and a mouth. The key to qualitative representation is to encode what Biederman calls the nonaccidental properties [4]. These are the relations that are unlikely to have occurred by accident in a sketch. For example, two lines chosen at random are unlikely to have exactly the same orientation. Therefore, when two lines are parallel, this is unlikely to have occurred by random chance, and so it is probably significant. There is abundant evidence that people encode qualitative relations corresponding to nonaccidental properties in visual scenes. For example, parallel lines are salient for children as young as three [1]. Adults and infants can distinguish between concave and convex shapes—a qualitative distinction [3], and humans have been shown to have a preference for objects aligned with a vertical or horizontal axis, as opposed to those with an arbitrary orientation [2]. Huttenlocher and colleagues [17] have shown that when individuals memorize a point’s location in a circle, they pay special attention to which quadrant of the circle the point lies in, again a qualitative distinction. While it is obviously the case that individuals are capable of encoding quantitative information in addition to these qualitative relations, the qualitative relations appear particularly well-suited to spatial problem-solving, as they can be easily encoded symbolically and used to compare different scenes. Thus, in our present work we explore the hypothesis that spatial tasks can be solved relying exclusively on qualitative representations. We see qualitative spatial representations as hierarchical (e.g., [24]). Each of the shapes in an image can have its own attributes and relations. At the same time, each of the edges that make up that shape can also have its own attributes and relations. This gives rise to two representational foci: the shape representation and the edge representation. A further claim we are evaluating with the current study is that these two representational foci will never be used together. That is, a comparison or other operation will always run on either an image’s shape representation or its edge representation. 3.1 CogSketch CogSketch [13] is a sketch understanding system based upon the nuSketch [12] architecture. Users sketch a series of glyphs, or objects in a sketch. CogSketch then computes a number of qualitative spatial relations between the glyphs, building up a structural representation of the sketch that corresponds to the shape representation. CogSketch can also decompose a glyph into its component edges and construct a representation of the qualitative relations between the glyph’s edges. This corresponds to the edge representation.
382
A. Lovett, K. Lockwood, and K. Forbus
Many of the spatial relations in the shape representation (e.g., relative position, containment) are computed based on the relative position and topology of the glyphs. However, some shape relations can only be computed by first decomposing a glyph into its edges and constructing the glyph’s edge representation. By comparing two glyph’s edge representations using SME, CogSketch can identify the corresponding edges in the two glyph’s shapes. These correspondences can be used to determine whether the two glyphs are the same shape, and whether one glyph’s shape is a transformation of the other (e.g., a rotation or a reflection). Furthermore, a glyph’s edge representation can be compared to itself via MAGI to identify axes of symmetry. Table 1. Qualitative vocabulary for the edge representation Edge Attributes • Straight/Curved/Ellipse • Axis-aligned (horizontal or vertical) • Short/Med/Long (relative length)
Edge Relations • Concave/convex corner • Perpendicular corner • Edges-same-length corner • Intersecting • Parallel • Perpendicular
3.2 Representing the Oddity Task Stimuli In order to model the Oddity Task, we examined the Dehaene [7] stimuli and identified a set of qualitative attributes and relations that appeared to be important for solving the problems. All attributes and relations had to be among those that could be computed automatically by CogSketch. Table 2. Qualitative vocabulary for the shape representation Shape Attributes • Closed shape • Convex shape • Circle shape • Empty/Filled • Axis (Symmetric, Vertical, and/or Horizontal)
Shape Relations • Right-of/Above (relative position) • Containment • Frame-of-Reference • Shape-proximity-group • Same-shape • Rotation-between • Reflection-between
Line-Line Relations • Intersecting • Parallel • Perpendicular
Line-Point Relations • Intersecting • Colinear • Centered-On
Modeling Cross-Cultural Performance on the Visual Oddity Task
383
Table 1 summarizes qualitative attributes and relations for the edge representations. Many relations are based on corners between edges. The other relations can only hold for edges that are not connected by a corner along the shape. Table 2 summarizes attributes and relations for shapes. Empty/filled is a simplification of shape color; it refers to whether the shape has any fill color. Frame-ofReference relations are used when a smaller shape is located inside a larger, symmetric shape (i.e., a circle). The inner’s shape location is described in terms of which quadrant of the larger shape it is located in; additionally, the inner shape may lie exactly along the larger shape’s axes of symmetry. Shape-proximity-group refers to shapes grouped together based on the Gestalt law of proximity [26]. Currently, grouping by proximity is only implemented for circles. Line/Line and Line/Point relations apply only to special shape types. Line/Line relations are for shapes that are simple, straight lines (thus these relations are a subset of the edge relations). Line/Point relations are for when a small circle lies near a line. The centered-on relation applies when the circle lies at the center of the line. This relation is essentially a special case of the frame-of-reference relation for a dot lying at the center of a circle. Axes of symmetry, same-shape, rotation-between, and reflection-between are all computed by comparing shapes’ edge representations, as described above. Reflections are classified as X-Axis-Reflections, Y-Axis-Reflections, and Other-Reflections.
4 Modeling the Oddity Task Our approach to performing the Oddity Task is to identify what is common across the images in an array by generalizing over their representations with SEQL. Individual images can then be compared to the generalization using SME. If one image is noticeably less similar to the generalization, then it must be the odd image out. Most of the time (e.g., Problem B in Fig. 2), the odd image out lacks a qualitative feature that is present in the other five images, in this case parallel lines. However, in some cases (e.g., Problem C), the odd image out possesses an extra feature beyond those found in the other images. 4.1 Theoretical Claims of Model Our model of the oddity task is based on the following theoretical claims: 1) People encode qualitative, structural representations of visual scenes and use these representations to perform visual tasks. 2) For a given problem, people will focus on a particular representational level (either the shape level or the edge level) in solving that problem. 3) Qualitative spatial representations are compared via structure-mapping, as implemented in SME. 4) People will identify the common features across a set of images via analogical generalization, as implemented in SEQL. Note that these claims are general enough to apply to many spatial tasks. However, they are not detailed enough to fully specify how any task would be completed. Thus,
384
A. Lovett, K. Lockwood, and K. Forbus
it is necessary to make additional modeling assumptions in order to fill out a complete computational model of the task. 4.2 Modeling the Process Our model attempts to pick out the image that does not belong by performing a series of Generalize/Compare trials. In each trial, the system constructs a generalization from a subset of the images in the array (either the top three or the bottom three). This generalization represents what is common across all of these images. For example, consider the right-angled triangle problem (Fig. 1, Problem A). The generalization built from the three top images will describe three connected edges, with two of the edges being perpendicular. In the rightmost top image, the two perpendicular edges form an edges-same-length-corner, but this relation will have been abstracted out because it is not common to all three images. The generalization is then compared to each of the other images in the array, using SME. The model examines the similarity scores for the three images, looking for a particular pattern of results: two of the images should be quite similar to the generalization, while the third image, lacking a key feature, should be less similar. In this case, the lower middle triangle will be less similar to the generalization because it lacks a right angle. Similarity is based on SME’s structural evaluation score, but it must be normalized. There are two different ways to normalize it: Similarity scores can be normalized based only on the size of the generalization (gen-normalized). This score measures how much of the generalization is present in the image being compared. This measure is ideal for noticing whether an image lacks some feature of the generalization. Alternatively, similarity scores can be normalized based on both the size of the generalization and the size of the image’s representation (fully-normalized). This score measures both how much of the generalization is present in the image and how much of the image is present in the generalization. While more complex than gennormalized scores, fully-normalized scores are necessary for noticing an odd image out that possesses an extra qualitative feature that the other images lack. For example, it allows the model to pick out the image with parallel lines from the other five images without parallel lines (Fig. 1, Problem C). 4.3 Controlling the Processing In each Generalize/Compare trial, the model must make three choices. The first is which subset of the images to generalize over (either the top three images or the bottom three). The second is whether to use gen-normalized or fully-normalized similarity scores. The third is whether to use edge representations or shape representations—recall that we are predicting that edge representations and shape representations will never be combined in a single comparison. These choices are made via the following simple control mechanism: (1) To ensure that the results are not dependent on the order of the images in the array, trial runs are attempted in pairs, one based on generalizing from the top three images and one based on generalizing from the bottom three images. (2) Because the gennormalized similarity score is simpler, it is always attempted first. (3) The model
Modeling Cross-Cultural Performance on the Visual Oddity Task
385
chooses whether to use edge or shape representations based on the makeup of the first image. If the image contains multiple shapes, or if the image contains an elliptical shape consisting of only a single edge (e.g., a circle), then a shape representation is used. Otherwise, an edge representation is used. Note, however, that an edge representation will be quickly abandoned if it is impossible to find a good generalization across images, as indicated by different images having different numbers of edges. After the initial pair of trials is run, the model looks for a sufficient candidate. Recall that each Generalize/Compare run produces three similarity scores for the three images that have been compared to the generalization. A sufficient candidate is chosen when the lowest-scoring image has a similarity score noticeably lower than the other two (< 95% of the second lowest-scoring image), meaning the image is noticeably less similar to the generalization. In cases where a sufficient candidate is not found, the model will attempt additional trials. (1) If the model was previously run using edge representations, it will try using shape representations. (2) The model will try using a fully-normalized similarity score, to see if the odd image out possesses an extra feature. At this point, if no sufficient candidate has been identified, the model gives up (this is the equivalent of a person guessing randomly, but we do not allow the model to make such guesses).
5 Simulation We evaluated our model by running it on the 45 problems from the Dehaene [7] study. The original stimuli, in the form of PowerPoint slides, were copied and pasted into CogSketch, which automatically converted each PowerPoint shape into a glyph. Four of the 45 problems were touched up in PowerPoint to ease the transition—lines or polygons that had been drawn as separate parts and then grouped together were redrawn as a single shape. Five additional problems were modified after being pasted into CogSketch. In all five cases, we removed simple edges which had been added to the images of the problem to help illustrate an angle or reflection to which participants were meant to attend. Because the model was unable to understand the information these lines were meant to convey, they would have served only as distracters. Aside from the changes to these nine problems, no changes were made to the stimuli which had been run on human participants. In analyzing the results, we consider first the model’s overall accuracy, including the correlation between its performance and that of both the American participants and the Mundurukú participants. We then use the model to identify four factors that could contribute to problem difficulty. We examine the correlation between these factors and human performance on the subset of problems that are correctly solved by the model. 5.1 Model Accuracy Our model correctly solves 39/45 problems. Note that chance performance would be 7.5/45. Furthermore, there is a strong correlation between the model’s performance and the performance of the human participants. Table 3 shows the Pearson correlation coefficient between the model and each of the human populations. As the table shows, the
386
A. Lovett, K. Lockwood, and K. Forbus
model correlates better with the American participants. However, there is also a high correlation with the Mundurukú participants. The coefficient of determination, which is computed by squaring the correlation coefficient, indicates the percentage of the variance in one variable which is accounted for by another. In this case, the coefficient of determination between the model and the Mundurukú participants is (.4932 = .243), meaning the model accounts for about ¼ of the variance in the performance of the Mundurukú participants. Table 3. Correlations between the model and the American and Mundurukú participants
Americans
Mundurukú
Model
.656
.493
Americans
*
.758
Mundurukú
.758
*
Fig. 2 plots the performance of the two populations and the model. As the figure shows, the six problems on which the model fails are among the hardest for both populations. The one clear exception is problem 21 (see Fig. 3). Although the model fails on this problem, the Mundurukú performed quite well on it (86% accuracy).
Fig. 2. Performance of Americans, Mundurukú, and our model on the Oddity Task
Discussion. Fig. 3 shows the six problems which our model fails to solve. As the percentages show, these problems were for the most part quite difficult for both the Americans and the Mundurukú, with performance on some problems little or no higher than chance (17%).
Modeling Cross-Cultural Performance on the Visual Oddity Task
21 (55% / 88%)
(4)
22 (13% / 48%)
(5)
34 (37% / 18%)
(6)
38 (60% / 23%)
(6)
44 (31% / 23%)
(4)
39 (24% / 20%)
(1)
387
Fig. 3. The six problems the model fails to solve. Above each problem the average accuracy for the Americans and the Mundurukú are listed, respectively, followed by the number of the correct answer. Overall, these six problems can be roughly broken down into three categories based on what is required to solve them. First, problem 22 requires encoding whether the dot lies along the axes of the asymmetric quadrilateral. Our model simply does not encode this relation—nor, it appears, do Americans, as they actually fall below chance on this problem. Interestingly, the Mundurukú are well above chance; at this time, it is difficult to say why they are better at solving this problem.
388
A. Lovett, K. Lockwood, and K. Forbus
Problems 38 and 44 both require identifying a rotation between shapes found in different images. Our model only looks for rotations between shapes within a single image. As the percentages show, the participants, and particularly the Mundurukú, had difficulty solving these problems. We believe the most likely reason is that it did not occur to them to look for rotations between shapes in different images. Problems 21, 34, and 39 all appear to require encoding a quantitative relation between shapes: a percentage distance along an edge for 21, a number of degrees of rotation for 34, and a ratio between two shapes’ sizes for 39. The fact that participants had so much trouble with these problems supports our prediction that individuals primarily encode and compare qualitative spatial features. The one exception here was problem 21, which was reasonably difficult for the Americans but actually quite easy for the Mundurukú. As with problem 22, it is difficult to say why the Mundurukú performed so well on this problem. It may that they are better at dividing a space (either a line or a quadrilateral) into smaller parts and qualitatively encoding which of those smaller parts a dot lies along. 5.2 Modeling Problem Difficulty We analyzed problem difficulty on the 39 problems that the model correctly solves. We used the model to identify four factors that could contribute to difficulty. For this paper, we focus on factors related to encoding the stimuli. The factors are: (1) Shape Comparison. Some problems (e.g., Fig. 1, Problem D) require constructing edge representations of two shapes and comparing them in order to identify a relation between the shapes (e.g., a rotation or a reflection). This may be difficult because it involves switching between the edge and shape representations, and because it requires conducting an additional comparison with SME before one begins comparing the six images. (2) Shape Symmetry: Some problems (e.g., Fig. 1, Problem E) require comparing a shape’s edge representation to itself, via MAGI, in order to identify an axis of symmetry. This could be difficult for similar reasons. (3) Shape Decomposition: Several problems (e.g., Fig 1, Problem A) require decomposing shapes into edges in order to represent each image at the edge representation level. It is possible that this will be difficult for individuals because there may be a temptation to consider closed shapes only at the shape representation level. (4) Shape Grouping: A couple problems (e.g., Fig. 1, Problem F) require grouping shapes together based on the Gestalt rule of proximity. Normally, one would assume this was easy, but preliminary analysis indicated it might be difficult for the Mundurukú participants. We used the model to produce a measure for each difficulty factor on each problem via ablation; for example, we ran the model with the ability to conduct shape comparisons turned off in order to identify the problems on which shape comparisons were required. We then attempted to find a difficulty function, based on the four factors, which correlated highly with each of the human populations. This was done by performing an exhaustive search over all possible linear weights for the four factors in the range of 0 to 15.
Modeling Cross-Cultural Performance on the Visual Oddity Task
389
Results. The optimal difficulty function for the American participants is shown in Table 4 (the weight for each factor is normalized based on the size of the largest weight). In addition to the weight of each factor, the table shows the individual contribution of each factor to the correlation between the function and human performance. This was computed by removing a factor from the difficulty function and considering the drop in the function’s correlation with the human population. As Table 4 shows, the difficulty function had an overall correlation of .667 with the American participants. This means that the function explains (.6672 = 44%) of the variance in human performance on the 39 problems. Most of the contribution to this correlation comes from shape comparison and shape symmetry. It appears that the American participants had a great deal of difficulty with problems that required decomposing shapes into edges and comparing the edge representations to identify relations between shapes, or symmetry within a single shape. Shape decomposition also contributed to the correlation, suggesting that the participants had some difficulty with the problems requiring focusing on the edge representations of closed shapes. Table 4. Relative contribution of factors to our difficulty function for American performance
Factor
Weight in Function
Contribution to Correlation
Shape Comparison
.69
.163
Shape Symmetry
1.0
.267
Shape Decomposition
.38
.062
Shape Grouping
.08
.001
Overall
---
.667
The optimal difficulty function for the Mundurukú participants is shown in Table 5. This difficulty function had a correlation of .637 with the human data, indicating it accounts for (.6372 = 41%) of the variance in the Mundurukú performance. By far, the most important factor was shape comparison. The other contributing factor was shape grouping, suggesting that the Mundurukú participants might have some difficulty with problems requiring grouping elements together based on proximity. This is surprising, as Gestalt grouping is generally thought to be a basic, low-level operation. Note that the Mundurukú participants had no trouble with problems requiring estimating relative distances, as indicated by their high performance on problem 21 (Fig. 3). Table 5. Relative contribution of factors to our difficulty function for Mundurukú performance
Factor Shape Comparison
Weight in Function
Contribution to Correlation
1.0
.393
Shape Symmetry
.29
.018
Shape Decomposition
.14
.009
Shape Grouping
.71
.081
Overall
---
.637
390
A. Lovett, K. Lockwood, and K. Forbus
Table 6 shows the correlation between each difficulty function and each population group. As expected, each difficulty function correlates far better with the population group for which it was built. The fact that there is still a relatively high correlation between the American function and the Mundurukú performance, and between the Mundurukú function and the American performance, most likely results from the fact that both groups have a great deal of trouble with problems requiring shape comparison. Table 6. Correlations between difficulty function and population
Difficulty Function
American Participants
Mundurukú Participants
American Function
.667
.402
Mundurukú Function
.427
.637
Discussion. One of our original goals was to use the model to identify differences between the two populations. Our two difficulty functions appear to have accomplished this. The difficulty function for American participants suggests that they tend to encode images holistically. They tend to have trouble when a problem requires breaking a shape down into its edge representation. This may be because the academic training in basic shapes encourages Americans to look at shapes as a whole, rather than explicitly considering the individual edges that make up a shape. The Mundurukú participants, in contrast, appear to encode stimuli more analytically. They are better able to consider shapes in terms of their component edges; most noticeably, they are better at using a shape’s edges to identify axes of symmetry. However, they had difficulty seeing groups of shapes holistically in this task.
6 Related Work Several AI systems have been constructed to explore visual analogy. Croft and Thagard’s DIVA [5] uses a 3D scene graph representation from computer graphics as a model of mental imagery. That is, the system “watches” animation in the computer graphics system in order to perceive its mental imagery. Analogy is carried out via a connectionist network over the hierarchical structure of the scene graph. DIVA’s initial inputs, unlike ours, are generated by hand. Their background knowledge is also hand-generated specifically for their simulation, unlike our use of the same knowledge base across many simulation systems and experiments. DIVA has only been tested on a handful of examples, and to the best of our knowledge, has not been used to model specific psychological findings. Davies and Goel’s Galatea [6] uses a small vocabulary of primitive visual elements (line, circle box) plus a set of visual transformation over them (e.g., move, decompose) to describe base and target descriptions, and uses a copy/substitution algorithm to model analogy, carrying sequences of transformations from one description to the other. All of Galatea’s inputs are handgenerated, as is its background knowledge, and it has only been tested on a few
Modeling Cross-Cultural Performance on the Visual Oddity Task
391
examples. Mitchell and Hofstader’s Copycat [23] modeled analogy as an aspect of high-level perception, using comparisons between letter strings as the domain. Copycat was domain-specific, and even the potential correspondences between items were hand-coded (the slipnet), making it less flexible than SME, which is domainindependent.
7 Discussion We have described a model of the Oddity task, using CogSketch to automatically encode stimuli in terms of qualitative spatial representations, MAGI to detect symmetry, and SME and SEQL to carry out the task itself. We showed that this combination of modules can achieve behavior comparable to the participants in Dehaene et al’s study of American and Mundurukú performance on the same stimuli. Furthermore, we were able to provide some evidence about possible causes for performance differences between the groups, through statistical analysis of ablation experiments on the model. We find these results quite exciting on their own, but they are also part of a larger pattern. That is, similar combinations of qualitative representations and analogical processing have already been used to model a variety of visual processing tasks [19,20,25]. This study lends further evidence for our larger hypotheses, that (1) qualitative attributes and relations are central to human visual encoding and (2) people compare low-level visual representations using the same mapping process they use for abstract analogies. The study also lends support to the proposal that (3) comparison operations are performed using either a shape representational focus or an edge representational focus. We plan to pursue two lines of investigation in future work. First, this paper focused on difficulties related to encoding. Our model suggests difficulties involving comparisons may also be implicated. For example, a problem might be harder because the six images in the array are less similar, making alignment and generalization production more difficult. We plan to explore how well aspects of the comparison process can explain the variance. Of particular interest are whether their contributions are universal, or whether there will be cultural differences. Second, we plan on using these analyses to construct more detailed models of specific groups performing this task (i.e., children and adults, as well as both cultures). Comparing these models to each other, and to models of similar spatial tasks, could help identify general processing constraints on such tasks. This may shed light on how universal human spatial representations and reasoning are, both across cultures and across tasks.
Acknowledgements This work was supported by NSF SLC Grant SBE-0541957, the Spatial Intelligence and Learning Center (SILC). We thank Elizabeth Spelke for providing the original oddity task stimuli.
392
A. Lovett, K. Lockwood, and K. Forbus
References 1. Abravanel, E.: The Figure Simplicity of Parallel Lines. Child Development 48(2), 708–710 (1977) 2. Appelle, S.: Perception and Discrimination as a Function of Stimulus Orientation: The Oblique Effect in Man and Animal. Psychological Bulletin 78, 266–278 (1972) 3. Bhatt, R., Hayden, A., Reed, A., Bertin, E., Joseph, J.: Infants’ Perception of Information along Object Boundaries: Concavities versus Convexities. Experimental Child Psychology 94, 91–113 (2006) 4. Biederman, I.: Recognition-by-Components: A Theory of Human Image Understanding. Psychological Review 94, 115–147 (1987) 5. Croft, D., Thagard, P.: Dynamic Imagery: A Computational Model of Motion and Visual Analogy. In: Magnani, L., Nersessian, N. (eds.) Model-based Reasoning: Science, Technology, Values, pp. 259–274. Kluwer/Plenum (2002) 6. Davies, J., Goel, A.K.: Visual Analogy in Problem Solving. In: Proceedings of the International Joint Conference on Artificial Intelligence, pp. 377–382 (2001) 7. Dehaene, S., Izard, V., Pica, P., Spelke, E.: Core Knowledge of Geometry in an Amazonian Indigene Group. Science 311, 381–384 (2006) 8. Falkenhainer, B., Forbus, K., Gentner, D.: The Structure-Mapping Engine. In: Proceedings of the Fifth National Conference on Artificial Intelligence (1986) 9. Ferguson, R.W.: MAGI: Analogy-Based Encoding Using Regularity and Symmetry. In: Proceedings of the 16th Annual Conference of the Cognitive Science Society, pp. 283–288 (1994) 10. Forbus, K., Oblinger, D.: Making SME Greedy and Pragmatic. In: Proceedings of the Cognitive Science Society (1990) 11. Forbus, K., Ferguson, R., Usher, J.: Towards a Computational Model of Sketching. In: Proceedings of the 2001 Conference on Intelligent User Interfaces (IUI-2001) (2001) 12. Forbus, K., Lockwood, K., Klenk, M., Tomai, E., Usher, J.: Open-Domain Sketch Understanding: The nuSketch Approach. In: AAAI Fall Symposium on Making Pen-based Interaction Intelligent and Natural (2004) 13. Forbus, K., Usher, J., Lovett, A., Wetzel, J.: CogSketch: Open-Domain Sketch Understanding for Cognitive Science Research and for Education. In: Proceedings of the Eurographics Workshop on Sketch-Based Interfaces and Modeling (2008) 14. Gentner, D.: Structure-Mapping: A Theoretical Framework for Analogy. Cognitive Science 7(2), 155–170 (1983) 15. Gentner, D., Markman, A.B.: Structure Mapping in Analogy and Similarity. American Psychologist 52, 42–56 (1997) 16. Gentner, D., Loewenstein, J.: Relational Language and Relational Thought. In: Amsel, E., Byrnes, J.P. (eds.) Language, Literacy, and Cognitive Development: The Development and Consequences of Symbolic Communication. Lawrence Erlbaum Associates, Mahwah (2002) 17. Huttenlocher, J., Hedges, L.V., Duncan, S.: Categories and Particulars: Prototype Effects in Estimating Location. Psychological Review 98(3), 352–376 (1991) 18. Kuehne, S., Forbus, K., Gentner, D., Quinn, B.: SEQL: Category Learning as Progressive Abstraction Using Structure Mapping. In: Proceedings of the 22nd Annual Meeting of the Cognitive Science Society (2000) 19. Lovett, A., Gentner, D., Forbus, K.: Simulating Time-Course Phenomena in Perceptual Similarity via Incremental Encoding. In: Proceedings of the 28th Annual Meeting of the Cognitive Science Society (2006)
Modeling Cross-Cultural Performance on the Visual Oddity Task
393
20. Lovett, A., Forbus, K., Usher, J.: Analogy with Qualitative Spatial Representations Can Simulate Solving Raven’s Progressive Matrices. In: Proceedings of the 29th Annual Conference of the Cognitive Society (2007) 21. Lovett, A., Sagi, E., Gentner, D.: Analogy as a Mechanism for Comparison. In: Proceedings of Analogies: Integrating Multiple Cognitive Abilities (2007) 22. Markman, A.B., Gentner, D.: Commonalities and Differences in Similarity Comparisons. Memory & Cognition 24(2), 235–249 (1996) 23. Mitchell, M.: Analogy-making as Perception: A Computer Model. MIT Press, Cambridge (1993) 24. Palmer, S.E.: Hierarchical Structure in Perceptual Representation. Cognitive Psychology 9(4), 441–474 (1977) 25. Tomai, E., Lovett, A., Forbus, K., Usher, J.: A Structure Mapping Model for Solving Geometric Analogy Problems. In: Proceedings of the 27th Annual Conference of the Cognitive Science Society (2005) 26. Wertheimer, M.: Gestalt Theory. In: Ellis, W.D. (ed.) A Sourcebook of Gestalt Psychology, pp. 1–11. The Humanities Press, New York (1924/1950)
Modelling Scenes Using the Activity within Them Hannah M. Dee, Roberto Fraile, David C. Hogg, and Anthony G. Cohn School of Computing, University of Leeds, Leeds LS2 9JT, United Kingdom {hannah,rf,dch,agc}@comp.leeds.ac.uk Abstract. This paper describes a method for building visual “maps” from video data using quantized descriptions of motion. This enables unsupervised classification of scene regions based upon the motion patterns observed within them. Our aim is to recognise generic places using a qualitative representation of the spatial layout of regions with common motion patterns. Such places are characterised by the distribution of these motion patterns as opposed to static appearance patterns, and could include locations such as train platforms, bus stops, and park benches. Motion descriptions are obtained by tracking image features over a temporal window, and are then subjected to normalisation and thresholding to provide a quantized representation of that feature’s gross motion. Input video is quantized spatially into N × N pixel blocks, and a histogram of the frequency of occurrence of each vector is then built for each of these small areas of scene. Within these we can therefore characterise the dominant patterns of motion, and then group our spatial regions based upon both proximity and local motion similarity to define areas or regions with particular motion characteristics. Moving up a level we then consider the relationship between the motion in adjacent spatial areas, and can characterise the dominant patterns of motion expected in a particular part of the scene over time. The current paper differs from previous work which has largely been based on the paths of moving agents, and therefore restricted to scenes in which such paths are identifiable. We demonstrate our method in three very different scenes: an indoor room scenario with multiple chairs and unpredictable unconstrained motion, an underground station featuring regions where motion is constrained (train tracks) and regions with complicated motion and difficult occlusion relationships (platform), and an outdoor scene with challenging camera motion and partially overlapping video streams. Keywords: Learning, Spatial relations, Computer vision, Modeling behaviour, Qualitative reasoning.
1
Introduction and Motivation
The ability to reason about the things we see in video streams is influenced by our ability to break down the spatial structure of such scenes into semantically meaningful regions. In our day-to-day talk about behaviour (“The chicken C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 394–408, 2008. c Springer-Verlag Berlin Heidelberg 2008
Modelling Scenes Using the Activity within Them
395
crossed the road”, for example) we discuss regions (roads) which might be visually determined by clear kerb stones and line markings. However, these regions could also be functionally determined: it is easy to imagine some dirt path which has no clear visible boundaries, but which is still a road by virtue of the cars driven along it regularly (much to the peril of chickens). In this sense, roads and paths can be identified as much by typical patterns of motion as by physical structures. There are certain things we can find out from motion patterns which would be very difficult to discover through the analysis of static scene structures. For example, whilst it is possible to imagine a hypothetical scene analysis system that could identify roads and roundabouts from static images, determining what side of the road people drive on or which way around the roundabout people travel would require analysis of motion. Within the field of Computer Vision there is a body of work concerning the modelling of scene structure through tracking visible agents, and this work identifies such emergent, functional paths. In scenes with limited behavioural repertoires (Fernyhough et al. [7] call these “strongly stylised domains”) and in which the behaviour of interest is detectable from trajectories alone, such systems work well. In scenes where finer grained ideas of motion are of interest (such as around chairs and benches, which we might be interested in detecting as the loci of sitting and standing activities) trajectory based systems have difficulties. In areas where behaviour is not as constrained (such as on a train platform, where paths have little meaning) the trajectory based systems also have difficulties. Strong occlusion is also a problem for trajectory based systems, and much work considers the problem of maintaining tracks through occlusion. In this paper we sidestep this difficult problem by using what we call “tracklets”, which are short indicative bursts of motion, and by working at the level of image features rather than tracked unitary objects. The current paper makes two contributions: we apply feature based tracking (as used in the activity modelling community, e.g. in [13]) to the problem of modelling scene geography, and we do this within a qualitative framework to extract descriptions that can be used within Qualitative Spatial Reasoning (QSR) systems. This allows us to label regions of unconstrained scenes, some of which are difficult for computer vision systems to handle.
2
Related Work
Whilst there is a large literature on modelling spatial regions using a priori ideas about space and motion, or previously crafted maps, the current paper falls in the category of scene modelling from automated analysis of video. Work in scene modelling has thus far concentrated on the analysis of gross patterns of motion, such as the trajectories of tracked people (or other moving objects) or on optical flow patterns. Systems which work at the level of the entire trajectory are able to construct models of the way in which agents move through the scene. Johnson and
396
H.M. Dee et al.
Hogg in [10] create models of activity within an environment for prediction and classification of human behaviour by learning behaviour vectors describing typical motion. Stauffer and Grimson [19] take trajectories and perform a hierarchical clustering, which brings similar motion patterns together (activity near a loading bay, pedestrians on a lawn). Makris and Ellis in e.g. [16] learn scene models including entrances and exits, and paths through the scene, and use these for tracker initialisation and anomaly detection; a similar approach is used in [17]. The rich scene models obtained from trajectory modelling have been used to inform either about the observed behaviour (e.g., typicality detection as in [10]) or about the scene (e.g., using models of trajectory to determine the layout of cameras in a multi camera system as in [11]), or to detect specific behaviour patterns of interest (e.g., elderly people falling down [17]). These systems all rely upon tracking moving objects through the scene and upon having entire tracks, which means that they are susceptible to tracker errors (such as those caused by occlusion). Because of the underlying reliance on background subtraction these systems are also very susceptible to camera shake. In many of these systems several hours of training data is required. Xiang and Gong, in [21], use pixel change history combined with dynamic probabilistic networks (specifically, various types of hidden Markov model (HMM)) to learn temporal and causal relationships between observed image patterns. Their work differs from ours in that they aim to detect and model events and the relationships between them directly and statistically. We are interested in modelling the spatial structure of a scene symbolically. A related approach centred more upon scene modelling is that presented in [1], working directly from image data using a forest of HMMs, and learning activity patterns (regions of scene in which there is increased activity). Activity discovery or recognition, however, generally works on a smaller scale, dealing with features or patterns of motion rather than trajectories, and is concerned with determining whether a particular video sequence contains an example of a learned activity (running, jumping, or more fine grained activities such as particular tennis shots). Efros et al., in [6], present early work on activity modelling using flow where they determine motion type and pose for “30 pixel man”, in which database data is labelled and then matched to input video using normalised optical flow. Laptev, in [13], describes his “Space-Time Interest Points”, which are spatio-temporal features developed from the 2D spatial-only Harris interest points, and with P´erez [14] more recently extends this to deal with the classification of even finer grained actions (smoking and drinking). Dalal et al. in [5] use similar techniques (based upon histograms of gradients) for the detection of humans in video; their trained detector uses both appearance and motion cues so can be seen as using activity to aid detection. Gryn et al. in [8] use hand-crafted direction maps to detect particular patterns of interest (using a particular door, or making an illegal left turn at an intersection). These direction maps are regularly spaced vector fields representing the direction of motion at locations of interest, and are scene specific, detecting motion in a particular image plane location. Colombo et al. in [4] take a different tack,
Modelling Scenes Using the Activity within Them
397
modelling regular scene changes on a smaller temporal scale (escalator movement, cyclic advertising hoardings) as part of the background model using Markov models – modelling scene motion in order to be able to ignore it. Our work is related to many of these approaches: we learn histograms from feature data, and use these to build models of activity within the scene. The method characterises scene projections by patterns of accumulated motion over an extended period as opposed to short-term motion patterns used in most earlier work (e.g. motion history templates). The aim of this paper is to show that these techniques can be used to move towards qualitative scene representations, which will facilitate qualitative reasoning about scene activities.
3
Feature Detection and Tracking
The “KLT” tracker is suggested in [18] building upon work from [15,20] and is based upon the insight that feature selection for tracking should be guided by the tracker itself: the features we should be tracking are precisely those features which are easy to track. Whilst in general the KLT feature tracker is good at following features across the image, there is a trade-off between longevity and reliability. This trade-off provides a practical bound on the length of our descriptors, or tracklets: by reinitialising the tracker every M frames (in the current implementation, M = 50) we have tracks which are reliable but long enough to provide a descriptive representation of feature motion. These M frame tracks are then split into two, which gives us a pair of temporally linked short (yet reliable and descriptive) tracklets which with M = 50 last one second each. We believe these tracklets comprise a promising representational level, but in the current work we do not exploit this fully, and consider just the displacement
Fig. 1. A frame of video showing two sets of tracklets: most recent (just completed) in blue; previous in green. These give a robust indication of motion in the image plane without committing to any object segmentation or scale.
398
H.M. Dee et al.
between start and end points using the tracklet as a robust means of getting to this. In order to do this, we look at the gross motion within each tracklet thresholding on angle from the vertical θ and distance travelled d between first and last points. This descriptor is one of up, up-right, right, down right . . . or still. Tracklets are classified as still if their total movement d is below a threshold α: in the current implementation α = 2 pixels, which we find allows for considerable camera shake whilst still detecting most significant motion. This calculation is set out in Table 1. This directional quantization is similar to the system described in [8], although they work with optical flow rather than tracked features and match their motion descriptors to hand-crafted templates. Table 1. An illustration of the direction calculations with associated thresholds. θ is the angle between start and end points of the tracklet; d is the total displacement between start and end points (in pixels). α has been set to two pixels in the current implementation. Label Still
Short label Illustration Classification Criteria S d<α
Up
U
−π 8
Up-right
UR
π 8
Right
R
3π 8
<θ≤
5π 8
5π 8
<θ≤
7π 8
Down-right DR ... ... Up-left
4
UL
...
<θ<
<θ≤
3π 8
... −π 8
π 8
≥θ>
−3π 8
Spatial Quantization: Histograms of Features
In order to characterise patterns of motion across the scene, we collect frequency counts for each of these directional quanta in different image regions. We do this by dividing the scene into N × N pixel bins and by accumulating a histogram of each directional relation in each bin based upon the start of the tracklet. In the current implementation, N is 16 - this works well in the scenes under consideration allowing us to avoid a large proportion of empty bins, yet generate regions detailed enough to capture the structure of the scene. A figure illustrating the types of histogram we observe is shown in Figure 2. These histograms are learned through observation over a period of time: the scenes in our experiments are all over 10 minutes long, with the longest being about 30 minutes. We have applied these processes (feature tracking and then spatial histogramming) to three video datasets. The datasets are: a 30 minute video taken in a university common room, featuring chairs in which people occasionally sit and
Modelling Scenes Using the Activity within Them
399
Fig. 2. A screenshot from the chair dataset with grid overlayed, showing histograms calculated from different scene cells. Cell A near the top of the door does not see much movement, and the movement that is observed is R and L corresponding to the opening and closing motion of the door. Cell B is at a medium height on the wall behind the chairs and sees motion both to the left and the right due to people moving backwards and forwards behind the row of chairs. C, in the door region, has a major peak in its histogram corresponding to motion to the left, due to people opening the door and going out through it, and a less pronounced peak at R, presumably corresponding to the door closing again.
drink tea or coffee; a 30 minute video from the UK Home Office “i-LIDS” (Imagery library for intelligent detection systems [9]) dataset of an underground station, including platform, train track region and a bench where passengers occasionally sit and wait for trains; and a 14 minute video of a busy roundabout intersection, taken from the top of a 20 metre pole using an experimental 2 camera setup (containing considerable camera shake as a result). We have not attempted to correct any issues with these datasets by pre-processing. These will be called the chair, i-LIDS and roundabout scenes. Figure 3 shows the generated histogram information presented as a bitmap for each scene and for each direction.
5
Dominant Patterns of Motion
Having calculated the motion histograms for each cell of the input video, the next stage is to use these to segment the visual scene into regions characterised by similar patterns of motion. This section describes two methods for achieving this: the first uses one direction alone (the significant direction) as a basis for clustering, and the second uses unsupervised learning techniques (K-means) to determine which prototypical direction histograms best partition the space.
400
H.M. Dee et al. Chair
i-LIDS
Roundabout Thumbnails of input videos. From left to right: Chair, within a busy room, i-Lids underground station dataset, Roundabout featuring experimental multi camera setup and extraneous metadata. With the frequency maps associated with “still”, you see a representation of where features are generally found. Up: here we see the chairs in the chair scene, movement on the platform and the bench underground, and an artifact of the split screen in the roundabout scene. Up left: here we see the chairs in the chair scene, movement around the bench in the underground scene and the exit to the roundabout at the left of the image. Left: we see some movement behind the chairs, some movement in the train area of the underground scene, and the nearground side of the roundabout. Down left: Here we see movement associated with the chairs, the platform edge (this is an artifact of the feature tracker), and the relevant portion of the roundabout. Down: here we see movement associated with the chairs (again), the platform of the tube scene, and a small amount of movement in the far ground of the roundabout scene. Down right: here we see the chairs, the far part of the platform and the bench, and the far entrance onto the roundabout. Right: here we see the area behind the chairs, the far platform of the tube scene, and the upper part of the roundabout (farthest from the camera). Up right: here we see the chairs in the chair scene, the edge of the platform in the tube scenario (see Up left), and the right entrance of the roundabout scene.
Fig. 3. Histogrammed direction data (one row per bin) showing evident patterns of motion within each input scene
Modelling Scenes Using the Activity within Them
5.1
401
Using Dominant Direction to Categorise Regions
The simplest means of categorising motion within a square is just to consider each direction independently: essentially treating each directional histogram bin as a separate “image”. A small amount of standard image pre-processing is carried out on each channel (normalisation to scale each direction so frequency counts fall in the range [0, 1], median filtering to smooth, and morphological opening to create coherent regions). The results are then thresholded at the median value for that direction. This results in a binary image for each direction showing those parts of the scene where the amount of motion in that direction is significant. To create a single segmentation based upon principal direction of motion, we use Markov Random Fields (MRFs) [2,3,12] in the place of simple thresholding. MRFs provide a graph-based formalism for segmentation through energy minimisation: in this case we define energy as a function of both the input frequency histogram (the data term) and the cost of labelling adjacent squares differently (the smoothness term). C(f ) = Cdata (f ) + Csmooth (f )
(1)
We use a smoothness term which penalises adjacent labels which are different and does not penalise adjacent labels which are the same (thus encouraging uniform regions). We have a smaller penalty for labels which are “one out”, which has the effect of lowering the penalty for adjacent regions with adjacent directions (right and up-right, for example). This can be thought of as decreasing the penalty term for labels which are conceptual neighbours as well as physical neighbours. Equation 2 provides details of the smoothness term for two adjacent squares i and j; k is a constant set in these experiments to be 0.5. Csmooth (i, j) = k if i and j have different labels Csmooth (i, j) = k/2 if i and j are conceptual neighbours
(2)
Csmooth (i, j) = 0 if i and j have the same label The advantage of the MRF framework is that it provides a means of creating smooth segmentations which preserve sharp boundaries where they exist; the main disadvantage when used in this context is that it does not allow multiple labels per element. Figure 4 shows the binary images generated by thresholding for each direction for each scene, and the MRF generated joint segmentation. 5.2
Region Classification Using Unsupervised Learning
Whilst considering individual directions provides an informative segmentation of the scene region it requires areas to be characterised by movement in just one direction. This is not, in many cases, a valid assumption. Rather than consider each direction independently, this section describes the use of unsupervised clustering of the histograms themselves, to provide a single label for each bin. We cluster the motion histograms using the K-means
402
H.M. Dee et al. Chair
i-LIDS
Roundabout Comments
The Up Reft and Up Right directions highlight the exit to the roundabout and the traffic passing the exit. Left motion and right motion both highlight the “train” area of the i-LIDS dataset, and the foreground and background sections of the roundabout.
Downleft, left, and downright seem to highlight the bench in the i-LIDS scene.
Note the clear identification of the far side of the roundabout
Markov Random Field segmentation
Fig. 4. Considering each direction independently. Final row shows the result of using an MRF to combine these to form an overall segmentation, rather than using a set threshold on each direction alone.
Modelling Scenes Using the Activity within Them Chair
i-LIDS
403
Roundabout
K-means (K=9) as a raw scene partition.
K-means followed by MRF smoothing.
Fig. 5. Learned motion patterns used for scene partitioning, with clusters learned for each scene. Colour coding in this figure is chosen within each scene: darker regions in one scene are not necessary related to darker regions in another.
Input
K=10
K=12
K=14
Fig. 6. Learned motion patterns used for scene partitioning, with clusters learned across all scenes. Despite diferent values of K, the bench in the i-LIDS scene has been identified as similar in motion pattern to the chairs in the chair scene. In this Figure, the colour coding changes between values of K but is consistent across scenes. For example, in the K=10 column the dark grey region which makes up the majority of the column corresponds to a vector representing very little motion.
algorithm, and then we use these clusters as the basis for our segmentation. As before, we use a Markov random field to smooth the segmentation. We use a smoothness term which does not consider conceptual neighbours, as it is more difficult to determine an ordering on the 8-dimensional1 input vectors. Thus the 1
The dimensions being: up; up left; left; down left; down; down right; right; and up right.
404
H.M. Dee et al.
Fig. 7. Illustrations of the learned cluster centres. The size of the arrow is proportional to the frequency with which that direction was observed. These illustrations are clusters learned across all scenes when K=10.
smoothness term has a penalty for neighbouring squares which differ in category, and no penalty for neighbouring squares which are the same. The distance measure used is Euclidean distance between histograms. Figure 5 shows the partitioning of each scene given by the use of K-means clustering, and the same partitioning after application of an MRF. The images in Figure 5 illustrate segmentations obtained by training on each scene individually. The motivation for this is that we might expect the motion patterns of vehicles at a roundabout to be different to those of people in an underground station, or in a university common room. However we might also expect there to be a certain amount of similarity in motion between the scenes. Applying K-means to all three datasets at once provides us with motion descriptors which are not individually tailored to each scene but which capture similarities between motion in each, and the results of this are shown in Figure 6. Figure 6 includes diagrams drawn with different values of K (the number of clusters). In each of these, similar patterns appear. Figure 7 shows cluster centres learned across all scenes when K=10, corresponding to the second column in Figure 6. This figure shows quite clearly that the observed patterns do not correspond to single dominant directions, but often to pairs of opposites.
6
Evaluation
Informally, various scene elements can be identified – in the i-LIDS scene, the track region is clear, in the chair scene, the chairs are clear, and in the roundabout there is an obvious structure in the right place. More formal evaluation is difficult as the generation of ground truth for motion segmentation is not a trivial matter. We are concerned not with the way in which the scene is superficially structured, but the way in which people interact with the scene as they move around. For example, whilst the roundabout dataset is indeed a roundabout, the majority of traffic goes straight across and turning traffic is fairly uncommon. In the i-LIDS dataset, the platform has a number of associated motion patterns, which differ from region to region (in some areas,
Modelling Scenes Using the Activity within Them
405
Screen shot from video.
Rough “ground-truth”.
MRF based on dominant direction
K-means followed by MRF smoothing learned individually from each scene.
K-means across all three scenes followed by MRF smoothing.
Fig. 8. Ground truth with various segmentations: dominant direction, motion patterns learned per scene and motion patterns learned across all scenes
hardly anybody waits, but in others there are often people milling around). Despite these acknowledged difficulties we believe that comparison with a handmarked-up ground truth is the best way to evaluate this work and have generated a simple region based segmentation against which to compare out output. This is shown in Figure 8, alongside various outputs. From Figure 8 we can see that many of the identified ground truth image regions have parallels in the segmentations. The MRF based upon dominant direction alone is the least like the ground truth segmentations; whilst it is possible to find similarities it would be generous to say that these segmentations were clear. With the segmentations learned for each scene individually the scene structure is more evident. The chair scene in particular has clearly highlighted the chairs as regions of heightened motion (although not the door). Within the i-LIDS dataset there is an unexpected distinction between regions of the train platform; the middle area where most people chose to wait is associated with a different cluster centre to the far and nearground, and there appears to be some form of emergent “path” heading to and from the bench. The edge of the platform and the train region have both emerged from the observed data. In the roundabout
406
H.M. Dee et al.
scene the near and far sections stand out very well, as does the left hand feeder branch to the roundabout. Finally considering the segmentations created by learning over all scenes simultaneously (the final line of images in Figure 8) we can begin to detect similarities between the regions defined in each scene. Whilst we cannot claim to have constructed something that can detect chairs and benches it is however fair to say that the clusters associated with the chairs in the chair scene (marked as pale grey in the ground truth) also seem to be associated with the bench in the i-LIDS scene (marked as black in the ground truth). The roundabout scene is not segmented as clearly in the combined segmentations as in the individual segmentations, presumably as this scene contains strongly directional motion (each section effectively being a one-way street).
7
Conclusions and Future Directions
This paper has presented a novel approach for the unsupervised learning of spatial regions from motion patterns. Our aim is to create segmentations of input video which correspond to semantically meaningful regions in an unsupervised fashion, and then to use these semantically meaningful regions within a qualitative spatial reasoning (QSR) framework. We have made considerable progress towards this aim, and have generated segmentations which correspond in part to ground truth segmentations of three experimental scenes. Our method is robust to camera shake and background changes in a way that the existing path based systems are not (due to their reliance on some form of background model). Further investigation is required to determine which varieties of input are most useful to this type of system: the directional histograms used here could be augmented by information about speed, for example, and we are investigating ways to further exploit the tracklet representation. We have carried out informal investigations in the variation of histogram bin size (resulting in the 16 by 16 bins reported here) but a more thorough study could be useful, and the optimal size will almost certainly be scene dependent. The use of overlapping bins or pyramidical representations is also something we wish to pursue. Perhaps more interestingly, further investigation is needed into the detection of common patterns across different scenes, perhaps within a supervised or semi-supervised machine learning framework. The similarity between segmentation of the bench in the i-LIDS dataset and the chairs in the chair dataset is a promising sign, and it would be an interesting experiment to collect video of many scenes containing chairs or benches and see if we can learn their associated motion patterns from observation. The scenes under consideration in this paper contain various types of motion constrained in various ways, and perhaps because of this the two broad approaches outlined in this paper (dominant direction vs. K-means clustering) perform differently in each scene. The dominant direction thresholding results in clear images of the roundabout scene, which is an example of what Fernyhough called a strongly stylised domain. As such we should expect strong directions to
Modelling Scenes Using the Activity within Them
407
emerge. There are certain aspects of a roundabout which cannot be modelled in terms of dominant direction alone; what we have are a sequence of observations caused by motion of objects in the real world subject to certain spatial and temporal constraints. Incorporating temporal information in some way might also detect patterns in the i-LIDS scene such as those caused by trains entering and passengers leaving the station.
Acknowledgements This work was supported by EPSRC project LAVID, EP/D061334/1. We would like to thank Joe Sinfield for assistance with data collection and Mark Conmy for technical assistance.
References 1. Fernyhough, J.H., Cohn, A.G., Hogg, D.C.: Generation of semantic regions from image sequences. In: Proc. European Conference on Computer Vision (ECCV), Cambridge, UK, pp. 475–484 (1996) 2. Laptev, I.: On space-time interest points. Journal of Computer Vision 64(2/3), 107–123 (2005) 3. Johnson, N., Hogg, D.C.: Learning the distribution of object tractories for event recognition. Image and Vision Computing 14(8), 609–615 (1996) 4. Stauffer, C., Grimson, E.: Learning patterns of activity using real-time tracking. IEEE transactions on Pattern Analysis and Machine Intelligence (PAMI) 22(8), 747–757 (2000) 5. Makris, D., Ellis, T.: Learning semantic scene models from observing activity in visual surveillance. IEEE Transactions on Systems, Man and Cybernetics 35(3), 397–408 (2005) 6. McKenna, S.J., Charif, H.N.: Summarising contextual activity and detecting unusual inactivity in a supportive home environment. Pattern Analysis and Applications 7(4), 386–401 (2004) 7. KaewTraKulPong, P., Bowden, R.: Probabilistic learning of salient patterns across spatially separated, uncalibrated views. In: Intelligent Distributed Surveillance Systems, pp. 36–40 (2004) 8. Xiang, T., Gong, S.: Beyond tracking: Modelling activity and understanding behaviour. International Journal of Computer Vision 67(1), 21–51 (2006) 9. Bicego, M., Cristiani, M., Murino, V.: Unsupervised scene analysis: a hidden Markov model approach. Computer Vision and Image Understanding (CVIU) 102, 22–41 (2006) 10. Efros, A.A., Berg, A.C., Mori, G., Malik, J.: Recognizing action at a distance. In: Proc. International Conference on Computer Vision (ICCV), Nice, France (2003) 11. Laptev, I., P´erez, P.: Retrieving actions in movies. In: Proc. International Conference on Computer Vision (ICCV) (2007) 12. Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Proc. European Conference on Computer Vision (ECCV), pp. 428–441 (2006)
408
H.M. Dee et al.
13. Gryn, J.M., Wildes, R.P., Tsotsos, J.: Detecting motion patterns via direction maps with application to surveillance. In: Workshop on Applications of Computer Vision, pp. 202–209 (2005) 14. Colombo, A., Leung, V., Orwell, J., Velastin, S.A.: Markov models of periodically varying backgrounds for change detection. In: Visual Information Engineering, London, UK (2007) 15. Shi, J., Tomasi, C.: Good features to track. In: Proc. Computer Vision and Pattern Recognition (CVPR), pp. 593–600 (1994) 16. Lucas, B.D., Kanade, T.: An iterative image registration technique with an application to stereo vision. In: International Joint Conference on Artificial Intelligence, pp. 674–679 (1981) 17. Tomasi, C., Kanade, T.: Detection and tracking of point features. Technical Report CMU-CS-91-132, Carnegie Mellon (1991) 18. Home Office Scientific Development Branch U.i-LIDS: Imagery library for intelligent detection systems, http://scienceandresearch.homeoffice.gov.uk/hosdb/ cctv-imaging-technology/video-based-detection-systems/i-lids/ 19. Boykov, Y., Veksler, O., Zabih, R.: Efficient approximate energy minimization via graph cuts. IEEE transactions on Pattern Analysis and Machine Intelligence (PAMI) 20(12), 1222–1239 (2001) 20. Kolmogorov, V., Zabih, R.: What energy functions can be minimized via graph cuts? IEEE transactions on Pattern Analysis and Machine Intelligence (PAMI) 26(2), 147–159 (2004) 21. Boykov, Y., Kolmogorov, V.: An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. IEEE transactions on Pattern Analysis and Machine Intelligence (PAMI) 26(9), 1124–1137 (2004)
Pareto-Optimality of Cognitively Preferred Polygonal Hulls for Dot Patterns Antony Galton School of Engineering, Computing and Mathematics, University of Exeter, UK
Abstract. In several areas of research one encounters the problem of generating an outline that is in some way representative of the spatial distribution of a pattern of dots. Several different algorithms have been published which can generate such outlines, but the detailed evaluation of such algorithms has mostly concentrated on their computational and mathematical properties, while the adequacy of the resulting outlines themselves has been left as a matter of informal human judgment. In this paper it is proposed to investigate the perceptual acceptability of outlines independently of any particular algorithm for generating them, in order to determine objective criteria for evaluating outlines from the full range of possibilities in a way that is conformable to human intuitive assessments. For the sake of definiteness it is assumed that the outline to be produced is a simple closed polygon whose vertices are elements of the given dot pattern, all remaining elements of the dot pattern being in the interior of the polygon. It is hypothesised that to produce a cognitively acceptable outline one should seek simultaneously to minimise both the area and perimeter of the polygon, and that therefore the points in area-perimeter space corresponding to cognitively optimal outlines will lie on or close to the Pareto front. A small pilot study was conducted, the results of which lend strong support to the hypothesis. The paper concludes with some suggestions for further more detailed investigations. Keywords: Polygonal hulls, dot patterns, perceived shape, multiobjective optimisation.
1
Introduction
When presented with a two-dimensional pattern of dots such as the one shown in Figure 1, and asked to draw a polygonal outline which best captures the shape formed by the pattern, people readily respond by drawing outlines such as those shown in Figure 2. Interestingly, on first encountering this task, people often tend to imagine that there is a unique solution, ‘the’ outline of the dots; but they will very quickly be persuaded that there is typically no unique best answer. Only the convex hull has any claim to uniqueness, but in very many cases (such as the example shown), the convex hull is a bad solution to the task, since it does not capture the shape that we humans perceive the dots as forming. This is illustrated in Figure 3, where two distinct point-sets, having the shape of the letters ‘C’ and ‘S’, have the same convex hull. C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 409–425, 2008. c Springer-Verlag Berlin Heidelberg 2008
410
A. Galton
Fig. 1. A simple dot pattern
Fig. 2. Example outlines for the same dot pattern
The problem of representing a dot pattern by means of an outline which in some way captures the shape defined by the pattern, or the region of space occupied by the dots, has been investigated over a number of years by researchers from a number of different disciplines [1,2,3,4,5,6]. There may be several distinct motivations underlying such investigations; for example: – Map generalisation. What appears at one level of detail as a set of discrete points may be better represented, at a coarser level of detail, as a region. The region indicates the general location and configuration of the points, but does not indicate how many points there are or their individual positions.
Fig. 3. Point-sets with the same convex hull
– Region approximation for storage and retrieval efficiency. Geographical regions typically have complex sinuous outlines which place a high load on storage and retrieval capacity when represented digitally. Web-based digital gazetteers require efficient ways of recording the locations associated
Pareto-Optimality of Cognitively Preferred Polygonal Hulls
411
with region names; but traditional approximations such as bounding boxes, centroids, or convex hulls are far too crude for most purposes, and detailed information concerning the region’s boundary may be unnecessarily complex, even if available. What is needed is an approximation to a region which may be efficiently generated from available information about points known to lie inside (or outside) the region [7,8]. – Gestalt perception. As humans, we typically perceive a cluster of points as occupying some two-dimensional region in the visual field, and can describe, at least roughly, the outline of the region they occupy. If we want to emulate this capacity of the human visual system in a computer vision system, we need to be able to compute the region from the points. – Representation and reasoning about collective phenomena such as flocking animals, traffic, or crowds. In such cases the ‘ground truth’ consists of a set of individuals each with its own typically point-like location at any given time, but for many purposes it is desirable to think of the phenomenon as a single entity spread out across a region which may be thought of as the ‘spatial footprint’ of the collective [9]. Different motivations may result in different criteria for evaluating the quality of the outlines produced by any proposed method, but it is striking that, for the most part, existing literature on the problem has had little to say about these criteria, focusing rather on the technical details and computational characteristics of different algorithms. The main aim of this paper is to redress this imbalance by focussing on the question of evaluation criteria rather than particular algorithms.
2
Previous Work
As mentioned above, there is already a considerable body of work, much of it in the pattern analysis, computer vision, and geographical information science communities, on defining the shape of dot patterns. A typical paper in this area will propose an algorithm for generating a shape from a pattern of dots, explore its mathematical and/or computational characteristics (e.g., computational complexity), and examine its behaviour when applied to various dot patterns. The evaluation of this behaviour is typically very informal, often amounting to little more than observing that the shape produced by the algorithm is a ‘good approximation’ to the perceived shape of the dots. While lip-service is generally paid to the fact that there is no objective definition of such a ‘perceived shape’, little is said about how to verify this, or indeed, about exactly what it means. The much-cited work of Edelsbrunner et al. [1], introduces the notion of αshape: whereas the convex hull of a point-set S is the intersection of all closed half-planes containing all the points of S, their ‘α-hull’ is the intersection of all closed discs of radius 1/α containing all points of S (for α < 0 the closed disc of radius 1/α is interpreted as the complement of an open disk of radius −1/α, and for α = 0 it is a half-plane). The α-shape is a piecewise linear curve derived in a straightforward manner from the α-hull. For certain (typically small negative) values of α, the α-shape can come close to capturing the cognitively salient
412
A. Galton
aspects of the overall distribution of points. The authors go into considerable details concerning the mathematical properties of these shapes, but almost the only thing stated by way of evaluating the adequacy of the shapes produced by the algorithm is that ‘α-shapes . . . seem to capture the intuitive notion of “finer” or “cruder shape” of a planar pointset’. Similar reticence is shown by others who have followed. Garai and Chaudhuri [2] propose a ‘split-and-merge’ procedure, which starts by constructing the convex hull of the points, and then successively inserts extra edges or smooths over zigzags. The splitting procedure results in a highly jagged outline, which is then made smoother by the merging procedure. But again, the authors say almost nothing on the evaluation of the results of the algorithm, although it is clear that one purpose of reducing the jaggedness of the outline is to improve its cognitive acceptability. Melkemi and Djebali [3] propose the A-shape: Given a finite set of points P and a set A disjoint from P , the A-shape of P is obtained from the Voronoi diagram for A ∪ P by joining any pair of points p, q ∈ P whose Voronoi cells border each other and also the Voronoi cell of a point in A. The edges pq are the ‘A-exposed’ edges of the Delaunay triangulation of A ∪ P . The A-shape was introduced ‘with the aim of curing the limits of α-shape’, and the authors have only a little more to say on its evaluation: their explicit aim is to look for a ‘polygonal representation’ that ‘must reflect the forms perceived by a human observer of the dot patterns set’. Chaudhuri et al [4] also make explicit reference to human visual perception. Their r-shape is obtained by constructing the union Ur of all disks of radius r centred on points of P , and then, for p, q ∈ P , selecting edge pq if the boundaries of the discs centred on p and q intersect on the boundary of Ur ; the r-shape of P is the union of the selected edges. In the same paper they discuss the sshape, obtained by partitioning the space into a lattice of s × s squares and then taking the union of those squares which contain points of P . They confine their attention to regular dot patterns, in which ‘the points are clearly visible as well as fairly densely and more or less evenly distributed’ (unlike, for example, our Figure 1). For such patterns they say that ‘one can perceive the border of the point set’, and see their problem as ‘extracting the border that is compatible with the perceived shape of the input pattern’; they also speak of ‘the intuitive shape of the dot pattern’. This way of speaking seems to imply that there is a unique perceived shape, but they acknowledge that ‘“perceptual structure” of [a dot pattern] S cannot be defined uniquely’, adding that it ‘will vary from one person to another to a small extent’. But no attempt is made to determine the extent of such variation, and in evaluating the results little is said beyond the statement that ‘if ε [a real-valued scaling factor used in their algorithms] lies in the range 0.3–0.5, the extracted border is compatible with the perceptual border of the dot pattern’ — and again, no quantitative measure of degree of compatibility is given. The remainder of their evaluation concerns intrinsic features of the algorithm such as its computational complexity. Galton and Duckham [5] proposed three different algorithms for generating a region (called a ‘footprint’) from a set of points. One, the ‘swinging arm’ method,
Pareto-Optimality of Cognitively Preferred Polygonal Hulls
413
generalises the ‘gift-wrap’ algorithm for constructing convex hulls; a line segment of length r is swung about an extremal point of the set until it encounters another point in the set; the two points are joined, and the procedure repeated from the second point, until a closed shape is produced. Additional components of the footprint will be obtained if points in the set lie outside the first component. Similar results can be obtained by joining all pairs of points separated by at most r and then selecting the peripheral joins, resulting in the ‘close pairs’ method. In the third algorithm, a region is produced by successively removing the longest exterior edges from the Delaunay triangulation of the points, subject to the condition that the region remains connected and its boundary forms a Jordan curve. In this work, more attention was paid to the question of evaluation criteria, and nine questions were listed that could be used to help classify different types of solution to the general problem of associating a region with a set of points. But like the work previously reviewed, this paper shied away from any detailed examination of the concept of ‘perceived shape’ other than noting that any such examination must ‘go beyond computational geometry to engage with more human-oriented disciplines such as cognitive science’. Moreira and Santos [6] proposed a ‘concave hull’ algorithm which is an alternative generalisation of the gift-wrap algorithm, in which at any stage only the k nearest neighbours of the latest point added to the outline are considered as candidates for the next addition. They state the problem as that of finding ‘the polygon that best describes the region occupied by the given points’, and acknowledge that the word ‘best’ here is ambiguous, what counts as a best solution being application dependent; but evaluation of the algorithm is largely confined to its computational characteristics and not the adequacy of the results, for which they do little more than refer to the criteria listed in [5]. Outputs from this algorithm (for Pattern 5 in Appendix A) are shown in Figure 4. In work currently in press, Duckham et al. [10] present more detailed evaluation for the Delaunay-based method first presented in [5], leading to a conclusion that ‘normalized parameter values of between 0.05–0.2 typically produce optimal or near-optimal shape characterization across a wide range of point distributions’, but it is acknowledged that what ‘optimal’ means here is both underspecified and somehow connected with ‘a shape’s “visual salience” to a human’. The actual evaluation presented in [10] takes the approach of starting with a well-defined shape, generating a dot pattern from it, and then testing the algorithm’s efficacy at reconstructing the original shape. The purpose of the present paper is to take some first steps towards establishing some principles for evaluating any proposed solution to the problem of determining an outline for a set of points. Whereas previous work has mostly been concerned with proposing particular algorithms for generating outlines, here I propose that, independently of any particular algorithm, we consider a full range of possible outlines, and try to determine what features, describable in objective (e.g., geometrical) terms, influence cognitive judgments as to the suitability of an outline as a depiction of ‘the’ shape defined by the set of points.
414
A. Galton
k=4
k=5
k=6
k=7
k=8
k = 10
Fig. 4. Polygonal hulls generated by the Concave Hull algorithm [6]
3
The Scope of the Inquiry
In order to bring the treatment to manageable proportions, we first make some assumptions about the kind of solution that is being sought. Many, though by no means all, of the published algorithms produce outlines satisfying the following criteria: 1. The outline is a polygon whose vertices are members of the dot pattern. 2. Any member of the dot pattern which is not a vertex of the polygon lies in the interior of the polygon. 3. The boundary of the polygon forms a Jordan curve (so in particular no point is encountered more than once in a full traversal of the boundary). We shall call such outlines polygonal hulls of the underlying dot pattern; for brevity, we shall usually just refer to them as ‘hulls’. The outlines shown in Figures 2 and 4 are of this kind. We exclude from consideration curvilinear outlines, outlines which exclude one or more points of the dot pattern, outlines which include all points of the dot pattern in their interior, outlines which are topologically non-regular, self-intersecting outlines, etc. Examples of two such excluded outlines are shown Figure 5. It is obvious that the vertices of the convex hull for any dot pattern will appear as vertices of all of the polygonal hulls for that pattern; and moreover, in any of the polygonal hulls, the convex-hull vertices will appear in the the same sequential order around the perimeter. In general we may represent a dot pattern as K ∪ I, where K is the set of vertices of the convex hull and I is the set of dots in the interior of the convex hull. Let the clockwise ordering of convex
Pareto-Optimality of Cognitively Preferred Polygonal Hulls
415
Fig. 5. Two non-examples: these do not count as polygonal hulls
hull vertices be p1 , p2 , . . . , pk , and let the interior dots be q1 , q2 , . . . , qn−k . Then the sequence of vertices of any polygonal hull for the dot pattern will consist of p1 , p2 , . . . , pk in that order, interspersed with some selection from q1 , q2 , . . . , qn−k in some order. How many polygonal hulls are there for a pattern of n dots? We can easily calculate an upper bound. From the above observations, for an n-point dot pattern whose convex hull has k vertices, we can select a polygonal hull by a sequence of four choices: (1) choosing how many interior dots qi will be vertices of the hull (say r dots, where 0 ≤ r ≤ n − k); (2) choosing which r of the n − k available interior dots will be vertices of the hull (n−k Cr choices); (3) in the clockwise traversal of the hull starting from p1 , choosing which r of the k + r − 1 remaining vertices will be assigned to interior dots (k+r−1 Cr choices); (4) choosing in which order the r interior dots will be assigned to the r vertex positions chosen at the previous step (r! choices). Not every combination of such choices will lead to a polygonal hull (the perimeter of the resulting polygon may be self-intersecting, or some of the dots may lie outside the polygon), but each polygonal hull will arise from exactly one combination of choices. Thus the number of polygonal hulls for an n-point dot pattern with a k-vertex convex hull is at most n−k
n−k
Cr k+r−1 Cr r!.
r=0
For the case n = 12 and k = 7, this comes to 86,276; but the 12-point dot pattern shown in Figure 6, with seven vertices in its convex hull, actually has only 5674 polygonal hulls, approximately 6.6% of the upper bound. Even so, the number of polygonal hulls does grow rapidly as the number of dots increases, and for large values of n it becomes impracticable to compute all of them (with n = 16 we are already talking days rather than hours or minutes in the worst case). In reality, however, only a tiny fraction of the polygonal hulls are worth considering as good candidates for the ‘perceived shape’ of the dot pattern. Figure 7 illustrates three of the 5674 polygonal hulls for the dot pattern in Figure 6. The leftmost one is the convex hull. This is easily defined, has wellknown mathematical and computational properties, and might be considered as a useful representation of the dot pattern for some purposes; but as already noted, it does not usually capture the perceived shape of the pattern. The rightmost one provides a very jagged outline which does not correspond to anything that
416
A. Galton
s
s s
s s
s
s
s
s s
s
s
Fig. 6. Dot Pattern 1
Fig. 7. Example polygonal hulls for Dot Pattern 1
we readily perceive when observing the dots on their own. The middle hull, on the other hand, does seem to capture pretty well a shape that we can readily perceive in the dots. It is certainly not unique in doing so, however, and in the pilot study reported below, only 2 out of 13 subjects drew this as their preferred hull for this pattern of dots. What factors make a polygonal hull acceptable as a representation of the ‘perceived shape’ of a dot pattern? The problem with the convex hull is that it will often include large areas devoid of dots; these are the perceived concavities in the shape, and the convex hull completely fails to account for them. Of all possible hulls, the convex hull simultaneously maximises the area while minimising the perimeter. It is the maximality of the area which causes the problem, since this correlates with the inclusion of the empty spaces represented by the concavities in the perceived outline. At the other extreme, the jagged figure on the right does very well at reducing the area, but at the cost of a greatly extended perimeter. The middle figure seems to strike a better balance, with both area and perimeter taking intermediate values, as shown in Table 1. A cognitively acceptable outline should (a) not contain too much empty space, and (b) should not be too long and sinuous. This suggests that to produce the
Pareto-Optimality of Cognitively Preferred Polygonal Hulls
417
Table 1. Area and perimeter measurements for the hulls in Figure 7 (units of measurement arbitrary)
Hull 1 Hull 2 Hull 3
Area Perimeter 42761.0 783.5 27163.0 962.5 21032.0 1599.3
optimal outline we should seek to simultaneously minimise both the area and the perimeter. These are, of course, conflicting objectives, since the minimum perimeter (that of the convex hull) corresponds to the maximum area. In the language of multi-objective optimisation theory [11], we seek non-dominated solutions. A polygonal hull with area A1 and perimeter P1 is said to dominate one with area A2 and perimeter P2 (with respect to our chosen objectives of minimising both area and perimeter) so long as (A1 ≤ A2 ∧ P1 < P2 ) ∨ (A1 < A2 ∧ P1 ≤ P2 ). The hulls which are not dominated by any other hulls form what is known as the Pareto set. When plotted in area-perimeter space (‘objective space’) they lie along the Pareto front. This shows up in the graphs as the ‘south-western’ frontier of the set of points corresponding to all the hulls for a given dot pattern. Area-perimeter plots for all eight dot patterns used in the pilot study described below can be found in Appendix B. In these figures, area is plotted along the horizontal axis, perimeter along the vertical; the convex hull, with maximal area and minimal perimeter, corresponds to the point at the extreme lower right. In light of the above considerations, we propose the following Hypothesis: The points in area-perimeter space corresponding to polygonal hulls which best capture a perceived shape of a dot pattern lie on or close to the Pareto front. The next section describes a pilot study which was carried out as a first step in the investigation of this hypothesis.
4
Pilot Study
A small pilot study was carried out to gain an initial estimation of the plausibility of the hypothesis. Eight dot patterns were presented to 13 adult subjects, who were asked to draw a polygonal outline which best captures the shape formed by each pattern of dots. An example dot pattern with two possible polygons was shown (these are our Figures 1 and 2), and more precise rules given as follows: 1. The outline must be a simple closed polygon whose vertices are members of the dot pattern; that is, it must consist of a series of straight edges joining up some or all of the dots, forming a closed circuit.
418
A. Galton
2. You do not have to include all the given dots as vertices of your outline; but any dots that are not used must be in the interior of the polygon formed, not outside it. 3. The outline must not intersect or touch itself; so outlines such as the two below are not allowed: [here the two non-examples of Figure 5 were given]. The eight dot patterns used in the pilot study are shown in Appendix A. The results of the pilot study are tabulated in Table 2. The rows of the table correspond to the eight dot patterns. For each dot pattern the following data are given: – – – – –
The number of dots in the pattern. The total number of polygonal hulls for the pattern. The number of Pareto-optimal polygonal hulls for the pattern. The maximum number of dominators for any individual polygonal hull. The number of distinct hulls generated by the subjects: the relevance of this figure is that it shows that the subjects provided a variety of different responses — for none of the dot patterns were there just one or two ‘obvious’ outlines to draw. – The number of subjects who responded with a Pareto-optimal hull. – The mean relative domination of the responses — this quantity is explained below. Table 2. Results of pilot study involving 13 subjects and 8 dot patterns No. of No. of Pareto-opt. Max. no. of Distinct Pareto-opt. Mean Pattern dots hulls hulls dominators responses responses rel. dom. 1 12 5674 43 5186 9 8 0.000252 2 12 14095 81 13023 12 5 0.002640 3 11 1246 38 996 8 10 0.004943 4 12 1826 23 1632 10 7 0.002168 5 13 74710 61 73205 12 6 0.000139 6 11 3303 29 3024 12 4 0.000738 7 11 3637 36 3322 11 6 0.003473 8 11 8308 72 7630 5 11 0.000323
Our hypothesis was that hulls corresponding to some ‘perceived shape’ of the dot pattern should lie on or close to the Pareto front in the area-perimeter plot. Totalling the figures in the penultimate column of the table, we see that 57 out of the total 104 responses were Pareto-optimal. The figures in the fourth column give the number of Pareto-optimal hulls available for that dot pattern, an indication of the size of the ‘target’ if our hypothesis is correct. The fifth column in the table shows the maximum number of hulls by which any given hull for that dot pattern is dominated: it will be seen that this always falls short of the total number of hulls, but not usually by much.
Pareto-Optimality of Cognitively Preferred Polygonal Hulls
419
A measure of the extent to which a hull falls short of being Pareto-optimal is given by the ‘relative domination’, that is, the ratio of the number of hulls which dominate it to the the maximum number of hulls that dominate any one hull for that dot pattern. The relative domination for any individual hull is thus obtained by dividing the number of dominators of that hull by the number of dominators of a maximally dominated hull. The relative domination ranges from 0 for a Paretooptimal hull to 1 for a maximally-dominated hull. For the hypothesis to be corroborated, we should expect the relative domination of subjects’ responses to be consistently close to 0, and this is indeed what we find. The rightmost column of the table shows the mean relative domination across all thirteen subjects, for each dot-pattern. The highest individual value for the relative domination was 0.008578, for a response to dot pattern 2 which was dominated by 118 out of the 14,095 hulls for that pattern. Compare this with the jagged rightmost hull in Figure 7, which has a relative domination of 0.2347. If Pareto-optimality had no influence on the subjects’ selection of polygonal hulls, we should expect the relative frequency of Pareto-optimal hulls selected for any of the dot patterns in the pilot study to approximate the relative frequency of Pareto-optimal hulls in the full set of hulls for that pattern. For example, for pattern 3, only 3% of the hulls are Pareto-optimal, which means that we should expect a Pareto-optimal hull to be chosen by 0.03 × 13 ≈ 0.4 subjects. Summing the corresponding values for all the dot patterns, we would expect about 1.1 out of the 104 responses to lie on the Pareto front, on the hypothesis that Pareto-optimality is not a relevant factor. This should be compared with the 57 Pareto-optimal responses actually observed. A chi-squared test gives χ2 = 2872, considerably larger than the value of 10.827 required for statistical significance at the 0.1% level. From our observations, the chance that Pareto-optimality has no influence on subjects’ choices is effectively zero. In Appendix C are shown, for each dot pattern, the points on the Pareto front (small dots), and the points corresponding to the hulls chosen by the subjects in the pilot study (circles). Comparing these with the full set of hulls illustrated in Appendix B, one obtains a good idea of how closely the hulls drawn by human subjects to represent the perceived shape of the dot pattern adhere to the Pareto front. In conclusion, the results of the pilot study lend considerable support to the hypothesis that the perceived shape of a dot pattern will tend to be Paretooptimal with respect to minimising both area and perimeter.
5
Next Steps
The pilot study reported here is limited in both scale and scope. There are many possibilities for further work to examine a range of additional factors with larger-scale experiments. Here we list a number of such possibilities. 1. Choice of dot patterns. The dot patterns used in the pilot study were chosen on the basis of an informal idea that they were in some way ‘interesting’.
420
A. Galton
As such, they no doubt incorporate an unconscious bias towards patterns of a particular type. To be sure that our results remain valid over the full range of possible dot patterns, it will be necessary to adopt a more principled approach to the selection of the patterns, e.g., using a randomised procedure to generate the patterns. It will also be necessary to investigate larger dot patterns, but the inherent intractability of any algorithm to generate all the hulls for a given pattern would make this impractical for patterns much larger than those already considered. Alternative approaches, involving sampling from the full set of hulls, may have to be considered instead. 2. Choice of experimental procedures. Instead of asking subjects to draw hulls for the dot patterns presented, other tasks may also yield useful information. Examples are (a) Subjects are presented with a selection of possible outlines for a dot pattern and asked to choose the one which, for them, best represents the perceived shape of the pattern. (b) Subjects are presented with pairs of outlines for a given dot pattern and asked to select the preferred outline. (c) Subjects are presented with a selection of possible outlines for a dot pattern and asked to rank them in order of acceptability. (d) Free-form commentary: in any of the above situations, subjects are invited to explain why they judge one outline to be more acceptable than another. 3. Application context. A possible concern with any of the above procedures is that they are assumed to be conducted in the absence of any proposed application context. Subjects are not being asked to rate the outlines as good for anything in particular, but merely what looks ‘right’ to them. A priori, one might suppose that this would prove problematic for some subjects, although in the pilot study it was found that subjects were very willing to treat the task as an abstract exercise without reference to any application. However, most of the subjects in the pilot study were university-educated, many of them actually working in the university, and if a wider-ranging set of subjects is used, this may become a more serious consideration, and it may be appropriate to embed the tasks in some ‘real-world’ problem context (e.g., map generalisation) in order to provide better motivation. 4. Other objective criteria. The results of the pilot study were only examined from the point of view of the area/perimeter minimisation hypothesis. But no doubt other factors are involved: in particular, once it is established that preferred outlines tend to lie on or close to the Pareto front of the area/perimeter graph, the obvious question is what further factors influence exactly whereabouts on the Pareto front the preferred solutions will be found. As the examples in the pilot study show, the Pareto front may take various forms. The point of maximum curvature sometimes assumes the form of a wellmarked ‘knee’, to the right of which the slope is quite gentle, representing a series of hulls with increasing area but similar perimeter. A priori one might
Pareto-Optimality of Cognitively Preferred Polygonal Hulls
421
expect the preferred hulls to lie towards the left of this series, near the knee, but the experimental results do not really bear this out. Further investigation is needed to determine what factors influence the location of the optimal hulls along the front. Factors that might be considered include sinuosity (a measure of which is the number of times the outline changes from convex to concave or vice versa as it is traversed), or the number of vertices in the hull. Both of these are to some extent correlated with perimeter, although the correlation is far from exact. One might also wish to investigate other factors such as symmetry, which undoubtedly affect visual salience. 5. Evaluation of algorithms. Having established an appropriate set of criteria for evaluating polygonal hulls, one can then begin experimenting with different algorithms. Many of the published algorithms for producing outlines of dot-patterns yield polygonal hulls in the sense defined in this paper, and an obvious first step would be to investigate to what extent these algorithms tend to produce outlines that are optimal according to the criteria that have been established. In particular, most of the existing algorithms involve a parameter — typically a real-valued length parameter, but in the case of the k-nearest neighbour algorithm of [6], it is a positive integer. It would therefore be interesting to investigate how the objective evaluation criteria vary as the parameter is varied: one could, for example, trace the path followed by an algorithm’s output in area-perimeter space as the parameter runs through the full range of its possible values, and hence find which parameter settings optimise the quality of the output. For the hulls shown in Figure 4, for example, the number of dominators in area-perimeter space are 0, 5, 4, 0, 5, and 0 respectively, suggesting that this algorithm, like our human subjects, is very good at finding hulls on or near the Pareto front. 6. Algorithm design. Going beyond this, one might also ask whether it is possible to design an algorithm with those criteria in mind, that is, to tailor an algorithm to produce hulls which are optimal with respect to the criteria. With larger point sets, one can only expect to identify the Pareto-optimal hulls to some degree of approximation, suggesting that a fruitful approach here might be to use some form of evolutionary algorithm. 7. Extension to three dimensions. Many of the ideas discussed here could probably be generalised to apply to three-dimensional dot patterns. A hull must now be a volume of space bounded by a polyhedral surface rather than an area bounded by a polygonal outline: a ‘polyhedral hull’. Some, but not all, of the algorithms that have been used for generating outlines of twodimensional dot patterns readily generalise to three dimensions; little work has been done on this, though the Power Crust algorithm of [12,13] is not unrelated. There would be obvious practical difficulties in asking experimental subjects to construct polyhedra in space rather than drawing outlines on a piece of paper, but no doubt some suitable experiments could be devised. For the time being, however, the two-dimensional case already offers ample scope for further investigation.
422
A. Galton
Acknowledgments The author wishes to thank Jonathan Fieldsend and Richard Everson for useful comments on an earlier draft of this paper, including advice on multi-objective optimisation.
References 1. Edelsbrunner, H., Kirkpatrick, D.G., Seidel, R.: On the shape of a set of points in the plane. IEEE Transactions on Information Theory IT-29(4), 551–559 (1983) 2. Garai, G., Chaudhuri, B.B.: A split and merge procedure for polygonal border detection of dot pattern. Image and Vision Computing 17, 75–82 (1999) 3. Melkemi, M., Djebali, M.: Computing the shape of a planar points set. Pattern Recognition 33, 1423–1436 (2000) 4. Chaudhuri, A.R., Chaudhuri, B.B., Parui, S.K.: A novel approach to computation of the shape of a dot pattern and extraction of its perceptual border. Computer Vision and Image Understanding 68(3), 257–275 (1997) 5. Galton, A.P., Duckham, M.: What is the region occupied by a set of points? In: Raubal, M., Miller, H.J., Frank, A.U., Goodchild, M.F. (eds.) Geographic Information Science: Proceedings of the 4th International Conference, GIScience 2006, pp. 81–98. Springer, Heidelberg (2006) 6. Moreira, A., Santos, M.: Concave hull: a k-nearest neighbours approach for the computation of the region occupied by a set of points. In: Proceedings of the 2nd International Conference on Computer Graphics Theory and Applications (GRAPP 2007), Barcelona, Spain, March 8-11 (2007) 7. Alani, H., Jones, C.B., Tudhope, D.: Voronoi-based region approximation for geographical information retrieval with gazetteers. International Journal of Geographical Information Science 15(4), 287–306 (2001) 8. Arampatzis, A., van Kreveld, M., Reinbacher, I., Jones, C.B., Vaid, S., Clough, P., Joho, H., Sanderson, M.: Web-based delineation of imprecise regions. Computers, Environment and Urban Systems 30, 436–459 (2006) 9. Galton, A.P.: Dynamic collectives and their collective dynamics. In: Mark, D.M., Cohn, A.G. (eds.) Spatial Information Theory. Springer, Heidelberg (2005) 10. Duckham, M., Kulik, L., Worboys, M., Galton, A.: Efficient generation of simple polygons for characterizing the shape of a set of points in the plane. Pattern Recognition (March 2008) (in press) (Accepted for publication) 11. Deb, K.: Multi-objective Optimization Using Evolutionary Algorithms. John Wiley, Chichester (2001) 12. Amenta, N., Choi, S., Kolluri, R.: The power crust. In: Sixth ACM Symposium on Solid Modeling and Applications, pp. 249–260 (2001) 13. Amenta, N., Choi, S., Kolluri, R.: The power crust, unions of balls, and the medial axis transform. Computational Geometry: Theory and Applications 19(2-3), 127– 153 (2001)
Pareto-Optimality of Cognitively Preferred Polygonal Hulls
A
The Dot Patterns Used in the Pilot Study q
q
q q
q
q
q
q
q
q
q q
q
q
q q
q
q
q
q
q
q
q
q
q
q
q
q
q q
q
q Pattern 4
q
q
q q
q
q
Pattern 3
q
q
q
Pattern 2
q q
q
q
q
Pattern 1
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Pattern 5
Pattern 6 q q
q q
q
q
q q
q q
Pattern 7
q q q q
q q
q
q q
q q
Pattern 8
q
423
424
B
A. Galton
Area-Perimeter Plots for Pilot Study Dot Patterns P
P
A
A
Pattern 1
Pattern 2 P
P
A
A
Pattern 3
Pattern 4 P
P
A
A
Pattern 5
Pattern 6
P
P
A
A
Pattern 7
Pattern 8
Pareto-Optimality of Cognitively Preferred Polygonal Hulls
C
Pareto Fronts, with Pilot Study Responses P
P p pp
p pp p
p p p pp p pp pp p p p paapa aap p p p pap ap ap p p p p pp
P
pp p p p pap p
pp A
Pattern 1 p
pp
pp
pp p
p p
pp
pp pp
pp p pp ap p p p ap a p ap p p pa pa p p pa
p pp p p pp p p p p pp p ppapapaapapa a p p pa p ppa p p p p p p p p p p
p p p
p
p ap p p p p p p p p ap aapa aa a ap p p
pa A
p
p p pp p p pp p
Pattern 7
A
A
Pattern 6
P
p p p p p ppp ap p apap aa ap papap a aap p p p p p p p
A
p pp a pp a p papa p a ppa p aap p ap p p a pa p p a p pp p pp p
A
p pp ppp
p
Pattern 4
P
Pattern 5
P
A
p
p p
p
Pattern 2
P
Pattern 3
P
pp pp p ppp ap p p p p a a aa p pa p ap pa p p paap a a p pp p p p pp
p p p pp pp pp p p p p p ap p p p p p p ap pp p p p p p pap p p p p ap Pattern 8
ap p p p p p A
425
Qualitative Reasoning about Convex Relations Dominik L¨ ucke1 , Till Mossakowski1,2, and Diedrich Wolter1 1
SFB/TR 8 Spatial Cognition Dept. of Computer Science University of Bremen P.O. Box 330440, D-28334 Bremen 2 DFKI Lab Bremen Safe & Secure Cognitive Systems Enrique-Schmidt-Str. 5, D-28359 Bremen
Abstract. Various calculi have been designed for qualitative constraintbased representation and reasoning. Especially for orientation calculi, it happens that the well-known method of algebraic closure cannot decide consistency of constraint networks, even when considering networks over base relations (= scenarios) only. We show that this is the case for all relative orientation calculi capable of distinguishing between “left of” and “right of”. Indeed, for these calculi, it is not clear whether efficient (i.e. polynomial) algorithms deciding scenario-consistency exist. As a partial solution of this problem, we present a technique to decide global consistency in qualitative calculi. It is applicable to all calculi that employ convex base relations over the real-valued space Rn and it can be performed in polynomial time when dealing with convex relations only. Since global consistency implies consistency, this can be an efficient aid for identifying consistent scenarios. This complements the method of algebraic closure which can identify a subset of inconsistent scenarios. Keywords: Qualitative spatio-temporal reasoning, relative orientation calculi, consistency.
1
Introduction
Since the work of [1] on temporal intervals, constraint calculi have been used to model a variety of aspects of space and time in a way that is both qualitative (and thus closer to natural language than quantitative representations) and computationally efficient (by appropriately restricting the vocabulary of rich mathematical theories about space and time). For example, the well-known region connection calculus by [2] allows for reasoning about regions in space. Applications include geographic information systems, human-machine interaction, and robot navigation. Efficient qualitative spatial reasoning mainly relies on the algebraic closure algorithm. It is based on an algebra of (often binary) relations: using relational composition and converse, it refines (basic) constraint networks in polynomial time. If algebraic closure detects an inconsistency, the original network is surely C. Freksa et al. (Eds.): Spatial Cognition VI, LNAI 5248, pp. 426–440, 2008. c Springer-Verlag Berlin Heidelberg 2008
Qualitative Reasoning about Convex Relations
427
inconsistent. If no inconsistency is detected, for some calculi, this implies consistency of the original network — not for all calculi, though. Orientation calculi focus on relative directions in Euclidean space, like “to the left of”, “to the right of”, “in front of”, or “behind of”. They face two difficulties: often, these calculi employ ternary relations, for which the theory is much less developed than for binary ones. Moreover, in this work, we show that algebraic closure can badly fail to approximate the decision of consistency of constraint networks. Hence, we look for alternative ways of tackling the consistency problem. We both refine the algebraic closure method by using compositions of higher arities, and present a polynomial decision procedure for global consistency of constraint networks that consist of convex relations. These two methods approximate consistency from below and above.
2
Qualitative Calculi
Qualitative calculi are employed for representing knowledge about a domain using a finite set of labels, so-called base relations. Base relations partition the domain into discrete parts. One example is distinguishing points on the time line by binary relations such as “before” or “after”. A qualitative representation only captures membership of domain objects in these parts. For example, it can be represented that time point A occurs before B, but not how much earlier nor at which absolute time. Thus, a qualitative representation abstracts, which is particularly helpful when dealing with infinite domains like time and space that possess an internal structure like for example Rn . In order to ensure that any constellation of domain objects is captured by exactly one qualitative relation, a special property is commonly required: Definition 1. Let B = {B1 , . . . , Bk } be a set of n-ary relations over a domain D. These relations are said to be jointly exhaustive and pairwise disjoint (JEPD), if they satisfy the properties 1. ∀i, j ∈{1, . . . , k} with i = j : Bi ∩ Bj = ∅ 2. Dn = i∈{1,...,k} Bi For representing uncertain knowledge within a qualitative calculus, e.g., to represent that objects x1 , x2 , . . . , xn are either related by relation Bi or by relation Bj , general relations are introduced. Definition 2. Let B = {B1 , . . . , Bk } be a set of n-ary relations over a domain D. The set of general relations RB (or simply R) is the powerset P(B). The semantics of a relation R ∈ RB is defined as follows: R(x1 , . . . , xn ) :⇔ ∃Bi ∈ R, Bi (x1 , . . . , xn ) In a set of base relations that is JEPD, the empty relation ∅ ∈ RB is called the impossible relation. Reasoning with qualitative information takes place on the symbolical level of relations R, so we need special operators that allow us to manipulate qualitative knowledge. These operators constitute the algebraic structure of a qualitative calculus.
428
2.1
D. L¨ ucke, T. Mossakowski, and D. Wolter
Algebraic Structure of Qualitative Calculi
The most fundamental operators in a qualitative calculus are those for relating qualitative relations in accordance to their set-theoretic disjunctive semantics. So, for R, S ∈ R, intersection (∩) and union (∪) are defined canonically. The set of general relations is closed under these operators. Set-theoretic operators are independent of the calculus at hand, further operators are defined using the calculus semantics. Qualitative calculi need to provide operators for interrelating relations that are declared to hold for the same set of objects but differ in the order of arguments. Put differently, we need operators which allow us to change perspective. For binary calculi only one operator needs to be defined: Definition 3. The converse ( ) of a binary relation R is defined as: R := {(x2 , x1 )|(x1 , x2 ) ∈ R} Ternary calculi require more operators to realize all possible permutations of three variables. The three commonly used operators are shortcut, homing, and inverse: Definition 4. Permutation operators for ternary calculi: IN V (R) := { (y, x, z) | (x, y, z) ∈ R } SC(R) := { (x, z, y) | (x, y, z) ∈ R }
(inverse) (shortcut)
HM (R) := { (y, z, x) | (x, y, z) ∈ R }
(homing)
Additional permutation operations can be defined, but a small basis that can generate any permutation suffices, given that the permutation operations are strong (see discussion further below) [3]. A restriction to few operations particularly eases definition of higher arity calculi. Definition 5 ([3]). Let R1 , R2 , . . . , Rn ∈ RB be a sequence of n general relations in an n-ary qualitative calculus over the domain D. Then the operation ◦ (R1 , . . . , Rn ) := {(x1 , . . . , xn ) ∈ Dn | ∃u ∈ D, (x1 , . . . , xn−1 , u) ∈ R1 , (x1 , . . . , xn−2 , u, xn ) ∈ R2 , . . . , (u, x2 . . . , xn ) ∈ Rn } is called n-ary composition. Note that for n = 2 one obtains the classical composition operation for binary calculi (cp. [4]) which is usually noted as infix operator. Nevertheless different kinds of binary compositions have been used for ternary calculi, too. 2.2
Strong and Weak Operations
Permutation and composition operators define relations. Per se it is unclear whether the relations obtained by application of an operation are expressible
Qualitative Reasoning about Convex Relations
429
in the calculus, i.e. whether the set of general relations RB is closed under an operation. Indeed, for some calculi the set of relations is not closed, there even exist calculi for which no closed set of finite size can exist, e.g. the composition operation in Freksa’s double cross calculus [5]. Definition 6. Let an n-ary qualitative calculus with relations RB over domain D and an m-ary operation φ : B m → P(Dn ) be given. If theset of relations is closed under φ, i.e. for ∀B ∈ B m ∃R ∈ RB : φ(B) = B∈R B, then the operation φ is called strong. In qualitative reasoning we must restrict ourselves to a finite set of relations. Therefore, if some operation is not strong in the sense of Def. 6, an upper approximation of the true operation is used instead. Definition 7. Given a qualitative calculus with n-ary relations RB over domain D and an operation φ : B m → P(Dn ), then the operator φ : B m → RB φ (B1 , . . . , Bk ) := {R ∈ B|R ∩ φ(B1 , . . . , Bk ) = ∅} is called a weak operation, namely the weak approximation of φ. Note that the weak approximation of an operation is identical to the original operation if and only if the original operation is strong. Further note that any calculus is closed under weak operations. Applying weak operations can lead to a loss of information which may be critical in certain reasoning processes. In the literature the weak composition operation is usually denoted by . Definition 8. We call an m-ary relation R over Rn convex, if {y | R (x1 , . . . , xm−1 , y) , (x1 , . . . , xm−1 , y) ∈ Rn } is a convex subset of Rn .
3
Constraint Based Qualitative Reasoning
Qualitative reasoning is concerned with solving constraint satisfaction problems (CSPs) in which constraints are expressed using relations of the calculus. Definitions from the field of CSP are carried over to qualitative reasoning (cp. [6]). Definition 9. Let R be the general relations of a qualitative calculus over the domain D. A qualitative constraint is a formula R(X1 , . . . , Xn ) (also written X1 . . . Xn−1 R Xn ) with variables Xi taking values from the domain and R ∈ R. A constraint network is a set of constraints. A constraint network is said to be a scenario if it gives base relations for all relations R(X1 , . . . , Xn ) and the base relations obtained for different permutations of variables X1 , . . . , Xn must be agreeable wrt. the permutation operations.
430
D. L¨ ucke, T. Mossakowski, and D. Wolter
One key problem is to decide whether a given CSP has a solution or not. This can be a very hard problem. Infinity of the domain underlying qualitative CSPs inhibits searching for an agreeable valuation of the variables. This is why decision procedures that purely operate on the symbolic, discrete level of relations (rather than on the level of underlying domain) receive particular interest. Definition 10. A constraint network is called consistent if a valuation of all variables exists, such that all constraints are fulfilled. A constraint network is called n-consistent (n ∈ N) if every solution for n − 1 variables can be extended to a n variable solution involving any further variable. A constraint network is called strongly n-consistent, if it is m-consistent for all m ≤ n. A CSP in n-variables is globally consistent, if it is strongly n-consistent. A fundamental technique for deciding consistency in a classical CSP is to enforce k-consistency by restricting the domain of variables in the CSP to mutually agreeable values. Backtracking search can then identify a consistent variable assignment. If the domain of some variable gets restricted to down to zero size while enforcing k-consistency, the CSP is not consistent. This procedure except for backtracking search (which is not applicable in infinite domains) is also applied to qualitative CSPs [4]. For a JEPD calculus with n-ary relations any qualitative CSP is strongly n-consistent unless it contains a constraint with the empty relation. So the first step in checking consistency would be to test n + 1consistency. In the case of a calculus with binary relations this would mean analyzing 3-consistency, also called path-consistency. This is the aim of the algebraic closure algorithm which exploits that composition lists all 3-consistent scenarios. Definition 11. A CSP over binary relations is called algebraically closed if for all variables X1 , X2 , X3 and all relations R1 , R2 , R3 the constraint relations R1 (X1 , X2 ),
R2 (X2 , X3 ),
R3 (X1 , X3 )
imply R3 ⊆ R1 R2 To enforce algebraic closure, the operation R3 := R3 ∩ R1 R2 (as well as a similar operation for converses) is applied for all variables until a fixpoint is reached. Enforcing algebraic closure preserves consistency, i.e., if the empty relation is obtained during refinement, then the qualitative CSP is inconsistent. However, algebraic closure does not mandatorily decide consistency: a CSP may be algebraically closed but inconsistent — even if composition is strong [7]. Algebraic closure has also been adapted to ternary calculi using binary composition [8]. Binary composition of ternary relations involves 4 variables, it may not be able to represent all 4-consistent scenarios though. Scenarios with 4 variables are specified by 4 ternary relations. However, binary composition R1 R2 = R3 only involves 3 ternary relations. Therefore, using n-ary composition in reasoning with n-ary relations is more natural (cp. [3]).
Qualitative Reasoning about Convex Relations
4
431
Reasoning about Relative Orientation
In this section we give an account on findings for deciding consistency of qualitative CSPs. Our study is based on the LR-calculus (ref. to [9]), a coarse relative orientation calculus. It defines nine base relations which are depicted in Fig. 1. The LR-calculus deals with the relative position of a point C with respect to the oriented line from point A to point B, if A = B. The point C can be to the left of (l), to the right of (r) the line, or it can be on a line collinear to the given one and in front of (f ) B, between A and B with the relation (i) or behind (b) A, further it can be on the start-point A (s) or an the end-point B (e). If A = B, then we can distinguish between the relations Tri , expressing that A = C and Dou, meaning A = C. Freksa’s double cross calculus DCC is a refinement of the LR-calculus and, henceforth, our findings for the LR-calculus can be directly applied to the DCC-calculus as well. We give negative results on the applicability of existing approaches for qualitative reasoning and discuss how computations on the algebraic level can nevertheless be helpful. We begin with a lower bound of the complexity.
f B
i
l A
dou
e r
A,B tri
s b
Fig. 1. The nine base relations of the LR-calculus; tri designates the case of A = B = C, whereas dou stands for A = B = C
Theorem 12. Deciding consistency of CSPs in LR is N P-hard. Proof (sketch). In a straightforward adaption of the proof given in [10] for the DCC calculus, the N P-hard problem NOT-ALL-EQUAL-3SAT can be reduced to equality of points. Algebraic closure usually is regarded the central tool for deciding consistency of qualitative CSPs. For the first qualitative calculi investigated (point calculus [11], Allen’s interval algebra [1]) it turned out that algebraic closure decides consistency for the set of base relations, i.e. algebraic closure gives us a polynomial time decision procedure for consistency of qualitative CSPs when dealing with scenarios. This leads to the exponential time algorithm for deciding consistency of general CSPs using backtracking search to refine relations in the CSP
432
D. L¨ ucke, T. Mossakowski, and D. Wolter
to base relations [1]. Renz pioneered research on identifying larger sets for which algebraic closure decides consistency, thereby obtaining a practical decision procedure [12]. If however algebraic closure is too weak for deciding consistency of scenarios, no approaches are known for dealing with qualitative CSPs on the algebraic level. Unfortunately this is the case for the LR-calculus. Proposition 13. All scenarios only containing the relations l and r are algebraically closed wrt. the LR-calculus with binary composition. Proof. We have a look at the permutations of LR and see that operation operand result INV l r r l SC l r r l HM l l r r the set of {l, r} is closed under all permutations. A look at the binary composition table of LR reveals that all compositions containing only l and r on their left hand side, always have the set {l, r} included in their right hand side: operand 1 operand 2 result l l {b, s, i, l, r} r {f, l, r} l r l {f, l, r} r r {b, s, i, l, r} But with this we can conclude, that Ri,k Rk,j ∩ Ri,j = ∅ for all i, k, j, with Rn,m ∈ {l, r}.
Of course not all LR-scenarios over the variables l and r are consistent. We will show that SCEN := {(A B r C), (A E r D), (D B r A), (D C r A), (D C r B), (D E r B), (D E l C), (E B r A), (E C r A), (E C r B)} is algebraically closed but inconsistent. Algebraic closure directly follows from Thm. 13. We will show that any projection of this scenario to the natural domain R2 of the LR-calculus yields a contradiction. Therefore we construct equations
Qualitative Reasoning about Convex Relations
433
β r
6 rγ
α σ αr
-
α
Fig. 2. Constructing equations
for the relations of the LR-calculus. In R2 the sign of the scalar product sign(X, Y ) determines the relative direction of X and Y . Given three points α, β and γ that are connected by an LR-relation, we can construct a local coordinate system with origin α. It has one base vector going from α to β; we call this vector α. The vector orthogonal to this one and and facing to the right is called α , as shown in Fig. 2. The vector from α to γ is called σ. With this we get that (α β r γ) is true iff α , σ > 0, and (α β l γ) is true iff α , σ < 0, and of course we know that the points α, β, and γ are different points in these cases. The vectors α and σ are described by yβ − yα xγ − xα α = ,σ= . xα − xβ yγ − yα With this we get (α β r γ) ⇔ (yβ − yα ) · (xγ − xα ) + (xα − xβ ) · (yγ − yα ) > 0 (α β l γ) ⇔ (yβ − yα ) · (xγ − xα ) + (xα − xβ ) · (yγ − yα ) < 0. Scenarios of the LR-calculus are invariant wrt. the operations of translation, rotation and scaling, this means that we can fix two points to arbitrary values, we chose to set D to (0, 0) and B to (0, 1). With this we obtain the inequations xA · yE < yA · xE
(1)
xC < 0
(4)
xC · yA < yC · xA yE · xC < xE · yC
(2) (3)
xE < 0 0 < xA
(5) (6)
In fact more inequations are derivable, but already these ones are not jointly satisfiable and we conclude: Theorem 14. Classical algebraic closure does not enforce scenario consistency for the LR-calculus.
434
D. L¨ ucke, T. Mossakowski, and D. Wolter
Proof. We consider the algebraically closed LR scenario SCEN and the inequations (1) to (6) that we derived when projecting it into R2 , the intended domain of LR. From inequations (1), (6), (4), (5) and (3) we obtain xE · yC yA · xE < yE < xC xA and again using inequations (6), (4) and (5) we get yC · xA < xC · yA contradicting (2). Hence our scenario is not consistent.
As discussed earlier ternary composition is more natural for ternary calculi than binary composition. Therefore we examined the ternary composition table of the LR-calculus1 and conclude: Theorem 15. Algebraic closure wrt. ternary composition does not enforce scenario consistency for the LR-calculus. Proof. Let us have a closer look at the ternary composition operation wrt. the relations contained in SCEN, namely the relation l and r. Recall that the set {l, r} of LR-relations is closed under all permutation operations. So we only need to consider the fragment of the composition table with triples over l, r: (r, r, r) = {r}, (r, r, l) = {b, r, l}, (r, l, r) = {f, r, l}, (r, l, l) = {i, r, l}, (l, r, r) = {i, r, l}, (l, r, l) = {f, r, l}, (l, l, r) = {b, r, l}, (l, l, l) = {l}. We see that any composition that contains r as well as l in the triple on the left-hand side yields a superset of {r, l} on the right-hand side. So all composable triples that have both l and r on their left hand side cannot yield an empty set while applying algebraic closure. So, we have to investigate how the compositions (l, l, l) and (r, r, r) are used when enforcing algebraic closure. Enumerating all composable triples (X1 X2 r1 X4 ), (X1 X4 r2 X3 ), (X4 X2 r3 X3 ) and their respective refinement relation (X1 X2 rf X3 ) yields a list with 18 entries shown in Appendix A. All of those entries list l as refinement relation whenever composing (l, l, l) and analogously for r. Thus, no refinement is possible, and the given scenario is algebraically closed wrt. ternary composition. We believe that advancing to even higher arity composition will not provide us with a sound algebraic closure algorithm. It turns out, however, that moving to a certain level of k-consistency does indeed make a change. Remark 16. Of course it is theoretically possible to solve these systems of inequations by quantifier elimination, or by the more optimized Cylindrical Algebraic 1
Such a table is available via the qualitative reasoner SparQ. (ref. to http://www. sfbtr8.spatial-cognition.de/project/r3/sparq/)
Qualitative Reasoning about Convex Relations
435
Decomposition (CAD). Unfortunately the CAD algorithm has a double exponential worst case running time (even though this can be reduced to polynomial running time with a optimal choice of the involved projections). Our experiments with CAD tools unfortunately were quite disillusioning, since those tools choked on our problems mainly because of the large number of involved variables (consider that each point in our scenarios introduces 2 variables in our systems of inequalities).
5
Deciding Global Consistency
In this section we will generalize a technique from [13] and we will show that this generalization decides global consistency for arbitrary CSPs over m-ary convex relations over a domain Rn . The resulting theorem transfers Thm. 5 of [14] from classical constraint satisfaction to qualitative spatio-temporal reasoning. Theorem 17 (Helly [15]). Let S be a set of convex regions of the ndimensional space Rn . If every n+ 1 elements in S have a nonempty intersection then the intersection of all elements of S is nonempty. Theorem 18. A CSP over m-ary convex relations over a domain Rn is globally consistent, i.e. k-consistent for all k ∈ N, if and only if it is strongly ((m − 1) · (n + 1) + 1)-consistent. Proof. In the first step of this proof consider an arbitrary CSP over convex mary relations that is strongly (m − 1) · (n + 1) + 1 consistent. By induction on k, which is the number of variables that can be instantiated in a strongly consistent way, we show that it is k + 1 consistent for an arbitrary k. Assume that for each tuple (X1 , . . . , Xk ) of these variables a consistent valuation (z1 , . . . , zk ) exists. For this purpose we define sets ps zi1 , . . . , zim−1 , Ri1 ,...,is ,k+1,is+1 ,...,im−1 = {z | Ri1 ,...,is ,k+1,is+1 ,...,im−1 zi1 , . . . , zis , z, zis+1 , . . . , zim−1 } with 1 ≤ ij ≤ k and 1 ≤ s ≤ m − 1. By assumption, these are sets of convex regions of the particular space defined by the assignment of the variables (X1 , . . . , Xk ) → (z1 , . . . , zk ) and the particular relation Ri1 ,...,,is ,k+1,is+1 ,...,im−1 . Let P = {ps zi1 , . . . , zim−1 , Ri1 ,...,is ,k+1,is+1 ,...,im−1 | 1 ≤ s ≤ m − 1 ∧ 1 ≤ ij ≤ k} be the set of all such convex regions. Observe that n + 1 tuples of elements of P are induced by constraints containing up to (m − 1) · (n + 1) different variables. By strong ((m − 1) · (n + 1) + 1)-consistency we know that the intersection of all these regions is non-empty. The application of Helly’s Theorem yields p = ∅. p∈P
436
D. L¨ ucke, T. Mossakowski, and D. Wolter
Hence a valuation for k + 1 variables exists. The second step of this proof is trivial, since global consistency implies k-consistency for all k ∈ N. In [7, Prop. 1] it was shown that whether composition is weak or strong is independent of the property of algebraic closure to decide consistency. However, in some cases, these two properties are related: Theorem 19. In a binary calculus over the real line that 1. has only 2-consistent relations 2. and has strong binary composition algebraic closure decides consistency of CSPs over convex base relations. Proof (Proof sketch). By Thm. 18 we know that strong 3-consistency decides global consistency. Since composition is strong, algebraic closure decides 3consistency and, since we have 2 consistency, it decides strong 3-consistency too. Thus algebraically closed scenarios are either inconsistent (containing the empty relation) or globally consistent. Put differently, global consistency and consistency coincide. Corollary 20. For CSPs over convex {LR, DCC}-relations strong 7-consistency decides global consistency. Proof. Follows directly from Thm. 18 for both calculi.
Corollary 21. Global consistency of scenarios in convex {LR, DCC}-relations is polynomially decidable. Proof. Compute the set of strongly 7-consistent scenarios in constant time (e.g. using quantifier elimination2 ). The given scenario is strongly 7-consistent iff all 7-point subscenarios are contained in the set of strongly 7-consistent scenarios. By Thm. 18 this decides global consistency. Unfortunately consistency and global consistency are not equivalent in the LRcalculus. Proposition 22. For the LR-calculus not every consistent scenario is globally consistent. Proof. Consider the consistent scenario {(A B r C), (A B r D), (C D l A) (C D l B), (A B f E), (C D f E)} which has a realization as shown in Fig. 3 (left), the lines AB and CD intersect. Now consider the sub-CSP in the variables A, B, C, and D with the solution shown in Fig. 3 (right). We see that the lines AB and CD are parallel, but the constraints (A B f E) and (C D f E) demand that the point E is on the line AB as well as on the line CD, hence the given scenario is not 5-consistent, and so it is not globally consistent. 2
Here we just want to state the computation is possible, we do not claim to suggest a practical method though.
Qualitative Reasoning about Convex Relations DD DD yy DD yy DD yy D yyy 89:; ?>=< E@ @@ ~~ @@ ~~ @@ ~~ @@ ~ ~~ 89:; ?>=< 89:; ?>=< B D@ @@ ~ @@ ~~ @@ ~~ ~ @@ ~ ~~ 89:; ?>=< 89:; ?>=< A C? ?? ?? ?? ?
89:; ?>=< B
89:; ?>=< C
89:; ?>=< A
89:; ?>=< D
437
Fig. 3. Illustration for Prop. 22
6
Discussion and Conclusion
We have shown that for relative orientation calculi capable of distinguishing between “left of” and “right of” like the LR-calculus, the composition table alone is not sufficient for deciding consistency of qualitative scenarios. We have argued that binary composition in ternary calculi in general does not provide sufficient means for generalizing algebraic closure to ternary calculi. Instead ternary composition is required. However, advancing to ternary composition which can list 4-consistent scenarios and thus allows us to generalize algebraic closure is still not sufficient for deciding consistency. This is a remarkable result that has implications to several relative orientation calculi to which the given proofs can be transferred: – – – –
LR calculus [16] Dipole calculus [17] OPRA calculus family [18] Double-cross calculus (DCC) [19]
To conclude, at the time being we have no practical method for deciding consistency in any of the listed relative orientation calculi. This may lead to a dramatic impact on qualitative spatial reasoning: The highly structured spatial domain does not yet help us to implement more effective reasoning algorithms than for general logical reasoning. So far the only backbone for reasoning with relative information is given by a logic-based approach [20]. In future work the practical utility of the presented polynomial-time decision procedure given by Cor. 21 for global consistency needs to be analyzed. While the general problem of deciding consistency of constraint satisfaction problems in LR is N P-hard, it is likely to be easier for scenarios. Therefore, our future work will be involved with singling out tractable problem classes and we aim at developing a method for deciding consistency of qualitative constraint satisfaction problems contained in N P, possibly finding a polynomial-time method for scenarios.
Acknowledgements This work was supported by the DFG Transregional Collaborative Research Center SFB/TR 8 “Spatial Cognition”, projects I4-[Spin] and R3-[Q-Shape]. Funding by the German Research Foundation (DFG) is gratefully acknowledged.
438
D. L¨ ucke, T. Mossakowski, and D. Wolter
References 1. Allen, J.: Maintaining knowledge about temporal intervals. Communications of the ACM, 832–843 (1983) 2. Randell, D.A., Cui, Z., Cohn, A.: A spatial logic based on regions and connection. In: Nebel, B., Rich, C., Swartout, W. (eds.) KR 1992. Principles of Knowledge Representation and Reasoning, pp. 165–176. Morgan Kaufmann, San Francisco (1992) 3. Condotta, J.F., Saade, M., Ligozat, G.: A generic toolkit for n-ary qualitative temporal and spatial calculi. In: TIME 2006: Proceedings of the Thirteenth International Symposium on Temporal Representation and Reasoning, pp. 78–86. IEEE Computer Society, Los Alamitos (2006) 4. Renz, J., Nebel, B.: Qualitative spatial reasoning using constraint calculi. In: Aiello, M., Pratt-Hartmann, I.E., Johan, F., Johan, F.v.B. (eds.) Handbook of Spatial Logics. Springer, Heidelberg (2007) 5. Scivos, A., Nebel, B.: Double-crossing: Decidability and computational complexity of a qualitative calculus for navigation. In: Montello, D.R. (ed.) COSIT 2001. LNCS, vol. 2205, pp. 431–446. Springer, Heidelberg (2001) 6. Dechter, R.: From local to global consistency. Artificial Intelligence 55, 87–108 (1992) 7. Renz, J., Ligozat, G.: Weak composition for qualitative spatial and temporal reasoning. In: van Beek, P. (ed.) CP 2005. LNCS, vol. 3709, pp. 534–548. Springer, Heidelberg (2005) 8. Dylla, F., Moratz, R.: Empirical complexity issues of practical qualitative spatial reasoning about relative position. In: Proceedings of the Workshop on Spatial and Temporal Reasoning at ECAI 2004 (2004) 9. Scivos, A., Nebel, B.: The finest of its class: The natural, point-based ternary calculus LR for qualitative spatial reasoning. In: Spatial Cognition, pp. 283–303 (2004) 10. Scivos, A.: Einf¨ uhrung in eine Theorie der tern¨ aren RST-Kalk¨ ule f¨ ur qualitatives r¨ aumliches Schließen. Master’s thesis, Universit¨ at Freiburg (in German) (April 2000) 11. Vilain, M.B., Kautz, H.A., van Beek, P.G.: Constraint propagation algorithms for temporal reasoning: A revised report. In: Readings in Qualitative Reasoning about Physical Systems. Morgan Kaufmann, San Francisco (1989) 12. Renz, J.: Qualitative Spatial Reasoning with Topological Information. LNCS, vol. 2293. Springer, Berlin (2002) 13. Isli, A., Cohn, A.: A new approach to cyclic ordering of 2D orientations using ternary relation algebras. Artificial Intelligence 122(1-2), 137–187 (2000) 14. Sam-Haroud, D., Faltings, B.: Consistency techniques for continuous constraints. constraints 1, 85–118 (1996) ¨ 15. Helly, E.: Uber Mengen konvexer K¨ orper mit gemeinschaftlichen Punkten. Jber. Deutsch. Math. Verein 32, 175–176 (1923) 16. Ligozat, G.: Qualitative triangulation for spatial reasoning. In: Proc. International Conference on Spatial Information Theory. A Theoretical Basis for GIS, pp. 54–68 (1993) 17. Moratz, R., Renz, J., Wolter, D.: Qualitative spatial reasoning about line segments. In: Horn, W. (ed.) ECAI 2000. Proceedings of the 14th European Conference on Artificial Intelligence. IOS Press, Amsterdam (2000)
Qualitative Reasoning about Convex Relations
439
18. Moratz, R., Dylla, F., Frommberger, L.: A relative orientation algebra with adjustable granularity. In: Proceedings of the Workshop on Agents in Real-Time, and Dynamic Environments (IJCAI 2005) (2005) 19. Freksa, C.: Using orientation information for qualitative spatial reasoning. In: Proceedings of the International Conference GIS - From Space to Territory: Theories and Methods of Spatio-Temporal Reasoning on Theories and Methods of SpatioTemporal Reasoning in Geographic Space, London, UK, pp. 162–178. Springer, Heidelberg (1992) 20. Renegar, J.: On the computational complexity and geometry of the first order theory of the reals. Part I–III. Journal of Symbolic Computation 13(3), 301–328 (1992); 255300, 301328, 329352
A
Table of Composable l/r Triples
(A C l B) ↓ ( l,
(A B l D) ↓ l,
(B C l D) ↓ l)
(A C l D) ↓ ∩ {l} = {l}
(A C l E) ↓ ( l,
(A E l D) ↓ l,
(E C l D) ↓ l)
(A C l D) ↓ ∩ {l} = {l}
(A C l B) ↓ ( l,
(A B l E) ↓ l,
(B C l E) ↓ l)
(A C l E) ↓ ∩ {l} = {l}
(E A l B) ↓ ( l,
(E B l C) ↓ l,
(B A l C) ↓ l)
(E A l C) ↓ ∩ {l} = {l}
(C D l B) ↓ ( l,
(C B l A) ↓ l,
(B D l A) ↓ l)
(C D l A) ↓ ∩ {l} = {l}
(C D l E) ↓ ( l,
(C E l A) ↓ l,
(E D l A) ↓ l)
(C D l A) ↓ ∩ {l} = {l}
(C E l B) ↓ ( l,
(C B l A) ↓ l,
(B D l A) ↓ l)
(C E l A) ↓ ∩ {l} = {l}
(E C r B) ↓ ( r,
(E B r A) ↓ r,
(B C r A) ↓ r)
(D A l B) ↓ ( l,
(D B l C) ↓ l,
(B A l C) ↓ l)
(E C r A) ↓ ∩ {r} = {r} (D A l C) ↓ ∩ {l} = {l}
440
D. L¨ ucke, T. Mossakowski, and D. Wolter
(D A l E) ↓ ( l,
(D E l C) ↓ l,
(E A l C) ↓ l)
(D A l C) ↓ ∩ {l} = {l}
(A D r B) ↓ ( r,
(A B r C) ↓ r,
(B D r C) ↓ r)
(A D r C) ↓ ∩ {r} = {r}
(A D r E) ↓ ( r,
(A E r C) ↓ r,
(E D r C) ↓ r)
(A D r C) ↓ ∩ {r} = {r}
(A E r B) ↓ ( r,
(A B r C) ↓ r,
(B E r C) ↓ r)
(A E r C) ↓ ∩ {r} = {r}
(C A r B) ↓ ( r,
(C B r E) ↓ r,
(B A r E) ↓ r)
(C A r E) ↓ ∩ {r} = {r}
(C A r E) ↓ ( r,
(C E r D) ↓ r,
(E A r D) ↓ r)
(C A r D) ↓ ∩ {r} = {r}
(C A r B) ↓ ( r,
(C B r D) ↓ r,
(B A r D) ↓ r)
(C A r D) ↓ ∩ {r} = {r}
(D C r B) ↓ ( r,
(D B r A) ↓ r,
(B C r A) ↓ r)
(D C r A) ↓ ∩ {r} = {r}
(D C r E) ↓ ( r,
(D E r A) ↓ r,
(E C r A) ↓ r)
(D C r A) ↓ ∩ {r} = {r}
Author Index
Agrawal, Shruti 202 Andonova, Elena 250 Arleo, Angelo 39 Avraamides, Marios N.
8
Barclay, Michael 216 Basten, Kai 104 B¨ ulthoff, Heinrich H. 1 Campos, Jennifer L. 1 Carlson, Laura A. 4 Chavarriaga, Ricardo 71 Cohn, Anthony G. 394 Dara-Abrams, Drew 138 Dee, Hannah M. 394 Doll´e, Laurent 71 Egenhofer, Max J.
295
Forbus, Kenneth 283, 378 Fouque, Benjamin 39 Fraile, Roberto 394 Frommberger, Lutz 311 Galton, Antony 216, 409 Gentner, Dedre 7 Gerstmayr, Lorenz 87 Girard, Benoˆıt 71 Giudice, Nicholas A. 121 Goschler, Juliana 250 Guillot, Agn`es 71 Hogg, David C. 394 Hois, Joana 266 Hurlebaus, Rebecca 104 Kastens, Kim A. 171, 202 Keehner, Madeleine 188
Kelly, Jonathan W. 22 Khamassi, Mehdi 71 Kiefer, Peter 361 Kohlhagen, Christian 56 Kuhnm¨ unch, Gregory 154 Kutz, Oliver 266 Liben, Lynn S. 171, 202 Lockwood, Kate 283, 378 Lovett, Andrew 283, 378 L¨ ucke, Dominik 426 Mallot, Hanspeter A. 87, 104 Martinet, Louis-Emmanuel 39 McNamara, Timothy P. 22 Meilinger, Tobias 1, 344 Meyer, Jean-Arcady 39 Mossakowski, Till 426 Myers, Lauren J. 171 Nedas, Konstantinos A. Pantelidou, Stephanie Passot, Jean-Baptiste Peters, Denise 154
295 8 39
Raubal, Martin 328 Reineking, Thomas 56 Richter, Kai-Florian 154 Ross, Robert J. 233, 250 Schmid, Falko
154
Tietz, Jerome D. Wiener, Jan M. Wolter, Diedrich
121 87, 104 311, 426
Zetzsche, Christoph
56