Multimodal Intelligent Information Presentation
Text, Speech and Language Technology VOLUME 27
Series Editors Nancy Ide, Vassar College, New York Jean Véronis, Université de Provence and CNRS, France Editorial Board Harald Baayen, Max Planck Institute for Psycholinguistics, The Netherlands Kenneth W. Church, AT & T Bell Labs, New Jersey, USA Judith Klavans, Columbia University, New York, USA David T. Barnard, University of Regina, Canada Dan Tufis, Romanian Academy of Sciences, Romania Joaquim Llisterri, Universitat Autonòma de Barcelona, Spain Stig Johansson, University of Oslo, Norway Joseph Mariani, LIMSI-CNRS, France
The titles published in this series are listed at the end of this volume
Multimodal Intelligent Information Presentation Edited by
Oliviero Stock ITC-irst, Trento, Italy
and
Massimo Zancanaro ITC-irst, Trento, Italy
A C.I.P. Catalogue record for this book is available from the Library of Congress.
ISBN 1-4020-3050-9 (PB) ISBN 1-4020-3049-5 (HB) ISBN 1-4020-3051-7 (e-book)
Published by Springer, P.O. Box 17, 3300 AA Dordrecht, The Netherlands. Sold and distributed in North, Central and South America by Springer, 101 Philip Drive, Norwell, MA 02061, U.S.A. In all other countries, sold and distributed by Springer, P.O. Box 322, 3300 AH Dordrecht, The Netherlands.
Printed on acid-free paper
All Rights Reserved © 2005 Springer No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Printed in the Netherlands
TABLE OF CONTENTS INTRODUCTION
vii
Part I: LIFE LIKE CHARACTERS I. POGGI, C. PELACHAUD, F. DE ROSIS, V. CAROFIGLIO AND B. DE CAROLIS / Greta. A Believable Embodied Conversational Agent
3
P. PAGGIO AND B. JONGEJAN / Multimodal Communication in Virtual Environments
27
M. THEUNE, D. HEYLEN AND A. NIJHOLT / Generating Embodied Information Presentations
47
Part II: MOBILE PRESENTATIONS J. BAUS, A. KRÜGER AND C. STAHL / Resource-Adaptive Personal Navigation
71
O. STOCK, M. ZANCANARO AND E. NOT / Intelligent Interactive Information Presentation for Cultural Tourism
95
T. RIST / Supporting Mobile Users Through Adaptive Information Presentation
113
Part III: NATURAL LANGUAGE GENERATION E. ANDRÉ, K. CONCEPCION, I. MANI AND L. VAN GUILDER / AutoBriefer: A System for Authoring Narrated Briefings
143
G. CARENINI AND C. CONATI / Generating Tailored Worked-Out Problem Solutions to Help Students Learn from Examples
159
TABLE OF CONTENTS
VI
J. CALDER, A. C. MELENGOGLOU, C. CALLAWAY, E. NOT, F. PIANESI, I. ANDROUTSOPOULOS, C. D. SPYROPOULOS, G. XYDAS, G. KOUROUPETROGLOU AND M. ROUSSOU/ Multilingual Personalized Information Objects
177
P. PIWEK, R. POWER, D. SCOTT AND K. VAN DEEMTER / Generating Multimedia Presentations from Plain Text to Screen Play
203
C. ZINN, J. D. MOORE AND M. G. CORE / Intelligent Information Presentation for Tutoring Systems
227
Part IV: VIRTUAL AND AUGMENTED REALITY B. BELL, S. FEINER AND T. HÖLLERER / Maintaining Visibility Constraints for View Management in 3D User Interfaces
255
W. SWARTOUT, J. GRATCH, R. HILL, E. HOVY, R. LINDHEIM, S. MARSELLA, J. RICKEL AND D. TRAUM / Simulation Meets Hollywood
279
P.L. WEISS / Presentation Technologies for People With Disabilities
305
Part V: FUTURE DIRECTIONS H. BUNT, M. KIPP, M. MAYBURY AND W. WAHLSTER / Fusion and Coordination for Multimodal Interactive Information Presentation INDEX
325 341
INTRODUCTION
Information Technology often moves slowly and the potential of innovative concepts requires time to be fully understood. We know what happened with the history of the computer and its directions of development in general. Though the achievements of hardware technology were amazing, software development methodology lagged behind and did not consent full exploitation of the hardware. Even more strikingly, it took over three decades, until in the Seventies for the computer to be recognized as an interactive tool aimed not only at automatic computation, but also for assisting human beings in any of their activities. Especially with the appearance of the personal computer, and of graphic, intuitive interfaces including the concept of direct manipulation, the computer grew closer to naïve users. At the same time, it became evident that much further work was needed in human-computer interaction to make computers friendly and usable by a larger public. In the following decades, aside investing on the traditional themes - computing power, infrastructure, system development methodology, etc. - Information Technology put effort into improving the intelligence of systems and in their capability of understanding ways of communicating that are natural for humans. In the Nineties the web era opened a new phase, based substantially on two concepts: connectivity (one single information space, irrespective of physical location of resources) and a simple navigational modality of interaction (inspired by the concept of hypertext, itself about as old as the computer). Mobile computing has added one additional dimension to Information Technology and pervasiveness of computing is coming into our lives. Today everyone realizes that we share many aspects of our routine activities with computers. It is understood that something must be done so to make access to information feasible for all (e.g. interpretation of requests more complex than could be possible with a menu-based interface, selection of existing information so that only the most relevant is displayed, etc.). Researches in the field, however, are realizing that computers have the capability of doing more than just mimicking human-human communication. Presenting information flexibly, integrating various means of communication, providing intelligent feedback are just some examples. Intelligent Information Presentation is a field that is assuming an increasing strategic importance. Current intelligent presentation systems are task specific and refer mainly to some specific technologies, but they have something important in common. They all posit that presentations have to take into account the user’s characteristics, the user’s specific needs and the context in which the interaction takes place. The process of elaborating a presentation starts from some level of internal representation of the content and from some knowledge or experience of previous presentations and exploits the available means of communication for conveying the information in the most appropriate way. In some cases, the system may even have its own communicative goals that may decide which one to pursue as well as when and how. VII O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, VII–XII © 2005 Springer. Printed in the Netherlands.
VIII
INTRODUCTION
At the root of the theme of Intelligent Information Presentation we can consider several scientific areas, but at least three are fundamental. Probably the first to be mentioned is Natural Language Generation, the branch of natural language processing that deals with the automatic production of texts. The field normally is described as investigating communicative goals, the dynamic choice of what to say, the planning of the overall rhetorical structure of the text (called sometime strategic planning), the actual realization of sentences on the basis of grammar and lexicon (sometimes called tactical planning), and so on. With a similar objective but with different means, the field of Adaptive Hypermedia combines hypertext (hypermedia) and user modeling. In contrast to NLG, text or multimedia existing documents are manipulated. Adaptive Hypermedia systems build a model of the goals, preferences and knowledge of the individual user and use this throughout the interaction for adaptation of the hypertext to the needs of the user. In other words, AH takes into account the fact that users vary in knowledge, cognitive skills and reasons for searching information. By keeping a model of some aspects of the user’s characteristics, the system can adapt to and aid the user in navigating and filtering information that best suits his or her goals. A third important field is computer graphics; it has experienced a fundamental passage toward the end of the Eighties, when it was understood that graphics production should start from internal representations and communicative goals in a way similar to language production. This passage has led to the possibility of developing multimodal systems, that in output would consider the available modalities, possibly the context and the user characteristics, and operate so that the message is allocated and realized in a coordinated way on several media. Intelligent Interactive Information Presentation has gone further along that line: it relates to the ability of a computer system to automatically produce multimodal information presentations, taking into account the specifics about the user, such as needs, interests and knowledge, and engaging in a collaborative interaction that helps the retrieval of relevant information and its understanding on the part of the user. It may include dimensions such as entertainment and education, opening important connections to areas that were not related to the world of human-computer interaction, such as broadcasting or cinematography. This vision has led to novel concrete aggregations. This is evident in a number of projects, in Europe as well as in America and in Japan, where the teams have included very diverse expertise. The vision has even caused the establishment of novel institutions: a good example is the Institute for Creative Technologies, created at the University of Southern California with the goal of bringing together researchers in simulation technology to collaborate with people from the entertainment industry. The idea was that “more compelling simulations could be developed if researchers who understood state-ofthe-art simulation technology worked together with writers and directors who knew how to create compelling stories and characters.” (see the chapter by Swartout et al. in this volume). Looking at the market prospects, many areas of IT will benefit from results in Intelligent Interactive Information Presentation. Future developments in the field are expected to have a significant impact on business output by increasing
INTRODUCTION
IX
productivity and improving quality of business documents’ content through rich user feedback. Training of employees at a company is another area that i3p technology can significantly enhance by improving the learning process. Systems able to provide assistance to customers that face, for instance, technical difficulties with an electronic product, can translate into significant savings for businesses, while also providing high-quality and tailored assistance to their customers. They can substantially simplify the presentation of information to nonexpert computer users as well as provide services for individual with various impairments. The tourism industry (about to become the most important business on the web) is an excellent example of the trend toward increasing demand from users for a personalized service. A privileged general area of application is education: Intelligent Information Presentation technology will bring a whole new experience for the user providing personalized information to allow individuals to learn at their own pace while gaining access to relevant information that efficiently complements their knowledge base. Obviously, interactivity in the entertainment industry can also be significantly enhanced by Intelligent Information Presentation technology by creating a personalized experience for the user. And many more areas could be indentified. With this book we have put together a collection of highly representative works that show the advancements and the trends in the field. The book is organized along five clusters. The first cluster encompasses three contributions on information presentation using lifelike characters. The contribution by Poggi, Pelachaud, de Rosis, De Carolis and Carofiglio illustrates the architecture of Greta, a believable embodied conversational agent built in the context of the EU Project MagiCster. Greta engages the users in natural conversations. It is provided with a mind, a dialogue manager and a body. The mind triggers an emotion on the basis of the agent's personality, the occurring events and the current dialog move. The dialog manager selects the appropriate dialog move to perform. The body, which includes a 3D face model is capable of expressing a set of nonverbal communicative functions. The second chapter, by Paggio and Jongejan, describes the multimodal communication components of a virtual world application developed at the Centre for Language Technology within the Danish research project "Staging of 3D Virtual Worlds". The application is a virtual farm in which the user can interact with an autonomous farmer agent to participate in simple conversations and actions concerning work on the farm. Communication is multimodal in that the agent can respond to verbal input combined with a limited number of deictic, iconic, and turntaking or giving hand gestures. Speech and gesture inputs are synchronised and, if the relevant syntactic and semantic constraints are satisfied, a unified interpretation of both is generated by a feature-based multimodal parser. Finally the work of Theune, Heylen and Nijholt discusses the output modalities available for information presentation by embodied, human-like agents which include both language and various nonverbal cues such as pointing and gesturing. Nonverbal modalities can be used to emphasize, extend or even replace the language
X
INTRODUCTION
output produced by the agent. In this chapter, the authors discuss the issues involved in extending a natural language generation system with the generation of nonverbal signals. The second cluster comprises three contributions on the topic of information presentation on mobile devices. Mobile navigation systems are considered one of the possible breakthrough technologies for broad band wireless internet access (i.e. for UMTS). Although quite sophisticated systems already exist for different transportation means, e.g. car navigation systems and more recently also pedestrian navigation systems, these systems only function in a single well defined context. They are designed either to be operated by a driver in a car, or a by pedestrian in an outdoor scenario. In contrast to that personal navigation systems are supposed to work in all of these contexts, regardless of the chosen mean of transportation or other external factors. The first chapter of this cluster, by Baus, Krüger and Stahl, focuses on the conceptual and technical requirements of such a personal navigation system. They compare their approach with classical mobile navigation systems and argue that resource-adaptiveness has to be achieved on three different levels of the navigation task: the wayfinding level, the presentation planning level and the presentation rendering level. Stock, Zancanaro and Not discuss the opportunities and challenges offered by the area of cultural heritage appreciation for innovative, natural–language centered applications. They introduce the PEACH project, aimed at exploring various technologies for enhancing the visitors' experience during their actual visit to a museum. Finally, in his contribution, Rist discusses the problem of the overwhelming variety of new portable computing and communication devices. This adds a new level of complexity to the design of usable information services since different users may not be equipped equally in terms of output and input capabilities. With a special focus on graphical representations, this contribution sketches several application scenarios taken from projects conducted at DFKI and discusses the technical approaches to accomplishing adaptation tasks that result from device restrictions, such as limited screen real estate, lack of high resolution and color. The cluster on Natural Language Generation presents four contributions. First, the chapter by Andrè and colleagues describes a system called AutoBriefer which automatically generates multimedia briefings from high-level outlines. The author of the briefing interacts with the system to specify some of the briefing content. In order to be scalable across domains, AutoBriefer provides a graphical user interface that allows the user to create or edit domain knowledge for use in the selection of some of the briefing content. The system then uses declarative presentation planning strategies to synthesize a narrated multimedia briefing in various presentation formats. The narration employs synthesized audio as well as, optionally, an agent embodying the narrator. Calder and colleagues address a similar issue in their chapter. They introduce The M-PIRO project which targets the concept of personalized information objects. MPIRO's technology allows textual and spoken descriptions of exhibits to be generated automatically from an underlying language-neutral database and existing
INTRODUCTION
XI
free-text descriptions. The resulting descriptions, produced in three different languages (English, Greek, and Italian), are tailored according to the user's interests, background knowledge, and language skills. Carenini and Conati in the last chapter of this cluster describe a framework that helps students learn from examples by generating example problem solutions whose level of detail is tailored to the students’ domain knowledge. When presenting a new example to a student, the framework uses natural language generation techniques and a probabilistic student model to selectively introduce gaps in the example solution, so that the student can practice applying rules learned from previous examples in problem solving episodes of difficulty adequate to her knowledge. Filling in solution gaps is part of the meta-cognitive skill known as self-explanation (generate explanations to oneself to clarify an example solution), which is crucial to effectively learn from examples. Finally, Piwek, Power, Scott and van Deemter explore NLG applications in which extra media are brought into play from the point of view of planning the layout of the document. The main point of the chapter is that the extra media are not simply added to plain text, but integrated with it: thus the use of formatting, or pictures, or dialogue, may require radical rewording of the text itself. The fourth cluster comprises two contributions on the topic of Virtual and Augmented Reality. Bell, Feiner and Höllerer describe a view-management mechanism that addresses dynamically changing visibility relationships among moving objects, and show how to apply it to 3D user interfaces. The focus is on augmented reality systems, in which users wearing head-tracked, head-worn displays can view graphics overlaid directly onto the real world. Examples from an implemented view-management testbed are used to demonstrate some of the different ways in which the system and its users can exert control over what is seen in augmented reality. These examples include automatically laid out annotations that provide desired information about the surrounding environment, and a hand-held tracked physical display and headtracked virtual situation-awareness aid that respond to user control and interact with the annotated environment. The work of Swartout and colleagues presents the Mission Rehearsal Exercise Project, which confronts a soldier trainee with the kinds of dilemmas he might reasonably encounter in a peacekeeping operation. The trainee is immersed in a synthetic world and interacts with virtual humans: artificially intelligent and graphically embodied conversational agents that understand and generate natural language, reason about world events and respond appropriately to the trainee's actions or commands. The final cluster of the book is about the future. It consists of a contribution by Bunt, Kipp, Maybury and Wahlster that introduces a recommended action plan and a roadmap for the topic of multimodal interactive information presentation. The occasion that gave impulse to this volume was the European CLASS Project (IST-1999-12611). CLASS (http://www.class-tech.org) was launched at the request of the European Commission with the purpose of stimulating cross-project
XII
INTRODUCTION
collaboration among EC-sponsored Human Language Technology projects as well as between those projects and relevant projects world-wide. Among the initiatives of CLASS, two workshops were held recently in Verona, and Copenhagen with submissions from Europe, the US, Canada, Australia and Japan. Subsequent to the workshops, the authors of the best selected extended abstracts in the proceedings have been invited to write a long paper for the present volume that was reviewed also by authors of other papers in the collection. Finally, we would like to thank Ivana Alfaro for her help in this enterprise and Erika Belli for her work on the manuscript.
Oliviero Stock Massimo Zancanaro
LIFE LIKE CHARACTERS
The research field on lifelike characters is a promising area for investigating challenging issues in Artificial Intelligence and Human-Computer Interfaces. Lifelike characters are a kind of multimodal interfaces where the modalities are those natural in human conversations: speech, gestures as well as facial expressions and body postures. The hypothesis is that intelligent user interfaces can take advantage of embodied intelligence to facilitate human-machine interaction. From this point of view lifelike characters constitute another argument in the long-term debate on anthropomorphism in the interface. Indeed, human-like embodied interfaces have become popular as the front end to many web sites and as part of many computer applications. However, these interfaces have just the graphical appearance of a body without any internal representation of the communicative functions of a body and they often produce far from satisfactory results. The contributions in this cluster take the opposite approach, they all try to investigate how to properly model the communicative functions of a body in order to produce realistic behavior. The contribution by Poggi, Pelachaud, de Rosis, De Carolis and Carofiglio illustrates the architecture of Greta, a believable embodied conversational agent built in the context of the EU Project MagiCster. Greta engages the users in natural conversations. It is provided with a mind, a dialogue manager and a body. The mind triggers an emotion on the basis of the agent's personality, the occurring events and the current dialog move. The dialog manager selects the appropriate dialog move to perform. The body, which includes a 3D face model is capable of expressing a set of nonverbal communicative functions. The second chapter, by Paggio and JongeJan, describes the multimodal communication components of a virtual world application developed at the Centre for Language Technology within the Danish research project "Staging of 3D Virtual Worlds". The application is a virtual farm in which the user can interact with an autonomous farmer agent to participate in simple conversations and actions concerning work on the farm. Communication is multimodal in that the agent can respond to verbal input combined with a limited number of deictic, iconic, and turntaking or giving hand gestures. Speech and gesture inputs are synchronised and, if the relevant syntactic and semantic constraints are satisfied, a unified interpretation of both is generated by a feature-based multimodal parser. Finally the work of Theune, Heylen and Nijholt discusses the output modalities available for information presentation by embodied, human-like agents which include both language and various nonverbal cues such as pointing and gesturing. Nonverbal modalities can be used to emphasize, extend or even replace the language output produced by the agent. In this chapter, the authors discuss the issues involved in extending a natural language generation system with the generation of nonverbal signals.
I. POGGI, C. PELACHAUD, F. DE ROSIS, V. CAROFIGLIO AND B. DE CAROLIS
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
1. INTELLIGENT BELIEVABLE EMBODIED CONVERSATIONAL AGENTS A wide area of research on Autonomous Agents is presently devoted to the construction of ECAs, Embodied Conversational Agents (Cassell et al. 2000; Pelachaud & Poggi, 2001). An ECA is a virtual Agent that interacts with a User or another Agent through multimodal communicative behavior. It has a realistic or cartoon-like body and it can produce spoken discourse and dialogue, use voice with appropriate prosody and intonation, exhibit the visemes corresponding to the words uttered, make gestures, assume postures, produce facial expression and communicative gaze behavior. An ECA is generally a Believable Agent, that is, one able to express emotion (Bates, 1994) and to exhibit a given personality (Loyall & Bates, 1997). But, according to recent literature (Trappl & Payr, in press; de Rosis et al., in press a), an Agent is even more believable if it can behave in ways typical of given cultures, and if it has a personal communicative style (Canamero & Aylett, in press; Ruttkay et al., in press). This is, in fact, what makes a human a human. More, an ECA must be interactive, that is, take User and context into account, so as to tailor interaction onto the particular User and context at hand. In an ECA that fulfils these constraints the communicative output, that is, the particular combination of multimodal communicative signals displayed (words, prosody, gesture, face, gaze, body posture and movements) is determined by different aspects: a. contents to communicate, b. emotions, c. personality, d. culture, e. style, f. context and User sensitivity. At each moment of a communicative interaction, all of these aspects combine with each other to determine what the Agent will say, and how. In this paper we show how these aspects of an ECA can be modeled in terms of a belief and goal view of human communicative behavior. We then illustrate Greta, an ECA following these principles which is being implemented in the context of the EU project MagiCster1. 2. WHAT AND HOW WE COMMUNICATE Before focusing on the Agent’s dialogue, let us see how a single move can be represented and simulated. An Agent S (Sender) generates a set of beliefs - about 3 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 3–25 © 2005 Springer. Printed in the Netherlands
4
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
itself or about external objects and events - and has the goal to make Agent A (the Addressee) believe them. To achieve this, S produces a set of communicative signals (for example words, gestures, gaze, head, face and body behavior), taken from its communicative repertoire and temporally ordered in a particular way. Now, what are the beliefs an Agent may conceive to communicate? What is the structure of its communicative repertoire? What are the rules for the temporal ordering and synchronization of different signals? But also: how does the system go from the input – a set of beliefs – to the output – that particular arrangement of words, voice, hands, body, face signals? In fact, our communicative repertoire is very complex, in order for us to modulate our communication in a very sophisticated way. For example, words have formal and informal variants, connotations, positive or negative, tender or insulting nuances; and not only is the verbal lexicon so rich, but we can also communicate by subtly different intonations, gestures, facial expression, gaze, posture, spatial behavior. In each move of a dialog, how do we choose the best way to communicate, the combination of verbal and non verbal signals that are most fit to express our communicative goal? How do we activate the goal of using that given word, replacing it with a gesture or using both to communicate our meaning? According to a goal and belief model of social action (Conte & Castelfranchi, 1995), choice, that is, the decision to pursue a goal instead of another is determined by the relative values attached to the alternative goals. But the value of a goal in its turn stems from the value of its superordinate goals, or from the algebraic sum of the values of two or more of them. So, whoever discovers his car was stolen might shout. But if this happens to someone who is just starting a work where he needs his car, his shout will be sharper or longer and his utterance more aggressive. In other words, resources we specifically use in a given communicative situation (a gesture in the place of a word, a very colloquial term instead of a more formal one) are determined by a number of permanent and contingent factors. 3. PERMANENT AND CONTINGENT FACTORS As an Agent enters a communicative interaction, having the goal of communicating some meanings, two kinds of factors affect the final aspect of his/her communication: permanent and contingent ones (Table 1). The former are the goals and resources coming from the Senders’ biological and cultural endowment, that are always active in them; the latter are the goals activated and the resources provided by the contingent situation in which the Senders communicate. Long-lasting internal resources are (1) personality, (2) social identity (age, gender, cultural roots) and (3) cognitive traits. Among them, we may distinguish (4) innate and (5) culturally learned features: innate ones may be (6) a higher or lower capacity of making inferences, different reasoning styles (more abstract, intuitive, or imaginative); or (7) the different aptitudes, partly depending on neurological dispositions, towards musical, mathematical, visual or linguistic skills.
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
5
Table 1. Factors affecting the choice of communication resources
contents to communicate PERMANENT personality
1 S’s
2 S’s social identity
3 S’s Cognitive Traits
4 Innate
6 Inference capacity 7 Aptitudes 5 8 Learn- Knowledge ed Base 9 Cultural techniques, norms, values 10 Communicative repertoire
communicative choice CONTINGENT 14 Physical resources 11 Self (motor capacity and energy) 15 Cognitive resources (drunk, concentrated) 12 16 18 Other Physical Sensory capacity resources 19 Media 17 20 Cognitive Knowledge resources Base 21 Inference capacity
22 Communicative repertoire
13 Situation
23 Personality 25 24 Available modalities Physical setting 26 Presence of referents 27 28 Social S-A relation setting (status, role, affect) 29 Type of encounter (service, affective) 30 Relations to others (in public, in private)
Other cognitive capacities are culturally dependent: culture entails beliefs about the environment (8) and about strategies of behavior typical of a population (see Section 8.), but also norms on how to do things, how to behave, what and how to
6
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
communicate (9), and finally the communication repertoires (verbal and nonverbal) one comes to learn since infancy (10). The combination of personality traits, attitudes and culturally learned behavior habits gives rise to style, an idiosyncratic tendency to behave in some peculiar way that allows others to recognize the Agent’s identity. The contingent resources taken into account for the choice of the output communicative behavior are provided, instead, by the intrinsic features of context (Poggi & Pelachaud, 2000). The presence or absence of these resources in oneself (11), the Other (12) or the Situation (13) activate specific goals in Senders as they plan their discourse and then the communicative signals to exploit. If I am very tired (physical resources, 14) I’ll tend to speak low and not to make conspicuous gestures; if my interlocutor is a bit deaf (sensory capacity of other, 18), I’ll speak loud. If he is a tourist (Other’s communicative repertoire, 22) I’ll talk slowly; with a student (Knowledge Base, 20) I’ll explain things at length; with a not so smart person I’ll explain even obvious causal links (inference capacity, 21). And if the interlocutor is a touchy person (personality, 23), I’ll be more indirect in my criticism. Beside the features of oneself and the interlocutor, resources present or absent in the environment trigger different goals about what and how to communicate: in a noisy discotheque (physical setting, 24), I’ll use gestures instead of words (available modality, 25), while I will not do this on the phone (absence of referents, 26). But more than the physical, the social setting (27) is relevant: we use more polite words with a high status Addressee (S-A relation, 28) or in formal situations (type of encounter, 29); less colloquial words in public than in private (relation to others, 30). How do we choose, then, how to communicate in a given situation? Presence or lack of individual resources lead us to choose one signal or combination of signals instead of another. Our hypothesis is that every combination of an Agent's multimodal communicative behavior (that particular word, uttered by a particular intonation, while making that gesture, that gaze, facial expression, and posture) is the result of the final choice of the communicative goal to pursue. This goal is selected among the Agent’s different goals, determined in their turn by contingent events (content to communicate, felt emotions, context and interlocutor) and by long-lasting features (the Agent’s culture, personality and style). In Sections 4 through 9 we overview how the above aspects (meanings to communicate, emotion, personality, culture and style) can be viewed according to a goal and belief model of human communicative interaction, and how they may be represented in an ECA. In Section 10 we describe the architecture of the ECA “Greta”, while showing how some of the above principles can be applied to construct a Believable Embodied Conversational Agent. 4. MEANINGS TO CONVEY Let us first overview the beliefs that may form the content of a communicative act. Three classes of meanings can be distinguished (Poggi, 2002 b): Information on the World, Information on the Speaker’s Identity and Information on the Speaker’s Mind.
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
7
Information on the World. As we communicate, we provide information on concrete or abstract events, their actors and objects and the time and space relations among them. Such information is provided mainly through words, but also by gestures or gaze. To mention the referents of our discourse, we may point at them by deictic gestures or gaze; to refer to some properties of objects we may use iconic or symbolic gestures and even gaze (as when we squeeze eyes to mean ‘small’ or ‘difficult’). Information on the Speaker’s Identity. Physiognomic traits of our face, eyes, lips, the acoustic features of our voice and often our posture provide information on our sex, age, socio-cultural roots, and personality. And, of course, our words can inform on how we want to present ourselves. Information on the Speaker's Mind. While mentioning events of the external world, we also communicate why we want to talk of those events, what we think and feel about them, how we plan to talk of them. We provide information on the beliefs we are mentioning, our own goals concerning how to talk about them and the emotions we feel while talking (Poggi 2002 a). About beliefs, we inform about: 1. degree of certainty: words like ‘perhaps’, ‘certainly’; conditional or subjunctive verb modes; but also frowning, which means ‘I am serious in stating this’; opening hands, which means ‘This is self-evident’; 2. metacognitive information: that is, the source of mentioned beliefs, whether they originate from memory, inference or communication (we look up when trying to make inferences, snap fingers while trying to remember,...) We inform about the following goals: 1. performative of the sentence (by performative verbs, intonation, facial expression); 2. topic-comment distinction (by batons, eyebrow raising, voice intensity or pitch); 3. rhetorical relations: class-example (saying “first…second…third...”; counting on fingers) topic shift (expressed through posture shift); 4. turn-taking and backchannel: raise hand for asking turn; nod to tell the Interlocutor we are following what she says. Finally, we inform on the emotions we feel while talking (by affective words, gestures, intonation, facial expression, gaze and posture). 5. COMMUNICATIVE REPERTOIRE Let us now see what is the structure of the communicative repertoire (Table 1, n.10): the set of innate and learned features and behaviors that we can use as signals in order to convey the meanings specified above. The Human Agent is endowed with different “mode-specific communicative systems”, that is, sets of rules to link meanings to signals that work in different modalities. Several parts of the Human Agent’s body produce communicative signals: so head, face (with eyes and mouth), hands, trunk and legs can be viewed each as the depository of a communicative system. Communicative systems may be of at least
8
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
two kinds: “codified” and “creative”. In a “codified” system, or “lexicon”, the same signal-meaning link is shared and coded in the Agents’ memory. Not only words or symbolic gestures but also gaze, facial expression, posture form lexicons, where each signal represented in an Embodied Agent’s mind corresponds to a specific meaning. In a “creative” system, instead (for instance the system of iconic gestures), what is coded in memory is only a small set of inference rules about how to create a new signal starting from a given meaning, or about how to retrieve a meaning from a given signal (Magno Caldognetto & Poggi, 1995). In our hypothesis, in all communicative systems, meanings may be represented in terms of ‘mental images’ or logical propositions. For example: Let Ai denote the Sender and Aj the Addressee, a an action and g a goal that may be achieved by means of a. The performative of a ‘Peremptory order’ may be represented as follows: 1. Goal Ai (Do Aj a) 2. (Goal Ai g) Bel Ai (Achieve a g)) 3. Goal Ai (Bel Aj (Power-on Ai Aj a)) 4. If (Not (Do Aj a)) then (Feel Ai Angry) while the performative of an ‘Imploration’ may be represented as follows: 1. Goal Ai (Do Aj a) 2. (Goal Ai g) Bel Ai (Achieve a g)) 3. Goal Ai (Bel Aj (Power-on Aj Ai a)) 4. If (Not (Do Aj a)) then (Feel Ai Sad) In both ordering and imploring, the Sender wants the Addressee to do some action. However, the two performatives differ for the power relationship between the two interlocutors (Ai has power over Aj in the former, and the reverse in the latter), and for the potential emotion (anger vs. sadness) in case Aj does not perform the requested action (Poggi & Pelachaud, 2000). A compositional representation can be adopted not only on the side of the meaning, but also for the signals: a signal is represented as a combination of behavioral parameters, each with its number of values. Gestures are decomposed into hand-shape, arm and wrist position, type of movement; gaze into direction of eyes, eyelid aperture, movements of the eyebrows, and so on (Poggi, 2002 b). This compositional representation on the two sides enables achieving a high flexibility in the correspondence between signals and meanings. It opens, as well, the possibility that a combination is not always conveyed by the same signal, but may be expressed through different parameters of signals taken from different communication systems (Pelachaud & Poggi, 2002). The overall communicative repertoire, in fact, manages the combination of signals taken from different communicative systems, the distribution of meanings across them, and their temporal ordering and synchronization.
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
9
6. EMOTIONS Before showing why and how emotions are an important determinant of an Agent’s communication, let us define them according to a goal and belief model of action and social interaction (Conte & Castelfranchi, 1995). 6.1. Emotions and goals Actions in our life are often part of a plan aiming at some goal. Take for example an action of Oetzi, the 5000 B.C. pre-historic man of Similaun, who chooses a stone apt to sharpen well and makes a lance for chasing the wild-pig successfully. The actions of looking for a good stone and to sharpen it are just means for the complex action of chasing. But also chasing is aimed at feeding himself and his group, which in turn aims at the goal of survival. The goals of our everyday plans are not ends in themselves: they all aim at more general goals of biological import that are common to all humans, like the biological goals of survival and reproduction and some subgoals of them, physical well-being, safety, loving and being loved, self-realization, image and self-image. These are terminal goals, that are ends in themselves and ones to which we assign the highest weights. So much that, if two of them are incompatible (as for instance freedom vs. life itself), giving up one of them is a heavy renunciation. With respect to terminal goals, the goals of our everyday life are instrumental goals, in that they directly or indirectly serve our terminal goals. For instance, chasing the wild-pig with a sharpened stone is instrumental to survival: if the lance is not sharp enough and does not hit the wild-pig to death, the wild-pig might aggress and kill Oetzi. Instrumental goals are more or less important to us, depending on the strength of their link with terminal goals: at the extent to which an Instrumental Goal is likely to be the only possible means to reach a Terminal Goal, that Instrumental Goal receives a high weight, just because it inherits its weight from the Terminal Goal it serves. Emotions are a biological device aimed at monitoring the state of reaching or threatening of our most important goals, be they terminal or instrumental (see, for instance, Carbonell, 1980; Oatley & Johnson-Laird, 1987). Anytime something happens (or the Agent believes it happens) that is likely to produce the achieving or threatening of a highly weighted goal, the biological device of emotion is triggered: from the agent's interpretation of the situation, a complex subjective state originates, generally of a short duration and with different degrees of intensity. This state includes physiological, expressive and motivational aspects. If Oetzi throws his lance but sees it has not run into the wild pig’s heart, fear is triggered since his goal of survival is challenged: physiological reactions are activated (blood flowing away from face to limbs) some of which may show in the perceivable state of his body (pale face, tremors); and the specific goal of escaping, that might serve the terminal goal of survival, is activated. There is a strong relationship, then, between goals and emotions: goals both cause emotions and are caused by emotions. They cause emotions since, if an important goal is achieved or threatened, an emotion is triggered: emotions are therefore a feedback device that monitors the reaching or threatening of our high-
10
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
weighted goals. At the same time, emotions activate goals and plans that are functional to re-establishing or preserving the well-being of the individual, challenged by the events that produced the emotions. So, fear triggers flight, anger triggers aggression, guilt triggers the goal of helping the harmed person or of escaping sanction (Castelfranchi, 2000). 6.2. Emotion triggering vs. emotion display Emotions may be implied in communication in at least two ways. 1. they may be the very reason that triggers communication: we activate the goal of communicating just because we want to express our emotion; 2. they may intervene during our communication, as a reaction to what our interlocutor is saying, or to some thought suddenly coming to our mind, either related to the ongoing dialogue or not. In both cases, the triggering of emotion does not necessarily imply that the Agent displays it. There are many reasons why we may refrain from expressing our emotion, and the final (aware or non-aware) decision of displaying it may depend on a number of factors (Prendinger & Ishizuka, 2001; De Carolis et al., 2001). Some of them concern the very nature of the emotion felt (emotion nature), others the interaction of several contextual (scenario) factors. 1. Emotion nature a. Intensity (a more intense emotion might be more likely displayed); b. Valence (it is not the same to display negative or positive emotions); c. Social evaluation (some emotions, like envy or shame, are subject to social sanction: then it is more difficult to express them); d. Addressee (it is different to express an emotion to the one who caused it or to a third person). 2. Scenario Factors a. Agent's Display motive (displaying or not depends on whether you do it to be helped, consoled, or if you want to demonstrate or teach something); b. Agent's personality (an impulsive person is generally more keen to displaying than a reflexive or a shy one); c. Interlocutor’s features (displaying depends on the other's personality, empathy, intelligence...). d. Agent - Interlocutor Role relationship (whether he has power over you or you over him); e. Agent - Interlocutor Personal Relationship (you might not display your being worried to someone you love, if you want to protect him); f. Type of social interaction (being in public makes a difference for emotion display).
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
11
7. PERSONALITY Personality is also linked to goals: it can be viewed in terms of weights people attribute to terminal goals (Carbonell, 1980; Poggi and Pelachaud, 2000). For example, a sociable person gives more importance than others to knowing and staying with other people. A selfish person, when the goals of physical safety and others' care are conflicting, chooses to pursue the former, while an altruist pursues the latter. A proud person attributes a high value to self image and autonomy, while a dependent person cares others’ image more than self-image. Since both emotions and personality have to do with the relative importance of goals, there is also some link between emotion and personality. Some personality traits may be viewed in terms of the general ‘propensity to feel emotions’ (Plutchick, 1980; Poggi and Pelachaud, 1998). Picard (1997) calls ‘temperament’ this subset of personality traits, while other authors relate them directly to one of the factors in the ‘Big-Five’ model: for instance, neuroticism (Mc Crae & John, 1992). These traits imply, in a sense, a lower threshold in emotion feeling (Ortony et al., 1988). For instance, a ‘shy’ person is keener to feel ‘shame’, especially in front of unknown people. A ‘proud’ person, who attributes a high weight to his goals of self-esteem and autonomy, will feel particularly proud of himself every time one of these goals is achieved. And, conversely, every time they are threatened (if, say, he is obliged to ask for help), the person will feel the opposite emotion, shame. Thus, a personality trait (proud) is related to attaching a higher weight to a particular goal (self-esteem, autonomy); and, since that goal is particularly important to that kind of person, the person will feel the corresponding emotions (pride or shame) with a higher frequency or intensity. 8. CULTURE Culture may also be viewed in terms of different weights on goals. Both a Somali shepherd and an Italian housewife have the goal of feeding themselves and their family; but the sub-goal chosen to pursue this goal may be for the former to search the bush, for the latter to go shopping in a store. 8.1. Goals and culture Humans pursue their goals by using external resources (presence of food, characteristics of the territory, climate conditions) and internal resources (physical strength, body agility, manual skill, beliefs and intelligence). Since different physical environments provide different resources, each population, given the environment in which it lives, comes to accumulate a set of beliefs on the instrumental goals that most easily and economically serve the biological terminal goals in that environment. At the extent to which an instrumental Goal is (or is likely to be) the only possible means to reach a terminal Goal, given the external conditions available, that instrumental Goal will receive a higher weight than other possible ones. Thus, the instrumental goal chosen becomes a strategy of survival typical of that culture; and culture, overall, may be defined as a set of beliefs on the
12
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
typical techniques to pursue goals. Of course it also entails beliefs about the external world. And since language is both produced by beliefs and a vehicle of them, culture typically shows up in language. Language is made of the beliefs of a population, and a way to organize them, a set of rules on how to conceptualize and categorize information. Consequently, it also implies a set of settled communicative techniques, that is of settled instrumental goals, stating how to convey information. More, culture entails values and norms. Values are evaluative beliefs about what is good and then has to be pursued as a goal (Miceli & Castelfranchi, 1989). But since particular ways to behave may be good or bad according to the environment, again due to what are the most useful techniques of survival, different populations in different environments may hold different values. Where individualistic behavior proved convenient, individualistic values will develop, while in environments where collectivistic behavior is more fit, values centered on the family or the group will hold. Norms are obligations that rule the relationships among people in a group (Conte & Castelfranchi, 1995). In a culture more centered on interdependency, a norm (then, a highly weighted goal), may prescribe to be cooperative, even when this implies intruding in the other’s affairs; in a culture centered on the individual’s autonomy, the goal of keeping one’s privacy will be more weighted, and a norm will hold of not intruding in others’ affairs and of contrasting others’ intrusions. Now, since values and norms generate goals in people (the goal to pursue that value or to respect that norm), if they are thwarted, they provoke emotions. Not living up to one’s values may induce shame, while violating norms may cause guilt. Therefore, if two populations have different values and norms, they will also feel these emotions out of different events. To sum up, culture is a set of beliefs shared by a population about the environment in which the population lives and the best techniques (the most highly weighted instrumental goals) to reach the biological terminal goals in that environment, given the means-end relations that hold in the given physical conditions and the set of beliefs accumulated. Culture also includes beliefs about how to gather and organize beliefs, and about the norms and values that are functional to techniques of goal achievement that best fit the surrounding environment. According to this definition, let us try to figure out how the way people communicate changes with cultural differences, by trying to distinguish what is universal (biological) and what is culturally determined in the different aspects of communication. These differences may then be simulated in a Believable Embodied Agent, in words or discourse planning, in gesture, gaze, facial expression, body posture and proxemic behavior. 8.2. Semantic rules versus norms of use in communication systems A communication system includes two kinds of rules: semantic rules and norms of use. Semantic rules concern the correspondence between meanings and signals, be they words, signs or sentences in a verbal or sign language, or gesture or gaze items. These are rules of the following type:
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
13
if you want to communicate the meaning ”I greet you”, say “Hello” if you want to communicate the meaning ”I greet you”, raise your eyebrows if you want to communicate the meaning ”I greet you”, wave your hand Norms of use, instead (those studied by Pragmatics and Sociolinguistics) do not state how some meaning has to be conveyed, but if some meaning can, should or should not be conveyed in a given situation. They are rules of this type: if you meet a person you know, apply the rule for the meaning “I greet you”, or if you meet an unknown person, do not apply the rule for the meaning “I greet you”. Now, in some communication systems (for example spoken languages) both semantic rules and norms of use vary across cultures. But in others (typically, the facial expression of primary emotions) semantic rules might be everywhere the same (a grimace of anger is performed in the same way in all cultures), while cultural difference holds in the norms of use (in Japan expressing anger is much more sanctioned than in the USA). 8.3. What is universal and what is cultural in speech, gestures, and face communication? - Words and sentences Even if at a deep level the syntax of all languages might be universal (as argued by the Chomskian Universal Grammar approach), specific words and syntactic rules differ from one language to another. Some mechanisms like iconicity hold in all languages, but cultural variation across languages holds at various levels: in the strategies of discourse planning, the importance of politeness or rhetoric, in the rules defining what to speak about, how much to mention the self, whether to convey new information or just make reference to shared knowledge, and so on. - Gestures Some gestures are culturally codified (e.g. the gestures for ‘OK’ or ‘Victory’), while others are biologically codified (raising fists up to show elation). If we want to simulate the former in a culture-sensitive Agent, they will have to be varied from a culture to another. Non-codified iconic gestures, instead, might be generated all through the same set of inference rules (supposedly universal), whatever the culture the Agent comes from; but if the referent represented is typical of a culture or an action is performed in a way typical of it, then also a creative gesture may be culturally dependent. - Gaze and Facial Expression Facial expression and gaze are more likely to be universally shared than gesture. They can communicate information on the world (we point at things with chin or gaze, squeeze eyes to say that something is little or difficult), information on the Speaker’s beliefs, goals and emotions (we raise eyes while we remember or make
14
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
inferences; we frown to communicate anger, concentration, or an order) and information on the Speaker’s identity (our face and gaze provide information on sex, age, ethnicity, personality, sometimes even social class). Focusing on the expression of emotions, even if the so-called basic emotions (happiness, sadness, anger, fear, surprise and disgust) are felt in all cultures, and everywhere they trigger an innate universal neural program for facial expression (Ekman, 1982), this does not imply that people in different cultures always show their emotions in the same way in the same situation. First, an emotion is triggered by the cognitive categorization of a situation on the part of the subject, and a situation that in a culture, because of its beliefs, norms and values, is categorized as a cause of sadness, in a different culture might be seen as a cause of happiness (take the death of a martyr in the Islamic culture). Second, the emotion display is filtered not only by cognitive and personality traits of the Agent and of the Interlocutor, their relationship and the situation, but also by the cultural norms about the expression or non-expression of given emotions. 9. STYLE Style is an internal feature of an Agent that affects its choice of communicative resources. We define style as the idiosyncratic stable tendency of a specific agent to exhibit specific communicative or non-communicative behaviors. Communicative style is then the tendency one has to choose some signals, or arrangements or aspects of signals, instead of others. Agent X uses formal words also in private informal situations; Y's discourse is always thoroughly explained; Z makes ample gestures; W avoids the Addressee's gaze. In its being a stable tendency to communicative choice, style is the ultimate result of the combination and interaction of the permanent goals that rule an Agent's behavior, namely the goals stemming from its personality, attitudes, and culture. A particular combination of these goals, at the moment of the communicative output, affects the choice of a particular signal or set of signals, or simply the value of some parameters in the production of signals. High level goals implied in a specific personality get instantiated into lower level goals. For example, in an introverted person, the general goal of not being too visible may result in a goal of avoiding too wide gestures. A tendency to visual thought may lead to use metaphors in discourse. The habit of explaining to students may induce to be always didactic. Style can manifest itself both in the choice of whole signals (I may prefer a metaphorical to a literal word) and in the choice of a particular way to produce a signal. For example, people may differ in their style as to various aspects of gesture use (Ruttkay et al., in press). Differences may concern: 1. whether to use a gesture or not, which depends on at least three factors: a. width of repertoire: we cannot convey a meaning by a gesture if our culture or personal gestural competence does not include a gesture for that meaning;
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
15
b.
threshold: the degree of formality, as well as other factors like need for redundancy, may provide a threshold for the use of a gesture; c. redundancy: I may use both a word and the corresponding gesture if I want to stress some meaning or if I want to be very clear. 2. ways of performing gestures. While some parameters and sub-parameters of nonverbal signals like gestures are subject to lexical and sociolinguistic variation, some sub-parameters of movement: tense or relaxed (tension), wide or narrow (amplitude), smooth or angular (manner) might be more affected by style. That is, they are directly determined by the current goal set of the Sender and then by his/her emotional or physiological state. In conclusion, each signal chosen and the combination of particular values in each parameter or sub-parameter of this signal is determined by the combination of goals that are activated at a certain moment in the Sender, all stemming possibly by the goal set described above. 10. GRETA’S EMOTIONS, PERSONALITY, AND COMMUNICATION In this Section, we show how some of the principles introduced above have been applied in the construction of an Embodied Agent: Greta. Specifically, the device of emotion triggering and its influence on the Agent’s behavior have been already implemented, along with some aspects of its personality, communicative planning, voice and face communication. Some aspects of gestural communication are still to be thoroughly implemented in Greta such as gesture expressiveness (Chi et al., 2000). But this architectural model may also support adaptation to some features of the user target population, such as culture and style (de Rosis et al., in press a; Ruttkay et al., in press; Hartmann et al., 2002). In order to achieve a high degree of flexibility, necessary to this aim, we base the architecture of our Agent on the following principles: - the Mind and the Body of the agent are separated. This separation, in our opinion, helps in achieving a higher flexibility in adaptation of the Agent’s behavior and appearance. It offers the opportunity to vary its mental state and its reasoning capacity according to the context and to establish the forms of communication to adopt by considering the technical resources available. Considering interdependency between the body and the mind would mean to include context-dependent rules concerning the different aspects of manifesting behavioral differences during communication into the planning process, and thus to make this step very complicated since it has to solve constraints at both inner and outward levels. In our approach (independence of Body and Mind), at the Mind level, only constraints on the meanings associated to a communicative act are represented, leaving to the Body the task of deciding which signal to employ according to the context: - at the Mind level, represented according to the BDI (belief, desire, intention) theory and architecture (Rao and Georgeff, 1991), the way in which its mental state is related to the context may be represented explicitly. This enables the Agent to vary the decision made (including the ‘discourse’ that achieves a given
16
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
‘communicative goal’) and the triggered and displayed emotions, according to its mental state’s structure; - at the Body level, that can use different physical representations, a contextdependent meaning-signal translation is performed in order to render the selected meaning consistently with the chosen physical body and context rules. - relations between the various phases of the generation process have to be conceived in such a way as to facilitate a flexible adaptation to different features that influence communication. For this reasons, the I/O interface between Mind and Body is constituted by an XML specification that is easily wrapped according to available resources and rendering rules. 10.1. The System Architecture These principles have been applied, in the context of the MagiCster project, to develop an Embodied Conversational Agent that appropriately combines verbal and nonverbal signals when delivering information, to establish a natural communication with the User. Our Agent achieves a quite rich expressiveness in conversation, by displaying various types of information typically transmitted in human-human dialogs (syntactic, dialogic, meta-cognitive, performative, deictic, adjectival, and rhetorical relation). It is embodied in a 3D talking head whose name is ‘Greta’ (Pelachaud, 2002); it has a personality and a social role, as well as the capability of expressing emotions, consistently with the context in which the conversation takes place and with its own goals, and to adapt her behavior to some user features relevant for the selected application domain. At present, we simulate advice-giving dialogs, where the main function of Greta is to provide suggestion and useful information to the User, in the application domain. In our ‘mixed-initiative’ system, the User can ask Greta some questions; this opens a question-answering sub-dialog after which, if needed, Greta revises her discourse plan according to the User request. At the moment, Greta provides information and advice in the medical domain adapting her behavior to some user features such as the age (children, teens, adult), the role (patient, nurse, parent of a patient) and some cultural features that are relevant for providing more appropriate suggestions both from the content and the signal rendering points of view. The architecture of our Conversational Agent includes the following main components (Figure 1): i) the Agent’s MIND including i.1) the ‘Emotional Mind’ and i.2) the Dialog Manager, and ii) a Generator of the Agent’s Body including ii.1) a Signal Generator, ii.2) a Signal Wrapper and iii.3) the chosen body (Greta in this case) that renders the specified behavior. Even if we are working in the field of advisory dialogs, the system is domainindependent. A particular application domain is therefore selected as a first step of interaction: a conversational goal in that domain is passed to the Dialog Manager (DM) and the dialog may start. From this goal, an overall discourse plan is produced for the Agent, by retrieving an appropriate ‘recipe’ from a plan library formalized according to DPML (de Rosis et al., in press b). This plan represents how the Agent
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
17
will try to achieve the specified communicative goal during the conversation and is selected according to user features and context. The way the dialog goes on is a function not only of this plan but also of the User Moves and of the Social Context. User Move Agent Model MentalState Affective State User Model Features Preferences
MIND Emotional Mind Dialog Manager
Domain DPML Plans DialogGoal
MIDAS Plan Enricher
Context Features
Meaning Enriched Agent Move (APML)
BODY SignalGenerator
Body Features Rendering Rules
SignalWrapper
Oh, I’m sorry, that’s a bad situation indeed...
Signal Enriched Agent Move (Greta FDL – Festival)
Figure 1. The Architecture of our Conversational Agent
This context includes variables concerning the topic of the dialog, the personality traits of the two interlocutors and their ‘attitudes’ towards the dialog itself. For instance, the Agent may be defined as more or less ‘empathic’ towards the User, more or less easily affected emotionally, more or less prone to feel specific emotions with a high intensity (for more details, see Carofiglio et al., in press). Once an Agent’s first move has been generated, the User replies with a move in his or her turn. The DM then asks the emotion simulation module (Emotional Mind) whether a particular affective state of the Agent is activated as a consequence of this move and with which intensity. Then, the DM selects the next move according to the goal activated by the user move, the Agent’s new affective state, and the Social Context. When the DM selects a particular Agent’s move, this has to be enriched by the plan enrichment module (MIDAS) that adds tags indicating the meanings to be conveyed. At this point, a further adaptation can be performed by adding meaning
18
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
according to the user features, culture and context. This enriched move is then passed to the Body Generator, that interprets it by converting meaning into signals and by rendering them into an expressive behavior. The selection of signals, at this point, can be adapted to the available communication channel of the chosen body, user features and cultural features. We now describe in some more detail the above mentioned modules. 1. The Emotional Mind is responsible for updating the Agent’s mental state by deciding whether a particular affective state should be activated and with which intensity, and whether the felt emotion should be displayed and how, according to the context variables. Mind is based on a dynamic belief network (DBN), that combines a belief network (BN) representing the agent’s mental state at time T with a network representing its mental state at time T+1 and a network that monitors the triggering of emotions in the interval (T, T+1). Three kinds of nodes can be found in the Agent’s mental state: ‘belief’ nodes, ‘goal nodes’ and ‘goal-achievement’ nodes. A weight is associated with goalachievement nodes, as a function of the agent’s personality. The belief network at time (T+1) is generated according to the network at time (T) and to the events occurred in the interval (T,T+1). These events are modeled, as well, by belief networks. 2. The Dialogue manager (DM) is built on the top of TRINDIKIT (http://www.ling.gu.se/projekt/trindi/). It controls the dialogue flow by iterating the following steps: a. after a 'dialog goal' has been specified, an appropriate discourse plan is selected from the library of plan recipes and the first move is generated according to the first step of this plan. The ‘dialog goal’ becomes the main topic of the conversation; b. at the end of this first move, the initiative is passed to the User, that may ask questions to the agent about any subject concerning the main topic under discussion; c. the User move is translated into a symbolic communicative act (through a simplified interpretation process) and is passed to the DM; d. the DM decides “what to say next” by selecting the sub-plans to execute. At the same time, the information state of Trindi is updated; e. the DM goes on cycling over steps b. to d., until the user leaves the conversation. 3. The Plan Enrichment module, called MIDAS, has the role of translating the symbolic representation of a dialog move into an Agent’s behavior specification. A dialogue move may be very simple (i.e. greet, ask) or it may correspond to a small discourse plan (for instance: describe an object with its properties). In both cases MIDAS generates a tagged output expressed according to a particular XML specification which is interpretable by the Body of any Animated Agent (Affective
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
4.
5.
19
Presentation Markup Language: APML. See De Carolis et al., 2002 for details). APML is used as a middle-ware level, to overcome integration problems between the mind and body components and, at the same time, to allow independence and modularity between them. The move is then represented as an APML string, in which the verbal part of the dialog act is enriched with the tags that are needed by the speech and body generation components of the Agent to produce the required expressions: rhetorical relations, deictic or adjectival information, certainty values, metacognitive or turn-taking expressions. The Body Generator module interprets the APML-tagged dialog move and decides which signal to convey on which channel for each communicative act. We defined a lexicon of facial expressions and gaze: that is, a set of (meaning, signal) pairs. Each meaning in the taxonomy specified in Section 4. (certainty, metacognitive, comment etc.) corresponds to a particular configuration of parameters of gaze or facial expressions that Greta is able to exhibit. For example, among the same class “certainty”, the meaning “certain” corresponds to a “small frown” while the meaning “uncertain” to “raise eyebrow”. In order to adapt this correspondence between meanings and signals, we started to introduce some translation conditions depending on the user and the culture of the target user population. The Body we use is a combination of a 3D face model compliant with the MPEG-4 standard (Pelachaud, 2002) and of the speech synthetizer Festival (Black et al., 2002). The facial model is capable of performing the face and gaze expressions foreseen for our conversational agent. Each signal may be expressed as a set of facial parameters. We have developed a language to describe facial expressions easily (De Carolis et al 2002). The text of each dialogue move with its tags is given as input to the animation module and to Festival, which provides the duration of the phonemes and a wav file (an audio file). Phonemes are the smallest temporal unit considered. Knowing the phoneme duration enables us to retrieve the exact duration of any expression as defined by the tags in the dialogue move, thus ensuring synchrony between speech and other visual activities. Tags get instantiated by their corresponding signals which are in turn transformed into facial parameters’ values. When several tags occur in the same text span, if their corresponding signals act on the same facial regions with different values (e.g. on the eyebrow region the agent should perform a frown and a raised eyebrow, or with the head, a head nod and a head shake), a conflict may arise. In such cases a conflict solver module takes as input the co-occurring meanings and produces as output a complex expression made of a mix of signals from different meanings (Pelachaud & Poggi, 2002). The various components (Mind, MIDAS, Greta and the DM itself) are connected through a Graphical Interface which controls activation, termination and information exchange for the various processes involved in the dialogue management.
20
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
10.2. An Example To give an idea of how the system works, we show a simple example of the output; in this dialog, Greta represents a doctor who explains a drug prescription to a patient (the User). After an initial ‘greeting’, the Agent informs the patient about her disease. Then, the patient can ask any question concerning the main topic of the conversation; for instance, she may ask for clarifications about the disease and its severity, or even about the related therapy to undergo; Greta answers the User question by selecting the adequate plan steps and by showing the appropriate expression that is appropriate to the social context. Figure 2 shows a few annotated moves of the example dialogue. The a moves are directed to a user which is a patient while b moves are directed to a user which is a nurse. These APML moves have been generated by the same DPML plan (informing about Mr. Smith’s disease), it is possible to notice that the verbal and nonverbal meanings to convey are slightly different. For instance, the affect attribute of the performative is not instantiated in move 1b: since the user is a nurse, an empathic attitude toward her has not been generated by Mind, and then it is not to be displayed by the Body Generator. Annotations are translated into signals and then rendered as facial expressions and voice intonations. Figure 3 shows some examples of these expressions (neutral, sorry-for and surprise). Let’s suppose we want to introduce a form of adaptation to the user culture at this level, the Body Generator has to look into a conditional table and select the appropriate signal for the chosen body (King et al., 2003; Ruttkay et al., in press). This translation file is also an XML Meaning-Signal description in which the combination of signals by which every meaning should be rendered is established by rules of the form: <meaning-tagi>
where <meaning-tagi> is an APML tag and attj specifies an expression or speech feature that the body is able to perform under the specified conditions (Cond). The following are some examples with the Facial Description Language of Greta. For some cultures, such a French or Italian, we have: For other ones, such as Greek or Turkish, the mapping is:
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
21
Greta move1a: <APML> I'm sorry to tell you that you have been diagnosed as suffering from what we call angina pectoris, which appears to be mild.
Patient User: What is angina pectoris? Greta move 2a :
<APML> This is a spasm of <deictic obj=”chest”> chest resulting from overexertion when heart is diseased. ....
Greta move1b:
<APML> Mr. Smith is suffering from angina pectoris, which appears to be mild.
Nurse User: Which is the therapy? Greta move 2b:
<APML> To solve the problem, he has to take two drugs. The first one is Aspirin and the second one is Atenolol
… and so on.
Figure 2. An example of dialog annotated with APML.
Meaning-signal translation files define a mapping between the semantics of discourse segments and the expressions the character can show with face, gesture, body and speech. Defining several meaning-signal translation files enables us to
22
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
render the same meanings with different combinations of signals. We may, for instance, implement a character with a ‘fun’ behavior for interacting with children and a more ‘serious’ one for adults. Or we may apply this method to build culturedependent characters (de Rosis et al., in press a). 11. CONCLUSIONS In this paper, we have overviewed some factors determining the communicative behavior of a Human Agent, that should be simulated in a Believable Embodied Agent to make it ‘natural’ (or ‘believable’). In particular, we have considered contents to communicate emotions, personality, culture, style and contextual variables. Then, we have briefly described the architecture of a conversational agent embodied in a 3D face, Greta, that tries to achieve a believable behavior while interacting with the User. Our Agent follows the seminal work of a number of other research groups involved in this domain (to mention only some of them: Arafa et al., 1999; Cassell et al., 1999; Rickel and Johnson, 1999; André et al., 2000; Lester et al., 2000). It builds upon research on how to model emotion activation and regulation (Frjida & Swagerman, 1987; Ortony et al., 1988; Elliott, 1992; Castelfranchi, 2000). The main characteristic of our Conversational Agent is in its ability to dialog with the User in any application domain whose knowledge has been represented with the appropriate formalism that is required by TRINDI and by MIND. Another peculiar feature is in the Mind-Body independence: although the examples we showed in this contribution are all about Greta, the APML tagging of dialog moves enables us to render them with any animated agent (2D or cartoons, for instance) according to the needs of the application domain and of the particular category of users to which the system is addressed.
Figure 3. Three expressions of Greta: neutral, sorry-for and surprise
In the future, we plan to study how the variables considered (in particular, affective factors) influence not only the external behavior of the Agent but also the inner part of it: typically, its reasoning style, its decision making ability and the argumentation process. The independence of the Mind and the Body representations
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
23
allows us to represent communicative information at a high abstract level. The instantiation of a meaning into a set of signals is done by considering factors such as the technical context (on which device is the agent displayed; is it a 2D cartoon-like face, or a 3D realistic-like face; does it have only the face or also the body…) or socio-cultural context (what is the user’s culture, the agent’s culture; what is the user and agent’s age and gender; where is the conversation taking place…). Such an architecture provides great flexibility to create specific agents for each given set of factors. 12. NOTES 1 IST project Magicster IST-1999-29078 with partners: University of Edinburgh, UK (coordination); DFKI, Germany; SICS, Sweden; Univ. of Bari, Italy; Univ. of Rome, Italy; AvartarME, UK.
13. REFERENCES André, E.., Rist, T., Van Mulken, S., Klesen, M., & Baldes, S. (2000). The automated design of believable dialogues for animated presentation teams. In J.Cassell, J.Sullivan, S. Prevost & E.Churchill (eds.). Embodied Conversational Agents. Cambridge, MA: MIT press. Arafa, J., Charlton, P., Mamdani, A., & Fehin, P. (1999). Designing and building Personal Service Assistants with personality. In S. Prevost & E. Churchill (Eds.). Proceedings of the Workshop on Embodied Conversational Characters, Tahoe City. Bates, J. (1994). The role of emotion in believable agents. Communications of the ACM, 37, 7, 122-125. Black, A.W., Taylor, P., Caley, R. & Clark, R. (2002) Festival. http://www.cstr.ed.ac.uk/projects/festival/ Canamero, L. & Aylett, R. (eds.) Animating Expressive Characters for Social Interactions. Amsterdam: John Benjamins. Carbonell, J.G. (1980). Towards a process model of human personality traits. Artificial Intelligence, 15, 49-74. Carofiglio, V., de Rosis, F. & Grassano, R. (in press). Dynamic models of mixed emotion activation. In L Canamero & R.Aylett (eds.): Animating Expressive Characters for Social Interactions. Amsterdam: John Benjamins. Cassell, J., Bickmore, J., Billinghurst, M., Campbell, L., Chang, K., Vilhjalmsson, H. & Yan, H. (1999). Embodiment in conversational interfaces: Rea. In Proceedings of CHI'99, Pittsburgh, PA, pp 520527. Cassell, J., Sullivan, J., Prevost, S. & Churchill, E. (eds.) (2000). Embodied Conversational Agents. Cambridge, MA: MIT press. Castelfranchi, C. (2000). Affective Appraisal Versus Cognitive Evaluation in Social Emotions and Interactions. In A. Paiva (Ed.). Affective Interactions. Springer LNAI 1814, Berlin. pp. 76-106. Chi, D., Costa, M., Zhao, L. & Badler, N. (2000) The EMOTE model of effort and shape. In Proceedings. of Siggraph, 173-182. Conte, R. & Castelfranchi, C. (1995). Cognitive and Social Action. London : University College. De Carolis, B., Carofiglio, V. & Pelachaud C. (2002). From Discourse Plans to Believable Behaviour Generation. Proceedings of the International National Language Generation Conference. New York. De Carolis, B., Pelachaud, C., Poggi, I. & de Rosis, F. (2001). Behavior planning for a reflexive agent. In Proceedings of IJCAI'01. Seattle, August 2001. de Rosis, F., Pelachaud, C. & Poggi, I. (in press a). Transcultural believability in embodied agents: a matter of consistent adaptation. In R.Trappl & S.Payr (eds.) Agent Culture: Designing virtual characters for a multi-cultural world. Dordrecht: Kluwer Academic Publishers. de Rosis, F., Pelachaud, C., Poggi, I., Carofiglio, V. & De Carolis, B. (in press b) From Greta's mind to her face: modeling the dynamics of affective states in a Conversational Embodied Agent. International Journal of Human-Computer Studies. Ekman, P. (1982). Emotion in the human face. Cambridge: Cambridge University Press.
24
POGGI, PELACHAUD, DE ROSIS, CAROFIGLIO AND DE CAROLIS
Elliott, C. (1992). An Affective Reasoner: A process model of emotions in a multiagent system. Technical Report No. 32 of The Institute for the Learning Sciences, Northwestern University. Frjida, N.H. & Swagerman, J. (1987). Can computers feel? Theory and design of an emotional system. Cognition and Emotion. 1(3), pp. 235-257. Hartmann, B. Mancini, M. & Pelachaud, C. (2002). Formational parameters and adaptive prototype instantiation for MPEG-4 compliant gesture synthesis. In Proceedings of Computer Animation ’02, Geneva: IEEE Computer Society Press. King, S., Knott, A. & McCane, B. (2003). Language-driven nonverbal communication in a bilingual conversational agent. In Computer Animation and Social Agents, New Brunswick. Lester, J.C., Stuart, S.G., Callaway, C.B., Voerman, J.L., & Fitzgerald, P.J. (2000). Deictic and emotive communication in animated pedagogical agents. In J.Cassell, J. Sullivan, S. Prevost, & E. Churchill (eds.). Embodied Conversational Agents. Cambridge, MA: MIT press. Loyall, A.B., & Bates, J. (1997). Personality-rich believable agents that use language. In Proceedings of Autonomous Agents 1997, pp. 106-113. Magno Caldognetto, E. & Poggi, I. (1995) Creative Iconic Gestures: some Evidence from Aphasics. In R.Simone (ed.) Iconicity in Language. Amsterdam: John Benjamins. Mc Crae, R. & John, O.P. (1992). An introduction to the Five-Factor model and its applications. Journal of Personality. 60, pp. 175-215. Miceli, M. & Castelfranchi, C. (1989). A cognitive approach to values. Journal for the Theory of Social Behavior, 19, 169-193. Ortony, A., Clore, G.L., & Collins, A. (1988). The Cognitive Structure of Emotions. Cambridge University Press. Oatley, K. & Johnson-Laird, P.N. (1987). Towards a cognitive theory of emotions. Cognition and Emotion, 13, 29-50. Pearl, J.(2000). Probabilistic reasoning in intelligent systems: Networks of plausible inference. Morgan Kaufman Publishers. Pelachaud, C. (2002) Visual Text-to-Speech. In I.S. Pandzic, R.Forchheimer (eds.) MPEG-4 Facial Animation - The standard, implementations and applications. New York: John Wiley & Sons. Pelachaud, C., Carofiglio, V., De Carolis, B., de Rosis, F. & Poggi, I. (2002). Embodied Contextual Agent in Information Delivering Application. In Proceeding of the First International Joint Conference on Autonomous Agents & Multi-Agent Systems. Bologna, Italy. Pelachaud, C. & Poggi, I. (eds.) (2001). Multimodal Communication and Context in Embodied Agents. Workshop at the 5th International Conference on Autonomous Agents, Montreal. Pelachaud, C. & Poggi, I. (2002). Subtleties of Facial Expressions in Embodied Agents. Journal of Visualization and Computer Animation. 13, 301-312. Picard, R.W. (1997). Affective Computing. Cambridge, MA: MIT Press. Plutchick, R. (1980). The Psychology and Biology of Emotions. New York: Harper & Collins. Poggi, I. (2002a). Mind Markers. In M.Rector, I.Poggi & N.Trigo (eds.) Gestures. Meaning and Use. Oporto: Universidad Fernando Pessoa Press. Poggi, I. (2002b). Towards the Lexicon and Alphabet of Gesture, Gaze and Touch. www.semioticon.com/virtual Poggi, I. & Pelachaud, C. (1998). Performative Faces". Speech Communication, 26, 5-21. Poggi, I. & Pelachaud, C. (2000). Performative Facial expression in Animated Faces. In J.Cassell, J.Sullinvan, S. Prevost & E.Churchill (eds.). Embodied Conversational Agents. Cambridge, MA: MIT press. Prendinger, H. & Ishizuka, M. (2001). Smart brains in simplistic bodies. In C.Pelachaud & I.Poggi (eds.) Multimodal Communication and Context in Embodied Agents. Workshop at the 5th International Conference on Autonomous Agents, Montreal. Rao, A.S. & Georgeff, M.P. (1991). Modeling Rational Agents within a BDI-Architecture. In J.Allen, R.Fikes, & E.Sandewall (eds.), In Proceedings of the 2nd International Conference on Principles of Knowledge Representation and Reasoning (KR’91). Morgan Kaufmann Publishers, San Francisco, CA. Rickel, J. & Johnson, W.L. (1999). Animated agents for procedural training in virtual reality: Perception, cognition, and motor control. Applied Artificial Intelligence, 13, 343—382. Ruttkay, Z. Pelachaud, C., Poggi, I. & Noot, H. (in press) Excercises of Style for Virtual Humans. In L.Canamero & R.Aylett (eds.) Animating Expressive Characters for Social Interactions. Amsterdam: John Benjamins.
GRETA. A BELIEVABLE EMBODIED CONVERSATIONAL AGENT
25
Trappl, R. & Payr, S. (eds.) (in press). Agent Culture: Designing virtual characters for a multi-cultural world. Dordrecht: Kluwer Academic Publishers.
14. AFFILIATIONS I. Poggi Università di Roma 3 Roma – Italy C. Pelachaud IUT de Montreuil - Universite de Paris 8 Paris – France F. de Rosis V. Carofiglio B. De Carolis University of Bari Bari - Italy
P. PAGGIO AND B. JONGEJAN
MULTIMODAL COMMUNICATION IN VIRTUAL ENVIRONMENTS
1. INTRODUCTION This paper describes research work carried out at the Center for Sprogteknologi (CST) in Copenhagen to develop the multimodal communication components of a 3D virtual world application. This work was part of an interdisciplinary Danish research project – the Staging project – the purpose of which has been to investigate the nature and application potential of 3D virtual worlds, and their impact in theatre and entertainment (see Qvortrup (2000) for an overview). One of the many research activities the project has dealt with, was the design and development of a virtual farm scenario and the exploration of this scenario from a number of different perspectives. Relevant issues were for example the usefulness of theatrical metaphors in 3D applications, the interaction modes possible in them, and last but not least how the user can communicate with the virtual agents. The interaction patterns defined for the agents in the scenario – a farmer and a number of animals – are based on an interactive improvisation defined for the EU-funded Puppet project, which has had an ongoing cooperation with Staging (Klasen et al. 2000). In our work, we have focused on the interaction between the agents and the user, who can communicate his intentions to the farmer and in that way influence the events on the farm. Two characteristics of the application have been challenging from the point of view of how communication between the user and the agents could be modelled. First of all, we wanted communication to be multimodal: the user should be allowed to talk, write and use hand gestures in as natural a way as possible. Secondly, interaction was to accommodate the fact that the virtual world would not be static, so that user and agents had to be able to relate to such changes while communicating with one another. Therefore, communication could not be strictly planned, and a balance had to be found between the agents' wish to engage in a dialogue and their autonomous behaviour. Finally, we wanted to experiment with an environment in which the interaction with the agents in the world had an entertainment value rather than being subordinate to the achievement of specific goals such as finding information or learning how to carry out a task. This has been accomplished by enabling surrealistic actions such as moving trees around, and slightly comical replies such as “The cow is not hungry, but dead”. It is our impression from the feedback obtained so far that users appreciate the system's funny reactions and do not necessarily expect the interaction to have a practical goal. 27 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 27–45 © 2005 Springer. Printed in the Netherlands
28
PAGGIO AND JONGEJAN
The multimodal requirements have been met by implementing a multimodal parser which is able to combine input coming from different modality devices and combine it where possible into a merged representation. The parser is neutral as to which specific modalities are used. In the Staging application, it combines speech and gesture inputs. Speech and gestures are paired during chart initialisation, and the grammar then merges the multimodal information into a unified semantic representation where possible. To comply with the flexibility necessary in the interaction with the virtual world, we have defined a dialogue model aimed at dealing with mixed-initiative dialogue. The two building blocks of the model are dialogue obligations modelling valid speech act sequences, and dialogue goals modelling the agents' domain-oriented initiatives in the dialogue. The paper starts by presenting the Staging virtual farm in Section 2. In Section 3, the architecture supporting multimodal communication in the virtual farm is described. Section 4 deals with the issue of how to combine speech and hand gestures in computer interfaces in rather general terms by considering various classifications of hand gestures, while Section 5 discusses the use of hand gestures in Staging and describes the methodology adopted for merging speech and gesture inputs. Section 6 deals with the component in charge of managing a humanoid agent's communication with the user, i.e. its Communication Manager, it describes our dialogue model and discusses a few dialogue examples. Finally, Section 7 contains the conclusion. 2. THE STAGING VIRTUAL FARM Before discussing the treatment of multimodal communication in Staging, a brief introduction to the project's virtual farm is provided in this section. The virtual farm has been developed at the Laboratory for Computer Vision and Media Technology at the University of Aalborg (CVMT/AAU) in Denmark and is described in detail by Paggio et al (2000). At CST we have produced a mock-up version of it in order to experiment with the integration of the multimodal communication components. In this mock-up, which we will refer to as the CST virtual farm from here on, the graphical rendering of the farm is two-dimensional, although the underlying spatial model is 3D. The graphical style is also different from that of the original farm in that we have substituted a simpler, cartoon-like look for the initial 3D computer game style. At the moment, the farm is a square with a tree in the middle in which the user can navigate by means of the keyboard arrows. The farm is populated by a farmer, a few cows and a sheep. Apples and carrots pop up in the grass for the animals to eat. The animals are mainly governed by their hunger level, and get more and more restless the hungrier they become. They are attracted by food if food is not too far from them, and eat it if they come close enough. If they don't find food, they eventually die. The farmer is capable of more complex actions such as moving things around on the farm, feeding the animals, and talking to the user. Finally the user, who can interact with the farm verbally and by means of gestures, is represented in the world by an invisible avatar. This means that the user sees the world from a certain perspective, and that this perspective changes if she
MULTIMODAL COMMUNICATION IN VIRTUAL ENVIRONMENTS
29
moves around the world – that is if she uses keyboard strokes to navigate in the world. Although the user is not represented physically in the world, and cannot manipulate any of the objects directly, the fact that she is positioned in the world by means of the invisible avatar makes it possible for the agents to relate to her in several ways. For example, when the farmer engages in a dialogue with the user, he comes closer to where the user is positioned, and when asked to tell the user where an object is, he will relate the position of the object in question to the position of the avatar.
Figure 1. The CST virtual farm
At the core of the CST virtual farm is the Virtual Environment (VE) server in charge of simulating the world, and to which relate an arbitrary number of autonomous agents as well as the user avatar. The VE provides the agents with sensory information that enables them to move around in the world and react to the presence of other agents, and it processes requests from agents, for example a request to move an object, produce a sound or play an animation. The agents perceive the presence of the user avatar in the same way as they perceive the presence of other agents. Agents have two different levels of functionality. The low level, which all agents share, accounts for spatial memory and motion planning, and enables an agent to perform simple actions such as “follow object X” or “move close to object Y”. The high level, on the other hand, is different from agent to agent, and is responsible for more complex behaviours and action planning. The high level of the farmer agent,
30
PAGGIO AND JONGEJAN
for example, will know that after a certain amount of time has elapsed from the last feeding time, the animals are hungry and must be fed. It is still the high level agent that may have the ability and will to engage in a dialogue. To choose among competing behaviours (only some of which may be the result of a verbal exchange with the user avatar), agents are equipped with a set of Releasing Mechanisms that compute the relevance of specific behaviours at any point in time (Madsen and Granum 2000). Agents continuously attempt to carry out the most relevant kind of behaviour, i.e. the behaviour that displays the strongest activation level. If for example the user (via an avatar) asks the farmer to feed a certain animal, say cow$3, the farmer’s releasing mechanisms will increase the relevance of the “feed object cow$3” behaviour. Whether the agent chooses to perform the action depends on the level of activation of the other behaviours at that moment in time – the farmer may be hungry, tired, or unwilling to cooperate with the user. In fact, other behaviours may be more relevant than “feed object cow$3”. Engaging in a conversation with the user is a type of behaviour in itself: to win over others, it must have a high level of activation. In the current implementation of the system, the dialogue behaviour has generally a high activation level for the farmer agent. 3. AN ARCHITECTURE FOR MULTIMODAL COMMUNICATION In the architecture we have defined for the CST virtual farm, and which is shown in Figure 2, each agent (but not the user avatar) may be equipped with its own multimodal analysis component. In the current implementation, only the farmer (agent B in the figure) is capable of reacting to multimodal messages. In the figure, A, B, C and D are individual agents. A is the user avatar, that is the representation of the user in the world. The avatar mediates the user's communication with the virtual world. We have chosen not to represent the user avatar graphically for the difficulties this would imply in terms of realism, credibility and so on, but the avatar is present in the 3D space so that for example spatial references are interpreted relative to the avatar’s position. The user can send input to the farm via a microphone, a data glove and a touch screen. The touch screen is used to point, whilst hand gestures other than pointing are performed by means of the data glove. The keyboard can be used as an alternative to the microphone should the physical conditions make it preferable. It is also used to navigate in the world. In principle, all the agents that are close enough to hear the avatar speak or to see its movements or both, can react to the input in different ways. In a fictive world (e.g. that of a children's game), it is easy to imagine a dog agent equipped with its own more or less primitive multimodal grammar. Depending on how anthropomorphic the agent is meant to be, it would either be able to react to very few linguistic utterances (Move, Sit), or to participate in a conversation. In the current implementation, however, only one agent has linguistic capabilities. This is the farmer, agent B in the figure, an example of a humanoid agent with more complex communicative skills than the other agents.
MULTIMODAL COMMUNICATION IN VIRTUAL ENVIRONMENTS
☞
C well$1
Gesture recognition A avatar
well$1
(the well)
"Take it to the well"
{
Virtual Environment
Speech recognition
D cow$3
Speech generation
"Okay"
31
B the farmer
}
spatial memory
Speech + Gestures
take it to the well$1|well Parser pred=move arg2=it arg3=well$1
releasing mechanism
ve mo to $1 ? g l n l illi we I w 3 to Am ow$ c am s, I Ye
action planning
Okay resolve("it") = cow$3
salient(cow$3)
Dialogue model
Dialogue history
Text generation
move(cow$3,well$1)
B's communication manager
Figure 2. System architecture
The farmer’s parser can analyse speech input and gesture information. As will be explained in more detail in Section 5, the multimodal input is unified if possible in a feature structure merging syntactic and semantic interpretation. In the figure, the user has entered the speech input Take it to the well while pointing at a well (“well$1”). The result of multimodal parsing is indicated in the figure as the list of features “pred=move, arg2=it, arg3=well$1”, where pred stands for the semantic predicate, arg2 for the direct object, and arg3 for the oblique complement. This is a much simplified representation which, however, serves the purpose of illustrating the fact that the information provided by the gesture is inserted in the structure built by the parser in the appropriate place. In addition to the multimodal parser, the farmer is equipped with a communication manager, which is in charge of managing the dialogue with the user avatar by means of a dialogue model1. It is the job of this component to interpret the user's messages with respect to the ongoing dialogue, to select the appropriate
32
PAGGIO AND JONGEJAN
dialogue moves for the agent, to build a dialogue history, and to generate verbal responses. In addition, the communication manager tries to resolve ambiguous referential expressions by looking at the salience of discourse referents in the dialogue history (in the figure, the pronoun it is resolved to “cow$3”). Microphone Loudspeaker
Dragon Naturally Speaking
SR-wrapper
Data glove Touch screen
Virtual Environment
Communication Manager
Keyboard
SAPI-compliant COM-interface
Multimodal Parser
CORBA-interfaces
Figure 5: Communication protocols in the Staging prototype
However, referential expressions may also be resolved by means of the gestures they are paired with, or because the agent cannot “see” more than one object matching the expression. As we shall see, the dialogue model constructs dialogue trees based on speech act analysis, and selects appropriate dialogue moves that fit in with the dialogue tree. However, the response to a request put forward to the agent in the dialogue also depends on the activation level of the corresponding action in the agent's releasing mechanisms. This is shown in the figure in terms of an internal deliberation as to whether the agent feels like complying with the request. In reality, the agent does not actually asks the question “Am I willing to move cow$3 to well$1?”. If the action can be released – for instance if the agent is not too tired – the request is sent to the agent's action planner. The system's prototype consists of several programs that are chained together using CORBA (omniORB 3.0.4) and Microsoft’s COM as communication protocols (see Figure 5). Dragon Naturally Speaking runs under the Windows 98 operating system, as does the Speech Recognition (SR) wrapper. The VE developed at CST runs under
MULTIMODAL COMMUNICATION IN VIRTUAL ENVIRONMENTS
33
Windows 98. The Communication Manager and the Multimodal Parser are platform independent and have been tested under Sun Solaris and Windows 98. Thanks to the CORBA interfaces all these programs can be distributed over several computers. The SR-wrapper, the VE mock-up, the Communication Manager and the Multimodal Parser are all written in C++. A considerable part of the functionality of some of these modules is defined in external text files: the parser loads lexicon, grammar and semantic mapping rules at start-up, whereas the Communication Manager loads and reloads script files describing virtually all of its functionality. The ability to reload the script files while the prototype is running allows for a rapid development of this component. 4. SPEECH AND HAND GESTURES Many sources argue that users generally prefer multimodal systems, and indicate a preference for one or the other modality in different applications (see e.g. Oviatt 1999). It has also been argued (Cohen et al 2000) that at least in the case of mapbased tasks where being able to manipulate diagrams directly by pointing and drawing is very important, multimodal interfaces are more efficient than graphical user interfaces, and make it easier and quicker for the user to recover from error situations. In dialogue systems, speech (or written text) replaces the mouse clicks we are so accustomed to from menu-driven interfaces. In human dialogue, however, speech is only one of the modalities used to convey meaning: gaze, expressions, head movements, hand gestures and bodily posture are also important channels by means of which we communicate. Thus, insights from studies of human communication are beginning to find their way into research on multimodal interfaces, so that multimodality is considered an interesting option both concerning system input and system output. This development is based on the assumption that a good understanding of human communication, which is fundamentally multimodal in its nature, is useful for the design of better human-computer interfaces: "The design of cooperative dialog systems, new human-computer interfaces [...] and avatars in various types of virtual environments can all be improved by better knowledge of features of real human-human multimodal communication." (Allwood 1999, p.1)
In our work, we are especially concerned with multimodal input and in particular, with the combined use of speech and hand gestures. Most researchers make a first basic distinction between conventionalised gestures like the OK sign, also called emblems, and conversational gestures (Kendon 1981, MacNeill 1992). Whilst emblems have stereotypical shapes (although the shape differs in different cultures) and do not necessarily occur in dialogues, conversational gestures do not take a predictable shape and occur in juxtaposition with speech. Conversational gestures can be categorised in different ways. In Bavelas (1994), they are grouped into topical gestures on the one hand, and interactive gestures on the other. Topical gestures are those directly related to the topic under consideration, e.g. gestures depicting the shape of an object, the speed of
34
PAGGIO AND JONGEJAN
a movement or a time location. Interactive gestures are those used to attend to the dialogue as such, for example by marking the delivery of information (new vs. old information), involving the addressee by acknowledging their contribution or seeking a response, and finally managing the turn in the dialogue. Bavelas describes no less than 12 different specialised interactive gestures. In Allwood (2002), Peirce's basic semiotic categories of indexical, iconic and symbolic information are applied to gestures. This yields the following classification: i. indexical gestures, which establish a causal relation (e.g. pointing) with the information being perceived (the indicated object); ii. iconic gestures, which express information by similarity or homomorphism; iii. symbolic gestures (emblems), where the relation between form and meaning is based on social convention. In Staging, we have focused on three different gesture types which we call deictic, iconic and turn management gestures following i.a. Cassell and Prevost (1996). We have further restricted deictic to pointing gestures, which are indexical according to Allwood's classification, and iconic gestures to gestures expressing the size of the objects on the farm. Turn management gestures are used either to take or to give the turn. The three gesture types interact with speech in different ways, as shown by the following examples: (1)
a) Please, feed the animals/them. (while pointing at a group of pigs) b) Move the smaller cow. (with a gesture meaning small) c) No, I meant the other animal. (while waving with a hand)
In example (1a), the pointing gesture is associated either with a full nominal phrase or a pronominal placeholder. It adds referential information that must be compatible with the semantics of the rest of the speech act. In (1b), the gesture adds a different kind of semantic information, which can usefully be represented as a feature contributing to the semantics of the noun cow. As we shall see, object references coming from pointing gestures, or size information coming from iconic gestures are input to the parser and, if syntactically and semantically compatible, merged with the speech input. Both pointing and iconic gestures are used by the system to resolve nominal phrase reference, and take precedence over reference resolution based on the dialogue history. Example (1c) is different, in that it does not directly contribute meaning to the utterance, and is rather to be interpreted as an independent speech act. A turn-taking
MULTIMODAL COMMUNICATION IN VIRTUAL ENVIRONMENTS
35
action is picked up by the agent's communication manager, which urges the agent to come closer to the user avatar and listen to the user's next utterance, maybe after having interrupted an ongoing action. Turn-giving gestures, on the other hand, cause the agent to take a dialogue initiative. In conclusion, Bavelas's distinction between topical and conversational gestures is reflected in the quite different treatments associated with pointing and iconic gestures on the one hand, and turn management gestures on the other. 5. MERGING MULTIMODAL INPUT An important assumption in our approach to multimodality is the fact that a great deal of research and experience are still needed to determine which particular combination of modalities is the most appropriate for specific tasks and applications. Therefore, we are interested in generic methodologies that allow systems to merge inputs from different modality devices irrespective of the specific application rather than in optimised, low-level implementations of concrete and domain-specific multimodal input combinations. We have chosen to parse multimodal inputs in an integrated fashion, thus taking advantage of unification to treat redundant, complementary ad well as ambiguous inputs. Our approach is different from that used in frame-based systems (as in Vo and Wood 1996), where input from different modalities is integrated by adding information in semantic frames tailored to the domain. It is more similar to the framework adopted in Johnston et al (1997), where the integration of spoken and gestural information is driven by typed-feature structures and carried out via unification. In their approach, multimodal integration takes place after speech and gestures have been independently parsed into separate, partial feature structures. Whilst we also use feature structures to build an integrated semantic representation, in our approach, as already discussed, deictic and iconic gestures are merged with speech by the parser, and turn managing gestures are interpreted later by the communication manager. Input to the system is provided via a microphone, a touch screen and a data glove. The speech input is sent to an off-the-shelf speech recogniser (Dragon NaturallySpeaking Professional) which provides word-based, continuous speech analysis trainable to different speakers' voices and accents. Currently, the speech recogniser's vocabulary is used as provided by the vendor. In other words, we have not limited Dragon only to recognise the words that are known to the parser. This strategy allows us to experiment with unknown words in the parsing algorithm, and has the advantage of not imposing vocabulary limitations on the user. In fact, adaptation of the product has been restricted to user-specific training of the speech files. We have not worked with the internal grammar of the speech recogniser, nor have we experimented with feeding the parser with multiple speech analysis hypotheses. In other words, in our current set-up Dragon sends one and only one analysis per speech input to subsequent processing. A different approach could certainly be adopted, but it was not in focus in the project.
36
PAGGIO AND JONGEJAN
Speech
Hand movements Gesture recognition
Speech recognition pointing, size
Chart initialisation Parsing
turn managing
Semantic mapping Communication management Action
Figure 3: Merging Speech and Hand Gestures
The gesture input is sent to a gesture recogniser developed by Karin Husballe Munk at CVMT (Aalborg University). This component recognises the following hand gestures: x x x x
gesture indicating “small”: closing thumb and forefinger into a circular or oval shape; gesture indicating “big”: stretching thumb and forefinger far from each other, other fingers stretched; turn-taking gesture: lifting hand with palm towards the screen, forefinger and middle finger stretched upwards; turn-giving gesture: from a closed hand (all fingers bent), opening the hand with palm upwards, all fingers stretched in the direction of the screen.
In the preceding section we saw that conversational gestures (both iconic and turn management gestures in our small gesture repertoire) are characterised by not having conventionalised shape. This means for example that the concept of “big” in human conversation can be conveyed in various ways using either one or two hands. The gesture we have decided to implement in the gesture recogniser is iconic in the sense that it is an attempt at imitating the real size of the big animals on the farm – the cows as opposed to the sheep: they are only relatively big, so that the distance between thumb and forefinger when the two fingers are stretched apart from each other seems a reasonable approximation of what “big” is in our virtual world. However, it may be objected that by only allowing two specific gestures to convey the meaning of the two size concepts we are interested in, we turn iconic gestures into emblems. The same objection can in fact be made for turn management gestures: only two of a spectrum of possible gestures are possible. The objection is
MULTIMODAL COMMUNICATION IN VIRTUAL ENVIRONMENTS
37
certainly justified. Given the fact that automatic gesture recognition is an extremely young discipline, it may be expected that future interfaces will be able to react to a much broader range of gestures than those proposed here. The inherent nonconventionality of conversational hand gestures, however, remains a very hard problem, so that systems also in the future will have to strike a difficult balance between naturalness of interaction and implementation feasibility. 5.1. Multimodal parsing Turning again to our implementation, the multimodal parser used in the project is an adaptation of Njal (Music and Povlsen 1993), a bottom-up left-corner chart parser that has been adapted to receive input from several modalities. Njal was chosen because it is fast, and can produce partial parses in case the input is illformed or outside the coverage of the grammar. It can also accommodate unknown words, and apply heuristics to prune multiple parses. Njal applies lexicon entries and grammar rules expressed in an augmented context-free formalism, where category symbols such as S (sentence), NP (nominal phrase) etc. are associated with a feature structure defining the semantics of the sentence or phrase under consideration. For Njal to be able to process speech and gestures together, the speech signal and the signal from the data glove are time-stamped and synchronised. As already mentioned, pointing and size gestures are entered into the initial chart, and if possible paired with relevant words. In addition to the near synchronicity of the two, syntactic and semantic constraints have to be met for the word and the gesture to be paired (see also Paggio and Jongejan 2000 for a discussion). A pointing gesture can be paired with a noun or a pronoun. Furthermore, the semantic type of the noun and the object type indicated by the gesture must be compatible. Let us consider two concrete examples: (2)
Move that cow$1|animal.
(3)
Feed that cow$1|apple.
In (2), the user asks the farmer to move that animal while pointing at a cow. The string cow$1 is the internal representation of the pointing gesture, where “cow” is the object type and “$1” an internal identifier. Since “cow” is a subtype of “animal” in the system's domain ontology, noun and pointing gesture are semantically compatible. The resulting semantic representation of the multimodal input is the following feature structure: (4)
{act=request, predicate=move, arg3={pred=animal, semtype=animal,
38
PAGGIO AND JONGEJAN objectid=cow$1}}
The interpretation of the gesture is an object identifier added to the recipient (arg3) of a “move” request. In example (3), on the contrary, the semantic type of apple (presumably the result of a speech recognition error) is not compatible with the semantic type associated in the lexicon with the word animal. Speech and gesture in this case cannot be unified. The parser is left with two distinct parse hypotheses, one in which apple is the recipient of the predicate feed, and one in which cow$1 is. Since the former is semantically impossible, the latter is yielded as the only possible parse. The resulting feature structure is shown below: (5)
{act=request, predicate=feed, arg3={semtype=animal, objectid=cow$1}}
Whilst pointing gestures are exploited to resolve the reference of nominal phrases directly, size gestures add semantic information which can be used for the same purpose in a more indirect manner. For instance, the gesture small or big paired with the phrase the animal, in certain contexts can provide enough information to identify a specific animal. In the feature representation, the meaning of a size gesture is expressed as the value of the feature “mod(ifier)” as shown below: (6)
{act=request, predicate=feed, arg3={semtype=animal, objectid=cow$1, mod={pred=small}}}
To conclude, our approach to the merging of multimodal inputs is general enough to accommodate complementary as well as ambiguous and contradictory information. The use of feature structures to represent the content of the resulting multimodal sign is, we believe, a powerful tool which can be exploited also for more complex examples than those implemented so far. 6. THE COMMUNICATION MANAGER The overall goal of interaction in Staging, as already pointed out, is not carrying out specific tasks, but rather “playing around” with the world and making things happen in it. Since the user cannot manipulate objects directly, the farmer has to be resorted to. Thus the user can ask the farmer questions about the situation in the world, e.g. where the animals are or whether they are hungry, and urge the farmer to perform simple actions that have to do with the animals on the farm. The
MULTIMODAL COMMUNICATION IN VIRTUAL ENVIRONMENTS
39
communication manager is in charge of managing the farmer's dialogue with the user in accordance with a dialogue model. This implies interpreting the user messages with respect to an ongoing dialogue, deciding on the agent's dialogue moves, building a dialogue history, and generating a verbal response. Our model aims at accommodating the flexibility required by the application by combining dialogue goals arising from the scenario with dialogue obligations expressed in terms of sequences of speech acts. Dialogue goals and dialogue obligations are modelled based on well-known approaches. However, we believe these have been combined in an interesting way to be applied in a world of autonomous agents in which communicating is one among a set of competing behaviours. Compared to the industrial models of dialogue management reviewed in Zinn et al. (this volume), our approach is more open in the sense that it is not based on explicit coding of all possible sub-dialogues as done for instance in systems based on finite state machines. The construction and interpretation of dialogue trees that is the central mechanism in our approach depends on the definition of general speech act sequences, or dialogue obligations, and the grammar used by the parser to interpret the individual dialogue moves is also very general. The domain knowledge consulted by the communication manager to give meaningful content to the speech acts generated by the agent, is expressed in a library of action templates that are used to generate dialogue goals. This seems similar to what is done in dialogue systems built by computational linguists and speech researchers (e.g. Allen et al 2000, Larsson and Traum 2000), where still according to Zinn et al, dialogue modelling and domain knowledge are kept separate. Dialogue goals and dialogue obligations will now be described in more detail. 6.1. Dialogue goals Examples of informally specified dialogue goals from the farm scenario – some very general, some more specific – are negotiate whether some animal should be moved, negotiate whether some animal should be fed, talk about the weather, and ask for object reference. Dialogue goals are action plans referring to a pre-defined and domain-relevant set of action templates in the spirit of Badler et al (1999). Such templates specify for a given action which semantic objects are involved, the corresponding attribute name in the semantic representation output by the parser, relevant preconditions if any precondition holds, etc. Templates for feeding and moving, for example, may look as follows (somewhat simplified): (7)
FeedAction(Topic=Feed, Animal=<arg3>, Food=<arg2>, Tool=)
(8)
MoveAction(Topic=Move, Object=<arg2>, Modus=, Pace=(quickly|slowly|normal)
40
PAGGIO AND JONGEJAN
Dialogue goals are generated dynamically either to reflect changes in the VE as experienced by the agent, or to help the agent react properly to a request by the user. For example, if the farmer becomes aware of the fact that a pig is hungry, maybe because a certain amount of time has elapsed since he last fed it, his communication manager will create a dialogue goal urging him to ask the user whether the pig in question should be fed. Thus, the agent can take the initiative in the dialogue, and it will do so if explicitly given the turn by the user, or if the user is silent for too long. Dialogue goals also enable the agent to gather all the information it needs to satisfy a request by the user. For instance if the user (U) asks the agent (A) to feed the cows, the agent will ask which food they should be fed with as shown in the following transcript (where cow$2 is the system internal representation of the reference provided by a pointing gesture): (9)
U: Hi feed an animal please. A: Which animal shall I take? U: Feed cow$2|that cow. A: Which food shall I take? U: An apple.
Of course, dialogue goals are not the only goals an agent will have. The exchange above, for example, will yield the goal of actually feeding the cow. Whether the action is then carried out will depend on facts external to the agent's will, such as in this case the cow's hunger level. In fact, an action may be relevant when the agent sets out to perform it, and become impossible before the agent has managed to carry it out. For example, a cow may die of hunger before the farmer gets there to feed it. In that case, the farmer will simply not perform the action. 6.2. Dialogue obligations and dialogue trees Dialogue obligations are stated as a set of condition/obligation pairs following the model described in Traum and Heeman (1996). For instance, a question is followed by an answer, an assertion by a confirmation or a contradiction, and so on. Dialogue obligations are used by the communication manager to produce a correct reaction to a speech act by the user as well as to interpret a dialogue move by the user either as closing an open dialogue segment, or as initiating a sub-dialogue. Thus, understanding a user move and selecting the appropriate type of verbal response is performed in terms of constructing and interpreting dialogue trees. An example of the analysis this mechanism yields for a short construed dialogue fragment is shown in Figure 4. In addition to a string corresponding to the message, each terminal node in the dialogue tree contains a semantic and pragmatic interpretation of it including speech act type. In the figure, the speech act is shown in italics, together with a node index and the message. Following the DAMSL annotation scheme described in Core and Allen (97), we make a distinction between backward-looking and forward-looking speech acts. A backward-looking speech act
MULTIMODAL COMMUNICATION IN VIRTUAL ENVIRONMENTS
41
constitutes a reaction to a preceding utterance, while a forward-looking one elicits a response (either a verbal response or an action). A given dialogue move may well have both a backward-looking and a forward-looking function. From the communication manager’s perspective, forward-looking speech acts specify rule conditions, while backward-looking speech acts are expected reactions to these conditions – our dialogue obligations. If a different speech act occurs instead of the expected obligation, a sub-dialogue is opened.
D Topic=feed
I/R1
I1
Shall I feed the brown cow? offer
I/R3
R1
I3
R3
Feed it then.
Shall I feed more animals? offer
Yes, that sheep. accept/request
accept /request
I/R2 I2
R2
Where is it? whque
In the cowshed. answer
Figure 4: An agent-initiated dialogue
In the dialogue example under consideration, the first initiative (I1) is taken by the agent. It is a forward-looking speech act of type offer, which the communication manager interprets as opening the first initiative-response pair (I/R1) of dialogue D (Jönsson 1997). The topic of the dialogue is defined as feed. Instead of answering the agent’s question directly by either accepting or rejecting the offer, the user asks for more information. The user’s move is therefore interpreted as a forward-looking whque(stion) opening a new initiative (I2), to which the agent replies conveniently with an answer move (R2). This closes the sub-dialogue consisting of the second I/R pair. The next user’s move is interpreted as a reply to the open I/R1, since it can be interpreted as a (backward-looking) accept(ance) of the agent’s original offer. Note that this move also has a forward-looking function of type request. We shall return to this point below. Before satisfying the user’s request of feeding the brown cow, the agent asks whether there are other possible object candidates for the feeding action. This initiative, the offer in I3, is generated by a dialogue goal based on the
42
PAGGIO AND JONGEJAN
relevant action template. The subsequent response by the user closes I/R3 as well as the whole dialogue. 6.3. Implied and coerced speech acts
Even in simplified and domain-oriented dialogues of the type we are dealing with, we know that the same message can be assigned several speech act labels, and that responses do not always seem to obey our limited set of condition/obligation rules. Here, we would like to discuss two cases in which the rules can be relaxed in a systematic way to make the dialogue model more flexible. One way to relax the condition-obligation rules is by allowing responses to be implied. For example, let us consider the following dialogue: (10)
A: Hi. U: Feed the animals
Since the user does not answer the agent’s greeting right away, the agent should not expect to be greeted back later. In other words, after the first utterance of the human participant U, the agent must consider the greeting procedure complete. If we stick to our model of initiative-response pairs, then we have to allow for implied responses. In the example, the Feed the animals request initiates a new initiativeresponse pair. If we do not want to regard this I/R pair as part of a sub-dialogue of the greeting séance (which would seem unnatural), then we have to conclude the greeting I/R pair with an implied greeting response. Another case in which the condition/obligation rules cannot be applied in a straightforward way is the following apparently unproblematic exchange. (11)
U: Feed the animal. A: Which animal shall I take? U: Feed that cow$1|cow. A: Which food shall I take?
The second request (feed that cow$1|cow) is not a new feeding request and must not be interpreted as an initiative opening a new sub-dialogue. On the contrary, it must be accepted as a valid response to the question (which animal shall I take?), and an elaboration of the action under discussion. In the current implementation a request can be coerced into an answer or an acceptance if there is an action on the action stack that can be enriched by the information provided by the request. More precisely, for the request to be assigned the appropriate backward-looking speech act type, there must be semantic equality between the verb of the request and that of the pending action (“feed” in this case), as well as semantic compatibility between their arguments (a cow is an animal).
MULTIMODAL COMMUNICATION IN VIRTUAL ENVIRONMENTS
43
7. CONCLUSION To meet the requirements of a natural and non-task-oriented communication in a virtual world, we have developed an experimental platform where the user can use speech and hand gestures to interact with a humanoid agent, and user and agent can engage in mixed-initiative dialogues dealing with work on a virtual farm. The repertoire of gestures the user can use in combination with speech includes pointing gestures that can be input via a touch screen, as well as iconic gestures indicating size and turn management gestures that can be entered by means of a data glove. In this paper we have presented the methodology used to combine speech and gestures, which is based on the unification of syntactic and semantic features in feature structures, and discussed the way in which the communication manager makes it possible for an agent to react to the user's multimodal communication verbally and by carrying out actions in the virtual world. The Staging virtual farm has provided a nice opportunity for us to experiment with multimodal human-computer interaction in an application where emphasis was put on entertainment value rather than on the accomplishment of specific tasks. On the Staging farm, it is perfectly appropriate for the farmer, when the user points at a brown cow that is looking very restless and asks whether it is hungry, to answer “The brown cow is not hungry, but starving”. If the cow is lying down, the same question will be met with an absurd “The cow is not hungry, but dead”. The possibility of funny responses like these was not planned from the start: it presented itself as we started to realise what possibilities and limitations the scenario had to offer in terms of communication. From the point of view of multimodal interaction, the farm has made us realise how difficult it is to work not only with speech-based dialogue, but to add the complexity of gestural input. The coverage of the gesture recognition should be improved substantially with respect to what we have managed to do in Staging to enable users to use gestures in a flexible way, and it should be possible to perform all types of gestures by means of the same device – e.g. a data glove, or in future freely in front of cameras. Furthermore, the disambiguating potential of multimodal input can be taken further than it has been done in Staging: in particular, it should be investigated how the pairing of speech input and gestures can be made to work with multiple speech analysis hypotheses. Also in the approach to multimodal dialogue presented here several aspects obviously deserve further study. First of all, a more refined treatment of failures ought to be designed and implemented. For instance, when an action requested by the user becomes impossible, we have seen that the agent's current reaction is simply to abandon his plan to carry out the action. It would be worth pursuing different strategies: for instance, the agent could explain to the user what has happened, or try an alternative path of action. Another way in which the scenario could be made more interesting from a theoretical as well as an application point of view is to have several conversational agents interact with the user at the same time. In principle, the general architecture of the system allows experimenting with multi-party dialogue, but of course this would introduce a great deal of complexity in dialogue management.
44
PAGGIO AND JONGEJAN
In conclusion, we believe that a great deal still has to be learnt on how to process gestural input, how to combine it with other modalities in computer interfaces and how actual users would like to be able to use gestures when interacting with systems for different purposes. The Staging farm has provided an interesting framework to study some of the aspects of multimodal interaction in virtual environments. 8. NOTES 1.
In principle, the communication manager could handle communication between agents as well. Such an extension of its role, however, is by no means trivial, and it is not clear that such a complex apparatus is necessary. The project has chosen not to address this issue.
9. REFERENCES Allen, J., Byron, D., Dzikovska, M., Ferguson, G., Galescu, L., Stent, A. (2000) An architecture for a generic dialogue shell. Natural Language Engineering 6 (3&4), 213–228. Allwood, J. (1999) Cooperation and Flexibility in Multimodal Communication. Gothenburg Papers in Theoretical Linguistics. December. Allwood, J. (2002) Bodily Communication Dimensions of Expression and Content, in B. Granström and D. House (eds.) Multimodality in Language and Speech Systems. Kluwer, Dordrecht. Badler, N.I., Bindiganavale, R., Allbeck, J., Schuler, W., Zhao, L. and Palmer, M. (2000) Parameterized action representation for virtual human agents, in J.Cassell, J.Sullivan, S.Prevost and E.Churchill (eds) Embodied Conversational Agents, MIT Press, 256–284. Bavelas, J. B., Chovil, N. and Roe, L. (1995) Gestures Specialized for Dialogue. Personality and Social Psychology Bulletin. Vol. 21, No. 4, 394–405, April. Cassell, J. and Prevost, S. (1996) Distribution of semantic features across speech & gesture by humans and machines, in Proceedings of the Integration of Gesture in Language and Speech. Cohen, P:, MacGee, D. and Cllow, J. (2000) The efficiency of multimodal interaction for a map-based task, in Proceedings of the 6th Applied Natural Language Conference and the 1st Meeting of the North American Chapter of the Association for Computational Linguistics, pp. 331-337, Seattle, Washington, USA. Core, M.G. and Allen, J.F. Allen. (1997) Coding dialogues with the DAMSL annotation scheme. In Working Notes of the AAAI Fall 1997 Symposium on Communicative Action in Humans and Machines. Johnston, M., Cohen, P.. MacGee, D., Oviatt, S.L., Pittman, J.A. and Smith, I. (1997) Unification-based multimodal integration. In Proceeding of the 35th Annual Meeting of the Association for Computational Linguistics, Madrid, Spain. Jönsson, A. (1997) A model for habitable and efficient dialogue management for natural language interaction. Natural Language Engineering, Vol.3, 103–122, September. Kendon, A. (1981) Nonverbal Communication. Interaction and gesture. Mouton. The Hague. Klasen, M., Szatkowski, J. and Lehmann, N. (2000) The black sheep – interactive improvisation in a 3D virtual world, in Proceedings of the i3 Annual Conference, Jönköping, 13–15. Larsson, S., Traum, D. (2000) Information state and dialogue management in the TRINDI dialogue move engine toolkit. Natural Language Engineering 6 (3&4), 323–40. Madsen, C.B. and Granum, E. (2001) Aspects of interactive autonomy and perception, in L.Qvortrup (ed.) Inhabited 3D Virtual Spaces, Springer Verlag, chapter 3, 182–208. MacNeill, D. (1992) Hand and Mind: What Gestures Reveal About Thought. University of Chicago Press, Chicago. Oviatt, S.L. (1999) Mutual disambiguation of recognition errors in a multimodal architecture. In Proceedings of the Conference on Human Factors in Computing Systems, pp. 576–583. ACM Press: New York. Paggio, P. and Jongejan, B. (2000) Representing multimodal input in a unification-based system: the Staging project, in Proceedings of the Workshop on Integrating Information from Different Channels in Multi-Media Contexts at ESSLLI 2000, 24–31.
MULTIMODAL COMMUNICATION IN VIRTUAL ENVIRONMENTS
45
Paggio, P., Jongejan, B. and Madsen, C.B. (2000) Unification-based multimodal analysis in a 3D virtual world: the staging project, in Proceedings of the CELE-Twente Workshop on Language Technology: Interacting Agents, 71–82. Qvortrup, L., ed. (2000) Inhabited 3D Virtual Spaces, Springer Verlag. Traum, D.R. and Heeman, P.A. (1996) Utterance units and grounding in spoken dialogue, in Proceedings of ICSLP. Vo and Wood (1996) Building an application framework for speech and pen input integration in multimodal learning interfaces. In Proceedings of the Internation Conference on Acoustics, Speech, and Signal Processing, Atlanta, GA.
10. AFFILIATIONS Patrizia Paggio and Bart Jongejan Center for Sprogteknologi (CST) Copenhagen, Denmark
M. THEUNE, D. HEYLEN AND A. NIJHOLT
GENERATING EMBODIED INFORMATION PRESENTATIONS
1. INTRODUCTION In conversations between human speakers, speech is the main carrier of information, but nonverbal signals such as gestures and facial expressions also play an important role, providing additional information about the content and the structure of the discourse. In order to successfully engage in a natural interaction with a human user, an embodied conversational agent should be able to interpret the user’s speech and nonverbal signals, and to respond with appropriate verbal and nonverbal behaviours of its own. We know from our research on interaction with embodied conversational agents (Nijholt and Heylen, 2002) that extending the agent’s repertoire of nonverbal behaviours can improve the quality of the interaction (Heylen et al., to appear). Embodied conversational agents should therefore not just communicate by words but also nonverbally. Just as in human-human communication, many channels can be used to send a whole range of signals. To give a few examples, facial expressions can be used to express interest or surprise and a whole range of emotional and conversational signals. Posture shifts may indicate, among other things, a readiness to speak. Gestures can be used for several functions, such as accentuating parts of utterances. Gaze provides information about the focus of attention and plays a role in turn taking. Nonverbal signals can thus be used to signal information about the mental and emotional state of the agent, its personality, its understanding of what is said in the conversation and many other things. In the past few years several groups have started to investigate different modes of nonverbal communication in more detail. This research deals with a whole range of questions: what signals are appropriate, what can we learn from human-human interaction in this respect, when do signals occur or when are they planned, how are they synchronized, how are they combined, how are they interpreted or evaluated by humans conversing with the agents, and so on. The variety of channels and kinds of signals that can be examined, together with the large amount of parameters that enter into nonverbal communication means that much of the work is still exploratory, focussing on just a few behaviours and functions for specific settings. In principle one would like embodied agents to be capable not only of producing the appropriate signals, but also of receiving and 47 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 47–67 © 2005 Springer. Printed in the Netherlands
48
THEUNE, HEYLEN AND NIJHOLT
interpreting the nonverbal signals of the human interlocutors. Although there are noticeable attempts at this (see e.g., Sowa et al., 2001, Breazeal, 2002, Wahlster, 2002), most work on nonverbal communication has been on the production side. We briefly discuss a few examples to illustrate the kinds of nonverbal signals that have been considered and the kind of research involved. In the tutor agent Steve (Rickel and Johnson, 2000), nonverbal modes of communication have been introduced to help the instructions. Steve uses a number of signalling modes to aid in specific functions that are prominent in teaching situations, like drawing the attention of students to certain objects or actions and giving clues on how they are doing with their tasks. Body orientation, gaze and deictic gestures are used to direct the students’ attention to objects in the virtual world. Steve can provide feedback to the students by shaking his head when telling students that they made an error or by a simple nod of approval when students perform correct actions. Nodding can also be used for back-channelling to acknowledge that Steve has understood the student. In this kind of research, the main task dictates the kind of use that is made of nonverbal communication. Research in the Gestures and Narrative Language group of the MIT Media Lab, headed by Justine Cassell, has been concerned with studying aspects of nonverbal communication in human-human conversations and using these results to build computational models of that behaviour for conversational agents in specific dialogue settings. In these studies special attention is paid to the relation between the nonverbal acts on the one hand and linguistics aspects of the dialogue on the other hand. For instance, in their research on gaze, they looked at the role of gaze in relation to turn-taking and information structure. Other nonverbal behaviours they have investigated are posture shifts and beat gestures (Cassell et al., 2000ab, 2001a). In “face-to-face” conversations, the face is the most expressive part of the body. Ekman (1979) provides an overview of the kind of signals that involve eyebrows. They play a role in common emotional expressions of sadness, surprise, fear and anger or distress; they may help underline certain words or punctuate the discourse; they may function as question markers, indicators of word search, et cetera. People are particularly sensitive to the eyes of others as a source of information about another person’s mental state. Baron-Cohen (1995) speaks about a special language of the eyes. Facial expressions can be classified as either functioning as a conversational signal or as emotional icons. The latter could be expressive of an emotional state experienced by the conversant, but in face-to-face conversation, affective displays mostly have a social, communicative function (Kraut and Johnston, 1979; Chovil, 1992). Despite this, much work on facial expressions is solely concerned with a few emotional expressions. A notable exception is the work by Takeuchi and Nagao (1993), who pay attention to conversational signals as well. A few other authors have looked in more detail at how to link facial expressions to the conversational actions and the intentional state of the agents (Pelachaud et al., 1996, 2002; Poggi et al., 2000). The diversity in nonverbal communication is huge and involves many parameters, while our models of emotion, personality, dialogue and social interaction are still rudimentary. Given this, it is a daunting task to let embodied
GENERATING EMBODIED INFORMATION PRESENTATIONS
49
conversational agents converse naturally with the full repertoire of verbal and nonverbal modes of communication that people use in face-to-face conversations. In this paper we will look at the relation between verbal and nonverbal communication for embodied conversational agents. We argue that because of their mutual dependence, the generation of verbal and nonverbal signals should be closely integrated. We will describe how a standard architecture for language generation can be extended to deal with the specification of nonverbal behaviours, and work out an example that shows how the resulting architecture can be used in a specific situation. First, however, we describe in more detail the role of human nonverbal modalities in conversation and discuss how taking these modalities into account can improve the interaction with conversational agents. 2. NONVERBAL SIGNALS IN HUMAN CONVERSATION Humans engaged in conversation typically produce many different kinds of nonverbal signals, which include visual and audio signals. Gestures and facial expressions are the most noticeable visual signals; others are gaze and posture. Audio signals include non-speech sounds, like snorting and sighing, and prosody, which involves features of the speech signal like volume, tempo, and pitch. Both speakers and listeners produce such nonverbal signals, but with differences in frequency and type. Most nonverbal signals are produced unintentionally. In the case of gestures, it has been argued that their main function is to help the speaker with verbal formulation (Krauss et al., 1991; Rimé and Schiaratura, 1991). This point of view is supported by the fact that people still produce nonverbal signals when they cannot be seen, e.g., on the telephone. Still, even though many nonverbal signals may not be intended for the benefit of the hearer/viewer, it has been shown experimentally that people do make use of the information conveyed by such signals (see Kendon (1994) for an overview). In the following, we discuss three general classes of nonverbal signals, according to the kind of information they convey. Following Pelachaud et al. (1996) and Cassell et al. (2000a), we distinguish between the interpretation of a nonverbal signal (its function, in terms of Cassell et al., 2000a) and its form: nonverbal signals with different forms can have the same interpretation, and the same nonverbal behaviour can be interpreted differently, depending on its context. The next three sections provide an overview of nonverbal signals in interaction between humans, grouped according to the kind of information they express. The last section discusses their relevance for embodied agents. 2.1 Expressing message content
One of the functions of communication is transferring knowledge: information about the state of the world is sent from the speaker to the listener. The information being transferred can be expressed using language, but also using nonverbal signals. All nonverbal signals in this category are in principle ‘interchangeable’ with the corresponding verbal expressions. In normal circumstances, the main part of any message is expressed using language. In specific circumstances, ‘content-bearing’
50
THEUNE, HEYLEN AND NIJHOLT
signals can fully replace language. (Think of communicating with someone when there is too much noise to be heard!) Usually, however, they are used in combination with language, sometimes redundantly, and sometimes non-redundantly. With respect to timing, redundant signals are synchronized with the corresponding verbal expression. In the case of non-redundant signals, they co-occur with the verbal reference to the action, object or property that they provide additional information about. Nonverbal signals that are used to express message content can be divided into the following two classes. Deictic signals are used to identify objects being referred to in the message. These signals usually take the form of a pointing hand gesture, but gaze direction, head and body movements are used as well. The indicated objects do not necessarily need to be visible, or even concrete. Deictic signals are different from other signals that express message content in that they do not inherently represent any meaning; their interpretation is entirely determined by the situational context. In this respect they are like verbal deictic expressions such as ‘here’, ‘we’, ‘now’, et cetera. Representational signals are used to express concepts such as attributes, actions, and relationships between objects. They can be divided into different subclasses. Emblems are like words in that the relation between their shape and meaning is arbitrary and different across cultures. They can replace words or entire messages. Examples are nodding or head shaking to express agreement or disagreement, and the ‘thumbs up’ gesture for OK. Iconic and metaphoric signals illustrate properties of objects and actions. Their shape is not arbitrary, but reflects the meaning being expressed. In the case of iconic signals, there is a direct resemblance between the signal and the concept being depicted, for example hands forming a circular shape. They usually take the form of gestures, but facial expressions can also act as icons. For instance, squeezed eyes can symbolize small size (Poggi et al., 2000), or bad taste can be expressed by pulling a disgusted face. For pantomimes, the relation between shape and meaning is even more direct, whereas for metaphoric gestures some visual metaphor is used, such as depicting a physical container to represent a bearer of information, e.g., a film (McNeill, 1992). 2.2 Reflecting discourse structure
Nonverbal signals that reflect the structure of the ongoing discourse deal with the form rather than the content of the message. For instance, speakers mark focused discourse elements by intonation, quick hand movements (beat gestures), nods, eyebrow raises or combinations of these. Turn taking is associated with a number of nonverbal signals. Avoiding eye contact while speaking can indicate a reluctance to give away the turn whereas making eye contact can indicate a readiness to hand it over. Posture shifts, like leaning forward, are often used to indicate one wants to take the turn. Nonverbal signals like prosody and facial expressions can also mark the kind of speech act that is performed, like the typical prosody associated with questions, or back-channelling vocalisations to acknowledge what has been said. These signals have well-defined relations with the linguistic channel, since what they do is provide information about the verbal discourse. In many cases, the
GENERATING EMBODIED INFORMATION PRESENTATIONS
51
information they convey can also be expressed verbally, but those expressions are often considered as marked. Examples are the use of cleft sentences (‘It was John who…’) and explicit turn taking phrases (‘Do you have anything to add?’). 2.3 Showing speaker/hearer characteristics
Some signals do not convey information about the message, but about the person who is sending or receiving it. Face, gestures and voice may convey information about static characteristics such as personality, age, and gender, and about dynamic characteristics, holding at the moment of speaking: emotion (pleased, irritated), mental state (nervous, confused, paying attention or not), physical state (sick, sleepy, full of energy). This kind of information can be inferred from all kinds of nonverbal signals. In a few cases, the signals ‘stand on their own’ (most notably, facial expressions), but in most cases the information is read from the way other signals are performed, or from their frequency. For instance, a happy or extravert speaker will use broader gestures and more ‘open’ body postures; and if a speaker uses few gestures this can indicate tiredness or lack of enthusiasm, but also age or personality. For most of these signals there is no direct temporal relation with the language being produced, although some dynamic states may be triggered by the content or form of the message (e.g., sadness at bad news, or irritation with long-windd wording). Speaker characteristics that are commonly expressed by nonverbal signals can be described verbally as well, e.g., the speaker can tell the listener about his current emotions. In this case, the information is part of the message content that is expressed intentionally (and which may be inaccurate). Usually, however, nonverbal signals of this type are produced unintentionally, and they are typically those that people will try to conceal or feign in face-to-face conversations (Ekman, 1992). 2.4 Relevance for Embodied Conversational Agents
We have seen that the production of nonverbal signals is an inherent part of human communication. Since nonverbal signals produced by human speakers can help the receiver to understand and remember the information being presented (cf. Kendon, 1994), this can be expected to hold for signals produced by embodied conversational agents as well. On the other hand, it has also been argued that nonverbal signals are usually not attended to, and that when they are, they are distracting and harm the processing of the message. Nevertheless, we believe that to allow for more fluent and natural interactions, embodied agents should be able to produce the same kinds of nonverbal signals as human speakers. Even if they are not always helpful, such nonverbal signals will make the embodied conversational agent more believable and lifelike, i.e., more like a human. This expectation is confirmed by our experiments on gaze (Heylen et al., to appear). For an overview of other experimental results on the effect of nonverbal signals (and agent embodiment in general) on human-agent interaction, see Dehn and van Mulken (2000).
52
THEUNE, HEYLEN AND NIJHOLT 3. AN ARCHITECTURE FOR EMBODIED LANGUAGE GENERATION
In this section we look at the generation of language in combination with nonverbal signals for the presentation of information by an embodied agent. Most existing embodied conversational agents or virtual presenters produce language using canned utterances. However, if the information that has to be presented changes over time, or is highly variable depending on the user’s information needs or other contextual dynamics, this approach is not feasible, and some form of natural language generation is required. So far, research in natural language generation has been aimed primarily at unimodal information presentation, where some underlying message is expressed using natural language only, usually in the form of a written text. However, when the message is to be expressed by an embodied agent an additional modality becomes available in the form of nonverbal behaviour. This raises the question how language generation and the production of nonverbal signals should be combined. A common approach is to specify the agent’s verbal utterance first, and then add mark-up to indicate any accompanying nonverbal behaviour. In this approach, the language used by the agent is not adapted to its nonverbal actions. This may result in information presentations that are less concise than they could have been (since the nonverbal signals that are added can only reflect, but not complement or replace language), and also less natural (for instance, when a full description of an object is made superfluous by pointing). In section 2, we saw that most nonverbal signals have a strong relation with language. Therefore, our approach is to integrate the specification of nonverbal signals with language generation. We focus on those nonverbal signals that are most closely related to language, i.e., deictic and representational signals, which mostly take the form of hand and arm gestures. The specification of other signals in connection to language generation is also discussed, but less extensively. In addition, we take a simplified view on information presentation as purely a ‘monologue’ task, and ignore nonverbal signals that are specific to dialogues, such as those signalling turn taking or dialogue acts. In the following, we first sketch a global architecture for generating embodied information presentations. Then, we discuss in some detail how the production of various kinds of nonverbal signals interacts with related language generation tasks. 3.1 General architecture
A standard architecture for language generation systems is described in Reiter and Dale (2000), who decompose the generation process into the following stages: x
x
Document planning. This involves determining the content and the global structure of the message to be presented. The outcome is an abstract message specification. In dialogue systems the dialogue planner generally carries out these tasks, which are largely domain specific and language independent. Microplanning (or sentence planning). At this stage, the message specification is fleshed out further. This involves the generation of
GENERATING EMBODIED INFORMATION PRESENTATIONS
53
referring expressions, lexicalisation (word choice), and aggregation (grouping information into clauses and sentences). These tasks require both linguistic and domain knowledge. x Realisation. Here, the abstract message specification is converted into real text, using knowledge about syntax, morphology, etc. In addition, mark-up may be added for use by external components. We now discuss how the specification of nonverbal signals can be integrated in this general architecture, so that it can be used to generate texts that contain mark-up for nonverbal signals. In such an integrated architecture, the message specification created at the document planning (or rather, content determination) stage is the basis for specifying both verbal and nonverbal signals. The two language generation tasks that are most closely related to the production of nonverbal signals are lexicalisation and the generation of referring expressions, both of which are carried out at the microplanning level. The main aim of referring expression generation is building a description that distinguishes the intended object from its distractors, i.e., other objects that might be referred to. Typically, such descriptions can be much reduced if they are accompanied by a deictic gesture that rules out most, or even all, distractors. Lexicalisation is the task of choosing words for expressing message concepts, such as actions and object properties. Since many concepts can be expressed by a representational nonverbal signal, the specification of such nonverbal signals can be seen as part of the lexicalisation process. The third microplanning task, aggregation, is related to the generation of nonverbal signals in that it delimits the domain for gesture production: on average, human speakers produce one gesture per clause during information presentation (McNeill, 1992), and to create a natural impression, a virtual presenter should do the same. In the field of language generation, there is no consensus about the order of the three tasks involved in microplanning. Here, we make the following assumptions on this ordering. Referring expression generation comes before lexicalisation, since it may enrich the message specification with additional concepts (i.e., object properties) that must be lexicalised. For instance, to distinguish an object from its distractors it may be necessary to mention a certain property, e.g., its shape. This property will then have to be lexicalised, either verbally or by making an appropriate iconic gesture, or both. Aggregation is sometimes done before and sometimes after lexicalisation. As argued by Reiter and Dale (2000), it should come at least before referring expression generation, to avoid generating a description more often than necessary. For instance, the two messages X is on R and Y is on R can be aggregated to X and Y are on R, so that a reference to R needs to be generated only once. Given that referring expression generation should precede lexicalisation, this means that aggregation is the first microplanning task. Nonverbal signals that reflect discourse structure, including prosody, do not really interact with language generation; rather, they reflect the structure of the result. Such signals are added at the realisation level, after syntactic realisation has been finished. The selection, both in form and frequency, of these and the other types of nonverbal signals should be influenced by different factors, such as the personality type or emotion the agent should convey (see sections 2.3 and 3.5).
54
THEUNE, HEYLEN AND NIJHOLT
Ideally, such factors should also influence the language used by the system, a matter that is ignored in most language generation work (some exceptions are Hovy, 1988, de Rosis and Grasso, 2000, and, for embodied agents, André and Rist, 2000). The output from embodied language generation is annotated text, specifying the nonverbal behaviour that has to be produced in parallel with certain words in the text. Several (XML-based) annotation languages have been proposed for this; e.g., Arafa et al., 2002, DeCarolis et al., 2002, Kranstedt et al., 2002. The annotated text is sent to external speech synthesis and animation modules. Speech synthesis converts the text into a speech signal, taking the prosodic markers into account. Animation takes care of producing the visual nonverbal signals, which have to be synchronised with respect to the pronunciation of the corresponding words, based on timing information from speech synthesis. For descriptions of how this can be done, see Pelachaud et al. (1996) and Cassell et al. (2001b). The animation component also has to determine the actual shape of the signals to be produced, taking relevant speaker characteristics into account (cf. Badler et al., 2002). Figure 1 shows the general architecture for embodied language generation proposed here. We assume that it is a pipeline, i.e., there is no backtracking between modules. This means that nonverbal signals, once they have been added, cannot be removed, and that subsequent generation stages can only add nonverbal signals that do not conflict with those already specified. In other words, deictic signals take precedence over representational ones, and the latter over signals that reflect discourse structure. (We are aware that this is a simplification.) Figure 1 also shows the knowledge sources needed: discourse, domain, speaker, and user models, as well as world knowledge. At least some of these models must be dynamically updated during generation. In the following sections we see how such knowledge is used during the intertwined specification of language and nonverbal signals. 3.2 Deictic signals and the generation of referring expressions
References to objects in the message specification are potential opportunities for deictic nonverbal signals by the agent. A first selection from these candidates will only retain references to objects that are visible to the user. 1Checking this condition is not always trivial; for instance, in applications where the user can move around in a 3D virtual world, as described in section 4, it will require some computation using data from the domain model (listing the positions of virtual objects) and the user model (listing at least the user’s current position and orientation). Since human speakers use deictic gestures and other nonverbal signals mainly to convey new information, another condition is that the object should be new, i.e., not previously mentioned, inferable, or otherwise familiar to the user. We also assume that deictic signals are appropriate for objects that are mentioned in a contrastive context (for ways to detect contrast using a discourse model, see Prevost, 1995, Theune, 2002). In short, first or contrastive references to visible objects will be accompanied by a deictic nonverbal signal. We now discuss how this influences the creation of a verbal reference to the object.
GENERATING EMBODIED INFORMATION PRESENTATIONS
55
Figure 1. Architecture for combined generation of language and nonverbal signals.
Constructing a verbal description of an object, with the aim of allowing the user to uniquely identify it, involves selecting those properties of the object that distinguish it from other potential referents, the distractors. Having the agent point at a virtual object will generally rule out several of its distractors, namely those that are positioned outside the ‘range’ of the pointing finger. (We assume that pointing with the hand or finger is the preferred form of a deictic gesture.) When pointing is quite exact, for instance when the object is nearly touched, or when there are no distractors close to it, all distractors may be ruled out by the pointing gesture, so that no further verbal description is required. In such a situation, generating a syntactic placeholder such as that or this one may be sufficient. In most cases, however, a pointing gesture by the agent will not rule out all distractors. As the distance from the agent to the virtual object increases, pointing becomes less precise (more or less in the fashion of a flashlight) and will rule out fewer distractors. In the case of deictic head movements or even only gaze, which must be reverted to if the agent’s hands are occupied by some task, pointing is always very inaccurate, indicating only a general direction. So, depending on the distance from the agent to the intended referent, and the kind of deictic signal being produced, the system must determine which distractors are ruled out. The remaining distractors will have to be ruled out by the verbal description. The result of referring expression generation is a specification of the content of the verbal expression and
56
THEUNE, HEYLEN AND NIJHOLT
of the deictic signal that should accompany it, if any. This information is added to the message specification, which is passed to lexicalisation. 3.3 Representational signals and lexicalisation
Not all concepts are equally suitable for expression using a representational nonverbal signal. Semantic features that can be easily visualized are the manner and direction of actions, and the shape, size and (relative) location of objects (Kendon, 1980, McNeill, 1992). When generating embodied information presentations, a simple way of checking whether a specific message concept can be expressed using a nonverbal signal is to use a ‘gesture dictionary’ or database linking concepts to nonverbal signals (Cassell et al., 2001b, Sowa et al., 2001). Such a dictionary may be based on a domain specific inventory of nonverbal signals that have been actually produced by human speakers. Nonverbal signals are only retrieved from the dictionary if they are compatible with nonverbal behaviours that were selected during the preceding generation task (i.e., deictic signals). For instance, the agent cannot produce an iconic gesture representing some property of an object if in the preceding stage it was decided that it should be pointing at that object. Similarly, representational head movements (e.g., a head wiggle to indicate dubiousness) are incompatible with deictic head movements. If multiple nonverbal signals are available to express a concept, a probabilistic choice can be made between them, based on their relative frequency in the corpus that was used to create the dictionary of representational signals. Having identified those concepts in the message specification that the agent can express using a representational nonverbal signal, some of these must be selected. Simply grasping all opportunities for generating a nonverbal signal is not an option, as it is likely to produce an unnatural effect, e.g., giving the impression that the agent is “talking to a foreigner” (Cassell and Prevost, 1996). Like human speakers, in a neutral context the agent should not produce more than one or two gestures2 per clause; a target that may be raised or lowered based on information from the speaker and user models. For instance, to convey an introvert personality, the agent should produce fewer nonverbal signals than average, whereas if the user actually is a foreigner, more nonverbal signals may be appropriate. As with deictic signals, the selection of representational signals may be based on newness and contrast. Additional selection criteria may reflect the human tendency to express surprising or uncommon object features using a gesture, as well as information that is ‘relevant to the primary communicative goal’ (Yan, 2000). For cases where too many candidates meet these criteria, we need a ranking in the form of a (domain specific) information hierarchy, where the highest level is occupied by information that is most likely to be expressed using a nonverbal signal, e.g., new and highly relevant information, and where the lowest level is occupied by, e.g., discourse-old information. Based on the current target number of nonverbal expressions to be generated (see above), a selection can be made from the candidates, starting with those on the highest levels.
GENERATING EMBODIED INFORMATION PRESENTATIONS
57
The next decision to be made is which concepts the agent should express only nonverbally, and which ones it should also express using speech, leading to a ‘redundant’ nonverbal signal (Cassell and Prevost, 1996). Since nonverbal signals run a higher risk of being missed (i.e., overlooked) by the user than speech, it seems that the choice between redundant and non-redundant signals should be based on the deemed importance of the information to be expressed. In terms of the information hierarchy mentioned above, only items on the lower levels should be expressed nonredundantly. Other important factors are economy and ease of expression. Some concepts are more easily expressed nonverbally than verbally. For instance, complex motions or shapes can be relatively difficult or inefficient to express using a verbal description. This makes the use of speech less attractive, especially when the presentation is bound to certain time limits. Regardless of the information hierarchy, a non-redundant visual signal will be preferred in such cases. In the choice between redundancy and non-redundancy, the kind of nonverbal signal being produced also plays a role. Emblems that are used non-redundantly seem to run little risk of being overlooked, because they actually replace the words they correspond with: speech is temporarily interrupted when a non-redundant emblem is produced, drawing the hearer’s attention to the emblem. An example is saying She is really <emblem for clever>. Here, the choice for non-redundancy seems to be a matter of style rather than efficiency or effectiveness. On the other hand, other non-redundant representational signals are usually produced during the verbal description of the action or object of which they express an attribute. In this case, speech is not interrupted, so there is a risk of overlook, especially for subtle behaviours such as head or eye movements. 3.4 Nonverbal signals reflecting discourse structure
Nonverbal signals that reflect discourse structure take the form of both prosodic and visual nonverbal signals. Since these signals do not only mark informational but also syntactic structure, they are specified during the realisation stage, after syntactic realisation has taken place. Based on syntactic information,3 some phrase boundaries can be marked prosodically by specific intonation contours or pauses of different lengths (cf. Theune et al., 2001), and visually by head motions, gaze direction, blinks or eyebrow movements (cf. Pelachaud et al., 1996). Similar visual signals, as well as beat gestures and pitch accents, can be used to mark words expressing new or contrastive information, and posture shifts may be inserted to mark topic changes (cf. Cassell et al., 2001a). As before, appropriate signals can only be selected if they are not in conflict with signals that were previously specified. In other words, we assume that both deictic and representational signals take precedence over signals that reflect discourse structure. This is similar to the rule of gesture class priority used in the BEAT system (Cassell et al., 2001b). Since deictic and representational signals most often take the form of gestures, in our proposed architecture discourse structure is somewhat more likely to be reflected by prosody and facial actions than by beat gestures, which run a higher chance of being incompatible with nonverbal
58
THEUNE, HEYLEN AND NIJHOLT
signals specified during microplanning. 4Rules for coordinating facial motions with intonation have been proposed by Pelachaud et al. (1996). Finally, the influence of speaker characteristics must also be taken into account at this level. For instance, to create an impression of extraversion or enthusiasm, the agent should produce more beat gestures and pitch accents than average, and to convey introversion it should look away more and move less. 3.5 Factors influencing selection preferences and production rate
We have already discussed that the type and frequency of the nonverbal signals produced by the agent can be varied depending on which speaker characteristics should be conveyed, such as personality and emotion. These characteristics also influence the actual shape of the specified nonverbal signals as determined by the speech synthesis and animation modules. In addition, there are several other contextual factors that should be taken into account in the selection of nonverbal signals. Among these are user characteristics such as age, nationality, and level of expertise in the application domain. For instance, the use of more, less subtle nonverbal signals may be appropriate when the agent is speaking to a child as opposed to an adult, to a foreigner as opposed to a native speaker, and to an inexperienced user as opposed to an experienced one. In addition, there may be an influence of the application domain. In some domains, getting the message across correctly may be more important than in others; for example, in an educational setting production rates may be set higher than in social talk, and non-redundant signals may be avoided to ensure that the student does not miss any information. Within domains, message complexity may have to be taken into account. As shown by Cohen (1977) for the presentation of route descriptions, the gesture rate of human speakers increases with the complexity of the message being presented. Presumably, this also holds for other kinds of nonverbal signals and in other domains, and it may be sensible to have the agent copy such behaviour. 3.6 Discussion
In the preceding sections, we have taken a standard language generation architecture (Reiter and Dale, 2000) as the starting point for describing an ‘ideal’ system for the generation of embodied information presentations. However, several language generation systems do not adhere to this standard architecture, or do not perform all of the distinguished tasks (see Cahill et al., 1999). Also, in practice many generation tasks are carried out in an ad-hoc rather than a theoretically motivated manner. Examples are directly mapping concepts to standard lexicalisations instead of ‘real’ word choice, and the use of templates for syntactic realisation. Similarly, in the architecture sketched here all possible kinds of nonverbal signals are specified in a principled manner, but in practical systems, achieving this will be nearly impossible, and often unnecessary. Also, the model for embodied language generation presented here is necessarily simplified. Clearly, the interactions between language generation and the production of nonverbal signals are more intricate than
GENERATING EMBODIED INFORMATION PRESENTATIONS
59
has been presented here. Finally, much of what has been presented rests on assumption: there are many things about the production of nonverbal signals (and of language, for that matter) that we do not know yet. To extend our knowledge, careful study of human speaker behaviour will be required, as well as evaluation of implemented models. User experiments should be conducted to test the naturalness and effectiveness of different types of generated information presentations, as well as their effect on the user’s perception of the agent’s personality and other speaker characteristics. 4. EXAMPLE: THE PRESENTATION OF ROUTE DESCRIPTIONS In this section, we discuss the presentation of route descriptions as an example application of embodied language generation. Human speakers presenting a route description make prominent use of both deictic and representational signals to indicate landmarks and directions. These signals mainly take the form of broad hand and arm gestures; facial expressions play a relatively minor role. The ANGELICA5 project aims at developing an embodied agent that presents route descriptions in the Virtual Music Centre (VMC, Figure 2). The VMC is a 3D virtual building with halls, corridors and different floors (Nijholt and Heylen, 2002). Since visitors to such environments often experience navigation problems (Nijholt et al., to appear), the VMC is a natural environment for the presentation of route descriptions. In previous research we developed a navigation agent that determines the user’s intended destination within the VMC by means of a spoken dialogue, and then computes the shortest route to this destination (van Luin et al., 2001). This agent is not yet embodied, and cannot give a verbal route description: it either presents the computed route on a map or moves the user through the VMC to the target location. In the ANGELICA project we will extend the navigation agent with a component for embodied language generation as sketched in the previous section. In the following, we describe how the embodied guide would generate a simple verbal and nonverbal route description, illustrating the proposed architecture.
Figure 2. Outside view of the VMC.
60
THEUNE, HEYLEN AND NIJHOLT
4.1 Content determination
The target location of the route description is determined in a spoken dialogue between the user and the embodied guide. This involves the generation of several dialogue utterances, accompanied by nonverbal signals. Several of these will be dialogue-specific, e.g., turn taking signals and signals that reflect different speech acts, such as confirmation and verification. Since our focus is on the presentation of the route description, we do not describe this stage of the interaction (but see Heylen et al. (to appear) for our work on gaze as a turn taking signal in spoken dialogues with an embodied agent). Here, we only assume that, after a number of turns, the target location is established as being the balcony of the VMC. The shortest route to this location is computed, and returned in the form of a vector of 3D coordinates.
ROUTE:[ action:[type:walk, direction:up, object:s2], action:[type:turn, direction:sharp_right], action:[type:walk, direction:through, object:d3]] Figure 3. Example route specification
This vector is given as input to content determination, which turns it into a message specification as shown in Figure 3, which describes the route in terms of the actions the user has to carry out to reach the intended destination, for example walking towards some landmark or turning in some direction. For brevity, ‘obvious’ locations and actions, such as the start and end point of the route, have been left out. Creating such a route specification requires a nontrivial amount of domain and world knowledge, which is used to map the vector of 3D coordinates to actions, landmarks and spatial concepts. We do not go into that here. 4.2. Microplanning
The first task performed during microplanning is aggregation. Here, it is determined that the actions specified in our example message can be expressed in separate clauses, which are combined into one sentence. Within this sentence, the performer of each action (the user, left implicit in the message) remains the same, and therefore only needs to be mentioned in the first clause. In the following, we discuss the generation of referring expressions (including deictic signals) and lexicalisation (including representational signals) for each clause/action in turn.
GENERATING EMBODIED INFORMATION PRESENTATIONS
61
Expressing the first action involves referring to the performer (the user) and the object of the action (s2, one of the two stairs in the VMC). For referring to the user, a simple pronoun (‘you’) is sufficient. This expression is not accompanied by a nonverbal signal, as the information being expressed is at the very bottom of the information hierarchy from section 3.3. The second reference is to the stairway, s2. This entity is new to the discourse, essential for the route description, and in the user’s current line of sight. A deictic hand gesture is therefore selected. To make clear that the agent refers to the stairs as a whole, and not to one of its steps or the handrail, the referring expression algorithm specifies that the object’s type property must be added to the verbal part of the expression. In combination with the gesture, this is sufficient to uniquely identify object s2. (If pointing had not been possible, for instance because the user could not see s2, then a more elaborate verbal description distinguishing it from the other stairway in the VMC would have been required.) Now that the content of the referring expressions is known, lexicalisation starts. One candidate for nonverbal expression is s2’s property of being a stairway, which can be easily expressed using, for instance, an iconic gesture representing the shape of the steps (assuming such a gesture is present in the ‘gesture dictionary’). However, a deictic gesture has already been selected for referring to the stairs, leaving no room for any additional hand movements by the agent. Therefore, the type property can only be lexicalised verbally. Another candidate for nonverbal expression is the upward direction of the action. This information is both new and essential for the route to be taken, so an upward moving gesture is selected. The action type, walking, might be expressed by a ‘walking finger’ gesture (cf. Cassell et al., 2000a). However, this information is low in the information hierarchy, since there is no other way to move around in the VMC. Therefore no nonverbal expression is selected to express this aspect of the action, and on the verbal side the general verb ‘go’ is preferred over the more specific, but superfluous, ‘walk’. All in all, two gestures are selected for the first clause: pointing upwards and at the stairs. Describing the second action does not involve references to objects, and thus no deictic signals. During lexicalisation, it is found that the concept of turning in a specific direction can be expressed nonverbally. Since this concept is new and important, a gesture is selected where the hand and arm demonstrate a sharp right turn. On the verbal side, the phrase ‘turn sharply to the right’ is selected. Here, the adverbial ‘sharply’ might have been left out, as it is grammatically optional and the sharpness of the turn is already illustrated by the gesture. However, this aspect is essential for the route, since the user can also make a non-sharp right turn at the top of the stairs. Only using a non-redundant gesture carries the risk of this information being overlooked and the user taking a wrong turn, so an additional verbal expression is preferred. The third action involves d3, one of the doors in the VMC. Because d3 is invisible from the user’s current location a deictic signal is deemed inappropriate, and a distinguishing verbal description must be created. The VMC has several doors, but only one of them (d4) is near to d3 and therefore counts as a distractor. To distinguish d3 from d4, describing it as ‘the first door’ suffices. We assume that the door’s attribute of being first, or nearest, cannot be expressed nonverbally.6 The concept of (walking through) a door may be expressed by a nonverbal signal, e.g., a
62
THEUNE, HEYLEN AND NIJHOLT
pantomimic gesture. However, walking through is an obvious action in connection with a door, and so we assume that no such gesture will be selected and that this information is only expressed verbally. 4.3. Realisation
The final generation stage is realisation. For the first clause, syntactic realisation produces the sentence ‘You go up the stairs’. Then, nonverbal signals reflecting discourse structure are added. The clause is too short for pauses, but the end of the clause is marked prosodically by rising intonation, indicating that the sentence is not yet at its end. The words ‘up’ and ‘stairs’ convey new information, so these receive a pitch accent. They cannot be accompanied by beat gestures, since the agent’s hands are already involved with gestures specified during microplanning: pointing upwards and at the stairs. Available alternatives to mark the focused information are blinking and raising the eyebrows (cf. Pelachaud et al., 1996). The second clause is syntactically realised as ‘turn sharply to the right’, where ‘sharply’ and ‘right’ are accented and accompanied by blinks and eyebrow movements. Again, there is no room left for beat gestures. The final clause is realised as ‘and go through the first door’. Here, ‘first’ is accented and accompanied by a beat gesture, since this word expresses information that is both new and contrastive (distinguishing d3 from d4). The word ‘door’, on the other hand, is not accented, since this property is shared by d3 and d4 and thus not contrastive. The end of the clause is marked by a falling intonation contour, since it is sentence-final. The resulting route description, with simplified mark-up, looks as follows: you go UP the STAIRSturn SHARPLY to the RIGHT and go through the FIRST door Here, accented words are given in small capital letters, and phrase boundaries of different strengths are indicated by a number of slashes. Of the visual signals, only gestures are indicated using greatly simplified XML style mark-up. The embodied presentation of the first clause is illustrated in Figure 4. 4.4. Remaining issues
It is clear that generating an embodied route description does not only require rules for the generation of verbal and nonverbal signals, but also involves a good deal of reasoning on the basis of (spatial) domain knowledge and general knowledge. The questions that need to be resolved include, which information is more or less essential for describing the route, how can ‘virtual objects’ and their properties be linked to coordinates in the 3D world, which objects are currently visible to the user (or will be from certain points along the route), and which objects are within the range of a deictic signal. In addition, it is not clear how to react to user
GENERATING EMBODIED INFORMATION PRESENTATIONS
63
interruptions or movements, during information presentation. For instance, what if the user moves away from the virtual guide, perhaps to look where she is pointing?
Figure 4. ‘You go UP the STAIRS’
Other remaining problems are at the level of speech synthesis and, in particular, animation. One of these is timing: the generated nonverbal signals have to be synchronized with the pronunciation of the corresponding words. For signals that correspond to phrases of varying lengths, this might require speeding up or slowing down their execution, depending on speech times. If we assume that the guide does not have a fixed location, this means that deictic signals must be animated in real time, based on the current position of the guide relative to the object being indicated. This involves selecting from all possible joint rotations and translations those that will result in a natural looking movement. Another issue is that of realistically blending similar signals (e.g., different gestures or facial expressions) that are performed in close sequence. An example is combining the upwards and deictic gestures from the first clause of our route description. Nonverbal facial signals will also have to be combined with facial movements that are directly related to physical speech production (visemes). Finally, in determining the actual shape of the nonverbal signals, both in animation and in speech synthesis, any relevant speaker characteristics should be taken into account. For many of the research issues listed above, practical solutions are still lacking, so that at the implementation level several of these problems will have to be sidestepped or handled in an ad-hoc manner. 5. RELATED WORK So far, there has been relatively little work on combining language generation with the production of nonverbal signals by embodied conversational agents. Notable exceptions are the work by Cassell et al. (2000ab) and Pelachaud et al. (2002). Cassell et al. focus on the division of labour between speech and
64
THEUNE, HEYLEN AND NIJHOLT
representational gestures, guided by information about the discourse context and the communicative function of the utterance to be generated. Their work is applied in REA, a virtual real estate agent that presents embodied descriptions of houses. In addition to representational signals, REA also generates nonverbal signals that are related to discourse structure. These include posture shifts (Cassell et al., 2001a), gaze, and beat gestures. Pelachaud et al. (2002) developed the Greta agent, a talking face that presents medical diagnoses to patients. Greta can display nonverbal signals that reflect discourse structure and the agent’s cognitive and emotional state, as well as signals that express message content. There is no interaction between language generation and nonverbal signal production, however. Other research in this area is that of André and Rist (2000), who have worked on language generation for virtual presenters, focusing on the problem of projecting different agent personalities. The NECA project (Krenn et al., 2002) builds further on this work, developing components for reasoning about agents’ emotions and generating combined (emotional) speech and nonverbal signals. Work on discourse-related prosody combined with language generation includes that of Williams (1999) (in the route description domain), McKeown and Pan (2000) and Theune et al. (2001). Schmauks (1987), Claassen (1992), Reithinger (1992), and Lester et al. (1999) address the generation of multimodal referring expressions. Their approaches are not aimed at animated agents, with the exception of Lester et al., who combine language generation with the production of pointing gestures by a pedagogical agent. All mentioned approaches are limited in that pointing is assumed to be exact, and hardly influences property selection for the verbal description. Krahmer and van der Sluis (2003) propose a new model for generating multimodal references that improves on this by using various degrees of pointing precision, and relating the selection of distinguishing properties to the kind of pointing gesture being generated. The presentation of route descriptions by an embodied agent is addressed in the REAL project, where an embodied agent shows users the way through a virtual environment (Baus et al., 2000). However, this agent has no language generation capabilities except for limited object references. In the MACK system (Cassell et al., 2002), route descriptions are presented by an embodied conversational agent that is projected in a real-world environment. This agent does not have language generation capabilities, but uses canned text. It produces nonverbal signals using a probabilistic strategy, based on data from route descriptions by human speakers. 6. CONCLUSION Over the years there has been a steady increase of our knowledge of human verbal and nonverbal communication, and how it can be simulated by embodied conversational agents. Nevertheless, the actual development of applications in which all kinds of contextually appropriate nonverbal signals are generated in a principled fashion is still some steps away. In this chapter we have sketched a general architecture for combining language generation with the specification of nonverbal signals for information presentation by embodied conversational agents, and illustrated how it can be used in a specific application. The model for embodied
GENERATING EMBODIED INFORMATION PRESENTATIONS
65
language generation we have presented here is necessarily simplified, and much of it rests on assumption. To extend our knowledge, further studies of spontaneous human speaker behaviour are required, as well as controlled experiments and evaluations of implemented systems. Our ultimate goal is to allow for human-agent interactions that are equally natural, effective, and entertaining as face-to-face interactions between humans. 7. NOTES 1 Actually, this constraint is too strong, because human speakers also use deictic references to objects that are not visible, but occupy an imaginary position projected in the space before the speaker (McNeill, 1992). However, for simplicity we will ignore this here. 2 Most representational signals take the form of gestures; we have no figures available on the average number of other nonverbal behaviours produced by human speakers. 3 Theune et al. (2001) determine prosody on the basis of syntactic structure. On the other hand, Steedman (2000) points out that syntactic and prosodic structure may diverge. 4 We ignore the possibility of overlaying beats over other gestures, as observed by McNeill (1992). 5 A Natural-language Generator for Embodied, Lifelike Conversational Agents. This project is partially funded by NWO, the Dutch Organization for Scientific Research. 6 A Natural-language Generator for Embodied, Lifelike Conversational Agents. This project is partially funded by NWO, the Dutch Organization for Scientific Research.
8. REFERENCES André E. and Rist T. (2000). Presenting through performing: on the use of multiple lifelike characters in knowledge-based presentation systems. Proceedings of the Second International Conference on Intelligent User Interfaces, New Orleans, USA, 1-8. Arafa Y., Kamyab K., Mamdani E., S. Kshirsagar, Magnenat-Thalmann N., Guye-Vuilleme A. and Thalmann D. (2002). Two approaches to scripting character animation. Proceedings of ECA's: Let's specify and evaluate them! Workshop held in conjunction with AAMAS 2002. Badler N., Allbeck J., Zhao L. and Byun M. (2002). Representing and parameterizing agent behaviors. Proceedings of Computer Animation, IEEE Computer Society, Geneva, Switzerland, 133-143. Baron-Cohen S. (1995). Mindblindness. MIT Press. Baus J., Butz A. and Krüger A. (2000). Incorporating a virtual presenter in a resource adaptive navigational help system. Proceedings of the Workshop on Guiding Users through Interactive Experiences, Paderborn, Germany. Breazeal C.L. (2002). Designing Sociable Robots. MIT Press. Cahill L., Doran C., Evans R., Mellish C., Paiva D., Reape M., Scott D. and Tipper N. (1999). In search of a reference architecture for NLG systems. Proceedings of the 7th European Workshop on Natural Language Generation (EWNLG'99), Toulouse, France, 77-85. Cassell J. and Prevost S. (1996). Distribution of semantic features across speech and gesture by humans and computers. Proceedings of the Workshop on the Integration of Gesture in Language and Speech, Newark, USA, 253-270. Cassell J., Bickmore T., Campbell L., Vilhjálmsson H. and H. Yan (2000a). Conversation as a system framework: Designing embodied conversational agents. In J. Cassell, S. Prevost, E. Churchill, and J. Sullivan (eds.), Embodied Conversational Agents. MIT Press, 29-63. Cassell J., Stone M. and Yan H. (2000b). Coordination and context-dependence in the generation of embodied conversation. Proceedings of the First International Conference on Natural Language Generation, Mitzpe Ramon, Israel, 171-178. Cassell J., Nakano Y., Bickmore T., Sidner C. and Rich C. (2001a). Non-verbal cues for discourse structure. Proceedings of the 41st Annual Meeting of the Association of Computational Linguistics (ACL 2001), Toulouse, France, 106-115.
66
THEUNE, HEYLEN AND NIJHOLT
Cassell J., Viljhálmsson H. and Bickmore T. (2001b). BEAT: the Behavior Expression Animation Toolkit.Proceedings of the 28th Int. Conference on Computer Graphics and Interactive Techniques (SIGGRAPH 2001), Los Angeles, USA, 477-486. Cassell J., Stocky T., Bickmore T., Gao Y., Nakano Y., Ryokai K., Tversky D., Vaucelle C. and Vilhjálmsson H. (2002). MACK: Media lab Autonomous Conversational Kiosk. Proceedings of Imagina '02, Monte Carlo, Monaco. Chovil N. (1991/1992). Discourse-oriented facial displays in conversation. Research on Language and Social Interaction 25, 163-194. Claassen W. (1992). Generating referring expressions in a multimodal environment. In R. Dale, E. Hovy, D. Rösner and O. Stock (eds.), Aspects of Automated Natural language Generation. Springer Verlag, Berlin, 247-262. Cohen A. (1977). The communicative functions of hand illustrators. Journal of Communication 27, 5463. Cohen A. (1980). The use of hand illustrators in direction-giving situations. In W. von Raffler-Engel (ed.), Aspects of Nonverbal Communication, Swets and Zeitlinger BV, Lisse, 265-273. DeCarolis B., Carofiglio V., Bilvi M. and Pelachaud C.. APML, a mark-up language for believable behavior generation. Proceedings of ECA's: Let's specify and evaluate them! Workshop held in conjunction with AAMAS 2002, Bologna, Italy. Dehn D. and van Mulken S. (2000). The impact of animated interface research: a review of empirical research. International Journal of Human-Computer Studies 52(1), 1-22. Ekman P. (1979). About brows. In M. von Cranach, K. Foppa, W. Lepenies and D. Ploog (eds.) Human Ethology, Cambridge University Press, 169-202. Ekman P. (1992). Telling Lies. W.W. Norton and Company. Heylen D., van Es I., Nijholt A. and van Dijk B. Controlling the gaze of a conversational agent. In J. van Kuppevelt, L. Dybkjaer and O. Bernsen, Natural, Intelligent and Effective Interaction in Multimodal Dialogue Systems. Kluwer Academic Publishers. Hovy E. (1988). Generating Natural Language Under Pragmatic Constraints. Lawrence Erlbaum, Hillsdale, New Jersey. Kendon A. (1980). Gesticulation and speech. In M.R. Key (ed), The Relationship of Verbal and Nonverbal Communication. Mouton, Den Haag, 207-227. Kendon A. (1994). Do gestures communicate? A review. Research on Language and Social Interaction 27(3), 175-200. Krahmer E. and van der Sluis I. (2003). A new model for the generation of multimodal referring expressions. Proceedings of the 9th European Workshop on Natural Language Generation, held in conjunction with EACL2003, Budapest, Hungary, 47-54. Kranstedt A., Kopp S. and Wachsmuth I. (2002). MURML: a Multimodal Utterance Representation Markup Language for conversational agents. Proceedings of ECA's: Let's specify and evaluate them! Workshop held in conjunction with AAMAS 2002, Bologna, Italy. Krauss R.M., Morrel-Samuels P. and Colasante C. (1991). Do conversational hand gestures communicate? Journal of Personality and Social Psychology 28, 389-450. Kraut R.E. and Johnston R.E. (1979). Social and emotional messages of smiling: an ethological approach. Journal of Personality and Social Psychology 37, 1539-1553. Krenn B., Pirker H., Grice M., Baumann S., Piwek P., van Deemter K., Schroeder M., Klesen M. and Gstrein E. (2002). Generation of multimodal dialogue for net environments. In S. Busemann (ed.), Konferenz zur Verarbeitung natürlicher Sprache (KONVENS 2002), Saarbrücken, Germany, 91-98. Lester J., Voerman J., Towns S. and Callaway C. (1999). Deictic believability: coordinating gesture, locomotion, and speech in lifelike pedagogical agents. Applied Artificial Intelligence 13 (4-5), 383414. van Luin J., Nijholt A. and op den Akker R. (2001). Natural language navigation support in virtual reality. Proceedings of the International Conference on Augmented, Virtual Environments and Three-dimensional Imaging (ICAV3D). Mykonos, Greece, 263-266. McKeown K. and Pan S. (2000) Prosody modeling in concept-to-speech generation: methodological issues, Philosophical Transactions of the Royal Society 358, 1419-1431. McNeill D. (1992). Hand and Mind: What Gestures Reveal about Thought. University of Chicago Press, Chicago. Nijholt A. and Heylen D. (2002). Multimodal communication in inhabited virtual environments. International Journal of Speech Technology 5, 343-354.
GENERATING EMBODIED INFORMATION PRESENTATIONS
67
Nijholt A., J. Zwiers and B. van Dijk (to appear). Maps, agents and dialogue for exploring a virtual world. In J. Aguilar, N. Callaos and E.L. Leiss (eds.), Web Computing. International Institute of Informatics and Systemics (IIIS). C. Pelachaud, N. Badler and M. Steedman (1996). Generating facial expressions for speech. Cognitive Science 20, 1-46. C Pelachaud, V. Carofiglio, B. De Carolis, F. de Rosis and I. Poggi (2002). Embodied contextual agent in information delivering application. Proceedings of the First International Joint Conference on Autonomous Agents & Multi-Agent Systems (AAMAS'02), Bologna, Italy, 758-765. Poggi I., Pelachaud C., and DeRosis F. (2000). Eye communication in a conversational 3D synthetic agent. AI communications 13(3), 169-182. Prevost S. (1995). A Semantics of Contrast and Information Structure for Specifying Intonation in Spoken Language Generation. PhD thesis, University of Pennsylvania. Reiter E. and Dale R. (2000). Building applied natural language generation systems. Cambridge University Press, Cambridge. Reithinger N. (1992). The performance of an incremental generation component for multimodal dialog contributions. In R. Dale, E. Hovy, D. Rösner and O. Stock (eds.), Aspects of Automated Natural language Generation. Springer Verlag, Berlin, 263-276. Rickel J. and Johnson W. L. (2000). Task-oriented collaboration with embodied agents in virtual worlds. In J. Cassell, S. Prevost, E. Churchill and J. Sullivan (eds.), Embodied Conversational Agents. MIT Press, 95-122. Rimé B. and Schiaratura L. (1991). Gesture and speech. In R. Feldman and B. Rimé (eds.), Fundamentals of Nonverbal Behavior. Cambridge University Press, Cambridge, 239-281. de Rosis F. and Grasso F. (2000). Affective natural language generation. In A.M. Paiva (Ed.), Affective Interactions. Springer Lecture Notes in AI 1814, 204–218. Schmauks D. (1987). Natural and simulated pointing. Proceedings of the 3rd Conference of the European Chapter of the Association for Computational Linguistics (EACL'87), 79-185. Sowa T., Kopp S., Latoschik M. E. (2001). A communicative mediator in a virtual environment: processing of multimodal input and output. Proceedings of the International Workshop on Information Presentation and Natural Multimodal Dialogue ( IPNMD-2001),Verona, Italy, 71-74. Steedman M. (2000). Information structure and the syntax-phonology interface. Linguistic Inquiry 31(4), 649-689. Takeuchi A. and Nagao K. (1993). Communicative facial displays as a new conversational modality, InterCHI' 93, Amsterdam, The Netherlands, 187-193. Theune M. (2002). Contrast in concept-to-speech generation. Computer Speech and Language 16(3/4), 491-531. Theune M., Klabbers E., de Pijper J.R., Odijk J. and Krahmer E. (2001). From data to speech: a general approach. Natural Language Engineering 7(1), 47-86. Wahlster W. (2002). SmartKom: fusion and fission of speech, gestures, and facial expressions. Proceedings of the First International Workshop on Man-Machine Symbiotic Systems, Kyoto, Japan, 213-225. Williams S. and Watson C. (1999). A profile of the discourse and intonational structures of route descriptions. Sixth European Conference on Speech Communication and Technology (Eurospeech'99), Budapest, Hungary, 1659-1662. Yan H. (2000). Paired Speech and Gesture Generation in Embodied Conversational Agents. Master’s thesis, Media Lab, MIT.
9. AFFILIATIONS M. Theune D. Heylen A. Nijholt University of Twente Twente, The Netherlands
MOBILE PRESENTATIONS The last few years have seen an increasing interest in the development of applications for mobile systems. The mobile user setting poses new challenges for information presentations especially because of the of constraints derived by the small display dimension and by the conditions of usage, (while walking or driving, with bad light conditions and so on). Technology on the market nowadays, in particular third generation cellular phones and the new hybrid devices that mix cellular phones and PDA, allows the users to be connected on the Internet and to access a wide array of services. Yet there is still a long way to go before these appliances may be effectively used in our everyday activities. Usability experts are working on guidelines and interface principles to build interfaces specifically appropriate for the small dimension of the display or, for instance, that improve the efficiency in browsing the web while walking. The approach taken by the authors of this cluster is different, although not in contrast with the effort toward usability. The underling principle is that intelligent adaptation of content may improve the user experience with mobile devices. The first chapter of this cluster, by Baus Krueger and Stahl argue that resourceadaptiveness is a crucial topic to increase the efficacy of a mobile navigation system. They propose that adaptation has to be achieved on three different levels of the navigation task, the wayfinding level, the presentation planning level and the presentation rendering level. Stock, Zancanaro and Not discuss the opportunities and challenges offered by the area of cultural heritage appreciation for innovative, natural–language centred applications. They introduce the PEACH project, aimed at exploring various technologies for enhancing the visitors' experience during their actual visit to a museum. Finally Rist discusses the problem the overwhelming variety of new portable computing and communication devices. He argues that this add a new level of complexity to the design of usable information services since different users may not be equipped equally in terms of output and input capabilities. With a special focus on graphical representations, his contribution sketches several application scenarios taken from projects conducted at DFKI and discusses the technical approaches to accomplishing adaptation tasks that result from device restrictions.
J. BAUS, A. KRÜGER AND C. STAHL
RESOURCE-ADAPTIVE PERSONAL NAVIGATION
1. INTRODUCTION While car navigation systems are already widely commercialized today, in the near future not only motorists will be guided by a navigation assistance system. Recent progress in mobile computing has provided us with the means to realize pedestrian navigation systems. Personal navigation systems will provide location sensitive information, which can be accessed at any place and any time. Since several different means of transportation typically have to be combined in order to reach a destination, it must be ensured that the user interface reacts to the user's alternating situation. Changing the means of transportation implies that a personal ubiquitous navigation service must adapt smoothly to the associated situational context. The necessary change between different positioning technologies such as GPS or GSM/UMTS radio-cell based technologies in an outdoor scenario to the use of infrared and Bluetooth technologies in buildings should remain unnoticed to the user of such a navigation system. Furthermore, the system should adapt the presentation to different output devices and modalities and to the varying accuracy of positional information according to the technical resources available in the current situation. In addition, cognitive limitations of the user put constraints on the system’s presentation and interaction possibilities. In this article we will discuss the major concepts that help to design and implement navigation systems that are sensitive to technical limitations on one hand and to cognitive constraints of the user on the other hand. We will start by explaining our view on resource-adapted navigation and discussing the various technical and cognitive resources that have to be taken into account. An overview over the relevant literature will provide the basis for the introduction of our navigation system REAL and the adaptation to different limited resources. We will see how REAL adapts graphical way descriptions on three different levels to resource limitations cast by the situational context. The article finishes with an example in which REAL provides the framework for a practical implementation of a fair guide navigation system for CeBIT 2002. 71 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 71–93 © 2005 Springer. Printed in the Netherlands
72
BAUS, KRÜGER AND STAHL 2. RESOURCE-ADAPTIVE NAVIGATION
Personal navigation systems that extend beyond today’s use in cars will play a major role in the future. Especially pedestrians will profit from the ability to get localized information on mobile devices everywhere and at anytime. The generated graphical presentations, ranging from complete route descriptions at a stationary information kiosk to incremental route descriptions given by the mobile devices, are used to communicate content (navigational instructions) between a sender and a receiver (in our case the system and the user). Whenever graphics are presented via a certain medium to a human viewer two different types of resources play an important role: on one hand all the technical resources of the electronic navigational aid and on the other the cognitive resource available to the user. 2.1 Technical Resources
Technical resources cover all types of limitations of the presentation platform. Here, three different subtypes can be identified: x x x
Quality and precision of the positional information Restrictions of the output medium Restrictions of the graphics generation process
Since several different means of transportation typically have to be combined in order to reach a destination, it must be ensured that the user interface reacts to the user's changing situation, no matter what kind of transportation is used. When a change of the means of transportation is detected the system has to adapt its interface to the new situational constraints. The essential switch between different positioning technologies, e.g. GPS or GSM/UTMS cells (outdoor) or infrared and Bluetooth (indoor) should ideally remain unnoticed by the user. It is also very important that a mobile pedestrian navigation system takes into account the precision of the location measurements, because sizes of radio cells may vary, GPSsatellites visibility depends on the actual location and the distribution of indoor infrared senders is often not uniform. Our notion of positional information does not only include the absolute or relative location of the user in the actual environment, but also the user’s orientation and speed. The knowledge about the quality and precision of positional information can be used to adapt the graphical navigational instructions, as described later in the article. In our case the output medium is the systems’ visual interface. Typical output media are printed paper and computer displays, but also 3D-Display, such as stereo glasses. All the different media are restricted by an outer and an inner scale (Hake, 1982). These terms from cartography describe the maximum size of the graphics displayed, their resolution and thus the amount of detail that will be visible. Both factors influence the graphical presentations. For example it is not possible to reduce the size of a graphic without loosing details, so if the outer scale of the display is limited, it might be better to enlarge some small, but important details, while omitting others which are not so relevant for the navigational task at hand. In our case this implies that we
RESOURCE – ADAPTIVE PERSONAL NAVIGATION
73
have to adapt our graphical presentations to many devices with different display sizes or resolutions. Another resource which might be limited is the capability of the devices to represent graphical navigational instructions in colour. Colours often provide important information about the depicted objects of the environment. In addition colours could help to draw the observers’ attention to important objects for the navigational task or to focus on certain details of the map. And last but not least, the graphics generation process itself needs resources, especially cpu-time and memory. These restricted resources are often critical in real-time applications. Even more difficult than the adaptation to technical constraints is the adaptation to the cognitive limitations of the user for a given situation. 2.2 Cognitive Resources
One of the most important cognitive resources of the user is the working memory. Unfortunately, there exists no comprehensive theory how tasks are solved in our brains and for the architecture or structure of the working memory different models have been proposed (cf. Baddeley, 1968, Loggie, 1995). In our work we simplify the limitations which are connected to the users’ working memory to three classes: x x x
Time restrictions Domain knowledge Familiarity with the presentation type
Time restrictions can be further divided into two different types. The viewing time is the time the addressee is given to look at the graphical presentation. In contrast to that, the decoding time describes the amount of time needed by the user to decode the presentation and to understand its meaning. Both times may be very restricted, especially when the user is heavily stressed or the navigational task is not the primary goal of the user. Parallel tasks such as carrying heavy luggage or answering a phone call while walking apply additional pressure on the user, which has to be compensated by the navigational aid. In this case complicated presentations that must be decoded under time pressure often lead to vicious circles, where the sense of not understanding the presentations results in stress reducing the ability to decode the presentation and so on. The domain knowledge of the user about the environment is represented in a user model and stores information about known and preferred routes. This enables the system to use different strategies to optimise routes, e.g., it minimises the amount of turning points, thereby reducing the number of possibilities to take wrong turns. And last but not least, the user might be familiar with some types of graphical presentations used in navigation tasks, e.g., ordinary street maps use certain coding conventions to convey their information and they rely on the ability of user to follow these conventions and to read maps.
74
BAUS, KRÜGER AND STAHL
2.3 The navigational task
In order to generate graphical presentations a navigational aid has to perform at least the following steps: determination of the information to present, determination of the presentation structure, and the graphical realization of the actual presentation. In addition, the information should be presented in a way that takes the maximum benefit of particular strengths of each presentation medium, and takes into account the actual context of the user as well as the current environmental context. We define navigation as the coordinated, goal-oriented following of a route through a spatial environment, where the navigational task consists of two parts: wayfinding and movement. In our notion the term wayfinding includes the planning and decision making processes involved in the determination of a route the navigator has to follow from a start to goal location. Such a route will be divided into segments and each segment consists of at least four parts, belonging to different categories: starting point, reorientation/orientation, path progression and ending point. Since routes are continuous the ending point of a segment serves as starting point for the next. The segmentation of a route is based on changes in direction at decision points. Here we have to communicate changes in direction clearly. Such a segment of a route represents a partial view of the environment by integrating different kinds of knowledge, e.g. knowledge about landmarks. In navigation tasks landmarks are used to identify decision points, destination points or provide orientation cues. They are the memorable cues that are selected along a route and also enable the navigator to encode spatial relations between objects along the route and the route itself, enhancing the development of a cognitive map of a region. A landmark may be any element in the environment that is external to the observer and that serves to define the location of other objects, locations or the actual position of the navigator in the environment (cf. Sorrows & Hirtle, 1999). In our case we use landmarks in graphical navigational instructions to help the user to locate herself in the current environment. In the following section we will give a brief overview about some existing mobile navigation systems. 3. MOBILE NAVIGATION SYSTEMS One of the first systems that used location aware information to help tourists was the CYBERGUIDE-system (cf. Long, Kooper, Abowd & Atkeson, 1996). Both an indoor and an outdoor navigation component were designed to assist tourists during their visits. The indoor component relied on infrared beacons broadcasting a unique ID that was used to display an arrow on a map whenever the user entered a new room. Additionally the user's orientation was estimated from her actual walking direction and the topology of the building. The outdoor system instead used GPS to determine the user's position and to display it on a map. Both systems operated independently from each other and could not be combined. The MARS system is an augmented reality system that provides information for the buildings on the Columbia University Campus (cf. Höllerer, Feiner, Terauchi, Rashid & Hallaway, 1999). More recently an additional indoor component has been developed that
RESOURCE – ADAPTIVE PERSONAL NAVIGATION
75
assists the user also in indoor navigational tasks (cf. Höllerer, Hallaway, Tienna & Feiner, 2001). The GUIDE-system is a location aware multimedia tourist guide for the City of Lancaster (cf. Cheverst, Davies, Mitchell, Friday & Efstratiou, 2000). Based on a radio cell infrastructure information is provided that is tailored to the personal preferences of the user, but also to her actual contextual situation. The DEEP MAP project (carried out at the European Media Lab) focuses on a mobile tourist information guide that brings together results from natural language and intelligent graphics generation (cf. Malaka & Zipf, 2000). This allows a multimodal user interface to offer the user a variety of information on the city of Heidelberg. Since GPS is used to determine the position DEEP MAP is at the moment restricted to outdoor scenarios. The MOBIS system is an electronic guide based on a PDA that provides information on the exhibits to a visitor of a museum (cf. Encarnacao & Kirste, 2000). The PDA receives its position from infrared beacons distributed in the environment and uses this position as a pointer to a specific content that is stored in a database on the PDA. A similar approach stands behind the HIPS system (cf. Not, Petrelli, Sarini, Stock, Strapparava & Zancanaro, 1998). In contrast to MOBIS a radio back channel is used for downloading multimedia content. HIPS uses more powerful mobile devices (sub-notebooks) which are heavier but allow the play back of much better animations and sound files. HIPS does not only take into account the absolute position of the user but also the relative distances to objects in the exhibition. The NEXUS system aims to provide a general framework for mobile and location aware computing (cf. Hohl, Kubach, Leornardi, Rothermel & Schwehm, 1999). The concept of an augmented world is used to store information that may be relevant to a user at a certain location. This model forms the basis for so called virtual information towers that connect information objects (i.e. from the WWW) to real places. Nearly all the discussed systems are designed for certain computational and communicational platforms. Either enough computational power has to be available on the mobile device (DEEP MAP) or the information has to be stored locally in a huge database (MOBIS) or can be accessed over an network infrastructure (NEXUS, HIPS). This prevents the distribution of computational tasks into the environment even if the environment is instrumented with enough computational power. Another drawback of the aforementioned systems is their inability to adapt to the restricted cognitive resources of the user. Especially the use of mobile devices implies the need for adaptive presentation generation and interaction, given the fact that the user might be distracted or under time pressure. In addition, none of the discussed systems allows a seamless switch from indoor to outdoor scenarios and vice versa.
76
BAUS, KRÜGER AND STAHL 4. AN ARCHITECTURE FOR RESOURCE-ADAPTATED NAVIGATION
4.1. Resource and location sensitivity
We distinguish three different classes of resource sensitive processes: (a) resource adapted processes, (b) resource adaptive processes and (c) resource adapting processes (cf. Wahlster & Tack, 1997). Resource adapted processes have been optimized in advance for well known and stable restricted resources. The quality of their results remains constant for a given input. In contrast resource adaptive and resource adapting processes can handle varying resources restrictions. Therefore their results depend on the available resources during runtime. Resource adaptive processes are using exactly one strategy that can react to varying input resources, whereas resource adapting processes switch their strategies on a meta cognitive level to comply with different resource situations. This means that under certain conditions several strategies are used to solve a given problem. We distinguish two different kinds of location sensitivity for mobile devices: 1. Active location sensitivity: the mobile device detects actively the actual location on its own. Based on that information the presentation is generated automatically to comply with the user’s requests (e.g. GPS based car navigation system). 2. Passive location sensitivity: the mobile device just passively presents the information that it receives from senders in its local environment (e.g. location based information on traffic status via Radio Data Service). Active location sensitivity puts most of the computational burden on the mobile device. Given the fact that its memory and CPU performance are limited, the adaptation process is even more difficult. In contrast to that passive location sensitivity shifts most of the computations to the instrumented environment and the mobile device reacts only to the changing situations by presenting the localized information. In other words one could say that the needed intelligence for a process that uses active location sensitivity has to be in the mobile device, whereas processes that rely on passive location sensitivity use computational resources of the environment. Therefore it is reasonable to use passive location sensitivity in scenarios with high user frequencies and complex adaptation services and active location sensitivity in scenarios with less users and limited adaptation services. 4.2. The resource sensitive navigation system REAL
REAL uses both resource adaptive and resource adapting processes to generate graphical way descriptions. At the same time REAL combines active and passive location sensitivity. The system has been designed in a way that the switch between both adaptation paradigms is barely noticeable by the user. REAL integrates two subsystems: the system IRREAL (Infrared REAL) for indoor navigation tasks based on infrared senders and the system ARREAL (Augmented Reality REAL) for outdoor navigation based on GPS-satellites. IRREAL uses passive and ARREAL
RESOURCE – ADAPTIVE PERSONAL NAVIGATION
77
uses active location sensitivity. When comparing the IRREAL system with the MOBIS system (Encarnacao & Kirste, 2000) it is worth highlighting that the same underlying technology (both systems use strong infrared senders in the environment and infrared receivers on the mobile devices) can be used for the same task (e.g. a fair navigation guide) with active location sensitivity (MOBIS) and passive location sensitivity (IRREAL). The overall system architecture is presented in figure 1. The input level transforms a specific request for a way description under consideration of the cognitive and technical limited resources into a user specific request. This request is then passed to the resource adaptive way finding module. With the help of domain knowledge and a user model the optimal route is determined and passed on to the presentation planning module. This module optimizes the presentation of the way description not only according to the resolution, screen size and colour capabilities of the output device but also to quality of the given sensor information (i.e. precision of location, orientation and speed of the user). The planning process consists of a media independent and a media dependent part. The media independent component relies on a 3D-model of the environment and domain knowledge to first choose appropriate parts of the 3D-model for visualization according to the range of the infrared senders (cf. Butz, Baus, Krüger & Lohse, 2001). The component then looks at spatial relations that can be used to express the way description in the specific situation based on information gained from experimental studies (cf. Zimmer, Speiser & Baus, 2001). This information is passed to the media dependent planning component, which uses design presentation strategies to adapt the way description to different output media. This approach allows the generation for different output medias by a single centralized server. At the moment the system supports the generation for a large screen (e.g. a Smartboard), for a mobile device (PDA) and for a small monitor that can be clipped to standard glasses. This classifies the graphics generation module of REAL as resource adaptive process. A great part of the cognitive resource limitations of users can be covered since REAL informs the user not only during the navigation task, but also in advance at a stationary information kiosk. This kiosk is able to provide sophisticated adaptation services (e.g. by the use of 3D-graphics, automatically planned camera paths and ego-motion). Both the stationary and the mobile part of the REAL system form a hybrid resource adaptive navigational aid, that provides consistent way descriptions over different output media (cf. Krüger, Baus & Butz 2001).
78
BAUS, KRÜGER AND STAHL Limited resources
Sensor Data
Request
Context of the request
Quality Position
Technical
Cognitive
Orientation User specific request
Speed
Media unspecific
presentation planning
User model
Dynamic localization 3D−Model Determination
Computation of spatial
of dynamic
relations
reference systems
Ontology Domainknowledge
Media specific
Resource adaptive mediaspecific graphic generation presentation planning
Resource adapting presentation planning
Resource adaptive waysearch
Navigational instructions Automatic animated 3D animation presentation agent Sketchrendering
WaveLAN
Mobile outdoor Clip−on display
Preparation IR−Broadcast
LAN
Mobile indoor
Static large display
Palm Pilot
Smart Board
Presentation− strategies
Figure 1: The system architecture of the resource sensitive navigation system REAL
RESOURCE – ADAPTIVE PERSONAL NAVIGATION
79
5. PERSONALISED NAVIGATION This section will explain in more detail the major components of the proposed resource adaptive and adapting architecture for a pedestrian navigation system. The architecture in figure 1 consists of three important levels that have to be sensible to the available resources: wayfinding, planning the presentation and finally rendering the presentation.
Figure 2: The thicker line represents the shortest path between two points of one of our working scenarios: Frankfurt Airport’s Terminal 1.
5.1. Wayfinding
The first step, that a navigation system has to perform, is the task of finding an appropriate path in the environment. Current realisations of navigation systems (e.g. for in-car use) usually try to find the shortest path. Certain flexibility is provided by most systems by taking into account the travelling speed for different types of roads and vehicles. The last generation of car navigation systems is even able to avoid traffic jams, by including real-time traffic information into the computation of the optimal route. Navigation systems for pedestrian have to go one step further. The context of a pedestrian is subject to high variations. Pedestrians can change their walking directions suddenly, enter building and climb stairs, cross places and use public means of transportation. All this has to be taken into account, when computing the shortest path from current position to a target location.
80
BAUS, KRÜGER AND STAHL
For this purpose our approach is based on the well-know Dijkstra-Algorithm with some significant changes that makes the algorithm resource-adaptive (in the sense of the definition in section 4.1.). Figure 2 displays the results of the usual shortest-path Dijkstra-Algorithm for the path network of one of our scenario (Frankfurt Airport). Three additional constraints were found that help to enhance this algorithm to take into account the cognitive resources of the user: the minimisation of turning points, the minimisation of decision points and the maximisation of the familiarity with the environment. There is a trade-off between the length of the path and the other optimisation criteria. Section 5.1.4 will provide a brief empirical study that investigates this trade-off more closely. 5.1.1. Minimisation of turning points
It is well known from relevant psychological literature that users remember way directions best, if they contain only a few decision points. A decision point in this sense is a node in the way net, where the user has to change her walking direction in order to follow the right path. In figure 2 path computed by the algorithm consists of five turning points. The user has to memorize all five turning points in order to be successful. If the goal is to minimize the cognitive load, this can not be considered the optimal solution. When looking at figure 3 we can see that there exists a path that has fewer turning points (namely three instead of five). This is achieved by adding additional costs to nodes that require a turning point. The DijkstraAlgorithm will find this solutions if additional costs are added during the system run for each turning point. 5.1.2. Minimisation of complex decision points
Another way of reducing the cognitive load is to keep the overall complexity of decision points small. The complexity of a decision point can be simply defined by the number of the crossing roads. For a more complex definition have a look at Raubal (2001). Here the hypothesis is, that the smaller the possible choices, the easier it is to remember and then to take the correct decision. Again, the modifications to the Dijkstra-Algorithm consist of additional costs for nodes with higher complexity.
RESOURCE – ADAPTIVE PERSONAL NAVIGATION
81
Figure 3: The thicker line represents a path has the same length as the one in figure 2 while requiring less turns. This makes it easier for the user to remember the way description.
5.1.3. Maximisation of spatial familiarity
The third way of adapting way directions to the cognitive resources of the user is to maximise the number path segments the user is familiar with for a given navigation task. This is only feasible if the user has at least some experiences with the location. For example, let us assume that the user has already visited the Frankfurt Airport a couple of times before and therefore already knows how to move from the check-in area to the area with the airport shops. When determining the way directions from the check-in area to a new location (e.g. the departure gate) the system can make use of the spatial familiarity by guiding the user via the known shopping area. Of course, this might result in an overall path being longer than the direct connection between check-in and gate. However, the length of the unknown path (i.e. the path from the shopping area to the gate) might be much shorter and therefore significantly reduce the cognitive efforts of remembering the right path. Graph nodes that represent known places to the user a rewarded with a cost reduction, enhancing the chance of the Dijkstra-Algorithm to include this area into the final way direction.
82
BAUS, KRÜGER AND STAHL
Figure 4: Trade off between path length and reduction of turning points for the Frankfurt Airport Scenario, Terminal 1
5.1.4. Empirical investigation of the trade-off between path length and cognitive criteria
As described in the foregoing sections the integration of all three cognitive optimization criteria is realised with a modification of the costs that are used by the Dijkstra-Algorithm to determine the Path with the lowest sum of the overall costs. Theoretically, a clever choice of the costs allows considering all three criteria simultaneously. Unfortunately from a practical perspective it is not clear which costs to choose in a given situation. In order to get an idea of the trade-offs between the cognitive criteria and the path length, empirical investigations were carried out. For this purpose we used the way net of figure 2 to calculate more than 1000 paths for randomly chosen start and target locations. Afterwards for each of those paths we looked at 10 different weights that controlled a certain cognitive criteria. The diagram in figure 4 shows the mean effect of the decrease of turning points on the increase of the overall path length. It is observable that there is only a slight chance to push the reduction of turning points for this scenario over 20 percent. As a consequence increasing the weight furthermore would not make much sense. Similar investigations were made for the other criteria providing us with good heuristics for the appropriate choice of weights in a given situation. Although this information is domain dependent and requires therefore a distinct analysis for a new domain, it can be obtained offline and only has to be recalculated in the case of a change in the way net.
RESOURCE – ADAPTIVE PERSONAL NAVIGATION
83
5.2. Planning way presentations
In order to generate structural graphical descriptions for the way directions we extend an efficient hierarchical planning approach presented in Butz (1997) for the generation of 3D animation. The main assumption here is that all generated graphical presentations can be structured in the form of a tree describing parts and subparts of the graphics. Each part or subpart corresponds to a node in the tree. Nodes are either terminal nodes in which case they describe portions of the graphics that will be realized by one of the graphics realization techniques described in the section about graphical techniques, or they are non-terminal nodes. In this case they specify a set of sub-nodes and a logical, spatial or temporal interrelationship between them. Temporal interrelationships only apply to temporal media and include the concepts parallel, sequential and incremental. An example for temporally parallel subparts of a graphics are, for example, a camera motion and an object motion taking place over the same times pan of a 3D animation. A sequential interrelationship describes a sequence of subparts taking place in a temporal order, e.g., one after each other. Specifying the subparts as having an incremental relationship means that after a sub-tree is fully expanded, this sub-tree can be forwarded to the graphics realization component, which is not the case with every sub-tree in a sequential list. The specification of incrementally ordered subsequences allows the graphics realization process to start its work before the structure of the graphics is fully generated, and thus greatly reduces the perceived delay from the start of the whole graphics generation process to the moment the first graphical element is shown. Logical interrelationships include the concepts of alternative, conditional and additional sub-trees in the structure of a graphical presentation. Both alternative and conditional sub-trees specify a list of possibilities for the realization of a certain part of the graphics. Conditional expansion selects one alternative from this list at planning time. In this way we can, for example, specify that a certain type of graphical presentation can be generated either as a line drawing or as a 3D image of the scene depending on media restrictions. Specifying these sub-trees as alternative postpones the decision until presentation time. The strategy here is to first expand the structurally simplest or computationally cheapest part of the tree and then – if time permits – to proceed with more complex alternatives that might be visually more appealing or clearer in the communicative sense. The resulting structure graph contains all of the various alternatives (unless planning was stopped before due to temporal restrictions) and leaves the decision which alternative is chosen to either the presentation process or even the user. An example for an additional relationship between sub-trees of the structure tree is the labeling of a line drawing or the creation of additional viewports for an already running 3D animation. This is based on the assumption that already the first sub-tree yields a ’working’ graphical presentation, while the following subtrees contain presentation elements which will enhance the overall quality of the graphics. Obviously, the different kinds of interrelationships within the structure tree of a graphical presentation leave room for various strategies of adaptation of the
84
BAUS, KRÜGER AND STAHL
generation process to limited resources either in the output medium or in the generation process.
Figure 5: A simplified view on a plan tree for adaptive way descriptions
Figure 5 shows a simplified part of a structure tree for a graphical way description. At the root of the tree we see that the way description can be presented in the form of a sketch or alternatively a 3D walkthrough. One of these alternatives can be chosen at presentation time based on circumstances, media restrictions or user preferences. In the case of a 3D walkthrough the ground plan and the actual
RESOURCE – ADAPTIVE PERSONAL NAVIGATION
85
path have to be shown. Specifying these two actions as having an incremental interrelationship implies that the ground plan can already be drawn by the realization component while planning for the path visualization continues. Had we specified the two actions as being parallel or sequential, planning would have to be finished before the realization starts. The visualization of the path itself consists of showing its starting and ending point and the trajectory in between. Showing the trajectory in turn is nothing else than an incremental loop over all of the trajectory’s segments (see section about route descriptions) and drawing an arrow for each of them. Optionally, after drawing each segment of the trajectory, the corresponding decision point can be shown. Depending on media restrictions this visualization can consist of a thumbnail image from the 3D world at this particular point, a text label or nothing at all. 5.3. Automated rendering of way descriptions for mobile devices
The hybrid navigation system REAL combines several static and mobile components to provide adaptive way instructions. As described in section 4 this includes an information kiosk with a large display and a high resolution as well as a mobile handheld device. The planning formalism discussed in section 5.2 ensures that the generated graphics for both device classes are coherent and contain cross references. The graphics generation process relies on a 3D-model of the environment. If the user is standing at the information kiosk this approach allows for highly sophisticated rendering of 3D ego-perspective views, including a virtual presenter who plays the role of a tour guide explaining the environment (for more details see Baus, Krüger & Wahlster, 2002). Unfortunately, those graphical techniques can not be put to use on the mobile device. Limited computational power and small displays with low resolution ask for more abstracted graphics and information. Therefore we will describe in the following the necessary steps to generate images of a way directions in a building that are adapted to the PDA display capabilities in terms of size and resolution, using the original 3D model from the information booth knowledge base as the source. Since we rely on infrared beacons to determine the actual position of the user, it is necessary to generate an image for each beacon in the environment. This graphics generation process is divided into a separate background and foreground graphics generation step. 5.3.1. Background generation
We derive the background drawings directly from the REAL 3D-knowledge base (see section 4) that consists of a polygon model of the building. In addition, all sender positions and orientations are represented too. By modifying the camera parameters we are now able to achieve graphical results that are similar to floor plans and therefore very suitable for our purpose.
86
BAUS, KRÜGER AND STAHL
Figure 6: The original model (a) is viewed as a parallel projection (b). The sketch renderer generates a floor plan in black and white (c). Setting the near and far clipping planes uncovers doors (d).
In the first step, the clipping area and orientation of the image is derived from the position and direction of the beacon. Instead of the usual perspective projection, we use parallel projection. The camera is placed into the centre of the map rectangle, where it is aligned along the Y-axis to show a top-down view and then turned around the Y-axis to match the orientation of the sender. This leads to an image that is oriented along the user’s walking direction. Since in a parallel projection, the camera distance to the model has no effect on the clipped area, the model is scaled to fit the desired extension. As a rule of thumb we always choose the scale that allows the user to overlook the way until he reaches the next beacon triggering a new position.
RESOURCE – ADAPTIVE PERSONAL NAVIGATION
87
C)
A)
B)
i
Figure 7: Generation of (foreground) route depictions: The background (a) is superposed by the foreground annotations (b) and displayed on the mobile device. Image (C) gives an example of the stationary booth visualisation at the same location.
The background drawing has to show doors and should hide small steps which could be erratically interpreted as walls. A vertical cut through the model, somewhere between the height of the knee and head of typical users yields the desired result. This can be achieved by simply setting the near and far cutting planes of the projection accordingly. A simple sketch renderer (see Butz, Baus, Krüger & Lohse, 2001) is applied to the geometry model to obtain a sketched image of the background. Additionally, the lighting model is changed to flat shading and all materials are set to plain black and white. The image is rendered by the graphics subsystem (here Java 3D) into a hidden frame-buffer with the exact PDA display resolution and then exported as a bitmap. The result low resolution sketch of the ground plan in black and white.
88
BAUS, KRÜGER AND STAHL
5.3.2. Foreground Generation
A)
B)
C)
D)
E)
F)
G)
H)
I)
Bitte kehren Sie auf den Pfad zurueck
Figure 8: The image sequence A to F contains route depictions, which the user receives via IR on their PDA. The images G to I are shown if the user deviates their route, leading them back on the right way.
Annotations of the foreground are generated and stored as 2D vector elements and superimposed to the background bitmap at runtime. The annotation consists of two parts: arrows that indicate directions, and important landmarks. If located inside the clipping region, the starting point of the route is depicted by a circle, whereas the destination point by a cross. Directional changes and the points where the path segment enter and leave the image are depicted by green arrows. The beacon that corresponds to the current map is indicated by a blue dot for better orientation. The generation of the arrows as vector elements is not trivial, since the
RESOURCE – ADAPTIVE PERSONAL NAVIGATION
89
path can arbitrarily start or end in the map region and enter or leave or even cross it many times. This problem is resolved by clipping each path segment with all map boundaries. The last step includes landmark information into the map. Displaying text labels on a low resolution display is hard to accomplish: large fonts may obscure important parts of the picture and small fonts are too difficult to read. Instead, we represent landmarks graphically by an information icon, making use of the PDA touch screen: when tapping on the icon with a stylus or finger, a text label naming the landmark is displayed for a few seconds. For this purpose, all route and landmark coordinates have to be transformed from 3D model space into 2D screen space, so that the vector elements fit exactly into the background bitmap. 5.3.3. A mobile indoor navigation example
The example shown in figure 8 explains how the user is led to a desired destination. In this setup the rooms of our lab have been associated with typical meanings taken from the Frankfurt Airport domain. In the example, the user starts in the VIP-lounge. There he interacts with the information booth and chooses the luggage claim as the destination. We assume that the route is presented by an agent in a complete walkthrough animation beforehand. After the presentation, the user starts to move on with the PDA. The first map (A) shows a rather large area, since the next IR sender is relatively far away. The user has to leave the room and turn right, following the arrows. The next beacon is located in the hall, separated by a door. The user will first receive a plain straightforward arrow (B) and then after a moment, a more detailed route description (C). The next map (D) includes a landmark represented by an “I”-shaped icon. Tapping on the icon will display the name of that landmark. At the destination, the user receives a map (F), telling her that she has reached the baggage-claim. Images (G)-(I) are used to indicate wrong turns to users and tell her how to find the right path again. 6. THE PERSONAL NAVIGATION SYSTEM REAL AT CEBIT 2002 The hybrid navigation system REAL was tested in several different scenarios, one of which was the world-wide largest computer fair CeBIT in 2002. For this purpose we integrated the interior 3D-views of several exhibition halls into the 3Dmodel of the fairground. This allowed users of the REAL system to obtain information about their location not only outdoors, while navigating from hall to hall (see figure 9), but also indoors when looking for a certain booth (see figure 10). Beside the geometrical representation of the fair ground, information on the various companies and their booth locations were also modelled. This CeBIT version of REAL is the first system that integrates the indoor and outdoor environment into a common 3D polygon model, containing buildings and their interior. The integrated representation allows for sophisticated camera paths, i.e. when the user enters or leaves the halls. Smooth transitions from the large scale
90
BAUS, KRÜGER AND STAHL
outdoor and lower scale indoor view prevent irritations of the user and enhance her spatial awareness.
Figure 9: Outside view of the CeBIT fair ground
The fair guide combines two distinct localization techniques. Outdoors a GPS receiver module is used to determine the users’ position. Indoors, where no satellite data is available, infrared beacons are used. Once the user’s position has been determined by one of those beacons a suitable view of the map is generated using the 3D polygon model and OpenGL rendering. For outdoor views the system relies on a more realistic 3D-view of the environment, whereas for indoor navigation a more sketch-like map was generated by means of the techniques described in section 5. The user’s position is indicated by a red dot. Target locations are marked with elevated 3D-boxes. A green line represents the proposed path. Outdoors the user movement is shown continuously and in real time, indoors the positioning appears in discrete steps depending on the distribution of beacons. Nevertheless, in order to make the visualisation more pleasant, the camera smoothly interpolates between those discrete user positions. Currently the visualization supports two adaptive features. First the display adapts to the user’s speed. When the user stops moving, the camera zooms in and the halls and the booths are completely labelled. When walking, the camera starts to zoom out of the scene. Only halls and booths close to the user are labelled. The second feature is the visualization of the GPS signal accuracy by the radius of the
RESOURCE – ADAPTIVE PERSONAL NAVIGATION
91
user’s standpoint marker. A small marker indicates a precise position, a larger marker represents a imprecise position. The CeBIT fair guide was implemented on a Xybernaut wearable computer running the Linux operating system.
Figure 10: Indoor view of an exhibition hall at CebIT 2002
7. SUMMARY AND OUTLOOK In this article we have focussed on the major concepts that help to design and implement a resource sensitive navigation system that adapts not only to various technical restrictions of the output media (e.g. a small screen or poor sensor qualities), but also to the cognitive limitations of the user (e.g. time pressure). We have shown that resource-adaptivity has to be included on three levels of the navigation system architecture: on the wayfinding, presentation planning and rendering level. Our navigation system REAL has proven to be useful in a range of scenarios one of which (the CeBIT fair guide) was introduced in this paper. From this point we would like to continue in two directions: on one hand we are working on multimodal extensions to REAL, allowing the user to use natural language and gestures to interact with the system. We are preparing an extension of the architecture that allows for speech recognition both on a thin client (i.e. a PDA) and a server in order to combine different speech recognition procedures in parallel.
92
BAUS, KRÜGER AND STAHL
On the other hand we want to develop personal travel files that contain all relevant information leading to personalised way descriptions. This includes not only known places and routes, but also personal annotations (e.g. voice memos). Appropriate visualisation and editing techniques will bring us one step further towards a more usable personal and spatial episodic memory. 8. REFERENCES Baddeley, A. D. (1986). Working Memory. Oxford University Press. Baus, J., Krüger, A., Wahlster W. (2002), A resource-adaptive mobile navigation system. In Proceedings of Intelligent User Interfaces 2002, ACM Press. Butz, A. (1997). Anymation with CATHI. In Proceedings of AAAI/IAAI ’97, (p. 957–962). AAAI Press. Butz, A., Baus, J., Krüger, A. & Lohse, M. (2001). A Hybrid Indoor Navigation System. In IUI2001: International Conference on Intelligent User Interfaces, (p. 25–33). ACM. Press. Cheverst, K., Davies, N., Mitchell, K., Friday, A. & Efstratiou, C. (2000). Developing a Context-aware Electronic Tourist Guide: Some Issues and Experiences. In Proceedings of the 2000 Conference on Human Factors in Computing Systems (CHI-00) (p. 17-24). ACM Press. Encarnacao, J.L., & Kirste, T. (2000). Beyond the desktop: Natural interaction and intelligent assistance for everyday life. In Alltag der Zukunft – Informationstechnik verändert unser Leben. (p. 39-57). Schöningh. Hake, G. (1982). Kartographie 1. Walter de Gruyter. Hohl, F., Kubach, U., Leonhardi, A., Rothermel, K. & Schwehm, M. (1999). Next century challenges: NEXUS – an global infrastructure for spatial-aware applications. In International Conference on Mobile Computing and Networking (MobiCom ’99) (p. 249-255). ACM Press. Krüger, A., Baus, J. & Butz, A. (2002). Smart Graphics in Adaptive Way Descriptions. In Proceedings of the working conference on Advanced Visual Interfaces (AVI 2000), (p. 92-97). ACM Press. Höllerer, T., Feiner, S., Terauchi, T., Rashid, G. & Hallway, D. (1999). Exploring mars: Developing indoor and outdoor user interfaces to a mobile augmented reality system. Computer and Graphics, 23(6), 779-785. Höllerer, T., Hallaway, D., Tienna, N. & Feiner, S. (2001). Steps towards accommodating variable position tracking accuracy in a mobile augmented reality system. In IJCAI-2001 Working Notes: AI in Mobile Systems (AIMS) (p. 31-57). AAAI Press. Long, S., Kooper, K., Abowd, G.D. & Atkinson, C. (1996). Rapid Prototyping of Mobile Context-Aware Applications: The Cyberguide Case Study. In Proceedings of the 2nd ACM International Conference on Mobile Computing and Newtorking. (p. 97-107). ACM Press. Malaka, R., & Zipf, A. (2000). Deep Map – challenging IT research in the framework of a tourist information system. In Information and Communication Technologies in Tourism. (p. 15-27). Springer. Not, E., Petrelli, D., Sarini, M., Stock, O., Strapparava, C. & Zancanaro, M. (1998). Hypernavigation in the physical space: Adapting presentations to the user and situational context. The New Review of Hypermedia and Multimedia, 4, 33-45. Raubal, M. (2001). Human wayfinding in unfamiliar buildings: a simulation with a cognizing agent. Cognitive Processing, 2(2-3), 363-388. Sorrows, M.E. & Hirtle, S. (1999). The Nature of Landmarks for Real and Electronic Spaces. In COSIT 99: Spatial Information Theory: cognitive and computational foundations of geographic information sciences. (p. 37-59). Springer. Wahlster, W. & Tack, W. (1997). SFB 378: Ressourcenadaptive Kognitive Prozesse. Informatik ’97: Informatik als Innovationsmotor. (p. 51–57). Springer. Zimmer, H.D., Speiser, H. & Baus, J. (2001). Die Selektion dimensionaler Präpositionen: automatisch und nicht ressourcenadaptierend. Kognitionswissenschaft, 9(3), 115-131.
RESOURCE – ADAPTIVE PERSONAL NAVIGATION 9. AFFILIATIONS J. Baus A. Krüger DFKI Saarbrucken, Germany C. Stahl Saarland University Saarbrucken, Germany
93
O. STOCK, M. ZANCANARO AND E. NOT
INTELLIGENT INTERACTIVE INFORMATION PRESENTATION FOR CULTURAL TOURISM
1. INTRODUCTION Since the second half of the Eighties, we have considered cultural heritage appreciation a privileged area of application for innovative, natural–language centred applications. From the application point of view, we believe this is an area of high interest, for the following principal reasons: a) the “users” of cultural heritage increase in number at a fast pace; b) there is a natural request for a quality shift: from presentation of cultural heritage as a standard mass product, similar to supermarket goods, to a way to provide the single person with the possibility of acquiring information and understanding on things that interests him most, and to assist his cultural development; c) the way in which the cultural experience is carried on has not changed much for centuries; and especially the young seem to require novel modes of being exposed to the cultural material, so that they would engage and entertain them; d) for Italy and Mediterranean countries cultural heritage can be a natural resource that fuels economy (Minghetti et al, 2002); e) humancomputer interface technology can have a decisive role in providing solutions for the individual. From the research point of view in the first phase we have considered this as an opportunity for exploring ideas related to multimodal interfaces. The AlFresco System was a system that integrated language, pointing in input and language and images in output (Stock et al, 1997). But the main aspect is that it integrated in a coherent way different interaction attitudes: the goal-oriented language based modality and the navigation-oriented hypermedia modality. Well before the web era the AlFresco generalized communication act management approach was perhaps anticipating some of the present challenges of web interaction. Subsequently we have begun working on information presentation in the physical environment. This brought in a number of new issues and some constraints (see Stock, 2001). Ideas were experimented in two projects, Hyperaudio (Not et al, 1998) and the European project HIPS (Benelli et al, 1999). We shall present here some new lines of research that we are now carrying on. 95 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 95–111 © 2005 Springer. Printed in the Netherlands
96
STOCK, ZANCANARO AND NOT 2. THE PEACH PROJECT
The PEACH (Personal Experience with Active Cultural Heritage) project objective is that of studying and experimenting with various advanced technologies that can enhance cultural heritage appreciation. The project, sponsored by the Trento Autonomous Province, is mainly based on IRST research, with important contributions by the other two partners: DFKI and Giunti Multimedia. The research activity focuses on two technology mainstreams, natural interactivity (encompassing natural language processing, perception, image understanding, intelligent systems etc.) and micro-sensory systems. Throughout the project, synergy and integration of different research sectors will be emphasized. Two general areas of research are highlighted: x The study of techniques for individual-oriented information presentation: (i) use of formalisms and technologies derived from the field of natural language generation in order to build contextual presentations; (ii) use of speech and gestures as input and audio and animated characters as output; (iii) use of multi-agent architectures to provide suggestions and propose new topics. x The study of techniques for multi-sensorial analysis and modeling of physical spaces—that is, the use of visual sensors such as video cameras, laser telemetry, infrared sensors, and audio sensors such as an array of microphones and ultrasonic signals for monitoring a dynamic environment, and collecting information about objects and about the environment for accurate virtual reconstruction. The aim of the project is to significantly increase the quality of cultural heritage appreciation, in such a way as to transform passive objects into active ones that can be manipulated by the observer, and thus aiding to bridge the gap between our past, which they represent, and our future, of which they are the seeds. Extended Appreciation and (Inter)active Objects are facets of an underlying unifying vision we term “Active Cultural Heritage”. 2.1 Extended Appreciation
The traditional modes of cultural heritage appreciation impose numerous limitations that are not always obvious. For instance, in observing a large statue, notwithstanding physical proximity, the observer most likely will be unable to capture details from every angle, as these may be too far from his/her viewpoint. In these cases, direct observation creates limitations that can be overcome with augmented reality, such as by using a palm computer to observe the details of the statue, taken from cameras or reconstructed in a virtual environment. Moreover, access to some objects can be difficult or even impossible for some visitors, such as disabled or elderly people. Creating an accurate virtual representation of the objects would extend fruition of the exhibit to these visitors as well. In general, remote appreciation opens interesting possibilities, also for the study of an artefact that due to its fragile nature must be kept under restricted conditions and is thus not accessible to everyone. The possibility of interacting with an accurate
INTELLIGENT INTERACTIVE INFORMATION PRESENTATION
97
virtual representation, allows the non-invasive access to a work of art in the manner, time, and place most appropriate for the visitor. Objects can be manipulated in an innovative, didactic, and fun way such as by modifying a work of art, partially or in its entirety. 2.2 (Inter)Active Objects
It is particularly important for the individual to be able to “navigate” an independent information course based on individually and dynamically created presentations. One of the scopes of the project is transcending a museum’s restrictive environment by transforming a passive object observed by the visitor into an active subject capable of providing new information in a context-sensitive manner, a kind of hyperlink for accessing additional situation-specific information to be presented coherently. Much of the technology for accessing information in the Internet today (for example, adaptive user profiling, information promotion, database browsing, query by example) has a natural place of application in this environment 3. THE MUSEUM AS A SMART ENVIRONMENT A system that generates presentations of artworks in a museum must mould to the behaviour of a person visiting the museum. On the one hand, the system must facilitate movements within the space by (i) aiding the orientation of the user using appropriate linguistic support such as “to your right you will see…”; (ii) proposing suggestions about the best route for continuing the visit, such as with “…along the same lines, the next room contains an interesting…”. On the other hand, the system must be able to interpret the implicit intentions of the person’s movements. For example, the prolonged observation of one object may be interpreted as a sign of interest. A system of this type will be able to take into consideration the constrains posed by the environment in accessing information (e.g. objects in an adjacent room may be far if the two rooms are not connected) emphasising the emotional impact of seeing the “real” work of art. Such a system will also be able to affect the visitor’s perception within the environment by attracting his/her attention to a particular work or detail; for instance, taking advantage of new technology such as the ability to superimpose computer-generated images to the real scene (via special transparent visors) or by generating verbal presentations based on rhetorical and persuasionoriented strategies. In this way, the museum visit is a full-fledged interaction between the visitor and the museum itself. In order to render possible this interaction, it is necessary that the museum - in fact the underlying information system - (i) knows the physical position of the visitor (and, as much as possible, his focus of visual attention); (ii) communicates individual information on the objects under exhibition—for instance through a portable device, 3D audio, or using a special wearable device that
98
STOCK, ZANCANARO AND NOT
automatically superimposes generated images to the real scene; and (iii) receives requests from the visitor—verbally and/or through gestures. A museum of this type will not be simply reactive, limiting itself to satisfy the questions of the visitors, but will also be proactive, explicitly providing unasked information; for instance, suggesting the visit to particularly interesting or famous objects, or allowing access to a “window” (e.g. a flat screen on the wall) that can deepen the study of the object under observation. Such suggestions can be made based on the observations of the person’s behaviour, for example, the route chosen by a visitor or how much time is spent in front of a work, information noted about the user, such as age and culture, or considerations relative to the environment like rooms that are too crowded or that are temporarily closed. The system should be able to overhear the visitor’s interaction (Busetta et al, 2001) and provide further suggestions on the basis of an internal model of priorities (for example, satisfying visitor’s interests, fulfilling educational goals, or, perhaps, increasing museum bookshop’s sales). Another important dimension is that of attracting the young and keep them hooked to the cultural experience. With children a playful attitude is essential. We are conceiving new technology-based environments, with spoken interaction (see also the NICE project with a similar theme1), where as a side effect children will be motivated to look with attention and learn about the cultural heritage. One of the central aspects is the communication attitude. A humorous interaction is a key resource with children. The role of humor to keep attention, enhance memorization names and help creative thinking is well known. We are now beginning to see some concrete results in modeling some processes of humour production. To this end our initial work in computational humor will find a useful terrain of experimentation here (see Stock and Strapparava, 2002). 4. THE ROLE OF INFORMATION PRESENTATION According to (Bordegoni et al, 1997), a medium is a physical space in which perceptible entities are realized. Indeed, in a museum (as well as in a cultural city, an archaeological site, etc.) the most prominent medium is the environment itself. The main requirement for the presentation of information task is that of integrating the ‘physical’ experience, without competing with the original exhibit items for the visitor’s attention. From a multimedia point of view, this means that additional uses of the visual channel have to be carefully weighed. In this context, the audio channel should play the major role in particular for language-based presentations, although the role of non-speech audio (e.g., music or ambient sounds) should also be investigated. Yet when a visual display is available (for example a PDA or a wall-size flat screen) images can be used to support the visitor in the orientation task (3D or 2D images can be used to support linguistic reference to physical objects). In this latter case, the visual channel is shared between the display and the environment but the goal is still to provide support to environment-related tasks.
INTELLIGENT INTERACTIVE INFORMATION PRESENTATION
99
From a multimodal point of view, different modalities can be employed to focus the visitor’s attention on specific objects or to stimulate interest in other exhibits. For example, the linguistic part of the presentation (through speech audio) can make large use of deictic and cross-modal expressions both with respect to space (such as “here”, “in front of you”, “on the other side of the wall”, etc.) and to time (“as you have seen before”, etc.) (Not and Zancanaro, 2000). The peculiarity of the environment as a medium is its staticity: the system cannot directly intervene on this medium (i.e. the system cannot move or hide exhibits nor change the architecture of a room, as in virtual settings, at least without considering technology-based futuristic extensions.) Therefore, it may appear that a main limitation of the presentation system is the need to adapt the other media in service of the environment. Yet a multimodal approach can get round the staticity constraint in two ways at least: a) Dynamically changing the user's perception of the environment: by exploiting augmented reality techniques (for example as described in Feiner, 1997) it is possible to overlay labels or other images on what the visitor is actually seeing. In this way, for example, the system can plan to highlight some relevant exhibits in the environment or shadow other less relevant ones. 3D audio effects or the selection of characteristic voices or sounds for audio messages can stimulate user's curiosity and attention (Marti et al., 2001). Yet a similar effect can be obtained by exploiting the power of language, as we did: language-based presentations can be carefully planned to attract the visitor's attention to more important exhibits and shadow less relevant ones. The simplest example: when a visitor enters a room for the first time, she usually receives a general room presentation followed by one that directs the visitor’s attention to the exhibit the system hypothesizes most interesting for her. b) Changing the user's physical position: the system can induce the user to change her physical position either by a direct suggestion (e.g. ”go to the other side of the room, the big fresco you'll see on the wall is La Maestà”) or indirectly, for instance by introducing a new topic (e.g., ”La Maestà, one of the absolute masterpieces of European Gothic painting, is located on the wall behind you”). Ultimately, the goal of such system is to support visitors in making their visiting experience meet their own interests; but in some cases a visitor should be encouraged not to miss some particular exhibits (for example, you cannot visit the Louvre for the first time and miss the Mona Lisa). Sometimes this task can be accomplished by direction giving, but there are other ways to promote exhibits: for example, by providing at the beginning of the visit a list of hotspots, or by planning a presentation that, in a coherent way, links the exhibit in sight to other ones through reference to the visitor’s interest. More generally, further research is needed towards implementing pedagogically-motivated systems with meta-goals to pursue, educational strategies to follow and intentions to satisfy. In this respect, the interaction between the visitor and the system must evolve from simple interaction to full-fledged collaboration (for a discussion on this topic applied to cultural tourism see Stock, 2001).
100
STOCK, ZANCANARO AND NOT
5. BUILDING ADAPTIVE PRESENTATION FROM EXISTING REPOSITORY A fundamental prerequisite for using existing information in different discourse and interaction contexts, and for different users, is that the original material is properly organized in a repository of information units that can be flexibly accessed and composed. In the HIPS [2] and HyperAudio [7] projects we experimented with an annotation formalism, called the MacroNode formalism, that allows the dynamic composition of an adaptive hypermedia [8]. In this framework, the available material is fragmented into content units which are annotated with essential semantic and linguistic data, i.e. a description of the main topic of the content unit, the semantic relations existing with other units (in terms of RST rhetorical relations [6]), and the different ways in which the unit can be presented to the user. A content unit is called a macronode and, for textual information, it typically corresponds to a paragraph. More generally, a macronode can correspond to data in different media, for example text, audio, images, or a combination of these.
Figure 1: A sample portion of macronode network displaying the rhetorical links.
The rhetorical links, used to connect macronodes to each other, express which information units can be composed together, and how to compose them to produce a text in the given genre. A rhetorical link specifies the coherence relation existing between two portions of discourse, e.g. whether text span B describes the event that caused state A to occur (CAUSE rhetorical relation), or whether text span B provides background information useful for interpreting the assertion in text span A (BACKGROUND relation). As a starting point for the identification of a set of relevant rhetorical relations, we have taken as a reference the work by Mann and Thompson ([6]). In HIPS, we exploit a set of basic relations considered as the most
INTELLIGENT INTERACTIVE INFORMATION PRESENTATION
101
relevant for our application task: elaboration: when text span B provides more details for what is discussed in text span A; an additional domain-dependent refinement of the elaboration relation (elaboration.authorship, elaboration.direct, elaboration.indirect, elaboration.represented) was introduced to capture different ways to add details to an artwork description; background: when text span B provides information useful for interpreting the assertion in text span A; comparison: when text span B compares the object/concept described in A with another relevant object/concept; navigation: when text span B provides suggestions about where to go/what to look at next. Figure 1 shows a sample fragment of macronode net, where rhetorical links signal the coherence relations between different macronodes. To compose an adaptive hypermedia out of the content units, the macronode approach exploits some text planning techniques and linguistic manipulation rules derived from the field of NLG. 5.1 Assuring Coherence
At generation time, the annotated data is processed by a Presentation Composer that uses some NLG text planning strategies to determine the most appropriate discourse organization for the macronodes relevant for the current discourse context. The module explicitly maintains a representation of the discourse history and is able to exploit a model of user's interests and knowledge to best tailor the output. The schema-based text planning phase, which relies on the rhetorical organization of the macronodes in the repository, guarantees that the composed message displays (i) a coherent rhetorical structure, (ii) a topic flow compatible with discourse and interaction context, (iii) cross-references to material already presented and to user interests. For example, discourse strategies are encoded to avoid presenting already known information, to choose the kind of information for which the user's interest is higher, to present new information when the user goes back to a previous topic. In addition, rules checking the rhetorical links between macronodes provide control for the length of an elaboration chain or for the inclusion of background information to clarify a topic, and so on. All this reasoning, familiar for deep NLG systems, is possible on fragments of existing text thanks to the shallow semantic representation of the macronodes content and to the macronode rhetorical network. 5.2 Assuring Cohesion
In order to assure cohesion, each macronode encodes different realizations of the same content. These alternative realizations are stored as a graph. Each node of the graph represents a minimal unit of content, whereas arcs contain the conditions under which the target node can be realized (given that the source node has been realized). Each path in the graph is a possible realization of the content of the macronode.
102
STOCK, ZANCANARO AND NOT
For example, for a macronode that encodes the content that “The Guidoriccio da Fogliano is a fresco by Simone Martini that dates back to 1328”, a possible realization graph is depicted in Figure 2. In this case, a pronoun is chosen if the concept has just been mentioned in the previous discourse. In addition, if the main topic of the overall current presentation is the fresco itself, a picture is also included. The use of a realization graph with linguistically motivated options does not allow the same flexibility that a deep NLG component would; yet in many contexts this solution is sufficient to assure cohesion in dynamically composed presentations.
Figure 2: Internal representation of a macronode
5.3 Integrating existing and automatically generated text
Although reusing existing repositories of information is definitely valuable in many domains, often the ability to intermix existing text and deeply generated text could significantly improve the effectiveness of an adaptive system. There already exist some systems where this ability has been explored (e.g., ILEX [10]). In many cases, the system provides sophisticated NLG machinery to support generation from scratch (i.e., generation from a knowledge base representing entities and facts related by semantic relations) as well as mechanisms to support the inclusion of flexible templates and stories (paragraph-length pieces of canned text). However, usually stories play a marginal role: no particular support is designed to help authors in systematically editing canned texts nor is reasoning done on the text content (though some adjustments are possible on the linguistic surface form, for example to increase cohesion by properly introducing anaphora). This might result in limitations for those application domains where much interesting information is already available in the form of narrative text, anecdotes, or other complex text (but also images and videoclips) which would be too difficult/expensive to generate from scratch. Figure 3 shows some sample texts written by a human author describing an archaeological exhibit for different users.
INTELLIGENT INTERACTIVE INFORMATION PRESENTATION
103
The figure highlights in light color the portions of text that would be difficult or worthless to generate from scratch, but that still require some flexibility in content selection and linguistic realization to get the desired personalization in the final presentation.
Figure 3. A dynamic page generated by M-PIRO
To overcome these limitations, within the M-PIRO project [1] at ITC-irst we have been exploring the possibility of integrating a deep NLG system, like ILEX, with a flexible approach to the composition of preexisting text as the one supported by the MacroNode architecture, thus achieving a trade-off between information reuse and flexibility. The NLG machinery allows the generation of the most dynamic parts of the presentation - i.e., with the selection of the most appropriate content and realization with respect to the current discourse context and user model; the introduction of sentences comparing different objects the user has seen; the suggestion of other interesting related objects. The AH machinery allows to enrich the content and the phrasing, still allowing for personalization.
104
STOCK, ZANCANARO AND NOT 6. CINEMATIC TECHNIQUES AT THE INTERFACE
Usually, multimedia guides use static images or pre-recorded short video clips about museum exhibits. The advent of more powerful devices has enabled experimentation with new forms of multimedia, in particular time-based media such as animations. In our recent work we introduce the idea of enhancing the audio presentation of a complex fresco by using the PDA screen to travel throughout the fresco itself and identify details. At presentation time a sequence of pictures is synchronized with the audio commentary and the transitions among them are planned according to cinematic techniques. Our hypothesis is that the use of this type of animation to present the description of a painting allows the visitor to better identify the details introduced by the audio part of the presentation. In this manner, both the efficiency and the satisfaction dimensions of the system usability are increased (Nielsen, 1994) while also providing an enhanced learning experience for the visitor.
Figure 4. The Infrared positioning at Torre Aquila
We have experimented this idea in a prototype of a multimedia guide for Torre Aquila, a tower at the Buonconsiglio Castle in Trento where a fresco called “The Cycle of the Months”, a masterpiece of the gothic period, can be found. This fresco, painted in the Fifteenth Century, illustrated the activities of aristocrats and peasants throughout the year. The fresco is composed by eleven panels each one representing one month (the month of March, painted on wood, was destroyed by a fire) and occupies the four walls of the room (see figure 4). Our multimedia guide, implemented on a PDA, detects the position of the visitor by means of infrared emitters placed in front of each panel. The initiative of the interaction is shared between the system and the user. When the system detects that the visitor is in front of a wall, a picture of that wall is displayed on the PDA and, after a few seconds, if the user has not changed its position the panel she is facing is
INTELLIGENT INTERACTIVE INFORMATION PRESENTATION
105
highlighted (see figure 5). Now, the visitor can click on the panel and get a multimedia presentation. The multimedia presentation is composed of an audio commentary accompanied by a sequence of images moved on the PDA display that help the visitor quickly identify the fresco’s details mentioned in the commentary.
Figure 5. Two snapshots of the mobile multimedia guide in Torre Aquila
The language of cinematography (Metz, 1974), including shot segmentation, camera movements and transition effects, is the basic resource for planning the animation and to synchronize the visual and the verbal parts of the presentation. In building the animations, a set of strategies similar to those used in documentaries were thus employed. Two broad classes of strategies have been identified. The first class encompasses constraints, imposed by the grammar of cinematography, while the second deals with conventions normally used in guiding camera movements in the production of documentaries. For instance, a strategy in the first class would discourage a zoom-in immediately followed by a zoom-out, while a different strategy in the second class would recommend the use of sequential scene cuts, rather than a fade-out effect, to visually enumerate different characters in a scene. It is worth noting that in the latter strategy it is often necessary to make reference to the discourse structure of the audio part of the presentation, such as enumeration of properties, background knowledge, and elaboration of related information. In order to formally use
106
STOCK, ZANCANARO AND NOT
discourse structure, we employ the Rhetorical Structure Theory (Mann and Thompson, 1987) mentioned earlier.
Figure 6. A videoclip on the mobile guide.
7. SEVEN CHALLENGES The developed prototype we have illustrated is a step in the direction we have tried to describe at the beginning of this paper. We think the theme discussed constitutes a terrain where several areas of research can yield important contributions. In the following we briefly review some challenges. Visitor Tracking. In our own experience after various investigations we have ended sticking to our initial choice - infrared emitters at fixed positions, sensors on mobile devices. This choice was also combined with a compass, but we are sure shortly there will be more interesting solutions available (e.g. ultrasounds). For the outdoor scenario, we need a combination of GPS and finer localization devices. Other techniques can be envisioned and should be further investigated, for example, beyond the physical position, it would be useful to know the direction of sight of the visitor. For the moment it requires head-mounted displays and complex vision recognition hardware, but one can foresee a future where gaze detection may be possible with less obtrusive hardware, at least in structured and internally represented domains. Representations and reasoning on what is at sight need obviously to develop substantially.
INTELLIGENT INTERACTIVE INFORMATION PRESENTATION
107
Novel devices. Acoustic output has been shown to be here preferable over written language, which instead can be usefully exploited for highlights and followups. Yet improvements on high quality graphics on a small device would be highly appreciated since pictures have been shown to be very helpful in signalling references to objects in the physical space. Wearable devices and head-mounted displays can play a role in specific settings. In particular, head-mounted displays can be very useful if coupled with a technology that can overlay computer generated graphics to real scenes (see Feiner et al, 1997). This technique is particularly interesting for archaeological sites where the visitor would be able to “see” the buildings as they were originally. But often the best device for these kinds of applications is no device at all. Speech recognition in the environment coupled with "spatialised" audio would allow the visitors to experience multisensory and unobtrusive interaction with the environment. The “narration” must develop with individual-oriented characteristics and at personal times, so it cannot just be produced once for all by physically dislocated sound sources. Expressing space and time reference. We need our systems to be able to reason about where things are, what kind of spatial entity they are, how they look like from a given position, how best the visitor can reach them, when they will appear. For example, the system should be able to instruct the visitor to “reach the room at the end of this corridor” rather than “go forward 10 meters and then turn left”. There is a substantial tradition in AI dealing with qualitative temporal reasoning and a somehow less extended one dealing with spatial reasoning (Stock, 1997). Representations must provide us with material at the right level of detail so that we can properly express it in words. Of course we need also that the language we produce is sophisticated in the proper use of word characteristics, for time taking into account concepts like word aspect and tense, or, for space, for example, being able to choose specific spatial prepositions. A newer important element in research is qualitative modelling of movement (see Galton, 1997), particularly relevant here, as we have seen that movement is the most relevant input modality, strictly coupled with our suggested medium - the environment. Beyond descriptive texts. Perhaps the biggest challenge is concerned with keeping the attention of the user high and granting a long term memory effect. We need to be able to device techniques of material presentation that hook the visitor, that continuously build the necessary anticipation and release tension. The "story" (we mean the multimodal story that includes language, graphics and the visible physical environment) must be entertaining, and it should include mechanisms of surprise. The expectation sometimes must be contradicted and this contrast will help in keeping the attention and memorizing the situation. A typical mechanism of this kind is at the basis of various forms of computational humour (see Stock, 1996). Especially with children humour (and play) can be a powerful means for keeping them interested. Another aspect where much more research is needed is concerned with mechanisms of persuasion: i.e. how we can build rhetorical mechanisms aimed at the goal that the hearer/experiencer adopts desired beliefs and goals. It is not only a matter of rational argumentation, a field a bit more developed, but an integration of various aspects, including some modelling of affect. At the end our philosophy is that the user is responsible for what she does and hence for the material that is
108
STOCK, ZANCANARO AND NOT
presented to her, but yet through the presentation some specific goals of the museum curator can be submitted for adoption. New visit modalities. The advent of technology opens the way to new modalities of visit, particularly important with children. A treasure hunt is an obvious example, where the external goals cause the innocent visitor to look for details and come across many different exhibits with "artificially" induced attention. An easier development is that at the end of a visit, a report of the visit is produced electronically, available for successive elaboration. For instance it will allow the user to re-follow on a virtual environment what she has seen and to explore related material at a deeper level through added hyperlinks. Support for group visits. A relevant percentage of visitors come to the museum together with other people. For natural science museums, the typical case is a parent with children, for art museums it is the group of friends. The group dimension is largely unexplored: how best can a family (or other group) be exposed in individually different manners to the material in the environment so that they discuss what they have seen and have a conversation that adds to their individual experience, bringing in new interests and curiosity? Only limited research is devoted to group visit (see for example, Woodruff et al, 2001) and most issues are still open. Of course, we can envisage a big difference in the parent-child case with respect to the friend's scenario. Another interesting issue can be the study of dynamic grouping, for example when grouping extends over time (see for example, Rahlff, 1999) or is dynamically created during the visit. Experimental evaluation. HIPS has been evaluated with real visitors in the Museo Civico in Siena. The results of the evaluation can be found in (Marti, 1999) and (Marti and Lanzi, 2001). The most enthusiastic comments regarded the possibility to move freely during the visit while being assisted by the dynamic guide. The visitors felt comfortable in listening to descriptions without interacting too much with the PDA interface, mainly used in case of poor performance by the system (delay in loading a presentation, lack of information etc.). A feature that was especially appreciated was how information came tailored to the context. The visitors recognized the capability of the tourist guide to follow their movements offering appropriate and overall coherent information at the right moment. However, our community has not become sophisticated enough in evaluating mobile systems for a cultural task. What we really need are techniques as powerful as the Wizard of Oz (simulation by hidden humans of systems that at least in part do not exist yet, and observation of user behaviour with the new means) so that the results will really help decide on the specific design choice. Equally important, as in any educational environment, is to evaluate retention of concepts and vividness of memory after time (hours, weeks, years). These themes impact on the environment seen as a medium: a) finer tracking permits to build better user models and to integrate dynamically appropriate linguistic material to be combined with the present environment perception; b) innovative devices will impact on coordination with the environment medium; c) better referring to space (and time too) will provide on one side more accurate means to change position and focus of attention, and on the other side a better integrated multimedia, with language and environment more tightly coupled; d) text that are able to bring about context-dependent story
INTELLIGENT INTERACTIVE INFORMATION PRESENTATION
109
mechanisms, for instance building expectations and releasing tension, may involve the surprise that an unpredicted scene, or an unpredicted interpretation, will bring: the environment can have a substantial role; e) visit modalities conceived for specific categories of visitors, such as children, will typically involve specific skills, so that attention will not be drawn to the environment medium according to standard, descriptive rhetoric, but will cause new pragmatic interplay within various modalities; f) the social dimension in visits opens the way to a more developed subjective exploitation of the environment medium, to be reconciled through subsequent interaction among human agents; g) evaluation of human-system behaviour in multimodal physical environment-based situations will have an obvious fundamental role for understanding and helping the design of better systems. Some of these challenges will be taken up in the PEACH project, that will deliver its results in a three year period with experimentation at the Castello del Buonconsiglio in Trento. DFKI, partner in the project, will experiment the prototypes at the Voelklinger Huette, a world cultural heritage site dedicated to iron and steel industry near Saarbruecken.
8. CONCLUSIONS There is a strong need for intelligent information presentation in the domain of cultural heritage appreciation and tourism. The way in which the cultural experience is carried on has not changed much for centuries; and especially the young seem to require novel modes of being exposed to the cultural material, so that they would engage and entertain them. In this paper we have mainly presented the PEACH project, aimed exactly at this. The project builds on our previous experience with building location-aware systems able to present information in an adaptive way on one side, and the one with building multimodal interfaces on the other side. In PEACH, one of the salient elements is the emphasis on multimodality in the dynamic presentation. The system is intended to combine the dynamically adapted language-based output with a kind of dynamically produced visual documentaries. While the first part is based on macronodes, a technique we have been working at for years and continue to improve, the second is based on cinematic techniques. Underlying the whole approach is the adoption of rhetorical structure theory and the use of repositories of material to be adapted for the specific presentation case. The input to the system, comes from the location of the visitor and observations the system itself has made about her behaviour and presumed interests and what she has been exposed to and presented so far; the material is presented coherently throughout the visit. At the time of this writing the first phase has been concluded and a prototype for the Castle of the BuonConsiglio, the most important museum of Trento, is available as a proof of concept. In the last part of this chapter we have discussed a number of issues that are critical for the further development of the field. We also believe advancement in technology for cultural tourism will lead to advancement in the world of education,
110
STOCK, ZANCANARO AND NOT
of mobile services and in general to human interfaces. The ultimate goal in fact is to have systems that are nonobtrusive, that allow us to perceive the natural environment and integrate it, if useful, with a number of resources appropriately coordinated and coherent, with a result that is cognitively congenial and - why not? pleasurable. 9. NOTES 1 http://www.niceproject.com
10. REFERENCES Stock. O. (2001) Language-Based Interfaces and Their Application for Cultural Tourism. AI Magazine, Vol.22, n.1, 2001, pp. 85-97, American Association for Artificial Intelligence, Menlo Park, CA. Minghetti, V., Moretti, A., Micelli, S. (2002) Reengineering the Museum’s Role in the Tourism Value Chain: Towards an IT Business Model. In Werthner, H. (Ed.), Information Technology & Tourism. New York: Cognitant Communication Corporation. 4(2), 131-143. Stock O., Strapparava C., Zancanaro M. (1997) Explorations in an Environment for Natural Language Multimodal Information Access in M. Maybury (ed.) Intelligent Multimodal Information Retrieval. AAAI Press, Menlo Park, Ca./MIT Press, Cambridge, Mass. Not E., Petrelli D., Sarini M., Stock O., Strapparava C., Zancanaro M. (1998) Hypernavigation in the Physical Space: Adapting Presentations to the User and the Situational Context. In New Review of Hypermedia and Multimedia, vol. 4. Benelli, G., Bianchi, A., Marti, P., Not, E., Sennati D. (1999) HIPS: Hyper-Interaction within the Physical Space. In Proceedings of IEEE Multimedia System ’99, International Conference on Multimedia Computing and Systems, Firenze Busetta P., Serafini L., Singh D., Zini F. (2001) Extending Multi-Agent Cooperation by Overhearing. In Proceedings of the 9th International Conference on Cooperative Information Systems - CoopIS2001. Lectures in Computer Sciences vol. 2172, Trento, September. Stock O. and Strapparava C. (2002) Humorous Agent for Humorous Acronyms: The HAHAcronym Project. Proceedings of the Fools' Day Workshop on Computational Humor, TLWT-20, Trento Bordegoni M., Faconti G. Maybury M.T. Rist T. Ruggeri S. Trahanias, P. Wilson, M. (1997) A Standard Reference Model for Intelligent Multimedia Presentation Systems. Computer Standards and Interfaces, 18, pp. 477--496, 1997 Not E. and Zancanaro M. (2000) The MacroNode Approach: mediating between adaptive and dynamic hypermedia. In Proceedings of the International Conference on Adaptive Hypermedia and Adaptive Web-based Systems, AH'2000, Trento, August. Feiner S., MacIntyre B., Hollerer T., Webster A. (1997) A Touring Machine: Prototyping 3D Mobile Augmented Reality Systems for Exploring the Urban Environment. In Proceedings of ISWC 97 (Int. Symp. on Wearable Computing). Cambridge, MA. October. Marti P., Gabrielli L., Pucci F. (2001) Situated Interaction in Art. Personal Technologies, 5:71-74. Baus J., Krüger A. and Wahlster W. (2002) A resource-adaptive mobile navigation system. Proceedings of IUI2002: International Conference on Intelligent User Interfaces 2002,ACM Press. Stock O. ed. (1997) Spatial and Temporal Reasoning. Kluwer Academic Publishers, Dordrecht. Galton A. (1997) Space, Time and Movement. In O. Stock (ed.) Spatial and Temporal Reasoning. Kluwer Academic Publishers, Dordrecht. Rahlff O.-W. (1999) Tracing Father. In Proceedings of i3 Annual Conference, Siena, October . Marti P. and Lanzi P. (2001) I enjoyed that this much! A technique for measuring usability in leisureoriented applications, In Joanna Bawa & Pat Dorazio, The Usability Business: Making the Web Work.
INTELLIGENT INTERACTIVE INFORMATION PRESENTATION
111
Stock. O. (1996) Password Swordfish: Humour in the Interface. In Proceedings of the International Workshop on Computational Humour, TWLT-12 Enschede. Woodruff A., Aoki P.M., Hurst A., Szymanski M. (2001) The Guidebook, the Friend, and the Room: Visitor Experience in a Historic House. Extended Abstract, ACM SIGCHI Conf. on Human Factors in Computing Systems, Seattle, WA, March. Malaka R. and Zipf A. (2000) Deep Map – challenging IT research in the framework of a tourist information system. In D. R. Fesenmaier, S. Klein, D. Buhalis (eds): Information and communication technologies in tourism 2000, pp. 15-27, Springer-Verlag, Wien, New York, 2000. Mann W.C. and Thompson S. (1987) Rhetorical Structure Theory: A Theory of Text Organization. In L. Polanyi (ed.), The Structure of Discourse. Ablex Publishing Corporation. Spohrer J.C.. (1999) Information in Places. IBM Systems Journal Vol. 38 n.4, pp. 602-628.
11. AFFILIATIONS Oliviero Stock Elena Not Massimo Zancanaro ITC-irst Trento, Italy
T. RIST
SUPPORTING MOBILE USERS THROUGH ADAPTIVE INFORMATION PRESENTATION
1. INTRODUCTION Recently, the coalescence of mobile telecommunication and portable computing has brought about a broad variety of new mobile appliances. Cellular phones with built-in micro computers are no longer just mobile telecommunication terminals but also serve as organizers and mobile Internet terminals. In turn, pocket computers, such as PDAs benefit from built-in telecommunication technology since they can now become mobile nodes of computer networks, and some of them work as mobile phones as well. In addition, there is a third strand of technology development motivated by the aim to empower all kinds of portable everyday gadgets, such as wrist watches, clothing, and also jewelry with computing and wireless communication functions. To provide a flavor of the ongoing convergence of technology strands, Fig. 1 illustrates a pedigree of some popular mobile devices. By the end of the last decade, the era of the “mobile internet/internet everywhere” has been announced and the first generation of mobile phones with built-in browser software for the display of data that are retrieved wireless from remote servers. In Europe the de facto standard WAP1 has been the choice of technology for wireless data connection, whereas Japan and other Asian countries introduced proprietary technologies, such as the i-mode system by NTT-DoCoMo, EZ-Web of the KDDI Corp., and J-Sky by J-Phone. In Europe, however, the predicted market penetration of WAP services did not take place for a number of reasons. Firstly, the display capabilities of early WAP devices as well as browser functionality were rather restricted. For instance, in principal it was possible to convert an arbitrary HTML2 web page, that entailed text, into a pile of so-called WML3 pages ready for wireless download and display in a WAP browser of a mobile device. Due to limited screen real estate, however, only a small fragment of the text could be displayed at once. Reading a larger text, say of A4 page size, requires the user to perform extensive scrolling and paging operations. More severe weighs the fact that many users experience such reading conditions as too awkward and therefore hardly use this option. To attract more users several attempts have been made by service providers and portal owners to prepare customized content for WAP clients. Unfortunately such extra efforts caused additional costs which could not be easily charged on top of the relatively high costs WAP-users had to pay already to their telecom providers for the “air-time” when downloading the pages. 113 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 113–139 © 2005 Springer. Printed in the Netherlands
114
RIST
Since the launch of the first WAP services some progress has been made concerning both the wireless data transmission systems and the capabilities of new devices. With the introduction of techniques like HSCSD4 and GPRS5 more data can be transferred at significantly reduced download times. In addition GPRS paved the way for a billing structure that is based on data volume rather than on air-time.
Figure 1: Pedigree of some popular mobile devices
Therefore, users can stay connected to the internet as long as they want and get basically charged only for the amount of downloaded data. As to devices, display screens now tend to be slightly larger, have a much higher resolution and support color. This so-called 2.5th generation (2.5G) of mobile phones certainly goes well beyond its predecessor generation – at least in Europe. In fact, the Japanese approach to wireless Internet seemed to skip some of the intermediate stages we had in Europe. With the introduction of i-mode in 1999 their era of 2.5G mobile internet had begun at least 2 years earlier than in Europe By the end of 2003 the third generation messengers will already be in use. The transition from 2G/2.5G to the yet pending introduction of 3G wireless devices and broadband communication infrastructure, such as UMTS6 in Europe, will boost the development of adaptive services for mobile users for at least the following reasons: Availability and fast transfer of rich media formats. Multimedia messaging is currently a buzz word heavily used by marketing people to quicken the user’s appetite for the upcoming era of mobile telecommunication. In fact, many of the new devices will be equipped with high-resolution color screens, some with cameras
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
115
and some others will even support the simultaneous use of different input modalities, such as voice, touch screen, and keyboard. Improved location awareness. A number of applications – mainly navigation systems - are already able to take into account a user’s current position, e.g., by taking advantage of GPS (Global Positioning System) or phone cell information with which a mobile phone is currently registered. However, GPS only works outdoors and requires either a special device or an extension board. On the other hand, with a resolution between approx. 30 meters under optimal conditions and some hundred meters in worse cases phone cell information is often too imprecise for many location-based services. In contrast, the upcoming 3G devices and infrastructure promise a localisation resolution in the range of less than 10 meters. This will still be worse than GPS but the important point is that it may work in- and outdoors without requiring hardware extensions. Download of executable applications. Perhaps the most important aspect concerns the ability to download and execute software on 3D devices. That is, 3G devices are no longer restricted to built-in firmware but provide open platforms for all kinds of software applications – only restricted by storage and processor performance capabilities. For instance, similar to the introduction of so-called Java applets to enrich web pages in the second half of the 1990s, Java programmers have now started to develop so-called Midlets7 which are small Java programs that can be executed on mobile devices running a mobile edition of a Java virtual machine. Looking 4 to 5 years further into the future, that is moving downwards in the pedigree depicted in Figure 1, we can already anticipate the next but one generation of mobile devices (4G and beyond) that are likely to take advantage of various input sensors, e.g., to measure orientation, acceleration, temperature and so forth. No doubt, this will again open another strand for new services to benefit mobile users. While there is no doubt that upcoming new mobile appliances will provide a high potential for a broad range of new information services that can accommodate specific needs of mobile users, a particular challenge for service providers will remain. It is a matter of fact that any new generation of devices and software tools will hardly substitute the previous generation immediately. This phenomena is commonplace in the IT industries and therefore many software manufacturers and operators of web portals feel obliged to support some downward-compatibility of their products in order not to exclude potential customers. Unfortunately, compared to the “PC world” the situation is more challenging since the spectrum of mobile devices is already much more diverse in terms of hardware and software capabilities and it is likely that this diversity will continue to increase. As to hardware, differences are due to broadly varying display capabilities given by different screen sizes, resolution, and available colors if color display is supported at all. As to software, one has to consider that mobile editions of browser and display programs are usually less powerful than their counterparts running on a PC, and those mobile editions of application software may struggle with dramatic performance differences from one mobile device to another. In the sequel we first refer to some related work and then sketch several different application scenarios to illustrate and motivate the need for techniques that aim to customize information presentations for mobile users.
116
RIST 2. RELATED WORK
Depending on the specific methods to solve adaptation tasks that will be introduced in the proceeding sections, similarities to work done by others exist, especially in the areas of (i) adaptive information delivery and presentation, (ii) navigation services for mobile users, and (iii) shared information representations in distributed systems. Concerning adaptive information delivery and presentation most of the existing approaches are based on proxy solutions. In essence, an adaptation proxy sits in between an ordinary web-server (or any other remote information source) and a client. The proxy performs specific tasks that aim to meet the adaptation goals of the client. This may comprise, for instance, filtering, aggregation, transformation and (re)formatting of fetched web content. Some adaptation proxies have been developed to compensate bandwidth bottlenecks in networks and especially in mobile networks (Brewer et al. 1998). One of the most popular proxies that aim to compensate for device limitations is the Top Gun Wingman, a graphical, split Web browser for the Palm Pilot PDA, developed by Fox et al. (1998). Recently, more complex proxy architectures have been proposed that, besides device adaptations, also address personalization issues. For instance, the APRO proxy server by Kylänpää & Heino (2000) enables a user to configure an own preference profile with an ordinary web interface (e.g., using HTML forms). If this user accesses the proxy with a mobile device, the specific profile will guide the adaptation process at the proxy and thus enables a more personalized adaptation. The iMobile proxy even foresees a personal agent that maintains a user profile that accesses and processes Internet resources on behalf of the user, keeps track of user interaction and performs content transformations according to the user and a device profile (Rao et al. 2001). The DailyLearner (Billsus et al. 2000) is another personalized agent running on a proxy. The system creates a statistical interest profile of the user to determine which information shall be sent to the mobile device. Personalization in this case is intended to save both, data transmission costs, as well as the user’s effort to read news which do not fit her interest profile. The introduction of a standardized version of so-called Composite Capability/Preference Profiles (CC/PP) as well as the use of more fine grained user models (Marucci & Paternò 2002) will further support more personalised adaptations and scalability of mobile access to web content. The adaptation process itself is often driven by meta information that is already coded in the representation of the information source. For instance, Gaedke et al. (1998) proposed a special purpose markup language called WCML (Web Composition Markup Language) for an object-oriented approach to compose presentations from a repository of device specific style components. In contrast, the approach by Hori et al. (2000) relies on a set of special purpose annotations that provide a proxy with information about the relative importance of content in a web page. Their proxy entails a so-called document clipping plug-in to portion content into smaller pages that match the available screen size of the mobile target device. To avoid extra efforts in the content authoring phase, Gonzalez-Castaño et al. (2002) propose an automated mechanism that superimposes a symbolic abstraction
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
117
hierarchy on the original document. Rather than accessing the original document with a PDA, the mobile user receives a hyper-document and uses the abstraction hierarchy to navigate through the single parts. Han et al. (1998) propose a transcoding policy that a proxy may apply to images. In their work, image transcoding means applying compression mechanisms to reduce the data volume with the ultimate aim to reduce download time. However, since time is also needed for the compression, they try to estimate the trade-off between adaptation costs and actual gain in terms of reduced download time. Mohan and colleagues (1999) describe a transcoding framework with the aim to enable universal access to multimedia web content. They propose a set of content transcoders that transform high-resolution source media objects into target media objects that typically have a lower resolution. In the case of images, they perform a semantic classification of the source image to decide on the transcoding mechanism to apply. While somewhat similar to the ImageTrans system that will be presented in Section 3.3.3, the assignment of image classes and transcoders is done once manually and does not involve learning. Most of the scenarios that will be presented later on in this paper assume a proxy where adaptation is done, too. For instance, the IMAGEN system described in Section 3.1 is essentially a proxy server, and image transformation for WAP clients is also a typical service to be carried out at a proxy. The popularity of proxy solutions for mobile content access has two major reasons. Firstly, so far the communication bandwidth of a wireless link has been either quite restricted or quite expensive. Secondly, most of today’s mobile devices are yet not powerful enough to perform complex adaptation tasks themselves. In the near future, however, this may change and part of the adaptation will be done by the devices themselves. This may lead to heterogeneous approaches of application aware adaptation as proposed in the Odyssey project (Noble 2000). To cope with adaptation issues which arise due to variations in the available bandwidth of a wireless connection, the authors argue that neither the operating system (which might include a proxy) nor an application alone should decide on the adaptation strategies to overcome bandwidth problems. Addressing a basic need of many mobile users, various navigation services have already become available. Most web-based routing services support mobile access and provide presentation variants tailored to different categories of mobile devices (e.g., for WAP phones and PDAs). Companies, such as Fontus Inc.8 even aim at users of 3G wireless devices and have begun to produce sophisticated 3D maps and display software that runs on 3G devices. As to the design of graphical navigation aids for mobile devices with sparse display capabilities Chincholle et al. (2002) note that the download of traditional miniaturised maps to a WAP enabled cell phone contributes little to the mobile user's demands whereas route directions were considered valuable. A comparison of different presentation formats for route instructions on mobile devices can be found in Kray et al. (2003). An interesting approach for the generation of route maps has been presented by Agrawala and Stolte (2001). They have developed a set of generalization techniques to improve route map usability. Taking into account aspects of human cognition their LineDrive system generates route sketches that show distortions in road lengths, angles and shape quite similar to distortions found in sketches produced by humans when asked
118
RIST
to sketch how to drive from one location to another. Compared to conventional maps, sketches produced by LineDrive have the advantage that they need less screen real estate for display and thus are especially suitable for PDAs. The Shopping Companion that we will sketch in Section 4.2 shares some similarities with other special purpose navigation aids, such as mobile museum guides (Not et al. 1998) and other shopping guides. The potential of such guides has been studied, for instance, by Bohnenberger et al. 2002. They conducted an empirical study to compare performance and user appreciation of two shopping conditions, one in which subjects were provided with a PDA-based shopping navigator versus one in which subjects had only a printed map (i.e., a 2D floor plan) at their disposal. They found that the format of assistance did not significantly impact the performance of concrete shopping tasks (i.e., time needed to localise products), however, most of the subjects preferred the PDA version over the classical map. Finally, our work on multi-user systems, such as the shared map application that will be presented in Section 5 relates to work done in the area of CSCW (Computer Supported Cooperative Work). In the context of a CSCW system with shared graphical output, Stefik et al. (1987) have coined the term “relaxed WYSIWIS (What You See Is What I See)”. In contrast to “strict WYSIWIS", where all collaboration partners obtain exactly the same graphical displays, relaxed WYSIWIS allows to be more selective on what is to be shared by other collaboration partners and how shared information gets individually presented. The so-called abstractionlink-view concept by Hill (1992) links different views to a shared data abstraction of the application semantics via constraints. During a multi-user session different views are dynamically updated in a way so that the specified constraints remain satisfied. The CSCW toolkit GroupKit (Roseman & Greenberg 1996) relies on an open protocol approach that allows application programmers to implement and coordinate tailored views on shared data abstractions. The toolkit Weasel (Graham & Urnes 1992) comprises a so-called Relational View Language for specifying customized views and their dynamic behaviors during a session. The notion of so-called tailorable user display agents has been introduced by Bentley et al. (1994). Sitting between the shared data representation and a user’s display, the agent is responsible for the (possibly) user-specific style of information display and also for the continuous update of the display during a session based on a filed mechanism for event-handling. While the purpose of these approaches is view adaptation, their first place intention was not directed towards the support of mobile users. But apart from a few exceptions, the involvement of mobile and heterogeneous communication devices has been rarely addressed in CSCW research. Myers et al. (1998) describe a collaborative environment in which PDA’s are used as remote control devices for the manipulation of a display that is shared by all users. A combination of multiple PDA’s with a shared large-scale whiteboard is also used by Greenberg and Boyle (1999). In their SharedNotes system, the PDA’s are used as input devices for private notes which can, however, also be made public by sending them to the shared whiteboard. Nevertheless, the primary focus of this work has been on maintaining private and public information in a multi-user application but less on view adaptation to account for specific device limitations.
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
119
Figure 2: Core Components of the IMAGEN Platform
3. ACCESSING THE WORLD-WIDE WEB WITH MOBILE DEVICES While it is debatable whether web-browsing belongs to the most useful applications for the users of mobile devices, there is no doubt that these appliances provide a high potential for a broad range of new information services that can accommodate for the specific needs of mobile users. Let us consider a first example that is taken from the EU-funded IMAGEN9 project. 3.1 Access to a Personalized Art Portal
The objective of IMAGEN is the development of a platform for the set-up and operation of portal services that aim at customized publication and distribution of multimedia content. A first pilot application of IMAGEN is the MyArt online service whose major business objective is to establish a communication channel between a large Italian publisher in the art domain and the community of English speaking Internet users interested in Renaissance arts. IMAGEN can be conceived as a mediator between content authors (1st-level users) and content customers (3rd-level users). In between we have to consider a further user group, usually consisting of publishers, content syndicators/owners and administrators of Web portals. Such 2nd-level users may purchase media assets from a number of different 1st-level users and compile customized content packages for their 3rd-level users. The platform and its core components are roughly sketched in Figure 2. The basic contents, so-called content units, that are to be distributed by IMAGEN are
120
RIST
stored in a number of possibly dispersed content repositories. A content unit consists of a small composition of multimedia assets, such as texts, graphics, images, sound and video clips, and programmed interactive units, e.g., Java applets or Flash animations. The composition is assumed to be self-contained and is specified in an XML format. A graphical editor, such as the eXact Packager10 may be used by authors for the creation of content units.
Figure 3: Delivery of customized art collections to different PC users
In IMAGEN the idea is to provide 3rd-level users with content packages, which are created on the fly based on content that is of interest to them. Therefore, in response to 3rd-level user requests, IMAGEN creates and delivers customized collections of content units. As illustrated in Figure 3, delivered packages usually differ with regards to content and layout style depending on profile information about the users. For the representation of content packages IMAGEN draws on XML-based standards developed by the IMS consortium (see http://www.imsproject.org/) for representing tutorials and multimedia content packages. The process of delivering content from the content provider to the content user is divided into three major steps: Content Management, Layout Management, and Transaction Management – which are implemented by separated components. Details about the single components and processes can be found in Kraus et al. (2002) for personalised content selection, and Kröner et al. (2002) for constraint-based layout. For the purpose of this paper, we restrict ourselves to layout customization for mobile users. 3.2 Content Delivery to Mobile Devices with Small Screens
There are a number of mobile devices, such as pocket computers and PDAs, that feature medium-size screens, typically in the range of [200 to 550] x [200 to 550] pixels. However, there is no standard screen shape and size apart from the fact that all screens are rectangular and that PDA screens tend to have a more balanced length
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
121
and width ratio than screens of pocket PCs. On the other hand, an important common feature of these devices is that they usually run at least one mobile edition of a PC-style Web browser. Therefore, layout customization for the purpose of the MyArt application can be achieved by providing the Web browsers with suitable settings of style attributes, such as font families and sizes as well as dimension parameters for embedded images.
Figure 4: A MyArt content package customized for the Nokia 9210 Communicator
In IMAGEN, a constraint-based layout manager can take into account parameters, such as screen size and resolution, and availability of colors to compute an optimal assignment of style parameters for a content package. Once a suitable assignment of style attributes has been determined, syntactic transformations can be applied to make further device-specific adaptations concerning factors, such as hardware limitations, design of physical controls, and limitations of the installed web browser. For instance, the design of some devices invite their users to scroll, whereas leafing through a presentation will be more convenient on other devices. Since in IMAGEN both content units and content packages are represented as XML documents, syntactic transformations can be performed easily by applying XSLT11 transformation rules that transform XML documents in other XML documents. In our current experiments, we use a number of different mobile devices with relatively small screens - among others a Nokia 9210 Communicator. In contrast to a typical PDA, this device has a 640 x 200 pixel color display and a built-in web browser that supports basic HTML features. Indeed, our test cases confirm that suitable presentations can be produced with IMAGEN’s layout manager. A showcase of accessing MyArt with the Nokia Communicator is shown in Figure 4. To account for the small screen of this device, package layout has been adapted in several ways. For instance, a constraint has been added that gives priority to larger fonts since due to a higher screen resolution fonts appear smaller on the Nokia 9210 than on an ordinary PC screen. To reduce the need for extensive scrolling, content will be portioned into several pages that are compliant with the available display space of the screen. Finally, an XSLT transformation is applied to refine the page layout and to connect these pages with a navigation bar that appears in the upper left corner of a page. Similarly, the layout of MyArt content packages can be customized for display on a PDA. Customization trials included the Sharp Zaurus SL5500 device, and an HP Jornada handheld.
122
RIST
3.3 Mobile Devices with Tiny Screens
Today, the delivery of content to small mobile devices, such as contemporary cell phones, remains constrained in terms of bandwidth, memory, and processing power. In the near future, however, rapidly improving hardware and infrastructure will make these limitations disappear. Nevertheless one problem will not be solved in this way – a great deal of mobile devices will still come with tiny displays and this will impose strong constraints on how content can be presented and accessed. 3.3.1 Transcoding Content
A popular approach to deliver web content to mobile devices with tiny displays is called web content transcoding, e.g., see Britton et al. (2001). This technology aims to adapt content according to constraints and preferences associated with a specific device. In the case of text, a number of approaches exist to transform text represented in an XML format into so-called stacks of WML (Wireless Markup Language) pages, or into lean versions of HTML / XHTML, such as cHTML (compact HTML), HTML-4-Mobile, and XHTML Basic12. Commercial products (including for example IBM’s WebSphere™, or xCA.suite by xCA.com) are available to achieve this task. However, preliminary tests conducted at DFKI with the aim to transcode MyArt content units into WML pages showed that a mere syntactical transcoding resulted in huge piles of WML pages. Such presentations are cumbersome to handle with a tiny display and weak navigation facilities, and thus hardly meet the idea of providing 3rd-level users with easy to use content access mechanisms. Even worse is the situation for non-text media types, such as images and graphics. Even the transformation of static graphics still remains a challenge. Neglecting graphical representations completely may be acceptable if these representations serve as decorations only. Unfortunately, however, there are many examples of services which can hardly do without graphical representations to encode information. In the case of the MyArt service, it is certainly doubtful whether one should try to adapt images of art works so that they can be displayed on tiny screens. But there are many other services that can benefit mobile users. For instance, at DFKI we have developed a personal travel assistance service (André & Rist 2001) and a service called Personal Picture Finder (Endres et al. 1999). In their original versions both services have been designed for PC users who are assumed to interface with the services via an ordinary WWW browser. In response to a user query the travel assistance service usually returns a mixture of text and graphics, e.g. including images of hotels and resorts as well as maps. Moreover, the Personal Picture Finder frequently returns only pictures as a search result. 3.3.2 Image Transformations
To enable access to graphical web services by means of mobile WAP phones with tiny displays, we have investigated several approaches to the transformation of graphical representations. By a transformation we understand an operation that can
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
123
be applied to a source image in order to obtain a target image that meets the particular requirements of the target display environment. Transformations can be complex in the sense that they can be described as sequences or configurations of several less complex operators. Targeted towards an adaptation of graphics for tiny displays a graphics transformation usually comprises a down-scaling operation and a reduction of colours. Also, there may be transformations that take into account a number of additional parameters (e.g., height and width of the target graphics) to adjust the transformation process. Unfortunately, a certain transformation T may lead to reasonable results for some source images while it provides unacceptable results for others. The problem is illustrated in Figure 5. Both source images have been converted with one and the same transformation, in this case using the Ginco “Pic2WBMP”13 image converter. However, similar problems arise when using other available products for image manipulation with built-in facilities that automatically convert images into formats that fit onto the display of mobile phones. This observation motivates the configuration of different transformations for different image classes. For a new source image, we need to determine an appropriate image class and apply the transformation which has been associated with that class. This approach raises two issues: (i) how to classify images, and (ii) how to configure appropriate transformations for the specific image classes. Secondly, we compute for each source image a characteristic feature vector that can be used later on to classify the image. An extreme case would be to take each pixel of the image as a single feature. However, this would be impractical as images often have hundred thousands if not millions of pixels. To serve their purpose, a feature vector should be a condensed representation of an image and the features should relate to an image’s color and luminance distribution in one or the other way. In fact, the exact computation of a feature vector is a second-rate issue. It is more important that an image vector covers a sufficient number of features (e.g., several hundreds of features) to ensure that different source images get characterised by distinct vectors.
T
T
Figure 5: Left: Applying a transformation T to a source image to obtain a small black and white target image that can be displayed on a mobile device. Right: Applying the same transformation to another source image produces a less satisfactory result
124
RIST
3.3.3 A Trainable Image Transformation System
Our idea was to deploy machine learning techniques to establish an assignment between recognized features of an image on the one hand, and available transformations on the other hand. That is, in a training phase, a graphics design expert manually assigns transformations to source images and thereby allows the system to recognize and generalize correspondencies between image features and transformation parameters. A screenshot of our trainable image transformation system is shown in Figure 6. When loading a source image into the system, the image is displayed in the left part of the frame and an analysis is carried out to construct a feature vector that can be used to classify images. One extreme case is to consider each pixel as a feature. Unfortunately, such long vectors would be unpractical as input for a learning system. In our current test setting we use about 430 features that result from regional color distributions of an image. To calculate regional distributions we overlay a source image with a grid and determine the histograms for the cells. The size of the current feature vector has been determined experimentally using smaller and larger vector sizes. However, more extensive testing with other feature vectors would be a reasonable strategy to optimize the approach. Our repertoire of transformations comprises eight different transformations and is yet quite small. The result of applying each of the eight available transformations to a source image is displayed on the right-hand side of the screenshot in Figure 6. In the learning or training phase, a graphics design expert would tell the system which of the eight transformations yields the best result for a given source image. In the test phase, the system is asked to recommend one of the eight transformations on its own for a new image that was not part of the training set. To transform the sample source image shown in Figure 6 into an image that can be displayed on a mobile WAP-phone, the system recommends T2. In cases where the system remains undecided which of the eight transformations to recommend, it may not recommend any of them, but tick the “nil” option T9 (see Figure 6). We trained the system with a set of 130 images including portraits of persons, images of landscapes, in- and outdoor images of buildings, images of everyday objects, and photographs of artworks. While we have experimented with several different learning methods provided by a software package for machine learning (Witten & Frank 2000), a standard C4.5 decision tree algorithm has proven most useful for this task. To assess the performance of the trainable image transformation system from a user’s point of view, some feedback was gathered during the Comdex computer faire in Basel in September 01. Stand visitors were invited to participate in an informal evaluation. A web cam was used at the stand to take portrait pictures (size 320 x 240 pixels, 256 colours) of stand visitors. Such portraits were then loaded into the image transformation system in order to obtain small size (60 x 45 pixels) black and white portraits that could be loaded and displayed in a WAP-browser of a mobile phone. A total of 61 visitors participated in the exercise and attested the system a reasonably good average performance of 3.3 points on a 5 points scale (1 = not acceptable, 2 = poor, 3 = acceptable, 4 = good, 5 = very good). However, this
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
125
ranking was relative to the available 8 transformations that the system could choose. For further details of the evaluation and the ImageTrans System we refer to (Rist & Brandmeier 2002).
Figure 6: Screenshot of the trainable image transformation system ImageTrans. After the training phase the system recommends one of several available transformations for yet unseen source images (left). The transformed image can be displayed in a WAP phone (right).
3.3.3 Semantic-based Translation of Graphics
The ImageTrans system introduced in the previous section aimed to modify a source graphics on the picture level by means of an image transformation. Another approach is inspired by the semantic-based translation of language. That is, a text given in one language is read and analysed to capture its meaning. Once the meaning is understood, a corresponding text can be generated in the target language. Similar, the translation of graphics does not aim at a modification of the source image. Rather, the idea is to first perform a deeper semantic analysis of the source image to obtain a content description, and then generate a new target image that still preserves the essential semantic contents, but at the same time suffices particular style constraints. We illustrate this approach by a translation of a route description as it may be obtained from a routing service available on the web (cf. left frame of
126
RIST
Figure 7) into stylistic variants that can be displayed on a WAP phone (cf. set of frames on the right-hand side of Figure 7). In this particular example the analysis of the map display is relatively easy to perform since a special colour coding scheme is used in the map generation process. For instance, the colour blue is exclusively used to highlight the requested route from its start to its end point and colours used to depict streets are not used for other purposes elsewhere in the map. Therefore, it is possible to apply a scan-line algorithm that identifies the highlighted route segments in the map. It is then also possible to construct a semantic representation of the route that contains all crossings between the start and the end point. Such a semantic description can be represented, for instance, in an XML-compliant format for further processing. Once a semantic representation of the route description has been extracted from the source map, various stylistic variants can be generated that meet resource limitations of the output device and/or a user’s personal style preferences. Since in the example the task was to display the route description on a small display of a WAP-enabled mobile phone, a reasonable approach is to segment the route into single manoeuvre points and to generate a schematic visualization for each corresponding manoeuvre point. However, as illustrated on the right-hand side of Figure 7, a manoeuvre like “turn left on next crossing” can be visualised in many different ways.
Figure 7: Translation of a map showing a route description into stylistic variants for display on a mobile WAP phone (here a Nokia 7110).
Unfortunately, the presented translation approach will only work for the particular map service used in the experiment. However, the development of
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
127
information extraction algorithms for visual media is an active field of research that may lead to usable results at least for restricted domains. 4. EMERGING NEW SERVICES FOR MOBILE USERS In the previous section, we focused on the aim to provide mobile users with access to web services that are primarily designed for classical use with a stationary PC. However, mobile users do not necessarily have the same information needs and Internet usage behaviour as if they were accessing the web with a PC from home or from the office. Rather, to increase the benefit of services provided to mobile users, it is important to explore the actual information needs of this user group. In our search for dedicated services for mobile users, we observed that added-value for mobile users can emerge when already existing services are combined. In this section. we will present two examples – the first one will again be taken from the IMAGEN project. 4.1 Informing About Local Art Exhibitions
The MyArt portal introduced in Section 3.1 is a typical web service that invites interested users to subscribe. For the purpose of illustration, let us consider a mobile user visiting Florence. When in Florence, the user might get interested in accessing MyArt to receive information about artworks that are on display in the Uffizi museum to decide on whether or not to visit the museum. Having only a WAP phone with her for accessing the MyArt portal it wouldn’t be a good idea to compile an ordinary content package even if it contained only descriptions of pieces that are displayed in the particular museum. As an alternative proposal, we suggest to compile a package that comprises only a few selected artworks and present this to the user in the form of a virtual walk through the museum. In this case it is not only the amount of delivered information but also the form (i.e., an exploratory walk-through) that may attract a mobile user to actually browse through the provided presentation. Figure 8 illustrates this application. As input to presentation generation serves a floor plan of the exhibition, and a set of content units, which correspond to the exposed artworks. The user may browse through the exhibition by following a predefined path. Waypoints of this path are presented as small 3D graphics representing rooms and floors of the exhibition building (see the sequentially aligned screens in Figure 8). Some of the graphics include icons which indicate points of interest, e.g., artworks exposed at this location. By selecting the icon, the exposed content units can be accessed (see the 4th waypoint). In this scenario the user herself initiated a request for information about a certain museum. In a further MyArt version, however, a registered user may also wish to get actively informed about upcoming and ongoing art exhibitions that seem of interest according to an available profile and also her current location. Such an active “push” service is already possible by means of a so-called LBS (Location-Based Services) server provided the user agrees on getting localized (via the cell phone
128
RIST
carrier) and also agrees to receive unsolicited notifications, e.g., in form of an SMS message.
Figure 8: A virtual walk through a museum performed on a mobile WAP phone
A recent survey undertaken by HPI Research in June 2001 deserves mentioning in this context. A total of 3300 people in eleven different global markets were questioned what they think about direct marketing concepts based on mobile phones. Interestingly, 88% of respondents stated that they would be open to receiving ads, while 31% would welcome marketing of this type. A major conclusion of the study was that the success of location-based direct marketing concepts depends on a number of key factors. For instance, even those who appreciated the service envisaged situations in which they want to turn it off. Therefore, it seems important to leave with the user the choice whether or not to receive ads. Also, users must not be enforced to read the messages completely. Rather, they should be given means to bypass and delete messages easily and quickly. Customization is another important aspect. Most users sought a means to customize the notification service so that they would only receive messages that match their interest profiles. Lastly, users expect a mutual benefit. This could be, for instance, an electronic coupon that qualifies the receiver to get a discount when buying products from the advertising party. Also, 86 percent of the respondents stated mobile marketing would be acceptable if it helped keep the cost of service down, that is, they expect a reduction in the cost of mobile service in exchange for getting ads.
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
129
4.2 Enriching the Traditional Shopping Tour in the City
In Figure 8 we sketched a service for mobile users of WAP phones. This service can be conceived as a combination of a navigation application and an image access service. The idea to combine existing services is currently put forward in the project ORBA that has been initiated by the federal state of the Saarland as an accompanying measure to the regional set-up of the UMTS network with the purpose to stimulate development of UMTS services by local companies. The ORBA project aims at an enrichment of a classical shopping visit in the city. This goal is motivated by the observation that traditional forms of business, such as real shopping tours in the city, are unlikely to get replaced by eCommerce services in the near and midterm future, but could be enhanced in many ways by 3G devices and wireless net infrastructure. In contrast to the afore mentioned virtual museum tour that was designed for 2.5G WAP phones, the upcoming new generation of devices will allow for multitasking to a certain extent. The practical value of this feature to interlock applications can be nicely demonstrated on a pocket computer or a PDA. In the context of the ORBA project, we are currently developing a shopping companion that is composed of an interactive map application and a web browser. As shown in Figure 9, both applications run concurrently and appear side-by-side on the screen. An interesting feature of the shopping companion is that both applications communicate which each other to enable a flexible assignment of control and mutual adaptation of information content and presentation style among the two applications. For instance, if the user selects a certain location on the map display, the selection can force the web-browser application to perform a search for more information about this location. Note that in this context the word “home page” may take on a literal meaning.
Figure 9: A navigation tool combined with a web browser running on a Nokia 9210 (left) or likewise on a Sharp PDA (right). Activities in one application can influence content and presentation style in the other application.
130
RIST
Vice versa, if the user initiates a content-oriented search for a certain product or shop, the web-browser not only provides a list of search results but also provides the map application with the list of selected target locations so that they can be shown and highlighted in the map display. A further example relates to focus and level of detail of the displayed information in both applications. In the map application the user can perform either a zoom-in or a zoom-out operation in order to shift focus and increase or decrease the level of detail. For instance, if the highest level of detail is chosen, the user will be able to tell apart single buildings. On the other hand, only a small region can be shown. When zooming out, less details are recognizable but a larger region comes into focus. Interestingly, it is possible to mirror a rough counterpart of zoom-in and zoom-out operations in the web browser application. That is, if the map is shown at a low level of detail and thus a larger region is in focus, the web browser will provide a simple list of all business units that are located in the displayed map area. Zooming into the region not only updates the map display but also causes the web browser to switch (at a certain zoom level) to a more focused display of the locations that are now in focus. In our current implementation this intermediate display takes on the form of a list of entries as found in the yellow pages of a phone book. Zooming further into the map will eventually cause the web browser to display a homepage of the shop or building in focus. In addition, movements of the user can serve as input that controls both applications. When in “tracking mode” the user’s position will be continuously updated on the map display. Therefore, walking along a path will change the focus of the map display. Again, this information can be used to change the focus of the information display in the web browser as well. For instance, suppose the user walks along a shopping street that is displayed in the map view together with a blinker that indicates the user’s current position. Aligned to this, the web browser display shows a two-column table where each column lists entries for shops that are located to the left-hand and right-hand side of the street respectively (assuming the same orientation as the map display). Since not all entries for the shops in this street fit on a single screen, only shops in the current vicinity of the user are displayed. When moving along the street the map display updates itself to show the user’s new position whereas the web browser updates its display list as if a scroll operation was performed. Since this is work in progress, it is yet an open question to what extent town visitors will actually take advantage of such a Shopping Companion and its suggested operation modes. While our developments demonstrate the technical feasibility of the service ideas, there are more factors that will eventually have an impact on user appreciation. For instance, the examples illustrated above presuppose an extensive and up-to-date content repository that maintains interesting information of business units and sightseeing attractions located in the shopping region under consideration. And, as taught by the WAP experience, pricing policies of the service providers will play a decisive role, too.
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
131
5. MULTI-USER COLLABORATIONS WITH HETEROGENEOUS DEVICES While the examples in the preceding sections aimed at single users, we now turn to the area of mobile multi-user applications. After all most of the mobile appliances are telecommunication devices and therefore suffice the multi-user criterion per se. However, the combination of portable computers with telecommunication capabilities goes well beyond the classical view of audio-based user-to-user telecommunication but at the same time raises a number of new questions. For instance, how will groups composed of mobile and possibly stationary users communicate with each other using multiple media to express themselves? How will people use combinations of text and speech if both forms are available? How will the style of communication change if people use communication terminals that are not equally rooted in terms of communication media and bandwidth? In the case of a user who missed a previous meeting, could it be useful/possible to provide a structured view of previous exchanges of text and audio contributions made by the participants? In the context of the EU-funded project Magic Lounge*(Bernsen et al. 1998, Rist et al. 2000) we were especially interested in the question of how to overcome communication barriers imposed by the use of heterogeneous communication terminals, e.g, a PC connected with mobile devices, such as a PDA and a 2.5G mobile phone (cf. Figure 10). In the following subsections we present two scenarios that have been investigated within the Magic Lounge project.
Figure 10: A group of geographically dispersed people communicate through the Magic Lounge server using heterogeneous communication devices
4.2 Text and Audio Chat using heterogeneous Communication Terminals
As suggested by its name, the Magic Lounge system has been designed as a virtual meeting space that aims to provide their users with some communication support tools. These tools include: a text-based conversation tool, a multi-party audio tool, and a meeting memory tool that provides a user with various different views on recorded conversations. At a first glance, the text conversation tool is similar to a text chat system. However, in a conventional chat system messages sent by users are basically strings
132
RIST
of text characters that are not interpreted further by the system. Therefore, such systems can only provide little support for structuring conversations and collaborations. In contrast, text messages exchanged through the Magic Lounge conversation tool are labelled with meta information referring to speech acts that are associated with the messages, and information to mark the conversation thread to which a certain message belongs. This meta information forms essential input for the memory tool since it can be used to structure recorded messages among different criteria and thus enable the generation of purpose-specific views on the memory entries. For instance, the views supported by the meeting memory tool comprise sort functions to list messages in a certain order, a view that displays the messages in a tree-like structure to emphasise topic boundaries in a conversation, a “musical score” view that emphasises the temporal order of exchanged messages, and finally a replay-view that deploys animated conversational characters to represent conversation partners for the replay of recorded conversations. The audio tool was built on the pre-existing conferencing tool VAT which has been developed by the Network Research Group of the Information and Computing Sciences Division at Lawrence Berkeley National Laboratory (LBNL) in Berkeley. Quite similar to a modern telephone with conferencing facilities, the audio tool enables its users to send and receive verbal messages either to some selected communication partners or to broadcast messages to all connected partners. All audio messages can be recorded at the communication server and thus can be accessed and replayed after a meeting. The labeling of audio contributions on analogy of the labeling of textual messages would have been desirable too, especially for the purpose of structuring the messages in a meeting record. However, after some initial testing it was concluded that such a labeling effort would disrupt the fluency of any spoken conversation and thus should be hidden from the users. As a consequence, Magic Lounge only produces a log-file of all verbal messages that indicates who has sent a certain message, when it was sent, how long it was and who the recipients were.
Figure 11: Magic Lounge WAP interface for a mobile phone (here a Nokia 7110).
Within the project, mobile access to Magic Lounge mediated meetings has been based partly on WAP which in turn employs the Wireless Markup Language (WML), and partly on a mobile telephone connection. Figure 11 illustrates an interaction sequence on a WAP browser of a mobile phone. When entering the Magic Lounge available services, such as querying the meeting memory, or writing
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
133
and receiving chat messages, are listed and can be started by selecting them (first and second snapshot of Figure 11). Memory contents can be presented as text (third snapshot of Figure 11), or via the audio channel. Also, chat messages can be submitted in textual form (fourth snapshot of Figure 11). Conducting field tests for the mobile client was not possible within the project, mainly for financial reasons. Among other things, the project was not able to purchase devices for a broader user group and to sign the required contracts with the telecom companies for the WAP service provision. Furthermore, we wanted to exploit the log-files of the WAP-gateway in the evaluation process. If we had used a telecom-provider’s WAP-gateway, these log-files would not have been accessible for us. On the other hand, the research license of our WAP-gateway was restricted to only three concurrent clients. Therefore, the tests where done on a smaller scale at DFKI with a group of nine subjects. The small number of test candidates does not allow to draw conclusions of any statistical significance. However, the observations made during testing helped to identify those parts of the tasks which were most difficult to perform. To sum up, our major observations were: Using the mobile Magic Lounge client seems to be a relatively easy task. We expected inexperienced users would have severe difficulties with the handling of the mobile devices. This seems to be the case. However, we were indeed surprised that only little training seems necessary to drastically improve task performance concerning required time, need for assistance, and deviations from the minimal interaction sequence. In addition, the test persons experienced the exercise as being a bit laborious but relatively easy to perform. Our initial concerns that users would blame the mobile client because of inconveniences concerning the overall handling of the mobile device was not at all supported. On the contrary, most test candidates indicated interest in using such a system, e.g., in order to brief themselves before joining a meeting. The current interaction design of the mobile client is not optimal. The records of the test sessions as well as the subjective responses from the test candidates revealed the most problematic parts of the current interaction structure. Unfortunately, most of the encountered problems are due to the particular mobile phone implementations. For instance, the quality of the displays will certainly improve very fast and down-load time will decrease as well. Another serious problem is that we are currently very much restricted in over-defining the assignments of function keys. As a consequence, any interaction design for an application will be constrained by already assigned functions. This can be very confusing as it is not always obvious whether a certain option is part of the specific application or whether it is related to some other phone functionality, such as the phone’s address book. Fortunately, some of the problems will certainly become obsolete when switching to the next generation of mobile devices. If text input is required, it must become more comfortable to use. As expected, text editing on a tiny number key board requires some training and is experienced as being quite inconvenient. In addition, most test candidates experienced the completion of text editing as unnecessarily complicated. Unfortunately, the implementations of the current WAP-browsers do not allow the introduction of a short-cut so that text-entries can be sent off by pressing only one button right after
134
RIST
editing has been completed. Though only three chat sessions were observed, it seems quite obvious that the implemented version of the WAP chat client was inconvenient to use. Among other aspects, users did not like the additional effort for providing meta information, such as the speech act associated with a message. Users also wanted to receive notification in case new messages arrive. In the provided system they had to inspect the message list periodically to find out whether new messages were received. Also, replying to a message was found inconvenient since after switching to the text editing mode, they could no longer see the message to which they wanted to reply and switching back and forth between the reading mode and the editing mode was found too awkward. 5.2 Using Heterogeneous Graphical Representations in Collaboration Tasks
With a special interest in graphical information presentation, we also conducted a case study in which we used a simulation environment to investigate how a small group of geographically dispersed users can jointly solve localization and route planning tasks while being equipped with different communication terminals. For the purpose of illustration, consider the situation in which a group of three users U1,U2, and U3 meet in the Magic Lounge to discuss the details of an impending trip to downtown Saarbrücken. U1 accesses the virtual meeting space via his PC, U2 via a PDA, and U3 via a cell phone with a tiny LCD display. To clarify how to get from a certain location to another, the participants want to consult a map. U1 is in an advantageous position as his PC can easily display even highly colored and detailed maps. As far as usability is concerned, it doesn’t make much sense to output complex graphics on a small PDA screen. Therefore, U2 receives a more abstract graphical presentation, in this case a network-style map that essentially encodes topological information about the objects in focus. The phone user U3 is certainly in the weakest position as there is only a 90x60 pixels display available on his phone to output graphics. U3 receives on his display a simplified graphical representation of the street which is in the current focus of interest. Other details, such as the street name or attributes which can not be shown are provided verbally using a text-to-speech synthesizer on the server side. Our approach to provide the users with appropriate information displays is to generate presentation variants (i.e. different views) from a shared data representation of the information to be presented (Rist & Hüther 1999). We use a data model in which geographic information is described in hierarchically ordered layers. Of course the details of this hierarchical representation depend on the data which have to be represented. We assume that this representation will be located on a centralized server which is accessible by all participants. The data hierarchy for the “shared map” application is shown in the center of Figure 12. The uppermost level of this hierarchy is just a list of meaningful domain objects, such as the names of streets, buildings, places, bridges and the like. In contrast, the lowest level of the hierarchy comprises a detailed bitmap, but also all the information of the superior layers. The intermediate layers represent abstractions relative to the information of the layers below. In the map example the intermediate
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
135
layers correspond to representations that subsequently abstract from geometric aspects such as spatial extent, shape, and distance between objects. In addition to the generation of different views on geographical data, the data hierarchy plays also a major role for mirroring object selections/markings from a source view to other target views. Figure 12 illustrates a scenario where a certain path to a particular location has been marked on the PDA display. The marked path is then mirrored on the PC’s map display as well as on the display of a cell phone (CP). The mapping from the PDA to the PC involves the following steps: First, the system determines the interaction type that is performed on the source view. In the example of Figure 12 the user has performed a Mark-action on his PDA. Next, the system determines the object(s) which are affected by the interaction (in the example, the graphical representation of the path leading to LOC#1 has been marked). The system also determines which action on the PC side corresponds best to the action that has been executed on the PDA display. In the example shown Figure 12, the appropriate action is named Mark_PC. Finally, the system must determine the graphical representation of LOC#1 on the PC display and apply the action Mark_PC to this representation. The mapping from the PDA (or the PC) display to that of the cellular phone is more challenging, especially if complex objects are marked, such as a whole path. As already illustrated by Figure 7, one possible approach is to transform the spatial display of the marking into a sequence of phone displays, each showing a small portion of the marked path. To construct such a view for the phone display the server needs to determine the involved path segments and to generate a sequence of displays. Using the scroll buttons on the phone, the user can navigate back and forth through the sequence of single images. While this approach does not allow the user to grasp the marked path at a glance, it may still help the user to build up a mental representation of the path (e.g. how many turns to make, order of turns etc.). Vice versa, a mapping from a marked/selected item on the phone display does not cause severe problems apart from having to provide a convenient means that enables the user to make such selections. To a large extent this is a device-specific issue. However, a fall back solution is to assign numbers to the selectable items shown on the phone display and use the phones keyboard to perform selections. To what extent collaborations over heterogeneous communication channels are useful and usable is an issue that is investigated best by means of concrete application scenarios. For the purpose of such a study we have chosen the domain of collaborative travel planning and implemented a demonstrator system. This demonstrator relied on simulations of the user interfaces for both the PDA and the mobile phone. For example, the phone interface was represented in the simulation through a realistic image. The user could press the depicted number and control buttons. Vice versa, information was presented through combinations of spoken messages and written text and minimalist graphics which are displayed in an area that simulates the tiny LCD display of a real phone. The device simulations were run on separate but networked PCs. In an informal evaluation conducted with colleagues and students at DFKI different settings have been considered: x collaboration between three PC users;
136
RIST
collaboration between PC, PDA and mobile phone; collaboration between PC and a mobile user with a NOKIA 7110 WAP phone. In all settings the participating users had to solve some concrete localization and route description tasks, such as showing the others where a particular object is located, or how to get from one place to another. The try runs were made with and without audio/text communication available. Based on our observations, we got a clearer idea on the sort of appropriate information displays and the type of collaboration modes which are likely to support the users in solving their tasks. For example, it turned out that collaborations without any verbal communication channel do not work at all. Some verbal communication (audio or at least textbased) is necessary in order to clarify the purpose of the graphical representations and the markings in relation to the given task and the proposed steps to solve it. On the other hand, the availability of graphical representations – though being different from each other – help to facilitate collaborations on localization and route planning tasks. x x
Figure 12: An object marking on the PDA display is mirrored on the PC display and on the Cell Phone display. References to objects are resolved via the hierarchical data representation at the centralized server component.
Another observation concerns the way in which users exchange markings which each other. It seems that one should prefer a collaboration principle which leaves the decisions to the individual users when and from where they want to “import” other views. Also, it seems not advisable to import markings from several other users at the same time since this can result in confusing displays.
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
137
6. CONCLUSIONS In this paper, we presented a number of different scenarios to motivate the need for adaptive information presentation that meets the requirements of mobile users. So far, our focus has been on requirements that are due to limited capabilities of mobile devices for graphical information display. We presented several different adaptation strategies, including image transcoding as well as graphics translation and re-generation from a semantic representation. While new generations of mobile devices will overcome many of the current performance limitations, other restrictions, such as reduced screen real-estate will remain since they are naturally restricted by the overall size of a mobile device. Diversity in devices is also likely to increase further and thus will call for smart adaptation mechanisms. Also, to fully exploit the new technical features of upcoming devices, new research questions arise as well. For instance, new devices will enable multimodal interaction and multimedia information presentation including text, audio/speech, graphics, animation, video, and active content (such as Midlets). However, interaction design and presentation composition have to be tailored to the target device, the user’s information needs and interaction preferences as well as to the particular mobile usage context. 7. ACKNOWLEDGEMENTS The work presented in this paper comprises valuable contributions from colleagues and students who where in involved in the corresponding projects. Special thanks to Patrick Brandmeier who did most of the development work on the image transformation system, the Shopping Companion, and who was responsible for setting up the WAP infrastructure for the Magic Lounge project. Thanks also to Alexander Kröner for his work on the constraint-based Layout Manger used in the MyArt portal, Uwe Sengewald for his work on several map visualization components for tiny displays, and Elisabeth André and a reviewer for comments and corrections on a draft version of this paper. 8. NOTES 1 WAP (Wireless Application Protocol), see http://www.wapforum.org 2 HTML (HyperText Markup Language), see http://www.w3.org/MarkUp/ 3 WML (Wireless Markup Language), see http://www.wapforum.org/DTD/wml_1.1.xml 4 HSCSD (High Speed Circuit Switched Data), see http://www.gsmworld.com/technology/hscsd/ 5 GPRS (General Packet Radio Service), see http://www.gsmworld.com/technology/gprs/ 6 UMTS (Universal Mobile Telecommunications System) 7 See http://wireless.java.sun.com 8 See http://www.fontus.com/ 9 EU project IMAGEN IST-1999-13123. IMAGEN project partners are: Bar-Ilan University Ramat Gan, Israel, CBE Group International, Amsterdam, Netherlands, DFKI Saarbrücken Germany (coordination), GIUNTI Interactive Labs Geonoa Italy, RADICAL MEDIA Lisbon Portugal, Vrije Universiteit Brussel - STAR Lab, Brussels Belgium. 10 The eXact Packager® produces assessments (IMS QTI), multi-device style sheets and is interoperable with any third party Learning Management System (LMS) or information system conforming to international standards. See http://www.learnexact.com/.
138
RIST
11 XSLT (XSL Transformations), see http:// www.w3.org/TR/xslt 12 The Magic Lounge project (1997-2000) was funded under the Future Emerging Technologies (FET) pro-active initiative i3. Project partners were DFKI, SaarbrĦcken Germany (coordination); NIS, Odense University, Denmark; LIMSI-CNRS Paris France; Siemens AG, München Germany; and The Danish Isles - User Community, Denmark. For details on mobile HTML derivatives see http://www.w3.org/Mobile/Activity. 13 Pic2WBMP stands for a conversion that assumes a source image in GIF or JPEG format and produces as output a black and white image in the so-called “wireless bitmap” format. The Ginco graphics converter was made available as shareware by Ginco New Media (www.gingco.de).
The Magic Lounge project (1997-2000) was funded under the Future Emerging Technologies (FET) pro-active initiative i3. Project partners were DFKI, SaarbrĦcken Germany (coordination); NIS, Odense University, Denmark; LIMSI-CNRS Paris France; Siemens AG, München Germany; and The Danish Isles - User Community, Denmark.
9. REFERENCES André, E. and Rist, T. (2001) Controlling the Behavior of Animated Presentation Agents in the Interface: Scripting vs. Instructing. AI Magazine 22(4), pp. 53-66. Agrawala, M. and Stolte, C. (2001) Rendering Effective Route Maps: Improving Usability Through Generalization. In Proc. of ACM SIGGRAPH 2001, pp. 241-250. Bentley, R., Rodden, T., Sawyer, P., and Sommerville, I. (1994) Architectural support for cooperative multiuser interfaces. In IEEE Computer, 37-46. Bernsen, N.O., Rist, T. , Martin, J.C., Hauck, C., Boullier, D., Briffault, X., Dybkjaer, L., Henry, C., Masoodian, M., Néel, F., Profitlich, H.J., André, E., Schweitzer, J., Vapillon, J. (1998) Magic Lounge: A Thematic Inhabited Information Space with "Intelligent" Communication Services. In: La Lettre de l'Intelligence Artificielle following the International Conference on Complex Systems, Intelligent Systems, & Interfaces (NIMES'98). Billsus, D., Pazzani, M.J., and Chen, J. (2000) A learning agent for wireless news access. Intelligent User Interfaces, pp.33-36. Bohnenberger, T., Jameson, A., Krüger, A., and A. Butz. (2002) Location-Aware Shopping Assistance: Evaluation of a Decision-Theoretic Approach. In: Mobile Human-Computer Interaction. LNCS Vol. 2411, Springer Verlag. Brewer, E., Katz, R., Amir, E., Balakrishnan, H., Chawathe, Y., Fox, A. , Gribble, S., Hodes, T., Nguyen, G., Padmanabhan, V., Stemm, M., Seshan, S., and Henderson, T. (1998) A Network Architecture for Heterogeneous Mobile Computing. In: IEEE Personal Communications Magazine 5(5), pp. 8-24. Britton, K.H., Case, R., Citron, A., Floyd, R., Li, Y., Seekamp, C., Topol, B., and Tracey, K. (2001) Transcoding: Extending e-Business to New Environments. In IBM Systems Journal, Vol. 40, No 1. Chincholle, D., Goldstein, M., Nyberg, M., and Eriksson, M. (2002) Lost or Found? A Usability Evaluation of a Mobile Navigation and Location-Based Service. In: Mobile Human-Computer Interaction, LNCS Vol. 2411, Springer Verlag. Endres, C., Meyer, M., and Wahlster, W. (1999) Personal Picture Finder: Ein Internet-Agent zur wissensbasierten Suche nach Personenphotos, Online'99 (in German). Service temporarily available: http://finder.dfki.de:7000/ Fox, A., Goldberg, I.., Gribble, S.D., Lee, D.C., Polito, A., and Brewer, E.A. (1998) Experience With Top Gun Wingman: A Proxy-Based Graphical Web Browser for the 3Com PalmPilot. In Proc. of Middleware '98. Gaedke, M., Beigl, M., Gellersen, h.-W., and Segor, C. (1998) Web Content Delivery to Heterogenous Mobile Platforms. In: Advances in Database Technologies. LNCS Vol. 1552, Springer Verlag. Gonzalez-Castaño, F.J., Rifón, L.A., and Costa-Montenegro, E. (2002) A New Transcoding Technique for PDA Browsers Based on Content Hierarchy. In: Human-Computer Interaction, LNCS Vol. 2411, pp. 69-80. Graham, T.C.N. and Urnes, T. (1992) Relational views as a model for automatic distributed implementation of multi-user applications. In Proc. of CSCW’92. pp. 59-66. Greenberg, S., Boyle, M., and LaBerge, J. (1999) PDAs and Shared Public Displays: Making Personal Information Public, and Public Information Personal. In: Personal Technologies 3(1&2).
SUPPORTING MOBILE USERS THROUGH ADAPTIVE PRESENTATION
139
Han, R., Bhagwat, P., LaMaire, R., Mummert, T., Perret, V., and Rubas, J. (1998) Dynamic adaptation in an Image Transcoding Proxy for Mobile Web Browsing. In: IEEE Personal Communication. pp. 8-17. Hill, R.D. (1992) The Abstraction-link View Paradigm, Using Constraints to Connect user interfaces to applications. In Proc. of CHI '92. pp. 335-342. Hori, M., Kondoh, G., Ono, K., Hirose, S., and Singhal, S., (2000) Annotation –Based Web Content Transcoding. In Proc. of 9th International World Wide Web Conference (WWW9). Kraus, S., Kröner, A., and Tsaban, L. (2002) IMAP - Intelligent Multimedia Authoring Tools for Electronic Publishing. In Proc. of the Second International Conference on Adaptive Hypermedia and Adaptive Web-Based Systems AH 2002, pp 527-530. Kray, C., Laakso, K., Elting, C., and Coors, V. (2003) Presenting Route Instructions on Mobile Devices. In Proc. of IUI'03, Miami, FL. Kröner, A., Bandmeier, P., and Rist, T. (2002) Managing Layout Constraints in a Platform for Customized Multimedia Content Packaging. In Proceedings of the Working Conference on Advanced Visual Interfaces AVI 2002, pp 89-93. Kylänpää, M. and Heino, I. (2000) Personalized Transcoding Proxy - an Approach to Mobile Web Access. In Proc. of 9th International World Wide Web Conference (WWW9). Article available under: http://www9.org/final-posters/poster10.html Marucci, L. and Paternò, F. (2002) Supporting Adaptivity to Heterogeneous Platforms through User Models. In: Mobile Human-Computer Interaction. LNCS Vol. 2411, Springer Verlag. Mohan, R., Smith, J., and Li, C.-S. (1999) Adapting Multimedia Internet Content For Universal Access. In: IEEE Transactions on Multimedia, pp. 104-114. Myers, B.A., Stiel, H., and Gargiulo, R. (1998) Collaboration Using Multiple PDAs Connected to a PC. Proc. of CSCW’98, pp. 285-295. Noble, B. (2000) System Support for Mobile, Adaptive Applications. In: IEEE Personal Communications Vol. 7, No. 1. Not, E., Sarini, M., Stock, O., Strapparava, C., Zancanaro, M. (1998) Information adaptation for physical hypernavigation. Proc. of i3 Annual Conference, Nyborg. Rao, H., Chen, Y. , Chang, D., and Chen, M. (2001) iMobile: A Proxy-based Platform for Mobile Services. In Proc. of the first ACM Workshop on Wireless Mobile Internet. Reynolds, F., Hjelm, J., Dawkins, S., and Singhal, S. (1999) Composite Capability/Preference Profiles (CC/PP): A user side framework for content negotiation. Available as online resource under: W3C Note. http://www.w3.org/TR/NOTE-CCPP/ Rist, T. and Hüther, M. (1999) Adaptive Generation of Graphical Information Presentations in a Heterogeneous Telecooperation Environment. In. Proc. of HCI’99 Vol.1, pp. 331-335. Rist, T. and the Magic Lounge Team. (2000) Magic Lounge: A virtual communication space with intelligent communication services. In: Proc. of the i3 Annual Conference’02, Jönköping, Sweden. Rist, T. and Bandmeier, P. (2002) Customising Graphics for Tiny Displays of Mobile Devices In: Personal and Ubiquitous Computing. No. 6, pp. 260-268. Roseman, M., and Greenberg, S. (1996) Building Real-Time Groupware with GroupKit, A Groupware Toolkit. ACM Trans. on Human-Computer Interaction, pp. 66-106. Stefik, M., Bobrow, D.G., Foster, G., Lanning, and Tatar, D. (1987) WYSIWIS Revisited: Early Experiences With Multiuser Interfaces, ACM Trans. on Office Information Systems 5(2) 147-167. Witten, I.H. and Frank, E. (2000) Data mining: Practical machine learning tools and techniques with Java implementations. Morgan Kaufmann, San Francisco.
10. AFFILIATIONS T. Rist DFKI, Saarbruecken, Germany
NATURAL LANGUAGE GENERATION Natural Language Generation is concerned with the automatic production of texts. This branch of natural language processing for many years has attracted much less interest than language understanding and only in the Eighties a real community of researchers has formed and different schools have appeared. Normally the field is described as investigating communicative goals, the dynamic choice of what to say, the planning of the overall rhetorical structure of the text (called sometime strategic planning), the actual realization of sentences on the basis of grammar and lexicon (sometimes called tactical planning), and so on. Among the different approaches a classical partition consists in a class that includes those based on schemata or templates, and a class based on those that combine finer-grained rhetorical elements. Hybrid approaches of various sorts have also been developed and in recent years, following a trend that is prevailing in language analysis, statistical approaches have been attempted as well. A very important aspect is concerned with the integration between language and other communication media. Many groups are now active in a way or in another in this direction. The cluster on Natural Language Generation presents four contributions. First, the chapter by Andrè and colleagues describes a system called AutoBriefer which automatically generates multimedia briefings from high-level outlines. The author of the briefing interacts with the system to specify some of the briefing content. The system uses declarative presentation planning strategies to synthesize a narrated multimedia briefing in various presentation formats. The narration employs synthesized audio as well as, optionally, an agent embodying the narrator. Calder and colleagues address a similar issue in their chapter. They introduces The MPIRO project which targets the concept of personalized information objects. MPIRO's technology allows textual and spoken descriptions of exhibits to be generated automatically from an underlying language-neutral database and existing free-text descriptions. The resulting descriptions, produced in three different languages (English, Greek, and Italian), are tailored according to the user's interests, background knowledge, and language skills. Carenini and Conati in the last chapter of this cluster describe a framework that helps students learn from examples by generating example problem solutions whose level of detail is tailored to the students’ domain knowledge. When presenting a new example to a student, the framework uses natural language generation techniques and a probabilistic student model to selectively introduce gaps in the example solution, so that the student can practice applying rules learned from previous examples in problem solving episodes of difficulty adequate to her knowledge. Filling in solution gaps is part of the metacognitive skill known as self-explanation (generate explanations to oneself to clarify an example solution), which is crucial to effectively learn from examples. Finally, Piwek, Power, Scott and van Deemter explore NLG applications in which extra media are brought into play from the point of view of planning the layout of the document. The main point of the chapter is that the extra media are not simply added to plain text, but integrated with it: thus the use of formatting, or pictures, or dialogue, may require radical rewording of the text itself.
E. ANDRÉ, K. CONCEPCION, I. MANI AND L. VAN GUILDER
AUTOBRIEFER: A SYSTEM FOR AUTHORING NARRATED BRIEFINGS
1. INTRODUCTION Document production is an important function in many organizations. In addition to products such as technical papers, instruction manuals, system documentation, courseware, etc., briefings are a very common type of multimedia document product. Since so much time is spent by so many people in producing briefings, often under serious time constraints, any method to reduce the amount of time spent on briefing production could yield great gains in productivity. Our work forms part of a larger DARPA-funded project aimed at improving analysis and decision-making in crisis situations by providing tools that allow analysts to collaborate to develop structured arguments in support of particular conclusions, and to help predict likely future scenarios. These arguments, along with background evidence, are packaged together as briefings to high-level decisionmakers. In leveraging automatic methods to generate briefings, our approach needs to allow the analyst to take on as much of the briefing authoring as she wants to (e.g., it may take time for her to adapt to or trust the machine, or she may want the machine to present just part of the briefing). The analyst's organisation usually will instantiate one of several templates dictating the high-level structure of a briefing; for example, a briefing may always have to begin with an executive summary. The methods also need to be relatively easily adapted to different domains, given that the subject matter of crises are somewhat unpredictable. These task requirements force us to focus on a flexible approach to briefing generation. The system must allow the user to take control as much as she wants in the specification of the content and format, as well as in editing the final product. It must also be able to help the user as needed by exploiting any redundancy in the briefing content, including stereotypical structure (dictated in part by the business rules of the organization) as well as material from previous briefings, databases, etc. In both these cases, the system should be able to populate an outline of the briefing. In cases where the user wants to be freed from specifying the low-level details of the animation, narration, and multimedia synchronization of the briefing, the system should be able to plan and synthesize the actual presentation, which the user can then post-edit. As an example, consider a user putting together a briefing about a particular individual, say Victor Polay, a leader of Peru’s Tupac Amaru movement. A person preparing this briefing by hand (using, say, Microsoft PowerPoint) would need to 143 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 143–158 © 2005 Springer. Printed in the Netherlands
144
ANDRÉ, CONCEPCION, MANI AND VAN GUILDER
plan the outline, and assemble the various textual and multimedia elements that constitute the core content. This latter activity could involve searching, doing background research, collecting existing material, drafting new text, etc. The person must also design or select a template for the overall briefing, as well as the layout of individual slides, and then produce each slide individually, realizing the content of each slide by embedding or linking the media elements on it, specifying the animation strategies, and then rehearsing and refining the briefing. Finally, the person has to present the briefing in front of the intended audience(s). In this paper, we report on a system called AutoBriefer where the author of the briefing interacts with a system via a graphical user interface (GUI) to specify some of the briefing content. As before, the user can select or create an outline. However, the user doesn’t have to produce each slide, realizing the media elements on each. Instead, the AutoBriefer GUI allows the user to select and associate the media elements directly with the outline. In doing so, the user may specify the temporal layout of the content. Moreover, instead of using pre-existing media elements, the GUI also allows the user to direct personalized information agents to assemble media elements from existing (possibly distributed) databases or text collections. The user can then leave the rest of the briefing preparation to the system, by asking AutoBriefer to generate a PowerPoint briefing. In doing so, the user allows the system to plan how many slides are needed, and to plan and realize the media elements on each slide, synchronizing them spatially and temporally. Further, the user doesn’t need to present the briefing; the system itself can generate a spoken narrative for the briefing, synchronizing the narrative elements it introduces with the other media elements, producing an animated briefing as a result. One challenge in automating this process involves automatically selecting and assembling media elements; here AutoBriefer exploits new technologies for automated summarization. In order to generate a narrative automatically, the system needs knowledge about the domain objects; here, AutoBriefer provides scalability across domains by allowing the user to specify or edit domain knowledge in the form of meta-information attributes associated with media elements. This ability to specify and edit domain knowledge in turn addresses an important bottleneck in current multimedia presentation strategies, which are difficult to adapt across domains. AutoBriefer leverages the ability to edit domain knowledge along with declarative presentation planning strategies to synthesize a narrated multimedia briefing in various presentation formats. 2. APPROACH The design of AutoBriefer is different from most multimedia presentation planners in the way the problem is decomposed. An easy-to-use GUI is used to specify high-level structural and temporal constraints for arranging domain objects in a presentation. Information associated with these domain objects is used to generate a narrative outline. This narrative outline is then fed as data along with a high-level communicative goal to a presentation planning component (PPP). The planner chooses a sequence of plan operators to realize the goals associated with the
AUTOBRIEFER: A SYSTEM FOR AUTHORING NARRATED BRIEFINGS
145
outline. The plan operators constitute a declarative representation of presentation strategies to decide on narrator animation, and spatial and temporal synchronization. The system architecture is shown in Figure 1. The user interacts with the system via a GUI using a simple point-and-click interface. She may load in a domain model of existing domain objects, for possible editing, or she may create new domain objects using information agents.
Briefing GUI
XML Narrative Outline Domain Knowledge
XML2Prop Converter Propositional Narrative Outline
Agents User
Multimedia Player or PowerPoint User Interface
SMIL or PowerPoint
Narrative Planner
Layout Generator Briefing Generator
Figure 1. System Architecture
As shown in Figure 2, the user arranges the objects along two dimensions. The first, structural dimension, allows objects to be nested under others to create a hierarchical structure. This is shown in the left frame, which uses a directory browser interaction metaphor. The user here has chosen to specify the goal of retrieving an existing media object, namely a picture of “Polay”. The user has also specified goals of creation of a summary of events and a biography; the execution of these goals, of type ‘create’, involve the use of information agents. Finally, the user has also included a particular audio file as a coda to end the briefing. The second, temporal dimension allows objects to be ordered temporally. This is shown in the right frame: objects from the left frame can be dropped into numbered temporal slots in the right frame (the system lays them out, by default, based on the outline). Each object can be edited by filling in meta-information in the form of values for slots (Figure 3). Here the user has typed in a caption, and has required that the system generate the running text. Once a subset of objects in the domain have been selected and arranged appropriately, the user may choose a style sheet for the current briefing, and then ask the system to create the briefing, in either PowerPoint (preferred, since it has a wide
146
ANDRÉ, CONCEPCION, MANI AND VAN GUILDER
user base) or SMIL (Synchronized Multimedia Integration Language), a W3Cdeveloped extension (SMIL 2002) of HTML that can be played by standard multimedia players, such as Real (REAL 2002) and Grins (GRINS 2002).
Figure 2. Briefing GUI
The narrative outline is represented in XML as a tree whose structure represents the rhetorical structure of the briefing. An example is shown in Figure 4, which was generated automatically as a result of GUI interaction. (Here, for readability, we show a schematic representation of the narrative outline rather than the raw XML.) Each node has a label, which offers a brief textual description of the node. Each leaf node has an associated goal, which, when realized, provides content for that node. Content-level goals are of two kinds: retrieve goals, which retrieve pre-existing media objects of a particular type (text, audio, image, audio, video) satisfying some description, and create goals, which create new media objects of these types using personalized information agents for search, summarization, etc. In addition to content-level goals, the narrative outline contains narrative-level goals, which serve to introduce descriptions of content at other nodes, including segues to content nodes, and running text and captions describing media objects at content nodes. Ordering relations reflecting temporal and spatial layout are defined on nodes in the tree. Two coarse-grained relations, seq for precedence, and par for simultaneity, are used to specify a temporal ordering on the nodes in the tree.
AUTOBRIEFER: A SYSTEM FOR AUTHORING NARRATED BRIEFINGS
147
Figure 3. Editing an Object
As shown in Figure 1, once the user of the GUI asks the system to create the briefing, the GUI passes off the narrative outline to a converter which first applies an XML parser to the outline to check for syntactic correctness. The validated outline is then transformed into propositions that can be processed by the Narrative Planner and the Layout Generator. The Narrative Planner takes the input narrative outline and works out a coherent presentation by realizing the corresponding retrieve and create goals. In particular, it fills out some content by invoking specific components for the acquisition and creation of multimedia material. It also makes decisions when needed about output media type. To improve the coherence of the briefing, the narrative planner may introduce narrative-level goals which refer to meta information on the content of a briefing. In the case of create goals, meta information may be easily provided by the narrative planner. For retrieve goals, this information is usually supplied by the user. In either case, narrative elements are generated from the meta-information by using shallow text generation methods (canned text). Finally, the Layout Generator generates the temporal and spatial layout of the presentation. Both the narrative planner and the layout generator are based on the PPP multimedia presentation planning tool (André and Rist 1996). The PPP tool combines two components: an HTN-planner that realizes the Narrative Planner and a component for spatial and temporal reasoning that is used in the Layout Generator. While in the original approach of André and Rist a presentation is planned on the basis of a domain knowledge base, the starting point for the approach presented here
148
ANDRÉ, CONCEPCION, MANI AND VAN GUILDER
is the narrative outline as provided by the user, which is then refined and synchronized. Peru Action Brief 1 Preamble 2 Situation Assessment 2.1 Chronology of Events 2.1.2 Latest Document Summary create(“summarize-generic compression .10/peru/p32.txt”) 2.2 Biographies 2.2.1 Profile of Victor Polay 2.2.1.1 A File Photo of Victor Polay retrieve(“D:\rawdata/polay.jpg”) 2.2.1.2 Profile of Victor Polay create(“summarize –bio –length 350 –span multi – person Victor Polay -out table/peru/”*”) 3 Coda “This briefing has assessed aspects of the situation in Peru. Overall, crisis appears to be worsening.” <seq> <par>1 <par>2.1.2 <par>2.2.1.1 2.2.1.2 <par>3 Figure 4. Narrative Outline
3. INTELLIGENT MULTIMEDIA PRESENTATION PLANNING The author of a briefing may choose to flesh out as much of the narrative as desired. For instance, she may exactly indicate how a certain image should be labeled, but is not required to do so. Further, she may specify certain narrative goals, though she is not required to do so; she may also leave it up to the Narrative Planner component to introduce all the narrative goals. In case domain objects should appear in a certain order in the final presentation, she may also provide the desired temporal ordering relations. Note, however, that the complete temporal layout need not be specified by the user. In addition to the narrative outline, the user may input a number of presentation constraints, such as the degree of detail, the style of presentation (e.g. agent-based versus audio presentation) or the output format, e.g.,
AUTOBRIEFER: A SYSTEM FOR AUTHORING NARRATED BRIEFINGS
149
SMIL or PowerPoint. The task of the presentation planner is then to work out a presentation based on the user’s input. This includes the realization of the create, retrieve and narrative goals as well as the determination of the spatial and temporal layout. The result of the presentation is a document in the required format. 3.1 Basic Technology
To accomplish the tasks sketched above, we rely on our earlier work on presentation planning. The essential idea of this approach is to formalize action sequences for composing multimedia material and designing scripts for presenting this material to the user as operators of a planning system. The header (i.e., effect) of a planning operator refers to a complex communicative goal (e.g. to provide a biography of a certain person) whereas the expressions of the body of the operator indicate which acts have to be executed in order to achieve this goal (e.g. to show a picture of that person). The input goal to the presentation planner is a complex presentation goal, e.g. to provide a briefing. To accomplish this goal, the planner looks for operators whose headers subsume it. If such an operator is found, all expressions in the body of the operator will be set up as new subgoals. During the decomposition process, a presentation is built up by collecting and expanding the nodes of the user’s narrative outline. To accomplish this task, we start from an empty output file that is continuously augmented by SMIL-expressions or expressions in an XML-based markup language that serves as an intermediate format for the PowerPoint slides to be generated. The planning process terminates if all subgoals have been expanded to elementary execution acts or have already been worked out by the user. Elementary execution acts include retrieve or create goals as well as commands, such as SAddString, that augment the output file by expressions in the required language. 3.2 Narrative Planning
As mentioned above, the system can construct a narrative to accompany the briefing. Narrative goals are generated to cover captions, running text, and segues. The captions and running text are realized from information provided in the narrative outline. In the case of create goals, the running text and captions may be easily generated by the system, as a by-product of creating a media object. In the case of retrieve goals, the objects retrieved may not have any meta-information, in which case a default caption and running-text is generated. For example, if a retrieve goal associated with a picture lacked a user-supplied caption, the Narrative Planner would use the narrative outline as the basis for constructing a caption. This flexibility allows the system to fill in what the user left unspecified. Clearly, a system's explanatory narrative will be enhanced by the availability of rich metainformation. The segues are realized by the system. For example, an object with a label slot “A biography of Victor Polay” could result in a generated segue “Here is a biography of Victor Polay”. The Narrative Planner, when providing content for
150
ANDRÉ, CONCEPCION, MANI AND VAN GUILDER
narrative nodes, uses a variety of different canned text patterns. For the above example, the pattern would be “Here is ?label”, where ?label is a variable whose value is the label of the current node. Segue goals are generated automatically by the system and integrated into the narrative outline. We always introduce a segue node at the beginning of the presentation (called a preamble node), which provides a segue covering the root of the narrative outline tree. A segue node is also produced at the end (called a coda). (Both preamble and segue can of course be specified by the author if desired). For introducing intervening segue nodes, we use the following algorithm based on the distance between nodes and the height in the narrative outline tree. We traverse the non-narrative leaves of the tree in their temporal order, evaluating each pair of adjacent nodes A and B where A precedes B temporally. A segue is introduced between nodes A and B if either (a) the maximum of the 2 distances from A and B to their least common ancestor is greater than 3 nodes or (b) the sum of the 2 distances from A and B to the least common ancestor is greater than 4 nodes. This is less intrusive than introducing segues at random or between every pair of successive nodes, and appears to perform better than introducing a segue at each depth of the tree. In case an animated presenter is used, the narrative planner also decides on the non-verbal communicative means to be employed. In particular, we make use of mimics and body gestures to convey the discourse structure of the presentation and the character’s emotions, see also (Poggi et al., this volume) or (Theune et al., this volume). Since a deliberate use of communicative means requires detailed knowledge about the content of the presentation, it is hard to determine appropriate non-verbal gestures for the presentation of content associated with retrieve goals and create goals (that heavily rely on pre-authored multimedia material). The use of nonverbal gestures for the realization of narrative goals essentially depends on their type. For instance, segues that relate to a change in topic are accompanied by a shift in posture, see also (Cassell et al. 2001). Strategies for expanding the narrative content are coded by means of plan operators. In the following, we present a simple plan operator which may be used to work out a narrative goal, namely to present a new PowerPoint slide. The body includes a communicative gesture as well as a canned text pattern which is instantiated with the value for ?label and uttered by the character. It should be obvious how this operator could be used to realize the segues for several of the nodes in the narrative outline in Figure 4. (define-plan-operator :header (A0 (PresentPersona ?id ?SlideNumber)) :constraints (Label ?id ?label) :inferiors ( (A1 (sAddPPCode ("PLAY OneHand;;"))) (A2 (SAddPPCode ("SAY Here is a " ?label ";;")))) The presentation operators also specify which media type is used to realize a certain goal. By default, we attempt to realize narrative goals as speech media type,
AUTOBRIEFER: A SYSTEM FOR AUTHORING NARRATED BRIEFINGS
151
using rules on text length and truncatability to less than 250 bytes to decide when to use text-to-speech. Captions are always realized as text (i.e., they have a text realization and a possible audio realization). 3.3 Layout Generation
As described above, we are able to generate both SMIL and PowerPoint presentations. The generation of Powerpoint presentations starts from an XMLbased markup language that represents both the contents of a Powerpoint slide as well as the accompanying comments and gestures by the presenter. The XMLrepresentation is then converted into Powerpoint presentations making use of Bridge2Alpha from IBM alphaWorks, a tool that allows Java programs to communicate with ActiveX objects. While the material of a presentation appears on the slides pages, suggestions regarding its presentation are included in the note pages. Essentially, these suggestions are related to AutoBriefer’s narrative goals. The spatial layout of the slides is given by the corresponding PowerPoint style sheet and can be directly inferred from the hierarchical structure as indicated by the human author. When generating SMIL-presentations, we also allow for dynamic media, such as video and audio, all of which have to be displayed in a spatially and temporally coordinated manner and need to be synchronized with the agents’ communicative gestures. For instance, a character may point to a certain object in a video while it is being played. In order to relieve the user from the burden to completely work out a spatial and temporal layout, the following extensions of the presentation planner have become necessary (André and Rist 1996): The formulation of temporal and spatial constraints in the plan operators. We distinguish between metric and qualitative constraints. Metric temporal and spatial constraints appear as metric (in-)equalities, e.g. (5 Bottom VideoRegion – Top VideoRegion). Qualitative temporal constraints are represented in an “Allen-style” fashion which allows for the specification of thirteen temporal relationships between two named intervals, e.g. (SpeakInterval (During) PointInterval). Qualitative spatial constraints are represented by a set of topological relations, such as LeftOf, CenterHor or TopAlign. Note that the plan operators include general temporal constraints that should be satisfied for a certain document type while the temporal constraints the user specifies with the GUI refer to a particular document instance. The development of a mechanism for designing a spatial and temporal layout. We collect all spatial and temporal constraints during the presentation planning process, combine them with the temporal constraints specified by the user for the document to be generated and use the incremental constraint solving toolkit Cassowary (Borning et al. 1997) to determine a consistent spatial and temporal layout which is then represented as a SMIL document. In case of a conflict between the constraints specified by the GUI user and the constraints specified by the author of plan operators, the user’s constraints is given preference, but a warning is produced.
152
ANDRÉ, CONCEPCION, MANI AND VAN GUILDER
In the following, we present a plan operator which may be used to synchronize text, spoken voice and video. The body of the plan operator contains a complex action (PresentVideo) that requires further expansion and two elementary actions (SAddSpeech and SAddSmilCode). SAddSmilCode will extend the SMIL output file with an animation that visualizes the talking. SAddSpeech will cause the Festival speech synthesizer (Taylor et al. 1998) to generate an audio file for ?text which will be included into the SMIL output file as well. The title should be visible while the comment is spoken and the video is run. In particular, the verbal comment starts one time unit after the display of the title which preceeds the video. After the video is played, the title will be visible for one more time unit. The spatial constraints specify where the regions for the title A1 and the video sequence A3 should be located with respect to the region corresponding to the superior presentation segment A0. (define-plan-operator :header (A0 (PresentSegment ?segment)) :constraints (*and* (BELP (Title ?segment ?title)) (BELP (Comment ?segment ?comment)) (BELP (Video ?segment ?video))) :inferiors (A1 (SAddSmilCode (?title))) (A2 (SAddSpeech Festival ?comment)) (A3 (PresentVideo (?video))) :temporal (1 <= Begin A2 – Begin A1 <= 1) (A2 (Before) A3) (1 <= End A1 – End A3 <= 1) :spatial (CenterHor A1) (AlignTop A1) (CenterHor A3) (1 <= Bottom A1 – Top A3 <= 1) Note that the plan operators do not necessarily represent a complete specification of the temporal and spatial layout of the resulting presentation, but only include specific design constraints. For instance, the plan operator above does not specify how long the title should be displayed, but only that it should start one time unit earlier and end one time unit later than the video. The advantage of our approach is that a consistent schedule is generated automatically on the basis of the constraints specified by the author of the plan operators and the user of the GUI shown in Figure 3. 4. EXAMPLE OUTPUT In the following, we will present two different multimedia briefings that may be generated (among many other variations) from one and the same narrative outline: a PowerPoint slide show that makes use of an animated character and a SMIL-based briefing that just employs synthesized audio.
AUTOBRIEFER: A SYSTEM FOR AUTHORING NARRATED BRIEFINGS
153
<Slide> <Title>Peru Action Brief <S>Preamble <S>Situation Assessment <S> <S>Chronology of Events <S>Biographies <S>Coda SHOW kaufmann, Kaufmann4.31.acs, 1,69;; PLAY Tie;; PLAY Normal;; PLAY OneHand;; SAY In this briefing I will go over the situation assessment.;; PLAY Normal;; SAY This will cover an overview of the chronology of events and a biography of Victor Polay.;; PLAY SpeakGiveTurn;; SLIDE 2=SAY Next slide please. … Figure 5. An Introductory Slide Represented by the Narrative Planner in XML
We start from the narrative outline discussed in Section 2. It includes two create goals, involving agents that carry out summarization functions: a multi-document summarizer (Mani and Bloedorn 1999) and a biographical summarizer (Schiffman et al. 2001). Figure 5 shows the XML-representation of the first slide of the PowerPoint presentation, which was created automatically by the Narrative Planner. The narrative goals include both natural language utterances as well as the corresponding gestures to illustrate the contents on the corresponding slide. The final PowerPoint presentation generated is shown in Figure 6. Here we show screen dumps of the six PowerPoint segments produced which also include a virtual speaker. To enable this style of presentation, we make use of Microsoft Narrator, a tool that allows for the enhancement of PowerPoint slides by Microsoft agents that narrate and automatically advance through each slide.
154
ANDRÉ, CONCEPCION, MANI AND VAN GUILDER
Figure 6. PowerPoint Presentation
The same narrative outline can be used to plan a presentation in SMIL. The corresponding presentation generated in SMIL is shown in Figure 7.
AUTOBRIEFER: A SYSTEM FOR AUTHORING NARRATED BRIEFINGS
155
Figure 7. SMIL Presentation
5. RELATED WORK There is a fair amount of work on automatic authoring of multimedia presentations. (Wahlster et al. 1993), (Dalal et al. 1996), (André and Rist 1996) aim at the fully automated generation of documents. While these approaches make use of their own display format, more recent work has discovered SMIL as a useful basis for multimedia presentations, e.g., see (Geurts et al. 2001) and (Bes et al. 2001). Our work is distinguished in several ways from the approaches listed above. First of all, our work is not restricted to a single presentation format. Second, AutoBriefer allows for flexibly shifting the responsibility between the presentation system and the human author. For instance, it is up to the user to determine to what extent to flesh out the narrative outline. Third, our approach explicitly distinguishes between content-level and narrative-level goals. Our captions, which are very short, rely on canned text based on node labels in the initial narrative outline, or are based on shallow meta-information generated by the summarization agents along with the created media object. Mittal et al. 1995 describe a variety of strategies for generation of longer, more explanatory captions, some of which may be exploited in our work by deepening the level of meta-information. The distinction between content-level and narrative-level goals has been proven useful especially with respect to the generation of presentations that make use of synthesized agents to convey this meta-information. Finally, the approaches listed above concentrate on
156
ANDRÉ, CONCEPCION, MANI AND VAN GUILDER
the presentation planning task and don’t exploit methods for multimedia document analysis. (Piwek et al., this volume) investigate how the same content may be expressed by different kinds of multimodal presentation ranging from plain texts and illustrated documents to dialogues between two animated agents. However, these functions are not yet included in a single system as in Autobriefer which is able to produce a large variety of presentation variants from one and the same outline. In our ability to leverage automatic summarization agents, our work should be clearly distinguished from work which attempts to format a summary (from an XML representation) into something akin to a Powerpoint briefing, e.g., (Nagao and Hasida 1998). Our work, by contrast, is focused on using summarization in generating briefings from an abstract outline. 6. CONCLUSION Overall, the advantage of our approach lies in the fact that we are able to generate a large variety of briefings starting from the same narrative outline generated as a result of GUI interaction. In addition, the approach allows for a flexible adaptation of the degree of automation. The generated PowerPoint document may be used in two different ways. In the simplest case, a human speaker presents the slides herself to an audience considering the information on the notes as suggestions on how to deliver the material. Another possibility is the integration of one or several synthesized agents as virtual presenters (André and Rist 2000). As mentioned at the outset, one of the goals of Autobriefer is to save time in briefing production. In the future, we hope to evaluate the time savings by means of an experiment that would compare the amount of time naïve and expert users need to produce a variety of briefings using Autobriefer instead of authoring directly in PowerPoint or SMIL. We suppose that the savings will be greater for direct use of PowerPoint compared to SMIL presentations since PowerPoint already provides a convenient interface to the user. Nevertheless, a benefit for the authoring of PowerPoint slides is expected especially in situations where users will have to produce a larger variety of PowerPoint documents for different settings of generation parameters. In addition, PowerPoint cannot of course be used to generate a presentation based on a rough outline provided by the user. Autobriefer has only touched the surface of what might be possible in terms of automated presentation of briefings. In the future, more use of natural language generation would allow for a greater degree of flexibility in generation of narrated briefings. To provide for more persuasive briefings, it would also be useful to allow the human speaker to specify in the GUI interaction the emotional and rhetorical gestures a character should convey since this information is hard to determine automatically.
AUTOBRIEFER: A SYSTEM FOR AUTHORING NARRATED BRIEFINGS
157
7. ACKNOWLEDGMENTS The work was funded by the BMBF under contract number 01 IW 001 and DARPA’s GENOA research program, under contract number DAA-B07-99-CC201. We would like to thank Peter Rist for designing and animating the virtual presenter. Part of this work has been conducted by the first author within the DFKI project MIAU. Thanks to Hamish Cunningham from Sheffield University for providing us with the GATE system which made it possible to install a complete version of Autobriefer at DFKI without having to rely on the commercial information extraction software used within GENOA. 8. REFERENCES André, E., and T. Rist. (1996) Coping with temporal constraints in multimedia presentation planning. In Proceedings of the AAAI ‘96. Menlo Park, Cambridge, London: AAAI Press/The MIT Press, Vol. 1:142–147. André, E., Rist, T. , van Mulken, S., Klesen, M. and Baldes, S. (2000). The Automated Design of Believable Dialogues for Animated Presentation Teams. In: Cassell, J., Sullivan, J., Prevost, S. and Churchill, E. (eds.): Embodied Conversational Agents, 220-255, Cambridge, MA: MIT Press. André, E. and Rist, T. (2001) Controlling the Behavior of Animated Presentation Agents in the Interface: Scripting versus Instructing. AI Magazine, Vol. 22(4): 53-66, 2001. Borning, A., Marriott, K., Stuckey, P. and Xiao, Y. (1997). Linear Arithmetic Constraints for User Interface Applications, Proc. of the 1997 ACM Symposium on User Interface Software and Technology, 87-96. Cassell, J., Nakano, Y., Bickmore, T., Sidner, C. and Rich, C. (2001). Non-verbal Cues for Discourse Structure, Proceedings of the 41st Annual Meeting of the Association of Computational Linguistics, 106-115. Dalal, M., Feiner, S., McKeown, K., Pan, S., Zhou, M., Hollerer, T., Shaw, J., Feng, Y., and Fromer, J. (1996) Negotiation for Automated Generation of Temporal MultimediaPresentations. Proceedings of ACM Multimedia '96, 55-64. Bes, F., Jourdan, M., and Khantache, F. (2001) A Generic Architecture for Automated Construction of Multimedia Presentations. The 8th International Conference on Multimedia Modeling, Amsterdam, 2001, 229-246. Geurts, J. van Ossenbruggen, J., and Hardman, L. (2001) Application-Specific Constraints for Multimedia Presentation Generation. The 8th International Conference on Multimedia Modeling, Amsterdam, 247-266. Mani, I. and Bloedorn, E. (1999). Summarizing Similarities and Differences Among Related Documents. Information Retrieval 1(1), 35-67. Mani, I., Gates, B., and Bloedorn, E. (1999) Improving Summaries by Revising Them. Proceedings of the 37th Annual Meeting of the Association for Computational Linguistics, College Park, MD, 558-565. Mittal, V., Roth, S., Moore, J., Mattis, J., and Carenini, G. (1995) Generating Explanatory Captions for Information Graphics. Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI'95), 1276-1283. Nagao, K. and K. Hasida, K. (1998) Automatic Text Summarization Based on the Global Document Annotation. Proceedings of COLING'98, Montreal, 917-921. ORATRIX (2002). www.oratrix.com Piwek, P.. Power, R., Scott, D., and van Deemter, K. (2003). Generating Multimedia Presentations: From Plain Text to Screenplay, this volume. Poggi, I., Pelachaud, C., de Rosis, F., De Carolis, B., and Carofiglio, V. (2003) Greta. A Believable Intelligent Embodied Conversational Agent, this volume. REAL (2002). www.real.com Schiffman, B., Mani, I., and Concepcion, K. Producing Biographical Summaries: Combining Linguistic Knowledge with Corpus Statistics. In Proceedings of the 39th Annual Meeting of the Association for Computational Linguistics (ACL'2001), 450-457.
158
ANDRÉ, CONCEPCION, MANI AND VAN GUILDER
SMIL (2002). Synchronized Multimedia Integration Language. www.w3.org/AudioVideo/ Taylor, P., Black, A., and Caley, R. (1998) The architecture of the Festival Speech Synthesis System. Proceedings of the Third ESCA Workshop on Speech Synthesis, Jenolan Caves, Australia, 147-151. Theune, M., Heylen, D., and Nijholt, A. (2003) Generating Embodied Information Presentations, this volume. Wahlster, W., André, E., Finkler, W., Profitlich, H.-J., and Rist, T. (1993) Plan-Based Integration of Natural Language and Graphics Generation. AI Journal, 63:387-427.
9. AFFILIATIONS E. André K. Concepcion I. Mani L. Van Guilder Universität Augsburg Augsburg, Germany
G. CARENINI AND C. CONATI
GENERATING TAILORED WORKED-OUT PROBLEM SOLUTIONS TO HELP STUDENTS LEARN FROM EXAMPLES
1. INTRODUCTION Studying examples is one of the most natural ways of learning a new skill. When studying a worked out solution to a problem, students can focus all their cognitive resources on understanding one solution step at a time, without being overwhelmed by the possibly too demanding task of solving the whole problem from scratch. Given the critical role played by worked out examples in learning, it is not surprising that substantial research in the field of Intelligent Tutoring Systems (ITS) has been devoted to understand how to use examples effectively. Most of this research has focused on how to select examples that can help a student during problem solving. ELM-PE [Burrow and Weber 1996] and ELM-ART [Weber and Specht 1997] are two tutoring systems that allow the student to access relevant examples while solving LISP programming problems and provide explanations on how each example is relevant for the problem solution. SHERLOCK [Gott, Lesgold et al. 1996], provides expert solutions to troubleshooting problems, and helps students compare these solutions with their own solutions at the end of each problem solving task. CATO [Aleven and Ashley 1997] helps students building legal arguments by generating relevant example cases and by reifying the connection between the content of the cases and their use in the arguments. In contrast with previous work, the research presented in this chapter does not investigate how to select examples that can help a student during problem solving. Rather, our focus is on how to describe an example solution so that a student can learn the most by studying it prior to problem solving. In particular, we address the issue of how to vary the level of detail of the presented example solution, so that the same example can be equally stimulating for learners with different degrees of domain knowledge. This problem is novel in ITS, as it requires sophisticated natural language generation (NLG) techniques. While the NLG field has extensively studied the process of producing text tailored to a model of the user’s inferential capabilities [Horacek 1997; Korb, McConachy et al. 1997; Young 1999], the application of NLG techniques in ITS are few and mainly focused on managing and structuring the tutorial dialogue [Moore 1996; Freedman 2000], rather than on tailoring the presentation of instructional material to a detailed student model. 159 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 159–176 © 2005 Springer. Printed in the Netherlands
160
CARENINI AND CONATI
The rationale behind varying the level of detail of an example solution lies on cognitive science findings about self-explanation (i.e., generate explanations to themselves to clarify an example solution) and cognitive load (i.e., the load that performing a particular task imposes on the learner’s cognitive system). Several studies indicate that those students who self-explain examples learn better than those students who read the examples without elaborating them [Chi 2000]. One kind of self-explanation that these studies showed to be correlated with learning involves filling in the gaps commonly found in textbook example solutions (gap filling selfexplanation). However, the same studies also showed that most students tend not to self-explain spontaneously. In the case of gap filling, this phenomenon could be due to the fact that gap filling virtually requires performing problem solving steps while studying an example. And, because problem solving can be highly cognitively and motivationally demanding [Sweller 1988], if the gaps in an example solution are too many or too difficult for a given student, they may hinder self-explanations aimed at filling them. Furthermore, if the gaps are too few and too easy the learner’s cognitive system can be overloaded by the unnecessary information. We argue that, by monitoring how a student’s knowledge changes when studying a sequence of examples, it is possible to introduce in the examples appropriate solution gaps, thus facilitating gap filling self-explanation and providing a smooth transition from example study to problem solving. We are exploring this hypothesis by extending the SE-Coach, a framework to support self-explanation of physics examples [Conati and Vanlehn 2000]. The SE-Coach already effectively guides two other kinds of self-explanations that have been shown to trigger learning [Chi 2000]: (i) justify a solution step in terms of the domain theory (step correctness); (ii) map a solution step into the highlevel plan underlying the example solution (step utility). The internal representation of an example solution used by the SE-Coach to monitor students’ self-explanation is generated automatically. However, because the SE-Coach does not include any NLG capability, the example description presented to the student and the mapping between this description and the internal representation is done by hand. Thus, each example has a fixed description, containing virtually no solution gaps. In this chapter, we describe how we extended the SE-Coach with NLG techniques to (i) automatically generate the example presentation from the example internal representation, and (ii) selectively insert gaps in the example presentation, tailored to a student’s domain knowledge. Several NLG computational models proposed in the literature generate concise text by taking into account the inferential capabilities of the user. [Young 1999] presents a system that generates concise plan descriptions tailored to the hearer’s plan reasoning capabilities. In instructing the user how to accomplish a certain task, for instance initializing a PDA, the system leaves out details if the user is assumed to be able to fill them in, given a model of the user’s planning algorithm and preferences. [Horacek 1997] describes a system that takes into account the hearer’s logical inference capabilities. The system generates explanations for planning office space. Facts are omitted from an explanation, if the user is expected to be able to infer them from previous information and context. The model of the user’s inferential capabilities consists of a set of deterministic logical
GENERATING TAILORED WORKED-OUT PROBLEM SOLUTIONS
161
rules associated with a stereotypical model of the user’s domain expertise. Finally, [Korb, McConachy et al. 1997] proposes a system that relies on a model of user’s probabilistic inferences to generate sufficiently persuasive arguments. In order to assess whether the user’s degree of belief in the argument conclusion will be within a target range once an argument is presented, the system represents the argument as a Bayesian network which encodes a probabilistic model of the user’s beliefs. This model can also incorporate common errors in human reasoning under uncertainty. In contrast to all of these approaches, our generation system tailors the content and organisation of an example to a probabilistic model of the user’s logical inferences, namely, a model of the system’s uncertainty about the user’s knowledge of a set of deterministic rules. This allows us to explicitly represent the inherent uncertainty involved in assessing a learner’s knowledge and reasoning processes. Furthermore, our system maintains information on what example parts are not initially presented (i.e., solution gaps), which is critical to support gap-filling selfexplanations for those students who tend not to self-explain autonomously. In the following sections, we first briefly present the Cognitive Load Theory (CLT), a theory of instructional design rooted in cognitive science that provides the background and motivation of our work. After that, we illustrate our general framework for example generation and presentation. We start by describing in detail the NLG techniques used and an example of the tailored presentations they generate. Then, we show how the output of the NLG process supports an interface to guide gap filling self-explanation. We conclude with a discussion of future work. 2. COGNITIVE LOAD THEORY Cognitive Load Theory (CLT) is an instructional theory that takes into account the student’s cognitive limitations as primary factors in learning processes [Sweller 1994], [Cooper 1998]. At the core of the theory is the recognition that, in order to learn a new skill, relevant information must be first attended to and then processed in working memory, before it is eventually stored in long-term memory. However, working memory is limited in capacity and it can store information only for a limited time. Therefore, according to CLT, effective learning can occur only if the learner’s cognitive system (working memory in particular) is not overloaded with information. On these grounds, CLT strongly criticizes conventional instructional strategies in which problem solving is excessively emphasized. Traditionally, when a new topic is taught, only few worked out examples are shown and learning how to apply newly introduced principles and rules is supposed to occur when students practice them in solving problems. However, novices who did not have enough practice with sufficiently varied worked out examples cannot learn effectively when they perform problems solving. The reason is that, when novices solve a problem, they typically waste a considerable amount of their limited cognitive resources on processes (e.g., means-ends analyses) that are not directly relevant for attending and processing the information intended to be learned (i.e., how to apply and combine relevant rules and principles in the target domain).
162
CARENINI AND CONATI
CLT proposes several remedies to improve traditional instructional strategies by taking into account the learner’s cognitive limitations. These include: goal free problem solving (i.e., instead of asking students to find a specific quantity, students are asked to find what they can), have learners study many worked out examples and then complete partially solved problems, avoid splitting the student’s attention across multiple sources of information (e.g., graphic and text), avoid presenting information redundantly (where what is redundant may depend on the learner’s level of expertise [Kalyuga, Chandler et al. 1997]). Notice that all these suggestions from CLT have been empirically shown to be beneficial to learners in several domains, ranging from geometry to biology (e.g., [Paas and Merrienboer 1994]). The key aspect of CLT that inspires our work is the importance placed on example studying in learning. CLT claims that, by studying a worked out example, the learner needs only to attend and process one solution step at a time. As a result, learning how to apply the rule used to derive that step is unlikely to be hampered by cognitive overload. However, it is clear that, as the student gains expertise in the domain, fully specified examples may become less and less effective, because the student already masters how to derive some of the required steps. Ideally, what the student needs are examples that leave out steps she is already familiar with, but still specify the steps she needs to learn how to derive.
Before run-time Problem definition (givens and goals)
Example
Problem Solver Domain and planning rules
6
User Interface
Gap filling menu 4 Text Plan 2b 2a 5
Example Generator 3a Solution graph
3b
Templates Explanation strategies
Long-term 1b user’s model 7 Bayesian network 1a
Figure 1. Framework for example generation
In the following sections, we describe a framework that does exactly that. It generates examples whose level of detail is tailored to the learner’s degree of domain knowledge. According to CLT, on one hand, if the example says too much (i.e., too few and/or too small gaps in the solution), it will generate unnecessary cognitive load because the learner will have to pay attention to steps that she already knows how to derive. On the other hand, if the example says too little (i.e., too many
GENERATING TAILORED WORKED-OUT PROBLEM SOLUTIONS
163
and/or too large gaps in the solution), it will also generate unnecessary cognitive load, because the learner will have to fill in the missing steps through problem solving that may be too cognitively demanding. Essentially, the challenge for our system is to strike a difficult balance between saying too much and saying too little, in order to avoid overloading the learner and therefore promote effective learning. 3. THE FRAMEWORK FOR EXAMPLE GENERATION Figure 1 shows the architecture of our framework for generating tailored example presentations. The part of the framework labelled “before run-time” is responsible for generating the internal representation of an example solution from (i) a knowledge base (KB) of domain and planning rules (for physics in this particular application); (ii) a formal description of the example initial situation, given quantities and sought quantities [Conati and Vanlehn 2000]. A problem solver uses these two knowledge sources to generate the example solution represented as a dependency network, known as the solution graph. The solution graph encodes how each intermediate result in the example solution is derived from a domain or planning rule and from previous results matching that rule’s preconditions. Consider, for instance, the physics example in Figure 2 (Example1). Figure 3 shows the part of the solution graph that derives the first three steps mentioned in the Example1 solution: x establish the goal to apply Newton’s 2nd Law (the corresponding text in Example1 is:“Because we want to find a force, we apply Newton’s 2nd law to solve this problem”); x select the body to which to apply the law (the corresponding text in Example1 is: “We choose Jake as the body”); x identify the existence of a tension force on the body (the corresponding text in Example1 is: “The helicopter’s rope exerts a tension force T on Jake”). In the solution graph, intermediate solution facts and goals (F- and G- nodes in Figure 3) are connected to the rules (R- nodes) used to derive them and to previous facts and goals matching these rules’ enabling conditions. The connection goes through rule-application nodes (RA- nodes in Figure), explicitly representing the application of each rule in the context of a specific example. Thus, the segment of network in Figure encodes that the rule R-try-Newton-2law establishes the goal to apply Newton’s 2nd Law (node G-try-Newton-2law) to solve the goal to find the force on Jake (node G-force-on Jake).
164
CARENINI AND CONATI
Figure 2: Sample Newtonian physics example.
The rule R-goal-choose-body sets the subgoal to find a body to apply the Newton’s 2nd Law (node G-goal-choose-body), while the rule R-find-forces sets the subgoal to find all the forces on the body (node G-find-forces). The rule R-body-byforce dictates that, if one has the goals to find the force on an object and to select a body to apply Newton’s 2nd Law, that object should be selected as the body. Thus, in Figure 3 this rule selects Jake as the body for Example1 (node F-Jake-is the body). The rule R-tension-exists says that if an object is tied to a taut string, then there is a tension force exerted by the string on the object. When applied to Example1, this rule generates the fact that there is a tension force on Jake (node F-tension-on-Jake in Figure 3). The solution graph can be seen as a model of correct self-explanation for the example solution, because for each solution fact it encodes the various types of selfexplanations relevant to understand it: step correctness (what domain rule generated that fact), step utility (what goal that fact fulfills in the high-level plan underlying the example solution) and gap filling (how the fact derives from previous solution steps). In the SE-Coach, every time a student is shown an example, the corresponding solution graph provides the structure for a Bayesian network (see right bottom side
GENERATING TAILORED WORKED-OUT PROBLEM SOLUTIONS
165
of Figure, link-1a) that uses information about how the student reads and selfexplains that example to generate a probabilistic assessment of how well the student understands the example and the related rules [Conati and Vanlehn 2000]. The prior probabilities to initialise the rule nodes in the Bayesian network come from the longterm student model (see Figure, link-3b), which contains a probabilistic assessment of a student’s current knowledge of each rule in the KB. R - try - Newton - 2law
G- force - on - Jake
RA - try - Newton - 2law R- goal - choose - body
F - hangs - from - rope
G- try - Newton - 2law
R- find - forces RA - goal - choose -b ody
RA- find - forces
G- goal - choose - body
G- find - forces
R- body - by - force
R- tension - exists RA - body - by - force
R
Rule
F/G
Fact/Goal
RA
Rule Application
F - Jake -is - the - body
RA - tension - exists F- tension - on - Jake
Figure 3: Segment of solution graph for Example1.
The long-term student model is created when the user starts using the SE-Coach. At first, the model is initialized either with existing data on student relevant Physics knowledge, or with uniform priors, when these data are not available. Then, the model is updated every time the student finishes studying an example, with the new rule probabilities computed by the corresponding Bayesian network (see Figure, link-7). In the SE-Coach, the solution graph and Bayesian network described above are used to support students in generating self-explanations for step correctness and step utility only. No explicit monitoring and support for gap filling self-explanation is provided. This is because in the SE-Coach, the description of the example solutions presented to the student and the mapping between these descriptions and the corresponding solution graphs are done by hand. This makes it impossible to tailor an example description to the dynamically changing student model by inserting gaps
166
CARENINI AND CONATI
at the appropriate difficulty level for a given student. We have overcome this limitation by adding to the SE-Coach the example generator (see Figure), an NLG system that can automatically tailor the detail level of an example description to the student’s knowledge, in order to avoid overloading the student’s cognitive resources and thus stimulate and support gap-filling self-explanation. 4. THE EXAMPLE GENERATOR (EG) EG is designed as a standard pipelined NLG system [Reiter and Dale 2000]. A text planner [Young and Moore 1994] selects and organizes the example content, then a microplanner and a sentence generator realize this content into language. In generating an example, EG relies on two key communicative knowledge sources (Figure, links 2a and 2b): (i) a set of explanation strategies that allow the text planner to determine the example’s content, organization and rhetorical structure; (ii) a set of templates that specifies how the selected content can be phrased in English. The design of these sources involved a complex acquisition process. We obtained an abstract model of an example’s content and organisation from a detailed analysis of the rules used to generate the solution graph. This was combined with an extensive examination of several physics textbook examples, which also allowed us to model the examples’ rhetorical structure and the syntactic and semantic structure of their clauses. To analyse the rhetorical structure of the examples, we followed Relational Discourse Analysis (RDA) [Moser, Moore et al. 1996], a coding scheme devised to analyse tutorial explanations. The semantic and syntactic structure of the examples’ clauses was used to design the set of templates that map content into English. We now provide the details of the selection and organisation of the example content. In EG, this process relies on the solution graph and on the probabilistic long-term student model (Figure 1, links 3a and 3b). It consists of two phases, text planning and revision, to reduce the complexity of the plan operators and increase the efficiency of the planning process. Text planning selects from the solution graph a knowledge pool of all the propositions (i.e., goals and facts) necessary to solve a given example, and it organizes them according to ordering constraints also extracted from the solution graph. The output of this phase, if realized, would generate a fully detailed example solution. After text planning, a revision process uses the assessment in the student’s long-term model to decide whether further content selection can be performed to insert appropriate solution gaps. Text planning and revision are described in the following sub-sections. 4.1 Text Planning Process
The input to the text planner consists of (i) the abstract communicative action of describing an example solution; (ii) the example solution graph; (iii) the explanation strategies. The planning process selects and organizes the content of the example solution by iterating through a loop of communicative action decomposition1.
GENERATING TAILORED WORKED-OUT PROBLEM SOLUTIONS
167
Abstract actions are decomposed until primitive communicative actions (executable as speech acts) are reached. In performing this task, the text planner relies on the set of explanation strategies that specify possible decompositions for each communicative action and the constraints dictating when they may be applied. These constraints are checked against the solution graph and when they are satisfied the decomposition is selected and appropriate content is also extracted from the solution graph. For illustration, Figure 4 (a) shows a simplified explanation strategy that decomposes the communicative action describe-solution-method. Possible arguments for this action are, for instance, the Newton’s-2nd-Law and the Conservation-of-Energy methods. Looking at the details of the strategy, the function find-steps (:constraints field) checks in the solution graph whether the method has any steps. If this is the case, the steps are retrieved from the solution graph and the describe-solution-method action is decomposed in an inform-about primitive action and in a describe-method-steps abstract action. The output of the planning process is a text plan, a data structure that specifies what propositions the example should convey, a partial order over those propositions and the example rhetorical structure. A portion of the text plan generated by EG for Example1 is shown in Figure 4(b). The propositions that the example should convey are specified as arguments of the primitive actions in the text plan. In Figure 4(b) all primitive actions are of type inform. For instance, the primitive action (Inform-about (act-on Jake weight)) specifies the proposition (act-on Jake weight), which is realized in the example description as “the other force acting on Jake is his weight”. In the text plan, the communicative actions are partially ordered. This ordering is not shown in the figure for clarity’s sake; the reader can assume that the actions are ordered starting at the top. The example rhetorical structure consists of the action decomposition tree and the informational/intentional relations among the communicative actions. For instance, in
Figure(b), the rhetorical structure associated with the action describe-solutionmethod specifies that, to describe the solution method, the system has to perform two actions: (i) inform the user about the method adopted; (ii) describe all the steps of the method. Between these two actions the Enable intentional relation and the Goal:Act informational relation hold. All the informational/intentional relations used in EG are discussed in (Moser, Moore et al. 1996] We clarify here only the meaning of the Enable relation because this relation is critical in supporting gap-filling selfexplanations. An intentional Enable relation holds between two communicative actions if one provides information intended to increase either the hearer’s understanding of the material presented by the other, or her ability to perform the domain action presented by the other.
168
CARENINI AND CONATI (a) (Describe-solution-method ?method :constraints (find-steps ?method ?steps) :sub-actions ((a1 (Inform-about ?method)) (a2 (describe-method-steps ?steps))) :relations ((r1 (Enable a2 a1)) (r2 (Goal:Act a1 a2))))
(b)
(Describe example1) (Inform-about-problem find-force) Enable Goal:Act (Describe-solution-method Newton’s-2ndnd L ) (Inform-about-method Newton’s-2 -Law) Enable Goal:Act (Describe-substeps-method Newton’s-2nd-Law) (Describe-step choose-body) (Inform-about (choose-simple-body Jake)) Enable Preparation:Act Graphical (Describe-step body’s-properties) actions… (Show free-body-diagram) Joint Step1:Step2 (Inform-about (act-on Jake tension)) (Describe-step all-forces-on-body) (Describe-step specify-component-equations)(Inform-about (act-on Jake weight)) (Describe-step choose-coordinate-axes) (Describe-step write-component-equations) Communicative action decomposition
Intentional/Informational relations
Figure 4: (a) Sample explanation strategy. (b) Portion of the text plan.
4.2 The Revision Process
Once the text planner has generated a text plan for the complete example, the revision process revises the plan to possibly insert solution gaps that can make the example more stimulating for a specific student. As described in the section on CLT, the key idea is to insert solution gaps of adequate difficulty. In this way, the student can practice applying newly acquired knowledge without incurring in the excessive cognitive load due either to too demanding problem solving (i.e., when gaps are too many/large), or to unnecessary information (i.e., when gaps are too few/small). In most NLG systems, the revision process typically involves reorganizing the text plan. In contrast, revision in EG does not change the structure of the text plan. Rather, the EG revision process only performs further content selection by consulting the probabilistic long-term student model that estimates the current student’s domain knowledge (see of Figure, link-3b). More specifically, the revision process examines each proposition specified by a primitive communicative action in the text plan and, if according to the student model, there is a high probability (above a given threshold T) that the student knows the rule necessary to infer that
GENERATING TAILORED WORKED-OUT PROBLEM SOLUTIONS
169
proposition, the action is de-activated. De-activated actions are kept in the text plan but are not realized in the text, thus creating solution gaps. However, as we will see in the next section, de-activated actions may be realized in follow-up interactions. As an illustration of the effects of the revision process on content selection, compare the example solutions shown in Figure 5 and Figure 6. Figure 5 displays the worked out solution for Example2 which, similarly to Example1, does not contain any solution gaps. In contrast, the same portion of Example2 solution shown in Figure 6 is much shorter, including several solution gaps. As previously described, EG determines what information to leave out by consulting the long-term probabilistic student model. In particular, the concise solution in Figure 6 is generated by EG if the student had previously studied Example1 with the SE-Coach and generated self-explanations of correctness and utility providing sufficient evidence that she understands the rules used to derive Example1 solution. When selecting the content for Example2, EG leaves out all the propositions derived from the rules that the student has learned from Example1. Notice, for instance, that the concise solution in Figure 6 does not mention the solution method used and the weight force. Also, the choice of the body and of the coordinate system is only conveyed indirectly. Even if a student has sufficient knowledge to fill in the solution gaps inserted by the revision process, she may not actually perform the required inferences when studying the example. As a matter of fact, cognitive science studies show that most students tend not to self-explain spontaneously [Chi 2000]. Thus, once the text plan is revised and realized, the system presents the concise example with tools designed to stimulate gap filling self-explanation. These tools help a student to detect gaps in an example solution and to fill the gaps. 5. SUPPORT FOR GAP FILLING SELF-EXPLANATION To support gap-filling self-explanation, we have extended the interface that the SE-Coach uses to support self-explanations for step correctness and utility. In this interface, described in [Conati and Vanlehn 2000] each example’s graphical element and solution step presented to the student is covered with gray boxes. Figure 7 shows a segment of the example solution in Figure 6 as presented with the masking interface. To view an example part, the student must move the mouse over the box that covers it, thus allowing the interface to track what the student is reading. When the student uncovers an example part, if the SE-Coach determines that the student needs to selfexplain that step at that time, a “self-explain” button appears next to it (see Figure 8 (a1)). Clicking on this button generates more specific prompts that suggest one or more of the self-explanations for correctness, utility or gap filling, depending upon which of them are needed by the current student to fully understand the uncovered step.
170
CARENINI AND CONATI
Figure 5. Portion of Example2 without solution gaps.
In particular, the text plan produced by EG is the key element in determining whether a prompt for gap filling is generated (Figure 1, link 4). A prompt for gap filling is generated whenever some of the primitive communicative actions that were de-activated during the revision process are related through an Enable intentional relation to the communicative action expressing the uncovered example part. The rationale behind this condition is that a solution gap with respect to an example part comprises all the solution steps that were left out, but whose understanding is a direct precondition to derive that example part. For instance, given the example part uncovered in Figure 8(a1), there is only one solution gap preceding it, namely the one corresponding to the communicative action Informabout (choose-simple-body wagon)2.
GENERATING TAILORED WORKED-OUT PROBLEM SOLUTIONS
171
Figure 6. Portion of Example2 with solution gap.
Figure 7. Masking interface to track what the student is attending to.
As shown in Figure 8 (a1), the prompt for gap filling is generated by adding the item “filling in missing steps” to the self-explain menu. If the student clicks on this item, the interface inserts in the solution text an appropriate number of masking boxes, representing the missing steps (see Figure 8 (a2), first box from top). The interface also activates a dialogue box that allows the student to fill in the missing steps. Since the interface currently does not process natural language input, the student needs to fill each missing step by selecting a sentence describing the step from a set of possible alternatives. Each alternative is either generated by EG by applying the realisation component to unrealised communicative actions in the text plan (see Figure 1, link 5), or it is randomly chosen from a pool of steps unrelated to the current problem.
172
CARENINI AND CONATI
(a1)
(a2)
(b1) Pull-down menu interface
st
We apply Newton’s 1 law We choose the wagon as the body
(b2) Radio-button interface
The wagon’s weight W is directed downward
Figure 8. Interface techniques for gap-filling self-explanation. (a1-2) Techniques to prompt the user for gap-filling and to indicate location of missing steps. (b1-2) Two versions of the dialogue box that supports the student in filling in the missing steps.
We have implemented two versions of the dialogue box that support the student in filling in the missing steps. As shown in Figure 8 (b1-2), one version is based on pull-down menus, while the other on radio-buttons. A key difference between the two is that, when more than one step is missing, radio buttons allow the student to
GENERATING TAILORED WORKED-OUT PROBLEM SOLUTIONS
173
inspect all the alternatives for all the missing steps at the same time. In contrast, with pull-down menus the student has to focus on one missing step at a time. We have performed a preliminary user study involving 26 students to determine which of the two versions students prefer. Unfortunately, the result of the study was inconclusive with roughly half of the students preferring one version and the other half preferring the other. Several students mentioned that radio-buttons were confusing when more than one missing step was present, because too much text was displayed. As for pull-down menus, one complaint was the need of a mouse click to see what options are available. Additional studies are clearly needed, especially to identify user differences that can explain the differences in their preferences; and also to explore whether hybrid combinations of pull-down menus and radio-buttons might be more effective and more acceptable to users. Going back to the student interaction with the gap-filling interface, let’s assume now that the student, regardless of which of the two dialogue boxes had been provided, makes her selection for filling in the missing step. When this happens, the student receives immediate feedback on the correctness of his selection, which is also sent to the Bayesian network built for the current example (see Figure 1, link 6). The network fact node that corresponds to the missing step is clamped to either true or false, depending on the correctness of the student’s selection, and the network updates the probability of the corresponding rule consequently. As we mentioned in Section 3, at the end of the interaction with the current example, the long-term student model is updated with the new rule probabilities from the example Bayesian network . In particular, if the student’s actions have shown that he is not ready to apply a given rule to fill a solution gap, this rule’s probability will decrease in the long-term student model (see Figure 1, link-7). When this happens, if the resulting rule probability is too low (i.e., below the threshold T), the next presented example involving this rule will include the solution steps the rule generates, giving the student another opportunity to see how the rule is applied. 6. CONCLUSIONS AND FUTURE WORK We have presented a tutoring framework that integrates principles and techniques from ITS and NLG to improve the effectiveness of example studying for learning. Our framework uses an NLG module and a probabilistic student model to introduce solution gaps in the example solutions presented to a student. Gaps are introduced when the student model assesses that the student has gained from previous examples sufficient knowledge of the rules necessary to derive the eliminated steps. The goal is to allow the student to practice applying these rules in problem solving episodes of difficulty adequate for his knowledge. Our framework is innovative in two ways. First, it extends ITS research on supporting the acquisition of the learning skill known as self-explanation, by providing tailored guidance for gap filling self-explanation. Second, it extends NLG
174
CARENINI AND CONATI
techniques on producing user-tailored text by relying on a dynamically updated probabilistic model of the user logical inferences. As future work, in the short term, we plan to investigate how the prediction capabilities of the probabilistic student model can be used to refine the strategies that EG uses to decide when to leave out a solution step. Currently the model is used to diagnose the student’s understanding of the relevant physics rules given the student’s interaction with the system. The results of this diagnosis, i.e., the updated rule probabilities, are then used directly to decide if the student can apply each rule to derive a corresponding solution step, and if so to deactivate that step in the presented solution. This approach is reasonable, if we assume that we always insert only gaps that include only one missing solution step. However, if we want to insert gaps that consist of chains of two or more steps, the fact that the student has enough knowledge to infer each of these steps in isolation does not imply that the user can actually make the chain of inferences, because of the increased cognitive load that this process entails. Thus, we are working on using the student model to predict, given one or more chains of deactivation, if the student can actually derive them given the current rule probabilities in the model. A more long-term step in our research will be to test the effectiveness of our framework through empirical studies. These studies are crucial to refine the probability threshold T currently used to decide when to leave out a solution step, and possibly to identify additional principles to inform the text plan revision. Additional future work involves NLG research. First, we plan to investigate how the example text plan can be used to maintain the coherence of the other example portions, when the student fills a solution gap. Second, we will start working on how to integrate the generation of the textual example solution and of the graphical elements of the example (e.g., the free body diagram). Third, we aim to verify whether a template-based approach to realization is sufficiently powerful to cover our target domain of physics examples. 7. ACKNOWLEDGEMNTS We thank Andréa Burnt for help in reviewing literature on Cognitive Load Theory and its applications in Intelligent Tutoring Systems. 8. NOTES 1. Communicative actions satisfy communicative goals. So, text planning actually involves two intertwined processes of goal and action decomposition. To simplify our presentation, we only refer to communicative actions and their decomposition. 2. Since the text plans for Example1 and Example2 are structurally the same, this can be verified in Figure 4(b), in which the missing step would have been Inform about (choose-simple-body Jake).
9. REFERENCES [Aleven and Ashley 1997] V. Aleven and K. D. Ashley. Teaching case-based argumentation through a model and examples: Empirical evalution of an intelligent learning environment. Proc. of the Artificial Intelligence in Education, Kobe, Japan, 1997.
GENERATING TAILORED WORKED-OUT PROBLEM SOLUTIONS
175
[Burrow and Weber 1996] R. Burrow and G. Weber. Example explanation in learning environments. Proc. of the Intelligent Tutoring Systems - Proceedings of the Third International Conference, ITS '96, Springer: 457-465, 1996. [Chi 2000] M. T. H. Chi. Self-Explaining Expository Texts: The Dual Processes of Generating Inferences and Repairing Mental Models. Advances in Instructional Psychology. R. Glaser. Mahwah, NJ,, Lawrence Erlbaum Associates: 161-238, 2000. [Conati and Carenini 2001] C. Conati and G. Carenini. Generating Tailored Examples to Support Learning via Self-Explanation. Proc. of the Proceedings of the 17th Joint International Conference on Artificial Intelligence (IJCAI 2001), Seattle, USA, 2001. [Conati and Vanlehn 2000] C. Conati and K. Vanlehn. Toward Computer-Based Support of MetaCognitive Skills: a Computational Framework to Coach Self-Explanation. International Journal of Artificial Intelligence in Education 11: 398-415, 2000. [Cooper 1998] G. Cooper. Research into Cognitive Load Theory and Instructional Design at UNSW, University of New South Wales, Australia, 1998. [Freedman 2000] R. Freedman. Plan-based dialogue management in a physics tutor. Proc. of the Sixth Applied Natural Language processing Conference (ANLP '00), Seattle, 2000. [Gott, Lesgold et al. 1996] S. P. Gott, A. Lesgold and R. S. Kane. Tutoring for transfer of technical competence. Constructivist Learning Environments. B. G. Wilson. Englewood Cliffs, NJ, Educational Technology Publications: 33-48, 1996. [Horacek 1997] H. Horacek. A Model for Adapting Explanations to the User's Likely Inferences. UMUAI 7(1): 1-55, 1997. [Kalyuga, Chandler et al. 1997] S. Kalyuga, P. Chandler and J. Sweller. Levels of Expertise and UserAdapted Formats of Instructional Presentations: A Cognitive Load Approach. Proc. of the User Modeling: Proceedings of the Sixth International Conference, UM97, 1997. [Korb, McConachy et al. 1997] K. B. Korb, R. McConachy and I. Zukerman. A Cognitive Model of Argumentation. Proc. of the Proceedings of the Nineteenth Annual Conference of the Cognitive Science Society: 400-405, 1997. [Moore 1996] J. D. Moore. Discourse generation for instructional applications: Making computer-based tutors more like humans. Journal of Artificial Intelligence in Education 7(2): 181-124, 1996. [Moser, Moore et al. 1996] M. G. Moser, J. D. Moore and E. Glendening. Instructions for Coding Explanations: Identifying Segments, Relations and Minimal Units, University of Pittsburgh, Department of Computer Science, 1996. [Paas and Merrienboer 1994] F. Paas and J. Merrienboer. Variability of Worked Examples and Transfer of Geometrical problem-Solving Skills: A Cognitive-Load Approach. Journal of Educational Psychology 86(1): 122-133, 1994. [Reiter and Dale 2000] E. Reiter and R. Dale. Building Natural Language Generation Systems, Cambridge University Press, 2000. [Sweller 1988] J. Sweller. Cognitive load during problem solving: effects on learning. Cognitive Science 12: 257-285, 1988. [Sweller 1994] J. Sweller. Cognitive Load Theory, Learning Difficulty, and Instructional Design. Learning and Instruction 4: 295-312, 1994. [Weber and Specht 1997] G. Weber and M. Specht. User modeling and adaptive navigation support in WWW-based tutoring systems. Proc. of the Proceedings of User Modeling '97, 1997. [Young 1999] M. R. Young. Using Grice's maxim of Quantity to select the content of plan descriptions. Artificial Intelligence 115(2): 215-256, 1999. [Young and Moore 1994] R. M. Young and J. D. Moore. DPOCL: A Principled Approach to Discourse Planning. Proc. of the Proceedings of the 7th International Workshop on Text Generation, Montreal, Canada, 1994.
10. AFFILIATIONS Giuseppe Carenini and Cristina Conati University of British Columbia Vancouver, Canada
J. CALDER, A. C. MELENGOGLOU, C. CALLAWAY, E. NOT, F. PIANESI, I. ANDROUTSOPOULOS, C. D. SPYROPOULOS, G. XYDAS, G. KOUROUPETROGLOU AND M. ROUSSOU
MULTILINGUAL PERSONALIZED INFORMATION OBJECTS
1. INTRODUCTION In a world of burgeoning information resources we need evermore responsive and intelligent ways of locating and delivering them. The M-PIRO project (Multilingual Personalized Information Objects) aims at bringing together leading research work on natural language generation (NLG) and speech synthesis, with multimedia database, information management and virtual reality technologies developed for public access to museum and gallery resources. M-PIRO targets the concept of personalized information objects ņ that is, entities capable of responding to requests for information by taking into account what the requester already knows, what they are most interested in, and how the related information is to be made available. The chosen context is cultural heritage: the project works closely with museums, galleries, and other “memory institutions” to develop technology responsive to their special needs. Museum missions, in fact, are changing: from the traditional, curatorial concern for scholarly understanding of objects and their conservation, the focus is shifting towards the visitors' interpretation of the objects in their historical and cultural context and towards making them widely accessible. The M-PIRO project provides tools and techniques to help this interpretative process by facilitating information delivery in a number of different languages and in a personalized way. It allows textual and spoken descriptions of exhibits to be generated automatically from an underlying language-neutral database, and from existing free-text descriptions. The resulting descriptions, generated in three different languages (English, Greek, and Italian), are tailored according to the user's interests, background knowledge, and language skills. For example, simpler vocabularies and syntactic structures may be used with children than with adults and/or experts. A prototype has been implemented capable of generating descriptions of 50 archaeological exhibits of different kinds, which shows the potential of the developed architecture for the delivery of information presentations at public access kiosks in museums, or through the web (Figure 1). The interaction scenario has also been extended to an immersive virtual reality environment, where another prototype delivers descriptions in spoken form (Figure 2). 177 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 177–201 © 2005 Springer. Printed in the Netherlands
178
CALDER ET AL. 2. DYNAMIC GENERATION OF MULTILINGUAL PERSONALIZED DESCRIPTIONS
One starting point of M-PIRO is the text generation facilities for English developed within the ILEX (Intelligent Label Explorer) project [Oberlander et al. 1998; O’Donnell et al. 2001]. ILEX was capable of generating web-based descriptions of diverse exhibits of modern jewellery located in the Royal Museum of Scotland. ILEX is particularly well-adapted at generating facts about collections of objects, such as museum catalogues. In addition to novel features such as clause aggregation, ILEX marks individual facts with importance and interest scores, allowing the text planner to create useful exhibit descriptions despite variations in the underlying data for different exhibits. Within M-PIRO, these facilities have been extended along a number of dimensions:
Figure 1: An example of the M-PIRO web interface customized for an English adult
Multilinguality. M-PIRO supports multiple languages: English, Greek and Italian. The production of linguistic resources for the latter two has been pursued with the aim of reusing work done on a systemic grammar for English [O’Donnell 1996] as much as possible. The ultimate goal was to experiment with an approach to
MULTILINGUAL PERSONALIZED INFORMATION OBJECTS
179
multilingual generation that leverages grammar commonalities across different languages, allowing for faster resource development and easier maintenance [Bateman et al. 1991; Callaway et al. 1999]. Many languages share the same basic systems, when seen functionally, but differ in their realization, or in the way specific predicates or relations are phrased. A core set of grammatical systems has been identified starting from ILEX's English resources. Then, language-specific Italian and Greek [Dimitromanolaki et al. 2001a] resources have been defined as complementary grammatical subsystems, lexicons and micro-planning expressions. Integrating existing text and automatically generated text. M-PIRO's interaction scenario and domain (the presentation of information to visitors of virtual archaeological museums) require a flexible integration of deep generation with shallow approaches. Indeed, much interesting information about relevant objects is already available to museum curators in the form of narrative text, anecdotes, or other complex texts that would be too difficult or expensive to generate from scratch. M-PIRO investigated this issue along various alternative directions. The implemented systems allow the flexible incorporation of existing pieces of text by associating (tri-lingual) strings to relevant domain entities and annotating them with information about their appropriateness with respect to user and discourse context. A template-based mechanism then allows the combination of strings with dynamically generated referring expressions or other strings, while the text planner guarantees that the resulting text is appropriately placed. In parallel, a more substantial integration of NLG and Adaptive Hypermedia (AH) technologies has been investigated to make the best of the strengths of each ņ e.g., by resorting to AH functionalities to convey information that could not be easily deployed by the deep generation component [Not and Zancanaro 2001]. The adopted AH technology is based on the Macronode approach [Not and Zancanaro 2000], which allows for the segmentation and annotation of pre-existing texts (as well as images and audio messages) into small content units. In this approach, the units can be reused in different discourse contexts after adjusting them dynamically to merge with the rest of the presentation. Personalizing text with user modelling. Advanced user modelling facilities have been developed to support the personalization task by keeping track of what visitors see and the information that has already been presented to them. By refining previous models of users' interests and knowledge, we can produce descriptions that better match their needs. We can also arrange our descriptions so that they draw connections between previously seen exhibits, convey general information about styles or periods, and offer pointers to other related exhibits, thus helping the visitors obtain a more cohesive view of the presented information. M-PIRO’s user models are more elaborate than ILEX’s in that they contain information that allows the system to tailor not only the content but also the surface form of the resulting texts. It is thus possible to generate, for example, shorter sentences with simpler words when producing texts for children. We store the user modelling information in the personalization server (developed at “NCSR Demokritos”), a database that is interrogated by the system during generation and dynamically updated after each text is produced. The personalization server stores both static parameters associated with user types (e.g., the appropriateness of language expressions per user type) and
180
CALDER ET AL.
dynamic information pertaining to individual users (e.g., the degree to which they have assimilated various pieces of information). Authoring of domain and linguistic resources. To facilitate the specification of the domain-dependent knowledge and linguistic resources, which typically represents one of the major bottlenecks in developing and porting a generation system to different domains, an authoring tool (developed at NCSR “Demokritos”) was created that enables museum curators to update the information source from which the texts are generated, and control the language and content of the resulting descriptions [Androutsopoulos et al. 2001]. Unlike systems such as KPML [Bateman 1997], the authoring tool is not intended to assist language technology experts in creating and maintaining domain-independent linguistic resources, such as large-scale grammars. In that sense, M-PIRO's authoring tool is closer to the symbolic authoring facilities of DRAFTER [Paris et al. 1996] and GIST [Power and Cavallotto 1996]. M-PIRO allows the author to modify not only the assertional part of the database ņ i.e., its entries ņ but also the schema of the database. This is a consequence of the fact that M-PIRO targets a broader range of domains; actually, any domain where descriptions of physical objects are needed. Besides museum collections, this extends, for example, to sales catalogues. Additionally, M-PIRO allows the authors to manipulate not only the information in the database, but also the domaindependent resources that control how this information is rendered in natural language. This allows them to control, for example, the vocabulary and form of the generated sentences, as well as the structure of the resulting texts. Integrating text generation with speech synthesis. Finally, M-PIRO has developed a state-of-the art multilingual Concept-to-Speech system. Novel techniques are being used to increase the naturalness of speech by tightly integrating speech synthesis with the NLG module. Synthesizers for English and Italian are constructed using Festival [Taylor et al. 1998], a common environment allowing the incorporation of newly developed resources; in our case for Italian [Cosi et al. 2001]. Greek speech synthesis is hosted by DEMOSTHeNES Speech Composer [Xydas and Kouroupetroglou 2001a], a research tool for speech synthesis. M-PIRO aims at identifying the linguistic and discourse phenomena that significantly affect the intonational and prosodic contour of utterances, in order to allow the text generator to pass on this information to the speech synthesizers. The relevant information is delivered by the generation module through an XML-based annotation scheme. The quality of the audio output is of the utmost importance with immersive virtual reality environments, an option which M-PIRO actively explored in a prototype where users directly manipulated 3D objects to get contextualized and personalized audio descriptions (Figure 2). 3. TEXT GENERATION Natural Language Generation (NLG) aims to develop computer systems that can effectively apply linguistic knowledge such as semantics, morphology and syntax, as well as knowledge of the application domain, to produce meaningful texts in human
MULTILINGUAL PERSONALIZED INFORMATION OBJECTS
181
languages, based on non-linguistic data [Reiter and Dale 2000]. The generation process in NLG systems typically consists of the following four main stages: x x x
x
Content Selection, in which the system selects those pieces of information from its database that are most appropriate for a particular communicative purpose, user model etc, Text Planning, which specifies the structure of the text by ordering the facts to be conveyed and by establishing the relations that hold between them, Microplanning, which involves the processes of lexicalization (the choice of the particular words and word types, e.g. choosing the right verb, tense, voice, person etc.), aggregation (combining facts to achieve a more fluent and concise text), and referring expression generation (selecting appropriate phrases to refer to particular entities in the domain), and Surface Realization, which performs the mapping of the abstract specifications into natural text, including morphology, linear precedence, and formatting for display in web browsers.
Figure 2: The 3D collection of the M-PIRO virtual reality prototype: spoken descriptions are generated for each selected object.
3.1 The Exprimo text generation system
The generation system used in M-PIRO is Exprimo, a modular NLG system that adheres to the fairly standard pipeline architecture [Reiter 1994] of figure 3. In general, Exprimo contains a number of important advances over previous end-to-end generation systems. Among these are: x x
The ability to generate in three languages using shared language resources The use of domain-specific linguistic input from an authoring tool
182
CALDER ET AL. x x
Integration with user modelling The ability to change the text, both lexically and syntactically, depending on user type (personalization)
The domain model consists of the domain database and the domain semantics. The domain database provides specifications for the classes of the entities present in the domain (like ‘exhibit’, ‘god’, ‘historical-period’, and so on), as well as the entities themselves (such as ‘exhibit1’, ‘Apollo’, ‘Hellenistic-period’).
Figure 3: A generalized pipeline NLG architecture, to which Exprimo adheres
Each of the entities is described by means of a number of predicates, which are either attributes (e.g., ‘exhibit-height’ takes values like ‘2.12.m’) or relations that can exist between the entity and members of other classes in the domain (e.g., the ‘original-location’ of an exhibit will be defined by another entity in the domain, such as ‘Pergamos’). Typically, each class defines the set of predicates that are relevant to its members. The domain semantics allows the system to interpret the domain database by providing predicate definitions and a class taxonomy. The predicate definitions specify domain restrictions and the way in which the predicates should be expressed. For example, the definition of the predicate ‘current-location’ will specify that this predicate can only be used to relate an instance of the basic class ‘exhibit’ to an
MULTILINGUAL PERSONALIZED INFORMATION OBJECTS
183
instance of the class ‘museum’. The taxonomy defines the hierarchy of classes in the domain. The Content Potential is an internal representation of the domain model in the form of a graph, reflecting the relations that exist between the entities in the database. Each instance of a relation or attribute in the domain (a fact) is a node of the graph (a fact-node), linked to the target domain objects (entity-nodes) by means of two arcs corresponding to the arguments of the predicate. The Content Potential represents the information that the system can communicate to the user, and the way in which the individual pieces of information are connected to each other. When the system receives a user request for an entity description, the system makes the particular entity the ‘global focus’ of the text to be generated [Mellish et al. 1998]. The facts that are connected to the entity in global focus (i.e. those satisfying the criterion of structural relevance) become candidates for selection by the system. The facts that will eventually be selected by the system are most likely to be those best matching the criterion of inherent relevance as well, which is evaluated on the grounds of the assumed interest of the content to the user, the importance of the fact in question relative to the pedagogical goal of the system, and the user’s level of assimilation of the information [Not et al. 2001.] The Text Planning module builds the structure of the text by ordering the facts and defining the dependencies that exist between them. The task of the text planner is to build the text-structure, which is a tree whose leaves are the selected facts (in other words, each leaf is a mapping of a fact onto a text-node). The internal nodes of the text structure specify the exact order in which the facts should appear in the output text, and the discourse relations that hold between those facts. The microplanner takes this “rough” document plan and produces a detailed specification for the text, which is used by the realization module to produce the surface document. Microplanning includes the language-dependent task of lexicalization that involves not only the choice of the words to use, but also of the syntactic structures that will be used to express the content of the propositions. Exprimo performs surface realization using its computational grammar for the specific target language: a systemic grammar [Halliday 1985] that consists of choice systems, each offering a set of alternatives representing grammatical features. Realization is carried out by means of a series of choices (traversal of the systemic network), where each choice is made at a different level in the hierarchy of ranks, starting from the clause-complex rank and ending at the word-rank, which specifies the exact grammatical and syntactic features at the lexical level. 3.2 Integration with the Authoring Tool
Exprimo is tightly integrated with a domain authoring tool. There are two main aspects to this: first, the domain descriptions produced using the tool can be stored as XML files which Exprimo can then use to generate text. Second, the authoring tool can use Exprimo to preview texts, giving immediate feedback to authors on the accuracy and appropriateness of new additions. The authoring tool supports user modelling, allowing information on the user types and specifications such as the desired number of facts (and thus page length) for each user type to be set. It is also
184
CALDER ET AL.
used to set the interest, importance, and repetitions for each user type. The repetitions parameter defines how many times the information must be conveyed before we can consider that the user has assimilated it. In general, the domain author can specify the interest, importance and repetitions of a field at the entity type (class) that introduces it. In the current domain, for example, the interest of the ‘creation-period’ field for adults is set to 3 at the ‘exhibit’ type, ensuring that all ‘exhibit’ entities inherit that value. The inherited interest, importance, and repetition scores can then be overridden, at individual entities. The authoring tool also allows the author to specify microplans. These are expressions controlling the lexicalization of each field (e.g., which verb to use, in what tense and voice, etc.) and encoding preferences that affect aggregation and the generation of referring expressions (e.g., whether or not the resulting clause may be aggregated, the desired case of the referring expressions, etc.). Multiple microplans can be specified per field, and each microplan is assigned a numerical score that indicates its appropriateness for each user type. Preference is given to microplans with the highest appropriateness scores that have been used fewer times in documents generated for the particular visitor. This allows the author to control the language expressions per user type (e.g., use simple verbs in active voice when generating for children), while leading at the same time to greater variation in the surface form. 3.3 Aggregation
The Content Selection module packages information as “verb-based, clausesized propositions, each of which is realised as a single sentence” [Shaw and McKeown 1997]. Texts composed exclusively of such single-fact sentences are very likely to contain repetitions and redundancies, and are almost certain to be considered boring and unnatural by human readers. To overcome this problem, NLG systems make use of aggregation algorithms, which combine semantically related propositions in order to produce a more concise and more coherent text. Exprimo is designed to generate personalised descriptions, based on its model of potential user types. In this context, it is desirable that the aggregation module also possess the flexibility to adjust the syntactic complexity of the output text to the requirements of the target user type. The authoring facilities of M-PIRO allow domain authors to specify the maximum number of facts that the system should convey to a particular type of user in each sentence. At the heart of Exprimo's aggregation module is a sophisticated algorithm that allows the system to examine the complexity and the quality of the constructed sentences in parallel. During each iteration of this algorithm, the system examines the effect of adding one more fact to the sentence being constructed on the quality of the sentence (provided that the sentence has not already reached the maximum size allowed by the user modelling parameter), as well as on the quality of the remaining (that is, of the still unaggregated) text. Thus, the system never sacrifices text quality in order to increase the length of a sentence.
MULTILINGUAL PERSONALIZED INFORMATION OBJECTS
185
The module is equipped with a set of standard aggregation operations (such as simple conjunction and syntactic embedding), each of which is able to express a certain type of rhetorical relation holding between two neighbouring facts in the text plan. Even though the module is capable of producing a significant variety of syntactic forms based on this set, as the length of a sentence increases, the module is more likely to employ one of its non-standard aggregation operations to resolve conflicts between sequences of incompatible standard operations. This may result in complex sentences using semicolons, qualifiers, etc. [Melengoglou 2002] The following example illustrates the effect of different values of the user modelling parameter on a series of four facts generated by the system: x maximum facts per sentence = 1 (no aggregation) This exhibit is a rhyton. It was painted by the painter of Sotades. It was painted with the red figure technique. Today it is located in the Musee du Petit Palais. x
maximum facts per sentence = 2
This exhibit is a rhyton, painted by the painter of Sotades. It was painted with the red figure technique and today it is located in the Musee du Petit Palais.
or This exhibit is a rhyton. It was painted by the painter of Sotades with the red figure technique. Today it is located in the Musee du Petit Palais. x
maximum facts per sentence = 3
This exhibit is a rhyton, painted by the painter of Sotades with the red figure technique. Today it is located in the Musee du Petit Palais. x
maximum facts per sentence = 4
This exhibit is a rhyton; it was painted by a painter called the painter of Sotades with the red figure technique and today it is located in the Musee du Petit Palais.
3.4 Comparisons
As already discussed, the system stores the information that it can convey to the users about the entities in its database in the form of fillers (which are either values or pointers to other entities) of a number of predicates that are common to the members of a certain class. For instance, members of the class exhibit can have predicates such as creation-period and original-location, while members of the class person can have different ones like person-country, etc. Thus, each entity can be described by means of a finite set of predicates, and it is possible to group the
186
CALDER ET AL.
members of a class in a number of ways, based on the fillers of their common predicates. For example, in the domain of museum exhibits, we may classify exhibits according to their origin (e.g. several exhibits originate from Attica, others from Argos etc.), according to the historical period during which they were created (Archaic, Classical, Hellenistic etc.), and so on. Exprimo generates comparisons between the current and previously examined entities, thus helping the users of the system build a more complete picture of the concepts that are presented, make associations and reach valuable conclusions. By being the result of the user’s own navigation, such comparisons result in texts that are more personalised and, therefore, more interesting, attractive and intelligent. Two types of comparisons are performed, based on entity similarities and differences (in terms of the resulting discourse relations: RST-SIMILARITY and RST-CONTRAST). ‘Contrast’ comparisons always take place between the entity in current focus and a group of previously visited entities (never an individual entity). This group is always directly related to the current entity; it belongs either to the same class as the entity, or to one of the entity’s super-classes. Central to the design of this module is the aim to avoid making individual, fullclause references to previously seen entities, such as: “Like the amphora that was decorated by the painter of Kleofrades, this lekythos was created during the archaic period”. Such past references both distract the users from their focus of attention (the current exhibit) and tend to lead to confusion, as users are unlikely to remember which of the (possibly several) exhibits that they have already seen is the one that actually matches a description. Thus, such comparisons tend to be boring, if not irritating, while their educational value is questionable. The comparison module used in Exprimo attempts to overcome this problem by using the type hierarchy to group previously examined exhibits into broader categories (classes) and make either group references (references to the class, if a comparison holds for all the previously mentioned members of the particular class), or short (name-only) individual references, when the system knows that the exhibit’s name is sufficient to make a unique reference. In order to ensure that the generated comparisons are meaningful, the module performs comparisons only between entities belonging to the same sub-type of the class ‘exhibit’ (i.e. between vessels, statues, coins, and so on). For instance, the system will attempt to compare a panathenaic amphora with a hydria or a black kantharos (as they are all vessels), but never, say, with a drachma or with a suit of armour. The following descriptions of seven vessels illustrate the use of comparisons in Exprimo: This exhibit is a white lekythos; it was created during the classical period and it dates from circa 440 B.C. […] This white lekythos was decorated by the Painter of Achilles. Scenes related to war were quite frequent in Athenian art during this period.
MULTILINGUAL PERSONALIZED INFORMATION OBJECTS
187
This exhibit is a panathenaic amphora. Like the white lekythos, it was created during the classical period. […] This panathenaic amphora dates from 490 B.C. and it was painted with the black figure technique. It originates from Attica. This exhibit is a marriage cauldron; it was created during the classical period and it dates from between 420 and 410 B.C. […] This marriage cauldron was decorated with the red figure technique. Like the previous vessels, this marriage cauldron was originally from Attica. This exhibit is a hadra ware hydria, created during the hellenistic period, which ranged between 323 and 31 B.C. […] Unlike the previous vessels, which were originally from Attica, this hadra ware hydria was originally from Rhodes. Today it is located in the Archaeological Museum of Rhodes. This exhibit is a panathenaic amphora; it was created during the hellenistic period and it dates from 320 B.C. Like the previous panathenaic amphora, it was painted with the black figure technique. It is now exhibited in The John Paul Getty museum. […] This exhibit is another amphora, created during the archaic period. It dates from the early 5th century B.C. […] This amphora was painted by the painter of Kleofrades. Unlike the previous amphoras, which were painted with the black figure technique, this amphora was painted with the red figure technique. This exhibit is an aryballos; it was created during the archaic period and it belongs to the corinthian type. It is spherical in shape. Like the panathenaic amphoras, this aryballos was decorated with the black figure technique. […]
3.5 Forward pointers
An important aspect of the web-based presentation of objects is the generation of forward pointers, which suggest a number of alternative paths that the users can follow to examine other objects that are in some way related to the one in the current description. Exprimo generates lists of forward pointers in all three languages dynamically, with the use of the generation engine. The Forward Pointers module locates those pieces of information (attributes of the current exhibit) that are assumed to be most relevant to the user (according to the user modelling assumptions made by the
188
CALDER ET AL.
domain author), and builds lists of hypertext references to exhibits which have not yet been visited by the user, and which also possess the attribute in question. The ‘relevance’ of each fact in this context is the value of the inherent relevance parameter (see section 3.1), and is thus not synonymous to interest. Apart from the user’s assumed interest in the fact, this parameter also considers the importance of this information to the communicative objectives of the system, and the degree to which a fact is already known to the user (assimilation of the fact). This allows the Forward Pointers module to be more flexible whenever the user requests subsequent descriptions of the same object, as facts which are introduced to the user for the first time (i.e. unassimilated facts) have a higher probability of forming a list of forward pointers. The following example presents a description of an ancient Greek coin, together with the generated lists of forward pointers, which are associated with particular facts in the description: This exhibit is another tetradrachm, created during the classical period. It dates from between 440 and 420 B.C. It has an image of Athena crowned with a branch of olive, her tree, on its obverse. On the reverse there is a picture of her owl. Like the previous tetradrachm, this tetradrachm is made of silver. It was originally from Attica. Other exhibits created during the classical period: -A hydria painted by the painter of Meidias -A lekythos decorated with the red figure technique Other exhibits made of silver: -A stater that originates from Croton -A drachma originally from Attica
3.6 Generating Text with Adaptive Hypermedia Techniques: Macronodes
In many application domains, full natural language generation is not the most adequate solution due to inefficiency, high development costs, and the difficulty of formalizing knowledge ontologies. Often, pre-existing texts (and related images or
MULTILINGUAL PERSONALIZED INFORMATION OBJECTS
189
audio commentaries) prepared by domain experts are a valuable knowledge source for assembling information presentations. Hybrid systems that combine both NLG and template techniques represent in these cases a promising solution [Reiter 1995]. Exprimo is a good candidate in this direction: it provides sophisticated NLG machinery to support generation from scratch (i.e., generation from a knowledge base representing entities and facts related by semantic relations) as well as mechanisms to support the integration of flexible templates. However, when large amounts of material must be reused, techniques from the field of Adaptive Hypermedia (AH) seem to be more effective in shaping and composing the existing information. Within M-PIRO we have investigated the benefits of complementing the Exprimo NLG system with a flexible approach to the composition of canned text such as the one supported by the MacroNode’s adaptive hypermedia architecture, thus achieving a trade-off between information reuse and flexibility. In this enlarged scenario, Exprimo’s NLG machinery allows for the generation of the most dynamic parts of the presentation, whereas the AH machinery allows for the enrichment of content and phrasing, while still allowing for personalization. As a starting point for AH, we have considered the MacroNode formalism [Not and Zancanaro 2000] that has been designed with the purpose of building a new kind of flexible hypermedia system which can overcome the limitations of the current adaptive hypermedia and the difficulties/costs of dynamic hypermedia. In this framework, available multimedia material is fragmented into content units which are annotated with essential semantic and linguistic data (i.e. a description of the main topic of the content unit), the semantic relations existing with other units (in terms of RST rhetorical relations [Mann and Thompson 1987]), and the different ways in which the unit can be presented to the user. A content unit is called a macronode and, for textual information, it typically corresponds to a paragraph. More generally, a macronode can correspond to data in different media, for example text, audio, images, or a combination of these. The macronode approach exploits text planning techniques and linguistic manipulation rules derived from the field of NLG when composing from content units. At generation time, the annotated data is processed by a Presentation Composer that uses some NLG text planning strategies to determine the most appropriate discourse organization for the macronodes relevant for the current discourse context. The module explicitly maintains a representation of the discourse history and is able to exploit a model of the user's interests and knowledge to best tailor the output. The schema-based text planning phase, which relies on the rhetorical organization of the macronodes in the repository, guarantees that the composed message displays (i) a coherent rhetorical structure, (ii) a topic flow compatible with discourse and interaction context, and (iii) cross-references to material already presented as well as to the user interests. For example, discourse strategies are encoded to avoid presenting already known information, to choose the kind of information for which the user's interest is higher, and to present new information when the user goes back to a previous topic. In order to assure cohesion, each macronode also encodes different realizations of the same content. These alternative realizations are stored as a graph. Each node of the graph represents a minimal unit of content, whereas arcs contain the conditions under which the target node can be
190
CALDER ET AL.
realized (given that the source node has been realized). Each path in the graph is a possible realization of the content of the macronode. Macronodes can be seen as an implementation of portions of canned text whose composition and linguistic form can be adjusted according to the discourse and interaction context. The advantage of splitting a text into small content units is that it allows reusing the same material in different discourse contexts, and permits a more flexible integration with the rest of the presentation. By integrating the MacroNode approach into Exprimo, the final system would be able to (i) reuse existing texts that could otherwise not be easily generated by Exprimo, (ii) still guarantee that they are integrated coherently in the overall presentation and partially adapted to the user needs, and (iii) opportunistically select the most appropriate generation machinery according to the type of information to be conveyed. Although, from the implementation point of view, the macronode approach has not been completely integrated into Exprimo, substantial headway has been made using two separate methods: constructing a new discourse planner that controls both Exprimo and the Macronode system as surface realizers, and adding discourse planning rules to Exprimo that would allow it to use macronodes as an alternate surface realizer [Not and Zancanaro 2001]. 4 AUTHORING TOOLS FOR DOMAIN AND LINGUISTIC KNOWLEDGE One of M-PIRO’s most ambitious goals has been to produce technology that is not only portable to new domains where object descriptions are needed, but ported to new domains solely by domain experts. By “domain experts” we mean people with expertise on the objects to be described, but with no background in language technology. Curators from the Foundation of the Hellenic World in Athens were involved from the beginning of the project in the formative evaluation, which helped define usability specifications for the authoring software. The authoring tool is the main instrument for attaining portability. It allows domain experts, such as museum curators, to configure the system for a new collection of exhibits, or to modify the system’s knowledge of an existing collection by allowing them to manipulate the system’s domain-dependent information. Resources of M-PIRO which are most important for the authoring tool include: x
The Database Schema: A structural description of the database, specifying: o The types of entities that are modelled by the database, such as artist, coin, statue, and vessel. o The sub-types and super-types of each entity type, in the form of a hierarchy. For example, coin, statue, and vessel are all sub-types of the entity type exhibit; vessel has sub-types like amphora and hydria, and artist has sub-types like sculptor and painter. o The types of relationships that entities of the various types may participate in. In the current domain, for example, there is a sculpted-by relationship between statue and sculptor entities, in
MULTILINGUAL PERSONALIZED INFORMATION OBJECTS
191
that entities of the former type were sculpted by entities of the latter. o The attributes that entities of each type possess, such as the name and height of a statue entity. The current implementation of the authoring tool provides full control over all the parts of the database schema. x The Database Entries: These provide information about particular entities. For example, there may be a statue entity whose name is “Doryphorus”, whose height is 2.12m, and which was ‘sculpted-by’ a sculptor entity called “Polyklitus”. The current implementation allows domain experts to insert, and later modify, database entries. Future work will also consider ways to import entries from existing databases, especially databases complying with museum standards. x Domain-dependent Linguistic Resources: Many of M-PIRO’s linguistic resources, such as the grammars of the supported languages and the lexicons that contain closed-class words, like articles and prepositions, are domain-independent. Some linguistic resources, however, are domaindependent, and need to be configured when moving to a new application domain: o The domain-dependent lexicon. This provides information about nouns and verbs that are relevant to the domain, including semantic information, such as the types of entities that a noun can refer to. The authoring tool provides facilities to manipulate the domain-dependent lexicon, which hide from the domain experts parts of the lexicon that require linguistic expertise. For example, there are mechanisms that infer automatically morphological features and generate all the forms of inflected words from their base forms. o Micro-planning expressions. These specify how relationships and attributes can be conveyed in the supported languages; for example, which verb to use to express a relationship, how the participants of the relationship relate to the subject and object of the resulting phrase, etc. The current implementation of the authoring tool provides full control over micro-planning expressions. x Partly Pre-generated Texts: These are half-ready textual descriptions of objects, that are used when the information to be conveyed is too complex or too object-specific to express via full natural language generation. MPIRO addresses two kinds of partly pre-generated texts: parts of templates, which are almost completely canned pieces of text, and the macronodes described above. Facilities for authoring macronodes have been developed and tested separately. M-PIRO’s authoring tool (Figure 4) is intended to be used at two stages: domain authoring and exhibit authoring. Domain authoring defines the database schema and the domain-specific linguistic resources, while exhibit authoring provides the
192
CALDER ET AL.
database entries and the partly pre-generated texts. Although the authoring tool is designed to be used by domain experts with no language technology expertise, some general training on the use of the tool is still required at both stages. It is possible to train only a few domain experts as domain authors who will be responsible for the initial configuration of the system and “advanced” modifications, such as the addition of new types of exhibits, or the tailoring of the domain-dependent linguistic mechanisms. The task of entering and maintaining information about particular exhibits can then be assigned to a larger number of exhibit authors, who may have received briefer training on the use of the authoring tool. M-PIRO’s authoring tool also provides a preview mechanism, which helps the authors detect anomalies in the content or realization of the generated texts, directing them to related parts of the database and domain-specific language resources. This can be seen as an attempt to link symbolic authoring to the WYSIWYM approach [Power and Scott 1998], where authors interact with the system entirely via generated texts that reflect both the content of the database and the options that are available to update it. 4.1 Domain authoring
The current implementation of the authoring tool provides full control over all the parts of the database schema, namely the hierarchy of the available entity types, relationships, and attributes. The available entity types and their hierarchy are defined by manipulating the tree structure in the left pane of the database tab. In Figure 5, the tree structure shows that both vessel and statue are sub-types of the entity type exhibit, while amphora, hydria and rhyton are some of the sub-types of vessel. Types like exhibit, historical-period and place in Figure 5, that have no super-types, are called basic entity types; they are shown as children of the pseudo-node basic entity types in the tree structure of the authoring tool. Although M-PIRO’s generation engine supports multiple inheritance (e.g., entity types with more than one super-types), the tool only allows single inheritance, to make it easier to use by domain experts. The tree structure initially contains only the basic entity type node and the builtin data-types. Basic entity types have to be linked to types of the Upper Model [Bateman 1992], a built-in domain-independent hierarchy of entity types. Linking a basic type to a type of the Upper Model in effect allows the generation engine to treat the basic type as a sub-type of the Upper Model type. A basic type can be linked to more than one types of the Upper Model; this is the only case where the authoring tool in effect allows multiple inheritance.
MULTILINGUAL PERSONALIZED INFORMATION OBJECTS
193
Figure 4: The M-PIRO authoring tool
4.2 Domain-dependent linguistic resources
Each entity type can be associated with one or more nouns; for example, the entity type statue can be associated with the noun whose lexicon identifier is statuenoun, and whose English form is “statue”. This allows the generation engine to use the nouns to refer to the corresponding entity types (e.g., “This is a statue.”), and to generate appropriate noun phrases when referring to entities (e.g., “This statue was sculpted by… ”). The associations between entity types and nouns are specified in the right pane of the database tab, as shown in Figure 5. The area below the table shows the nouns that are associated with the entity type that has been selected in the left pane, as well as nouns that are associated with super-types of the selected type. New nouns can be added to the domain-dependent lexicon through the lexicon tab of the authoring tool. The tool encourages the domain authors to keep the lexicons of the supported languages aligned, by treating each lexicon entry as a triplet that contains information about equivalent nouns in the three languages. The internal form of the entries is more complicated: for example, the Greek part of the entry contains several features pertaining to the inflection of the noun, the
194
CALDER ET AL.
position of the stress, etc. The authoring tool incorporates mechanisms that hide features of this kind that require linguistic expertise [Dimitromanolaki et al. 2001b].
Figure 5: Domain hierarchy (left) and linguistic editing (right)
M-PIRO’s generation engine also incorporates morphology rules that allow the system to produce automatically all the forms of inflected words from the base forms that the authors enter. Although the morphology rules cover the most common irregular forms, they may occasionally fail to generate the correct forms of words whose inflection is irregular, and this will be reflected in the previews of the generated texts. Another popup menu displays a table that allows the domain authors to correct those word forms by hand. In a similar manner, new verbs can be added to the domain-specific lexicon, which can then be used with micro-planning expressions. As with nouns, verb entries are treated as triplets containing information about equivalent verbs for each language, and there are mechanisms to hide from the author features that require linguistic expertise and to generate automatically all verb forms. Morphology rules are particularly important in Greek and Italian, where verbs are highly inflected. 4.3 Micro-planning expressions
Micro-planning expressions (Figure 5, bottom right) specify how relationships and attributes can be conveyed in the supported languages; for example, which verb to use to express a relationship, how the participants of the relationship relate to the
MULTILINGUAL PERSONALIZED INFORMATION OBJECTS
195
subject and object of the resulting phrase, etc. As discussed previously, relationships and attributes are represented in the authoring tool by fields. There are two types of micro-planning expressions, clause-plans and templates, and the domain author can select between the two [Androutsopoulos et al. 2001]. In clause plans, the domain author specifies the verb to be used (from those available in the domain-specific lexicon), the voice and tense of the resulting clause, the preposition (if any) to be included between the verb and the object of the clause, any adverb (e.g., “frequently”) to be included, and strings to be concatenated as adjuncts at the beginning or end of the clause (e.g. “In case of fire, ring the bell.”). Templates provide much stricter control over the surface form of the expression than do clause plans. They contain sequences of elements, which, after being instantiated, are simply concatenated to produce the final expression. Among other things, they are able to produce forms of expression that are not clauses. Each slot can be filled either by a particular string, by an expression that refers to the owner of the field, or by an expression that refers to the filler of the field. In the last two cases, it is also possible to specify that a particular type of referring expression (e.g., pronoun) should be generated. 4.4 Macronode authoring
To assist in developing the MacroNode aspect of generation, which is significantly more difficult than the wholesale inclusion of completely canned text, a separate authoring tool has been created to allow for macronodes creation. Each small piece of text must contain a conditional statement specifying when it will be included in the final text along with its placement in the macronode’s message graph. This entails entering information on rhetorical type, pronominalization, user modelling parameters, etc. The MacroNode authoring tool allows these structures to be created by museum staff using a GUI instead of editing the XML structures that the system itself uses. Because the text fragments of the content units associated with each macronode are written as a fixed string, museum staff are able to create complete texts more quickly and easily than using the full-scale authoring system. This has benefits, such as rapid prototyping, which allows exhibit authors to decide whether particular exhibit descriptions are worth the cost of being more deeply developed. A previewing capability is also included, allowing the author to change different parameters such as user type, language and level of expertise, and then see the regenerated text to ensure it has been properly structured. 4.5 Reusing existing language resources
M-PIRO reuses a large-scale systemic grammar of English, which derives from ILEX and WAG [O’Donnell 1996], and the Italian and Greek grammars of M-PIRO reuse large parts of that grammar. The Upper Model, which has been used in several natural language generation applications, is another existing resource employed in M-PIRO. Monolingual large-scale lexicons in electronic form, such as WordNet and LDOCE [Fellbaum 1998; Wilks et al. 1996], constitute an alternative type of
196
CALDER ET AL.
semantic/linguistic resource to M-PIRO’s domain-dependent lexicon. As in most natural language generation applications, however, we expect M-PIRO’s domaindependent lexicon to be rather small. In most domains, it will typically contain a few dozens of lexical entries, many of which will be for specialized terms (e.g., “kouros”, “stater” in the current domain) that are not included in general-purpose lexicons. Hence, the potential gain from using existing monolingual large-scale lexicons is rather small. Multilingual domain-specific glossaries are of greater value, as they often provide exact mappings of terms across languages. They would be particularly useful when the speakers of the other languages that the domain author has access to are not domain experts, and hence may not be certain about the appropriate terms to be used in their languages. Even in those cases, however, using the glossaries is more a matter of offering easy access to them, perhaps by installing them on the same workstations as the authoring tool, rather than a matter of integrating them into M-PIRO’s technology. Overall, then, the potential gain from attempting to incorporate existing lexical resources into M-PIRO’s technology is rather small. 5. SPEECH SYNTHESIS Another significant aspect of the M-PIRO project is the simultaneous generation of speech coordinated with the textual, web-based presentation. Adding audio allows M-PIRO to be used in auditory-only interfaces (e.g. mobile environments) or in applications where the text cannot be shown (e.g. a 3D virtual reality environment). Traditionally, the quality of synthetic speech is not very natural since generic models are used in order to cover documents from almost any domain (e.g. news, law, or culture). In M-PIRO we exploited the constraints of this specific domain in order to deliver more natural synthetic speech by accentuating focus information and rendering more realistic, intonational curves. To achieve this, we set up an enriched pipeline between the text generator and the text-to-speech system that transmits meta-information about the actual spoken text, thus allowing us to use more linguistic phenomena during prosody generation. This pipeline is based on the SOLE markup notation [Hitzeman et al. 1999] and has been extended according to the domain requirements. We have adapted the Festival speech synthesis system to improve on existing English speech and add support for Italian. Furthermore, we developed the DEMOSTHeNES Speech Composer in order to efficiently accommodate Greek resources and host Greek speech synthesis. Traditional text-to-speech conversion fails to vocalize texts in a rhetorical way due to the generalizations of the underlying prosodic models and the lack of a reliable pragmatic analysis to detect the context of the document. Moreover, when going from text to speech, all the stages of this process introduce an error quantity: e.g. part of speech identification fails in 3% of the cases for Greek using statistical taggers. In addition, intonation events are often quantized to a set of 4-5 marks in order to avoid training from large data sets and thus reduce the probabilities of getting exceptional values in the F0 curve (e.g. ToBI annotation offers more than 20 discrete mark combinations, while only 4-5 of them are used in generic text-to-
MULTILINGUAL PERSONALIZED INFORMATION OBJECTS
197
speech systems). In M-PIRO, such problems were addressed by taking into account the constraints of its domain and the ability of the text generator to provide the speech synthesizers with information that either enhances the performance of the above models or replaces them with highly accurate data. 5.1 Integration of text generator and speech synthesizers
The SOLE component lies between the text generator and the speech synthesizer. The SOLE markup language itself provides enumerated word lists and syntactic tree structures. On the syntactic tree, information exists at the phrase level about the phrase type (sentence, noun phrase, prepositional phrase, relative clause, etc) as well as at word level about the part-of-speech (determiner, noun, verb, preposition, etc.) (Figure 6). This structure carries error-free phrasing and POS information. Most of the sentences generated by the text generator in M-PIRO can be annotated with such detailed meta-information. Pieces of canned text integrated in the presentation are marked as “CANNED-TEXT” without any POS or phrasing information. Such cases are dealt with by the standard text-to-speech path, using instead trained models from the specific domain, thus leading to more natural speech. The linguistic information carried by SOLE is being used to define better prosody. However, since the main aspect of prosody is to accentuate parts of emphasis in text, we concluded that the elements of the original specification of SOLE on their own could not always provide certain intonational focus points, even for the limited number of sentence structures generated for this domain. Taking into account the capabilities of Exprimo, we extended the SOLE specification to accommodate elements that could directly or indirectly imply emphatic events for the specific domain. These elements stand for noun phrases and are: o Newness information (to the particular user): newness [new/old] o Number of times mentioned before (to the particular user): mentionedcount [integer] o Whether they are a second argument to the verb: arg2 [yes/no] o Whether there is deixis: genitive-deixis, accusative-deixis [yes/no] Furthermore, for the Greek language we also added to DEMOSTHeNES the ability to support rhetorical questions and explicit-emphasis. Some of these features are handled by trained models while the others are sent to an e-Text to Speech and Audio (e-TSA) component [Xydas and Kouroupetroglou, 2001b] to identify prosodic contours. The combination of the above elements provides the synthesizers with clues of stress and intonational focus.
198
CALDER ET AL.
<wordlist> … <w id="w7">ʌȠȣ <w id="w8">įȘµȚȠȣȡȖȒșȘțİ <w id="w9">țĮIJȐ <w id="w10">IJȘ <w id="w11">įȚȐȡțİȚĮ <w id="w12">IJȘȢ <w id="w13">ĮȡȤĮȧțȒȢ <w id="w14" punct=".">ʌİȡȚȩįȠȣ … … <elem phrase-type="S"> <elem lex-cat="PRP" href="words.xml#id(w7)"/> <elem lex-cat="V" href="words.xml#id(w8)"/> <elem phrase-type="PP"> <elem lex-cat="IN" href="words.xml#id(w9)..id(w11)"/> <elem phrase-type="NP" newness="new" arg2="true" proper-group="true" genitive-deixis="true"> <elem lex-cat="DT" href="words.xml#id(w12)"/> <elem lex-cat="N" href="words.xml#id(w13)..id(w14)"/> …
Figure 6: A SOLE example
5.2 Trained prosodic models
An important goal in M-PIRO’s speech synthesizers is the delivery of more natural speech through improved prosody that approximates the prosody of human guides. To achieve that, professional speakers were used to record corpora and allow us to capture the spoken expressions of a museum guided tour. These corpora were created, one per language, by reading museum exhibit descriptions. For example, in the case of Greek, a 30000-word corpus of exhibit descriptions was created. The text corpus was first annotated by DEMOSTHeNES and was then produced in a readable format for the speakers to read them out following the annotation directions. The produced corpora were further segmented and annotated using ToBI marks. These marks were trained against the linguistic phenomena presented in the SOLE documents to produce the trained prosodic models, using Classification and Regression Trees [Breiman et al. 1984]. Training was split in 3 parts: o Break prediction, where break indices were assigned to each word o Accent prediction, where accent tones were assigned to stressed syllables
MULTILINGUAL PERSONALIZED INFORMATION OBJECTS
199
o Endtone prediction, where ending tones were given to phrase boundaries. In most cases we used the Linear Regression model [Black and Hunt, 1996] for F0 rendering. These LR models were also trained from the above corpora and thus led to the production of smooth intonational curves for the specific domain. 5.3 Speech Synthesizers
The Festival language independent speech synthesis system in M-PIRO accommodates Italian and English. DEMOSTHeNES Speech Composer hosts Greek synthesis. Both systems implement Heterogeneous Relation Graphs [Taylor et al. 2001], to easily exploit XML documents (SOLE, VoiceXML, etc). Running in server mode, the systems are connected to the Speech Controller component, which, depending on the user’s language preference, routes the text to the appropriate synthesizer. In all languages, diphone synthesis was used. Diphones were extracted from a dedicated voice database with isolated nonsense words. The speech synthesis output quality is critical, especially in the Virtual Reality application where it is the only form of exhibit description since the use of text is not considered an appropriate metaphor for a 3D environment. The speech synthesis systems developed in M-PIRO were demonstrated at the Foundation of the Hellenic World’s immersive projection-based system “Magic Screen” [Roussou 2000] where users, wearing special 3D shutter glasses, could select a virtual object and listen to its description and associations. 6. CONCLUSION The M-PIRO project provides tools and techniques to enhance a museum visitor’s experience by facilitating information delivery in a number of different languages and in a personalized way. It allows textual and spoken descriptions of exhibits to be generated automatically from an underlying language-neutral database and existing free-text descriptions. Each text is tailored according to the user's interests, background knowledge, and language skills. A prototype that generates personalized descriptions of 50 archaeological exhibits in the form of web pages has been developed, which shows the potential of M-PIRO’s technology in information presentations over the web and at public access kiosks. A second prototype delivers personalized audio descriptions of exhibits in an immersive virtual reality environment, using a tight integration between the text generator and the speech synthesizer to achieve more natural prosody. Authoring facilities allow domain experts to configure M-PIRO’s technology for new exhibit collections, and to control both the information that is delivered to the visitors and the language expressions that are used to convey the information. The M-PIRO consortium consists of: the University of Edinburgh (UK, coordinator), ITC-irst (Italy), NCSR "Demokritos" (Greece), the University of Athens (Greece), the Foundation of the Hellenic World (Greece), and System Simulation Ltd (UK). For more information, consult: http://www.ltg.ed.ac.uk/mpiro/.
200
CALDER ET AL. 7. REFERENCES
Androutsopoulos, I., V. Kokkinaki and A. Dimitromanolaki. “Dynamic Multilingual Object Descriptions: The M-PIRO Project”, presented at the 29th Conference on Computer Applications and Quantitative Methods in Archaeology, Gotland, Sweden, April 25-29, 2001 Bateman, J.A. “Enabling technology for mulilingual natural language generation”, in Natural Language Engineering, 3(1):15 55, 1997 Bateman, J.A.. “The theoretical status of ontologies in natural language processing”, in S. Preuss and B. Schmitz, editors, Text Representation and Domain Modelling – Ideas from Linguistics and AI, pages 50–99. Technische Universitat Berlin, KIT-Report 97. 1992 Bateman, J., C. Matthiessen, C. Nanri and L. Zeng. “The re-use of linguistic resources across languages in multilingual generation components”, in Proceedings of IJCAI'91, Sydney, Australia, 1991 Black, A. and Hunt, A. “Generating F0 contours from ToBI labels using linear regression”, ICSLP96, vol. 3, pp 1385-1388, Philadelphia, PA. 1996 Breiman, L., Friedman, J. Olshen, R. and Stone, C. Classification and regression trees, Wadsworth and Brooks, Pacific Grove, CA. 1984 Callaway, C., Daniel, B. and Lester, J. “Multilingual Natural Language Generation for 3D Learning Environments”, in Proceedings of the 1999 Argentine Symposium on Artificial Intelligence. Buenos Aires, Argentina, 177-190, 1999 Cosi, P., F. Tesser, R. Gretter and C. Avesani. “Festival Speaks Italian!”, in Proceedings of the 7th European Conference on Speech Communication and Technology, EUROSPEECH 2001, Aalborg, Denmark, September 2001 Dimitromanolaki, A., I. Androutsopoulos, V. Karkaletsis, and C.D. Spyropoulos. “The Greek generation component”. Deliverable 1.3, M-PIRO, 2001 Dimitromanolaki, A., I. Androutsopoulos and V. Karaletsis. “A Large-Scale Systemic Functional Grammar of Greek”, presented at the Int. Conference on Greek Linguistics, Sorbonne, France, 2001 Fellbaum, C. (editor). WordNet – An Electronic Lexical Database. MIT Press, 1998 Halliday, Michael A. K. An Introduction to Functional Grammar. London: Edward Arnold. 1985 Hitzeman, J., A. Black, C. Mellish, J. Oberlander, M. Poesio and P. Taylor. “An annotation scheme for concept-to-speech synthesis”, in Proceedings of the European Workshop on Natural Language Generation, pp. 59-66, 1999 Isard, A., J. Oberlander, I. Androutsopoulos and C. Matheson. “Speaking the Users’ Languages”. IEEE Intelligent Systems, vol. 18, no. 1, pp. 40-45, 2003 Mann, William C. and Sandra A. Thompson. Rhetorical structure theory: A theory of text organization. Technical Report ISI/RS-87-190, USC Information Sciences Institute. 1987 Melengoglou, Alexander. Multilingual Aggregation in the M-PIRO System. Master’s thesis, University of Edinburgh. 2002 Mellish, C., M. O’Donnell, J. Oberlander, and A. Knott. “An architecture for opportunistic text generation”, in Proceedings of the 9th INLG Workshop. Niagara-on-the-Lake, Ontario. 1998 Not, E., and M. Zancanaro. “Building Adaptive Information Presentations from Existing Information Repositories”, in Proceedings of the International Workshop on Information Presentation and Multimodal Dialogue, Verona, Italy, 14-15 December, 2001 Not, E., and M. Zancanaro. “The MacroNode Approach: mediating between adaptive and dynamic hypermedia”, in Proceedings of the International Conference on Adaptive Hypermedia and Adaptive Web-based Systems, AH'2000, Trento, August 2000 Not, Elena, Ion Androutsopoulos, Jo Calder, Aggeliki Dimitromanolaki, Vangelis Karkaletsis, and George Paliouras. “Generation architecture and description of first M-PIRO proto-type”. M-PIRO Deliverable 1.2, University of Edinburgh. 2001 Oberlander, J., M.O'Donnell, A. Knott and C. Mellish. “Conversation in the Museum: Experiments in Dynamic Hypermedia with the Intelligent Labelling Explorer”, in New Review of Hypermedia and Multimedia. 4:11-32, 1998 O'Donnell, M., C. Mellish, J. Oberlander, A. Knott. “ILEX: An Architecture for a Dynamic Hypertext Generation System”, in Natural Language Engineering, 7(3): 225-250, 2001 O'Donnell, M. “Input specification in the WAG sentence generation system”, in Proceedings of the 8th International Workshop on Natural Language Generation, Herstmonceux Castle, UK, June 1996
MULTILINGUAL PERSONALIZED INFORMATION OBJECTS
201
Paris, C., K. Vander Linden, M. Fischer, A. Hartley, L. Pemberton, R. Power, and D. Scott. “A support tool for writing multilingual instructions”, in Proceedings of the 14th International Joint Conference on Artificial Intelligence, pages 1398 1404, Montreal, Canada, 1996 Power, R., and N. Cavallotto. “Multilingual generation of administrative forms”, in Proceedings of the 8th International Workshop on NLG, pages 17-19, Herstmonceux Castle, UK, 1996 Power, R., and D. Scott. “Multilingual authoring using feedback texts”, in Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and the 17th International Conference on Computational Linguistics (COLING-ACL 98), pages 1053-1059, Montreal, Canada, 1998 Reiter, Ehud. “Has a Consensus NL Generation Architecture Appeared, and is it Psycholinguistically Plausible?”, in Proceedings of the Seventh International Workshop on Natural Language Generation. Kennebunkport, ME, 163-170, 1994 Reiter, Ehud. “NLG vs. Templates”, in Proceedings of the Fifth European Workshop on Natural Language Generation, The Netherlands, 1995 Reiter, Ehud and Robert Dale. Building Natural Language Generation Systems. Cambridge, U.K.: Cambridge University Press. 2000 Roussou, M. “Virtual Heritage: From the Research Lab to the Broad Public”. In Franco Niccolucci (ed.), Virtual Archaeology Proceedings of the VAST2000 Euroconference, Arezzo, Italy. Oxford, Archaeopress, 93-100, 2002 Shaw, J. and K. McKeown. “An architecture for aggregation in text generation”, in Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI’97). Nagoya, Japan, 1997 Taylor, P., Black, A., and Caley, R. “Heterogeneous Relation Graphs as a Mechanism for Representing Linguistic Information”, Speech Communications 33, pp 153-174, 2001 Taylor, P., A.W. Black and R. Caley. “The Architecture of the Festival Speech Synthesis”, in Proceedings of ESCA 98 Workshop, 1998 Wilks, Y., B.M. Slator, and L.M. Guthrie. Electric Words – Dictionaries, Computers, and Meanings. MIT Press, 1996 Xydas, G. and Kouroupetroglou, G. “DEMOSTHeNES Speech Composer”, in Proceedings of the 4th ISCA Tutorial and Workshop on Speech Synthesis, SSW4, pp. 167-172, Perthshire, Scotland, 2001 Xydas, G., and G. Kouroupetroglou. “Text-to-Speech Scripting Interface for Appropriate Vocalization of e-Texts”, in Proceedings of the 7th European Conference on Speech Communication and Technology, EUROSPEECH 2001, Aalborg, Denmark, September 2001
8. AFFILIATIONS Jo Calder, Alexander C. Melengoglou University of Edinburgh, UK Charles Callaway, Elena Not, Fabio Pianesi ITC-irst, Trento, Italy Ion AndroutsopoulosҘ, Costantine D. Spyropoulos NCSR “Demokritos”, Athens, Greece Gerasimos Xydas, Georgios Kouroupetroglou University of Athens, Greece Maria Roussou Foundation of the Hellinc World, Athens, Greece
P. PIWEK, R. POWER, D. SCOTT AND K. VAN DEEMTER
GENERATING MULTIMEDIA PRESENTATIONS FROM PLAIN TEXT TO SCREEN PLAY
1. INTRODUCTION Most research on Natural Language Generation (NLG) has focussed on the production of plain text, the typical output being one or two paragraphs of continuous prose. In a survey of NLG research pre-1998, Paiva (1998) describes 19 applications of which 12 generate plain text, 5 generate text with some simple layout (e.g., enumerated or bulleted lists), and 2 generate text that includes diagrams. This restriction may have been sensible during the early development of the field, but to address the typical requirements for commercial applications we need to widen the scope of generation to include more complex layout (e.g., lists, boxes, footnotes, emphasis), diagrams or pictures, and the new media (such as animations) that can now be incorporated into documents on the web. In several projects during the last five years, we have focussed on the commercial requirements of technical documentation, such as instruction manuals (Paris et al., 1995), patient information leaflets (Bouayad-Agha et al., 2002), and product information in the car industry (Krenn et al., 2002). During the course of these projects, we have analysed corpora of existing documents, and interviewed the technical writers and translators who produce them. Several general points have emerged. First, the documents make rich use of layout in order to enhance clarity and navigability. Secondly, most documents of this type include diagrams, which may even be more important than the text (e.g., in instructions on how to assemble equipment or furniture). Finally, the companies issuing the documents are extremely sensitive to questions of style: a manual or instruction leaflet is seen partly as a portrayal of the company’s character and its attitude to its clients. The lesson to be drawn is clear: for commercial applications of NLG, we must develop generators which integrate a range of media and which allow control over style. How can the scope of a generator be widened so that it can plan layout and the inclusion of illustrations? The simplest solution would be to add these graphical features to the text – in other words, we might hope to take the output of an existing generator, and to apply specialised modules for formatting and illustration. In most cases this method will not yield good results; in our interviews with document designers, we have also confirmed that it is not the method they use. Reformatting, or addition of pictures, means at the very least that the text must be revised; we have even encountered cases in which layout and pictures are drafted before any text is written at all (Bouayad-Agha et al., 2000). In short, text, layout, pictures, etc. have 203 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 203–225 © 2005 Springer. Printed in the Netherlands
204
PIWEK, POWER, SCOTT AND VAN DEEMTER
to be co-adapted, which means that they must be planned together. What is needed is a generator that can plan not simply a text but a multimedia document, while preferring patterns in all the various media that portray the desired personality or style. Our aim in this paper is to survey our experiences in developing generators that are enhanced in this way. We contrast four presentations of the same semantic content. We begin by considering a presentation that is limited to plain text. We then consider presentations in which graphical formatting methods like vertical lists and emphasis are added to the text. Next, we discuss the implications of adding pictures. Finally, we jump to a much more advanced and recent medium: a computer animation of a situated dialogue, based on an automatically generated screenplay or ‘dialogue script’. The semantic content in each case will be a section adapted from a patient information leaflet, which provides instruction to patients on how to take their tablets. In surveying presentations in the various media, we focus on three issues: Meaning Graphical devices are at least partly a means of expressing semantic or rhetorical content. Just as a conventional generator needs rules linking semantic patterns to syntactic ones, an enhanced generator needs some way of representing the meanings of layout patterns and pictures, so that it can make appropriate choices. We have had to address questions like ‘What are the semantic or rhetorical implications of a vertical list?’ and ‘How can the meaning of a picture be specified in an NLG application?’ Style Languages and related media provide an enormous stock of patterns for expressing the same semantic content; this allows authors the opportunity of signalling a desired personality or style by their preferences among these options. Some linguistic indicators of style are already well-known (e.g., sentence length, informal vs formal terminology); similar indicators can be found within the options for layout and illustration. In the case of scripted dialogue the situation is complicated by the presence of two characters with potentially different conversational styles. Wording Since the presence of text is the common factor among the four presentations that we discuss, it is important to see how the nature of this text changes when combined with different media. We show, for example, that criteria of linguistic correctness change when a document uses graphical layout, so that stripping away the layout would yield an ill-formed plain text. Similarly, we will show that pictures and dialogue can have a profound effect on wording.
As might be expected in such a wide-ranging survey, we are more interested in raising issues than in demonstrating specific results. Although our observations are backed up by study of corpora, they rely more on intuition than on firm empirical evidence. In fact, it is difficult to say what kind of empirical basis would be appropriate for a system that generates documents. Do we want to emulate average human behaviour? Presumably not, since many people express themselves ineffectively. We might prefer to emulate professional document designers – yet can we be sure their judgement is always reliable? Perhaps the best guide would be a study of the effect on readers, although this is not an easy thing to measure. Luckily, most of the points we make are relatively elementary, so we can reasonably hope that our intuitions coincide both with expert opinion and objective effectiveness.
GENERATING PRESENTATIONS FROM PLAIN TEXT TO SCREEN PLAY
205
2. GENERATING TEXT Since the generation of text is the primary capability of NLG systems, and because all the multimodal systems that will be discussed later in this paper build on this capability, it will be useful to sketch informally what ‘generation of text’ amounts to in practice. Space will not allow a detailed discussion, and we refer to Reiter and Dale (2000) for elaboration. 2.1 Representing meaning
One usually speaks of NLG when an abstract representation of meaning is put into the words of a natural language. The representations that form the input to the system can be of various kinds. What is crucial is that they are not already the expressions of a natural language (in which case we speak of Machine Translation, not NLG). These representations may be created specifically for processing by an NLG system, or they may have led an independently useful life, for example as the output of a Question-Answering system, or as the facts in a database. In this article, we will focus on a logical representation language, of the kind that is created by the so-called WYSIWYM knowledge editing interface (Power and Scott, 1998). In addition to the usual logical operators, such as negations and conditionals, the logical language contains modal operators such as ‘(it is) obligatory (that)’ and special constructs of the kind that are familiar from Artificial Intelligence planning. For concreteness, we will assume that the input to the system is equivalent to the conjunction of the following six formulas, which contain variables as ‘handles’ for expressing coreference. (1) (2) (3) (4) (5) (6)
obligatory(s:suggest(y:doctor(z:patient),d:dose)) obligatory(follow(z,s)) [unsure-about(z,d) unsure-about(z,timing(d))]-> obligatory(ask(z,y)) procedure(take(z,t:tablet), [remove(z,t,foil,finger,back(t))& swallow(z,t,water)]) [w:take(z,overdose)] -> obligatory([tell(z,y,w) OR visit(z,casualty(hospital(z)))]) store(a:person,m:medicine) -> obligatory(storeawayfrom(a,m,children)
The first formula says that it is obligatory that a person who is the doctor of the patient (henceforth identified through the variables y and z respectively) suggests what dose d to take. The second formula uses the same variables to say that the patient z should follow this suggestion (identified as s). Consistent with conventions in Discourse Representation Theory (Kamp and Reyle, 1993), most variables (i.e., the ones that do not occur in special constructs such as the antecedent of a conditional) are interpreted as existentially quantified. Because they do occur in the antecedent of a conditional, the variables a and m in (6) are universally quantified: For every person a and medicine m, if a stores m then a has to store m
206
PIWEK, POWER, SCOTT AND VAN DEEMTER
away from children. Formulas of this kind leave a lot of information implicit – witness the complex relation storeawayfrom, for example – but they are precise enough to serve as a basis for an NLG system that is designed with these formulas in mind. Some NLG systems use inputs that distinguish between different kinds of conjunctions, using the relations proposed by Rhetorical Structure Theory (RST, Mann and Thompson, 1987). In this way, it is possible to encode the fact that the information in (1) is more important than that in (2), for example, by saying that (2) is an Elaboration of (1). We will refer to rhetorical relations between propositions when and where this is appropriate, assuming that they are available in the input to the NLG system. 2.2 Planning document structure
In most NLG systems, the assignment of semantic material to large-scale textual units like sentences, paragraphs, and lists, is performed by relatively simple ‘microplanning’ or ‘aggregation’ rules (Reiter & Dale, 2000). Typically, the rhetorical structure in a text plan is realised by more or less ‘reading it out’ in a left to right fashion, making each leaf into a (grammatical) clause, and producing the text by simply realising these clauses as grammatical sentences, occasionally making more complex sentences with the help of discourse connectives or aggregation. The limitations of this approach for our purposes include the following: o The structure of the resulting text will be isomorphic with the rhetorical structure. In other words, an RST analysis of the generated text will be exactly the same as the input RST structure. But rewriting even a short piece of text without removing any content can result in a range of different RST structures: e.g., clauses, sentences and whole paragraphs can be reordered; clauses and sentences can be made into adjectives or adverbs, or vice versa; material can be pulled apart and distributed between different sentences or paragraphs; horizontal lists can be made into indented bulleted lists etc. o The generated text will only include information that is present in the input rhetorical structure. However, texts that are more graphical than those considered by ordinary NLG systems – for example, patient information leaflets – often include additional information. For example, headings (“How to take your medicine”), references to other parts of the document (“See figure 2”), and figure captions. To overcome these and other related limitations, we have introduced an additional level of representation – document structure – which is a realisation of rhetorical structure that is not necessary isomorphic to it (Power et al., 2003a). Document representations encode information about the graphical presentation of a document, such as textual level (paragraph, orthographic sentence etc.), layout (indentation, bulleted lists etc.) and their relative positions. Document structure can be seen as an extension to Nunberg’s ‘text-grammar’ (Nunberg, 1990); it is also closely related to ‘logical markup languages’ like LaTeX and HTML. We are not able
GENERATING PRESENTATIONS FROM PLAIN TEXT TO SCREEN PLAY
207
to describe document structure in much detail here, and refer the reader instead to (Power et al., 2003a). However, we think the reader will benefit from a small example, showing the transition from rhetorical structure to document structure and the eventual generated text. Consider the following text, of the kind that can be generated by the ICONOCLAST system (Power et al., 2003a): Your doctor should suggest a dose; follow your doctor’s advice and do not change the dose. If you are unsure about the dose or if you are unsure of when to take it, ask your doctor. To take a tablet, remove it from the foil and swallow it with water. If you take an overdose, tell your doctor immediately or visit your hospital’s casualty department.
Figure 11: One way of expressing the input of section 2.1
Consider the first sentence of this text, which is based on the first two formulas of the input introduced in section 2.1: (1) obligatory (s:suggest(y:doctor(z:patient),d:dose)) (2) obligatory(follow(z,s)) (The text uses two different formulations to express the information in (2).) The rhetorical structure for this sentence is shown informally in Figure 2. Note that although the nucleus and satellite are identified along with the rhetorical relation BACKGROUND their ordering is not yet determined, since this is a task for the document structurer. BACKGROUND
n
2
s 1
Figure 2: Example rhetorical representation (informal)
An example simple document representation is shown in Figure 3. Here there is a single document structure containing a paragraph at the top level, which comprises a sequence of text-sentences. Shown in more detail is the part of the document structure representing the first sentence in Figure 1. This text-sentence is composed of two text-clauses, the second being further decomposed into a pair of text-phrases. Each node has the associated features of TEXT, INDENTATION, POSITION and MARKER, with appropriate values. Note that the message of the text, as described in Figure 2 has been expanded at the level of document structure to include a repetition of (2), and that the propositions are now ordered and their associated discourse markers have been assigned.
208
PIWEK, POWER, SCOTT AND VAN DEEMTER TEXT-LEVEL: paragraph INDENTATION: 0 POSITION: ? MARKER: NIL
TEXT-LEVEL: text-sentence INDENTATION: 0 POSITION: 1 MARKER: NIL
TEXT-LEVEL: text-clause INDENTATION: 0 POSITION: 1 MARKER: NIL SEM: 1
…
TEXT-LEVEL: text-clause INDENTATION: 0 POSITION: 2 MARKER: NIL
TEXT-LEVEL: text-phrase INDENTATION: 0 POSITION: 1 MARKER: NIL SEM: 2
TEXT-LEVEL: text-phrase INDENTATION: 0 POSITION: 2 MARKER: and SEM: 2
Figure 3: Document representation for first sentence of Fig.1
2.3. Controlling linguistic style
Patient information leaflets have been a rich source of material for us. They present information that is safety critical, and for this reason their content is strictly legislated by regulatory bodies (e.g., the EMEA in Europe). Nevertheless, they employ a range of different linguistic devices to convey the required information. This is because the same generic drug tends to be marketed by many companies; given this fierce competition, and given the constraints on what they can say, pharmaceutical companies try to make their mark by adopting an individual, often distinctive, style. Even the most cursory perusal through a corpus of these leaflets (e.g., ABPI 1997) is revealing. At one end of the spectrum, there are those that favour a ‘tabloid’ style, with simple language (e.g., throat instead of larynx; ask instead consult), liberal use of formatting devices (e.g., font and face alternations, indented bulleted lists), pictures, and containing many single sentence paragraphs; they also tend to be rather ‘personal’ with references to “your doctor” and “your medicine”. At the other end,
GENERATING PRESENTATIONS FROM PLAIN TEXT TO SCREEN PLAY
209
there are those that favour a more serious ‘broadsheet’ style, sometimes at first glance looking more like an article in a medical journal than a leaflet for patients – with more technical language, fewer and longer paragraphs, in smaller font size and fewer pictures. The following examples show how the semantic content defined in section 2.1, and shown in Figure 1 may be expressed linguistically in a range of styles. The problem of generating the same content using different styles has been studied by various authors, including Hovy (1988), Walker et al. (1996), Power et al., (2003b). Version 1 Take your tablet by removing it from the foil and swallow it with water. Your doctor will tell you your correct dose; follow your doctor's advice and do not change the dose. Ask your doctor if you are unsure of the dosage or its timing. If you miss a dose, take another as soon as you remember or wait until it is time to take the next dose. Then proceed as before. However, If you take an overdose, you should immediately tell your doctor or go to the casualty department (emergency ward) at your hospital. Keep your medicine in a place where children cannot reach it. Version 2 The method of delivery for tablets is the removal of the tablet from its foil, followed by its ingestion with water. Doctors must instruct their patient about the correct dosage, and patients must follow this instruction strictly. In the event that a patient forgets the required dosage or its timing, the patient must ask their doctor. In the event of an overdose, the doctor must be informed immediately by their patient, or else the patient must report immediately to the casualty department of their hospital. The medicine must be kept away from the reach of children. Version 3 To take a tablet, you should first remove it from the foil and then swallow it with water. Your doctor will tell you the dosage. Follow his advice and do not change it. If you are unsure of your dosage or its timing, you should ask your doctor. If you take an overdose, you should inform your doctor. If you are unsure of your dosage or its timing, you should ask your doctor. If you take an overdose, you should inform your doctor immediately or go straight to your hospital’s emergency ward. Store your tablets out of the reach of children.
3. LAYOUT Layout enhances plain text by introducing various graphical devices like indented lists, tables, boxes, footnotes, along with extra character formatting (italics, bold face, small type, etc.). There is not a sharp distinction between ‘plain text’ and ‘text with layout’, unless by ‘plain text’ we mean literally a string of words, with no punctuation at all. Devices like semi-colons, full stops, and parentheses serve as graphical aids as much as bulleted lists or bold face; the only difference is that they happen to be realised as characters, and can thus be specified in a plain text editor; owing to this contingency, ‘layout’ is typically used as a term for devices that cannot be expressed through a string of ASCII characters. In languages like LaTeX and HTML, this essentially means anything for which a tag is needed. Here is a version of our sample text in which the features indented lists and emphasis are added, followed by its LaTeX source: Your doctor should suggest a dose. Follow your doctor’s advice, and DO NOT CHANGE THE DOSE. Ask your doctor if
210
PIWEK, POWER, SCOTT AND VAN DEEMTER
you are unsure about the dose you are unsure when to take the dose To take a tablet: (1) (2)
Remove the tablet from the foil. Swallow it with water
If you have taken an overdose, tell your doctor immediately or visit your hospital's casualty department. Your doctor should suggest a dose. Follow your doctor's advice, and {\sc do not change the dose}. Ask your doctor if \begin{itemize} \item you are unsure about the dose \item you are unsure when to take the dose \end{itemize} To take a tablet: \begin{enumerate} \item Remove the tablet from the foil \item Swallow it with water \end{enumerate} If you taken an overdose, tell your doctor {\em immediately} or visit your hospital's casualty department.
In HTML, essentially the same specifications would be given, except that the tag names are different and paragraphs have to be tagged explicitly rather than separated by an empty line. 3.1. Layout and Meaning
Some recent research in NLG has explored the relationship between the rhetorical structure of an argument, and the document structure that expresses this argument in a text (Bateman et al., 2001; Power et al., 2003a). At a rhetorical level, our example includes a procedure for taking the medicine; this procedure comprises a goal (taking a tablet) and a method, which consists in turn of a sequence of steps. These are all rhetorical concepts, which are part of the message no matter how it is expressed. In the example, these rhetorical units have been mapped to units of abstract document structure: the procedure is realised by a text-sentence; the goal is realised by a text-clause; the method by an enumerated list; and the steps by list items, each also a text-sentence. In instructional texts, this way of presenting a procedure has become almost standard – for instance, it is used consistently in software manuals for any product by Microsoft or Apple. Why is it considered a good format? Note that one cannot
GENERATING PRESENTATIONS FROM PLAIN TEXT TO SCREEN PLAY
211
answer this question usefully by subjective reports like ‘the list format is clearer’ or ‘it is easier to read’ – in this case, why do we not use a list format for everything? Are there any general principles that would allow an NLG system to decide when an indented list is appropriate? Here are some relevant factors: x Space: Indented lists introduce larger separations between segments of the text. Using an indented list therefore incurs a cost: either you must use more paper, or some other part of your message must be left out. x Navigability: Readers often refer to instructions while performing the task (e.g., preparing a recipe, or assembling a bookcase), rather than memorising all the steps beforehand. This means that it should be easy to re-enter the text after looking away, in order to retrieve the next step. x Uniformity: Indented lists carry the implication that the items play similar roles in the argument (e.g., in a procedure, they are all steps). x Size: The value of an indented list is enhanced if the items are relatively large – up to a point. (If the list overflows several pages, confusion obviously results, since the eye can no scan across the bullets or numbers, or align them if the list has multiple levels.) x Length: The value of an indented list is also enhanced if it has more items – again up to a point. With just one or two steps in a procedure, navigation is less of a problem, so the list formatting might be regarded as overkill. In terms of Rhetorical Structure Theory (Mann and Thompson, 1987), the uniformity constraint might be stated as follows: use indented lists only for multinuclear relations, not for nucleus-satellite relations. By definition, a nucleussatellite relation is asymmetric – the nucleus is more important than the satellite – and therefore does not have uniform constituents. It would be odd, for example, to present a whole procedure through a list in which the first item was the goal and the second item was the method: x To take a tablet: x Remove the tablet from the foil and swallow it with water. Turning to the size constraint, the following presentation of the sentence ‘the medicine contains terbinafine and gestodene’, while not anomalous, is also strange, because any gain in clarity seems poor compensation for the waste of space. This objection would no longer apply if the list were longer, or if each ingredient in the list was described more fully. The medicine contains: terbinafine gestodene Ubiquitous use of indented lists diminishes their value. If there are bullets dotted around all over the page, it is no longer simple to look away from the instructions and then re-enter at the right point. We have reviewed at some length the motives for using indented lists; we now consider, more briefly, the use of emphasis (shown, for example, by italics or small
212
PIWEK, POWER, SCOTT AND VAN DEEMTER
capitals). Here there are various possible rhetorical correlates, which have in common the notion that the word or phrase is important. In plain text, crucial warnings can be marked by words like ‘vital’, ‘must’, ‘absolutely’, and sentences can be arranged so that new (as opposed to given) information appears at the end. One sign of careless writing is the repeated use of emphasis in order to recover from poor sentence organisation: Consult your doctor if you have any problems concerning your treatment, or any questions about your treatment. Consult your doctor about your treatment if you have any problems or questions.
In the first sentence, questions is italicised because the next phrase ‘about your treatment’ presents old information; in the second sentence the new information comes at the end, so emphasis through formatting is unnecessary. In written texts, emphasis is not only the counterpart of stress in speech; it also serves as a navigational aid. For instance, if important warnings are presented in bold face, they can be found at a glance, even if they appear in an inaccessible location such as the middle of a paragraph. As in the case of indented lists, this benefit depends on using the device sparingly: if each page has dozens of emphasized phrases, the warning becomes a needle in a haystack. At an absurd extreme one might imagine an author emphasizing the whole text on the grounds that every word is of vital importance. 3.2. Layout and Style
So far, we have discussed formatting devices as a means of conveying ideas and improving usability; however, authors also have to take account of another aspect, namely style. Different formatting styles will be interpreted as indicating different personalities or attitudes to the reader. Consider the eccentric author just mentioned, who emphasizes an entire leaflet by formatting it (let us say) in capital letters. The drawback is not only that the reader cannot distinguish degrees of importance; there is also an unpleasant tone, a feeling of being shouted at. The different textual styles that were discussed in section 2.3 tend to go together with systematic variations in formatting. In what we called the ‘tabloid’ approach, one sees frequent use of bullet points and emphasized words; in ‘broadsheet’ style, one sees long dense paragraphs and little use of indented lists. The reason, presumably, is that an indented list represents an exchange of clarity for depth. The crucial points can be found more easily, but since space is wasted, there is less room for giving additional explanation. Some readers will thank the author for easing their task; others will perceive the leaflet as an insult to their intelligence. Similar differences are probably found between academic fields: while common in scientific articles, bulleted lists are rare in humanistic fields like philosophy, literary criticism, and history, where the dignity of the material seems to demand long paragraphs of continuous prose and to preclude anything so vulgar as a list.
GENERATING PRESENTATIONS FROM PLAIN TEXT TO SCREEN PLAY
213
3.3. Layout and Wording
A potential misconception about formatting is that can be added to a text as a kind of adornment, like a cherry on top of a cake. In terms of an NLG system, this would be equivalent to an architecture in which the formatting module is invoked last, after the syntax and wording have already been fixed. In reality, layout and wording interact, especially when the layout includes indented structures like lists and quotes. As a simple example of such an interaction, consider the following sentence from our example: Ask your doctor if: o o
you are unsure about the dose you are unsure when to take the dose
Suppose that, owing to a change of policy, this sentence had to be revised to take out the bulleted list. Obviously the author could not simply remove all the LaTeX tags, leaving the following non-sentence: Ask your doctor if: you are unsure about the dose you are unsure when to take the dose.
Two changes are needed. First, correct syntax now requires the introduction of a connective (i.e., ‘or’). Second, the punctuation must be corrected – the colon after ‘if’ is no longer appropriate, and a full stop must be added at the end: Ask your doctor if you are unsure about the dose or you are unsure when to take the dose.
What seems to be happening here is that the extra separation and organisation provided by an indented list makes the full stop and the connective unnecessary (provided that the context makes it clear whether the relation is a conjunction or a disjunction), so licensing a small departure from normal syntax. However, the departure can be much more radical, as the following example shows: In rare cases, the treatment can be prolonged for another week; however, this is risky since:
o o
The side-effects are likely to get worse. Some patients have reported severe headache and nausea. Permanent damage to the liver may result.
Here the list items are paragraphs (the first has two full text-sentences), yet the whole list serves syntactically as the complement of the subordinating conjunction ‘since’. As a result, removal of the indented list would require more far-reaching changes, as in the following version: In rare cases, the treatment can be prolonged for another week; however, this is risky for two reasons. First, the side-effects are likely to get worse; some patients have reported severe headache and nausea. Secondly, permanent damage to the liver may result.
214
PIWEK, POWER, SCOTT AND VAN DEEMTER 4. PICTURES
Many documents, whether they are meant to be seen on paper or on a computer screen, contain more than just formatted text. In addition to formatting, they may contain such graphical elements as formulas, diagrams, and pictures. A considerable amount of research has been done on the meaning and use of diagrams (e.g., Kerpediev and Roth, 2000); here we will focus on pictures. Pictures are sometimes distinguished from other graphics by the fact that they are ‘iconic’: the meaning of a picture arises mainly by its similarity to what it depicts (Hartshorne and Weiss, 1958). Photographs are pictures, and so are the more stylised sketches that are often find in Patient Information Leaflets. In the present subsection we will ask (a) how pictures contribute to the meaning of a document, (b) how they affect the style of the document, and (c) how they can affect the wording of the document. In addition, we will discuss how documents may be generated that contain pictures as well as formatted text. 4.1. Pictures and Meaning
It is not easy to say in general terms what the meaning of a picture is. An interesting exploration of this question can be found in Levesque (2003), who claims that pictures tend to convey ‘vivid’ information: information that contains no logical structure beyond predication and conjunction. A picture might say, for example, that Carter shook hands with Castro: it cannot say that they either shook hands or beat each other up. The notion of vivid information is an important concept in artificial intelligence, and pictures are sometimes seen as a prime example. Here, we will adopt a variant of this view, viewing pictures as basically expressing existential information: A picture of two unrecognisable men shaking hands can be argued to mean that at some point in time, two men with certain outward appearances shook hands in a particular way. (If one of them is recognisable, then we can think of this as depicting the property of being equal to Carter.) ‘Photographic’ pictures of this kind convey a wealth of information, and would be difficult to generate automatically (unlike the more schematic ones generated in McKeown et al., 1992; Wahlster et al., 1993). The picture just discussed, for example, shows in detail how the handshake took place. This ‘how’ would be difficult to capture in words or mathematical symbols: any symbolic representation would tend to leave out something that the picture depicts. This is different with the kinds of pictures that are used in instructional texts like Patient Information Leaflets, where pictures are employed to convey ‘discrete’ information of the kind that might also have been conveyed by text. Consider the following illustrated version of the document, for example (version 3 in section 2.3).
GENERATING PRESENTATIONS FROM PLAIN TEXT TO SCREEN PLAY
215
To take a tablet, you should first remove it from the foil and then swallow it with water. Your doctor will tell you the dosage. Follow his advice and do not change it. If you are unsure of your dosage or when to take it, you should ask you doctor. If you take an overdose, you should inform your doctor immediately or go straight to your hospital's emergency ward. Store the medicine out of the reach of children.
Consider the picture showing someone storing away the medicine. The person is shown as a woman with longish hair; the cabinet has a specific size and a medicinal cross on each of its doors. Little of this has anything to do with the meaning of the picture in its current setting. In other respects, the picture is rather poor in information. For example, it does not show what the woman is doing; there is no medicine in sight! Clearly, the picture denotes both more and less than what is depicted: it denotes a person (whose gender and appearance are irrelevant) storing away a medicine in a place where children cannot reach it. This is the kind of information that is conveniently represented using a representation language that allows us to represent atomic propositions (involving one or more arguments), existential quantification, and conjunction. As before, we leave existential quantification implicit: x:person & y:medicine & storeawayfrom(x,y,children)
(‘There is a person x and a medicine y such that x stores away y away from children.’) Something similar is true for the other picture, which shows how to obtain the tablet. The only novel element here is the use of the term back(y). x:person & y:tablet & remove(x,y,foil,finger,back(y))
(‘There is a person x and a tablet y such that x removes y from the foil by pushing a finger to the back of y.’) Over the last few years at ITRI, we have explored the usefulness of an approach to picture meaning that uses logic to express their meaning, analogous to and inspired by the way in which logic is routinely used for capturing the meaning of texts. This has been done in the following way (van Deemter, 1999; Cahill et al., 2001; van Deemter and Power, to appear): x We have built an interface that makes it easy to create formulas of the kind exemplified here, and to stipulate that a given picture is associated with a given meaning representation. In the case of the first picture, for example, we assume that the annotator knows that the picture will be used in the patient information leaflets, allowing her to associate it with the formula
216
PIWEK, POWER, SCOTT AND VAN DEEMTER
x
x
shown above. Crucially, the interface offers the annotator only a limited range of options; for example, it will not allow the annotator to specify the gender of the person depicted, since the system ‘knows’ that this is irrelevant. The picture annotation interface is a variant of existing WYSIWYM interfaces. As a result, the same interface can be used for creating the A-Box of section 2.1 and for annotating pictures. In earlier work, NLG tools have been constructed that allow WYSIWYMcreated formulas to be expressed in a number of natural languages (Power and Scott, 1998). These tools allow a user to specify the content of a text at a high level of abstraction, leaving textual realisation to the system. Given an A-Box and an annotated library of pictures, a search algorithm finds those pictures in the library that match the information in the A-Box best. The best matching picture is included in the generated document, at a location that corresponds with their contribution to the text (Cahill et al., 2001)
This retrieval-based treatment of pictures is especially useful in applications like the Patient Information Leaflets, where there exists great variety between the types of pictures that are used, and where pictures tend to be reused heavily. Our work in this area has also been integrated into an interactive document authoring system, called ILLUSTRATE, which allows users to specify the semantic content of a document, leaving all details of graphical as well as textual expression to the system (Van Deemter and Power, to appear). 4.2. Pictures and Style
Even though pictures and text can express similar kinds of information, they do not have the same strengths and weaknesses. One strong point of pictures, for example, is the immediacy with which they tend to be understood (Pineda, 2000). Take a quick look at the leaflet under discussion, for example, in its illustrated version. Certain aspects stand out with much more clarity and immediacy than others: headers, for example, have a high perceptual salience, and the same is true for pictures. A related strength of pictures is that they are language-independent, making them especially suitable for conveying information to linguistic minorities. Another strong point of pictures, relating to their iconicity (see above), is their suitability for indicating information relating to the relative locations of objects. Consider the first of the two pictures in the leaflet, for example. It can be expressed textually, but to express everything that the picture conveys would tend to be cumbersome: Take your tablet by removing it from the foil by pressing your finger against the back of the tablet.
An informal study of the leaflets in APBI (1997) has shown that pictures are used in about 60% of the leaflets, and that they are used heavily to depict:
GENERATING PRESENTATIONS FROM PLAIN TEXT TO SCREEN PLAY
217
complex pieces of equipment (anti-asthmatic inhalers, inoculators, etc.) whose spatial layout the reader of the document it is important to understand. x actions, such as the steps that need to be taken to clean an inhaler. Often, entire sequences of actions are depicted. x continuous quantities – e.g., when creams and ointments are used, one frequently sees depictions of the required quantity, sometimes positioned on a finger or juxtaposed to a coin to show the relative size of the blob. x parts of the human anatomy. Some companies do not use pictures at all. Those companies that avoid pictures appear to also favour a more formal textual style (section 2.3). x
4.3. Pictures and Wording
Just like layout (see section 3), pictures are more than the cherry on the cake: when pictures are used well, they are an integral part of a document, which affects wording (and, obviously, layout). Although neither the RICHES system (Cahill et al., 2001) nor the ILLUSTRATE system (van Deemter and Power, to appear) have taken this fully into account, let us explore some of the ways in which pictures and wording can influence each other. First, we will discuss references to illustrations. Secondly, we will explain how texts may be reduced because of information expressed in the picture. 4.3.1. References to illustrations
References to illustrations can help to improve the readability of the text, by highlighting a connection between two or more of its parts. A sentence like ‘Compare Fig.7 of the previous subsection’, for example, makes crucial use of such a reference. Our semantic approach to pictures allows us to generate such references, whether they are based on the location of the picture (as in ‘See Fig.7’) or on its pictorial content (as in ‘See the picture of the inhaler’). Conversely, the fact that a picture depicts a certain domain object can also be exploited for identifying that object. An example would be the noun phrase ‘the red vial depicted in Fig.7’, which refers to a domain object via a document part. Descriptions of this kind can be generated using a variant of standard generation algorithms (e.g., Dale and Reiter, 1995) provided document-related properties like ‘being described by Fig.7’ are treated on a par with other properties (such as being red, being a vial, etc.) This is only possible in a system that ‘knows’, of every picture in the document, which domain object it refers to. Algorithms for automatically generating ‘document deictic’ references of this kind are discussed in Paraboni and van Deemter (2002) and implemented in a small stand-alone generation program.
218
PIWEK, POWER, SCOTT AND VAN DEEMTER
4.3.2. Reducing text
Something that is expressed through a picture may no longer have to be expressed textually. For example, let us return to the earlier-discussed removal of a tablet. If the picture is used then there is no obvious need for the text generator to explain in full detail how the tablet is to be removed, since this is already conveyed by the picture. Instead of (1) below, it is now sufficient to write (2), (3) or even (4): 1. 2. 3. 4.
Take your tablet by removing it from the foil by pressing your against the back of the tablet Take your tablet by removing it from the foil (by pressing it through). Take your tablet by removing it from the foil. Take your tablet.
Mechanisms for achieving the kinds of reductions exemplified in (2)—(4) are described in van Deemter and Power (to appear). Note that the choice between these four possibilities is not trivial, since some degree of overlap between the information conveyed by text and pictures may be desirable. 5. FROM DOCUMENTS TO DIALOGUE VIA SCRIPTS In this section, we move from documents to dialogue. The step from documents to dialogue might at first sight seem like a big one. Documents are objects, whereas dialogues are events. In order to bridge this gap, our initial concern will be not so much with dialogues as events, but rather with the written records of such events. In fact, our focus is first and foremost on records of fictitious, as opposed to real, dialogues. We use the term dialogue script to refer to a record of a dialogue, regardless of whether it is a real or fictitious dialogue. 5.1. Dialogue Scripts
We have defined dialogue scripts in terms of their origin, i.e., as records of dialogue events. Let us now move to a functional perspective on dialogue scripts. We want to discuss how they can be put to use. We focus on one specific type of exploitation: dialogue scripts used as discourse. From this perspective, dialogue scripts are on a par with, for instance, argumentative discourse and narrative: they are used for communicating with an audience. For instance, if a dialogue script is written entirely by an author, this author is comparable to the author of any other type of discourse: she utilizes the discourse to communicate with the readers of that discourse. Note that even if the dialogue script is based on a real dialogue, it can still be used as a device for communication: the person(s) who selected the real dialogue and subsequently transcribed and adapted it might have done so with a particular audience and message in mind. Philosophers of language have been occupied for a long time with the question what counts as a communicative act. Arguably, the most influential account is due to Grice (1957). Roughly speaking, Grice proposes that a speaker tries to communicate the message m by signal s to an audience if and only if the speaker intends that by
GENERATING PRESENTATIONS FROM PLAIN TEXT TO SCREEN PLAY
219
presenting s the audience will recognize the message m partly by recognizing this very intention of the speaker. It is characteristic of scripted dialogue that it involves communication on two layers. The dialogue script is not only a means for communication with an audience, but also a report of a communicative event – the dialogue – which has its own participants. These participants are usually different form the author and audience of the dialogue script. The presence of this second layer of reported communication provides some new opportunities for communicating information on the first layer. These opportunities are, however, orthogonal to the possibilities which text, layout and pictures provide. For instance, we have seen that, in text, rhetorical structure can be expressed by means of discourse markers. Since a dialogue script is a text, this device is still available as witnessed by the use of ‘although’ in the following fragment of a dialogue script. Although the patient asked when he should take the medicine, the pharmacist could only reply to him that his doctor would be able to tell him.
Layout can be used to enhance dialogue scripts in various ways. In the example above, the underlying dialogue structure is expressed through the wording (‘the patient asked’, ‘the pharmacist could only reply’). Roughly the same meaning can be conveyed through layout: Patient: Pharmacist:
When should I take the medicine? Your doctor will be able to tell you.
The choice between (in)direct speech and layout to convey dialogue is influenced by the same considerations which apply to the choice between plain text, enumerations and lists. For example, we pointed out that requirements concerning the navigability of a document can suggest a particular type of layout (e.g., a bulleted list). Now, consider a dialogue script, which is used as a direction to actors. During rehearsals, actors will often refer to individual turns while reciting them. Hence, like in instructional text, it becomes important that it is easy to re-enter the text after looking away, in order to retrieve the next turn. Character names are salient re-entry sites. A presentation of a dialogue script in the form given below can basically be seen as a labelled list, where labels occupy the place that is taken by bullets in a bulleted list. Speaker A: Speaker B: …
Text of turn. Text of turn. …
As in previous sections, formatting is not just an embellishment of text. The differences between the pronouns in the two examples (‘he’ vs ‘I’, ‘his’ vs ‘your’ and ‘him’ vs ‘you’) show that wording interacts with layout. Finally, a scripted dialogue can also include pictorial illustrations. Again, reasons for adding pictures to (formatted) text can also be reasons for adding pictures to the (formatted) text of a dialogue script. This is not just a theoretical possibility. For instance, the website of
220
PIWEK, POWER, SCOTT AND VAN DEEMTER
the UK Department of Health (http://www.doh.gov.uk/adguid.htm.) presents some of its information in the form of question-answer sequences. Additionally, these are elaborated with illustrations. A characteristic of scripted dialogue is that it introduces multiple voices into a text. Every text has an authorial voice. Through the text the author expresses ideas, emotions and opinions. Authors have, however, various devices at their disposal to introduce further voices that can articulate different or incompatible ideas, emotions, opinions, etc. In a scripted dialogue the voices are not only expressing further points of view, but also interacting with each other. In plain text, different voices can be introduced through direct and indirect speech. We have seen that in dialogue scripts the same can also be achieved through layout. The move from text to layout has an interesting consequence. We move from an explicit report (‘X said …’) to one where the authorial voice is implicit in the layout, thus creating more distance between the reported dialogue and the reporter/author of the dialogue. Thus, the reported dialogue is dissociated from the reporter and associated more strongly with the characters involved in the dialogue. The reader is lured into viewing the reported utterances as genuine performances of these characters rather than creations or interpretations of the reporter/author for which the reporter/author can be held responsible. A dialogue script with believable characters creates new opportunities for influencing its audience. The characters can be given certain traits that affect the interpretation by the audience of what they are saying. For instance, consider an author who is not a medical expert yet wants to communicate some piece of medical advice. One way to do this is to directly address the audience. Alternatively, s/he could report a (real or fictitious) utterance by a medical authority – say a pharmacist. The latter strategy would allow the author to imbue the communicated information with the authority of the character. Scripted dialogue also provides new means for conveying rhetorical structure. Some researchers (e.g., van Kuppevelt, 1995) have suggested that the underlying rhetorical structure of discourse can be understood in terms of the (implicit) questions which the sentences in the discourse address. Let us go back to the meaning representation in section 2.1. The following dialogue script could be used to express the information it contains: (1) (2) (3) (4) (5) (6) (7) (8) (9) (10) (11) (12) (13)
Pharmacist: Patient: Pharmacist: Patient: Pharmacist: Patient: Pharmacist: Patient: Pharmacist:
Here is your medicine. Store it away from children. Your doctor should suggest a dose. Can I change the dose? No, you should follow your doctor’s advice. When should I take the medicine? Ask your doctor. How should I take the tablet? Remove it from the foil by pressing your finger against the back of the tablet. Then swallow it with water. What if I take an overdose? Tell your doctor or visit your
GENERATING PRESENTATIONS FROM PLAIN TEXT TO SCREEN PLAY (14)
221
hospital's casualty department.
Here, (4) expresses a possible question which (5) answers. Thus (4) legitimizes (5) and reveals the purpose of (5). Complex rhetorical structures, such as those expressed by the following formula (number 4 of section 2.1) can also be expressed in a transparent way in a dialogue script. procedure(take(z,t:tablet), [remove(z,t,foil,finger,back(t)) & swallow(z,t,water)])
This is achieved by distributing information amongst the interlocutors. In our script, the patient expresses the goal of the procedure (8) whereas it is the pharmacist who explains the steps (9—11). Piwek and van Deemter (2002) provide examples of human-authored dialogue scripts with subdialogues which at first sight contribute no new information, for instance, subdialogues consisting of a clarification question and an answer. From the perspective of the interlocutors in the dialogue such subdialogues play an important communicative function. It is, however, less clear what function they play in the communication from the author of a dialogue script to the readers/audience. Why does did the author include these subdialogues? On the basis of the aforementioned examples Piwek and van Deemter (2002) hypothesize that subdialogues can serve the purpose of emphasising information. The author of our dialogue script between the pharmacist and the patient might have employed such a strategy if, for some reason, s/he thought that the information conveyed in 13. and 14. should be brought to the reader’s attention with extra force. The idea would be to insert, for instance: (15) Patient (16) Pharmacist: (17)
Do I really have to? Absolutely, tell your doctor or go to the casualty department.
This example illustrates that dialogue provides a natural medium for repeating information to achieve emphasis. Dialogue thus provides us with a new device for emphasizing information in addition to the methods discussed earlier on, such as the use of typography. 5.2. Scripted Dialogue
We started this section by pointing out the gap between documents and dialogue. We have suggested that dialogue scripts might function as a stepping stone between the two. In the remainder of this section, we want to take the final step to dialogue. We already pointed out that dialogue scripts are written records. From here it is, however, a small step to scripted dialogue, i.e., the performance of a dialogue script by two or more actors. Thus, whereas a dialogue script is an object, a scripted dialogue is an event. The automated generation of scripted dialogue has been pioneered by André et al (2000).
222
PIWEK, POWER, SCOTT AND VAN DEEMTER
Here we want to briefly describe an architecture for generating scripted dialogue which has been implemented in the NECA project (Krenn et al., 2002), automating both the generation of the dialogue script and the subsequent performance of that script. The input to the system consists of (a) a database or conjunction of logical formulae (as described in section 2.1), possibly annotated with further pragmatic information (e.g., which information is important) and (b) information about the characters (personality traits, role and interests). A pipeline architecture is employed: the system takes the input and puts it through the following modules: 1. 2. 3. 4.
5.
A Dialogue Planner, which produces an abstract description of the dialogue (the dialogue plan). A multi-modal Generator, which specifies linguistic and non-linguistic realizations for the dialogue acts in the dialogue plan. A Speech Synthesis Module, which adds information for Speech. A Gesture Assignment Module, which controls the temporal coordination of gestures and speech. A player, which plays the animated characters and the corresponding speech sound files.
Each step in the pipeline adds more concrete information to the dialogue plan/script until, finally, a player can render it. A single XML compliant representation language, called RRL, has been developed for representing the dialogue script at its various stages of completion (Piwek et al., 2002). In Piwek and van Deemter (2003) a revision-based approach to enforcing constraints on dialogue style (e.g., amount of emphasis by means of subdialogues) is presented. The revision takes place on the level of the dialogue planner. We assume that a first draft of the dialogue plan is generated, after which a number of revision operations can be applied (one of them involves inserting subdialogues) for optimising the dialogue with respect to a number of constraints. In particular, Piwek and van Deemter examine constraints on the length of the dialogue and the amount of emphasis expressed by means subdialogues. Specific attention is paid to the problem of determining how to simultaneously satisfy potentially conflicting constraints. In order to avoid situations where one constraint wins at the expense of others, a number of game-theoretical solutions to this problem are examined (such as the Nash arbitration plan). This approach exemplifies a recent trend in formal and computational linguistics to put to use techniques from decision and game theory (Rubinstein (2000), but see also Jameson (1987)). 6. CONCLUSION We have highlighted some issues in generating multimedia, using a thought experiment (based on several implemented generation systems) in which exactly the same semantic content is expressed (a) in plain text, (b) in text with layout, (c) in text with pictures, and (d) in a dialogue script that can be performed by two animated agents. The most general point to emerge is perhaps an obvious one: a good presentation cannot be produced merely by adding the various media together,
GENERATING PRESENTATIONS FROM PLAIN TEXT TO SCREEN PLAY
223
like ingredients mixed in a bowl. Wording must be adapted to take account of layout and pictures; and in a dialogue, speech must be coordinated with gestures and other behaviour. What is not at all obvious is how this integration can be achieved in a multimedia generation system. Our approach to multimedia generation in ICONOCLAST and NECA has been based on the notion of constraint satisfaction. The generated presentations in each medium have to conform to several kinds of constraint. First, they must be consistent with the semantic content (so ensuring that they ‘sing from the same hymn sheet’ – i.e., do not contradict one another). Secondly, they must respect any general requirements on style, so that for example a sober scientific article is not combined with frivolous pictures or layout. Thirdly, they must respect constraints on crossmedia realisation, so that each is designed to take account of the others. This third category has been the main focus of our examples. Recently, we have started to apply this approach to the generation of dialogues (Piwek and van Deemter, 2003). Here we concentrate on the generation of dialogues as presentations for an audience (i.e., dialogue in drama, comedy, feature films, commercials, infomercials, etc.). The focus is on the effect of the dialogue on the audience. This in contrast to traditional work on dialogue, which concentrates on the influence of dialogue (acts) on the participants of the dialogue. The integrated approach contrasts with one in which the various media are considered in sequence. Imagine, for example, a system that first produces a plain text (perhaps a section for a patient information leaflet), then adds layout, then adds pictures, through three modules organised in a ‘pipeline’. Such a system can achieve some degree of integration, because the layout and picture modules have access to the generated wording. However, the options open to the later modules will be restricted by decisions taken unilaterally by the text generator, so that the final result, even though well-formed, is likely to fall below the quality obtainable through integrated planning. To state constraints across different media, it is useful to have a unified description language for the multimedia document. Our first step towards such a representation has been the notion of ‘document structure’, introduced in the ICONOCLAST and RAGS projects (Power, 2000; Power et al., 2003a; RAGS, 1999), which extends Nunberg’s ‘text-grammar’ (Nunberg, 1990) to allow the inclusion of illustrations and indented layout patterns. A similar move towards a unified representation for animations is the Rich Representation Language (RRL; Piwek et al., 2002) for speech and gesture in NECA.
7. REFERENCES ABPI (1997). Compendium of Patient Information Leaflets. Assocation of the British Pharmaceutical Industry. André, E., Rist, T., van Mulken, S., Klesen, M., and Baldes, S. (200). The automated design of believable dialogues for animated presentation teams. In Cassell, J., Sullivan, J., Prevost, S., and Churchill, E., editors, Embodied Conversational Agents, pages 220-255. MIT Press. Bateman, J., Kamps, T., Kleinz, J. and Reichenberger, K. (2001). Towards Constructive Text, Diagram, and Layout Generation for Information Presentation. Computational Linguistics, 27 (3), 409-449.
224
PIWEK, POWER, SCOTT AND VAN DEEMTER
Bouayad-Agha, N., Power, R., Scott D., and Belz, A. (2002). PILLS: Multilingual generation of medical information documents with overlapping content. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC 2002), pages 220-255. MIT Press. Bouayad-Agha, N., Scott, D., and Power, R. (2000). Integrating content and style in documents: a case study of patient information leaflets. Information Design Journal, 9/2-3:161-176. Cahill, L., Carroll, J., Evans, R., Pavia, D., Power, R., Scott, D., and van Deemter, K (2001). From RAGS to RICHES: exploiting the potential of a flexible generation architecture. In Proceedings of ACL/EACL-01, pages 98-105 Toulouse. Dale, R. and Reiter, E. (1995). Computational interpretations of the Gricean maxims in the generation of referring expressions. Cognitive Science, 19:233-263. Grice, P. (1957). Meaning. Philosophical Review, 66:377-388. Hartshorne, C. and Weiss, P. (1958) The Collected Papers of Charles Sanders Peirce. Harvard University Press, Cambridge, Mass. Hovy, E.,H. (1988) Generating Natural Language under Pragmatic Constraints. Lawrence Erlbaum, Hillsdale, NJ. Jameson, A. (1987). How to appear to be conforming to the ‘maxims’ even if you prefer to violate them. In Kempen, G., editor, Natural LanguageGeneration: New Results in Artificial Intelligence, Psychology and Linguistics. Martinus Nijhoff Publishers, Dordrecht. Kamp, H. and Reyle U. (1993). From Discourse to Logic: Introduction to Model-theoretic Semantics of Natural Language, Formal Logic and Discourse Representation Theory. Kluwer, Dordrecht. Kerpediev and Roth (2000). Mapping communicative goals into conceptual tasks to generate graphics in discourse. In Proceedings of Intelligent User Interfaces, New Orleans. Krenn, B., Pirker, H., Grice, M., Baumann, Piwek, P., van Deemter, K., Schröder, M., Klesen, M. and Gstrein, E. (2002). Generation of multi-modal dialogue to net environments. In Busemann, S., editor, KONVENS 2002, pages 91-98, Saarbruecken, Germany. Levesque, H. (1986). Making believers out of computers. Artificial Intelligence, 30:81-108. Mann, W. and Thompson, S. (1987). Rhetorical structure theory: a theory of text organisation. ISI Technical Report ISI/RS-87-190. McKeown, K., Feiner, Robin, Seligman, and Tananblatt. Generating Cross-references for multimedia explanation.. In Proceedings of 10th National Conference on Artificial Intelligence, p.9-16. Nunberg, G. (1990). The Linguistics of Punctuation. CSLI, Stanford, USA. Paiva, D. (1998). A survey of applied natural language generation systems. Information Technology Research Institute Technical Report, ITRI-98-03. Paraboni and van Deemter (2002). Towards the generation of Document-Deictic references. In K.van Deemter and R.Kibble (Eds.) Information Sharing: Reference and Presupposition in Language Generation and Interpretation. CSLI Publications, Stanford, Ca. Paris, C., Vander Linden, K., Fischer, M., Hartley, A., Pemberton, L., Power, R., and Scott, D. (1995). A support tool for writing multilingual instructions. In Proceedings of the 14th international Joint Conference on Artificial Intelligence, pages 1398-1404, Montreal, Canada. Pineda, L. (2000). Spatial language, deixis and illustration. In Proceedings of Workshop on Coherence in Generated Multimedia, INLG, 2000 Mitzpe Ramon, Isreal. Piwek, P., Krenn B., Schröder, M., Grice, M., Baumann S., and Pirker, H. (2002). RRL: A rich representation language for the description of agent behaviour in NECA. In Proceedings of AAMAS workshop ‘Embodied conversational agents – let’s specify and evaluate them!’, Bologna, Italy. Piwek, P. and van Deemter, K. (2002). Towards automated generation of scripted dialogue. In Proceedings the 6th workshop on the semantics and pragmatics of dialogue, Edinburgh. Piwek, P and van Deemter, K. (2003) Dialogue as Discourse: Controlling Properties of Scripted Dialogue. In Proceedings of AAAI Spring Symposium on Natural Language Generation in Spoken & Written Dialogue, Palo Alto, CA. Power, R. (2000). Planning texts by constraint satisfaction. In Proceedings of the 18th International Conference on Computational Linguistics (COLING-2000). Power, R. and Scott, D. (1998). Multilingual authoring using feedback texts. In Proceedings of the 17th International Conference on Computational Linguistics and 36th Annual Meeting of the Association for Computational Linguistics, pages 1053-1059, Montreal, Canada. Power, R., Scott, D., and Bouayad-Agha, N. (2003a). Document Structure. Computational Linguistics, 29(2)
GENERATING PRESENTATIONS FROM PLAIN TEXT TO SCREEN PLAY
225
Power, R., Scott, D., and Bouayad-Agha, N. (2003b). Generating texts with style. In Proceedings of the 4th International Conference on Intelligent Texts Processing and Computational Linguistics, Mexico City. Cahill, L., Doran, C., Evans, R., Mellish, C., Paiva, P., Reape, M., Scott, D., and Tipper, N. (1999). Towards a Reference Architecture for Natural Language Generation Systems. ITRI Technical Report ITRI-99-14 Reiter, E. and Dale, R. (2000). Building Applied Natural Language Generation Systems. Cambridge University Press, Cambridge. Rubinstein, A. (2000). Economics and Language. Cambridge University Press, Cambridge. Van Deemter, K. (1999). Document generation and picture retrieval. In Proceedings of Third International Conference on Visual Information Systems, Amsterdam. Springer Verlag. Van Deemter, K. and Power, R. (in press). High-level authoring of illustrated documents. To appear in Natural Language Engineering, 9 (2). Van Kuppevelt, J. (1995). Discourse structure, topicality and questioning. Journal of Linguistics, 31:109147. Wahlster, W., André, E., Finkler, W., Profitlich, H.-J., and Rist, T. (1993). Plan-based Integration of Natural Language and Graphics Generation. Artificial Intelligence, 63, pp. 387-427. Walker, M.A., Cahn, J.E., and Whittaker, S.J. (1996). Linguistic Style Improvisation for Life-like Computer Characters. In Proceedings of the AAAI workshop on AI, A-life, and Entertainment. Portland.
8. AFFILIATIONS Paul Piwek Richard Power Donia Scott Kees van Deemter ITRI – University of Brighton Brighton, UK
C. ZINN, J. D. MOORE AND M. G. CORE
INTELLIGENT INFORMATION PRESENTATION FOR TUTORING SYSTEMS
1. INTRODUCTION Much of the work on intelligent information presentation has focused on systems that support information-seeking (e.g., providing flight times and fares), assist decision-making (e.g., comparison shopping, logistics planning), or describe objects and artefacts (e.g., museum guides). In such applications, while we expect that the system has more information than the user, we also assume that the user understands the domain of discourse and is able to use the information the system provides in order to choose an option, or make a decision, or assimilate the information they receive. In this paper, we focus on intelligent information presentation in the domain of tutoring, where the system is trying to teach the user new concepts and correct user misconceptions. To motivate our approach, we describe the unique characteristics of human tutorial interaction, and present an architecture for intelligent information presentation for tutorial applications. 1.1 Intelligent Information Presentation for Tutoring
Studies show that one-to-one human tutoring is more effective than other modes of instruction. A meta-analysis of the findings from 65 independent evaluations of school tutoring programs found that tutoring raised students' performance by 0.40 standard deviations (Cohen, Kulik, & Kulik, 1982). Results with good tutors are even more promising. For example, the average student who received one-to-one tutoring with a good tutor scored 2.0 standard deviations above the average student who received standard classroom instruction, and 1.0 standard deviation above students in a mastery learning condition (Bloom, 1984). From its inception, the goal of research in computer-based tutoring environments has been to model the effective behaviors of good human tutors, and in so doing to create an optimal educational tool. There is mounting evidence from cognitive psychology that important (or what many call “deep”) learning is most likely to occur when students encounter obstacles and work around them, and explain to themselves what worked and what did not, and how new information fits in with what they already know (Chi, Bassok, Lewis, Reimann, & Glaser, 1989; Chi, de Leeuw, Chiu, & LaVancher, 1994; Ohlsson, Rees, 1991; VanLehn, 1990). This is consistent with the constructivist movement in education, which argues that students 227 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 227–252 © 2005 Springer. Printed in the Netherlands
228
ZINN, MOORE AND CORE
learn best when they are active participants in the learning process and construct knowledge for themselves. Debates about what makes human tutoring effective, and how this might be captured in a computer-based learning environment, led to several detailed studies of human tutoring (Lepper & Chabay, 1988; McArthur, Stasz, & Zmuidzinas, 1990; Merrill, Reiser, & Landes (1992); Fox, 1993; Graesser & Person, 1994). The consensus from these studies is that experienced human tutors maintain a delicate balance, allowing students to do as much of the work as possible and to maintain a feeling of control, while providing students with enough guidance to keep them from becoming frustrated or confused. Maintaining this delicate balance requires that a tutor be flexible. Our and others' analyses of human tutorial interactions show that human tutors use a variety of strategies, including hinting (Hume, Michael, Rovick, & Evens, 1996), drawing students' attention to an error (often indirectly) and providing students an opportunity for repair (Fox, 1993; Lepper & Chabay, 1988), pointing out features of the solution that are incorrect, scaffolding (Chi, Siler, Jeong, Yamauchi, & Hausmann, 2001), and so on. In addition, human tutors strategically moderate their feedback. They sometimes intervene immediately after an error has occurred, but at other times allow the student to proceed with the solution, returning to the error later (Littman, Pinto, & Soloway, 1990). Merrill, Reiser, Ranney, & Trafton (1992), argue that human tutorial guidance appears to be structured around impasses, and the content and timing of feedback are dependent on the error or impasse encountered. Human tutoring is a collaborative process, in which tutor and student work together to repair errors. It is a highly interactive process, with the tutor providing constant feedback to support students' problem solving. Merrill, Reiser, Ranney, & Trafton (1992), argue that regardless of the timing or content of the intervention, human tutors carefully design their feedback to allow students to do as much of the work as possible, while still preventing floundering. Fox (1993, p. 122), observes that “the tutor and student both make use of strategies which maximize the student's opportunity to correct his/her own mistake.” In addition, tutors avoid directly telling the student that they are wrong or precisely how a step is incorrect. Instead tutors indirectly guide students through the process of error detection and correction. Human tutors interact with students via natural language dialogue, sometimes including equations or references to diagrams or simulated models of the domain. They prompt students to construct knowledge, give explanations, assess the student's understanding, and so on, all via natural language. They get and give linguistic cues about how the dialogue is progressing. These cues give tutors information about the student's understanding of the material, and allow the tutor to determine when a strategy is working or when another tactic is needed. Natural language dialogue is an ideal medium for this type of interaction because it offers many indirect techniques for notifying students that a step in the solution requires repair. Fox (1993), found that tutors provide frequent feedback indicating that a step is okay. A short hesitation in responding “okay” typically led the student to assume that something was amiss with the current step, and frequently led students to repair their own errors. When more explicit help was required, the tutor focused the student's attention on the part of the solution that required modification
INTELLIGENT INFORMATION PRESENTATION FOR TUTORING SYSTEMS
229
or on information that was useful for repairing the error. Although students sometimes explicitly request guidance or affirmation that their step is correct, this usually is not necessary because the tutor provides such information through hints, leading questions, verbal agreement, and other indirect methods. 1.2 Managing Tutorial Dialogue
Sect. 1.1 described the benefits of natural language dialogue as a presentation modality for tutoring. In this section, we describe the requirements that must be met to build such a system for intelligent information presentation: x Presentations must unfold over many conversational turns, even when it would be possible to present all of the information in a single contribution. This is crucial, because the system must give the student opportunities to contribute to the solutions and must not ignore student's signs of confusion. x The tutor system must have the ability to ask students questions that it “knows” the answer to, either to prompt the student to provide the information to facilitate knowledge construction, or to diagnose the level of the student's knowledge. x The tutor must understand student utterances well enough to respond appropriately. x The tutor system must have the ability to react to unexpected events. By evaluating the current dialogue situation, it must be able to revise its current plan or postpone the refinement of a sketchy plan until the situation provides the necessary information. In particular, the tutor is required 1. not to ignore student confusion, 2. to encourage the student to recognise and correct their own errors, 3. to abandon questions that are no longer relevant, 4. to handle multiple student actions in a single turn, and 5. to deal with student-initiated topic changes. These five sub-requirements stress the need for a tutorial agent to monitor the execution of its dialogue strategies. In the case of failure, the agent needs to adapt its plan to the new situation: inserting a plan for a sub-dialogue to handle student confusion or a student misconception (1+2), deleting parts of a dialogue plan because their effects are now irrelevant or already achieved (3+4), or reorganising sub-plans to handle topic changes (5). Consequently, there is no need for tutorial dialogue managers to generate elaborate discourse plans in advance. Given the dynamics of tutorial dialogue ņ the large number of potential student actions at any point and the limited ability of the tutor to predict them – it is a more viable approach to enter a tutorial conversation with a sketchy high-level dialogue plan. As the dialogue progresses, the dialogue manager then refines the high-level plan into low-level dialogue activities by considering the incrementally constructed dialogue context. The dialogue manager therefore interleaves high-level tutorial planning with on-the-fly situation-adaptive plan refinement and execution.
230
ZINN, MOORE AND CORE
In addition to these tutoring specific requirements, the fact that the computer is participating in conversation with a human means it must perform the following dialogue management tasks as described by Lewin, Rupp, Hieronymus, Milward, Larsson, & Berman, (2000) – turn-taking management: determining who can speak next, when, and for how long; topic management: determining what can be spoken about next; utterance understanding: understanding the content of an utterance in the context of previous dialogue; intention understanding: understanding the point or aim behind an utterance in the context of previous dialogue; context maintenance: maintaining a dialogue context; intention generation: generating a system objective given a current dialogue context; and utterance generation: generating a suitable form to express an intention in the current dialogue context. Although we focus on intention generation here, the other topics are discussed briefly as they are all interrelated, and are sometimes conflated in the literature. In the next section, we discuss the state of the art in dialogue management and why it is not sufficient to meet the requirements for managing tutorial dialogue. 2. STATE-OF-THE-ART DIALOGUE MANAGEMENT We review three industrial-strength models of dialogue processing, namely, finite state machines (FSMs), form-filling, and VoiceXML, as well as an interesting cross-breed of FSMs and planning. In the FSM approach, dialogue management is performed by a set of hierarchical FSMs that represent all possible dialogues. The top-level FSM is typically based on the structure of the task to be performed (e.g., flight information, phone banking transactions). For each state of the FSM, the dialogue engineer needs to specify its transitions to successor states, many of which handle numerous exceptions (e.g., user help and cancel requests, timeouts). The dialogue engineer thus manually defines all possible dialogue flow; all alternative routes are drawn in full. Consequently, the construction of FSMs is not only domain specific but also labour intensive and prone to error. On the positive side, FSMs run in real time, and a welldesigned FSM can produce help messages and issue re-prompts that are sensitive to the task context. On the negative side, the dialogues are system-driven: turn-taking as well as system feedback are hardwired, and there is only a limited and welldefined amount of dialogue context stored in each state of the network. This makes it hard to produce responses sensitive to unexpected input or to the linguistic context, and to provide personalised or customised advice or feedback. Also, it makes it hard to port a FSM-based dialogue system to a new domain. The size of a FSM is practically, not theoretically limited. A typical industrial dialogue system in the area of banking has circa 1500 states (personal communication, Arturo Trujillo, Vocalis plc). FSM construction is supported by various toolkits, e.g., AT&T's FSM library (Mohri, Pereira, & Riley, http://www.research.att.com/sw/tools/fsm). Form-filling is a less rigid approach to dialogue management. Instead of anticipating and encoding all plausible dialogues, the dialogue engineer specifies the information the dialogue system must obtain from the user as a set of forms composed of slots. The structural complexity of possible dialogues is limited only by the form design and the intelligence of the form interpretation and filling
INTELLIGENT INFORMATION PRESENTATION FOR TUTORING SYSTEMS
231
algorithm. This algorithm may be able to fill more than one slot at a time, maintain several active forms simultaneously, and switch among them. In contrast to a FSMbased dialogue system, the user of a form-filling dialogue system can therefore supply more information than the system requested (the system performs question accommodation), or start a task before the system has offered to perform it (the system performs task accommodation). VoiceXML (W3C, 2002) augments the form-filling approach with an XMLbased specification language and support for speech input and output. It is an evolving industry standard that is designed for modeling audio dialogues including synthesised speech, digitised audio, spoken and DTMF key (“touch-tone”) input, and mixed-initiative conversations. A VoiceXML document or set of documents defines a space of possible human-computer dialogues. Two types of dialogues are supported: forms that collect values for variables, and menus that present the user with a choice of options. Each dialogue specifies the next dialogue in the sequence; if there is no successor, execution stops. Subdialogues provide a mechanism for modularising common tasks such as confirmation sequences. Like a subroutine in a computer program, once the subdialogue is complete, the form interpreter returns to the place in the document where it was invoked. Subdialogues can be used to build libraries of short interactions shared among documents comprising a variety of applications. The acquisition and processing of normal input is complemented by an event handler that uses application-specific XML code to cope with user help and cancel requests as well as with no input or no match situations. VoiceXML-based dialogues can be less rigid than our description suggests. The order in which the machine collects the information from the user is not entirely pre-determined. Mixed-initiative dialogues allow the user to provide information in a flexible order or to provide multiple pieces of information in succession without the interruption of intermediary prompts. A major advantage of these three approaches is the robustness of their language understanding capabilities. Each approach has strong expectations about user input based on the state of the system. In the FSM approach, the nodes of the network can be associated with special grammars and language models; in the form-filling approach, one can associate actions with slot-filling events, for example, controlling the activation and combination of scoped grammars; in the VoiceXML approach, user input, provided in response to a system generated utterance produced by the interpretation of the contents of the <prompt> tag, is recognised using the grammar supplied by the associated tag. The form elements and are used to cover cases where the user fails to supply input or where the user input is not covered by the associated grammar. All three approaches discussed so far face the problems of choosing among multiple refinements of a task, identifying which task the user is currently trying to perform, and detecting user-initiated switches among tasks or abandonment of tasks. These problems remain difficult and open. Moreover, FSM and form-filling approaches do not facilitate per se the maintenance of a dialogue history, which can then be exploited to support the generation of natural and effective feedback. The specification of VoiceXML, however, defines five distinct levels of variable scope (session, application, document, dialogue, and anonymous), allowing, in principle,
232
ZINN, MOORE AND CORE
the maintainance of a dialogue context. Used together with VoiceXML’s conditional statements, they enable a dialogue engineer to implement a proper handling of, say, meta-dialogues and other linguistic phenomena (e.g., generation of anaphora, ellipses). In each of the three approaches, all plausible dialogues (FSMs) or their content (form filling, VoiceXML) have to be specified in advance. None of the approaches involve a deliberative component that can generate dialogue plans to achieve underlying goals, albeit (complex) forms can be seen as instantiated plans. A deliberative component also proves useful in cases where the execution of the first planned dialogue strategy fails and re-planning is needed. Dialogue management in the AUTOROUTE system combines the power of deliberative planning with the benefits of FSMs in a two-tier approach (Lewin, 1998). The bottom-tier consists of dialogue games, which are generic FSMs. Dialogue games encode typical adjacency pairs (e.g., a question is followed by an answer which may be followed by a confirmation and an acknowledgement) that do not specify the content of the dialogue moves they contain. In the top-tier, a deliberative agent treats the I/O behaviour of a dialogue game as a primitive action. Given a goal, it generates a plan that has instantiated dialogue games as its primitive steps (e.g., to pay a bill, obtain the date, amount, and bill-type; to obtain a date, play a question/answer game). Thus far, the AUTOROUTE approach has only been applied to the genre of information-seeking dialogues. Its planning architecture has not been fully exploited, and its re-planning capabilities allow only trivial cases of question and task accommodation. Moreover, turn taking is hardwired into the FSMs, which makes it hard to cope with situations where the user grabs the turn. Thus, the AUTOROUTE approach is a move in the right direction but more flexibility is required to handle tutorial dialogue. 3. A 3-LAYER ARCHITECTURE FOR MANAGING TUTORIAL DIALOGUE Our three-layer architecture for managing tutorial dialogue has been inspired by studying planning architectures used in robotics. 3.1. Planning in Robotics
As we have seen, tutorial dialogue requires an architecture which can support the planning of feedback presentations that may consist of a sequence of dialogue actions to be delivered over several dialogue turns. During the multi-turn presentation, the system must monitor the success of the presentation, and respond effectively to events that indicate that extensions or modifications to the plan are necessary. In the mid 1990's, a consensus architecture emerged in robotics. This architecture grew out of the realisation that reactive systems and hierarchical planning were not at odds with one another, but rather that in order to yield robust, flexible, and generalisable behavior, the strengths of these two methods needed to be combined. In purely reactive systems, robustness is gained at the expense of flexibility and
INTELLIGENT INFORMATION PRESENTATION FOR TUTORING SYSTEMS
233
adaptability. Action and perception are addressed, but cognition is ignored, limiting these robots to mimicking low-level life forms (Arkin, 1989). Extending the behaviors of such systems to more meaningful problem domains, requires memory and dynamic representations of the environment, as well as a representation of the goals the system is pursuing. This realisation led to the development of three-layer architectures that have become prevalent in robotics, and which we have adapted to the domain of dialogue. Bonasso, Firby, Gat, Kortenkamp, Miller, and Slack (1997) describe a 3-layer planning architecture consisting of a controller (reactive component), a sequencer (hybrid component), and a deliberator (deliberative component). The controller is the component that interacts with the physical world. It possesses a library of primitive behaviors that it can execute. The controller’s policy is to detect failure rather than to avoid it. The sequencer is a reactive plan execution and monitoring unit. In order to achieve tasks, it selects and sequences primitive behaviors (for execution by the controller) by taking the current situation or context into account. The sequencer has the capability to respond to contingencies (e.g., a failure of a primitive behaviour can be replaced by another primitive behaviour or a sequence thereof) and also to manage parallel interacting tasks. The deliberator is the planning component of the 3-layer architecture. It produces plans for the sequencer to refine and execute. It also looks for optimisations in the sequence maintained by the sequencer and can repair plans in response to exceptions. To see this architecture in action, consider an office robot that is given the goal to take a worker’s coffee mug to the kitchen sink. The robot’s 3-layer planning architecture has the following division of labour. First, the deliberator decomposes the original goal and generates a high-level plan, say, (i) search the office for a coffee mug; (ii) find the door; (iii) exit safely; (iv) go to the kitchen; and (v) put the mug in the sink. This sequence of high-level steps is then passed to the sequencer, which performs a situation-dependent plan refinement. For example, the sequencer may refine step (iii), exit safely, to: (iii.a) move to the center of the office; (iii.b) point toward the door; (iii.c) follow the current heading with obstacle avoidance activated. Once the current step in the sequence is refined, its expansion is passed to the controller. The controller then executes primitive behaviours like obstacle avoidance, wall finding, wall alignment, following headings, and wandering. Now reconsider Sect. 1.2, which contains our discussion of the many different tasks of managing dialogue in general, and tutorial dialogue, in particular. Given the proposed planning archicture, we need to identify the appropriate boundaries for the subdivision of functionality between the three planning layers. Also, we need to determine how we can achieve an effective coordination between the three layers. In order to address these questions and to validate and propagate our answers, we have built BEETLE, the Basic Electricity and Electronics Tutorial Learning Environment. In the next section, we give an introduction to BEETLE’s planning architecture. Section 4 then presents BEETLE’s intelligent information presentation strategies from the perspective of its users.
234
ZINN, MOORE AND CORE
3.2. A 3-layer Architecture for Managing Tutorial Dialogue.
Do m a in Kn o wle d g e
C u rri cu lu m
In fo rm a t io n S tat e
St u d e n t Mod e l
De lib e ra t o r request repair
resume execution
Se q u e nce r
Upda t e
monitor results
dialogue moves
Co nt ro lle r Tu to r ia l St r a te g ie s
Interpret
Generate
In p u t
Ou tp u t
Figure 1. BEETLE’s Dialogue Management Architecture.
Figure 1 displays BEETLE’s underlying generic and modular architecture for the management of tutorial dialogue. It is divided into four major parts (from left to right): external knowlege sources, the information state, the update engine, and the three-layered planning and execution architecture. BEETLE’s architecture emphasises the importance of clearly separating the knowledge sources involved in tutorial dialogue. This aims to minimise the rerepresentation of knowledge in different parts of a tutor system. In addition, each knowledge source can be transparently accessed from other modules of the tutoring system via the information state (see below). The information state establishes an interlingua among modules with varying representations. The deliberator, for instance, can thus easily access domain knowledge, student modeling expertise, learning goals, and pedagogical strategies. Moreover, this design emphasises and encourages the reusability of components. 3.2.1. The Information State and the Update Module.
The information state (IS) captures the overall dialogue context and interfaces with external knowledge sources as well as with the planning modules. In particular, the IS contains a dialogue history that records all prior dialogue moves and the
INTELLIGENT INFORMATION PRESENTATION FOR TUTORING SYSTEMS
235
common ground. Dialogue moves refer to the conversational actions performed by an utterance. For example, the statement “I didn’t understand you” requests information and signals non-understanding of the previous speaker’s utterance. The common ground refers to the set of propositions that both dialogue participants have agreed upon in prior discourse. The IS also contains a list of salient objects to facilitate the treatment of linguistic phenomena such as anaphora. Moreover, the IS maintains a stack of any pending discourse obligations, allowing the system to deduce the issues that it needs to, or intends to address in future dialogue continuations. The IS is maintained by the update module. After each tutor and student turn, which usually consists of more than one dialogue move, this module is activated. Its update rules encode conversational expertise and define how to update the current context given a new dialogue move. Two example update rules are given below:
doDiagQuery: precond: latest turn contains move of type DIAG_QUERY effects: add DIAG_QUERY move to CDU add obligation for hearer to address move to CDU doAssert: precond: latest turn contains move of type ASSERT asserted content addresses a previous DIAG_QUERY or INFO_REQ, say Q1 effects: add ASSERT move to CDU remove speaker's obligation to address Q1 add hearer's obligation to address assertion add that speaker is committed to propositional content of assertion add that if hearer accepts assertion, then add assertion to common ground The update rule engine fires the rule doDiagQuery if the latest turn contains a diagnostic query dialogue move. If so, that move would be entered into the current discourse unit (CDU), and an obligation for the hearer to address this move would be created. The rule doAssert fires if the latest turn contains an assertion and if its asserted content addresses a previous diagnostic query or information request. If this is the case, then the assertion is added to the CDU, the speaker's obligation to address the question is deleted, the hearer is now obliged to address the assertion, and the propositional content of the assertion is a candidate for entering the common ground. 3.2.2 Response Generation Using the 3-layer Architecture
The response generation module computes appropriate tutorial moves and synthesises tutorial feedback as text or other modalities. It uses the 3-layer planning architecture: a deliberative planner that projects the future and anticipates and solves problems (top layer); a sequencer or plan execution and monitoring system that performs adaptive on-the-fly refinement (middle layer); and a controller or
236
ZINN, MOORE AND CORE
perception/action system that interprets student actions and performs primitive tutor actions (bottom layer). Top-layer: Deliberative Planning. The deliberation component synthesises plans from action descriptions, or operators, at the highest possible level of abstraction. This abstraction minimises wasted effort by allowing the middle layer to perform a maximum of situation-adaptive plan refinement. High-level planning results in a structured sequence of tasks that are passed to the task agenda of the middle layer; for each task, there is a reactive action package (see below) that achieves it, if successfully executed. The deliberator is explicitly activated in two cases: when the tutor begins a tutorial dialogue, and during the dialogue when the middle layer encounters a failure during plan execution and requests that the top layer performs a plan repair. However, the deliberative component also has a permanent background activity. It regularly inspects the agenda of the middle layer for two reasons: verifying whether pending discourse obligations (as recorded in the IS) are covered by the contents of the agenda, and anticipating problems or optimising the agenda's content. In both cases, the top layer can add or delete items from the task agenda as well as reorganise or aggregate them. The top and middle layer therefore need to synchronise their access to the agenda. Figure 2 depicts the operator do_instruct_procedure that can be used to instruct a student on how to perform a procedure. Its application leads to a high-level discourse plan with three main parts: first, it plans to open the discourse by communicating some initial thoughts about the procedure (open_lab); second, for each step in the domain plan, a discourse plan step instruct_step is generated; and third, after each step of the procedure has been instructed, close_lab evaluates the student’s performance and terminates the lab with closing remarks.
schema do_instruct_procedure; vars ?lab = ?{type learning_goal}, ?args = ?{satisfies listp}, ?domain_plan = ?{satisfies listp}; expands {instruct_procedure ?lab ?args}; conditions only_use_if {teaching_mode} = SOCRATIC, compute {beer (has_planp ?lab ?args)} = t, compute {beer (activate_plan ?lab ?args)} = success, compute {beer (active_plan)} = ?domain_plan; nodes 1 action {open_lab ?lab}, 2 iterate action {instruct_step ?step} for ?step over ?domain_plan, 3 action {close_lab ?lab}; orderings 1 ---> 2, 2 ---> 3; effects {student_knows (lab ?lab)} = true at end_of self; end_schema; Figure 2. A High-Level Plan Operator for Instructing a Procedure (in OPLAN syntax)
INTELLIGENT INFORMATION PRESENTATION FOR TUTORING SYSTEMS
237
The operator do_instruct_procedure is written in the task description language of OPLAN, the Open Planning Architecture (Currie & Tate, 1991). Consider its four preconditions. The first condition, {teaching_mode} = SOCRATIC, ensures that it is only applied if the tutor is using a Socratic tutoring style. The other conditions interface with an external domain reasoner for Basic Electricity & Electronics, called BEER, which is queried three times. The first query checks whether BEER has a domain plan for the procedure the student is to perform; the second condition checks if BEER is able to activate the plan (NB: if a plan is activated, then BEER can be used to track the student’s progress through its steps); and the third condition asks BEER to return a sequential representation of the domain plan. This plan is used to instantiate the instruct_step operator. Figure 3 displays a simplified version of the operator instruct_step. It consists of six steps: in the first four steps, it is planned that the tutor takes the turn, makes an assertion (namely that the next step is ?step), issues a directive telling the student to perform that step, and finally gives away the turn. It is then planned that the tutor waits for the student to react (get_student_input), and subsequently supplies feedback (supply_feedback) addressing the student’s answer(s) or action(s). schema do_instruct_step; vars ?step = ?{satisfies listp}, ?s_input = ?{satisfies listp}; expands {instruct_step ?step}; nodes 1 action {take_turn}, 2 action {assert (next_step ?step)}, 3 action {direct (do ?step)}, 4 action {give_away_turn}, 5 action {get_student_input ?s_input}, 6 action {supply_feedback ?s_input}; orderings 1 ---> 2, 2 ---> 3, 3 ---> 4, 4 ---> 5, 5 ---> 6; effects {student_knows (next_step ?step)} = true at end_of 2, {student_performs (do ?step)} = true at end_of 3; end_schema; Figure 3. Instructing a Step
Currently, there is only one realisation of instruct_procedure, but there are several other realisations of instruct_step. An example of another decomposition asks the student whether he knows the next substep rather then simply asserting what the next substep is. Middle-layer: Context-Driven Plan Refinement. Reactive Action Packages, introduced by Firby, (1989), are the basic building blocks of a situation-driven plan refinement system. A reactive action package (RAP) groups together and describes
238
ZINN, MOORE AND CORE
all ways to carry out a specific task in different situations. In BEETLE, we are representing a RAP as a set of possible OPLAN action decompositions. Our sequencer, the OPLAN execution and monitoring environment, executes the contents of the agenda, as filled by the deliberator. First, it selects the next task to be performed from the agenda. Then, it checks the selected task against the information state to see whether its effects have already been achieved. If this is the case, the task is deleted. Otherwise the sequencer identifies the RAP that can achieve the task. The methods of the identified RAP are checked, and the most appropriate of them is selected. If the chosen method results in a primitive action or a sequence thereof, then it is delegated to the bottom layer for execution; if the method is a complex task, then each of its subtasks is put on the agenda, and a new interpretation cycle starts. The execution of a RAP can fail for three reasons: its preconditions are not met, none of its methods are applicable, or the execution of one of its primitive methods fails. The sequencer can cope with some failures, e.g., it can try another applicable method. In the other cases, it has to call the top layer to cope with the failure. schema do_supply_positive_feedback; vars ?step = ?{satisfies listp}; expands {supply_feedback ?step}; conditions only_use_if {answer_correctness ?step} = CORRECT at begin_of self; nodes 1 action {play_positive_feedback_game ?step}; end_schema; schema do_supply_feedback_question_certainty; vars ?step = ?{satisfies listp}, ?aut, ?app =?{satisfies number}; expands {supply_feedback ?step}; conditions only_use_if {answer_correctness ?step} = INCORRECT at begin_of self, compute {beesm (get_aut_app)} = {?aut, ?app}, compute {?aut ~ 0.45} = t, compute {?app ~ 0.4} = t; nodes 1 action {take_turn}, 2 action {diag_query (question_certainty ?step)}, 3 action {give_away_turn}, 4 action {get_student_input}; end_schema; Figure 4. Two Decompositions for Supplying Feedback.
Figure 4 shows a RAP consisting of two OPLAN operators for the provision of tutorial feedback. The first operator is used to provide positive feedback to a correct student answer or GUI action; the second operator is applicable if the student’s answer is incorrect. There are several other supply_feedback operators that cover
INTELLIGENT INFORMATION PRESENTATION FOR TUTORING SYSTEMS
239
cases where the student is partially correct, incorrect, or stuck and take other tutorial parameters (e.g., student performance, tutor verbosity) into account as well. Bottom-layer: Perception/Action System. The bottom layer is responsible for the interpretation of student input and for the execution of primitive dialogue actions. The interpretation module allows the student to interact with the system via text and graphical means. It identifies the meaning and intention behind a student utterance. This includes its syntactic analysis, the construction of a representation of its propositional content, the recognition of its speech act (e.g., statement), the recognition of the intent behind the utterance (i.e., assertion and answer), and an evaluation of the student's utterance or action for correctness. The generation system gets a sequence of elementary dialogue moves and microplans the generation of multi-modal feedback (natural language utterances and GUI actions). The bottom layer is supported by a sentence and media planner, which both have access to the full dialogue context. In particular, these components consult the list of salient objects and the contents of the previous and current discourse unit to generate natural feedback that makes use of elliptical constructions and anaphoric expressions. Action execution fails if the sentence or media planner fails. 3.2.3. Turn-Taking Management.
The tutoring agent releases the turn after it asks a question or requests that the student perform an action. In all other cases, the sequencer “cycles” until a question or action request is generated. If the student takes the initiative and grabs the turn, then this dialogue move will be recorded in the IS, generating an obligation for the tutor to address the last student utterance. The top-layer then deliberates over the new situation and may change the contents of the agenda accordingly. Similarly, if the student fails to react within a certain time limit, then the interpretation module generates an appropriate dialogue act, which the update engine processes, generating an obligation for the tutor to address the student's silence. The deliberative planner can then decide to either give the student more time, or to take the turn to supply help. 3.3 Implementation Status
We have built BEETLE, a prototype implementation of our computational framework for managing tutorial dialogue that serves to both validate and propagate our ideas. It is primarily based on two technologies, the TRINDIKIT dialogue system shell (Larsson & Traum, 2000) and the Open Agent Architecture (OAA) (Martin, Cheyer & Moran, 1999). BEETLE’s information state is entirely maintained and updated using TRINDIKIT. We were able to re-use the conversational expertise that was implemented in an information-seeking TRINDIKIT-based dialogue system, namely EDIS (Matheson, Poesio, & Traum, 2000). Many update rules were used unchanged; some were adapted for the purpose of tutoring, and new ones were written that capture dialogue moves that are only present in the tutorial dialogue
240
ZINN, MOORE AND CORE
genre (e.g., hinting moves). TRINDIKIT also helps to glue together the different components of the system. Its language allows us to define the flow of control, and the information state serves as a representation of the dialogue context that is shared among the various system components. Moreover, all major system components have been agentified with OAA wrapper code. OAA enables the dialogue system engineer to use components that are written in a variety of programming languages and run on different platforms. In line with our goal to provide a flexible, modular, and thus reusable architecture, domain reasoning is performed by a single agent. The BEER agent encodes BEETLE's knowledge about basic electricity and electronics. It has a rich LOOM (see http://www.isi.edu/isd/LOOM/LOOM-HOME.html) representation of BEE concepts, can perform basic inferences in this domain, and has explicit representations of domain plans. BEETLE's deliberative planner, OPLAN creates instantiated dialogue plans from its general dialogue strategies by accessing the relevant domain knowledge represented in BEER. As we described, OPLAN is used on both the top layer and the middle layer. Its deliberative capabilities have been extended by a plan execution and monitoring environment that implements a good part of the functionality of the sequencer: selecting the next element in the sequence, passing it down to the bottom layer, and monitoring its success. The sequencer can handle failures by asking the deliberative component to perform a plan repair. However, we need to extend the sequencer with a situationadaptive plan refinement capability. At the time of writing, fully-fledged discourse plans are executed by the sequencer, frequently fail, and are therefore equally frequently repaired. We have implemented an agent (BEESM) that determines the values for two variables that are used in constructing the system’s response: autonomy and approval. Following (Brown & Levinson, 1987), autonomy captures the freedom of action and freedom of imposition by other agents; approval captures a positive consistent self-image that is appreciated and approved of by other agents. Applied to tutoring, for example, the tutor can ask the student whether he knows what to do next (high autonomy) or tell the student what to do next (low autonomy). Furthermore, a high approval response would be to praise a student answer whereas a low approval response would be to simply go on to the next topic. Autonomy and approval are determined by a Bayesian network that combines evidence from several situational factors derived from a study with teachers (e.g., aptitude, correctness, material left, time left) (Porayska-Pomsta, Mellish, & Pain, 2000). BEEGLE, the user interface agent sends BEESM messages when relevant GUI actions occur (e.g., button presses, lab actions) to allow BEESM to infer evidence of the presence of the situational factors. BEETLE's perception and action system includes NUBEE, a natural language understanding module based on the CARMEL toolkit (Rose, 2000) and BEETLEGEN, an XSLT-based (W3C, 1999) sentence realiser. The CARMEL toolkit comes with a spelling corrector, a robust parser, and a wide coverage English grammar and lexicon (COMLEX). The output of CARMEL is disambiguated and translated into BEETLE's logical form format, and references to objects are resolved. Reference resolution works by querying BEER for suitable objects matching the input description. In case of ambiguity, NUBEE can query the
INTELLIGENT INFORMATION PRESENTATION FOR TUTORING SYSTEMS
241
information state to determine which objects are most salient. Although CARMEL has a wide coverage lexicon, it only contains syntactic and not semantic information. Thus, NUBEE's vocabulary is limited to the words for which we have provided semantics. As a start, we have covered common expressions for the elements in BEER and are now adding vocabulary such as "need", "think", and "want" not directly related to BEE concepts. In addition, we use Wordnet (Miller, 1990) to look for known synonyms of unknown words to help expand the coverage. The natural language generation component, BEETLEGEN, performs pipelined XML-transformations for different stages of the generation process. The basic idea is to use XSLT stylesheets to transform a semantic input tree provided by OPLAN into a syntactic output tree that includes lexical items. The first step of the generation pipeline is to match the semantic input tree and build-up an initial syntactic structure, using a hybrid template-based and rule-based approach. The following stages perform NP-realization and make decisions about pronominalization as well as forming ellipses. The next step then performs lexical lookup including inflection for the non-canned parts of the generated tree structure. The last stage reads off the lexical items from the tree and produces polished English text (including proper punctuation and capitalization). BEETLEGEN's contextdependent pronominalization, its lexical variation, and its capability to omit redundant speech acts is supported by two agents, namely, the information state agent and the rudimentary student modelling agent BEESM. 4. THE BE&E TUTORIAL LEARNING ENVIRONMENT (BEETLE) 4.1 The Graphical User Interface
Figure 5. A BEE Hypertext Lesson Page on Measuring Current
242
ZINN, MOORE AND CORE
The starting point for BEETLE’s graphical user interface was a course on basic electricity and electronics, which was written in HTML and the VIVIDS authoring environment (Munro, 1994) at the Navy Personnel Research and Development Center. This course included nine multiple-choice quizes and four labs requiring students to make basic measurements using a multimeter in a simulated environment. Two more complex labs involving building simple circuits and solving equations were added later (Rosé, Moore, VanLehn, & Allbritton, 2000). To integrate this curriculum with the rest of our system, it was necessary to reimplement the quizzes and labs and build an HTML viewer. This graphical interface, which also includes a chat-window allowing typed dialogue with the system, is written in Tcl/Tk and uses add-on packages for the HTML display. In interacting with BEETLE, students read textbook-style lessons written in HTML and then perform labs using the graphical user interface (GUI). In this section, we discuss in more detail the HTML pages, the quizes, and the labs. BEETLE features a set of BEE lessons, each of which consists of hypertext pages that contain instructive text as well as static illustrations and animated images. Figure 5 displays part of the hypertext lesson “characteristics of current”. It features an animated image that illustrates the directed flow of current in a circuit. The illustration is supplemented by a paragraph of instructive text. The student can browse through the hypertext pages at his own pace, using the left and right arrows. Many pages such as the one in Figure 5 are animated and some hypertext pages contain links that allow the student to explore the material in greater detail. The material is structured into lessons each of which has several sub-lessons ending in a progress check, a multiple-choice quiz designed to assess the student’s understanding of the sub-lesson. Figure 6 shows an example.
Figure 6. A Multiple-Choice Question
Note, currently BEETLE has no tutorial strategies for helping students with the quizes but in future work these will be used to help students when they get a wrong answer, click “Don’t know” or “Help on Question”. Currently, the student is simply
INTELLIGENT INFORMATION PRESENTATION FOR TUTORING SYSTEMS
243
told to try again if he gets a wrong answer and given the correct answer if he clicks “Don’t know”. The current tutorial strategies focus on the labs. After the student successfully completes the theoretical part of a lesson (that is, reading a sequence of sub-lessons and answering the associated multiple-choice questions), he must complete a practical exercise, a so-called lab. Figure 7 displays a screenshot of BEETLE while going through the lab measure current with a student. The screen is divided into four large areas: (top left) lessons done so far, (top right) the measuring current lab, (middle) the dialogue box, and (bottom) the student input entry window. The student can interact with the BEETLE interface in two ways. He can perform actions in the lab (e.g., open the switch of the circuit, remove a wire, set the function switch of the multimeter) or type his questions and answers in English in the student input window. For example, the student can address a tutor question of the form “do you know what to do next” by typing his answer into the student input window, or by performing a GUI action that he thinks is the next step in the current procedure. BEETLE then supplies tutorial feedback in English by displaying it in the dialogue window.
Figure 7. BEETLE Engaging in Tutorial Dialogue on Measuring Current.
4.2. Example Dialogue
Figure 8 depicts an excerpt from a human-human tutorial dialogue, as it occurs in our BEE corpus. Note that this is a dialogue with a human tutor set within the
244
ZINN, MOORE AND CORE
VIVIDS learning environment.1 The dialogue’s topic is the multi-step procedure for measuring current. Earlier, the student read the relevant BEE lesson and answered a number of associated multiple-choice questions. The excerpt is part of a larger dialogue context. Before this dialogue fragment began, the student interacted with the VIVIDS software. The student received a sequence of instructions and simple positive and negative multi-modal feedback. The student encountered no problems in successfully carrying out the first three instructions of the procedure measure current (de-energise the circuit, set the multimeter switch to CDC, and remove a wire), but then has problems connecting the leads of the multimeter to the appropriate places in the circuit. A human tutor then initiated the dialogue in Figure 8. We now explain how the 3-layer architecture of the BEETLE system can play the role of the tutor in the dialogue. [1] T: Now connect the leads of the multimeter to the circuit! [2] S: <student floundering> [3a] T: OK, [3b] can you tell me what the goal of this lab is? [4] S: where do i connect the red lead? [5] T: Before I answer that, answer the question I just asked you. [6] S: to connect the multimeter in series [7a] T: Good, [7b] and what is the purpose of doing that? [8] S: to measure current [9] T: And do you remember what it means to hook something up in series? [10a] S: yes, [10b] i had to think for a minute. [10c] let me try again [11a] T: OK,
[11b] go ahead. ... Figure 8. An Example Dialogue.
When the tutoring session begins, the deliberative planner, OPLAN, is given the initial learning goal of teaching the student the procedure for measuring current. The planner, informed by its knowledge sources (e.g., the BEER domain reasoner that provides a domain plan for measuring current) generates a high-level discourse plan: T1: open_lab(“measuring current”) T2: instruct_step(“de-energise circuit”) T3: instruct_step(“set multimeter switch to CDC”) T4: instruct_step(“remove a wire”) T5: instruct_step(“connect multimeter leads”)
INTELLIGENT INFORMATION PRESENTATION FOR TUTORING SYSTEMS
245
T6: instruct_step(“energise the circuit”) T7: instruct_step(“take meter reading”) T8: close_lab(“measuring current”) While the high-level step open_lab can be fleshed out by the deliberative planner, the step close_lab has the effect of summarising the student's performance, and therefore cannot be planned in advance. In a similar vein, the deliberator does not need to expand any of the instruct_step tasks since the instruction of a step may depend on the student’s prior performance. Once a high-level plan has been constructed, the deliberator then passes the sequence of tasks to the agenda of the the sequencer, which then starts processing the agenda's content. Assume that the sequencer has successfully processed the tasks T1–T4, and that it is now choosing to process T5. The task instruct_step is complex, that is, not a primitive behaviour that can be executed by the controller. It must therefore be refined. Given the current dialogue situation, an applicable method for its decomposition is identified, and the sequencer produces the agenda:2 T51: sequence assert(next_step(“connect multimeter lead”)) direct(do(“connect multimeter lead”)) T52: supply_feedback(did(“connect multimeter lead”)) T5: instruct_step(“connect multimeter lead”) T6: instruct_step(“energise the circuit”) T7: instruct_step(“take meter reading”) T8: close_lab(“measuring current”) After the sequencer starts another processing cycle, assume it chooses to process T51 only, which is an elementary task, and is therefore passed down to the bottom layer. The controller’s feedback generator takes the assert and direct dialogue moves and generates a single English sentence to achieve their intended effects, in turn [1] of Figure 8. Here, the assert move has been realised as an indirect effect of the directive. T51 is deleted from the agenda, and the update module adds the tutor's turn to the information state. Update rules, triggered by the turn, update the previous and current discourse units and add an obligation for the student to address the tutor's action directive. The tutor releases the turn. Assume now that the student performs a number of GUI actions, none of which is connecting the multimeter leads to the circuit in a systematic and correct manner, and assume that the controller’s interpret module posts the dialogue act “student floundering”. The update module updates the information state by deleting the student's obligation to address the tutor's directive (the student did address the utterance, but was unable to perform it), and by adding an obligation for the tutor to address the student's floundering move. If the task agenda contains a RAP designed to handle this event, then the sequencer will simply execute one of its applicable methods. Otherwise, the sequencer informs the top-level that it has no contingency for the situation and asks for a plan repair. Let us assume that the RAP supply_feedback can indeed handle the student's floundering, and that one of its applicable methods is to apply the
246
ZINN, MOORE AND CORE
subgoal_reification_strategy. The sequencer expands T52 into the tasks T521, T522, and T523, and we obtain:
T521: make_salient_goal_structure(“measure current”) T522: make_salient(howto(“connect multimeter leads”)) T523: direct(“connect multimeter leads”) Task T521 is selected, it is decided that the goal structure is not yet or no longer salient, and T521 is expanded into: T5211: diag_query(“can you tell me what the goal of this lab is”) T5212: supply_feedback(answered(“can you tell me what the goal of this lab is”)) Task T5211 is primitive, and therefore, it is passed to the bottom layer. The feedback generator now generates the utterances [3a] and [3b] of Figure 7. Note that the sentence planner generates “OK” to resolve the tutor's pending low-level dialogue obligation to acknowledge the student's last move. It is now the student's turn, and he utters [4] instead of resolving his obligation to address the tutor's question. The update module therefore creates an obligation for the tutor to address this question, and obviously, T5212 is not applicable, nor are any other pending tasks in the agenda. The sequencer therefore calls the deliberator to cope with the situation. The deliberator inspects the contents of the agenda as well as the IS and decides to resolve its obligation to address the student question by postponing its answer. Since the student did not address the tutor's question, the deliberator also decides to remind the student to answer its question [3b]. These two tasks are inserted into the task agenda, and since the sequencer identifies them as primitive, passes them to the feedback generator, yielding [5]. The controller’s interpret module identifies the student's utterance in [6] as a partially correct answer to [3b]. The update module deletes the student's obligation to address [3b] and creates an obligation for the tutor to address the student's answer. The sequencer is able to address the new obligation by executing T5212. The RAP supply_feedback is expanded by a multi-turn tutoring strategy that is successfully carried out in [7a,7b,8]. The task T522 is to ask the student if he knows how to connect the leads; the sub-tasks T5221 (diag_query}) and T5222 (supply_feedback) are generated and executed to accomplish T522. Note that the sentence planner realises T5221 as [9] using the prior context, especially the student's utterance [6]. The student answers [9] with [10a], [10b], and [10c], the first two of which can be handled by T5222, producing [11a]. In [10c], the student asked the tutor for permission to try again, and the update module generated an obligation to address the student's request. Here, the planning engine falls back into deliberation since this obligation was not anticipated. The task T523 is replaced by the task T524, grant_permission(“connect multimeter leads”). The bottom layer produces [11b] to accomplish T524, and the resulting agenda is then:
INTELLIGENT INFORMATION PRESENTATION FOR TUTORING SYSTEMS
247
T52: supply_feedback(did(“connect multimeter leads”)) T5: instruct_step(“connect multimeter leads”) T6: instruct_step(“energise the circuit”) T7: instruct_step(“take meter reading”') T8: close_lab(“measuring current”) The tutor then waits for the student to connect the leads. Note that the student could have reacted differently in each of his turns [2,4,6,8,10]. A flavour of BEETLE’s dialogue flexibility and robustness can be gained from the interaction depicted in Figure 7. Here, the student defied the system’s expectations most of the time. BEETLE was able to handle each of the cases by a discourse plan repair in the planning engine or a low-level communication management action (a generation of a request_rephrase move) in the update module. 5. RELATED WORK IN TUTORIAL DIALOGUE SYSTEMS In Section 2, we focused on previous work on dialogue management with sound theoretical underpinnings. This work was done in relatively simple domains such as information-seeking dialogues (e.g., getting travel information) and action-seeking dialogues (e.g., phone banking transactions). The work on tutorial dialogue management is more ad hoc; we discuss it here to show that a more principled approach is necessary. We review the dialogue management in the tutoring systems AUTOTUTOR (domain: computer literacy) (Graesser, Wiemer-Hastings, WiemerHastings, Kreuz, & Tutoring Research Group, University of Memphis, 1999), ATLAS-ANDES (Newtonian mechanics) (Schulze, Shelby, Treacy, Wintersgill, VanLehn, Gertner, 2000), and CIRCSIM/APE (circulatory system) (Khuwaja, Evens, Michael, Rovick, 1994; Freedman, 2000) as well as in the EDGE explanation system (electrical devices) (Cawsey, 1989). AUTOTUTOR's dialogue management can be regarded as an adaption of the form-filling approach to tutorial dialogue; to solve the feedback generation problem, it adds feedback moves to slots. AUTOTUTOR's dialogue management relies on a curriculum script, a sequence of topic formats, each of which contains a main focal question, and an ideal complete answer. The ideal complete answer consists of several sub-answers, called aspects. Each aspect includes: a good answer for that aspect, a list of anticipated bad answers corresponding to misconceptions and bugs along with corrections for those misconceptions and bugs; lists of prompts and hints that can be used to get the learner to contribute more information; and elaboration and summary moves that can be used to provide the learner with additional or summarising information. All of the moves are hard coded in English. Using latent semantic analysis, AUTOTUTOR evaluates the student's answer to the main focal question against all the good answers of its ideal complete answer, and the anticipated bad answers. AUTOTUTOR gives immediate feedback based on the student's answer, and then executes dialogue moves that get the learner to contribute more information until all answer aspects are sufficiently covered. The
248
ZINN, MOORE AND CORE
category and content of tutor dialogue moves are computed by a set of 20 fuzzy production rules and an algorithm that selects the next answer aspect to focus on. While AUTOTUTOR's dialogue management performs well in the descriptive domain of computer literacy, it is unclear how well this approach will work in problem-solving domains such as algebra or circuit trouble-shooting. In these domains student answers will often require the tutor to engage the student in a multiturn remediation sub-dialogue. Curriculum scripts are not nested and do not allow the representation of multi-turn dialogues. The dialogue manager of ATLAS-ANDES operates in the domain of Newtonian mechanics and thus must address AUTOTUTOR's aforementioned limitations. It uses a combination of knowledge construction dialogues (KCDs), which are recursive FSMs (Jordan, Rosé, & VanLehn, 2001) and a generative planner (Freedman, 2001). The grammar of a KCD definition bears many similarities to an AUTOTUTOR curriculum script. A student answer to a tutor question can be divided into correct and incorrect sub-answers with associated tutorial remediations. Unlike AUTOTUTOR, this feedback may extend over multiple turns through the use of recursive KCDs. While AUTOTUTOR requires a pre-defined and hand-crafted curriculum script, the ATLAS-ANDES approach allows on-the-fly generation of nested KCDs, using the APE discourse planner. The ATLAS-ANDES architecture is therefore similar to the 2-tier AUTOROUTE architecture: its KCDs are dialogue games that are more complex but domain-specific. The simple generic question-answer pair is replaced by a recursive automaton that can deal with a specific question and many anticipated possible correct and incorrect sub-answers. A compiler then maps KCDs into plan operators. APE is then used to combine KCDs into larger recursive FSMs. The developers of ATLAS-ANDES propose a solution to the rigidity that is typically associated with FSM-based systems. The reactive component of APE can skip around in the recursive KCD, for example, it can pop sub-networks that ATLASANDES believes contain intentions that were already dealt with in prior dialogue. In contrast, dialogue management in EDGE and CIRCSIM/APE is purely planbased. EDGE provides two types of (STRIPS-like) operators: discourse and content operators. Discourse operators model Sinclair & Coulthard's four levels of discourse, namely, transaction, exchange, move, and act (Sinclair & Coulthard, 1975). Content operators specify how to construct explanations. For example, “to describe a device: explain its function, its structure, and its behaviour”. EDGE's content operators are quite general; their bodies contain abstract domain references that interface with a knowledge representation module. EDGE incrementally builds and executes plans. Before each tutor turn, the deliberative planner expands the current unfinished step with either a complex sub-plan or an elementary plan step. Elementary plan steps are then executed using simple template driven generation. Thus, planning is delayed as much as possible so that the most current student model can be consulted. CIRCSIM/APE also incrementally constructs and executes plans, and uses simple template driven generation for realising elementary plan steps. However, a major drawback of CIRCSIM/APE (for others, see (Freedman, 2001)) is that it embeds control in operators, unlike traditional planners, where control is separated
INTELLIGENT INFORMATION PRESENTATION FOR TUTORING SYSTEMS
249
from action descriptions. This makes writing operators difficult and puts an additional burden on the planner. Although previous and ongoing work in tutorial dialogue systems has striven to support unconstrained natural language input and multi-turn tutorial strategies, there remain limitations that must be overcome: teaching strategies, encoded as curriculum scripts, KCDs, or plan operators, are domain-specific; the purely planbased systems embed control in plan operators or, necessarily, conflate planning with student modeling and maintenance of the dialogue context; and all current tutorial dialogue systems except EDGE mix high-level tutorial planning with lowlevel communication management. These limitations can make systems difficult to maintain, extend, or reuse. Looking at dialogue systems built by computational linguists and speech researchers, we see the opposite problem (Lewin, 1998; Allen, Byron, Dzikovska, Ferguson, Galescu, & Stenton, 2000; Pieraccini, Levin, & Eckert, 1997; Larsson, Ljungloef, Cooper, Engdahl, & Ericsson, 2000; Rudnicky & Wu, 1999; Chu-Carroll, 1999). These systems do not allow for conversational moves extending over multiple turns and the resulting need to abandon, suspend, or modify these moves. However, these systems aim for dialogue strategies that are independent of dialogue context management and communication management concerns. These strategies contain no domain knowledge; they query domain reasoners to fill in necessary details. Furthermore, in systems explicitly performing dialogue planning, control is never embedded in plan operators. In BEETLE, we have combined these beneficial features (modularity and re-usability) with the flexibility and robustness seen in tutorial systems. 6. CONCLUSION In the introduction we argued that natural language dialogue is a critical modality for intelligent information presentation in the tutoring domain. This modality allows for multi-turn remediation subdialogues where students are encouraged to solve problems and detect and correct errors on their own. However, these subdialogues, while making this modality effective, also make it difficult to implement. Some current tutorial dialogue systems are able to support multi-turn remediation subdialogues; however, they do so by using carefully hand-crafted tutorial strategies that encode an initial question to ask and anticipated student responses and associated tutorial goals. The architecture for BEETLE splits the tasks necessary for supporting multi-turn remediation subdialogues among a set of independent modules that coordinate through a shared knowledge source (the information state). All domain knowledge is stored in the domain reasoner; all communication-management knowledge is encoded in the update module. We are currently populating a database of domain-independent tutorial strategies; we will test their effectiveness and portability against current domain-dependent approaches.
250
ZINN, MOORE AND CORE 7. NOTES
1 Rosé, Moore, VanLehn and Allbritton (2000) describe an experiment where students go through BEE lessons and labs with the guidance of a human tutor. The video signal from the student's computer was split so the tutor who was hidden behind a partition could watch the student's progress. The tutor and student were allowed to type messages to each other through a chat interface. We are using the logs of this chat interface, the BEE dialogues, to identify teaching tactics to be used by our tutor and plan to use them to train our system. The BE&E dialogues are publically available for download from our website (http://www.cogsci.ed.ac.uk/~jmoore/tutoring/dialogue-corpus.html) and also from the CIRCLE archives (http://www.pitt.edu/~circle/Archive.htm). 2 The task T5 will be popped from the agenda only when the tasks T51 and T52 are finished.
8. REFERENCES Allen, J., Byron, D., Dzikovska, M., Ferguson, G., Galescu, L., Stent, A. (2000). An architecture for a generic dialogue shell. Natural Language Engineering 6. Arkin, R.C. (1989). Towards the Unification of Navigational Planning and Reactive Control, Working Notes of the AAAI Spring Symposium on Robot Navigation, Stanford University, March 20-28. Bloom, B.S. (1984). The 2 Sigma problem: The search for methods of group instruction as effective as one-to-one tutoring. In Educational Researcher, Vol 13, pp. 4–16. Bonasso, R.P., Firby, R.J., Gat, E., Kortenkamp, D., Miller, D.P., Slack, M.G. (1997). Experiences with an Architecture for Intelligent, Reactive Agents. Journal of Experimental & Theoretical Artificial Intelligence. Brown, P., Levinson, S.C. (1987). Politeness: Some Universals in Language Usage. Cambridge University Press. Cawsey, A. (1989). Explanatory dialogues. Interacting with Computers 1, 69—92. Chi, M.T.H, Bassok, M., Lewis, M.W., Reimann, P, Glaser, R. (1989). Self-Explanations: How Students Study and Use Examples in Learning to Solve Problems. Cognitive Science, 13(2), 145–182. Chi, M.T.H., de Leeuw, N., Chiu, M., LaVancher, C. (1994). Eliciting Self-Explanations Improves Understanding. Cognitive Science 18(3), 439–477. Chi, M.T.H., Siler, S.A., Jeong, H., Yamauchi, T., Hausmann, R.G. (2001). Learning from human tutoring. Cognitive Science 25, 471—533. Chu-Carroll, J. (1999) Form-based reasoning for mixed-initiative dialogue management in informationquery systems. In: Proceedings of the 7th European Conference on Speech Communication and Technology (Eurospeech-99), 1519 —1522. Cohen, P.A., Kulik, J.A., Kulik, C. (1982). Educational Outcomes of Tutoring: A meta-analysis of findings. American Educational Research Journal 19, 237–248. Currie, K., Tate, A. (1991). O-Plan: the open planning architecture. Artificial Intelligence 52, 49—86. Firby, R.J. (1989). Adaptive Execution in Complex Dynamic Domains. PhD thesis, Yale University, Technical Report YALEU/CSD/RR #672. Fox, B. (1993). The Human Tutorial Dialogue Project: Issues in the design of instructional systems, Lawrence Erlbaum Associates, Hillsdale, New Jersey. Freedman, R. (2000). Using a Reactive Planner as the Basis for a Dialogue Agent. Proceedings of the Thirteenth Florida Artificial Intelligence Research Symposium (FLAIRS 2000), Orlando. Freedman, R. (2001). An approach to increasing programming efficiency in plan-based dialogue systems. In Moore, J.D., Redfield, C.L., Johnson, W.L., eds.: 10th International Conference on Artificial Intelligence in Education, IOS Press. Graesser, A.C., Person, N. (1994). Question Asking During Tutoring. American Educational Research Journal. 31(1), 104–137. Graesser, A.C., Wiemer-Hastings, K., Wiemer-Hastings, P., Kreuz, R., Tutoring Research Group, University of Memphis (1999) AutoTutor: A simulation of a human tutor. Cognitive Systems Research 1, 35—51. Hume, G., Michael, J, Rovick, A, Evens, M. (1996) Hinting as a Tactic in One-on-One Tutoring, (Journal of the Learning Sciences, 5 (1), 23–47.
INTELLIGENT INFORMATION PRESENTATION FOR TUTORING SYSTEMS
251
Jordan, P.W., Rosé, C., VanLehn, K. (2001). Tools for authoring tutorial dialogue knowledge. In Moore, J.D., Redfield, C.L., Johnson, W.L., eds.: 10th International Conference on Artificial Intelligence in Education, IOS Press, 222—233. Khuwaja, R.A., Evens, M.W., Michael, J.A., Rovick, A.A. (1994). Architecture of the CIRCSIM-tutor (v.3). In Proceedings of the 7th Annual IEEE Computer-Based Medical Systems Symposium, IEEE Computer Society Press, 158—163. Larsson, S., Traum, D. (2000). Information state and dialogue management in the TRINDI dialogue move engine toolkit. Natural Language Engineering 6, 323—340. Larsson, S., Ljungloef, P., Cooper, R., Engdahl, E., Ericsson, S. (2000). GoDiS - an accommodating dialogue system. In: Proceedings of ANLP/NAACL-2000 Workshop on Conversational Systems. Lepper, M.R., Chabay, R.W. (1988). Socializing the intelligent tutor: Bringing empathy to computer tutors. In Learning Issues for Intelligent Tutoring Systems, Mandl, H. & Lesgold, A. (editors). Springer, 114–137. Lewin, I. (1998). Autoroute dialogue demonstrator. Technical Report CRC-073, SRI Cambridge. Lewin, I., Rupp, C.J., Hieronymus, J., Milward, D., Larsson, S., & Berman, A. (2000). Siridus system architecure and interface report. Tech. report, Siridis D6.1. Littman, D, Pinto, J, Soloway, E. (1990). The knowledge required for tutorial planning: An empirical analysis. Interactive Learning Environments, 1, 124–151. Matheson, C., Poesio, M, Traum, D. (2000). Modelling Grounding and Discourse Obligations Using Update Rules, In Proceedings of NAACL 2000, Seattle, 2000. Martin, D.L., Cheyer, A.J., Moran, D.B. (1999). The open agent architecture: A framework for building distributed software systems. Applied Artificial Intelligence: An International Journal 13, 91—128. McArthur, D., Stasz, C., Zmuidzinas, M. (1990). Tutoring Techniques in Algebra. Cognition and Instruction 7, 197–244. Merrill, D.C., Reiser, B.J., Landes, S. (1992). Human tutoring: Pedagogical strategies and learning outcomes. Paper presented at the annual meeting of the American Educational Research Association, San Francisco. Merrill, D.C., Reiser, B.J., Ranney, M., Trafton, G. (1992). Effective Tutoring Techniques: Comparison of human tutors and intelligent tutoring systems. Journal of the Learning Sciences 2(3), 277–305. Miller, G. (1990). WordNet: An on-line lexical database. International Journal of Lexicography 3(4). Munro, A. (1994). Authoring interactive graphical models, In: The Use of Computer Models for Explication, Analysis and Experimental Learning, Edited by deJong, T., Towne, D.M., Spada, H., Springer. Ohlsson, S., Rees, E. (1991). The function of conceptual understanding in the learning of arithmetic procedure, Cognition and Instruction, 8, 103–179. Pieraccini, R., Levin, E., Eckert, W. (1997). AMICA: The AT&T mixed initiative conversational architecture. In: Proceedings of the 5th European Conference on Speech Communication and Technology (Eurospeech-97). Porayska-Pomsta, K., Mellish, C., Pain, H. (2000). Pragmatic Analysis of Teachers' Language. Towards an Empirically Based Approach. In Proceedings of AAAI Fall Symposium on Building Dialogue Systems for Tutorial Applications. Rosé, C. (2000). A framework for robust semantic interpretation, In Proc. of the 1st Annual Meeting of the North American Chapter of the ACL (NAACL-00), Seattle. Rosé, C. P., Moore J. D., VanLehn, K., Allbritton, D. (2000), A Comparative Evaluation of Socratic versus Didactic Tutoring, University of Pittsburgh, LRDC-BEE-1. Rudnicky, A., Xu, W. (1999). An agenda-based dialog management architecture for spoken language systems. In: Intl. Workshop on Automatic Speech Recognition and Understanding. Schulze, K.G., Shelby, R.N., Treacy, D., Wintersgill, M.C., VanLehn, K., Gertner, A. (2000). Andes: A coached learning environment for classical newtonian physics. The Journal of Electronic Publishing 6. Sinclair, J.M., Coulthard, R.M. (1975). Towards an analysis of discourse: the English used by teachers and pupils. Oxford University Press. VanLehn, K. (1990). Mind bugs: the origins of procedural misconception. MIT Press, Cambridge, Massachusetts. W3C (1999). XSL Transformations (XSLT), Version 1.0. W3C Recommendation 16 November 1999, http://www.w3.org/TR/xslt
252
ZINN, MOORE AND CORE
W3C (2002). Voice Extensible Markup Language (VoiceXML) Version 2. Working Draft, 24 April 2002, W3C Voice Browser Working Group. http://www.w3.org/TR/2002/WD-voicexml20-20020424/. Work in progress. Young, R.M., Pollack, M.E., Moore, J.D. (1994). Decomposition and causality in partial order planning. In Proceedings of the Second International Conference on Artificial Intelligence and Planning Systems, pp. 188—193. Morgan Kaufman. Zinn, C., Moore, J.D, Core, M.G. (2002), A 3-tier Planning Architecture for Managing Tutorial Dialogue, Intelligent Tutoring Systems, 6th Intl. Conference (ITS–2002), Springer LNCS 2363, 574—584, Biarritz, France.
9. AFFILIATIONS Claus Zinn Johanna D. Moore Mark G. Core The University of Edinburgh, UK
VIRTUAL AND AUGMENTED REALITY Nowadays virtual reality is a popular term designating a very active area of research. The origins of this area can be tracked back to the long history of flight (and other) simulators, but the term Virtual Reality (VR) was originally coined by Jaron Lanier in 1983. The term was mainly linked to the development of specific artifacts such as the dataglove, or head-mounted displays and it has become very fertile with the extraordinary developments in 3D graphics. The idea of virtual reality lies in the relation between the perceptual experience and the possibility of intervening coherently in the created world. The experiencer feels immersed in the virtual reality, rather than external to some communicative artifact, he has the sense of being in the environment specified by the displays and perceives the objects of the virtual world as equally present. The subject can navigate and act in the environment, with position and motion tracking and realistic feedback on the part of the system. On the other hand, Augmented reality (AR) systems generate a composite view for the user: a combination of the real scene perceived by the user and a virtual scene generated by the computer that augments the scene with additional information. While Virtual Reality is based on a totally immersive environment, an augmented reality system augments the real world scene and the virtual images are merged with the real view to create the augmented display. In augmented reality, registration of virtual objects with the real scene is essential for the user correct perception of an integrated scene and it becomes a particularly complex task as the user moves in the real environment. In this cluster, Bell, Feiner and Höllerer describe a view-management mechanism that addresses dynamically changing visibility relationships among moving objects, and show how to apply it to 3D user interfaces. Examples from an implemented view-management testbed are used to demonstrate some of the different ways in which the system and its users can exert control over what is seen in augmented reality. These examples include automatically laid out annotations that provide desired information about the surrounding environment, and a handheld tracked physical display and head-tracked virtual situation-awareness aid that respond to user control and interact with the annotated environment. The work of Swartout and colleagues presents the Mission Rehearsal Exercise Project, which confronts a soldier trainee with the kinds of dilemmas he might reasonably encounter in a peacekeeping operation. The trainee is immersed in a synthetic world and interacts with virtual humans: artificially intelligent and graphically embodied conversational agents that understand and generate natural language, reason about world events and respond appropriately to the trainee's actions or commands. Finally Weiss’s chapter focuses on a variety of issues related to increasing accessibility to information via alternate forms of presentation in order that wider segments of the population, and especially those with disabilities, can share the benefits that modern presentation technologies can provide. In the paper examples of virtual reality in the treatment of patients with neurological impairment illustrate this important application of adapted presentation technology.
B. BELL, S. FEINER AND T. HÖLLERER
MAINTAINING VISIBILITY CONSTRAINTS FOR VIEW MANAGEMENT IN 3D USER INTERFACES
1. INTRODUCTION A key part of user interface design involves determining the positions and sizes of the objects that the user sees, a task often known as layout. In 2D user interfaces, the geometry of 2D objects directly determines their geometry on 2D displays, and their assignment to layers directly determines which of a set of overlapping objects obscures the others. In contrast, in 3D user interfaces, 3D objects are viewed from one or more viewpoints, and projected onto 2D view planes that are mapped to 2D displays. Even in a static 3D scene, changing the viewpoint from which objects are seen can change the positions, sizes, and visibility relationships of their projections. Thus, the positions and sizes of the objects’ 2D projections and their visibility relationships depend on both the objects and the viewpoints. We use the term view management to refer to the decisions that determine the spatial relationships among the projections to obey some desired set of constraints. For example, some objects may be sufficiently important to the tasks being performed that their visibility should be guaranteed, while others may need to be rendered at some minimum size or preferred aspect ratio to improve their legibility. Furthermore, there are many situations in which the viewpoint or the geometry of certain objects may be dynamic, yet externally determined, and relatively unpredictable. For example, the viewpoint may be constrained to that of a headtracked display, and some of the objects may be controlled by a simulation or may be part of a changing physical world viewed through the see-through display of an augmented reality system. In this chapter, we address how view management can be performed in interactive 3D user interfaces. We present our implementation of a viewmanagement module that we have developed to explore real-time specification and maintenance of constraints on the projections of 3D objects on the 2D view plane. Our view-management module provides the ability to add user interface components that can interact with these constraints, to give developers the ability to add interactive controls that use view management. The interface for the view management module is used to specify constraints on objects that are tagged to indicate their properties, such as position, scale, transparency, and inter-object spatial relationships that should be maintained. To solve 3D visibility constraints, we use interactive visible-surface determination algorithms to compute efficient approximations of the objects’ 255 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 255–277 © 2005 Springer. Printed in the Netherlands
256
BELL, FEINER
AND HÖLLERER
projections, based on upright 2D rectangular extents. To compute placements that avoid occlusion, we apply the same algorithms to determine those portions of the view plane where objects are not projected (i.e., the empty space). We use these representations to develop a variety of view-management strategies, including ones
Figure 1. View management in an augmented reality meeting scenario (imaged through a tracked, optical see-through, head-worn display).
that take into account decisions made during previous frames to minimize frameto-frame visual discontinuities. Figure 1 demonstrates how we have applied our view-management module to a collaborative augmented reality environment (Arthur et al., 1998; Billinghurst et al., 1998; Butz et al., 1999; Szalavari et al., 1998). It is photographed through an optical see-through head-worn display, in which the user directly views the real world through optics, which also reflect graphics rendered on a miniature LCD. One user is sitting across from another, discussing a virtual model of the Columbia campus located between them. In response to a request from the user whose view is displayed, all virtual buildings have been annotated with their names. Each label is scaled within a user-selectable range and positioned automatically. A label is preferentially placed within a visible portion of its building’s projection. If there is no space to accommodate a label legibly within that projection, it is instead placed near the projection, but not overlapping other buildings or annotations, and is
MAINTAINING VISIBILITY CONSTRAINTS FOR VIEW MANAGEMENT
257
connected to its building by an arrow. Additional annotations in the scene include a meeting agenda (shown at the left), a virtual copy of a building that one of the users has created, and popup information related to that building (both shown at the right). The projections of these annotations avoid other objects, including the visible user’s head, to maintain a clear line of sight between users.
Figure 2. Screen shot of a 3D campus information visualization system.
Figure 2 shows a simple interactive 3D information visualization system that allows users to inquire about buildings on our campus. Visible buildings are initially labeled to show their identity, and to convey what is selectable. When a user selects a building, more information about it is displayed and placed close to the building’s visual projection with a leader line back to the building to convey the information’s relationship to it. A virtual avatar is shown in the model with a comic-strip speech balloon annotation, constrained at a fixed position relative to the projection of the avatar’s mouth. All annotations are laid out so they do not overlap each other or any of the buildings’ projections. Automated placement of these annotations by our view-management module ensures that important objects in the 3D scene remain unobscured when providing the user with useful information. Placement of the annotations in these examples is personalized for each individual view. The personalized annotations appear as local variations of each
258
BELL, FEINER
AND HÖLLERER
view in the environment, similar to the approach used in other distributed graphics packages (Hesina et al., 2000; MacIntyre & Feiner, 1998). In the remainder of this chapter, we describe related work in Section 2. Section 3 provides a detailed description of how we perform automated view management by collectively controlling the visibility and spatial layout of objects on the view plane, instead of indirectly modifying the properties of individual objects, as in typical 3D applications. Then, in Section 4, we extend this approach to address user-driven view management, showing how user-controlled physical objects, including a hand-held tracked physical display and a head-tracked virtual model can participate in our automated view management system. Thus, we show how view management can be accomplished jointly by the system and user working together to determine what is seen. Finally, we present our conclusions and future work in Section 5. 2. RELATED WORK Researchers have often labeled and annotated objects in 3D virtual and augmented environments without taking into account visibility constraints (Fairchild et al., 1988; Feiner, MacIntyre, Haupt et al., 1993; Mori et al., 1999; Starner et al., 1997; Sutherland, 1968; You et al., 1999). Static 3D illustrations have been annotated using a generate-and-test approach using some visibility calculations in non-real time applications with automatic control of placement coupled with manual corrections (Rist et al., 1994). 2D graph layout algorithms (Battista et al., 1999) and 2D label placement algorithms for point, line, and area features on maps and graphs (Christensen et al., 1995) are closely related to the view-management tasks in which we are interested. Approaches to labeling point features typically explore a set of candidate positions arranged in a circle around each feature. An objective function rates a solution’s goodness, based on factors such as the overlap of labels with features and other labels, and the location of labels relative to their features. Christensen, Marks, and Shieber show that an exhaustive search approach to labeling point features is NP-hard. As described in Section 4, our work uses greedy, non-optimal algorithms for positioning objects (e.g., labels or other annotations) near points or areas. Our algorithms can efficiently determine the set of areas in which an object can be placed, while avoiding overlapping other objects and maintaining desired spatial relationships with selected objects. It is also possible to avoid some view-management decisions by limiting the number of objects being displayed. For example, dynamic queries can allow users to filter information in a 2D database display (Ahlberg et al., 1992), while information about a user’s objectives and proximity to other objects have been used to filter the 3D information displayed in an augmented reality systems (Julier et al., 2000). While these techniques are quite useful, they do not address the need to display related objects near each other, to avoid occluding objects, or to ensure visibility for specified important objects. A number of researchers have developed approaches for automatically avoiding 3D occlusion of selected objects. Some of these rely on controlling the viewpoint
MAINTAINING VISIBILITY CONSTRAINTS FOR VIEW MANAGEMENT
259
(He et al., 1996; Phillips et al., 1992), which is not possible in head-tracked environments in which the viewpoint is attached to a real user’s head. Others utilize illustrative effects, such as cut-away views and transparency (Feiner & Seligmann, 1992) or line style (Feiner, MacIntyre, & Seligmann, 1993; Kamada & Kawai, 1987), without altering object geometry. Visual access distortion (Carpendale et al., 1997) moves objects away from the user’s line of sight to a selected focus point by displacing each object perpendicular to the line of sight by a function of the magnitude of the object’s distance from the line of sight. However, this approach does not actually guarantee the visibility of an object at the focus point: the size of an object’s projection is not taken into account when determining how far to move it (e.g., a large object may still occlude the focus point after the move), and the summed displacements for an object that lies between lines of sight to multiple focus points may not move the object in a useful way (e.g., the projection of an object that lies along the vector average of two lines of sight may not move). In contrast, our view-management approach can select positions for objects that guarantee desired occlusion relationships with other objects are maintained. 3. OBJECT PROPERTIES AND CONSTRAINTS All the objects in our scenes have properties, some of which may be marked as controllable or constrained by the user, the view-management module, a tracker, or other components (e.g., a simulation or user interface). This means that there can be multiple sources of change, and these sources can control or constrain different properties of the same object. The view-management module uses the controllable properties to help satisfy the constraints. We support a set of constraints that control the view plane layout (currently visibility, position, size, and priority) and transparency of objects in a 3D scene. These constraints can be set explicitly by user interface components or by the user, or they can be defined to change based on other properties or relationships. Visibility constraints specify occlusion relationships on the view plane: those objects that a given object should not occlude, and those objects that it is allowed to occlude. Examples of the use of visibility constraints illustrated in this chapter include: x Pavement and grass patches in the campus scene can always be occluded by other objects. x Labeled objects can be occluded by certain other objects at certain times. Objects can be overlaid by their own labels, but not by labels for other objects. x The collaborating user’s face may not be occluded by any other objects. Position constraints specify the minimum and maximum distance to be maintained from: x Other objects: For example, the “speech balloon” of Figure 2 is constrained to be above and near an area feature (the avatar’s mouth).
260
BELL, FEINER
AND HÖLLERER
A point, area, or volume in a coordinate system, which may be screenstabilized (relative to the user’s head position/orientation) or worldstabilized (relative to the world). x The viewpoint: For stereo viewing, this distance will control how much the user’s eyes need to converge to fuse the two separate images together. Size constraints specify a range of possible sizes. For example, labels are associated with a font size range, with preference towards the high end of the range when possible. Priority constraints specify priorities for objects. This allows the system to determine the order in which objects are included in the image, so that less important objects can be elided if adding them will violate other constraints. Transparency constraints specify a range of object transparency values. For example, transparency can be modified to minimize the consequences of occluding other objects. x
4. AUTOMATIC SYSTEM-DRIVEN VIEW MANAGEMENT
1.
Perform computations not dependent on visibility
4.
2.
3. Determine viewing specification
5. Perform viewdependent computations
Perform animation/temporal continuity computations
Perform visibility and occlusion computations
6. Set properties controlled by view management
Figure 3. View Management Pipeline
Computing the visible portions of the projections of objects on the view plane is essential for satisfying most of the constraints described. Figure 3 shows a pipeline that specifies the order in which computations are done for view management. To precisely maintain the specified visual constraints, this pipeline should get executed for each frame rendered. However, since some of these computations can take a considerable amount of time to compute, asynchronous execution of the pipeline for real-time applications,
MAINTAINING VISIBILITY CONSTRAINTS FOR VIEW MANAGEMENT
261
as in the Decoupled Simulation Model (Shaw et al., 1993), can preserve interactive rendering frame rates at the expense of view management accuracy. (These errors might be preferable depending on the user’s preference or task.) In this section, we discuss each pipeline step and give examples of how it is performed in our implementation. 4.1. Computations not Dependent on Visibility
Computations that are not dependent on the viewing specification can happen either at the beginning of the pipeline or asynchronously. Some applications, such as ones that use head-worn displays, must render graphics based on real-time tracker input. To minimize the system’s response time (or error caused by asynchronous computation latency), we want to minimize the time between reading the tracker data (i.e., to determine the viewing specification in step 2) and setting object properties (step 6). To do this, we execute all computations that are not dependent on the viewing specification outside of these steps. These computations can include most types of logic in a graphics system, such as layout, content management, and interaction. Also, data structures that are needed farther down the pipeline can be initialized at this time. 4.2. Viewing Specification Determination
For each 3D object in the scene, we need to determine projection parameters that include viewpoint (two for stereo), orientation, and other viewing parameters (e.g., field of view and clipping planes). For a physical simulation, all objects in the same coordinate system will typically have the same relative viewing specification. Depending on the type of graphical application, determining the viewing specification for each frame can be different. We want to support applications in which the viewing specification changes interactively and unpredictably, as well as applications in which the view follows a predetermined path. Some real-time applications determine the user’s viewpoint and orientation by using input from 3D trackers, along with transformations based on the position and orientation of the physical trackers and the display relative to the user. Other real-time 3D applications are controlled using 2D direct manipulation devices, such as a mouse. Predetermined applications, such as video and animation playback, can be manually planned or use automated assistance (He et al., 1996; Phillips et al., 1992). Determining the viewing specification may involve screen-space computations to automatically choose camera placement. The situation-awareness aid we describe in Section 5 determines a viewing specification for a world in miniature that depends on the user’s head pitch, which is computed from the viewing specification for the physical world. In this augmented reality example, two viewing specifications, one for the physical world and one for the world in miniature, are computed from tracker input for rendering.
262
BELL, FEINER
AND HÖLLERER
4.3. Visibility and Occlusion Computations
After determining the viewing specification, the constrained objects’ visibility, occlusion and depth values are computed to help satisfy the constraints discussed in Section 3. To satisfy these constraints efficiently, we use a 2D space representation (Bell & Feiner, 2000) to encode the visible portions of the projections of 3D objects on the view plane and a second instantiation of this 2D space representation to encode the complementary portions of the view plane on which no projections lie (i.e., the empty space). Each such representation of an arbitrary, possibly multifaceted, spatial region consists of a list of rectangles that are axis-aligned, are possibly overlapping, and are the largest such rectangles that lie wholly inside that region. For example, Figure 4(a) shows four images of a view plane populated by a single (dark grey) rectangle, which represents the projection of a single 3D object. Each image shows one of the four (light grey) rectangles that together comprise the representation of the remaining empty space. These four light grey rectangles are the largest axis-aligned rectangles that fit wholly within the portion of the view plane that does not include the dark grey rectangle; that is, none of these largest space rectangles can get any larger and still remain completely inside the region that it is supposed to represent. Suppose we wish to place an object within either region. If that object can be represented by an axis-aligned rectangle, it will fit within a region only if its rectangle fits wholly within one of the rectangles that make up that region’s representation. Furthermore, if the object can be translated, the union of all positions it can occupy, while wholly inside at least one of the region’s rectangles, represents all valid positions it can occupy within the entire region. Thus, by searching through the rectangles representing a region, it is possible to find placements inside that region that match specified containment constraints (Section 4.4). To satisfy some constraints, such as avoiding occlusion, we can search for a position for an object within the empty-space representation. For other constraints, we need to compute the visible-space representation of one object, or of a combination of objects. We have used two alternative visible-surface-determination algorithms to generate these representations: one based on a Binary Space Partitioning Tree, and one based on a hardware-accelerated z-buffer. We can use the Binary Space Partitioning (BSP) Tree algorithm (Fuchs et al., 1980) to determine the visibility order efficiently for an arbitrary projection of a scene. A BSP tree is a binary tree whose nodes typically represent actual polygons (or polygon fragments) in the scene. Because we need to determine the visibility order for objects, rather than for the polygons of which they are composed, our BSP tree nodes are defined by planes that separate objects, rather than planes that embed object polygons. We choose these planes using the heuristics of Thibault and Naylor (1987). Although BSP trees are often used for purely static scenes, dynamic objects can be handled efficiently by adding these objects to the tree last and removing and adding them each time they move (Chrysanthou & Slater, 1992). We traverse the BSP tree in front-to-back order relative to the viewpoint to find the visible portions of the scene’s objects.
MAINTAINING VISIBILITY CONSTRAINTS FOR VIEW MANAGEMENT
(a)
(b)
263
(c)
Figure 4. Adding a new object to the view-plane representation. (a) original object (dark grey), shown separately with each of the four original largest empty spaces (light grey) (b) New object extent (medium grey) is added. (c) Resulting visible largest spaces of new object extent (medium grey)
4.3.1. Visibility Determination using a Binary Space Partitioning (BSP) Tree
For each node obtained from the BSP tree in front-to-back order, the new node’s upright extent is intersected with the members of the current list of largest emptyspace rectangles. This can be accomplished efficiently because we maintain the largest empty-space rectangles in a 2D interval tree (Samet, 1990), allowing an efficient window query to determine the members that actually intersect the extent. (The intersection operation itself is trivial because all rectangles are axis-aligned.) The intersection yields a set of rectangles (not necessarily contiguous), some of which may be wholly contained within others; these subset rectangles are eliminated. The result is a set of largest rectangles whose union is the visible region of the new node. For example, Figure 4(b) shows the intersection of a medium grey rectangle (the upright extent of the second BSP tree node to be processed in this case) with the four light grey largest empty-space rectangles. The resulting three medium grey rectangles, shown in Figure 4(c), represent the visible portion of the original medium grey rectangle. Some objects are represented by a single BSP tree node, while others are split across nodes. (Objects may be split during BSP tree construction, or prior to BSP tree construction to better approximate a large object by the upright extents of a set of smaller objects.) For example, if a single object were split in two by a BSP tree plane, it would be represented by two sets of largest visible-space rectangles, one per node. Another rectangle being compared with the largest visible-space rectangles in both sets might fit in none of them, even though it might fit in one of the largest visible-space rectangles representing the original unsplit object, resulting in a false rejection. Therefore, the visible rectangles of all of an object’s nodes must be coalesced to create the correct list of largest rectangles for the object (Bell et al., 2001).
264
BELL, FEINER
AND HÖLLERER
4.3.2. Visibility Determination using a Z-buffer
Alternatively, we can compute the 2D spatial representation of visible regions of objects in a 3D scene using a z-buffer algorithm (Catmull, 1974). By rendering each object in a distinct color with the current viewing specification, we can use the frame buffer as an object buffer (Atherton, 1981), in which each pixel’s color encodes the object visible at that pixel. We use a sweep algorithm to convert the resulting pixel representation to the spatial representation needed to satisfy our constraints. Since it is likely that adjacent pixels will have the same value, we evaluate each pixel, starting at one corner of the buffer, across the largest dimension of the image, keeping track of the last position at which the pixel value has changed and the current object value. Once a change is detected or the end of the scan line is reached, a rectangle is created from the last to the current position (with unit height) and combined into the current object’s visiblespace representation. The current object and last position are then reset, and the algorithm continues. Since the only rectangles that can be adjacent to a rectangle on the current scan line are those combined from the previous scan line, a 2D interval tree of all rectangles from the previous scan line is kept for fast lookup. Any rectangle that is not a result of a combine operation from the last scan line is considered part of the final result and does not need further processing. The set of largest visible space rectangles for each visible region consists of all resulting rectangles, including the rectangles that have been combined on the last scan line. Note that we must be careful to avoid the problems that can occur when two adjacent polygons in the same object fail to share a vertex on a shared edge, potentially causing missed pixels (Foley et al., 1996). This can result in a large area being improperly split into a set of smaller ones. To prevent this, we add the necessary extra vertices to polygons in our model. 4.3.3. Performance and Accuracy
Since most current graphics hardware implements the z-buffer algorithm, the time needed to compute the pixel representation is low. However, the conversion to our representation is more computationally intensive. Generating the object buffer at a lower resolution than used for rendering the user interface accelerates the algorithm at the expense of accuracy. In the BSP tree approach, accuracy can also be traded for performance. Since objects are approximated by the upright rectangular extents of their BSP tree nodes, better approximations can be produced if more nodes are generated at the cost of increased computation. 4.4. View-Dependent Computations
The resulting visibility representations, consisting of the largest rectangles of single objects, multiple objects, and empty spaces (i.e., background), can support a variety of layout strategies. Queries on these visible regions can be designed for choosing appropriate positions and sizes for new objects within these areas.
MAINTAINING VISIBILITY CONSTRAINTS FOR VIEW MANAGEMENT
265
Figure 5. Browsing an augmented reality environment (imaged through a tracked video see-through display).
For placing text, this is equivalent to map area feature labeling (Roessel, 1989) in a 3D environment in which objects can partially occlude each other; in this situation, we can place the label relative to the visible portion of the related object. The visibility representations also allow proximity queries that can constrain the resulting position of an object to be inside a region and closest to a point or another projected object. These queries are satisfied by searching the region’s list of largest visible rectangles to find a resulting rectangle that best fits the criteria needed, similar to the queries of (Bell & Feiner, 2000). We describe here some examples of strategies used for satisfying specific constraints for application tasks such as labeling, annotating, and placing other information. The results from these computations can be used to place objects directly or as input to animation computations (Section 4.5) that produce the final placement. There are many cases in 3D user interfaces in which the application needs to communicate information to the user. For example, Figure 5 shows an augmented reality environmental browsing mode in which the system tries to label a number of objects that have been determined to be of interest to the user, making sure that the
266
BELL, FEINER
AND HÖLLERER
labels are not ambiguous. In this case, we use the two-tiered approach that was applied to Figures 1–2, creating both internal and external labels, and ensuring that no label occludes any part of an object except the object it labels. In contrast to Figure 1, Figure 5 is imaged using the video see-through display, in which video mixing hardware is used to combine a camera view of the real world with synthesized graphics. 4.4.1. Internal Label Placement
For each object to be labeled, we determine the largest visible rectangle internal to the object’s projection that can contain the largest copy of the object’s label, given a user-settable font and size range, and taking into account an additional buffer around the label to keep adjacent labels from appearing to merge. This buffer can also help control the amount of the related object’s projection that is blocked by the label if the object has a high priority. For efficiency, we approximate the dynamic size of a label using its aspect ratio at a certain font size, texture map it to a polygon, and scale the polygon based on the result of the query. 4.4.2. External Label Placement
If no internal rectangle for an object is large enough, the object will not be labeled internally, but could instead be labeled externally. External labels can be processed in any desired order. We currently use the front-to-back order; however, since the allocation algorithm is greedy, if an importance metric were available for labels, we could sort on it instead. To lay out an external label, the system queries the empty-space representation for a rectangle that can contain the label within a user-settable range of sizes, and within a user-settable distance from the object. If the label is allocated within the empty space, it will neither be occluded by nor occlude any other object. If no such rectangle can be found, then the label is not allocated. Since external labels are potentially ambiguous, we also generate a leader-line arrow from the label to the interior of its object. Note that many classical map labeling algorithms base their label-placement quality estimate on whether the label can be visually identified easily with only a single feature (Imhof, 1975). Providing the leader line helps mitigate possible misidentifications. Because internal labels occlude parts of the object that they label, we also support tagging an object to indicate whether or not certain parts should be internally labeled (e.g., internal labels may be suppressed for faces). Each external label is added to the view-plane representation as it is created, so that objects added later do not occlude it.
MAINTAINING VISIBILITY CONSTRAINTS FOR VIEW MANAGEMENT
267
Figure 6. Image taken directly through an optical see-through head-worn display of restaurant and related popup information. Note: Since the construction vehicle and cable spool on the right are not included in our environment model, the system does not constrain the popup to avoid overlapping them. The brightly lit spool appears to obstruct the virtual popup because it is much brighter than the image on the optical see-through display
4.4.3. Other Placements
Applications may need to show additional information about objects in the scene (e.g., Figures 1–2 and 5). This information can be placed external to related objects, much like external labels, possibly with different size and aspect ratio constraints. Some information, such as the agenda located on the left side of Figure 1, does not refer to any objects that are visible in the scene. To place the agenda on the screen so that it doesn’t overlap any important objects, the empty-space region is queried after all other annotations are placed. There are also situations, such as the augmented reality neighborhood restaurant guide of Figure 6, where there is not enough room to place the information. In this case, we detect the limited space by the negative results of external placement. Another query is used to place the popup information so that it avoids overlapping the restaurant, but could overlap other objects, such as the building above it.
268
BELL, FEINER
AND HÖLLERER
4.5. Animation/Temporal Continuity Computations
Thus far, we have described our system without taking into account temporal continuity. However, if the layout of objects in sequential frames is computed independently, discontinuous changes in object position and size occur, which may be annoying in some applications. Therefore, in some real time applications, we need to take into account decisions made during previous frames when determining current placement. First, we introduce hysteresis in the discrete state changes that can take place. We also use a technique that is similar to traditional keyframe animation (Lasseter, 1987): comparisons between the results of the discrete queries (Section 4.4) and additional queries specific to animation (Section 4.5.2) automatically determine the keyframes in real time for determining placement destinations, while interpolation computes the animation between the keyframe placements (the “inbetweening”). Instead of introducing movement, this technique results in the annotations being continuously rendered at the keyframes for positional stability, while inbetween frames are rendered minimally (only when movement between keyframes is necessary). Based on informal user feedback, when the viewing specification is controlled by a mouse, animation is typically not preferred—when the user stops controlling the view, the viewing specification stops changing, but the annotations would keep moving if animation were used. In contrast, when the viewing specification is controlled by a head tracker, it is always changing, if only slightly, potentially resulting in drastic changes in the results of our layout queries between frames. Thus, to avoid jitter and excessive movement of annotations, it is necessary to provide animation and temporal continuity. 4.5.1. State Hysteresis
There are several situations in which objects may change state, resulting in a discrete visual change, such as from an internal label to an external one, or from being displayed to not being displayed. Some user interfaces use three constraints on the size of an object being displayed: minimum size, maximum size, and preferred size (e.g., Java AWT (Gosling et al., 1996)). As shown in the state diagram of Figure 7, we borrow this approach, modifying the definition of minimum size by defining both an absolute minimum size (absMin) and a sustained minimum size (min) to make possible state hysteresis to avoid oscillation between states at size boundary conditions. An object's visual size is guaranteed to be the sustained minimum size at least once within a settable time interval n. The system displays the object only when there is enough space for the sustained minimum size, and removes it when it is below the absolute minimum size. Furthermore, if the object is already being displayed and there is only enough space for an object within the absolute and sustained minimum sizes for time > n, the object is removed. Otherwise, the object is displayed at the largest possible size no greater than its maximum size. The state diagram of Figure 7 employs similar tests to make an additional distinction between drawing an object inside (preferred) vs. outside another object, which we use for
MAINTAINING VISIBILITY CONSTRAINTS FOR VIEW MANAGEMENT
269
displaying area feature labels. The values selected for n and the difference between the absolute minimum and sustained minimum sizes help avoid visual discontinuities. 4.5.2. Positional Stability
To compute whether the current frame is a “keyframe” for a particular object, it is necessary to find out whether the object’s position from Section 4.4, which is considered the best placement for the current frame, has moved from the object’s position in the last computation. To do this, an additional operation, computed in pipeline step 4, needs to query for the closest layout to the previous layout. The same visible region is used for this query. For example, if the label was an inside label in the previous frame, the visible region of the object would be queried. However, since the annotation’s related object could have moved because of view specification change (e.g., due to head movement), the last placement is not valid for finding the closest layout. Instead, we compute the change relative to the projection of the related object by transforming the last frame’s placement from the 2D coordinate system of the projected object in the last frame, to the 2D coordinate system of the projected object in the current frame, where the centroid of each rectangle is its own origin. Inside size t min Outside size t min & inside size < min Do not draw
Outside size
time>n or size
Inside size t min
Draw outside with sizetmin time=0
sizetmin Draw outside with absMind size<min
Inside sizen or size
Draw inside with sizetmin time=0
sizetmin Draw inside with absMind size<min
Figure 7: State hysteresis in area feature labeling. Area features are preferentially labeled inside rather than outside. Labels are promoted from left to right through comparison with min and demoted from right to left through comparison with absMin
The result, transformed back into pixel coordinates, is used as the parameter to query for the closest relative placement from the previous frame inside the current frame. This result is compared with the best layout and is used if both results are the
270
BELL, FEINER
AND HÖLLERER
same within a small threshold; if they are different, we start a timer and, if the best and closest fail to coincide after a set amount of time, we use the best position as a keyframe and reset the timer. 4.5.3. Interpolation
Once a keyframe is chosen, the current frame is determined by interpolating the layout parameters from the previous frame to the keyframe. We use linear interpolation for both position and scale, based on the time difference since the last frame. In changing from internal to external labels, we also grow or shrink the arrow. 4.6. Set Controlled Object Properties
When all view management pipeline computations have completed, the properties of the rendered objects must be updated. Although the pipeline can run asynchronously relative to rendering, as described at the beginning of Section 4, inconsistent results can occur if the rendering properties for an object do not get set before it is rendered. Therefore, all controlled object properties need to get set before rendering begins, assuming that any object can affect another object’s projection. 5. USER-DRIVEN VIEW MANAGEMENT FOR AUGMENTED REALITY In many augmented reality applications, we would like to assist the user in interacting with and obtaining information about the physical environment.
Figure 8: Tracked hand-held display takes priority over labels.
MAINTAINING VISIBILITY CONSTRAINTS FOR VIEW MANAGEMENT
271
In our testbed, users can control annotations and other user interface components by indirect manipulation; a mouse or other wireless device controls a menu to select filtering modes or manipulate object properties and constraints. Selecting a label reveals more information, which can present buttons or other widgets to offer additional options for obtaining additional information. 5.1. Physical Manipulation
Figure 8 shows a screen shot of a user interactively controlling a six-degree-offreedom-tracked, hand-held computer, whose screen we would not like to have obscured by labels. To accomplish this, the screen is defined as having a higher priority than any of the objects in Figure 5, and no associated label. Consequently, since much of “Table Davinci” is obscured by the hand-held computer in Figure 8 relative to Figure 5, the “Table Davinci” label changes from a large internal label to a smaller external one; the “File Cabinet” and “Ascension Transmitter” labels and arrows move up and down, respectively, out of the hand-held computer’s way; and the “Thin Table 1” label disappears.
Figure 9. Annotated situation-awareness aid. (a) Looking slightly upward. (b) Looking slightly downward. (c) Looking nearly straight down.
272
BELL, FEINER
AND HÖLLERER
Imagine replacing the tracked hand-held computer with a transparent tracked frame. Instead of having the system show labels for all objects that are not obscured by the hand-held computer, the system could act like a 3D magic lens (Viega et al., 1996), showing labels only for objects that fall within the frame. 5.2. Controlling a Situation Awareness Aid
We have developed a situation-awareness aid, based on a world in miniature (Stoakley et al., 1995), that provides the user with an overview of the surrounding environment and the ability to discover, select, and inquire about objects that may not be directly visible within that environment. The user’s head pitch controls the aid’s position, scale, and orientation. This makes user initiated view-management control possible: by using head movement, users can easily change their focus of attention between the real world and the virtual aid.
Figure 10. Outdoor version of the annotated situation-awareness aid.
The aid’s yaw (y-axis orientation) is constrained to be aligned with the environment, as in Darken & Sibert (1993). Figure 9 shows screen shots of the aid in our laboratory environment. In Figure 9(a), the user looks slightly upward and the unannotated aid stays close to the bottom of the viewport, pitched towards the view plane by a default angle (22.5˚) about the user’s position in the aid. In Figure 9 (b), the user looks slightly downward and the aid is scaled to be slightly larger, angled slightly closer to parallel to the view plane, and translated up slightly higher
MAINTAINING VISIBILITY CONSTRAINTS FOR VIEW MANAGEMENT
273
in the viewport. In Figure 9(c), the user looks nearly straight down, and in response receives a zoomed-in, annotated view that is nearly parallel to the view plane, with the user’s position in the aid located at the center of the viewport. Figure 10 shows a version of the aid in our outdoor augmented reality system that uses both texturemapped aerial photography as well as 3D models of individual buildings. The roll component of the aid’s orientation is kept at zero, so that its ground plane is always parallel to the bottom of the viewport, but not necessarily parallel to the ground plane in the physical world. Note that the only effect of the user’s position on the aid is to control the “you are here” marker, rendered as a red sphere, about which the aid orients (in yaw and pitch). This marker is positioned at the center of the viewport’s width and at a head-pitch–determined location relative to the viewport’s height. To obtain the desired functionality, the user’s head pitch is used to determine three parameters of the aid: pitch, scale, and y position in the viewport. Each parameter has a set of four values: min and max values and two head pitch trigger values. As described by Bell et al., (2002), we linearly interpolate each dependent variable from min to max between the head pitch trigger values. No parameter is influenced above zero degree head pitch. That means that as long as the user looks straight ahead or up, the aid occupies the smallest possible amount of screen space allowed by the parameters and stays near the bottom of the viewport. Both the scale and the vertical position of the aid are set to reach their maximum values when the user looks down at a 45˚ angle.
5.3. Annotation Assistance
The aid provides us with another set of capabilities for providing the user with more information and interaction. In Figure 9(c), labels are placed relative to the projections of the virtual objects in the aid in the same way we annotate the physical objects. A threshold value (currently set at looking down 30°) determines when the labels change between annotating the real world and labeling the aid. Popup annotations behave in the exact same way. Figure 11(a) shows an annotation reporting on a computer’s CPU and memory usage, placed relative to the projection of its associated computer’s physical monitor, while avoiding other important objects, including other physical objects, labels, and the aid. As the user looks down at the aid, the annotated physical monitor is no longer visible, but the popup remains displayed. As the aid scales up, its labels appear, and the popup’s arrow points to the monitor’s virtual representation in the aid, shown in Figure 11(b). As the user looks up, the popup is dissociated from the aid and recaptured by the physical monitor in the full-scale physical environment. Thus, popup annotations can move freely between the physical world and its scaled virtual representation.
274
BELL, FEINER
AND HÖLLERER
Figure 11. Transfer of annotations between the physical world and virtual aid. (a) Popup annotation provides information about the computer monitor to which its arrow points at the center. (b) As user looks down, annotations are enabled in the aid, and the popup’s arrow now points at the monitor’s representation in the aid.
Furthermore, virtual objects in the aid can be selected to view information about physical objects that might not be currently visible in the real world from where the user is currently positioned. 6. CONCLUSIONS We have presented a framework for view management in virtual and augmented reality applications. While many of the underlying algorithms we described may
MAINTAINING VISIBILITY CONSTRAINTS FOR VIEW MANAGEMENT
275
change with further development and refinement as we improve performance, add possible constraints, and increase functionality, we believe that the basic concept of view management is fundamental to creating interactive 3D applications: placing and automatically maintaining visual constraints on the visibility and spatial layout of the projections of 3D objects. Many current applications can benefit from using view management techniques, ranging from scientific visualization to entertainment; for example, view management would be particularly applicable to labeling characters in 3D multiplayer games. Although augmented reality has attracted increasing interest from academic and industrial researchers over the past decade, head-tracked, head-worn systems are at present still research prototypes. For augmented reality applications to flourish, advances will be required in hardware technology (displays and tracking), user interfaces, and societal acceptance (Azuma et al., 2001). The techniques presented in this chapter contribute to the user interface, addressing both information presentation and user interaction. We are extending our work in three major directions. First, we have experienced some situations in which the current layout strategies induce excessive movement of annotations, such as when the user’s head moves quickly, or when the system annotates small objects. To cope with this problem, we are developing strategies in which annotations are placed in 3D but dynamically take into account visibility relationships, though less aggressively than our current techniques. One variant, similar to “billboarding,” would always be oriented facing the user. Second, we plan to do usability studies to compare the effectiveness of these different techniques in a variety of tasks. Using eye tracking to determine where the users’ eyes move relative to the 3D scene could help measure the effectiveness of these techniques (Tanriverdi & Jacob, 2000). Finally, we note that all our constraints are currently placed “by hand,” which is quite tedious. To address this limitation, we are developing a knowledge-based infrastructure to allow us to control programmatically how constraints are used. This will make it easier to handle more complex situations, including ones in which the constraints are inferred by rules. 7. ACKNOWLEDGEMENTS This work was supported by ONR Contracts N00014-99-1-0249, N00014-99-10394, and N00014-99-1-0683, NSF Grants IIS-00-82961 and 0121239, NLM Contract R01 LM06593-01, and gifts from Intel, Microsoft, and Mitsubishi Electric Research Labs. Blaine Bell was supported by an IBM Graduate Research Fellowship. Ryuji Yamamoto created the campus model shown in Figures 1–2. 8. REFERENCES Ahlberg, C., Williamson, C., & Shneiderman, B. (1992). Dynamic Queries for Information Exploration: An Implementation and Evaluation. Proc. ACM CHI '92. Arthur, K., Preston, T., Taylor, R., II, Brooks, F., Jr., Whitton, M., & Wright, W. (1998). Designing and Building the PIT: A Head Tracked Stereo Workspace for Two Users. In Second Int. Immersive Projection Technology Workshop. Ames, IA.
276
BELL, FEINER
AND HÖLLERER
Atherton, P. R. (1981). A Method of Interactive Visualization of CAD Surface Models on a Color Video Display. Proc. SIGGRAPH '81. Azuma, R., Baillot, Y., Behringer, R., Feiner, S., Julier, S., & MacIntyre, B. (2001). Recent Advances in Augmented Reality. IEEE Computer Graphics and Applications, 21, 3447. Battista, G. D., Eades, P., Tamassia, R., & Tollis, I. (1999). Graph Drawing: Algorithms for the Visualization of Graphs. USA: Prentice Hall. Bell, B., & Feiner, S. (2000, November 5--8). Dynamic Space Management for User Interfaces. Proc. UIST '00, San Diego, CA. Bell, B., Feiner, S., & Höllerer, T. (2001, November 1114). View Management for Virtual and Augmented Reality. Proc. UIST '01, Orlando, FL. Bell, B., Höllerer, T., & Feiner, S. (2002, October 2730). An Annotated Situation-Awareness Aid for Augmented Reality. Proc. UIST '02, Paris, France. Billinghurst, M., Weghorst, S., & Furness, T., III. (1998). Shared Space: An Augmented Reality Approach for Computer Supported Collaborative Work. Virtual Reality, 3(1), 2536. Butz, A., Höllerer, T., Feiner, S., MacIntyre, B., & Beshers, C. (1999). Enveloping Users and Computers in a Collaborative 3D Augmented Reality. In Proc. IWAR '99 (Int. Workshop on Augmented Reality) (pp. 3544). San Francisco, CA. Carpendale, M. S. T., Cowperthwaite, D. J., & Fracchia, F. D. (1997). Extending Distortion Viewing from 2D to 3D. IEEE Computer Graphics and Applications, 17(4), 4251. Catmull, E. (1974). A Subdivision Algorithm for Computer Display of Curved Surfaces. Ph. D Thesis, Report UTEC-CSc-74-133, Computer Science Department, University of Utah, Salt Lake City, UT. Christensen, J., Marks, J., & Shieber, S. (1995). An Empirical Study of Algorithms for Point-Feature Label Placement. ACM Transactions on Graphics, 14(3), 203232. Chrysanthou, Y., & Slater, M. (1992). Computing dynamic changes to BSP trees. In Computer Graphics Forum (Proc. Eurographics '92) (Vol. 11, pp. 321332). Darken, R. P., & Sibert, J. L. (1993, November). A Toolset for Navigation in Virtual Environments. Proc. UIST '93, New York, NY. Fairchild, K. M., Poltrock, S. E., & Furnas, G. W. (1988, May). SemNet: Three-Dimensional Graphic Representation of Large Knowledge Bases. Cognitive Science and its Applications for HumanComputer Interaction, Hillsdale, New Jeresy, U.S.A. Feiner, MacIntyre, B., Haupt, M., & Solomon, E. (1993). Windows on the World: 2D Windows for 3D Augmented Reality. In Proc. UIST '93 (ACM Symp. on User Interface Software and Technology) (pp. 145155). Atlanta, GA. Feiner, MacIntyre, B., & Seligmann, D. (1993). Knowledge-based augmented reality. Communications of the ACM, 36(7), 5262. Feiner, S., & Seligmann, D. (1992). Cutaways and Ghosting: Satisfying Visibility Constraints in Dynamic 3D Illustrations. The Visual Computer, 8(56), 292302. Foley, J., van Dam, A., Feiner, S., & Hughes, J. (1996). Computer Graphics: Principles and Practice, 2nd Ed. in C. Reading, MA: Addison-Wesley. Fuchs, H., Kedem, Z., & Naylor, B. (1980, July 14-18). On visible surface generation by a priori tree structures. SIGGRAPH '80, Seattle, WA. Gosling, J., Joy, B., & Steele, G. (1996). The JAVA Language Specification (1st ed.). Reading, Massachusetts: Addison-Wesley. He, L., Cohen, M., & Salesin, D. (1996). The virtual cinematographer: A paradigm for automatic realtime camera control and direction. In Proc. SIGGRAPH '96 (pp. 217224). New Orleans, LA. Hesina, G., Schmalstieg, D., Fuhrmann, A., & Purgathofer, W. (2000, Decemeber 20--22). Distributed Open Inventor: A Practical Approach to Distributed 3D Graphics. Proc. VRST '99, N.Y. Imhof, E. (1975). Positioning names on maps. The American Cartographer, 2(2), 128144. Julier, S., Lanzagorta, M., Baillot, Y., Rosenblum, L., Feiner, S., Höllerer, T., et al. (2000). Information Filtering for Mobile Augmented Reality. In Proc. ISAR '00 (Int. Symposium on Augmented Reality) (pp. 311). Munich, Germany. Kamada, T., & Kawai, S. (1987). An enhanced treatment of hidden lines. ACM Transactions on Graphics, 6(4), 308323. Lasseter, J. (1987). Principles of traditional animation applied to 3D computer animation. Proc. SIGGRAPH '87.
MAINTAINING VISIBILITY CONSTRAINTS FOR VIEW MANAGEMENT
277
MacIntyre, B., & Feiner, S. (1998). A Distributed 3D Graphics Library. In Computer Graphics (Proc. ACM SIGGRAPH '98) (pp. 361370). Orlando, FL. Mori, T., Koiso, K., & Tanaka, K. (1999). Spatial Data Presentation by LOD Control Based on Distance, Orientation, and Differentiation. In Proc. of Int'l Workshop on Urban Multi-Media/3D mapping (UM3'99) (pp. 4956). Tokyo, Japan. Phillips, C. B., Badler, N. I., & Granieri, J. (1992, March -- April). Automatic Viewing Control for 3D Direct Manipulation. Proc. SIGGRAPH '92, Cambridge, MA. Rist, T., Krüger, A., Schneider, G., & Zimmermann, D. (1994). AWI: a workbench for semi-automated illustration design. Proceedings of the workshop on Advanced visual interfaces, Bari, Italy. Roessel, J. (1989). An Algorithm for Locating Candidate Labeling Boxes within a Polygon. The American Cartographer, 16(3), 201209. Samet, H. (1990). The Design and Analysis of Spatial Data Structures. Reading, MA: Addison-Wesley. Shaw, C., Green, M., Liang, J., & Sun, Y. (1993). Decoupled simulation in virtual reality with the MR toolkit. ACM Transactions on Information Systems (TOIS), 11(3), 287317. Shneiderman, B. (1983). Direct manipulation: A step beyond programming languages. IEEE Computer, 16(8), 5769. Starner, T., Mann, S., Rhodes, B., Levine, J., Healey, J., Kirsch, D., et al. (1997). Augmented Reality through Wearable Computing. Presence, 6(4), 386398. Stoakley, R., Conway, M., & Pausch, R. (1995, May 711). Virtual Reality on a WIM: Interactive Worlds in Miniature. Proc. CHI '95, Denver, CO. Sutherland, I. (1968). A Head-Mounted Three Dimensional Display. Proc. FJCC 1968, Washington, DC. Szalavari, Z., Schmalstieg, D., Fuhrmann, A., & Gervautz, M. (1998). Studierstube: An Environment for Collaboration in Augmented Reality. Virtual Reality, 3(1), 3748. Tanriverdi, V., & Jacob, R. (2000). Interacting with Eye Movements in Virtual Environments. Proceedings of the SIGCHI conference on Human factors in computing systems, The Hague, The Netherlands. Thibault, W., C., & Naylor, B., F. (1987). Set operations on polyhedra using binary space partitioning trees. Proc. SIGGRAPH '87. Viega, J., Conway, M. J., Williams, G., & Pausch, R. (1996). 3D Magic Lenses. UIST '96, Seattle WA. You, S., Neumann, U., & Azuma, R. (1999). Hybrid Inertial and Vision Tracking for Augmented Reality Registration. In Proc. IEEE Virtual Reality '99 (pp. 260267). Houston, TX.
9. AFFILIATIONS B. Bell Lawrence Berkeley National Laboratory - Berkeley, USA S. Feiner Columbia University - New York, USA T. Höllerer Universität Trier - Germany
W. SWARTOUT, J. GRATCH, R. HILL, E. HOVY, R. LINDHEIM, S. MARSELLA, J. RICKEL AND D. TRAUM
SIMULATION MEETS HOLLYWOOD: Integrating Graphics, Sound, Story and Character for Immersive Simulation
1. INTRODUCTION For many researchers, software integration is often regarded as a kind of necessary evil – something that must be done to make sure that all the research components of a large system fit together and interoperate properly – but not as something that is likely to contribute new research insights or suggest new solutions. Our work on constructing virtual humans to interact with people in virtual environments has involved large-scale integration of a number of software technologies that support the simulation of human behaviors, ranging from speech recognition and dialogue management through task reasoning, gesture generation and emotion modeling. In addition, because we use virtual humans in training simulations, the characters behave in the context of a scenario, so another aspect of integration has been to bring together story content with virtual human behavior. In integrating these various components and content, we have been surprised to find that the conventional wisdom about integration does not hold: the integration process has raised new research issues and at the same time has suggested new approaches to long-standing issues. This paper describes how that has taken place and our discoveries. We begin with a brief description of the background behind our work in training and the approach we have taken to improving training. We then describe the technology components we have developed, the system architecture we use, and we conclude with the insights we have gained from the integration process. 1.1. Background
How can training simulations be made more effective? An important insight in answering that question is to recognize that effective training depends both on the technology that is used to present the material and the content of the material itself. The Institute for Creative Technologies (ICT) was created at the University of Southern California with the goal of bringing together researchers in simulation technology to collaborate with people from the entertainment industry. The idea was that if those who understood how to create high resolution graphics, immersive 279 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 279–303 © 2005 Springer. Printed in the Netherlands
280 SWARTOUT, GRATCH, HILL, HOVY, LINDHEIM, MARSELLA, RICKEL AND TRAUM sound, and believable virtual humans worked together with those who understood how to create compelling stories and characters a synergy would emerge that would allow them to create much more compelling simulation experiences. Although the ICT has only been in existence for a short time, we are already beginning to see some of the results of this collaboration. These are reflected both in the kinds of projects that the ICT takes on and the approach that we take to implementing systems. While most military simulations involve simulating a vehicle such as a tank, an airplane or a helicopter, ICT’s simulations put trainees into a human-oriented simulation, where they interact with real and virtual (computergenerated) humans. While scenarios in most military simulations tend to proceed in a straightforward fashion, our scenarios engage the trainee with plot twists, turns and surprises, much like one might find in a good Hollywood script. In constructing our simulations we have used a hybrid approach, mixing different techniques and technologies to produce the best overall effect. In that way, we are following Hollywood film production techniques where what appears as a single seamless scene in film may actually be the result of integrating a large number of disparate elements produced using filmed live action, computer generated imagery, and models. What should the lieutenant do? Should he stop and render aid? Or should he continue on with his mission? Depending on decisions he makes, different outcomes will occur.
Figure 1: The Mission Rehearsal Exercise System, showing from the left, the platoon sergeant, the injured boy and his mother, a medic, and a crowd.
One of the ICT’s projects that illustrates these ideas well is the Mission Rehearsal Exercise (MRE) project. Since the end of the cold war, the kinds of operations that the US military is involved with has expanded greatly. The need for peacekeeping and nation-building operations has grown, and humanitarian efforts such as disaster relief are common. One of the hallmarks of these operations is that they frequently involve close interactions between the military and the local civilian populace. To function effectively and avoid misunderstandings that could have unintended consequences, it is important that soldiers understand the customs,
SIMULATION MEETS HOLLYWOOD
281
norms, habits and taboos of the local population and they need to be exposed to the thorny dilemmas and decisions that may await them. The Mission Rehearsal Exercise system, shown in Figure 1, is designed to provide that kind of experience in simulation, before trainees encounter it in reality. Presented on a 30 foot by 8 foot curved screen, the MRE system places the trainee in a location. The trainee interacts with life-sized virtual humans that can play the role of local civilians, friendly forces and hostile forces. A 10.2 sound system (10 channels of audio, 2 subwoofer channels) enhances the immersive effect. The scenario we are currently using is situated in a small town in Bosnia. It opens with a lieutenant (the trainee) in his Humvee. Over the radio, he gets orders to proceed to a rendezvous point to meet up with his soldiers to plan a mission to assist in quelling a civil disturbance. When he arrives at the rendezvous point, he discovers a surprise. One of his platoon’s Humvees has been involved in an accident with a civilian car. There’s a small boy on the ground with serious injuries, a frantic mother, and a crowd is starting to form. A TV camera crew shows up and starts taping. The initial version of the Mission Rehearsal Exercise system was first shown in September, 2000. Since then, the MRE project has been actively engaged in research to improve the MRE system and make it more interactive. 2. MRE ARCHITECTURE Voice Input
Projection System
Vega ¥
PSERT ¥
Animation System (DIMR) Haptek¥ Audio (protools) Speakers (10.2)
Communication Bus (Elvin)
Scenario Manager
Speech Recognition (HTK) Semantic Parser
Virtual Human Virtual VirtualHuman Human Virtual Human(s) Motion/ Gesture Scheduler (Beat) Text to Speech (Festival)
Figure 2: MRE System Architecture
The MRE architecture, illustrated in Figure 2, supports the flexible integration of a number of components, including visualization components (such as graphics and audio processing), interface components (such as voice input) and behavioral components (such as virtual humans and the scenario manager).
282 SWARTOUT, GRATCH, HILL, HOVY, LINDHEIM, MARSELLA, RICKEL AND TRAUM Components are linked through a messaging and notification service (the communication bus). Here we consider the communication services, graphics and animation, audio processing, and some of the behavior modeling. The details of the virtual human architecture are discussed in the following section. 2.1. Communication Services
Components in the MRE system communicate primarily through a common communications bus, implemented through a notification and messaging service called Elvin that enables efficient inter-process and cross-platform communication (elvin.dstc.edu.au). Elvin uses a client-server architecture and allows messages to be filtered based on their content to reduce network load. Components send all their messages to the server and messages are routed to individual components if they have registered interest in the specific message type. Message content is formatted as simple text strings or XML, facilitating the easy creation of new message types or formats. Two communication pathways bypass Elvin for efficiency purposes. There is a dedicated communication link between the Animation System and the audio system to mitigate latencies and, similarly, there is also a dedicated link between the text-tospeech engine and the character gesture manager (BEAT). 2.2. Graphics and Animation
The graphics and animation system, DIMR, provides a set of core services for visualizing activities in the virtual world. DIMR uses two commercial products, VegaTM and PeopleShopTM, to animate the virtual world. VegaTM renders the environment and the special effects. The environment includes the buildings, roads, trees, vehicles, and so on, while the special effects include explosions and the dynamic motion of objects like cars and helicopters. The PeopleShopTM Embedded Runtime System (PSERT) is integrated with VegaTM and provides the animation of the characters' bodies. A 3D model of a Balkan village was developed to fit the types of scenarios we had in mind. Texture mapped surfaces were applied to the buildings, vehicles, and characters to give them a more authentic look and feel. Boston Dynamics Incorporated (BDI), the developers of PeopleShopTM, extended their virtual character bodies in several ways to suit our needs. First, they integrated expressive faces (developed by Haptek Incorporated) to support lip synchronization and facial expressions. Second, while the basic PeopleShopTM software primarily supports dynamic sequencing of primitive motion fragments, BDI combined their motion-capture approach with procedural animation to provide more flexibility, primarily in the areas of gaze and arm gestures (Marsella, Gratch & Rickel, 2003). Finally, we wanted more variety when it came to the character bodies, so BDI created a suite of new character bodies and behaviors. The new character bodies included a Balkan woman (to play the mother), a child, a man holding a TV news camera, civilian characters for the crowd, an Army medic, and an Army sergeant.
SIMULATION MEETS HOLLYWOOD
283
2.3. Audio Processing
In our current scenario the scene begins with the lieutenant driving up to the village in an Army vehicle known as a Humvee. As the vehicle drives into town and turns a corner, our view out the front windshield and side windows allows us to see the road, buildings, and trees. We perceive the bumps in the road as a jiggle in the scene, and the vehicle appears to change velocity as the gears are shifted. While the visual aspects of the scene give the viewer a sense of being there in that village, the audio system provides a critical dimension to the experience. The distinctive roar of the Humvee's diesel engine, the creaks, rattles, and bumps from the bouncy ride, and the post-ignition knock when the engine shuts off are all synchronized with visual effects. When the lieutenant steps out of the Humvee, one can immediately hear the murmur of a crowd of people speaking in Serbo-Croatian, gathered near the accident site. When the medevac helicopter flies overhead the room literally vibrates with the sound of a Blackhawk helicopter. To address the problem of matching picture with sound spatially, a novel multichannel audio system was developed (Kyriakakis, 1998). This system uses 10 channels of audio and 2 subwoofer channels. Speakers are not only arranged in a radial pattern around the participant, similar to conventional surround sound, but in addition, there is a vertical displacement between speakers. This allows sound to be spatialized in both horizontal and vertical dimensions, creating, in effect, a hemisphere of sound around the audience. This means that the sound of a helicopter flyover will be perceived as coming from overhead, making the sonic experience much more convincing. 2.4.Behavior Modeling
Some of the behaviors in the MRE system are autonomous, but others are specified in advance based on the story that a writer develops. Generally, the behaviors of the major characters in the simulation are autonomous, while physical events (e.g. explosions) and minor characters are scripted. The scenario manager component triggers scripted behaviors to shape the experience for the trainee and create the dilemmas he must solve. Scripted behaviors can be triggered either autonomously, through a set of simple production rules, or by a human exercise controller. There are currently two classes of agents playing the character roles in the MRE system: scripted characters and virtual humans controlled by AI. The scripted (minor) characters come packaged with PeopleShopTM. They can be scripted to perform specific actions, such as running along a pre-specified path or securing a perimeter, and this behavior can be triggered by the scenario manager or a virtual human. The scripted characters do not perceive anything in the world — their behaviors are generated by playing motion capture sequences. Virtual humans, the major characters, are autonomous and their structure is described next.
284 SWARTOUT, GRATCH, HILL, HOVY, LINDHEIM, MARSELLA, RICKEL AND TRAUM 3. VIRTUAL HUMANS Our virtual humans are build on prior work in the areas of embodied conversational agents (Cassell, Sullivan, Prevost, & Churchill, 2000) and animated pedagogical agents (Johnson, Rickel, & Lester, 2000), but they integrate a broader set of capabilities than any prior work. For the types of training scenarios we are targeting, the virtual humans must integrate three broad influences on their behavior: they must perceive and act in a 3D virtual world, they must engage in face-to-face spoken dialogues with people and other virtual humans in such worlds, and they must exhibit human-like emotions. Classic work on virtual humans in the computer graphics community focused on perception and action in 3D worlds (Badler, Phillips, & Webber, 1993; Thalmann, 1993), but largely ignored dialogue and emotions. Several systems have carefully modeled the interplay between speech and nonverbal behavior in face-to-face dialogue (Cassell, Bickmore, Campbell, Vilhjálmsson, & Yan, 2000; Cassell et al., 1994; Pelachaud, Badler, & Steedman, 1996) but these virtual humans did not include emotions and could not participate in physical tasks in 3D worlds. Some work has begun to explore the integration of conversational capabilities with emotions (Lester, Towns, Callaway, Voerman, & FitzGerald, 2000; Marsella, Johnson, & LaBore, 2000; Poggi & Pelachaud, 2000), but still does not address physical tasks in 3D worlds. Likewise, our prior work on Steve addressed the issues of integrating face-to-face dialogue with collaboration on physical tasks in a 3D virtual world (Rickel & Johnson, 1999a, 1999b, 2000), but Steve did not include emotions and had far less sophisticated dialogue capabilities than our current virtual humans. The tight integration of all these capabilities is one of the most novel aspects of our current work. The virtual humans, which include the sergeant, medic, and mother in the scenario, are implemented in Soar, a general architecture for building intelligent agents (Newell, 1990) and build on the earlier Steve system. As such, their behavior is not scripted; rather, it is driven by a set of general, domain-independent capabilities discussed below. The virtual humans perceive events in the simulation, reason about the tasks they are performing, and they control the bodies and faces of the PeopleShopTM characters to which they have been assigned. They send messages to one another, to the character bodies, and to the audio system via the Communications Bus shown in Figure 3. 3.1. Virtual Human Architecture
In order for virtual humans to collaborate with people and each other in scenarios like the peacekeeping mission with a sufficient illusion of human-like behavior to keep human users engaged, they must include a wide variety of capabilities, such as perception, planning, spoken dialogue, and emotions. Our research objectives are to advance the state of the art in each of these areas, but also to explore their integration into a single agent architecture. Thus, we desired a flexible architecture for our virtual humans that would allow us to easily experiment with the connections between the individual components.
SIMULATION MEETS HOLLYWOOD
285
A blackboard architecture, in which individual components have access to the intermediate and final results of other components, provides such flexibility. The alternative, in which each module would explicitly pass specific information to other components, would require constant revision as we made progress understanding the interdependencies among components. In contrast, a blackboard architecture would make all intermediate and final results of individual components available by default, so the designers of each component could make use of such results as they proved useful.
Figure 3. Virtual Human Architecture
For our integrated architecture, we chose Soar, because it allows each component to be implemented with production rules that read from and write to a common working memory, which acts as the desired blackboard. Soar further breaks computation into a sequence of intermediate operators that are proposed in parallel but selected sequentially via an arbitration mechanism. This allows for tight interleaving of operators from individual components and flexible control over their priority. All components of the virtual humans are implemented in Soar, with several exceptions: speech recognition, natural language understanding (syntactic and semantic analysis), synchronization of verbal and nonverbal components of output utterances, and speech synthesis. It was less practical to implement these four
286 SWARTOUT, GRATCH, HILL, HOVY, LINDHEIM, MARSELLA, RICKEL AND TRAUM components in Soar because each was built on top of existing software that would have been difficult to reimplement. 3.2. Task Representation and Reasoning
To collaborate with humans and other synthetic teammates, virtual humans need to understand how past events, present circumstances, and future possibilities impact team tasks and goals. For example, the platoon sergeant agent must be able to brief the trainee on past events that led to the accident and must reason how the victim’s current injuries impact the platoon’s mission. More generally, agents must understand task goals and how to assess whether they are currently satisfied, the actions that can achieve them, how the team must coordinate the selection and execution of those actions, and how to adapt execution to unexpected events. To provide this understanding, our agents use domain-independent reasoning algorithms operating over a general, declarative representation of team tasks, and this representation is used to encode their domain-specific task knowledge for a given training scenario (or class of scenarios). The agent maintains an explicit representation of past, present and future taskrelated information in Soar’s working memory. This representation extends our earlier work on virtual humans for team training (Rickel & Johnson, 2002) and includes three components: the task description, a causal history, and the current world description. The task description includes of a set of steps, each of which is either a primitive action (e.g., a physical or sensing action in the virtual world) or an abstract action which must itself be further decomposed. Abstract actions give tasks a hierarchical structure. There may be ordering constraints among the steps, which define a partial order. Interdependencies among steps in the task description or causal history are represented as a set of causal links and threat relations (McAllester & Rosenblitt, 1991). Each causal link specifies that an effect of a step in the task could achieve a particular goal that is a precondition for another step in the task (or for termination of the task). For example, in our military domain there is an action of marking a landing zone with smoke, which achieves the goal of allowing a helicopter pilot to visually identify the landing zone, which in turn is a precondition for landing it. Threat relations specify that an effect of a step could threaten a causal link by unachieving the goal before it is needed. For example, extinguishing the smoke before the helicopter arrives threatens the helicopter's ability to land. The causal history maintains a sequence of executed steps (including unexpected and non-task events), interdependencies between past steps (e.g., causal links), as well as interdependencies between past steps and future steps in the task description. In addition to understanding the structure of tasks, agents must understand the roles of each team member. Each task step is associated with the team member that is responsible for performing it (Rickel & Johnson, 2002). We have also extended our representation to include an optional association of each task step with the teammate who has authority over its execution; that is, the teammate responsible for a task step cannot perform it until authorization is given by the specified teammate
SIMULATION MEETS HOLLYWOOD
287
with authority (Traum et al., 2003). This extension to the representation was required to model the hierarchical organizational structure of some teams, such as in the military. Given a top-level abstract task for the team to accomplish, each agent independently uses its task knowledge to construct a complete task model. Starting with the task description for the top-level task, the agent recursively expands any abstract step with its task description, until the agent has a fully decomposed, hierarchical task model. Agents may or may not be given identical task knowledge, and so may or may not construct identical task models; this can be used to model teammates with partial or erroneous knowledge. An agent's task model represents its understanding of the task in general, independent of the current scenario conditions. To guide execution of the task and robustly handle unexpected events that require adaptive execution or replanning, agents use a partial-order planning algorithm over the task model; the algorithm is described in detail in (Rickel & Johnson, 1999a), and its application to reasoning about team tasks is detailed in (Rickel & Johnson, 2002). The task model specifies all the steps that might be required to complete the task; it can be viewed as a worstcase plan. Agents continually monitor the state of the virtual world via messages from the simulator (Rickel & Johnson, 1999a) that are filtered to reflect perceptual limitations (Rickel et al., 2002). These perceptions will allow the agents to update their representations of the status of goals in the task model as being satisfied, unsatisfied, or unknown if they cannot currently perceive the state of the goal. The planning algorithm works backwards through the causal links in the task model to identify goals that are currently desired and task steps that are currently intended to establish those desired goals. Just as the status of a goal can be satisfied, unsatisfied, or unknown, the planning algorithm marks the “desired” property of goals and the “intended” property of steps as true, false, or unknown. The result of this planning algorithm specifies how the agent privately believes that the team can collectively complete the task, with some causal links specifying the interdependencies among team members' actions. Agents continually revise this private plan as the scenario unfolds. A key aspect of collaborative planning is negotiating about alternative ways to achieve team goals (Traum et al., 2003). To support such negotiation, we have extended our earlier representation so that task models support reasoning about alternative, mutually exclusive courses of action (recipes) for achieving tasks, and we have added mechanisms for evaluating the relative strengths and weaknesses of different alternatives. These courses of action are self-contained hierarchical tasks in the sense defined above, and subject to the same dynamic task reasoning. For example, one might evacuate someone to a hospital by using either a medevac helicopter or an ambulance. Depending on the circumstances, only one option might be possible (e.g., the medevac may be unavailable or the injuries may be too severe for an ambulance), but if both are valid options, they must be ranked through some reasoned analysis of their relative costs and benefits.
288 SWARTOUT, GRATCH, HILL, HOVY, LINDHEIM, MARSELLA, RICKEL AND TRAUM 3.3. Natural Language Dialogue
In many ways, our natural language processing components and architecture mirror fairly traditional dialogue systems. There is a speech recognizer, semantic parser, dialogue manager, NL generator, and speech synthesizer. However, the challenges of the MRE project, including integration within an immersive story environment as well as with the other virtual human components required innovations in most areas. Here we briefly describe the natural language processing components and capabilities; we will return later to some of the specific innovations motivated by this integration. The Speech recognizer was built using the Hidden Markov Model Toolkit (http://htk.eng.cam.ac.uk/) currently employing a limited domain finite-state language model with a several hundred word vocabulary and using about 70 phrases, and with locally trained acoustic models (Wang & Narayanan, 2002). Output is currently the single best interpretation, sent as Elvin messages, as well as indications of when the user starts and stops speaking, to manage gaze control and turn-taking behavior of agents. Speech recognition output is processed by the semantic parser module, which produces a semantic representation of the utterances. The parser uses a hybrid between finite-state transducers and statistical processing to produce a best-guess at semantic information from the input word stream (Feng 2003). In cases in which imperfect input is given, it will robustly produce representations which may possibly be incomplete or partially incorrect. The module will provide addressee information (if vocatives were present), sentence mood, and semantic information corresponding to states and actions related to the task model. See (Traum, 2003) for more details about the semantic representation. The SOAR-module for each agent receives the output of the speech recognizer and semantic parser. This information is then matched against the agent's internal representation of the context, including the actions and states in the task model, current expectations, and focus to determine a set of candidate interpretations. Some of these interpretations may be underspecified, due to impoverished input, or overspecified in cases of incorrect input (either an out of domain utterance by the user, or an error in the speech recognizer or semantic parser). In some cases, underspecified elements can be filled in with reference to the agent's knowledge; if not, the representation is left underspecified and processing continues. The dialogue component of the SOAR agent also produces a set of dialogue act interpretations of the utterance. Some of these are traditional speech acts (e.g., assert, request, inforequest) with content being the semantic interpretation, while others represent other levels of action that have been performed, such as turn-taking, grounding, and negotiation. See Traum & Rickel, (2002) for details on the levels of dialogue acts. Dialogue management follows the approach of the TRINDI Project (Larsson & Traum, 2000), and specifically the EDIS system (Matheson, Poesio, & Traum, 2000). Dialogue acts are used to update an Information State that is also used as context for other aspects of agent reasoning. SOAR is actually very similar to the TrindiKit software used by EDIS, so it was straightforward to adapt the prior dialogue update rules into the SOAR agent. More on the aspects of information state
SIMULATION MEETS HOLLYWOOD
289
can be found in Traum & Rickel, (2002). Decisions of how to act in dialogue are tightly coupled with other action selection decisions in the agent. The agent can choose to speak, choose to listen, choose to act related to a task, etc. Aspects of the information state provide motivations to speak, including answering questions, negotiating with respect to a request or order, giving feedback of understanding (acknowledgements, repairs, and repair requests), and making suggestions and issuing orders, when appropriate according to the task model. Once a decision is made to speak, there are several phases involved in the language production process. First is the content selection phase, in which the agent reasons about how best to achieve the output goal. Examples are which assertion to make to answer a pending question, or how to respond to a negotiation proposal. Once the content has been selected, next there is a sentence planning phase, deciding the best way to convey this content. The output of this phase is a case frame structure that specifies the content and some aspects of the form of each utterance. Next, realization proceeds in two passes. In the first pass, each noun phrase unit is realized as a variety of alternatives. As described later, units with the most appropriate emotional connotations are selected. In the second pass, variations of the sentence itself are realized, using the selected noun phrases, and then similarly ranked for connotations. Finally, the sentence that maximizes the inclusion of semantic content and the expression of desired emotional connotations is selected.
1
Render Aid
Decomposition Squads in area
A=Lt,R=Sgt
2
Secure Area
Area Secure
Decomposition
3 Secure 12-4
4
A=Sgt,R=1sldr
Secure 4-8 A=Sgt,R=2sldr
5 A=Sgt,R=3sldr Secure 8-12
6 A=Sgt,R=4sldr Secure Accident
Focus=1 Lt: U11“secure the area” Commited(lt,2), 2 authorized, Obl(sgt,U11) Sgt: U12“yes sir” Committed(sgt,2), Push(2,focus) Goal7:Announce(2,{1sldr,2sldr,3sldr,4sldr}) Goal8: Start-conversation(sgt, ,{1sldr,2sldr, …},2) Goal8 -> Sgt: U21 “Squad leaders listen up!” Goal7 -> U22 “I want 360 degree security” Push(3, focus) Goal9:authorize 3 Goal9 -> u23“1st squad take 12-4” Committed(sgt,3), 3 authorized Pop(3), Push(4) Goal10: authorize 4 Goal10 -> u24“2nd squad take 4-8 ” Committed(sgt,4), 4 authorized Pop(4) … A10: Squads move A10: grounds U21-U26,… ends conversation about 2, realizes 2
Figure 4: Sample task model and dialogue interaction
This final sentence is then augmented with communicative gestures and sent to the synthesizer and rendering modules to produce the speech. Meanwhile, messages are sent to other agents, letting them know what the agent is saying. More details on
290 SWARTOUT, GRATCH, HILL, HOVY, LINDHEIM, MARSELLA, RICKEL AND TRAUM the generation component are described by Fleischman & Hovy (2002) and Traum, Fleischman & Hovy (2003). The speech synthesizer uses Festival and Festvox, with locally developed unit-selection limited-domain voices to provide the emotional expressiveness needed to maintain immersiveness (Johnson et al., 2002). Figure 4 shows a brief example of how dialogue behavior is integrated with task reasoning. The left side of the figure shows a small fragment of the task model: part of the "Render aid" task involves securing the assembly area, which requires that the squads are in the area; it has a decomposition involving actions of various squads, and has the effect that the area is secure. The figure also shows which agents are responsible (R) for seeing that an action is performed (either doing it themselves or acting as team leader making sure the subtasks are carried out), and which agents have authority (A) to have the action performed. With reference to this piece of the task model, consider the dialogue fragment on the right. Initially the focus is on the render aid task. When the lieutenant issues the command to secure the area (utterance U11), the sergeant recognizes the command as referring to a subaction of Render Aid in the current task model (Task 2). As a direct effect of the lieutenant issuing a command to perform this task, the lieutenant becomes committed to the task, the sergeant has an obligation to perform the task, and the task becomes authorized. Because the sergeant already agrees that this is an appropriate next step, he is able to accept it with utterance U12, which also commits him to perform the action. The sergeant then pushes this task into his task model focus and begins execution. In this case, because it is a team task requiring actions of other teammates, the sergeant, as team leader, must announce the task to the other team members. Thus, the system forms a communicative goal to make this announcement. Before the sergeant can issue this announcement, he must make sure he has the squad leaders' attention and has them engaged in conversation. He forms a goal to open a new conversation so that he can produce the announcement. Then his focus can turn to the individual tasks for each squad leader. As each one enters the sergeant's focus, he issues the command that commits the sergeant and authorizes the troops to carry it out. When the sergeant observes the troops move into action, he can infer that they have understood his order and adopted his plan. When the task completes, the conversation between sergeant and squad leaders finishes and the sergeant turns his attention to other matters. 3.4. Emotion
Our work on modeling emotion is motivated by the Cognitive Appraisal theory of emotion. Cognitive Appraisal is a psychological theory of emotion that emphasizes the relationship between emotion and cognition (Lazarus, 1991). The theory posits two basic processes: appraisal and coping. Appraisal generates emotion by assessing the person-environment relationship (did an event facilitate or inhibit the agent’s goals; who deserves blame or credit). Coping is the process of dealing with emotion, either by acting externally on the world (problem-focused coping), or by acting internally to change beliefs or attention (emotion-focused coping). Coping and appraisal interact and unfold over time, modeling the temporal
SIMULATION MEETS HOLLYWOOD
291
character of emotion noted by several emotion researchers (Lazarus, 1991; Scherer, 1984): an agent may “feel” distress for an event (appraisal), which motivates the shifting of blame (coping), which leads to anger (re-appraisal). In re-casting this theory as a computational model, we have tied appraisals and coping to the explicit representation of past, present, and future task-related information in Soar’s working memory, discussed above. This representation has several advantages for modeling emotion. It makes a clean separation between domain-specific knowledge (e.g., specific action definitions, probabilities and utilities) from the domain-independent mechanisms that operate on these representations. It acts as a blackboard architecture, simplifying communication between appraisal and coping to other mechanisms (like planning) that operate on the interpretation. It facilitates reasoning about blame and indirect consequences of action (e.g., a threat to a sub-goal might be distressing, not because the sub-goal is intrinsically important, but because it facilitates a larger goal). It provides a uniform representation of past and future actions (this action caused an effect which I can use to achieve that goal). Finally, it facilitates reasoning about different agents’ perspectives (I think this outcome is good but I believe you think it is bad). Our approach to appraisal assesses the agent-environment relationship via features of this explicit task representation (Gratch, 2000). Speaking loosely, we treat appraisal as a set of feature detectors that map features of this representation into appraisal variables that characterize the consequences of an event from the agent’s perspective. These variables include the desirability of those consequences, the likelihood of them occurring, who deserves credit or blame and a measure of the agent’s ability to alter those consequences. The result of this feature detection is one or more data structures, called appraisal frames, which characterize the agent's emotional reactions to an event. Thus, the belief that another agent has caused an undesirable outcome leads to distress and possibly anger. Our computational model of coping -- as described in (Marsella & Gratch, 2002) -- similarly exploits the task representation to uncover which features led to the appraised emotion, and what potential there may be for altering these features. In essence, coping is the inverse of appraisal. To discharge a strong emotion about some situation, one obvious strategy is to change one or more of the factors that contributed to the emotion. Coping operates on the same representations as the appraisals, the agent’s beliefs, goals and plans, but in reverse, seeking to make a change, directly or indirectly, that would have the desired impact on appraisal. Coping could impact the agent’s beliefs about the situation, such as the importance of a threatened goal, the likelihood of the threat, responsibility for the threat, etc. Further, the agent might form intentions to change external factors, for example, by performing some action that removes the threat. Indeed, our coping strategies can involve a combination of such approaches. This mirrors how coping processes are understood to operate in human behavior whereby people may employ a mix of problem-focused coping and emotion-focused coping to deal with stress. Coping behavior is focused by those Soar operators that update the task representation and thus helps to reveal the emotional significance and can inform the prioritization of those operations. At any point in time, the virtual humans have many different emotions corresponding to multiple features of the task
292 SWARTOUT, GRATCH, HILL, HOVY, LINDHEIM, MARSELLA, RICKEL AND TRAUM representation. To perform in the virtual environment an agent must understand and generate speech, generate and repair plans and direct its sensors to perceive activities in the environment. All of these operations reference or modify the agent’s interpretation of past, present or future task-related information. For example, perception updates beliefs. Each time one of these operations accesses an element of the task representation it activates any emotional appraisals associated with the element. These emotions associated with the object are made available as “concerns” for the coping process. Whereas there has been prior work in computational models of appraisal, there has been little prior work in modeling the myriad ways that people cope with emotions. And yet coping behavior is a key aspect of human behavior. People employ a rich set of coping strategies and different individuals tend to adopt stable and characteristic “coping styles” that are correlated with personality type. Our work is building a library of these strategies and uses personality-inspired preference rules to model consistent differences in style across different agents. For example, our virtual humans may take preemptive action to circumvent a stressful factor, they may choose to shift blame to another agent or they may behaviorally disengage from attempts to achieve a goal that is being thwarted or threatened. 3.5. Body Movements
Internally, the virtual humans are continually perceiving the events surrounding them, understanding utterances, updating their beliefs, formulating and revising plans, generating emotional appraisals, and choosing actions. Our goal is to manifest the rich dynamics of this cognitive and emotional inner state through each character's external behavior using the same verbal and nonverbal cues that people use to understand one another. The key challenge is the range of behaviors that must be seamlessly integrated: each character's body movements must reflect its awareness of events in the virtual world, its physical actions, the myriad of nonverbal signals that accompany speech during social interactions (e.g., gaze shifts, head movements, and gestures), and its emotional reactions. Since gaze indicates a character's focus of attention, it is a key element in any model of outward behavior, and must be closely synchronized to the character's inner thoughts. Prior work on gaze in virtual humans has considered either taskrelated gaze (Chopra-Khullar & Badler, 2001) or social gaze (Cassell et al., 1994) but has not produced an integrated model of the two. Our gaze model is driven by our cognitive model, which interleaves task-related behaviors, social behaviors, and attention capture. Task-related behaviors (e.g., checking the status of a goal or monitoring for an expected effect or action) trigger a corresponding gaze shift, as does attention capture (e.g., hearing a new sound in the environment). Gaze during social interactions is driven by the dialogue state and the state of the virtual human's own processing, including gaze at an interlocutor who is speaking, gaze aversion during utterance planning (to claim or hold the turn), gaze at an addressee when speaking, and gaze when expecting someone to speak. This tight integration of gaze
SIMULATION MEETS HOLLYWOOD
293
behaviors to our underlying cognitive model ensures that the outward attention of the virtual humans is synchronized with their inner thoughts. Body movements are also critical for conveying emotional changes, including facial expressions, gestures, posture, gaze and head movements (Marsella, Gratch, & Rickel, 2001; Marsella, Gratch & Rickel 2003). In humans, these behaviors are signals and as such they can be used intentionally by an individual to inform or deceive but can also unintentionally reveal information about the individual's internal emotional state. Thus a person's behavior may express anger because they feel it or because they want others to think they feel it or for both reasons. Prior work on emotional expression in virtual humans has focused on either the intentional emotional expression or as a window on internal emotional state (Neal Reilly, 1996). Our work attempts to integrate these aspects by tying expressive behavior to coping behavior. As noted earlier, emotional changes in the virtual human unfold as a consequence of Soar operators updating the task representation. These operators provide a focus for emotional processes, invoking coping strategies to address the resulting emotions which in turn leads to expressive behaviors. This focus on operators both centers emotional expression on the agent's current internal cognitive processing but also allows coping to alter the relation of the expression to those internal cognitive processes. Thus, when making amends, our virtual humans might freely express their true appraisal-based feelings of guilt and concern, for example through facial expressions, gestures, posture, gaze and head movements. However, when shifting responsibility, it might suppress an initial expression of guilt and rather express anger at the character they are blaming, to reflect a more calculated attempt to persuade others. Finally, a wide range of body movements are typically closely linked to speech, movements that emphasize, augment and even supplant components of the spoken linguistic information. Consistent with this close relation, this nonverbal behavior, which can include hand-arm gestures, head movements and postural shifts, is typically synchronized in time with the speech. Realizing this synchronization faces the challenge that we do not have an incremental model of speech production. Such a model would allow us to tie nonverbal behaviors to speech production operations much like the gaze and coping behaviors are tied to cognitive operations. Rather, our approach is to plan the utterance out and annotate it with nonverbal behavior. The annotated utterance is then passed to a text-to-speech generation system that schedules both the verbal and nonverbal behavior, using the BEAT system (Cassell, Vilhjálmsson, & Bickmore, 2001). This approach is similar to the work of Cassell et al. (Cassell et al., 1994). Our work differs in the structure passed to the gesture annotation process, in order to capture the myriad ways that the nonverbal behavior can relate to the spoken dialog and the internal state of the virtual human. Specifically, while both systems pass the syntactic, semantic and pragmatic structure of the utterance, we additionally pass the emotional appraisal and coping information associated with the components of the utterance. The gesture annotation process uses this information to annotate the utterance with gestures, head movements, eyebrow lifts and eyeflashes.
294 SWARTOUT, GRATCH, HILL, HOVY, LINDHEIM, MARSELLA, RICKEL AND TRAUM 4. THE ROLE OF STORY The AI-driven virtual humans in MRE are autonomous, as of course is the human trainee. However, there is an overall scenario or story that sets the context and shapes the experience for the trainee. Since there are certain pedagogical goals we want to achieve for the trainee, we feel that it is necessary to provide structure and guidance to the experience he has. If he is allowed to wander aimlessly through the simulation, he may never encounter the decision-making dilemmas we want him to experience. Thus story is critical to the training experience, and the quality of the story can determine the degree of engagement the trainee feels. The function and value of story is often misunderstood and misrepresented. Story is not simply a progression of events. Passive (or traditional) storytelling is an integration of elements, harmoniously interacting to create a seamless, involving experience. Interactive storytelling adds further complexity to the task. In its best form, passive storytelling is predictably unpredictable. It is predictive because the ultimate outcome of the story is usually ordained, and the audience expects the outcome. Most good stories contain elements of parable and morality. Some typical predictable outcomes are: good triumphs over evil; justice prevails; hard work is rewarded; immoral acts are punished. Stories use the element of verisimilitude to create a coherent world, where this message can be delivered to the reader/listener/viewer. While stories are usually relegated to entertainment, the concept of storytelling can be interwoven into the fabric of the story to provide the pedagogical goals of learning. In this form story becomes an essential element of simulation training. The unpredictable element of a story is critical but infrequently understood. It is said that there are only five to seven basic stories. Yet, there have been hundreds of thousands of stories told, if not millions. And we, the reader/listener/viewer feel the experience is new with each story. How is that so? It is the application of unpredictability to the equation. While the outcome of the story is pre-ordained, the way in which the story is told and the actual events that occur within the story can vary almost infinitely. The story need not even be told linearly (consider the film “Momento.”) It is these sudden shifts in events, character behaviors, and environment that engender the reader/listener/viewer involvement. Creating these unpredictable elements is the essential function of the writer, and the brilliance by which they accomplish this end separates the hack from the award winner. Interactive storytelling compounds the task of the writer. The predictable element of the story remains much the same. The pedagogic/morality aspect is relatively unchanged. And the story still progresses forward in time from beginning to end. But interactive storytelling vastly increases the complexity of the unpredictable element. The participant in the interactive experience gains a measure of control of events, and the writer must use a different construct to keep the values of the story secure and still provide a compelling experience. This is accomplished by devising a support structure for the story, much in the way that a road surface enables a car to move easily forward. The support structure also provides the triggers for the unpredictable elements that allow the writer to maintain control of the immersive experience.
SIMULATION MEETS HOLLYWOOD
295
This story support structure is also critical to the integration of story with other interactive elements in the simulation. As we will discuss in the next section, the integration of story with the technical aspects of the simulation can synergistically enhance the simulation experience. 5. PUTTING IT ALL TOGETHER: THE VALUE OF INTEGRATION We have described the major technical components of the Mission Rehearsal Exercise system and the story content that the MRE scenario is based on. As we pointed out in the introduction, software integration is necessary to make sure that all the various pieces in a system work together properly, but one usually expects that the real research takes place in building the individual components. One doesn’t expect to learn much from integration (expect perhaps to find that some components don’t interface properly). However, in integrating the Mission Rehearsal Exercise system, we have been surprised: we have uncovered new research issues and some new approaches to existing problems have been suggested. In this section we outline some of the things we learned as we brought all the pieces together. 5.1. Dialogue in rich social environments with multiple characters
As we have argued above, a good story involves compelling characters with intriguing interrelationships. Supporting such an environment imposes new demands on natural language processing. In particular, the Bosnian scenario of MRE, with a cast of many characters occupying various roles in a rich social fabric, is quite different from the usual case of natural language dialogue with a single human and single computer system interacting. While some aspects of dialogue as social interaction had already been addressed in previous work (e.g, discourse obligations in (Traum & Allen, 1994)), many new issues needed to be addressed to allow agents to understand and engage in this sort of dialogue. These issues include x Is the intended addressee paying attention? x Is he already engaged in conversation? x How will hearers recognize who is the addressee? x How are vocatives and gaze as well as context reasoning used to help this process? x How are multiple, interleaved, conversations managed (e.g., talking face to face with one character while on the radio to another)? These issues have implications for agents in both understanding and producing communications, and for representing the dialogue state. Furthermore, there are differences depending on whether the conversation is between virtual humans or between the human trainee and a virtual human, because more limited information is available in the second case. We have begun to address these issues in several ways. First, the dialogue model has been extended so that who is being addressed is captured as well as the content to be conveyed. Second, we have introduced conventions for marking the start and termination of a conversation with an agent. A conversation begins by
296 SWARTOUT, GRATCH, HILL, HOVY, LINDHEIM, MARSELLA, RICKEL AND TRAUM addressing the character either by name or by his role. For example the lieutenant might give the sergeant an order by saying: “Sergeant, send first squad to Celic!” Once a conversation has been started, it is assumed to continue until it is terminated, either by the purpose having been fulfilled (for a short task-specific conversation like securing the area), or by an explicit closing (e.g., "out" on the radio). For conversations between the human trainee and the virtual humans we rely on these conventions to determine who is addressing whom. For conversations between virtual humans, the problem of determining who is being addressed is easier, because it is all represented internally. However, the virtual humans use the same reasoning methods when talking among themselves as they use for interacting with the trainee so their behavior is consistent. We feel this is an important constraint to achieve consistency in interface behavior. See Traum and Rickel, (2002) for additional details on our work in this area. In the future, we would like to make use of head-tracking data to determine who the trainee is looking at when he speaks. This is an additional source of information that could be used to determine whom he is addressing. We are just beginning to scratch the surface in this area, and we hope to see more sophisticated techniques emerge as research progresses. But the surprising thing to us is that this area of inquiry has received so little attention from the computational linguistic community, yet it is clearly basic to multi-person interactions. It points out to us the value of large-scale integration that has forced us to confront this new research issue. 5.2. The Pervasive Effect of Emotion
In humans, emotion has a broad effect on behavior. It affects how we speak, how we gesture, our posture, and even how we reason. And, of course, emotion is indispensable for creating good story and compelling characters. In integrating emotion into our virtual humans, we have found that we need to deal with a similarly broad range of issues. Models of emotion can both affect the behavior of other components of the virtual human, and they can provide additional knowledge that the system can use in reasoning. Below we give an example of each. 5.3. Emotionally Appropriate Natural Language Generation
A big challenge for Natural Language Generation in MRE is the generation of emotionally appropriate language, which expresses both the desired information and the desired emotional attitude towards that information. Each expressive variant casts an emotional shade on each representational item it contains (for example, the phrase governed by the verb “ram” as in “They rammed into us, sir” casts the subject in a negative and the object in a positive light). Prior work on the generation of variation expressions, such as Bateman & Paris (1989); Hovy (1990), uses quite simplistic emotional models of the speaker and hearer. In general, these systems simply had to choose among a small set of phrases, and within the phrase from a small set of lexical fillers for certain positions of the phrase, where each alternative
SIMULATION MEETS HOLLYWOOD
297
phrase and lexical item was pre-annotated with an affective value such as good or bad. The presence in MRE of an emotion model provides a considerably finer-grain level of control, enabling principled realization decisions over a far more nuanced set of expressive alternatives. Given many representational items, a rich set of emotional values potentially holding for them, and numerous phrases, each with its own combination of positive and negative fields, the problem was to design a system that can reliably and quickly find the optimal phrasing without dropping content. (Of course, in some cases no perfect solution may exist. The best way to say “we crashed into them” may be “they were bumped”, but it omits part of the material to be conveyed.) Emotion-based realization involves a potentially expensive process of casting representation items into phrase positions with appropriate connotations, where different positions may have different strengths, and making sure that the phrases themselves cover the material to be conveyed. To compute shades of connotation more accurately and quickly, we created a vector space in which we can represent the desired attitudes of the speaker (as specified by the emotion model) as well as the overall emotional value of each candidate expression (whether noun phrase or whole sentence). Using a standard Euclidean distance measure we can then determine which variant expression most closely matches the desired effect. After realization has produced all variants for a given input, and determined their distances from the emotion model’s desired value, the ranking algorithm then combines the distance scores with a score reflecting how much of the input content was in fact realized in the output. The overall winner is selected and passed along for speech synthesis. More details on the NLG Module can be found in Fleischman & Hovy (2002). 5.4 Using Emotion to Determine Linguistic Focus
In natural language, we often refer to things in imprecise ways. To correctly interpret such referents in a natural language utterance, one needs to understand what is in linguistic focus. Loosely speaking, one needs to understand what is the main subject of discussion. For example, when the lieutenant trainee arrives at the accident scene in the MRE scenario, he might ask the sergeant, “What happened here?” In principle many things have happened: the lieutenant just drove up, the soldiers assembled at the meeting point, an accident occurred, a crowd formed, and so forth. The sergeant could talk about any one of these and be factually correct, but he would sound quite silly if he responded: “Well, you just drove up, sir.” The expected response is for the sergeant to talk about the accident. To produce an appropriate response the sergeant needs to understand that the accident is in linguistic focus. A number of heuristics have been developed to model linguistic focus. One such heuristic is based on the idea of recency. It holds that the entity that is in linguistic focus is whatever was most recently discussed, or occurred most recently. In this case, recency doesn’t work, since the lieutenant opens the conversation with his question and several things have happened subsequent to the accident.
298 SWARTOUT, GRATCH, HILL, HOVY, LINDHEIM, MARSELLA, RICKEL AND TRAUM However, people are often focused most strongly on the things that upset them emotionally, which suggests an emotion-based heuristic for determining linguistic focus. Because we have modeled the sergeant’s emotions in MRE the linguistic routines have access to the fact that he is upset about the accident and they can use that information in determining linguistic focus, allowing the sergeant to give the most appropriate answer and describe the accident and how it occurred. 5.5. Integration of Story with Virtual (and Real) Humans
A well-constructed story can play a critical role in enhancing a trainee’s experience. We have identified several ways in which this can occur. First, because the technology is still immature, interactivity with virtual humans driven by artificial intelligence is constrained. AI characters have limited intelligence and range of activities. The story support structure must be aware of these limitations and other factors within the simulation environment and respond accordingly. For example, because the BDI characters do not support collision detection, characters cannot touch each another or be in close proximity; a small error in positioning would make them simply move through one another, destroying the illusion. For similar reasons, it is very difficult to animate the soldiers climbing into the Humvee. The writer must build such constraints into the story support system. In the case of the Humvee the easy solution was to have the AI characters enter from the far side of the vehicle, so that the viewer would not be aware that they did not actually climb into it. Second, the story support structure can use the element of unpredictability to distract the viewing audience from the flaws in the system. Motion picture and television people do this often. If they do not want the viewer to focus attention to a particular portion of the screen, they use distraction as a tool. For example, where budget constraints have made them use a painted backdrop instead of an actual location, the creative people create activity to draw the eye to that portion of the screen desired. And, because the viewer is willing to suspend disbelief, he or she does not focus upon it and accepts the patently false image of the painted backdrop as real. This use of story distraction is a powerful tool for the MRE simulation. Returning again to the difficulty of animating the soldiers getting into their vehicles, we have found that sometimes the solution suggested above of having the soldiers enter on the side of the vehicle that is out of sight does not work due to constraints in the layout of the scene. However, if the problematic action occurs in the background, and a distracting action takes place in the foreground, the problem will probably not be noticed. For example, at one point in the scenario, a number of soldiers leave in their Humvees. Although the animation of the soldiers entering the vehicles is awkward, it occurs in the background. Furthermore, at the same time, the mother character’s emotion model causes her to become very upset and start gesturing and shouting excitedly in the foreground, because she feels the soldiers are abandoning her and her son. This foreground action advances the story and at the same time distracts from the awkward animation, so viewers tend not to notice the flaws.
SIMULATION MEETS HOLLYWOOD
299
Third, when we consider the range of technologies that we are trying to integrate in MRE we realize making it all work is an ambitious goal. Indeed, we believe that if we were to try to construct virtual humans that could function in the real world and provide a wide range of capabilities such as speech recognition, natural language understanding and generation, emotion modeling and body animation, the task would be too hard for the current state of technology: the range of situations that the system would have to deal with would be too great. But we are not trying to build virtual humans that operate in the real world. Instead, we are building an artificial world that we control via a story line and introducing real people into it. The story provides a very strong context both from a rational and emotional perspective that limits the possible responses that the human trainee will make. This works because people are predictable in their responses. If a hundred people from the same culture are put into the same situation, they won’t respond in a hundred different ways. Instead a handful of responses will cover the range of responses. A story, by providing that strong context, very much limits the range of responses the system must handle, which in turn limits the breadth and range of knowledge that must be programmed into the virtual characters. The limited testing we have performed so far has confirmed this hypothesis although additional testing is needed. This integration of the predictable and unpredictable elements of storytelling in MRE demonstrates the critical role they play in creating an immersive training simulation. 6. STATUS An initial version of the MRE system described in this paper has been implemented and applied to the peacekeeping training scenario described earlier. The system allows the trainee, playing the role of the lieutenant, to interact freely (through speech) with the three virtual humans (sergeant, medic, and mother). The trainee's primary interaction is with the sergeant, who is the main source of information about what happened and advice about how to proceed. The trainee takes action in the virtual world through commands to the sergeant, who in turn commands the squads. Ultimately, the experience terminates with one of four possible endings, depending on the trainee's actions. However, unlike interactive narrative models based on an explicit branching structure, the system does not force the trainee through a predetermined sequence of decision points, each with a limited set of options; the trainee's interactions with the characters is unconstrained and limited only by the characters' understanding and capabilities. The understanding and capabilities of the virtual humans is limited by the coverage of their spoken dialogue models and their models of the domain tasks. The sergeant's speech recognizer currently has a vocabulary of a few hundred words, with a grammar allowing recognition of 16000 distinct utterances. His natural language understanding module can currently produce semantic representation frames for all of these sentences as well as providing (sometimes partial) results for different or ill-formed input. His natural language generation module currently expresses all communicative goals formed by the dialog module, modulating some
300 SWARTOUT, GRATCH, HILL, HOVY, LINDHEIM, MARSELLA, RICKEL AND TRAUM of them for affective appropriateness. His speech synthesis module currently has a vocabulary of over 1000 words. The sergeant's domain task knowledge, which is the most complex among all the virtual humans in the scenario, includes about 40 tasks, and about 150 properties of the world. While the tasks represent the full range of actions that the sergeant can understand and carry out, his ability to talk about these tasks and properties (e.g., answer questions and give advice) is broad, limited only by the coverage of the spoken dialogue modules as described above. Despite its complexity, real-time performance of the system is good, although we are continuing to improve latencies. Given an utterance by the user, a virtual human typically responds within 3 seconds, including speech recognition, natural language understanding, updating dialogue and emotional states, choosing how to respond, natural language generation, planning the voice output and accompanying gestures and visemes, and finally producing the speech. As is typical of humans, the virtual humans are producing communicative behaviors throughout this time delay, including averting gaze from the user during the utterance planning phases to indicate that they are formulating a response (Kendon 1967). We have tested the system with a variety of users acting as trainees. Early sessions were useful for system debugging, but since these trainees lacked the military background required to understand the appropriate actions in situations such as our peacekeeping scenario, sessions were not useful for formal evaluations. In general, trainees with some knowledge of the scenario were often successful in using the system but were undoubtedly biased by their knowledge, and those without such knowledge often failed because they had little idea of how to proceed in such situations. We have just begun testing the system with trainees who have more appropriate military backgrounds, and we expect to report our results in a forthcoming paper. 7. SUMMARY Integration is a kind of two-edged sword. Making a large number of components work together requires a significant effort in developing a system architecture and the interfaces between the components. But as we have tried to illustrate in this paper, integration can also open up new vistas for research and it can enable new solutions to difficult problems. To us, this suggests that integration needs to be thought of as an integral part of the research process, rather than something that is done once all the research is complete. 8. ACKNOWLEDGEMENTS The authors would like to thank A. Crane, W. Crane, J. Deweese, J. Douglas, D. Feng, M. Fleischman, W.L. Johnson, Y. J. Kim, S. Kwak, C. Kyriakakis, C. LaBore, A. Marshall, D. Miraglia, B. Moore, J. Morie, M. Murguia, S. Narayanan, P. O’Neal, D. Ravichandran, M. Raibert, M. Thiébaux, L. Tuch, M. Veal, and R. Whitney for their hard work and enthusiasm that contributed greatly to this research.
SIMULATION MEETS HOLLYWOOD
301
This paper was developed with funds of the United States Department of the Army under contract number DAAD 19-99-D-0046. Any opinions, findings and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the United States Department of the Army. 9. REFERENCES Badler, N. I., Phillips, C. B., & Webber, B. L. (1993). Simulating Humans. New York: Oxford University Press. Bateman, J. A., & Paris, C. L. (1989). Phrasing a Text in Terms the User can Understand. Paper presented at the 11th International Joint Conference on Artificial Intelligence, Detroit, MI. Cassell, J., Bickmore, T., Campbell, L., Vilhjálmsson, H., & Yan, H. (2000). Human conversation as a system framework: Designing embodied conversational agents. In J. Cassell, J. Sullivan, S. Prevost & E. Churchill (Eds.), Embodied Conversational Agents (pp. 29-63). Boston: MIT Press. Cassell, J., Pelachaud, C., Badler, N., Steedman, M., Achorn, B., Becket, T., et al. (1994). Animated Conversation: Rule-Based Generation of Facial Expression, Gesture and Spoken Intonation for Multiple Conversational Agents. Paper presented at the ACM SIGGRAPH, Reading, MA. Cassell, J., Sullivan, J., Prevost, S., & Churchill, E. (Eds.). (2000). Embodied Confersational Agents. Cambridge, MA: MIT Press. Cassell, J., Vilhjálmsson, H., & Bickmore, T. (2001). BEAT: The Behavior Expressive Animation Toolkit. Paper presented at the SIGGRAPH, Los Angeles, CA. Chopra-Khullar, S., & Badler, N. (2001). Where to Look? Automating Attending Behaviors of Virtual Human Characters. Autonomous Agents and Multi-Agent Systems, 4(1-2), 9-23. Feng, D (2003) Cooperative Model Based Language Understanding in Dialogue. To be presented at the NAACL/HLT Student Research Workshop, Edmonton. Fleischman, M., & Hovy, E. (2002). Emotional variation in speech-based natural language generation. Paper presented at the International Natural Language Generation Conference, Arden House, NY. Gratch, J. (2000). Émile: Marshalling Passions in Training and Education. Paper presented at the Fourth International Conference on Intelligent Agents, Barcelona, Spain. Hovy, E. H. (1990). Pragmatics and Natural Language Generation. Artificial Intelligence, 43(2), 153-198. Johnson, W. L., Narayanan, S., Whitney, R., Das, R., Bulut, M., & LaBore, C. (2002). Limited Domain Synthesis of Expressive Military Speech for Animated Characters. Paper presented at the IEEE Workshop on Speech Synthesis, Santa Monica, CA. Johnson, W. L., Rickel, J., & Lester, J. C. (2000). Animated Pedagogical Agents: Face-to-Face Interaction in Interactive Learning Environments. International Journal of AI in Education, 11, 47-78. Kendon, A. (1967). Some Functions of Gaze Direction in Two-Person Conversation. Acta Psychologica, 26, 1–47. Kyriakakis, C. (1998). Fundamental and Technological Limitations of Immersive Audio Systems. Proceedings of the IEEE, 86(5), 941-951. Larsson, S., & Traum, D. (2000). Information state and dialogue management in the TRINDI Dialogue Move Engine Toolkit. Natural Language Engineering, 6, 323-340. Lazarus, R. (1991). Emotion and Adaptation. NY: Oxford University Press. Lester, J. C., Towns, S. G., Callaway, C. B., Voerman, J. L., & FitzGerald, P. J. (2000). Deictic and Emotive Communication in Animated Pedagogical Agents. In J. Cassell, S. Prevost, J. Sullivan & E. Churchill (Eds.), Embodied Conversational Agents (pp. 123-154). Cambridge: MIT Press. Marsella, S., & Gratch, J. (2002). A Step Toward Irrationality: Usign Emotion to Change Belief. Paper presented at the First International Joint Conference on Autonomous Agents and Multiagent Systems, Bologna, Italy. Marsella, S., Gratch, J., & Rickel, J. (2001). The Effect of Affect: Modeling the Impact of Emotional State on the Behavior of Interactive Virtual Humans. Paper presented at the Agents 2001 Workshop
302 SWARTOUT, GRATCH, HILL, HOVY, LINDHEIM, MARSELLA, RICKEL AND TRAUM on Representing, Annotating, and Evaluating Non-Verbal and Verbal Communicative Acts to Achieve Contextual Embodied Agents, Montreal, Canada. Marsella, S., Gratch, J., & Rickel, J. (2003) Expressive Behaviors for Virtual Worlds, in Like-like Characters. Tools, Affective Functions and Applications, Helmut Prendinger and Mitsuru Ishizuka (eds.), Springer-Verlag (in press). Marsella, S., Johnson, W. L., & LaBore, C. (2000). Interactive Pedagogical Drama. Paper presented at the Fourth International Conference on Autonomous Agents, Montreal, Canada. Matheson, C., Poesio, M., & Traum, D. (2000). Modeling Grounding and Discourse Obligations Using Update Rules. Paper presented at the First Conference of the North American Chapter of the Association for Computational Linguistics. McAllester, D., & Rosenblitt, D. (1991). Systematic Nonlinear Planning. Paper presented at the Ninth National Conference on Artificial Intelligence, Menlo Park, CA. Neal Reilly, W. S. (1996). Believable Social and Emotional Agents Ph.D Thesis No. CMU-CS-96-138. Pittsburgh, PA: Carnegie Mellon University. Newell, A. (1990). Unified Theories of Cognition. Cambridge, MA: Harvard University Press. Pelachaud, C., Badler, N. I., & Steedman, M. (1996). Generating Facial Expressions for Speech. Cognitive Science, 20(1). Poggi, I., & Pelachaud, C. (2000). Emotional Meaning and Expression in Performative Faces. In A. Paiva (Ed.), Affective Interactions: Towards a New Generation of Computer Interfaces. Berlin: Springer-Verlag. Rickel, J., & Johnson, W. L. (1999a). Animated Agents for Procedural Training in Virtual Reality: Perception, Cognition, and Motor Control. Applied Artificial Intelligence, 13, 343-382. Rickel, J., & Johnson, W. L. (1999b). Virtual Humans for Team Training in Virtual Reality. Paper presented at the Ninth International Conference on Artificial Intelligence in Education. Rickel, J., & Johnson, W. L. (2000). Task-Oriented Collaboration with Embodied Agents in Virtual Worlds. In J. Cassell, J. Sullivan, S. Prevost & E. Churchill (Eds.), Embodied Conversational Agents. Boston: MIT Press. Rickel, J., & Johnson, W. L. (2002). Extending Virtual Humans to Support Team Training. In G. Lakemayer & B. Nebel (Eds.), Exploring Artificial Intelligence in the New Millenium (pp. 217-238). San Francisco: Morgan Kaufmann. Rickel, J., Marsella, S., Gratch, J., Hill, R., Traum, D., & Swartout, W. (2002). Toward a New Generation of Virtual Huans for Interactive Experiences. IEEE Intelligent Systems, July/August, 32-38. Scherer, K. (1984). On the Nature and Function of Emotion: A Component Process Approach. In K. R. Scherer & P. Ekman (Eds.), Approaches to emotion (pp. 293-317). Thalmann, D. (1993). Human Modeling and Animation. In Eurographics '93 State-of-the-Art Reports. Traum, D. (2003). Semantics and Pragmatics of Questions and Answers for Dialogue Agents. Paper presented at the Fifth International Workshop on Computational Semantics, Tilburg. Traum, D., & Allen, J. F. (1994). Discourse Obligations in Dialogue Processing. Paper presented at the 32nd Annual Meeting of the Association for Computational Linguistics. Traum, D., Fleischman, M., & Hovy, E. (2003). NL Generation for Virtual Humans in a Complex Social Environment. paper presented at the AAAI Spring Symposium on Natural Language Generation in Spoken and Written Dialogue. Traum, D., & Rickel, J. (2002). Embodied Agents for Multi-party Dialogue in Immersive Virtual Worlds. Paper presented at the First International Conference on Autonomous Agents and Multi-agent Systems, Bologna, Italy. Traum, D., Rickel, J., Gratch, J. and Marsella, S.. "Negotiation over Tasks in Hybrid Human-Agent Teams for Simulation-Based Training", to appear in Proceedings of Autonomous Agents and Multi-Agent Systems Conference, Sydney, Australia, 2003. Wang, D., & Narayanan, S. (2002). A confidence-score based unsupervised MAP adaptation for speech recognition. Paper presented at the Proceedings of 36th Asilomar Conference on Signals, Systems and Computers.
SIMULATION MEETS HOLLYWOOD 10. AFFILIATIONS W. Swartout J. Gratch R. Hill E. Hovy R. Lindheim S. Marsella J. Rickel D. Traum University of Southern California Marina del Rey, USA
303
P.L. WEISS
PRESENTATION TECHNOLOGIES FOR PEOPLE WITH DISABILITIES
1. INTRODUCTION 1.1. Implications of decreased mortality and increased morbidity
There exists a disturbing paradox regarding the health and welfare of modern man. On the one hand, the ready availability of nourishment and medical care has increased the mean longevity in many countries, including both the West and the East. Today, due to advances in health-care technologies, disease and trauma cause fewer deaths but lead to significantly greater disability; it is the chronic and disabling nature of these conditions that have such a deleterious impact on the quality of human life (Kaye, 1997). Infants born as early as the twenty-fourth week of gestation are now able to survive even though many develop serious motor, perceptual and cognitive disabilities (Horbar, Badger, Carpenter, Fanaroff, Kilpatrick, LaCorte, Phibbs, & Soll, 2002) Similarly, more victims of cerebral vascular accidents now survive the initial stroke but remain impaired with significant functional limitations (Kolominsky-Rabas, Weber, Gefeller, Neundoerfer, & Heuschmann, 2001). Moreover, environmental risk factors such as air pollution and noxious materials and environments, an overabundance of unhealthy food, and an increasingly sedentary life style have combined to create a situation in which increased longevity co-exists with elevated morbidity (Samet, Zeger, Dominici, Dockery, & Schwartz, 2000). According to U.S. Census Bureau, one out of five Americans has some form of physical or cognitive disability while one out of 10 is considered to have a severe disability (www.census.gov/prod/2003pubs/c2kbr-17.pdf). Estimates indicate that the aging of the baby boom population means that by the year 2030 individuals 65 years of age and over will constitute some 20% of the population (http://www.census.gov/hhes/www/disable/sipp/disable97.html). This increasing proportion of elderly residents leads to a need for greater health care support services due to their considerable decline in physical, sensory and physical abilities Whereas only 13% of persons aged 35 to 44 years have significant functional limitations, this percentage increases to 72% for those aged more than 74 years; worse, the functional limitations of 41% of this latter age group are severe. (Vanderheiden, 1997). 305 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 305–321 © 2005 Springer. Printed in the Netherlands
306
WEISS
Thus, a significant health care challenge is to provide individuals with more than just increased longevity. The goal of health care must also be to help people achieve and maintain an enhanced quality of life which is sustained by a balance between work, leisure and play. 1.2 Beyond the visual presentation of information
The present era is often referred to as the age of electronic information. Few will dispute that much of the information available to the average consumer is abundant, ubiquitous, and highly instructive. Unfortunately this plethora of information has not always had the expected corresponding functional impact on the consumer's everyday life. On the contrary, a significant number of users suffer rather than gain by the current dependence on visually-based information retrieval in almost every sphere of their daily lives. One of the key sources of difficulty is due to the fact that, in a large majority of applications, visual display is the only sensory channel through which users can interact with and manipulate the available information. Which users are disadvantaged by the sole or primary reliance on visual display? One must immediately consider the more than 150 million people worldwide who have moderate to severe visual impairment including loss of visual acuity, visual field deficits, and color blindness (Thylefors, Negrel, Pararajasegaram, & Dadzie, 1995). The large numbers of children and adults who have learning disabilities also have considerable difficulty in interpreting information when it is presented in only a visual format (Vanderheiden, 1997). Finally, as indicated above, the number and severity of disabilities increase as people age, and often include deterioration in both visual acuity (particularly with regard to close-up vision such as reading) and visual field (Vanderheiden, 1997). However, people with sensory or learning disabilities are by no means the only segment of the population who are better able to relate to information displayed in other than a visual form. The current technological revolution in the realm of mobile computing is one of the strongest driving forces behind the provision of additional feedback channels since such devices typically have extremely small display areas (Sorid & Moore, 2000). For example, a user who accesses the Internet via a cellular phone will have difficulty in relying only on a visual display as will users of information appliances who are unable to visually attend to the display. Or, a surgeon may be carrying out a surgical procedure while obtaining essential information from a remote colleague in one of the numerous telemedicine applications now available. Finally, the need for improved data analysis techniques supports the drive to further develop auditory interfaces. Due to increases in computational power over the past two decades, enormous amounts of data can be generated from even simple scientific experiments. Existing scientific visualization techniques are used to interpret and understand these data but many claim that we are quickly approaching the limits of users' abilities to interpret and comprehend visual information (Kramer, Walker, Bonebright, Cook, Flowers, Miner, & Neuhoff, 1997). There is increasing evidence that access to multi-dimensional data via the auditory channel, due to its
PRESENTATION TECHNOLOGIES FOR PEOPLE WITH DISABILITIES
307
superior integrative properties, may be less taxing to the user than comparable visual displays (Kramer et al., 1997). Moreover, data sets that include the need to monitor large numbers of rapidly changing variables appear to be best suited to auditory displays. 1.3 Chapter overview
This chapter focuses on a variety of issues related to increasing accessibility to information via alternate forms of presentation in order that wider segments of the population, and especially those with disabilities, can share the benefits that modern presentation technologies can provide. Users with disabilities need to have access to presentation technologies for two main reasons. First, they, similar to all users of technology, must be able to perceive the information presented in order to be able to interact with it. Sonification, the use of the auditory channel to convey information normally presented visually, is provided as an example of how alternate modes of presentation can make information accessible when vision is limited. This will be followed by a discussion of multimodal feedback with an emphasis on the use of haptic devices to increase the amount of information provided to people who have severe visual impairment. The second reason why adapted presentation technologies are important is that they provide intervention tools which can be used in therapy to achieve results superior to those possible via conventional treatment techniques. Several examples of virtual reality in the treatment of patients with neurological impairment will be used to illustrate this application of adapted presentation technology. The chapter will conclude with some examples of how intelligent presentation technologies may be applied to specifically address issues of concern for people with disabilities. Although such applications are still few in number, it is clear from the work presented in the other chapters of this book that continued effort will eventually provide greater independence for users with disabilities. 2. SONIFICATION A rapidly expanding alternative to visually based information retrieval is sonification which refers to the use of the auditory channel to convey information. Even prior to the modern era of pervasive applications of technology, there were a number of well-known examples of successful sonification applications. These included the Geiger counter, the auditory thermometer, and a variety of medical and cockpit auditory displays (Kramer, et al., 1997). A more recent application is a program that helps chemists who are blind use an auditory display to examine infrared spectrographic data (Lunney & Morrison, 1990). The concept of providing audio formatting to speech in order to represent content and structure has a long history, including techniques used in radio broadcasting and children's story books on cassette (James, 1997). Research into the development of audio presentations for blind users has explored the use of both natural sounds (Gaver, 1986) and musical sounds (Brewster et al., 1993). In
308
WEISS
addition, work has been done on the audio formatting of text for blind computer users in the area of access to mathematics and TeX documents (Ramen, 1993). The field of sonification is inherently interdisciplinary since it is based on research from human perception, acoustics, musicology and engineering (Kramer et al., 1997). The primary disciplines include auditory perception and cognition (e.g., dynamic sound perception, auditory scene analysis, multimodal interaction, and the role of memory and attention in extracting information from sound), and physics, engineering and computer science (e.g., providing the user with flexible control over data dimensions and sound parameters, facilitating data exchange to and from a variety of formats and display systems, and integrating a perceptual testing and evaluation framework). 2.1 Auditory Perception
Auditory information is conveyed to a human listener by associating relational data with an acoustic signal. The construction of any auditory interface must therefore be based upon an understanding of users' perception of “sonified” data. Initially, research in auditory perception focused on studies of specific auditory abilities including pitch, tempo, loudness, and localization (e.g., Hartmann, 1997). More recently, investigations into other, more complex phenomena (e.g., auditory streaming, dynamic sound perception, auditory attention, multimodal displays) has been emphasized (Kramer et al., 1997). One area of particular interest is the selection of tasks when an auditory interface appears to perform as well as or better than a visual interface (for example, when a user's vision is overloaded or inadequate). 2.2 Sonification Tools
Sonification tools consist of hardware (e.g., audio recording, signal processing, and playback equipment) and software (e.g., sound synthesis, editing, and analysis tools). It is now common to find sound hardware on most desktop and personal computers, and these are supported by numerous free or low cost software sound tools. The availability of these off-the-shelf tools, together with customized and domain-specific components, means that most users can take advantage of products resulting from the further development of sonification tools. Nevertheless, despite the considerable progress in sound production hardware and software, and its widespread availability to consumers with multimedia applications, many of the current tools require skills beyond the abilities of many users. It is essential that the tools be consistent, reliable, and portable across various computer platforms. Moreover, they should include the possibility of letting the user switch easily between on-line and off-line sound production. Sonification tools must also be flexible enough to enable the user to benefit from new data from the field of auditory perception (e.g., how users respond to complex sounds and multimodal displays) (Kramer, et al., 1997). Finally, the tools should be readily
PRESENTATION TECHNOLOGIES FOR PEOPLE WITH DISABILITIES
309
compatible with traditional visualization programs (such as spreadsheets, equipment displays). 2.3 Optimal characteristics of an auditory interface
It is unlikely that a single auditory interface will be appropriate for all users in all applications (James, 1999). However, general usage supported by some (but insufficient) empirical evidence has helped to identify certain parameters that appear to optimize users' abilities to benefit from auditory feedback. Some of the considerations involved in selecting sonification characteristics are illustrated with respect to three key parameters (speech versus non-speech audio, metaphoric versus non-metaphoric sounds, and number of voices). 2.3.1 Speech versus non-speech audio
Speech is an essential part of auditory interfaces for the reason that it is impossible or impractical to sonify some types of information in any other way. While a specific sound effect may well communicate the presence of, for example, a check box, the purpose of that check box cannot be readily communicated by sound effects. Check boxes come with a label that indicates what that check box is for. It is impractical to communicate that label other than via speech. In addition, the spectrum of qualities that objects can have can lead to a large number of sound effects with which to communicate them. Too many sound effects lead to memory demands that cannot be handled by any user. Instead, such qualities would better be presented via speech. As a concrete example, consider text. Text can have a number of attributes such as color, font family, font style, etc. The differences between fonts and colors can be very subtle, and it would require an almost infinite number of sounds to represent them. Stating “ten point” or “times roman” or “bold” would appear to be a better way to communicate this kind of information. Speech, however, does have its limitations, notably its length or duration. Users do not want to spend time listening to a lecture on a check box. Hence, whenever the information can be communicated using a non-verbal sound, that technique is preferred. For example, non-verbal sounds are best suited to indicate the type or role of component (e.g., check box, push button, slider), whether the component is enabled or disabled, its current state (e.g., checked, pushed, slider value), that one has navigated to the component, as opposed to changing its state, and that one is activating or manipulating the component (e.g., checking/unchecking the check box, pushing the button, or sliding the slider). 2.3.2 Metaphoric versus non-metaphoric sounds
The sound effects for auditory user interfaces are typically a combination of metaphoric and non-metaphoric sounds (Ramen, 1997). Metaphoric sounds are based on real world sounds that accompany well-known activities. This is done in the hope that the use of intuitive metaphors will make the component and concepts
310
WEISS
associated with them easier to remember and understand. For example, the action of cutting a selection from a document may be associated with the sound of scissors cutting or the sound effect for opening a menu or tabbed pane is the sound of a drawer opening. Like each of these components, a drawer may be one of a group of drawers each of which can be opened. Metaphoric sounds are appealing for two reasons. First, because they are designed for commonly encountered tasks, they are heard repeatedly by users as they interact with the interface. This means that the metaphors must be distinctive, easily recognizable, and bring a degree of consistency of experience. Second, many metaphors (e.g., the cutting and drawer opening examples given above) are immediately comprehensible to people from widely different cultures. Non-metaphoric sounds include tones such as beeps, buzzes, melodies, and clicks that are arbitrarily assigned to represent components and their states. They provide the user with feedback that does not have an immediate analogy to the element that they sonify. This most often occurs for components that don’t have an obvious metaphorical equivalent such as the sound of a “list”, or “save action”. The use of non-metaphoric sounds have one clear advantage in that they can provide consistency within the interface by associating the same sound across many components to designate the same feature. For example, a buzzer noise may be used to cue the user when a component is unavailable. It has been noted above that a reason for using speech is that it can be used to quickly communicate information in some cases. A similar principle applies with respect to non-verbal audio. Thus both metaphoric and non-metaphoric sounds should be simple, uncluttered, short in duration. In summary, sound sequences that are complicated by multiple parts and extended duration should be avoided, and sounds that represent the same task or ones very similar to it should be reused whenever possible. Re-using sounds often involves the transfer of identical sound sequences from one part of the interface to another. However, it can also involve the use of characteristics like melodic structure or timbre to make fine tuned distinctions between components in the same category. Such techniques should make the interface consistent and efficient, and therefore easy to learn and remember. 2.3.3 Number of Voices
It has been shown that when more than one voice is used in an interface, users attempt to establish and maintain relationships between the pairs of speakers (dominant-subordinate, etc.) (Reeves & Nass, 1996). As James (1999) has indicated, since the number of pairwise relations between "n" speakers is equal to n*(n-1)/2, the number of different speakers in an interface should be kept small. A usability study in which user's responded to sonified HTML documents confirmed this result. This finding is consistent with the visual analogy where document structures may all be identified by the same type of change, for example, color. Although the judicious use of color can aid the user to identify the marked text, it will be less useful if too many colors are used; the resulting document merely appears to be colorful and unmarked (James, 1997). Thus, marking techniques as well as other structural
PRESENTATION TECHNOLOGIES FOR PEOPLE WITH DISABILITIES
311
definitions (color for the visual interface and speaker change for an audio interface) are impaired by overuse. 2.4 Auditory Assistive Devices
Synthetic speech is the most common way sound is used in assistive presentation technologies for people with severe visual impairment. There are many hardware and software devices currently available in the market that display information by speaking both text (e.g, the contents of a WORD file) as well as application interfaces (e.g., Window's menus). See, for example, the extensive list compiled by the Adaptive Technology Resource Center at the University of Toronto (http://www.utoronto.ca/atrc/reference/tech/textspeech.html). People who are blind or who have severe visual impairment must use web-browsers and web pages that are formatted to be compatible with such screen-reading devices and software to listen to textual information that appears on the computer monitor. One of the greatest difficulties experienced by the users of screen readers is their incompatibility with the Graphical User Interface, the most pervasive computing interface used today, and with graphic web pages. Moreover, despite the proliferation of usability guidelines such as the Web Accessibility Initiative (http://www.w3.org/TR/WAI-AUTOOLS/), many web sites continue to be needlessly inaccessible. For example, the extensive use of tables, the use of graphics without "alt text" captions, and the use of frames without a "no frames" option all prevent auditory display of information via a screen reader. 2.5 Beyond sonification: Multimodal displays
It is anticipated that future applications of sonification will include multimodal displays with visually, aurally and haptically displayed information complementing or supplementing each other (Kramer, 1997). The effectiveness of one or more such displays when provided together remains to be investigated (Yu & Brewster, 2003). Interactions between auditory feedback other sensory displays, such as haptic information will become more important, especially now that haptic devices are becoming more widely available (Weiss, 1997; Treviranus, 1997; 1999). However, it may be that multimodal displays will not provide the best solution for certain applications. Indeed, the combination of visual and auditory information may be less effective in some circumstances than information presented in one sensory modality (Tzelgov, Srebro, Henik, & Kushelevsky, 1987). Further experimentation identifying the complex perceptual interactions of sonified data used together with other sensory modalities is vital to the success of multimodal applications. 2.6 Multimodal processes
We perceive stimuli in the environment through our sensory channels. Neuroanatomical and neurophysiological studies (e.g., Driver & Spence, 2000). suggest that early processing is modality specific, and that later on the unimodal data
312
WEISS
are integrated into a complete description of the world. Studies in various disciplines have demonstrated that the modalities may interact; this interaction may take place at different levels of processing, and can be manifested in several ways. Contradicting tactual and visual information, for example, may hinder the proper cross-modal integration thus causing perceptual illusions (e.g., Brown, Kerr & Bayon, 1998). Cross-modal interactions are found even at earlier levels of processing. Auditory stimuli may effect visual processing (e.g., Shams, Kamitani, & Shimojo, 2000), and visual information may modulate somatosensory processing (e.g., Taylor-Clark Kennett, & Haggard, 2002). The different modalities may influence each other even prior to the appearance of stimuli, as manifested in the control of attention. Behavioral studies have shown how shifting attention in one modality caused a shift of attention in another modality (e.g., Spence, Nicholls, & Driver, 2001). Similar findings were also found in patients with sensory extinction where a visual stimulus may produce tactile neglect (e.g., Maravita, Spence, Clarke, Husain, & Driver, 2000). Longer lasting effects appear in cross-modal transfer (CMT), where knowledge acquired in one modality, improves performance when employing another modality. Rats trained in a maze showed bi-directional CMT between visual and somatosensory modalities (Tran, Adducci, DeBartolo, & Bower, 1994), as well as auditory and visual modalities (Delay, 1986). This phenomenon is found in humans as well (Krekling, Tellevik, & Nordvik, 1989). While some cross-modal interactions may be described by neurobiological explanations involving neural circuitry (e.g., Driver and Spence, 2000), other results may require an explanation at a level of higher cognitive processes. The redundancy gain achieved by coding a stimulus by more than one modality may be derived from principles of various approaches. Potential candidates are Pavio's (1990) Dual Coding Theory (DCT) or different models of Parallel Distributing Process (e.g., Rumelhart, & McClelland, 1986). In summary, empirical evidence indicates that the sensory modalities interact at several levels. The outcome of these interactions may be short or long termed. One such influence may produce improvement in learning and retrieval of representations due to multiple coding. This may lead to better performance in various tasks and actions. Indeed, it has been shown that the addition of substitute sensory haptic feedback improved performance on grasping and place tasks and the addition of synthesized sound sources to a virtual world increased the sense of presence (a feeling of engagement and involvement in the virtual environment) although not necessarily the sense of realism. Visual displays have been shown to influence the perception of haptic stiffness (Srinivasan, Beauregard, & Brock, 1996), audio cues have been shown to influence the perception of haptic stiffness and visual and haptic information combine to create better estimates of size and texture (DiFranco, Beauregard, & Srinivasan, 1997).
PRESENTATION TECHNOLOGIES FOR PEOPLE WITH DISABILITIES
313
2.7 Haptic Interfaces
Until recently, the one major component lacking in many VR simulations has been the provision of haptic feedback. In the absence of haptic feedback, users reach out to touch a virtual object, only to place their hands right through the object without feeling it. Haptic information derives from both tactile and kinesthetic receptors. Initially, when some body part makes contact with an object the sensation of touch is provided by the touch receptors in the skin, which provide information on the geometry of the contact surface, its surface texture, and slippage (Lederman & Klatzky, 1987). When force is applied to the object, kinesthetic receptors provide information about the position and motion of the hand and arm, and the forces acting on these, giving a sense of mass, contact force, and surface compliance. The haptic sense plays an important role in object discrimination and manipulation, and is critical for adequate task performance (Lederman & Klatzky, 1990). A person with sensory loss will have difficulty using the affected hand even if there is no motor loss; he usually stops using his affected hand and the phenomenon of "learned non use" ensues (Taub, Uswatte, & Elbert, 2002). Haptic feedback is provided via an interface that applies forces to the body to display information (Durlach & Mavor, 1994). It has been used successfully in the training of medical diagnostic skills, particularly when used to provide force and touch feedback to the person manipulating an instrumented tool. It is only quite recently that haptic technologies are capable of delivering credible sensory stimuli at a reasonable cost, using devices of a practical size (Stone, 2001). Example applications involve the training of physicians to palpate tumors (Burdea, 2002) and, more recently, for the rehabilitation of patients with motor impairment due to stroke (e.g., Adamovich, Merians, Boian, Tremaine, Burdea, Recce, & Poizner, 2003; Goncharenko, Svinin, Matsumoto, Hosoe, & Kanou, 2003).
3. VIRTUAL REALITY AND NEUROREHABILITATION The utilization of presentation technologies in novel ways provide intervention tools which can be used in therapy to achieve results superior to those possible via conventional treatment techniques. Several examples of virtual reality in the treatment of patients with neurological impairment will be used to illustrate this application of adapted presentation technology. One of the most exciting developments in recent years in rehabilitation intervention has been the presentation of virtual environments to patients with a variety of neurological impairments. Virtual reality (VR) entails the use of advanced technologies, including computers and various multimedia peripherals, to produce a simulated (i.e., virtual) environment that users perceive as comparable to real world objects and events (Weiss & Jessel, 1998). Users employ specially designed transducers to interact with displayed images, moving and manipulating virtual objects, and performing other actions in a way that engenders a feeling of actual presence in the simulated environment. This is accomplished by having the simulation adjust to movements of the user’s body, so that the sensory cues correspond to what users would expect
314
WEISS
were they to perform in the real world. Visual feedback is often supplemented by auditory cues that emanate from appropriate locations and, in rare cases, by appropriately timed haptic, proprioceptive and vestibular information (Rizzo, Buckwalter & Neumann, 1997). One of the cardinal features of virtual reality is the provision of a sense of actual presence in, and control over, the simulated environment (e.g., Witmer & Singer, 1998). This feature is achieved to different extents in the various applications of virtual reality, depending upon the goals of the particular application and the cost and technical complexity its developers are willing and able to assume. In the most technically advanced applications of virtual reality, known as “immersive” VR, the user is essentially isolated from the outside world and fully enveloped within the computer-generated environment. Until recently, presence was measured primarily via questionnaires which were developed for that purpose (Witmer & Singer, 1998). The results of recent studies suggest that physiological measures, such as heart rate and the galvanic skin response and/or movement transitions may provide additional important information about the extent to which a user feels immersed within the virtual environment (Mantovani & Riva, 1999). 3.1 VR Applications in Rehabilitation
Until recently, the application of VR technology in rehabilitation was severely limited by the lack of inexpensive and easy-to-use VR systems. The development of such platforms with more user-friendly software launched a wave of potential applications to medicine in general, and rehabilitation in particular (Riva et al., 1999). VR is being used in the training of surgical procedures (Satava, 1996), the education of patients and medical students and the treatment of psychological dysfunction including phobias (Rothbaum, Hodges, Kooper, Opdyke, Williford, & North 1995), post-traumatic stress disorder (e.g., Rothbaum et al., 1999), and eating and body image disorders (Riva and Melis, 1997). It is also used in the rehabilitation of cognitive processes, such as visual perception and executive functions (Rizzo et al., 2000) and for training activities of daily living (e.g., use of public transportation (Brown et al., 1998), meal preparation (Christiansen, Abreu, Ottenbacher, Huffman, Masel. & Culpepper, 1998)). More recently, VR has been used to improve range of motion, strength, endurance, balance and other impairments (e.g., Adamovich et al., 2003; Burdea, 2002; Sveistrup & McComas, 2002). Our own studies have shown the potential of using a desktop-based system to train stroke patients suffering from unilateral spatial neglect to learn how to cross a street in a safe manner (Weiss et al., 2003b), and have shown its potential as an intervention tool for patients suffering from cognitive and/or motor deficits (Kizony et al., 2002, 2003a, b, Weiss et al., 2003b). A screen shot from this program, shown in Figure 1, illustrates the relatively simple presentation of a functional environment (in this case a typical street crossing in Israel) that, despite its simplicity, provides patients with opportunities to engage in meaningful, purposeful tasks that are related to real life interests and activities.
PRESENTATION TECHNOLOGIES FOR PEOPLE WITH DISABILITIES
315
Figure 1. Simple desktop presentation of a functional street crossing environment that provides patients with cognitive impairment with opportunities to engage in tasks that are meaningful and purposeful during treatment.
Work in our laboratory over the past two years using a video-capture VR system has resulted in the adaptation of three virtual environments such that it is now possible to completely control the type, speed, location and direction of all stimuli and to record all participant performance (Kizony et al., 2003a). To date, results using video-capture VR to train balance control of patients with paraplegic spinal cord injury (Figure 2a) and to improve upper extremity motor function of patients who have had a stroke (Figure 2b) indicate that the intervention is highly effective (Kizony et al., 2003b). VR presentations have also been used to provide opportunities to young adults with cerebral palsy and mental retardation to engage in recreational activities such as playing soccer (Figure 2c). We are currently developing a new environment, a 3-D virtual shopping center, which promises to provide highly immersive functional applications (Rand, Kizony, Feintuch, Katz, Josman, Rizzo, & Weiss, 2003).
Figure 2. Virtual stimuli presented within customized environments (a) to a patient with spinal cord injury to improve balance control, (b) to a patient with a stroke to improve upper extremity function and (c) to a young adult with intellectual impairment and cerebral palsy to provide recreational opportunities.
316
WEISS
These initial studies have yielded provocative results suggesting that VR may offer new possibilities within rehabilitation. These include the potential to measure behavior in challenging but safe and ecologically valid environments, while maintaining control over stimulus delivery and measurement (Schultheis & Rizzo, 2002, Weiss & Jessel, 1998). VR also offers the capacity to individualize treatment needs, while providing increased standardization of assessment and training protocols.It is common for patients to practice the use of compensatory strategies within the rehabilitation setting, in the hope that these skills will generalize to their home environment. However, this approach is labor intensive, limited in intensity and duration of repetitions, and has uncertain efficacy outside of the rehabilitation setting. VR can help to address these limitations by allowing the development of training environments consistent with the patient’s home setting. Furthermore, virtual environments can offer the capacity to gradually increase the complexity of tasks while decreasing the feedback provided by the therapist. 3.2 VR and Haptic Feedback
Sophisticated VR systems employ more than specialized visual displays. Engaging the user in the virtual environment may be enhanced via audio display, either ambient or directed to specific stimuli (Västfjäll, 2003). In recent years, haptic display has been introduced to the field of VR. Haptic feedback enables users to experience the sensation of touch, making the systems more immersive and closer to the real world experience. Haptic gloves, such as the Rutgers Master II, may provide force feedback while manipulating virtual objects (Jack, Boian, Merians, Tremaine, Burdea, Adamovich, Recce, & Poizner, 2001) or for strength training (Deutsch, Merians, Burdea, Boian, Adamovich, & Poizner, 2002). Haptic information may also be conveyed by simpler means such as a force-feedback joystick (Reinkensmeyer, Iobbi, Kahn, Kamper, & Takahashi, 2003) or a force-feedback steering wheel (Kline-Schoder, 2004). Other, less frequently used ways of making the virtual environment more life-like are by letting the user stand on a platform capable of perturbations and thereby providing vestibular stimuli such as that available with Motek’s CAREN multisensory system (McFadyen, Malouin, Fung, Comeau, Chapdelaine, Beaudoin, Lamontagne, Laurendeau, & Richards, 2004). Still more rare is the provision of olfactory feedback to add odor to a virtual environment, a feedback channel whose potential is now being investigated (Harel, et al., 2003). 4. THE NEED FOR INTELLIGENT PRESENTATION TECHNOLOGIES FOR APPLICATIONS RELATED TO DISABIITY As indicated in the introduction to this book, the pace of development of applications of information technology is often slower than expected, and it is only within the last two decades that significant strides have been made towards
PRESENTATION TECHNOLOGIES FOR PEOPLE WITH DISABILITIES
317
improving the intelligence of technology systems (Stock, Not and Zancanaro, this volume). Unfortunately, applications of such methodologies that specifically address issues of concern for people with disabilities are still few in number. All of the classic goals of intelligent information technologies, including the flexible presentation of information, the integration of various modes of communication, and the provision of intelligent feedback are of extreme relevance for people with disabilities. Indeed, given the additional variability of human response due to a given disability the need to be able to consider a specific user’s characteristics, needs and context of action assume an even greater importance than is the case of intact persons. For example, movement kinematics and kinetics encompass a much wider range of performance metrics when one takes into account movement disorders such as hypotonia, hypertonia, tremor, rigidity, contracture, etc. Cognitive performance is equally variable when one considers the large range of cognitive, perceptual and intellectual deficits which result from many pathological conditions. Each of the classic goals of intelligent presentation technologies can be applied to the disability examples reviewed in this chapter. Consider, for example, the use of video-capture virtual reality-based therapy for person's with neurological deficits. Their performance is known to be labile from day to day depending on their physical state, response to medications, time of day, mood and social circumstances. The ability to incorporate techniques such as those now available via neural networks could automate the provision of intelligent feedback, enabling the customized and personalized grading of levels of difficult uniquely suited to the current user's abilities and therapeutic objectives. The application of intelligence to the automation process would ensure that the therapist and patient would obtain the optimal benefit from the VR system. Moreover, it is conceivable that such an approach could lead to a better understanding of the underlying mechanisms of the pathological kinematics since the accumulated data could serve as input to model these processes. . For example, Baloian, Luther & Sanchez (2002) have applied modeling techniques in an attempt to mimic real world applications for people with reduced visual or auditory cues. They used modeling to represent two virtual worlds, AudioDoom (which allows children who are blind to explore and interact with virtual worlds created with spatial sound) and Whisper (an environment to help people with impaired auditory abilities to recognize speech errors). The model aims to learn to represent the real world as proposed by the system and also to model of the user's knowledge of the virtual world. In another application, intelligent word prediction has been used to enhance text input rate for people with severe motor and speech disabilities (Garya-Vitoria & Gonzalex-Abascal, 1997). They found that the addition of intelligent methods based on syntactic analysis resulted in greater keystroke savings, a therapeutic goal of great importance for such users. In conclusion, it is essential that researchers who have developed paradigms for adding intelligence to presentation technologies of the type presented in this book join together with those who work with people who have disabilities such that the latter may continue to reap benefit from new technologies. Flexible presentation of information and integration of modes of communication would enable existing techniques such as the sonification and haptic modes described above to become
318
WEISS
richer by the addition of intelligent adaptive processes that would be finely tuned to each user's capabilities 5. ACKNOWLEDGEMENTS This paper was developed as part of the collaboration agreement. between the Institute for Scientific and Technological Research, ITC (Trento, Italy) and CRI Caesarea Rothschild Institute for Interdisciplinary Applications of Computer Science, University of Haifa (Haifa, Israel).
6. REFERENCES Adamovich, S.V., Merians, A., Boian, R., Tremaine, M., Burdea, G., Recce, M. & Poizner, H. (2003). A virtual reality based exercise system for hand rehabilitation post-stroke. Proceedings of the Second International Workshop on Virtual Reality, Piscataway, NJ. Baloian, N., Luther, W., & Sanchez, J. (2002). Modeling Educational Software for People with Disabilities: Theory and Practice. Proceedings of Assets 2002, Edinburgh, Scotland Brewster SA, PC Wright, Edwards ADN. An evaluation of earcons for use in auditory human-computer interfaces. INTERCHI `93 Conference Proceedings, 222-227, Amsterdam, April 1993. Brown, D. J., Kerr, S. J., & Bayon, V. (1998). The development of the virtual city: A user centered approach. Proceedings of the Second European Conference on Disability, Virtual Reality and Associated Techniques, Skovde, Sweden. Burdea, G. (2002). Virtual reality – benefits and challenges. Proceedings of the First International Workshop on Virtual reality Rehabilitation, Lausanne, Switzerland. Christiansen, C., Abreu, B., Ottenbacher, K., Huffman, K., Masel. B., & Culpepper, R. (1998). Task performance in virtual environments used for cognitive rehabilitation after traumatic brain injury. Archives of Physical Medicine and Rehabilitation, 79, 888-892. Delay, E. R., (1986). Effects of cross-modal transfer on direct and reversal learning in the rat. Animal Learning and Behavior ,14, 427-434. Deutsch, J.E., Merians, A.S., Burdea, G.C., Boian, R., Adamovich, S.V., & Poizner, H. (2002). Haptics and Virtual Reality Used to Increase Strength and Improve Function in Chronic Individuals Poststroke: Two Case Reports. Neurology Report, 26, 79-86. DiFranco, D.E., Beauregard, G.L. & Srinivasan, M.A. (1997). The effect of auditory cues on the haptic perception of stiffness in virtual environments. Proceedings of the ASME Dynamic Systems and Control Division, Dallas, Texas. Driver, J., & Spence, C., (2000). Multisensory perception: Beyond modularity and convergence. Current Biology, 10, 731-735. Durlach, N.I., & Mavor, A.S., (Eds). (1994). Virtual Reality: Scientific and Technological Challenges, Washington, D.C., National Academy Press. Garya-Vitoria, N. & Gonzalex-Abascal, J. (1997). Intelligent word-prediction to enhance text input rate. Proceedings of the 1997 International Conference on Intelligent User Interfaces, January 6-9, 1997, Orlando, Florida. Gaver, W.W. Auditory icons: Using sound in computer interfaces. Human-Computer Interaction, 2:167177, 1986. Goncharenko, I., Svinin, M., Matsumoto, S., Hosoe, S., & Kanou, Y. (2003). Design and Implementation of Rehabilitation Haptic Simulators with History Unit. Proceedings of the Second International Workshop on Virtual Reality, Piscataway, NJ. Hartmann, W.M. (1997). Signals, sound, and sensation. Woodbury, NY: American Institute of Physics. Horbar, J.D., Badger, G.J., Carpenter, J.H., Fanaroff, A.A., Kilpatrick, S., LaCorte, M., Phibbs, R., & Soll, R.F. (2002). Trends in mortality and morbidity for very low birth weight infants, 1991–1999. Pediatrics, 110, 143–151.
PRESENTATION TECHNOLOGIES FOR PEOPLE WITH DISABILITIES
319
Jack, D., Boian, R., Merians, A., Tremaine, M., Burdea, G.C., Adamovich, S.V., Recce, M., & Poizner, H. (2001).Virtual reality- enhanced stroke rehabilitation, TEEE Transactions of Neural Systems and Rehabilitation Engineering, 9, 308-318. James, F. Lessons from developing audio HTML interfaces. Retrieved from the world wide web in June, 1999. [http://www-pcd.standford.edu/frankie/assets/lessons.html] James F. Presenting HTML structure in audio: User satisfaction with audio hypertext. CSLI Technical Report 97-201. CSLI, Stanford University, Jan. 1997. Kaye, H. S. (1997). Disability watch: The status of people with disabilities in the United States. San Francisco: Disability Rights Advocates. Kizony, R., Katz, N. & Weiss, P.L. (2002). Immersion without encumbrance: adapting a virtual reality system for the rehabilitation of individuals with stroke and spinal cord injury. Proceedings of the Fourth International Conference of Disability, Virtual Reality and Associated Technologies, Vrezprem, Hungary. Kizony, R., Katz, N. & Weiss, P.L. (2003a). Adapting an immersive virtual reality system for rehabilitation. Journal of Visualization paper, 14, 1-7 Kizony, R., Raz, L., Katz, N., Weingarden, H. & Weiss, P.L. (2003b). Using a video projected VR system for patients with spinal cord injury. Proceedings of the Second International Workshop on Virtual Reality, Piscataway, NJ. Kline-Schoder, R. (2004). Virtual Reality-Enhanced Physical Therapy System. Proceedings of CyberTherapy 2004, San Diego. Kolominsky-Rabas, P.L., Weber, M., Gefeller, O., Neundoerfer, B., & Heuschmann, P.U. (2001). Epidemiology of ischemic stroke subtypes according to TOAST criteria: incidence, recurrence, and long-term survival in ischemic stroke subtypes: a population-based study. Stroke, 32, 2735-40. Kramer, G., Walker, B., Bonebright, T., Cook, P., Flowers, J., Miner, N., Neuhoff, J. (1997). Sonification Report: Status of the Field and Research Agenda National Science Foundation by members of the International Community for Auditory Display. Retrieved from the www: http://www.icad.org/websiteV2.0/References/nsf.html, May 14, 2004. Krekling, S., Tellevik, J. M., & Nordvik, H. (1989). Tactual learning and cross-modal transfer of an oddity problem in young children. Journal of Experimental ChildPsychology. 47, 88-96. Lederman, S. J., & Klatzky. R. L. (1987). Hand movements: a window into haptic object recognition. Cognitive Psychology, 19, 342-368. Lederman, S. J., & Klatzky. R. L. (1990). Haptic object classification: knowledge driven exploration. Cognitive Psychology, 22, 421-459. Lunney, D., & Morrison, R. C. (1981). High-Technology Laboratory Aids for Visually Handicapped Chemistry Students. J. Chem. Ed. 58, 228-231. McFadyen, B.J., Malouin, F., Fung, J., Comeau, F., Chapdelaine, S., Beaudoin, C., Lamontagne, A., Laurendeau, D., & Richards, C.L. (2004). Development of complex virtual environments for locomotor training following stroke. Proceedings of the Fifteenth Congress of the International Society of Electrophysiology and Kinesiology, Boston. Mantovani, G. & Riva, G. (1999). "Real" presence: How different ontologies generate different criteria for presence, telepresence and virtual presence. Presence: Teleoperators and Virtual Environments, 8, 538-548. Maravita, A., Spence, C., Clarke, K., Husain, M., & Driver, J. (2000). Vision and touch through the looking glass in a case of crossmodal extinction. Neuroreport, 11, 3521-3526. Paivio, A. (1990). Mental Representations: A Dual Coding Approach. New York: Oxford University Press. Raman, T.V. (1993). Audio System for Technical Readings. Ph.D thesis, Cornell University. Raman, T.V. (1997). Auditory User Interfaces: Toward the Speaking Computer, Kluwer Academic Publishers. Rand, D., Kizony, R., Feintuch, U., Katz, N., Josman, N., Rizzo, A.A. & Weiss, P.L. (2003). Video projection versus HMD VR system for rehabilitation: Within and between comparisons. Proceedings of the Second International Workshop on Virtual Reality, Piscataway, NJ. Reeves, B. and C. Nass. The Media Equation: How People Treat Computers, Television and New Media Like Real People and Places. Cambridge University Press, New York, 1996. Reinkensmeyer, D,J., Iobbi, M.G., Kahn, L.E., Kamper, D.G., & Takahashi, C.D. (2003)., Modeling reaching impairment after stroke using a population vector model of movement control that incorporates neural firing-rate variability. Neural Computing 15, 2619-42.
320
WEISS
Riva, G. & Melis, L. (1997). Virtual reality for the treatemnt of body image distrubance. In Riva, G. (Ed.). Virtual reality in neuro-psycho-physiology. (pp. 71-79), Amsterdam; IOS Press. Rizzo, A. A.., Buckwalter, J. G., & Neumann, U. (1997). Virtual reality and cognitive rehabilitation: A brief review of the future. Journal of Head Trauma Rehabilitation, 12, 1-15. Rizzo, A. A., Buckwalter, J. G., Humphrey, L., van der Zaag, C., Bowerly, T., Chua, C., Neumann, U., Kyriakakis, C., van Rooyen, A., & Sisemore, D. (2000). The Virtual Classroom: A Virtual Environment for the Assessment and Rehabilitation of Attention Deficits. CyberPsychology & Behavior, 3, 483-500. Rothbaum, B. O., Hodges, L. F., Alarcon, R., Ready, D., Shahar, F., Graap, K., Pair, J., Hebert, P., Gotz, D., Wills, B., & Baltzell, D. (1999). Virtual reality exposure therapy for PTSD Vietnam veterans: A case study. Journal of Traumatic Stress, 12, 263-272. Rothbaum, B. O., Hodges, L. F., Kooper, R., Opdyke, D., Williford, J. S., & North, M. (1995). Effectiveness of computer-generated (virtual reality) graded exposure in the treatment of acrophobia. American Journal of Psychiatry, 152, 626-8. Rumelhart, D. E., & McClelland, J. L. (1986). Explorations in parallel distributed processing: A handbook of models, programs, and exercises. Computational models of cognition and perception. Cambridge, MA, The MIT Press. Samet, J.M., Zeger, S., Dominici, F., Dockery, D., & Schwartz, J. (2000). The National Morbidity, Mortality, and Air Pollution Study (NMMAPS). Part I. Methods and methodological issues. Cambridge, MA: Health Effects Institute. Satava, R.M. (1996). Medical Virtual Reality: The current status of the future. In S.J. Weghorst, H.B. Sieburg, and K.S. Morgan (Eds.). Proceedings of the Medicine Meets Virtual Reality 4 Conference, (pp. 100-106), Berlin; IOS Press. Schultheis, M. T., & Rizzo, A. A. (2002). The application of virtual reality technology in rehabilitation. Rehabilitation Psychology, 46, 296-311. Shams, L., Kamitani, Y., & Shimojo, S. (2000). What you see is what you hear. Nature, 408, 788. Sorid, D., & Moore S.K., (2000). The Virtual Surgeon. IEEE Spectrum, 37, 26-31. Spence, C., Nicholls, M. R., & Driver, J. (2001). The cost of expecting events in the wrong sensory modality. Perception and Psychophysics, 63, 330-336. Srinivasan, M.A., Beauregard, G.L. & Brock, D.L. (1996). The impact of visual information on haptic perception of stiffness in virtual environments, ASME Winter Annual Meeting. Stone, R. (2001). Virtual reality for interactive training: An industrial practitioner's viewpoint. International Journal of Human Computer Studies, 55, 699-711 Sveistrup, H. & McComas, J. (2002). Virtual Reality Applications in Rehabilitation: Selected Trials in Canada. Electronic Proceedings of the Symposium on Applications of Virtual Reality to Rehabilitation, Haifa, Israel. Taylor-Clark, M., Kennett, S., & Haggard, P. (2002). Vision modulates somatosensory cortical processing. Current Biology, 12, 233-236 Taub, E., Uswatte, G. & Elbert, T. (2002). New treatments in neurorehabilitation founded on basic research. Nature Reviews Neuroscience, 3, 228-36. Thylefors, B., Negrel, A.D., Pararajasegaram, R., & Dadzie, K.Y. (1995). Global data on blindness. Bull World Health Organ, 73, 115-21. Tran, T. D., Adducci, L. C., DeBartolo, L. D., & Bower, B. A. (1994). Transfer of visual and haptic maze learning in rats. Animal Learning and Behavior, 22, 421-426 Treviranus J. (1999). Adding Feeling, Sound And Equal Access To Distance Education. Proceedings of the 14th Annual International CSUN conference. Retrieved January 27, 1999 from the World Wide Web: http://www.dinf.org/csun_99/csun99.htm. Treviranus J. (1997). Nimble document navigation using alternative access tools. Proceedings of the Sixth International WWW Conference. Santa Clara. Retrieved January 27, 1999 from the World Wide Web: www6conf.slac.standford.edu. Tzelgov J, Srebro R, Henik A, Kushelevsky A. (1987). Radiation detection by ear and by eye. Human Factors, 29, 87-98. Vanderheiden, G. C. (1997). Design for people with functional limitations resulting from disability, aging, and circumstance. In G. Salvendy (Ed.), Handbook of human factors and ergonomics (2nd Ed., pp. 2010-2052). New York, NY: John Wiley & Sons, Inc. Västfjäll, D. (2003).The subjective sense of presence, emotion recognition, and experienced emotions in auditory virtual environments. CyberPsychology & Behavior, 6, 181-188.
PRESENTATION TECHNOLOGIES FOR PEOPLE WITH DISABILITIES
321
Weiss E. The Addition of the Haptic Modality to the Virtual Reality Modeling Language. Master’s thesis. MIT, 1997. Weiss, P.L., & Jessel, A.S. (1998). Virtual reality applications to work. Work, 11, 277-293. Weiss, P.L., Naveh, Y., and Katz, N. (2003a). Design and testing of a virtual environment to train CVA patients with unilateral spatial neglect to cross a street safely. Occupational Therapy International, 10, 39-55. Weiss, P.L., Bialik, P. & Kizony, R. (2003b). Virtual Reality Provides Leisure Time Opportunities for Young Adults with Physical and Intellectual Disabilities. CyberPsychology & Behavior, 6, 335-342. Witmer, B.G., & Singer, M.J. (1998). Measuring presence in virtual environments: A presence questionnaire. Presence, 7, 225-240. Yu, W. & Brewster, S.A. (2003). Evaluation of multimodal graphs for blind people. Journal of Universal Access in the Information Society, 2, 105-124.
7. AFFILIATION
Prof. Patrice L. (Tamar) Weiss University of Haifa Haifa, Israel
FUTURE DIRECTIONS
H. BUNT, M. KIPP, M. MAYBURY AND W. WAHLSTER
FUSION AND COORDINATION FOR MULTIMODAL INTERACTIVE INFORMATION PRESENTATION Roadmap, Architecture, Tools, Semantics
1. ROADMAP Figure 1 illustrates the roadmap in the near term, from 2003-2006 for the creation of mobile, human-centered intelligent multimodal interfaces. Mobile, Human-Centered, and Intelligent Multimodal Interfaces Adequate Corpora for MM Research
Computational Models of Multimodality
Multimodal Interface Toolkit
2006 Standards for the Annotation of MM Training Corpora Situated and TaskSpecific MM Corpora
Multimodal Barge-In Multiparty MM Interaction
Collection of Hardest and Most Frequent/Relevant Phenomena
Examples of Added-Value of Multimodality
Empirical and Data-Driven Models of Multimodality
Multimodal Toolkit for Universal Access
Models of MM Mutual Disambiguation Decision-theoretic, Symbolic and Hybrid Modules for MM Input Fusion
XML-Encoded MM Human-Human and Human-Machine Corpora Corpora with Multimodal Artefacts and New Multimodal Input Devices
Mobile Multimodal Interaction Tools
Models for Effective and Trustworthy MM HCI
Task- , Situationand User- Aware Multimodal Interaction Common Representation of Multimodal Content
Plug- and Play Infrastructure
2003
Reusable Components for Multimodal Analysis Markup Languages and Generation for Multimodal Dialogue Semantics
Advanced Methods for Multimodal Communication
Toolkits for Multimodal Systems
Figure 1. Near Term Roadmap 325 O. Stock and M. Zancanaro (eds.), Multimodal Intelligent Information Presentation, 325–339 © 2005 Springer. Printed in the Netherlands
326
BUNT, KIPP, MAYBURY AND WAHLSTER
Three “lanes” in the road identify three areas of research and development, including empirical and data driven models of multimodality, advanced methods for multimodal communication and toolkits for multimodal systems. The end of the roadmaps indicates the outcome in 2006, specifically multimodal corpora, computational models, and interface toolkits. Of course there are a variety of interim outcomes along the way. For multimodal corpora this includes annotated corpora of human and natural phenomena (e.g., surveillance, meeting, or broadcast news video) as well as human-machine interactions. Corpora can be used by systems for training or testing/evaluation purposes. In the methods lane, this includes developments such as multimodal mutual disambiguation, multiparty interaction, and multimodal barge in. With respect to toolkits, developments include markup standards for multimodal phenomena (e.g., for combinations of speech, gesture, and facial expressions), reusable components for multimodal analysis and generation, and tools for universal and mobile multimodal access. Ecological Multimodal Interfaces Tailored and Adaptive MM Interaction
2010
Computational Models Incremental Feedback between of the Acquisition of MM Modalities during Generation Communication Skills User‘s Theories Non-Intrusive of System‘s & Invisible MM Multimodal Testsuites Input Sensors Capabilities and Benchmarks for Multicultural Adaptation Multimodal Interaction of Multimodal Presentations Parametrized Model of Multimodal Models Multimodal Behaviour Resource-Bounded of Engagement and Floor Multimodal Interaction Management Usability Evaluation Methods for MM System Multimodal Feedback Multimodality in VR and Grounding Demonstration of and AR Environments Models of MM Performance Advances 2007 Collaboration through Multimodal Non-Monotonic MM Interaction Input Interpretation Real-time Localization and Motion/Eye Biologically-Inspired Affective MM Communication Tracking Technology Intersensory Coordination Models
Empirical and Data-Driven Models of Multimodality
Advanced Methods for Multimodal Communication
Toolkits for Multimodal Systems
Figure 2. Long Term Roadmap
Figure 2 illustrates the roadmap in the far term, from 2007-2010. Using the same three dimensions as in the short term roadmap, we envision a variety of future outcomes. From 2007 to 2010, in the area of models of multimodality, we envision biologically-inspired intersensory coordination models, test suites and benchmarks, and eventually computational models of the acquisition of multimodal communication skills, among other advancements. Advanced methods will include affective, collaborative, and multicultural multimodal communication. Toolkits will
FUSION AND COORDINATION FOR MULTIMODAL INFORMATION PRESENTATION
327
advance from real-time localization and motion/eye tracking, to the incorporation of multimodality into virtual and augmented reality environments, and resource bounded multimodality. Multimodal Input
Multimodal Interaction
Multimodal Output
z Sensor Technologies
z User Modelling
z Smart Graphics
z Vision
z Cognitive Science
z Design Theory
z Speech & Audio Technology
z Discourse Theory
z Embodied Conversational Agents
z Biometrics
z Ergonomics
z Speech Synthesis
z Machine Learning
z Formal Ontologies
z Pattern Recognition
z Planning
Figure 3 Enabling Technologies
2. ABSTRACT ARCHITECTURE In addition to establishing a roadmap for future research, we focused on articulating a common reference architecture to consolidate current understanding, facilitate systems description, and to help formulate future systems research. Intelligent multimodal systems require a number of essential functional and technical requirements. Functionally, they need to: support modality integration (both fusion of input and design of coordinated output), provide situation (User, task, application) appropriate real-time sensing/response (e.g., supporting barge-in, perceptual sensing/feedback), represent (modules and data structures) a varying level of granularity manage feedback, both locally and globally, support incremental processing support incremental development, be scaleable. In addition to these functional requirements, there are a number of important system/technical requirements these systems should exhibit, including technical means for processing/fusing multimodal input (e.g., parallel processing) modular, composable elements and algorithms (possibly distributed processing) efficient algorithms and efficient implementations of those support for varying time scales, and temporal and spatial resolutions (as well as of course temporal resolution) shared (even after partial processing) data structures
328
BUNT, KIPP, MAYBURY AND WAHLSTER
open and extensible protocols for interprocess and intermodule communication. The group analyzed approximately a dozen architectures of intelligent multimodal systems to modify the base architecture articulated in Maybury and Wahlster (1998) to create the extended and refined architecture shown in Figure 4.
Biometrics
Media Input Processing
G T A G V
Media/ Mode Analysis
Language
Graphics Gesture
Sound Vision
Media/
Mode A Synthesis
User(s)
G A V G
Mode Coordination Multimodal Fusion Multimodal Reference Resolution
Multimodal Design
Language
Select Content
Graphics
Design Allocate
Gesture
Coordinate
Sound
Layout
Animated Media Output Presentation Rendering Agent
User/Agent Discourse Models Model
Interaction Management Discourse Management Reference Resolution
Application Interface
Context Management
Initiate
Multimodal Lexicon Management
Terminate
Request
Intention Recognition Action Planning
Respond Integrate
User Modeling
Information, Applications, People
-
User ID
Context Model
Domain Model
Task Model
Media Models
Application Models
Representation and Inference, States and Histories A-audio, V-video, T-text, G-Graphics
Figure 4. Architecture
The architecture utilizes the definitions of media as a material-centered notion including interactive devices (e.g., keyboard, mouse, microphone) and artifacts (audio, video, text, graphics), mode as human-centered perceptual processes (e.g., visual, auditory, tactile), and code as the formal languages that specific the elements, syntax, semantics, pragmatics and so on that govern the use of media and modes. As can be seen in the figure, this abstract architecture includes functionality for media input processing and media output rendering as well as deeper media/mode analysis and synthesis, which would draw upon at least underlying models of media and modes (language, graphics, and gesture). Following analysis, multimodal input would need to be fused and then interpreted within the current state of the discourse, context (time, space, task,
FUSION AND COORDINATION FOR MULTIMODAL INFORMATION PRESENTATION
329
domain and so on) and user model including such functions as cross-modal mutual disambiguation. Once the intention of the user (in an interactive setting) had been recognized, the system might interact with the backend application (possibly initiating or terminating sessions, requesting and integrating information or responding to application requests). Finally the system might plan a response to the user, which in turn might require the design of a multimodal presentation (including Content selection, media design, allocation, coordination, layout) which would then need to be synthesized and rendered on specific media for the user. Underlying all the modules in this architecture are mechanisms for representation and inference on a broad range of models. These include models of the user (identity, capabilities, beliefs, and intentions) and other agents (e.g., system, software agents, intermediaries), a model of the discourse (to help track attention and information about interlocutor turns and also detect and correct errors), context (e.g., physical/spatial and temporal state), domain, task, applications and, of course, the media and modalities (their properties and any associated codes). The underlying infrastructure needs to maintain the states and histories of these models, which might be shared with many of the processes shown in the abstract architecture. 3. SEMANTICS Figure 4 and many of the arrows indicating interfunction communication will rely upon an enabling syntax, semantics and pragmatics. A multimodal meaning representation plays a central stage in such a system, supporting both interpretation and generation processes. In particular, a multimodal meaning representation should support the fusion and coordination of multiple input- or output modalities at a semantic level, representing the combined and integrated semantic contributions from the different modalities. The interpretation of a multimodal input, such as a spoken utterance combined with a gesture and a certain facial expression, will often have stages of modality-specific processing, resulting in representations of the semantic content of the interactive behaviour in each of the modalities involved. Other stages of interpretation combine and integrate these representations, and take contextual information into account, such as information from the domain model, the discourse model or the user model. A multimodal meaning representation language should support each of these stages of interpretation, as well as the various stages of multimodal output generation and in that sense support incremental construction and processing of semantic content. To make incremental processing feasible, where possible the representations of various types of input and output should be uniform, in the sense of using the same kinds of building blocks and the same ways in which complex structures can be composed of these building blocks. Moreover, to support the representation of partial and intermediate results of semantic interpretation, the framework should allow meaning representations which are underspecified in various ways and capture unresolved ambiguities. When we are considering inputs and outputs from a semantic point of view, the representation of lower-level modality-specific aspects of interactive behaviour, like syntactic linguistic information or morphological properties of gestures is not a
330
BUNT, KIPP, MAYBURY AND WAHLSTER
primary aim, but some such information may percolate as features associated with a meaning representation, especially at intermediate stages of interpretation, where their relevance for semantic interpretation may not have been fully exploited. At the other end of interpretation, where understanding is rooted in domain models and ontologies, a multimodal meaning representation language should support the connection with frameworks for defining ontologies and specifying domain models, such as OIL and DAML. The main objective of defining multimodal meaning representations is to provide a fundamental interface format to represent a system’s understanding of multimodal user inputs, and to represent meanings that the system will express as multimodal outputs to the user. This means that the first and foremost basic requirement of a semantic representation framework is that (a) it should be expressive enough to correctly represent the meanings of multimodal messages, and (b) that the representation structures themselves have a formal semantics, i.e., their definition should provide a rigorous basis for reasoning (whether deductive, statistical, in the form of plan operators, or otherwise). In order to delineate the task of formulating objectives, constraints and components of multimodal meaning representation, the group adopted a working definition of meaning in multimodal interaction as the specification of how the interpretation of a multimodal input by an understanding system should change the system’s information state (in a broad sense of the term, including domain model, discourse model, user model, task model; cf. Figure 4). While formulated with reference to input interpretation only, this definition can also be related to the generation of multimodal outputs by assuming that an output is generated by the system in order to have an effect on the user through the interpretation of that output by the user. (The generation of appropriate outputs thus depends on the system having an adequate model of what its outputs may mean to the user – which is exactly as it should be.) An additional objective in defining a well-defined representational framework for multimodal dialogue acts is to allow the specification and comparison of existing application-specific representations (e.g. the M3L representation used in the SmartKom project) and the definition of new ones, while ensuring a level of interoperability between these. Finally, the specification of a multimodal meaning representation should also form a basis for the definition of annotation schemes of multimodal semantic content. Since the design of multimodal human-computer systems is an area in which new research results and new technologies may bring new challenges and new approaches for the representation of multimodal meanings, the representational framework should be open to contributions from different theories and approaches, and should be extensible, inviting the use of alternative methods for designing representation schemas, like XML. As a first step in the direction of defining a generic multimodal semantic representation form, we have to establish some basic concepts and corresponding terminology. First, the action-based concept of meaning mentioned above, applicable to multimodal inputs in an interactive situation, means that the meaning of a multimodal ‘utterance’ has two components: one that is often called ‘propositional’ or ’referential’ and that is concerned with the entities that the
FUSION AND COORDINATION FOR MULTIMODAL INFORMATION PRESENTATION
331
utterance refers to and with their properties and relations, and a ‘functional’ component that expresses a speaker’s intention in producing the utterance: what effects does he want to achieve with this utterance (using ‘speaker’ in a broad, multimodal sense here)? This distinction is familiar from speech act theory, where the two components are called ‘propositional content’ and ‘illocutionary force’, and is also prevalent in other theories of language-based communication; it is sometimes viewed as drawing a line between semantics and pragmatics. In the analysis of multimodal interaction it is particularly important to pay attention to these two aspects of meaning, since different modalities often contribute to each aspect in different ways; for instance, in spoken interaction the referential and propositional aspects of meaning are often expressed verbally, while gestures and facial expression contribute primarily to functional aspects. The term `multimodal content’ should not be confused with `propositional content’, and should not make us forget that multimodal messages have meanings with functional aspects that are equally important as their propositional and referential aspects. In this document we use ‘multimodal content’ as synonymous with ‘multimodal meaning’, including functional aspects, and we use ‘semantic representation’ as synonymous with ‘representation of meaning’. A convenient term that has become popular in the literature on (multimodal) human-computer dialogue is dialogue act. Though mostly used in an informal, intuitive way, or as a variant of ‘speech act’, the term also has a formal definition in terms of the effects that a ‘speaker’ intends to achieve through its understanding by the addressee (see Bunt, 2000), which makes it suitably precise for use in the analysis of the meaning of multimodal inputs and outputs. Without further going into definitions here, we will use the term ‘dialogue act’ in the rest of this document. Definitions of other useful concepts can be found in Romary (2002). As a second methodological step, we propose to distinguish the following three basic types of ingredients that would seem to go into any multimodal meaning representation framework. 1. Basic components: the basic constructs for building representations of the meanings of multimodal dialogue acts: types of building blocks and ways to connect them. 2. General mechanisms: representation techniques like substructure labeling and structure sharing, that make the representations more powerful, more flexible and more compact. 3. Contextual data categories and values: types of administrative (meta-) data that do not, strictly speaking, contribute to the meanings of semantic representations, but that may nonetheless be relevant for their processing. Initially, at least the following basic components will be needed for representing the general organization of any semantic structure: 1. Temporal structures (‘events’), to represent communicative events, like spoken utterances (input or output dialogue acts) and gestures, and semantic events, such as states and events representing meanings of verbs. 2. Referential structures (‘participants’), to represent for instance the speaker of an input utterance, the addressee of a system output dialogue act, or the individuals and objects participating in a semantic event.
332
BUNT, KIPP, MAYBURY AND WAHLSTER
3.
Restrictions on temporal and referential structures, to represent for instance the type(s) of dialogue, act associated with an utterance, a gesture type, assigned to a gesture token, or the denotations of linguistic modifiers. 4. Dependency structures, representing semantic relations between temporal and/or referential structures, such as participant roles (like SPEAKER, ADDRESSEE, AGENT, THEME, SOURCE, …), discourse/rhetorical relations and temporal relations. To this, other components will have to be added, for instance to represent quantified entities, logically complex restrictions, and propositional attitudes. General mechanisms like substructure labeling and substructure sharing are important to make meaning representations suitable for partial and underspecified meanings, to give representations a more manageable form, and to relate them to external sources of information. For example, allowing labels instead of the substructures that they label in argument positions opens the possibility of argument underspecification by means of label variables. Structure sharing makes it possible to represent that a certain part of the representation plays more than one role, e.g. a participant may be both the speaker of an utterance and the performer of a gesture, as well as the agent in a verbally expressed semantic event.
Figure 5. Example of Multimodal Semantic Representation
Finally, meaning representations will need to be annotated with general contextual information, both globally and also at the level of subexpressions, to capture information which is not found inside the elements of interactive behaviour, but which is potentially relevant for their interpretation and generation, such as
FUSION AND COORDINATION FOR MULTIMODAL INFORMATION PRESENTATION
333
environment data (e.g., time stamps and spatial information), processing information (e.g., which module has produced this representation; what is its level of confidence), and interactional information (who is the speaker; what other addressees are there, etc.). We illustrate the possible combination of basic components, general mechanisms, and contextual data into a multimodal meaning representation and exemplify the general methodology that we suggested here, by taking up a sample semantic representation derived from an initial example expressed in the ULF+ format (ULF+ is a slightly updated version of a representation language that was developed successively in the PLUS dialogue project, see Geurts and Rentier, 1993, and the multimodal DENK project; see Bunt et al., 1998; Kievit, 1998). Figure 5 shows a fragment of a (simplified) possible XML-style multimodal meaning representation of the utterance “I want to go from here to there” uttered by a speaker named Peter who points at locations on a map while speaking. The toplevel element <semRep> corresponds to the multimodal event, consisting of the event of the spoken utterance “e0” and gesture events like “e2”. The <event> construct is used both to represent these events and to represent the linguistic content (“e1”); the <participant> element is used to represent the various entities involved in the events. Events and participants are related by means of elements (with source and target attributes pointing to the corresponding arguments of the relation. Note the use of the construct to represent an unresolved ambiguity in the interpretation of the utterance as a certain kind of dialogue act and the cert attribute for representing the corresponding confidence. 4. TOOLS On the roadmap in Section 1 we see a whole lane dedicated to “Empirical and Data-driven Models of Multimodality” with a big sign ahead saying: “Adequate Corpora for MM Research”. Such “adequate corpora” are the fuel for those systems that want to rush down this lane, they are fundamental for the design of representations (Section 3), they have an impact on architectures (user interface, multimodal fusion/integration, see Section 2), and serve as training/test material in data-driven systems. We explored needs and existing resources for multimodal corpora and annotation tools. We started out by clarifying what we understood to be a multimodal corpus: what kind of data is most fundamental now and what additional needs will come up in the future. Current resources for multimodal research were assessed, including repositories for corpora and tools, institutes and initiatives for data collection. For researchers about to start a data collection effort we wondered about how they could locate technical guidelines, tools and coding schemes. To get an immediate picture on demands and needs we conducted a questionnaire1 study at a Dagstuhl Seminar, collecting replies from 28 participating researchers coming from a variety of different backgrounds. The analysis revealed a significant lack in the reuse of resources (corpora and tools) due to nonstandardization, at the same time establishing a strong interest in a global coordination of resources. This led us to ask what organisational infrastructure
334
BUNT, KIPP, MAYBURY AND WAHLSTER
(European and American funding programmes, networks, conferences, workshops, journals, and institutes) there is to support worldwide cooperation. We identified shortcomings and outlined issues for coming conferences and workshops. They were addressed on a workshop at LREC 2002 conference. What is a Multimodal Corpus?
A multimodal corpus contains primary data (audio/video/text files) and encodings on different layers, descriptive and interpretative ones, for each modality. For sound, a corpus should have a number of standard encodings for human language (transcription, part-of-speech, syntax, co-reference, rhetorical relations, dialogue acts etc.), possibly conforming to a standard like MPEG-72. Besides language, music and environmental noise must be dealt with in descriptive and interpretative terms. For vision, research is currently investigating description of nonverbal communication through the human body, usually focussing on the face (see FACS3) or hands/arms (gestures). Posture has recently been picked up. For the future we will need description/meaning encodings of general visual environments, be they static (museum) or moving (car driving scenario, flight simulator). One challenge here is to remodel annotation tools to cope with spatial and spatiotemporal encodings since virtually all current tools base their encodings on time. New to research is the haptic modality: pressure on hands, feet or back (force feedback), even texture can be conveyed (cf. PHANTOM4). Encoding aside, we need to ask: how does the primary data look like? Ideally, we would have haptic data as another track in a video file, just like sound is integrated. The same applies for biometric data: heartrate, eye dilation, skin sensitivity, breathing cycles etc. – a modality where communication solely works in the direction human to computer. For the virtual reality community, we will have to look at smell/taste, sense of balance and thermal perception in terms of primary data and encoding formats. More general concepts like mirror behavior, synchronized and repeated behavior, distance or touch, which encompass more than one subject, probably crossing modalities, need to be tackled. It is here where one is forced to draw a line to higher level behavioral and social units which are used in psychology, anthropology etc. They are beyond the scope of standardization efforts for now. Coding Schemes and Tools
As Bird and Liberman (2001) lucidly demonstrated, tools are tailored to a specific annotation framework. If we extract the logical level of an annotation framework and build a core engine with a programmers’ interface (API), we could reuse this core for a large number of applications (annotation, visualization, query, analysis etc., see Figure 6). Two such annotation frameworks are the track-based framework (TBF) and the Annotation Graph framework (AGF). The latter is more general than the former, but not all applications may need the added power of the AGF. Isolating and opening up the locial level is a leap forward in standardization but in the near future a single, standard core engine used by all researchers is
FUSION AND COORDINATION FOR MULTIMODAL INFORMATION PRESENTATION
335
unlikely to come up. On the other hand, a number of coding schemes have undergone some standardization, especially in the linguistic community (syntax trees, parts-of-speech, dialogue acts etc.), so one may ask whether a unified coding scheme specification language as under development by the ISO/TC 37/SC 4 committee (Ide, Romary 2001) would help in reusing such schemes and make annotated data accessible to applications that work on those schemes (parsers, speech recognizers, generation planners etc.). We envision a standardized specification language for schemes which is independent of a specific framework, thus being independent of a specific tool. It should allow modular extensibility so that new kinds/classes of information can be added easily from outside sources (see Figure 6). Central repositories could collect, store and distribute sets of standard taxonomies (part-of-speech, syntax trees, gesture categories, emblem lexicons etc.) that can be downloaded and readily integrated in a plug-and-play fashion into research schemes. A question in a similar direction is how close an coding scheme’s representation can come to a multimodal system’s representation language (see Section 3), so the same representation would be used in empirical research as well as in the running system. Such a close match would shorten development time considerably. For tools the question of ergonomic design must be emphasized as more and more mass data is collected, making efficiency a central factor. An integrated user interface should not only allow a smooth annotation workflow but also offer complex search options, visual access to coding schemes, and semi-automatic documentation facilities. gesture
specific analysis
gaze
speech
parser
machine learning classifier
Scheme Framework
Scheme Layer Framework Layer Coding Tool (API)
visualization general analysis
Application Layer
Logical Layer
Annotation Framework (Tracks, types, objects etc.)
Figure 6. Refined view on the separation between application and logical layer (left/right). In the lower half applications are based on the generic annotation framework, whereas in the upper half they are based on specific coding schemes.
336
BUNT, KIPP, MAYBURY AND WAHLSTER
The technologies of multimodal interfaces that spring from the annotated data will hopefully itself become part of annotation tools for intuitive and therefore efficient encoding of data, using color, sound, touch etc. Standardized metaphors for coding tool interfaces should develop in the process. Bootstrapping techniques can increase efficiency, especially where standard taxonomies are used (POS, syntax etc.), but may also bias the coder to a certain degree. Multi-coder annotation should be supported by offering update/merge functions (versioning), concurrent coding and reliability checks. To help interpretation of data, more effort is needed in standardization of analysis methods, be they simple descriptive statistics or automatic extraction of meaningful entities, and in the development of evaluation metrics. Organizations (Existing Corpora and Tools)
We assembled a meta-survey of sources, which provide each detailed survey of corpora, tools, and other resources. Survey activities on mulimodal resources are undertaken in the ISLE project (formerly EAGLES), especially the NIMM subgroup (Knudsen et al. 2002a, 2002b), in the TalkBank project and at MITRE (Bigbee et al. 2001). Extensive documentation of corpora and tools is available in papers and websites. General data collection and standardization initiatives in the US are NIST and LDC , in Europe ELRA , ELDA , AHDS (UK), in Japan COCOSDA . Initiatives to build standard tools are ATLAS in the US and NITE in Europe (successor of MATE). ELRA fosters the founding of central national agencies for the collection of native language corpora, and organizes the International Conference on Linguistic Resources and Evaluation (LREC). Data Collection
One crucial issue for data collection is cost reduction and hence knowledge of best practice and potential reusability. Reusability requires standard procedures and formats for collecting primary data (video/audio) and standards for transcription (best practice and coding scheme). Since, due to the diverse needs and goals of different projects, this is impossible to achieve on a large scale, one can retreat to a situation where all corpora are at least easy to locate and browse. A corpus must be identifiable as relevant to your own project by means of meta-data that informs in a concise and standardized way about technical setup, file formats and schemes used. Such considerations have led to the founding of OLAC (Bird, Simons, 2001), based on a standard resource description model: the Dublin Core Metadata Set (DCMS). The ISLE MetaData Initiative (IMDI) is also working on meta-data, specifically for multimedia/multimodal corpora, including a mapping to/from OLAC elements. Apart from meta-data, guidelines for best practice are urgently needed. These should include advice on coding scheme design and coding procedures but also on fundraising and legal issues (ethical and commercial – usually country-specific). The Oxford Text Archive (OTA) is working on an online best practice guide,
FUSION AND COORDINATION FOR MULTIMODAL INFORMATION PRESENTATION
337
commissioned by the AHDS. ISLE, Talkbank and LDC also provide best practice guidelines. Dagstuhl Questionnaire
Corpora and tools were the main issues of the Dagstuhl questionnaire, what was there and what was needed. We collected 28 completed questionaires. The subjects came from 24 different institutes. We will only give the most interesting results. So we found that the dominating modality studied is still speech, closely followed by gesture. Facial expression, posture and gaze were less frequent. 20% of the data is still in analogue format showing that some institutes still use VCR technology. As for the tool situation, 38% use no tool for annotation (just a text editor) and 43% developed their own tool. Clearly dominating scenarios were tourism and navigation, the dominating language was English, followed by German and Japanese, then French and Italian. Of the 28 replies, 21 stated that they were going to collect/code data in the near future. We concluded that on one hand, there is an overlap in research aims (examined modality, language and scenario), on the other, there is a severe lack of resource coordination. Since most participants are about to collect possibly large corpora soon, this questionnaire is yet another reason to call for initiatives, workshops and conferences where resources are made available and standardization issues are advanced. We suggest that small but concrete case studies be collected and exchanged, exemplifying the different approaches to annotation (schemes and tools). Not only would research results become clearer by giving away the original data, we would also have a more competitive situation where schemes as well as tools are used, compared and possibly enhanced. We also suggest finding sample cases that clearly demonstrate the potentials and benefits of multimodal systems. As a first step, we held a hands-on annotation exercise which was part of the workshop on “Multimodal Resources and Multimodal Systems Evaluation” , organized by Mark Maybury and Jean-Claude Martin, at LREC 2002. Major issues were a discussion of corpora, schemes, evaluation metrics and annotation tools. 5. FUTURE RESEARCH Many outstanding research problems must be solved to realize automatically created user tailored answers to questions. Important issues include: 1. Multimodal ontologies, including representations of time and space for processing multimodal inputs and outputs. 2. New devices for cross modal mutual disambiguation. 3. New models for multimodal interaction, for managing context, discourse, user, and application interactions. We also need a series of community activities and events to ensure continued scientific progress. This includes: 1. Community agreement of standards and methodologies for annotation and markup.
338 2. 3. 4. 5.
BUNT, KIPP, MAYBURY AND WAHLSTER Empirical data and user studies to understand fundamental properties of human processing of multimodal (audio, imagery, tactile) input and output, and the effects on and limitations of human memory and retention. Evaluation tasks, benchmarks and methods. Regular exchange of knowledge. Shared or at least interoperable tools to ensure progress. 6. CONCLUSION
Given the overload of information and knowledge, the growth in heterogeneous computing platforms, and the increasing ubiquity of communications and information access for fixed and mobile users, intelligent interaction may prove to be the most important application in the next decade. Achieving some of the key issues we have outlines in the paper will help out societies reach promising facilities for important sociotechnical objectives such as information access for all, increased task performance and higher quality interactions. 7. ACKNOWLEDGEMENTS This collaborative analysis was initiated at the international Dagstuhl Seminar on Coordination and Fusion in Multimodal Interaction that took place at Schloss Dagstuhl in Germany October 27 through November 2, 2001 . A number of the ideas expressed in this document reflect the contributions of participants at the workshop, including the authors and Sharon Oviatt (OGI), Oliviero Stock (IRST), Rainer Malaka (European Media Lab), Elmar Nöth (Univ. of Erlangen), Norbert Reithinger (DFKI), Koiti Hasida (ETL), Noelle Carbonell (Loria), Candy Sidner (MERL), Laurent Romary (UMR Loria), Justine Cassell, (MIT Media Lab) Derek Jacoby(Microsoft Research), Dagmar Schmauks (TU University), Robbert-Jan Beun (Utrecht University), Jean-Claude Martin (Univ. Paris 8), Emiel Krahmer (Tilburg University), Fabio Pianesi (IRST), Elisabeth André (Univ Aachen), Thomas Rist (DFKI), Arne Jönnson (Linköping University), Catherine Pelachaud (Rome), Paul Mc Kevitt (Univ. of Ulster, Londonderry), John Lee (Univ. of Edinburgh) and Lisa Harper (MITRE). 8. REFERENCES Bigbee, T. and Loehr, D. and Harper, L. 2001. “Emerging Requirements for Multi-Modal Annotation and Analysis Tools”. In: Proceedings of Eurospeech, pages 1533-1536. Bird, S. and Liberman, M. 2000. “A Formal Framework for Linguistic Annotation”. In: Speech Communication 33 (1/2), pages 23-60. Bird, S. and Simons, G. (2001). “The OLAC Metadata Set and Controlled Vocabularies”. In: Proceedings of the ACL/EACL Workshop on Sharing Tools and Resources, pages 7-18. Bunt, H. 2000. Dialogue pragmatics and context specification. In H. Bunt and W. Black, editors. Abduction, Belief and Context in Dialogue. Amsterdam: John Benjamins Publishing Company. (ISBN 90-272-4983-0 (Eur.)/1-55619-794-2 (US). Bunt, H. and Beun, R.J. editors, 2001. Cooperative Multimodal Communication. Springer Lecture Series in Artificial Intelligence 2155. Berlin: Springer Verlag. (ISBN 3-540-42806-2).
FUSION AND COORDINATION FOR MULTIMODAL INFORMATION PRESENTATION
339
Bunt, H.C., R. Ahn, R.J. Beun, T. Borghuis and C. van Overveld, 1998. Multimodal Cooperation with the DENK System. In: H.C. Bunt, R.J. Beun and T. Borghuis, editors Multimodal Human-Computer Communication. Berlin: Springer Verlag. Ekman, P. and Friesen, W.V. (1978). Facial Action Coding System. Palo Alto, CA: Consulting Psychologists Press. Geurts, B. and G. Rentier, 1993. Quasi-logical form in PLUS. Internal Report, Esprit Project P5254, A Pragmatics-based Language Understanding System. Tilburg University. Ide, N. and Romary, L. (2001). “Standards for Language Resources”, In: Proceedings of the IRCS Workshop on Linguistic Databases, pages 141-149. Kievit, L.A., 1998. Context-driven Natural Language Interpretation. Ph.D. Thesis, Tilburg University. Kita, S. and van Gijn, I. and van der Hulst, H. (1998). “Movement Phases in Signs and Co-speech Gestures and Their Transcription by Human Coders”. In: Wachsmuth, I. and Fröhlich, M. (eds.) Gesture and Sign Language in Human-Computer Interaction, pages 23-35. Knudsen, M. W., Martin, J. C., Berman, S., Bernsen, N. O., Choukri, K., Dybkjær, L., Heid, U., Mapelli, V., Pelachaud, C., and Poggi, I. (2002a). Survey of NIMM Data Resources, Current and Future User Profiles, Markets and User Needs for NIMM Resources, ISLE Deliverable D8.1, http://isle.nis.sdu.dk. Knudsen, M. W., Martin, J. C., Bernsen, N. O., Dybkjær, L., Heid, U., Pelachaud, C., Poggi, I., Reithinger, van ElsWijk, G., Wittenburg, P., Llisterri, J. and Ayuso, M. J. M, N., Carletta, J. (2002b). Survey of Multimodal Annotation Schemes and Best Practice, ISLE Deliverable D9.1, http://isle.nis.sdu.dk. Maybury, M. T. editor. 1993. Intelligent Multimedia Interfaces. AAAI/MIT Press. 405 pp. ISBN 0-26263150-4. Maybury, M. T. and Wahlster, W. editors. 1998. Readings in Intelligent User Interfaces. Morgan Kaufmann Press. Romary, L., 2002. MMIL requirements specification. Project MIAMM – Multidimensional Information Access using Multiple Modalities. EU project IST-20000-29487, Deliverable D6.1. LORIA, Nancy.
9. AFFILIATIONS Michael Kipp and Wolfgang Wahlster DFKI Saarbruecken, Germany Mark Maybury The MITRE Corporation Bedford, USA
Harry Bunt Tilburg University Tilburg, The Netherlands
INDEX
Adaptive …. Presentations 100-101 …. Hypermedia 191-192 Affective Presentation Markup Language (APML) 19 Aggregation 186 Animated Agent 19 Auditory …. Assistive devices 313 …. Perception 310 Authoring tool 186, 145 BDI 16 BEETLE Project 242, 244-246 Behavior modelling 287-288 Believable …. Agent 3 …. Virtual Humans 284 Binary space partitioning (BSP) tree 267-268 Body movements 296-298 Cinematic techniques 104-106 Coding schemes 321-322 Cognitive Load Theory 163-165 Coherence 101 Cohesion 102 Collaborations 131-132 Collaborative planning 291 Communicative Repertoire 7-9 Comparisons 188-190 Constraints 263-264 Content …. determination 60-61 …. Selection 183 …. delivery 121-122 Culture 11-14 Deictic signals 54-56 Dialogue 220 …. Goals 39-40 …. Obligations 40-42 …. Trees 40- 42 Discourse structure 50-51 Document production 145 Embodied Conversational Agents (ECA) 3,47,51-52, 288 341
342
INDEX
Emotions 9-11,295-296 … and goals 9-10 … triggering 10-11 Gap filling 171-175 GRETA Talking Head 15 Haptic Feedback 318-319 Image transformations 123-124 IMAGEN Project 119 Improved location awareness 115 Intelligent Tutoring Systems (ITS) 161 Label placement 270 Layout 211-212 …. generation 153-154 Learning – disabilities 308, Lexicalisation 56-57 Linguistic …. Focus 302, …. Style 210-211, Location sensitivity 76 Macronodes 191-192 …. formalism 100 Magic Lounge Project 133 MagiCster Project 16 Microplanning 61-62,183 Mission Rehearsal Exercise project 284 Mobile navigation systems 74-75 M-PIRO Project 103,179 Multilinguality 180-182 Multimedia presentation planning 150-151 Multimodal …. corpora 312 …. input 35-37, …. parsing 37-38 Narrative planning 151-153 Natural Language Dialogue 292-295 Natural Language Generation (NLG) 52, 102-104,180-182, 301-302 …. of examples 165-168 …. of referring expressions 54-56 Neurorehabilitation 315-316 Nonverbal signals 49-50 Norms 12, 13 PEACH Project 96 Personal navigation systems 72 Personality 11 Pictures …. and wording 219
INDEX …. and meaning 216-218 …. and style 218-219 Planning document structure 208-210 Position constraints 264 Positional stability 273-274 Priority constraints 264 Prosodic models 201 PUPPET Project 27 REAL Project 71 Realisation 62-63 Representational signals 56-57 Resource-adaptated navigation 76 Response generation 238-241 Revision 170-171 Route descriptions 59-60 Self-explanation 171-175 Semantic-based Translation of Graphics 126-127 Size constraints 264 Sonification 309-310 Speech and hand gestures 33-35 Staging Project 27 State hysteresis 273 Story 298-299 Style 14-15 Surface realization 183 Temporal continuity 272 Transparency constraints 264 Turn–taking 241-242 Tutorial dialogue 231-232 Tutoring systems 229-231 Virtual …. Humans 288 …. Reality 315-316 Visibility and occlusion 266-267 Wayfinding 79-80 Z-buffer 268
343
Text, Speech and Language Technology 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.
12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23.
H. Bunt and M. Tomita (eds.): Recent Advances in Parsing Technology. 1996 ISBN 0-7923-4152-X S. Young and G. Bloothooft (eds.): Corpus-Based Methods in Language and Speech Processing. 1997 ISBN 0-7923-4463-4 T. Dutoit: An Introduction to Text-to-Speech Synthesis. 1997 ISBN 0-7923-4498-7 L. Lebart, A. Salem and L. Berry: Exploring Textual Data. 1998 ISBN 0-7923-4840-0 J. Carson-Berndsen, Time Map Phonology. 1998 ISBN 0-7923-4883-4 P. Saint-Dizier (ed.): Predicative Forms in Natural Language and in Lexical Knowledge Bases. 1999 ISBN 0-7923-5499-0 T. Strzalkowski (ed.): Natural Language Information Retrieval. 1999 ISBN 0-7923-5685-3 J. Harrington and S. Cassiday: Techniques in Speech Acoustics. 1999 ISBN 0-7923-5731-0 H. van Halteren (ed.): Syntactic Wordclass Tagging. 1999 ISBN 0-7923-5896-1 E. Viegas (ed.): Breadth and Depth of Semantic Lexicons. 1999 ISBN 0-7923-6039-7 S. Armstrong, K. Church, P. Isabelle, S. Nanzi, E. Tzoukermann and D. Yarowsky (eds.): Natural Language Processing Using Very Large Corpora. 1999 ISBN 0-7923-6055-9 F. Van Eynde and D. Gibbon (eds.): Lexicon Development for Speech and Language Processing. 2000 ISBN 0-7923-6368-X; Pb: 07923-6369-8 J. V´eronis (ed.): Parallel Text Processing. Alignment and Use of Translation Corpora. 2000 ISBN 0-7923-6546-1 M. Horne (ed.): Prosody: Theory and Experiment. Studies Presented to G¨osta Bruce. 2000 ISBN 0-7923-6579-8 A. Botinis (ed.): Intonation. Analysis, Modelling and Technology. 2000 ISBN 0-7923-6605-0 H. Bunt and A. Nijholt (eds.): Advances in Probabilistic and Other Parsing Technologies. 2000 ISBN 0-7923-6616-6 J.-C. Junqua and G. van Noord (eds.): Robustness in Languages and Speech Technology. 2001 ISBN 0-7923-6790-1 R.H. Baayen: Word Frequency Distributions. 2001 ISBN 0-7923-7017-1 B. Granstr¨om, D. House and. I. Karlsson (eds.): Multimodality in Language and Speech Systems. 2002 ISBN 1-4020-0635-7 M. Carl and A. Way (eds.): Recent Advances in Example-Based Machine Translation. 2003 ISBN 1-4020-1400-7; Pb 1-4020-1401-5 A. Abeill´e: Treebanks. Building and Using Parsed Corpora. 2003 ISBN 1-4020-1334-5; Pb 1-4020-1335-3 J. van Kuppevelt and R.W. Smith (ed.): Current and New Directions in Discourse and Dialogue. 2003 ISBN 1-4020-1614-X; Pb 1-4020-1615-8 H. Bunt, J. Carroll and G. Satta (eds.): New Developments in Parsing Technology. 2004 ISBN 1-4020-2293-X; Pb 1-4020-2294-8
24. 25. 26. 27. 28.
G. Fant: Speech Acoustics and Phonetics. Selected Writings. 2004 ISBN 1-4020-2373-1; Pb 1-4020-2789-3 W.J. Barry and W.A. Van Dommelen (eds.): The Integration of Phonetic Knowledge in Speech Technology. 2004 ISBN 1-4020-2635-8; Pb 1-4020-2636-6 D. Dahl (ed.): Practical Spoken Dialog Systems. 2004 ISBN 1-4020-2674-9; Pb 1-4020-2675-7 O. Stock and M. Zancanaro (eds.): Multimodal Intelligent Information Presentation. 2004 ISBN 1-4020-3049-5; Pb 1-4020-3050-9 W. Minker, D. B¨uhler and L. Dybkjaer (eds.): Spoken Multimodal Human-Computer Dialogue in Mobile Environments. 2004 ISBN 1-4020-3073-8; Pb 1-4020-3074-6
KLUWER ACADEMIC PUBLISHERS – DORDRECHT / BOSTON / LONDON