EProactive E Spoken Dialogue Interaction in Multi-Party Environments
Petra-Maria Strauß • Wolfgang Minker
Proactive Spoken Dialogue Interaction in Multi-Party Environments
Petra-Maria Strauß Ulm University Institute of Information Technology Albert-Einstein-Allee 43 89081 Ulm Germany
[email protected]
Wolfgang Minker Ulm University Institute of Information Technology Albert-Einstein-Allee 43 89081 Ulm Germany
[email protected]
ISBN 978-1-4419-5991-1 e-ISBN 978-1-4419-5992-8 DOI 10.1007/978-1-4419-5992-8 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2009944071 © Springer Science+Business Media, LLC 2010 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
This book describes the development and evaluation of a novel type of spoken language dialogue system that proactively interacts in the conversation with two users. Spoken language dialogue systems are increasingly deployed in more and more application domains and environments. As a consequence, the demands posed on the systems are rising rapidly. In the near future, a dialogue system will be expected, for instance, to be able to perceive its environment and users and adapt accordingly. It should recognise the users’ goals and desires and react in a proactive and flexible way. Flexibility is also required in the number of users that take part in the interaction. An advanced dialogue system that meets these requirements is presented in this work. A specific focus has been placed on the dialogue management of the system on which the multi-party environment poses new challenges. In addition to the human-computer interaction, the human-human interaction has to be considered for dialogue modelling. A prevalent approach to dialogue management has been adapted accordingly. To enable proactive interaction a detailed dialogue history has been implemented. As opposed to common dialogue systems which start from scratch when the interaction begins, the system developed in the scope of this book starts modelling as soon as the conversation enters its specified domain. The knowledge acquired during this early stage of the conversation enables the system to take the initiative for meaningful proactive contributions, already from the first interaction. In order to develop this interaction assistant comprehensive data recordings have been conducted in a multi-modal Wizard-of-Oz setup. A detailed overview and analysis of the resulting corpus of multi-party dialogues is presented. An extensive evaluation of the usability and acceptance of this novel sort of dialogue system constitutes a further significant part of this book.
Contents
Preface................................................................................................ V 1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Introduction on Spoken Language Dialogue Systems . . . . . . . . . 2 1.1.1 System Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2 1.1.2 Current Trends in Spoken Language Dialogue Systems . 6 1.2 Related Work on Advanced Dialogue Systems . . . . . . . . . . . . . . . 8 1.3 The Computer as a Dialogue Partner . . . . . . . . . . . . . . . . . . . . . . 9 1.4 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 1.5 Outline of the Book . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2
Fundamentals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Corpus Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Evaluation of Spoken Language Dialolgue Systems . . . . . . . . . . . 2.3 Multi-Party Interaction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Speech Acts and other Linguistic Fundamentals . . . . . . . 2.3.2 Conversational Roles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Human-Human and Human-Computer Interaction . . . . . 2.4 Dialogue Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 Dialogue Context and History . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Dialogue Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.3 Information State Update Approach To Dialogue Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Multi-Party Dialogue Modelling . . . . . . . . . . . . . . . . . . . . . 2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39 43 49
Multi-Party Dialogue Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Existing Multi-Party Corpora . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Wizard-of-Oz Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51 51 55 55 56
3
17 17 20 22 22 25 28 34 35 37
VIII
Contents
3.2.3 System Interaction Policies . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.4 WIT: The Wizard Interaction Tool . . . . . . . . . . . . . . . . . . 3.3 The PIT Corpus . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Annotation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Dialogue Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
58 60 65 66 67 68 72
4
Dialogue Management for a Multi-Party Spoken Dialogue System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 4.1 Multi-Party Dialogue Modelling . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.1.1 Dialogue Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 4.1.2 Interaction Protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 4.2 Dialogue Management in the Example Domain of Restaurant Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2.1 Dialogue Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2.2 Domain Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 4.2.3 Task Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82 4.2.4 Information State Updates . . . . . . . . . . . . . . . . . . . . . . . . . 84 4.2.5 Dialogue Plans . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91 4.3 Enabling Proactiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.3.1 Optimistic Grounding and Integration Strategy for Multi-Party Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 4.3.2 System Interaction Strategy . . . . . . . . . . . . . . . . . . . . . . . . 94 4.3.3 Dialogue History for Proactive System Interaction . . . . . 98 4.4 Proactive Dialogue Management Example . . . . . . . . . . . . . . . . . . 102 4.5 Problem Solving Using Discourse Motivated Constraint Prioritisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107 4.5.1 Prioritisation Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 4.5.2 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
5
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115 5.1 Usability Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.1.1 Questionnaire Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116 5.1.2 Participants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118 5.1.3 Analysing the System Progress . . . . . . . . . . . . . . . . . . . . . . 120 5.1.4 Assessing the Usability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 5.2 Evaluating System Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 5.2.1 Descriptive Analysis of the PIT Corpus . . . . . . . . . . . . . . 125 5.2.2 Evaluation of Discourse Motivated Constraint Prioritisation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 5.3 Gaze Direction Analysis to Assess User Acceptance . . . . . . . . . . 129 5.4 Assessing Proactiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 5.4.1 Addressing Behaviour During First Interaction Request 133
Contents
IX
5.4.2 Effect of Avatar on Proactiveness . . . . . . . . . . . . . . . . . . . . 134 5.4.3 Subjective Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136 5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138 6
Conclusions and Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.2 Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
A
Wizard Interaction Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
B
Example Dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
C
Questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
List of Figures
1.1 1.2
SLDS architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Interaction model of the dialogue system. . . . . . . . . . . . . . . . . . . . . 11
2.1 2.2 2.3
Interaction model of the dialogue system. . . . . . . . . . . . . . . . . . . . . 27 IBiS1 information state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 Multi-IBiS information state . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
3.1 3.2 3.3 3.4 3.5 3.6
Data collection setup. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Recording scene from the viewpoint of cameras C3 and C1 . . . . Wizard Interaction Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Phoneme based mouth positions of the avatar . . . . . . . . . . . . . . . . Example database entry . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dialogue with crucial points and phases. . . . . . . . . . . . . . . . . . . . .
4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8
Dialogue management component of the system. . . . . . . . . . . . . . 74 Information state structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75 Example information state. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76 Ontology for restaurant domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 System life cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99 Dialogue history as it relates to the dialogue . . . . . . . . . . . . . . . . . 100 Dialogue history . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101 Example information state after getLatestUtterance of utterance 16. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 Example information state after integrate of utterance 16. . . . 104 Example information state after consultDB of utterance 16. . . . 104 Example information state after getLatestUtterance of utterance 17. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105 Example information state after integrate of utterance 17. . . . 105 Example information state after getLatestUtterance of utterance 18. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106 Example information state after integrate of utterance 18. . . . 106
4.9 4.10 4.11 4.12 4.13 4.14
55 57 62 63 64 66
XII
List of Figures
4.15 Example information state after downdateQUD of utterance 18. . 107 5.1 5.2 5.3 5.4 5.5 5.6 5.7 5.8 5.9 5.10
Technical self-assessment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119 Usability evaluation over all sessions using AttrakDiff . . . . . . . . . 121 Usability evaluation over all sessions using SASSISV . . . . . . . . . . 121 Usability evaluation using SASSISV . . . . . . . . . . . . . . . . . . . . . . . . 122 Usability evaluation over the different setups using AttrakDiff . . 123 Usability evaluation over the different setups using SASSISV. . . 124 Durations of the dialogues of Session I and II . . . . . . . . . . . . . . . . 126 Comparison of the dialogue phase durations . . . . . . . . . . . . . . . . . 127 Evaluation of the prioritisation algorithm . . . . . . . . . . . . . . . . . . . . 128 Listening behaviour . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
A.1 WIT software architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 A.2 Class diagram of the WIT dialogue manager . . . . . . . . . . . . . . . . . 153 C.1 SASSISV questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158 C.2 SASSI questionnaire without SASSISV items . . . . . . . . . . . . . . . . 159 C.3 System interaction questionnaire . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
List of Tables
1.1
Example dialogue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1 2.2 2.3 2.4 2.5 2.6
Dialogue snippet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interaction protocols . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interaction principles by Ginzburg and Fern´andez . . . . . . . . . . . . Interaction protocol adapted to multi-party situation . . . . . . . . . Interaction principle by Kronlid . . . . . . . . . . . . . . . . . . . . . . . . . . . . Interaction protocol using the AMA principle . . . . . . . . . . . . . . . .
29 42 44 45 45 46
3.1 3.2 3.3 3.4
Example scenario description. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Statistical information of the PIT corpus. . . . . . . . . . . . . . . . . . . . PIT Corpus dialogue act tagset. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Annotated example dialogue from the PIT corpus. . . . . . . . . . . . .
58 65 68 70
4.1 4.2 4.3 4.4 4.5 4.6
New interaction principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79 Interaction protocols using ASPS . . . . . . . . . . . . . . . . . . . . . . . . . . . 80 Dialogue system interaction types. . . . . . . . . . . . . . . . . . . . . . . . . . . 95 Contentual motivation for proactive interaction. . . . . . . . . . . . . . . 96 Snippet of example dialogue from the PIT corpus. . . . . . . . . . . . . 102 Prioritisation scheme applied to an extract of a dialogue. . . . . . . 112
5.1
Gaze direction towards dialogue partners according to dialogue phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130 Percentage of U1 addressing U2. . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 Gazing behaviour during addressing . . . . . . . . . . . . . . . . . . . . . . . . 132 Gazing behaviour during listening . . . . . . . . . . . . . . . . . . . . . . . . . . 132 Gazing behaviour during first interaction request. . . . . . . . . . . . . 134 Statistical analysis of proactiveness in Session III dialogues . . . . 135 Subjective ratings of system interaction in of Session III dialogues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.2 5.3 5.4 5.5 5.6 5.7
1 Introduction
HAL: ’Excuse me, Frank.’ Frank: ’What is it, HAL?’ HAL: ’You got the transmission from your parents coming in.’ Frank: ’Fine. Put it on here, please. Take me in a bit.’ HAL: ’Certainly.’ Quote from ’2001 – A Space Odyssey’ (1968) by Stanley Kubrick. The HAL 9000 computer is addressing Dave who is resting on his sun bed. app. 1:00 hour into the film
As it was predicted already in 1968 by Stanley Kubrick (1928-1999) and Arthur C. Clark (1917-2008) in the science fiction movie 2001 – A Space Odyssey [Kubrick, 1968] the future has arrived. Computers are by now playing a prominent role in our everyday lives. Over the past decades they have evolved from big, monstrous mainly industrial machines to small mobile and extremely powerful devices that are in one way or another used by presumably every human being in the developed world. The quote by the ’supercomputer’ HAL 9000 from Kubrick’s movie shows that the computer is equipped with human-like qualities. It possesses natural language capabilities for both, understanding and speaking, the ability of logical reasoning and proactive behaviour, just to name a few character traits. The human characters of the movie are quoted in the movie to describe the computer as a sixth member of their space ship crew. A ’HAL-like’ computer has not been developed at present, however, HAL’s characteristics, i.e. his human-like features, are starting to appear in more and more computer systems. Natural language interaction plays an important role due to the fact that speech is for humans still the easiest and most natural way of interaction. Big displays become superfluous which opens the way for ubiquitous computing which lets computers disappear more and more into the background. Security is a further supporting factor for interaction by speech. P.-M. Strauß and W. Minker, Proactive Spoken Dialogue Interaction in Multi-Party Environments, DOI 10.1007/978-1-4419-5992-8_1, © Springer Science + Business Media, LLC 2010
1
2
1 Introduction
This becomes apparent especially in the scope of automotive applications. While operating a vehicle, the driver can interact with the navigation, telephony and media applications by speech without taking the eyes off the road. The automotive environment is also a pioneer domain for proactiveness. State of the art head units inform the driver about traffic hazards coming up on the road. According to the priority of the message, i.e. for instance in terms of the distance to the obstacle which denotes if the driver could be affected immediately, even ongoing phone calls should be interrupted for the driver to receive the message as soon as the system learns about the hazard. As an independent crew member, HAL is further able to communicate with multiple users at the same time while most of today’s computer systems are restricted to one user, i.e. human-computer interaction. If dialogue systems can interact with several users simultaneously many applications would benefit, for instance in the process of achieving a common task. The research presented in this book addresses the presented challenges: A spoken language dialogue system that interacts with two users as an independent dialogue partner. It engages proactively in the interaction when required by the conversational situation and also takes itself back when it is not needed anymore. We thereby focus on the dialogue management functionality of the system (Chapter 4) for which we perform an extensive data collection (Chapter 3) to support the system development. Further, evaluation of the novel sort of dialogue system builds another prominent part of this book (Chapter 5). The envisaged system is introduced in more detail in Section 1.3. First, a short introduction is given on spoken language dialogue systems in general followed by a description of current trends and related work conducted in the area of advanced dialogue systems.
1.1 Introduction on Spoken Language Dialogue Systems 1.1.1 System Architecture The task of a spoken language dialogue system (SLDS) is to enable and support spoken interaction between human users and the service offered by the application. The SLDS deals with two types of information - the one understood by the user (natural language in speech or text) and the one understood by the system (e.g. semantic frames). The system carries out a number of tasks before it can give a response to the user. The tasks are performed by different modules which are usually connected in a sort of pipeline architecture. Figure 1.1 shows a basic architecture of a SLDS. The different modules are described in the following: Automatic Speech Recognition (ASR). The task of the ASR module is the transcription of the speech input (i.e. acoustic signals) of the user into words (e.g. [Jelinek, 1997,Rabiner and Juang, 1993,Jurafsky and Martin,
1.1 Introduction on Spoken Language Dialogue Systems
Acoustic analysis / Speech recognition
Semantic analysis
Dialogue management
Text-to-speech synthesis
3
Application
Natural language generation Dialogue context
Database
Fig. 1.1. SLDS architecture.
2000, Huang et al., 2001]). Using an acoustic model which describes potential signals, a lexicon containing all potential vocabulary and a language model, i.e. a grammar, the acoustic signals are usually mapped to the resulting words or sentences with statistical methodology. Different factors determine the complexity of speech recognition. A system that is to be used by an unknown number of different users possibly speaking in different accents and dialects is said to be speaker-independent. The opposite is a speaker-dependent system which is trained especially for the individual future user. A third intermediate option is a speaker-adaptive system which is developed as a speaker-independent system but can adapt to the actual user through training and usage. The vocabulary of the system further influences the complexity and performance: Small vocabulary is easier to be recognised than large vocabulary. Finally, continuous speech poses a greater challenge than isolated keywords. Natural Language Understanding (NLU). The NLU module tries to extract the semantic information from the word string produced by the speech recogniser (refer to e.g. [Allen, 1995, Jurafsky and Martin, 2000]). It produces a computer readable representation of the information (e.g. as semantic frames) which is then further processed by the dialogue management module. A common approach is to perform rule-based semantic parsing to extract the semantic meaning, e.g. attribute-value pairs, out of the utterances. Another approach include statistical methods for semantic analysis (e.g. [Minker et al., 1999]). Dialogue Management (DM). The dialogue manager is responsible for smooth interaction. It handles the input (in form of a semantic representation)
4
1 Introduction
which is to be integrated into the dialogue context. It organises turn taking and initiative, and performs task or problem solving by interacting with the application. Finally, it induces the output generation to return an appropriate response (e.g. the requested information) or to ask for any information missing in order to be able to fulfil the task. The DM makes use of various knowledge sources which constitute the dialogue context. The main parts are the task model and the dialogue model [McTear, 2002]. The task model contains all task-related parts of the system, such as the task record which holds all user constraints mentioned in the ongoing dialogue so far whereas the dialogue model contains information regarding the dialogue, such as a conversation model which consists of the current speaker, addressee, speech act, etc. The dialogue history can be said to belong to this part of the context as it holds information about the previous utterances of the ongoing dialogue. Further knowledge sources are a domain and world knowledge model which holds the logical structure of the domain and world the dialogue system functions in. A user model can be deployed which holds the information about the users, either to recognise specific users or more general information to be able to make recommendations. All these components are implemented more or less explicitly depending on the type of dialogue management used. Approaches to dialogue management can be classified into three main categories (following the categorisation presented by McTear (2002)): • Finite-state-based approach. The dialogue is always in a certain predefined dialogue state, certain conditions trigger state changes. In this approach, knowledge base and dialogue management strategy are not separated but are represented together in the dialogue states. The approach is rigid and inflexible but very suitable for small, clearly defined applications. • Frame-based approach. The systems implementing this approach deploy a specific task model which determines a set of slots to be filled with values supplied by the user in the course of the dialogue in order for the system to fulfil the task. Conversational aspects of the dialogue are considered only in the scope of task solving. The system is not expected to hold a conversation or know details of the conversation such as regarding the order of the constraints mentioned etc. Thus, no complex models have to be deployed. The approach is suitable for dialogue systems used for information retrieval, such as train departure times etc. • Agent-based approach. This approach is able to model dialogues in a more complex way. With sophisticated models of the conversation and dialogue participant it overcomes the limitations of the aforementioned approaches. Dialogue is no longer limited to certain dialogue stages but rather works towards understanding the dialogue as a whole. It models from the viewpoint of the dialogue system which is modelled as an agent who has goals (e.g. to fulfil the task), intentions, and plans to achieve its goals.
1.1 Introduction on Spoken Language Dialogue Systems
5
The prominent Information State Update approach (e.g. [Ginzburg, 1996, Larsson, 2002]) belongs to the third category. The dialogue which is seen as a state of information that is updated with every utterance is modelled from the viewpoint of the system enabling it to ’understand’ the dialogue as it occurs. This approach is thus very suitable to be adopted for our dialogue system which is to constitute an independent dialogue partner. The approach is introduced in Section 2.4 and later adopted and extended to suit our setup as presented in Chapter 4. A further categorisation differentiates between rule-based and statistical processing of dialogue management. All of the above mentioned categories of dialogue management can be implemented using either approach. The rulebased approach has been state of the art for a long time. Rules, defined by the developer, have to be supplied for all cases that can possibly occur in the dialogue. Accurate processing is thus assured, however, the development of the rule-base is very time-consuming and an increase in the complexity of the application brings about an analogical increase in the complexity of the rule set which can easily reach an unmanageable dimension. Recently, statistical approaches popular in ASR and also in NLU (e.g. [Minker et al., 1999]) are starting to be also deployed to dialogue management e.g. [Levin and Pieraccini, 1997, Singh et al., 2002, Lemon et al., 2006, Williams and Young, 2007]. Statistical techniques are based on statistical modelling of the different processes and learning the parameters of a model from appropriate data. The drawback of this approach is that for development a large amount of training data is needed which is difficult to obtain. Another important task of the dialogue management is problem solving. The dialogue management communicates with the application in order to fulfil the task. The simplest form of an application is a database. The dialogue management would in this case interact by performing database queries based on the current user constraints (contained in the task model) (e.g. [Qu and Beale, 1999]). Problem solving further looks at the outcome of the query and, if necessary, tries to optimise it. For instance, in the case that the query does not yield any results, the constraint set can be modified (for instance by relaxing less important user constraints) until a more satisfying result is achieved (e.g. [Walker et al., 2004, Carberry et al., 1999]). Natural Language Generation (NLG). The response commissioned by the dialogue management module is in this step turned into a natural language utterance. A common practise for NLG is the template based approach. Previously defined templates are filled with the current values. The NLG module is further responsible of structuring the output, i.e. choosing the best or combining the output if various are available or breaking it down into appropriate chunks if the answer is too large. The dialogue history can be consulted to assure responses that are consistent and coherent with the preceding interaction. For a multi-modal system, if e.g. visual output is deployed besides the speech output, the different modalities have to be integrated. The
6
1 Introduction
respective output has to be assigned the appropriate modality always assuring conformity. In general, NLG is from concerned with three tasks [Reiter, 1994, Reiter and Dale, 2000]: • Content determination and text planning to decide on what information, and in what kind of rhetorical structure it should be communicated. • Sentence planning determines the structure of the utterance for instance adapting it in order to fit in well with the current flow of the dialogue. Examples are splitting or conjunction of sentences as well as adding references or discourse markers. • Realisation is responsible for linguistic correctness and adaptation of the content to the actual output modality. A common practise for NLG is the template based approach. Previously defined templates are filled with the current values. The NLG module is further responsible of structuring the output, i.e. chosing the best or combining the output if various are available or breaking it down into appropriate chunks if the answer is too large. The dialogue history can be consulted to assure responses that are consistent and coherent with the preceding interaction. For a multi-modal system, if e.g. visual output is deployed besides the speech output, the different modalities have to be integrated. The respective output has to be assigned the appropriate modality always assuring conformity. Text-to-Speech Synthesis (TSS). Utterances generated in the previous module are converted from textual form into acoustic signals using text-tospeech (TTS) conversion [Dutoit, 2001, Huang et al., 2001]. The text is in a first step converted into a phoneme sequence and prosodic information on a symbolic level. Acoustic synthesis then performs a concatenation of speech units (e.g. for German diphones are common, while syllables are used for Chinese) contained in a database. The generated audio is then played back to the user. A different option yields the most natural sounding speech output that uses pre-recorded audio files. The duty of the NLG module is to simply select the adequate audio file to be played back to the user. A combination of these approaches, popular for commercial dialogue systems, is especially convenient for template-based NLG. The fixed template texts are pre-recorded, all variable parts are generated on the fly (preferably using the same speaker for both recordings). This way, the prompts sound as naturally as possible, however, not losing the flexibility of synthetically produced speech prompts. 1.1.2 Current Trends in Spoken Language Dialogue Systems Today’s commercial dialogue systems are usually deployed for simple tasks. They are predominantly slot-filling small-vocabulary finite-state machines, i.e. systems that match specific incoming keywords to a fixed output, a task that does not demand for elaborate dialogue systems. They are mainly found in
1.1 Introduction on Spoken Language Dialogue Systems
7
telephony applications replacing human operators in call centres. Their main aim is to save cost. A nice side-effect has been achieved by some companies by personifying their dialogue systems to use them as marketing instruments. The systems are given a name and appearance and thus star in commercials and on websites to improve a company’s image and level of awareness. A prominent example for such a system is the award-winning Julie1 (deployed in May 2002) who answers the phone if someone calls for train schedule information to travel within the United States. Insufficient technical performance, however, has been hindering speech based interfaces to obtain large-scale acceptance and application. Broad usage requires good recognition performance of speaker-independent large-vocabulary continuous natural speech which has been posing a great challenge to speech recognition. The last years have been coined by technical advancement and further, user acceptance has been growing due to the fact that people gradually get accustomed to the SLDS. The usefulness and convenience of spoken language interaction has been recognised and thus the range of applications is starting to grow and change. With progressing technology and the quest for smart and apt computer systems the foundation for accelerated progress has been provided. Possibly, scenarios that have for a long time only been found in science fiction might become ordinary scenes of everyday life in the future. A current trend addresses the nature of computer systems. Computers are blending more and more into the background, as described by the term ubiquitous computing. Computers are becoming smaller, almost disappearing, and are deployed more and more in mobile form. Everyday appliances are enriched with computational intelligence trying to ease human life building the basis for intelligent environments. Popular examples are intelligent heating and lighting adjustments and the intelligent refrigerator that keeps track of the contents, recipes, shopping lists and even ingested calories of the users. The overall trend is that computers adapt to the human way of interaction instead of requiring the humans to move towards the system for interaction. All of these facts pose further demands on applications and technology and at the same time show the importance of speech based interaction. It is an intuitive means of communication and does not require any space (e.g. big screens as is the case for haptic interaction) nor visual attention to be deployed and is thus also a suitable way for human-computer interaction in critical situations, such as the car where the driver’s gaze should not be drawn from the road if possible2 . Novel demands are posed on future systems in order to realise the adoption to new applications and environments. The objective of future systems is to actually understand the dialogue they are involved in and to adapt to 1 2
http://www.amtrak.com In practise, as an intermediate step towards speech interaction, current systems adopt speech interaction mostly as an alternative on top of the common forms of interaction and this way trying to gain in user acceptance.
8
1 Introduction
the surroundings and users, to autonomously perceive the user’s needs and desires and to react flexibly and proactively. Future dialogue systems are thus endowed with perceptive skills from different sensory channels (vision, hearing, haptic, etc.) to capture the spacial, temporal, and user specific context of an interaction process. Elaborate conversational skills are required to be able to capture and analyse spoken, gestural, as well as emotional utterings. Integration of perception, emotion processing, and multimodal dialogue skills in interactive systems is expected to not only improve the human-computer communication but also the human-human communication over networked systems. There is further an increasing demand for flexibility in terms of the number of users that are able to take part in the interaction. A system could this way for instance assist a group of users already during the decision process by providing information, immediately reporting problems and thus accelerating the task-solving process. Thus, interaction between various humans and possibly also various computers will be possible that integrates the dialogue system as an independent dialogue partner in the conversation. The research presented in this book focuses on a dialogue system of this kind: The system resembles an independent dialogue partner. It interacts with two users and engages proactively in the conversation when it finds it necessary in respect to advancing the task solving process in the best possible way. A description of the system and objective of this book is presented in detail below after taking a look at related work conducted on advanced dialogue systems.
1.2 Related Work on Advanced Dialogue Systems Various research projects investigate possibilities that open up by enriching multi-party interaction and advanced dialogue systems with the perception of the users’ state, context and needs. Most of the research on multi-party interaction at present is concerned with the meeting scenario as it can benefit greatly from the use of intelligent computer systems which enhance and assist the human communication during (and also after) the meetings. Great effort is put in the design and development of adequate software tools for meeting support and to investigate multi-party interaction phenomena. Meeting assistants could be deployed as meeting browsers or summarisers, i.e. they obtain information about the course and content of a meeting. They can be used for example during the meeting to assist participants who have come late to the meeting, summing up what has been said and who has committed to what. In the same way, easy and fast access of the meeting content is enabled at a later point in time. An example of a tool of this kind is the meeting browser developed in the framework of the Augmented Multi-Party Interaction (AMI) project [Renals, 2005] (and its successor AMIDA). The aim is to develop new multimodal technologies in the context of instrumented meeting rooms and
1.3 The Computer as a Dialogue Partner
9
remote meeting assistants, e.g. a meeting browser which enables browsing through videos and transcripts of the meetings. A second category of meeting assistants denotes tools that directly interact in the meetings. Exemplary tools to support and guide meetings in organisational, informational, and social aspects are developed in the scope of the Neem project [Barthelmess and Ellis, 2005]. Kwaku for instance is a virtual meeting partner which is equipped with emotion and personality and displayed using animation and speech output. It performs organisational tasks, such as watching over the time spent on certain agenda items and reminding the participants proactively to go on to the next point, if necessary. Another example which aims at developing tools that ease the human life is the project Computers in the Human Interaction Loop (CHIL) [Stiefelhagen et al., 2004]. It focuses on developing environments in which computers serve humans giving them more freedom to concentrate on the interaction with other humans by minimising the attention having to spend on operating the computer systems. One of the tools developed in the scope of CHIL is the Connector [Danninger et al., 2005] which perceives the activities, preoccupations, and social relationships of its users in order to determine their disposability and the appropriate device for communication. Another tools is the Memory Jog, a context- and content-aware service providing the user with helpful background information and memory assistance related to an ongoing event or other participants as a sort of personal assistant. While these systems aim at assisting the users with helpful background information none of them have the goal of getting involved in the conversation as an independent interaction partner.
1.3 The Computer as a Dialogue Partner The presented research aims at the development of a spoken language dialogue system which acts as an independent dialogue partner in the conversation with two human users. In the following, the system is presented in detail including with a description of the key characteristics that define this novel system. The system proactively takes the initiative when required by the conversational situation and gets meaningfully involved in the conversation. When it is not needed any more, e.g. the task is solved, the system takes itself back. At the beginning the system silently observes the conversation between the dialogue partners capturing the relevant conversational context and detects whether the current discourse falls within the specified domain. It becomes active in terms of paying close attention and modelling what the users say as soon as the dialogue partners come to speak about its specified domain. Thus, at the point in time of proactive or reactive interaction the system already knows what the users have spoken about and can directly participate in the ongoing conversation instead of starting from scratch. The system eventually reacts to interaction requests of the users who signal intention to communicate by turning their attention towards the system - either explicitly by addressing
10
1 Introduction
the system directly or implicitly by looking at it. To facilitate the development of the system, the setup is currently limited to allowing only one user, the socalled main interaction partner, to pose direct interaction requests towards the system. We adopt the example domain of restaurant selection to show the system’s functionality. In this case, the system becomes active when the users come to speak of wanting to go out to eat. The system notices the topic change and immediately ’listens’ attentively to the users discussing their preferences about choosing an appropriate restaurant. The system stores all relevant data in the dialogue context and history. Starting with the first interaction, the system assists the users in finding a suitable restaurant by performing database queries based on the users’ preferences and dialogue context and providing information about the restaurants. The system is displayed in form of an avatar which moves its lips and talks through synthesised speech output. A restaurant’s menu, city map or local bus schedule can further be presented on the screen when suitable. A short interaction example is shown in Table 1.1. The most significant characteristics of our system are described in detail in the following. U1:
Look at this weather, isn’t it beautiful!
U2:
Absolutely. Why don’t we leave it for today?
U1:
Good idea. I’m getting hungry anyway.
U2:
Do you want to go eat something? I feel like a huge pizza!
U1:
Sounds great! Let’s go somewhere where we can sit outside. [looks at computer] Is there anything over in the park maybe?
S:
Unfortunately not. However, there is Pizzeria Napoli. It’s close to here and has a patio.
U1:
That sounds good. What’s the address?
S:
Pizzeria Napoli is situated on Forest Avenue number fifteen. Would you like to see the menu?
U1:
Oh yes, please. [Menu pops up on the screen.] . . . Table 1.1. Example dialogue [Strauß, 2006].
Multi-Party Interaction Setup The interaction the system encounters is that of a multi-party scenario, i.e. more than two dialogue participants (in short: DPs) are engaged in a conversation, also called a multi-party dialogue. In our setup, the three dialogue partners interact with each other in different ways. Figure 1.2 shows a simple
1.3 The Computer as a Dialogue Partner
11
communication model. The system’s main user (in the following referred to as user U1) interacts directly with both dialogue partners: The human (U2) and the computer (S). The conversation with U2 is natural, the system is addressed either explicitly by calling it by its name or ’computer’, or implicitly by looking at it.
Fig. 1.2. Interaction model of the dialogue system.
The interaction between the second user U2 and the computer is by definition indirect. The system does not react when U2 addresses it with a request, although it hears and (hopefully) understands the utterances3 . Vice versa, it cannot be anticipated that U2 reacts to the system’s actions although equally perceived by both users. The only difference is in the physical access to the computer. U1 is situated directly in front of the screen and thus can be assumed to feel spoken to while U2 could easily feel as a side participant without the direct physical access. The interaction setup is taken up again from the linguistic viewpoint in Section 2.3. It can be argued why we chose this kind of a setup privileging one user and that a different example scenario involving two users which are not equal from the system’s point of view might be better suited, as e.g. in a doctor and patient or expert and customer scenario. For such a scenario it is imaginable to transmit information which is not intended for user U2 via some kind of secret channel that U2 has no access to. However, in our example domain of restaurant selection the information is designed for both users equally. The restriction to one user as the system’s main interaction partner derives from complexity reasons as the system is expected to be aware of the fact when the user looks at the system as a form of interaction request. The office environment with the system running on the main user’s personal computer endorses the scenario further.
3
However, the system’s proactive and cooperative behaviour would still lead to a system interaction if a pause longer than a certain threshold occurs after the request and the system has something meaningfully to contribute.
12
1 Introduction
Proactiveness One of the most important character traits of an independent dialogue partner is its independence. System interaction should not only rely upon interaction requests by the users but also be possible on the system’s own initiative. Proactiveness can be defined in the way that the dialogue system takes the initiative to control a situation instead of waiting to reactively respond to something after it has happened. A system should operate independently and anticipatory, always keeping an eye on its goals. Proactive involvement in the dialogue also implicates that the system takes itself back when it is not needed anymore, e.g. the task is solved or the users have permanently switched to a different topic. Proactive interaction behaviour requires complete understanding of the conversation and its context at any point in time that is relevant for the task solving process. Thus, in contrast to conventional dialogue systems that become active with the system’s first interaction, our envisaged system has to start modelling the dialogue before the system’s first interaction. Only then, the system can make meaningful and proactive contributions. Thus, an extensive dialogue history and context modelling is required to keep track of the complete dialogue (refer to Section 4.3.3). Different Levels of Attentiveness The envisaged dialogue system is an always-on system, i.e. it is not turned on when needed but constantly running in the background. It becomes active and interactive when certain criteria are true (as described below). The different levels of attentiveness the system takes on are the following: • Inactive: While the human users talk about anything but the specified domain the system does not pay full attention. It ’overhears’ the conversation and waits for keywords in order to detect the point in time when the conversation topic changes to its specified domain. • Active: As soon as the system recognises certain keywords, such as ’hungry’ uttered by U1 in the third utterance of the dialogue in the example dialogue in Figure 1.1. The system assumes that the conversation has entered the specified domain and switches over to pay full attention. From that point on, the computer ’listens’ actively to the conversation. It analyses the complete utterances, memorises the users’ preferences and builds a dialogue history. • Interactive: When required by the conversational situation the system gets - reactively upon an interaction request or proactively if it has a solution or problem to report - involved in the conversation by interacting, task and problem solving and presentation of the results. After the task has been solved or the dialogue has permanently moved on to a different topic, the system switches back to inactive mode. At this point, the dialogue history is reset; the system does not deploy a memory over different interactions.
1.4 Challenges
13
1.4 Challenges The dialogue system development is in the scope of this work limited to the dialogue management component as this gives home to the researched qualities of the system described above. Efforts on the development of an approach for semantic analysis for the system have been described in [Strauß and Jahn, 2007]. The system further contains text generation and speech synthesis components which are adopted from the WIT tool presented in Section 3.2.4. In detail, the present book elaborates on the following challenges. • Proactiveness. We introduce a way to enable proactive behaviour as it denotes a main character trait of an independent dialogue partner. Our system encounters an up to now unusual situation for dialogue systems: It steps into an existing conversation held between multiple dialogue participants. The ongoing conversation thus has to be modelled by the system already before its first interaction to be able to interact at an appropriate point with knowledge about the ongoing conversation. We see the potential to handle this challenge in the dialogue history which is used by common dialogue systems mainly for the task of enabling backtracking in case of e.g. misunderstanding. We propose to start modelling the dialogue history as soon as the conversation of the users enters the system’s specified domain in order to build a complete picture of the dialogue and users’ preferences. We further determine points in the dialogue that are suitable for proactive system interaction according to the analysis of our interaction data. • Multi-party dialogue management. The main focus of the development of the presented dialogue system is put on the multi-party dialogue management. It is based on a popular approach for two-party interaction which models dialogue as a state of information that is updated with every new utterance. Existing modifications of this approach that allow for multi-party interaction modelling do not suffice for our setup. We thus introduce an extended and altered version of the approach in order to suit our three-party setup and to allow proactive interaction by sideparticipants. • Discourse-oriented user constraint prioritisation. Within the functionality of problem solving, a novel approach to prioritise user constraints is introduced. It comes into play when an over-constrained situation occurs, i.e. too many or too restrictive preferences were provided by the users restraining the database query so that it yields no result. As far as the authors are aware, up to now there is no such approach for multi-party interaction. We make use of the new challenges and possibilities that come with the multi-party setup and integrate the course of the conversation into the decision process to determine the priority of the constraints.
14
1 Introduction
• Multi-party dialogue corpus. As a further important part of this book, the collection of a rich data corpus is described. 76 multi-party dialogues are recorded within an extensive Wizard-of-Oz environment. The data of the corpus, which consists of audio and video data, contains realistic interactions of the users with the envisaged system and thus forms an ideal basis to assist the development of the dialogue system. As far as the authors are aware, there is no comparable corpus available that consists of dialogue recordings in a human-human-computer setup. • Evaluation. Our presented independent dialogue partner poses a whole new situation for the users. Thus, it is of special interest to us to assess the usability and user acceptance of a novel system of this kind. Usability evaluation is performed on the basis of the questionnaires that the participants of the dialogue recordings filled out. An assessment of user acceptance is performed by analysing the main user’s gaze direction comparing the way it behaves towards the system versus the other user. A further evaluation considers the system’s proactive interaction behaviour.
1.5 Outline of the Book A short introduction on spoken language dialogue systems was provided in this chapter, followed by a short listing of current trends and ongoing related research on advanced dialogue systems. We introduced the dialogue system that builds the centre of this work and listed the scientific challenges that this book aims to achieve. The remainder of the book is structured as follows: Chapter 2 presents fundamentals to provide a background for the subsequent chapters. Section 2.1 provides a description on data collection and corpus development in general and by means of the prominent Wizard-of-Oz method. A short introduction on dialogue system evaluation is given in Section 2.2. Section 2.3 focuses on multi-party interaction presenting theoretical fundamentals on general linguistic theories and properties of human-human and human-computer interaction. Section 2.4 presents basics on dialogue modelling and dialogue management using the Information State Update approach. Existent multi-party extensions to the approach are discussed in Section 2.4.4. Chapter 3 focuses on the development of the multi-modal PIT corpus of German multi-party dialogues. Section 3.1 provides a listing of existent multiparty corpora which have been collected over the past years. As none of these corpora comprises the features required for our research, we build our own corpus. The setup and process of data collection using the Wizard-of-Oz method are outlined in Section 3.2 followed by a description of the corpus itself which consists of a 76 multi-party dialogues (Section 3.3). The section is concluded with a description of the annotation and analysis performed on the collected
1.5 Outline of the Book
15
data. Chapter 4 presents the proactive dialogue management developed for our dialogue system. Section 4.1 describes the multi-party dialogue modelling as well as a new interaction principle introduced in order to allow for proactive interaction. Section 4.2 introduces our example domain of restaurant selection. The remaining task related components of dialogue management are presented within the scope of this section. Section 4.3 concentrates on proactiveness enabling dialogue management strategies and components: Our optimistic grounding and integration strategy, the system’s interaction strategy, as well as the dialogue history keeping. The functioning of the proactive dialogue management is illustrated on an example dialogue extract in Section 4.4. Finally, a look is taken at the constraint based problem solving process deployed by our system in the scope of which a novel algorithm is introduced to prioritise user constraints according to the ongoing dialogue (Section 4.5). Chapter 5 is dedicated to the evaluation of the system. Section 5.1 presents the usability evaluation based on questionnaires the participants filled out prior as well as subsequent to the data recordings. The recordings were conducted with the simulated system (presented in Chapter 3); the evaluation thus rates the system as it is envisaged to perform in its final state. Section 5.2 presents the evaluation of the system performance including the evaluation of the discourse oriented user constraint prioritisation. An assessment of the user acceptance of the dialogue system is presented in Section 5.3. It analyses the behaviour of the main dialogue partner towards the system throughout the interaction in comparison with the behaviour towards the other human dialogue partner. Finally, the system’s proactiveness is assessed as presented in Section 5.4. The book is concluded in Chapter 6 with a summary and description of future work. The appendix contains a more detailed description of the WIT tool (Appendix A). The original German version of the example dialogue listed in Section 3.3.3 is presented in Appendix B. Finally, the part of the questionnaire used for the subjective evaluation as filled out by the participants of Session III recordings (refer to Section 5.1) is displayed in Appendix C.
2 Fundamentals
Dialogue involving two persons has for a long time been a popular topic of research. Dialogue systems involving two participants (the system and the user) have by now been firmly established in everyday life, especially in the field of call centre applications. Multi-party dialogue, i.e. conversation between more than two participants, is on the other side a rather novel field in research. It started most likely with the reclassification of the at that time conventional conversational roles of speaker and hearer (e.g. [Searle, 1969, Austin, 1962]). Clark and Carlson (1982) modified Searle’s speech act theory (1969) to enable multi-party interaction which was the start to multi-faceted research in multiparty dialogue and thus also opening the way for multi-party dialogue systems which have just started to emerge within the last few years. This chapter presents fundamentals to provide a basis for the work presented in the following chapters. Section 2.1 describes the process of data collection that is performed for corpus development and introduces the prominent Wizard-of-Oz technique used in this context. In Section 2.2 fundamentals on dialogue system evaluation are presented. Section 2.3 elaborates on fundamental properties of multi-party interaction from the linguistic point of view regarding speech acts and speaker roles and takes a further look at particular attributes of human-human and human-computer interaction. All of these aspects are presented in general followed by a discussion about how an aspect relates to our setup and envisaged dialogue system. Finally, the Information State Update approach to dialogue management (e.g. [Ginzburg, 1996, Larsson, 2002]) is introduced in Section 2.4 to provide the theoretical basis for the dialogue management presented in the scope of our work.
2.1 Corpus Development Dialogue data can be collected in various ways. What kind of data to be collected and in which way depends mostly on the purpose of the data collection, i.e. what the data is to be used for. The data can be used to analyse dialogue P.-M. Strauß and W. Minker, Proactive Spoken Dialogue Interaction in Multi-Party Environments, DOI 10.1007/978-1-4419-5992-8_2, © Springer Science + Business Media, LLC 2010
17
18
2 Fundamentals
in general or to perform user behaviour studies, i.e. to find out how conversations about a certain topic in a certain setup proceed, as is the case when used in the process of developing a dialogue system. Another important purpose is data material acquisition, i.e. collecting data to be used to train statistical models at all processing stages of the SLDS. Dialogue data can be classified according to different characteristics. It can for instance be distinguished in terms of the dialogue participants. Are only human dialogue partners involved or humans as well as computer systems? Do the users who take part in the recordings resemble the future user group? Will the recorded dialogues be real, realistic or artificial conversations? Recording real dialogues, i.e. the people being recorded engage in a conversation they would hold also if they were not recorded, yield the most authentic data. In a real setting and situation people act and interact out of their own motivation to achieve a certain goal. In artificially set up recordings, participants often receive some sort of compensation, such as e.g. money, cinema tickets or credits at university, which might present a greater motivation than their inner drive to complete an artificial task. A behaviour change is also described by the Hawthorne effect [Adair, 1984] which reports on phenomena that occur when people are aware to be under observation such as e.g. being recorded by camera and microphone. They act differently from how they would normally act in the same circumstances if they were not under observation. Large amounts of data already exist and are freely available. TV shows, news recordings, radio programs and audio books are only a few examples of available speech data. If these data can be used depends greatly on the purpose the data is to be used for. While it features a wide range of interaction amongst humans, it is for instance not suited if the interaction is to include a specific dialogue system. A further dimension is the recording procedure and equipment used. It can range from a simple recording device with integrated microphones to high class equipment, and additionally with webcams or video cameras. What equipment should be used depends generally on the purpose as well as the setup and surroundings. The data can be collected in ’field recordings’, i.e. realistic surroundings where this sort of conversation would normally take place, or in a laboratory. The laboratory could be made to represent a realistic setting, e.g. for an office or intelligent home situation. Finally, what kind of technical realisation of the recordings is needed? Conversations between humans can be recorded without great technical effort. If the recordings are to include a dialogue system, however, a system has to be deployed in order to obtain realistic data. If the data is needed in order to build the system, the famous hen egg problem is encountered. A popular approach to overcome this problem is the Wizard-of-Oz approach which is described in the following.
2.1 Corpus Development
19
Wizard-of-Oz Approach The Wizard-of-Oz (WOZ) technique for data recordings is a popular procedure to obtain realistic interaction data (e.g. [Dahlb¨ack et al., 1993]) with a system which not yet or only partly exists. The technique is used to collect data in order to help develop the final dialogue system and at the same time enables system evaluation at an early development stage. The name Wizardof-Oz derives from the children’s tale [Baum, 1900] about the ’man behind the curtain’ as the setup is similar to the situation in the story. A human socalled wizard simulates a dialogue system (or essential components thereof) that interacts with the human users like the envisaged system. Ideally, the users do not notice the simulation and behave as if they were interacting with an automatic system rather than a human. The wizard is thereby situated in a different room (or some other hidden place). The behaviour of the simulated system has to be as close as possible to the behaviour of the envisaged system to obtain realistic data. Besides correctness in terms of the simulation another very important point is speed. For the users to believe to be interacting with a real system they are not willing to accept long reaction times as they are not typical for computer systems. The interface the wizard uses for the interaction thus has to ensure easy and quick usage. Equally, deterministic behaviour has to be assured, i.e. the system has to always react the same way under the same circumstances. Thus, in order to assure quick and deterministic system responses, the system is generally already partly functional. For instance, the application backend could already be functional (e.g. using the module at an early stage of development or a prototype with the same functionality) in order to assure deterministic behaviour in the system output. Modules that are more difficult to develop or require the data to be collected for the development (e.g. speech recognition and language understanding modules) are simulated by the wizard. Besides assisting in the development process, WOZ recordings are further a powerful tool for evaluation early in the development process. It can safe a lot of effort and expenses to allow potential problems and need for changes in the development to be identified before the actual implementation and constantly throughout the development process. This is facilitated due to the fact that users do not interact with a prototype with limited functionality but usually with the (partly simulated) envisaged system in its full extent. Evaluation can be performed via observation of the interaction and also by subjective evaluation of the system’s usability in the forms of interviews or questionnaires the users are to fill out after interacting with the system. An introduction to these evaluation methods is giving in the following section. Besides an audio recording functionality the setup can further be equipped with all sorts of logging mechanisms and cameras to record the behaviour of the users in more detail and thus allow further analysis and evaluation of the interaction.
20
2 Fundamentals
Our aim is to obtain realistic dialogue data to analyse the interaction in a novel multi-party setup. No such data is currently available for analysis (refer to Section 3.1). Thus, we perform data collection using the presented WOZ method which perfectly suits our needs. We obtain realistic data as the human users think to be interacting with a fully working system and thus behave in a realistic way. The WOZ setup and recording procedure are presented in Section 3.2. Besides assisting the development by providing interaction models, the WOZ approach further enables us to evaluate the envisaged system already at this early stage of development to assess user acceptance.
2.2 Evaluation of Spoken Language Dialolgue Systems Evaluation for SLDS can be differentiated between subjective and objective evaluation [M¨oller, 2005]. Subjective evaluation measures factors such as usability and attractiveness of a dialogue system using perceptive judgments obtained from a subjective (i.e. the user’s) point of view. Usability evaluation measures the quality of an interactive system in terms of usefulness and user-friendliness. Interactive systems are used to manipulate the user’s world. Usability evaluation rates the way in which this manipulation takes place, i.e. how useful, usable the system and especially the design of the system is to a user. According to Nielsen (1994), usability consists of five quality attributes: Learnability, efficiency, memorability, errors and satisfaction. These components are said to be precise and measurable. However, in the evaluation of novel interactive systems as the one presented in this book, pragmatic quality which describes accurate functioning and design does not cover all aspects that influence a subjective rating of the interaction, i.e. the user assesses for herself if the product satisfies her needs personally. Thus, further quality characteristics such as stimulation and identity also play an important role. Objective evaluation aims at rating and predicting system performance and quality by analysing interaction parameters collected from the dialogues. Examples of interaction parameters are task success, recognition performance or number of timeouts which can be extracted from the dialogues at runtime or afterwards from the transcribed dialogues. These measures are sometimes further used during the ongoing dialogue e.g. to chose an adequate dialogue strategy. A common method for objective dialogue system evaluation is PARAdigm for DIalogue System Evaluation (PARADISE) [Walker et al., 1997, Kamm et al., 1999, Walker et al., 2000] developed by Walker and colleagues which predicts user satisfaction using task success rates and cost functions. The goal is to maximise user satisfaction by maximising task success while minimising cost. Fully automatic metrics are further extractable in real time in order to facilitate the system to adapt based on its assessments of its current performance. M¨oller (2005) studied in what ways subjective and objective measures correlate. Different evaluation methods are evaluated and compared to find
2.2 Evaluation of Spoken Language Dialolgue Systems
21
only moderate correlations between subjective quality judgements and interaction parameters. Thus, automatic analysis such as the one performed in the PARADISE framework cannot replace subjective evaluation. M¨oller concludes noting that ’subjective tests persist as the final reference for describing quality’ [M¨oller, 2005, p.311]. For our evaluation presented in Chapter 5, we do not use the PARADISE framework for evaluation as our focus is a different one. We aim at appraising the novel features our dialogue system possesses which is primarily achieved through subjective usability evaluation that is able to assess the system as a whole and how it comes across to the users. The process of collecting data for evaluation can be performed in different ways (e.g. [Bortz and D¨ oring, 2006,Diekmann, 2007,Beywl et al., 2007]). Empirical social research identifies three main evaluation methodologies [Beywl et al., 2007]: Content analysis, observation and survey, as shortly described in the following. • Content analysis is a method that analyses the content of data, by breaking it down into components which are then assigned to a system of categories. • Observation can be used to collect verbal or non-verbal actions. An observer observes a certain process and classifies the observed behaviour according certain categories. The observation can be performed in different ways: The observer can be concealed or visible to the observed persons. The observer can be participating in the interaction or not. The interaction can be natural or artificial, etc. • Survey is an evaluation method used to collect opinions and attitude of the respondents. Surveys can be carried out in written or oral form. They can be performed via different communication channels (telephone, direct, online, mail etc.) and in the form of questionnaires or interviews. Many different forms of questions can be used (e.g. open or closed) depending on the aim of the question, i.e. what shall be achieved. The evaluation of our dialogue system uses all three of the presented techniques. Content analysis is used in order to examine the interactions in terms of design decisions on dialogue modelling and dialogue management issues, as described in Section 3.3. The observation technique is used for quantitative analysis of the interactions. The video recordings of the data are analysed in terms of the gaze direction of the main interaction partner, as described in Section 3.3.2. The survey technique is deployed in form of questionnaires that are filled out by the participants before and after the interaction. The design of the questionnaires used for our study is described in Section 5.1.1. The evaluation results are presented in Chapter 5.
22
2 Fundamentals
2.3 Multi-Party Interaction Dialogue can be classified regarding how many participants are involved in the conversation. In monologue, only one person is speaking, all other potential present persons build the audience. In dialogue, two persons are involved in the interaction, taking turns one after the other. If one person is speaking, the other person is addressed and listening and will take the next turn during which the first speaker will then be addressed and take the subsequent turn and so on. Thus, the interaction pattern is clearly defined. In multi-party dialogue more than two speakers are involved in the conversation as active participants. It is not self-evident who will take the next turn as there are usually no strict interaction patterns. Multi-party dialogue is thus more flexible with turn-taking as anyone could possibly take the next turn, even if this person was the addressee in the previous turn. In the following, a closer look is taken at linguistic fundamentals such as the speech act theory and grounding. It is followed by a discourse on dialogical roles which participants adopt in a conversation. The last part of the section considers the different combinations of dialogue participants that can occur in an interaction, namely human-human and human-computer. 2.3.1 Speech Acts and other Linguistic Fundamentals Austin phrased the fundamental linguistic question what people do with words as ’How to do things with words’ [Austin, 1962] and with it brought about the speech act theory (see also, among others, [Searle, 1969,Searle, 1975,Bach and Harnish, 1979, Grice, 1968]). He noted that a speaker says something not just for the cause of speaking itself but rather with the intention of communicating something, i.e. pursuing a certain goal. The act of speaking is therefore divided into three different speech acts: locutionary, illocutionary, and perlocutionary act. The speaking itself, i.e. uttering words or even meaningful sounds, is called the locutionary act. The illocutionary act describes the form in which the speaker intentionally utters the words, e.g. asserting, giving an order, asking a question, or promising something, just to name a few. What the speaker wants to achieve with the uttered words is denoted by the perlocutionary act which can e.g. be an action or change of a state of mind. An illocutionary act is thus successful, if the designated effect is achieved. For an example we look at the utterance ”Shut the window”. It clearly shows the intention of the speaker to be understood by the hearer as an order (illocutionary act) and intends further that the addressee should close the window (perlocutionary act). The illocutionary act (the order) is successful if the addressee closes the window. It would not be if the addressee would ignore the request. Hence, for successful communication it is necessary that the addressee understands the speaker’s intention and behaves cooperatively. Conversation can thus be denominated as a joint action of speaker and addressee [Clark, 1996].
2.3 Multi-Party Interaction
23
Clark and Carlson (1982) were the first recognising a shortcoming in the speech act model. The theory is limited to speaker and hearer1 . It is not differentiated between different types of hearers. If a conversation is held among more than two persons, the conversational roles of speaker and hearer might not be sufficient to describe all participants. In a circle of three dialogue participants, for instance, one speaker gives an order to another DP, addressing her directly by her name: ”Anna, please shut the window”. The third participant (i.e. not Anna and not the speaker) is also part of the audience and is assumed to hear what has been said, however, only the addressed person, Anna, is expected to perform the perlocutionary act of closing the window. Thus, Clark and Carlson replace hearer with addressee and argue that a speaker performs two illocutionary acts with each utterance: The traditional illocutionary act which is directed at the addressee (or addressees) and additionally an informative act which is directed at all the participants of the conversation, addressees and side-participants alike (refer to the section on dialogical roles below for a definition of the dialogical roles). Principle of Responsibility Clark and Carlson (1982) define the Principle of Responsibility which indicates that every participant of a conversation is responsible at all times for keeping track of what is being said, and for enabling everyone else to keep track of what is being said.’ [Clark and Carlson, 1982, p.334]. A speaker therefore has to design his or her contribution according to the audience, according to who is known to be and also might be listening, possibly unintended2 . ’Speakers design their utterances to be understood against the common ground they share with their addressees - their common experience, expertise, dialect, and culture.’ [Clark and Schober, 1989, p.211]. The speaker then ’presupposes the common ground already established; and all the parties, the speaker included, add what is new in that contribution to their common ground.’ [Clark and Carlson, 1982, p.334]. The common ground can differ immensely between different people which ’ordinarily gives addressees an advantage over overhearers in understanding. Addressees have an additional advantage [...] because they can actively collaborate with speakers in reaching the mutual belief that they have understood what was said, whereas overhearers cannot.’ [Clark and Schober, 1989, p.211]. However, the principle of responsibility places not only the responsibility of audience design on the speaker but also a responsibility on the audience. It indicates that all participants have the task to constantly keep track of the conversation to build on their common ground. If they lack to understand, 1
2
In some literature hearer is referred to as audience [Grice, 1968] or addressee [Clark and Carlson, 1982]. Potential unintended listeners might for example cause the speaker to use techniques like concealment [Clark and Schaefer, 1987].
24
2 Fundamentals
they are expected to pose clarification requests to be able to resolve everything fully. Grounding The common ground denotes the common basis of knowledge between dialogue partners. Grounding is the process of adding to this common ground. Due to different previous common knowledge and experience, different common grounds exist between different dialogue partners. In that way, Branigan (2006) claims that in multi-party dialogue there is no common common ground but rather several common grounds between the pairs of participants. The common ground between the speaker and side-participant is a partial common ground as it could contain utterances that are not fully grounded and there is no necessity that full grounding will take place in the further course of the dialogue. In contrast, the real common ground between the speaker and direct addressee(s) deserves the name due to the fact that an addressee has the possibility and responsibility to ground everything fully. However, in a conversation all utterances are witnessed by all DPs and it is the informative act introduced above that informs all participants of the illocutionary act performed towards the addressees. Thus, it is this informative act which is introduced into the common ground, in the same way for all DPs. This enables that in dialogue modelling, one single or common common ground can generally be modelled for all DPs. Clark and Schaefer (1989) claim that an acknowledgement is required for dialogue participants to perform grounding. It cannot be said that being informed about the occurrence of an act denotes that it has been understood, and also not by all DPs in the same way. Depending on the nature of the dialogue, there are different kinds and requirements of acknowledgement or acceptance. In two-party dialogues, acceptance is always of an individual nature, i.e. there is only one addressee and it has to be this addressee who accepts or acknowledges the move. In the multi-party case, however, it can be differentiated between three different kinds of acceptance: individual, communal or distributive acceptance [Kronlid, 2008]. Individual acceptance takes place if there is only one addressee, encountering the same situation as in the twoparty case. Communal acceptance means that if there is various addressees, one of the addressees accepts as a sort of a spokesperson for all addressees3 . Distributive acceptance denotes that each addressee gives feedback individually. The question is thus, which sort of acceptance is expected and required in each case? Can any utterance be considered grounded after communal acceptance? When is distributive acceptance required? There is of course not one valid answer, it depends greatly on the nature of the question.
3
Differing opinions are not considered and would require at least as many responses at there are opinions in the group of the addressees.
2.3 Multi-Party Interaction
25
Further, different grounding strategies can be deployed. An optimistic grounding strategy optimistically assumes acceptance and performs grounding without waiting for explicit feedback of all addressees. However, grounding might in this case be performed hastily. A pessimistic grounding strategy, on the other hand, waits for all feedback until an utterance is considered grounded. Larsson (2002) defines a third strategy, a cautiously optimistic grounding strategy, which follows the optimistic strategy, however, enables rollback in case that grounding has been conducted hastily. In our setup we deploy one single common ground to which all of the participants contribute. We argue that previous common ground which might exist between the users if they are acquainted is not (and cannot be) considered. Only the ongoing conversation is important for the aim of the dialogue to find a solution that fits best for all dialogue participants. Thus, the dialogue participants - even while they are side-participants - are expected to naturally follow the principle of responsibility as finding the best solution is also their goal of the dialogue. As discussed thoroughly in Section 4.3.1 we deploy an optimistic grounding strategy which alleviates us from the need to classify utterances according their individual, communal, or distributive nature. 2.3.2 Conversational Roles A participant takes on certain roles in a dialogue. Dialogical roles denote the part a person plays in the dialogue such as speaker or addressee. Social roles indicate the social position or relationship between the dialogue partners, such as roles within families and professional settings or the level of acquaintance. Further, task roles describe the position a dialogue participant plays in the dialogue in order to achieve a certain task. These roles are often assigned by profession, e.g. a person adopts a role as a moderator or judge in a trial etc. (e.g. [Traum, 2004]). Below, a closer look is taken at dialogical and social roles. Dialogical Roles Clark and colleagues [Clark and Carlson, 1982,Clark and Schober, 1989,Clark, 1996] make the standard speech act theory capable of handling multi-party situations by dividing the hearers into addressee, participant, and overhearer. Participants describe thereby the by the speaker intended audience, divided again into addressees and side-participants (who are currently not addressed). Overhearers are the ones that are not ratified participants. From the sociological viewpoint, Erving Goffman (1981) distinguishes between different types of unintended listeners in eavesdroppers and overhearers: ’Correspondingly, it is evident that when we are not an official participant in the encounter, we might still be following the talk closely, in one of the two socially different ways: either we have purposely engineered this, resulting in ”eavesdropping”,
26
2 Fundamentals
or the opportunity has unintentionally and inadvertently come about, as in ”overhearing”’ [Goffman, 1981, p.131-132]. In the dyadic case, dialogue participants adopt the dialogical roles of speaker and addressee for turn-taking in rotation. In multi-party interaction, however, each participant can take on various roles depending on the circumstances of the conversation. Interaction patterns cannot be clearly defined. With the dialogical role adopted, a participant takes on certain responsibilities. Thus, as an addressee the participant is expected (besides the principle of responsibility described above) to ground and respond. Whereas, as a sideparticipant, the system is expected to ground but not to respond and as an eavesdropper, one is expected not to reveal its presence. However, it cannot be taken for granted that the intended addressee takes the next turn as the side-participant or any other participant takes the next turn instead. Instead of the strict interaction patterns a dialogue system can easily adhere to, for the multi-party situation the system has to be flexible in order to change roles always adapting to the current situation of the conversation. According to which role the system adopts in the dialogue, it is expected to act accordingly. For instance, if it is an addressee it is expected to respond in the following turn. If, however, a different participant steals the turn, the system has to wait for a new chance to speak and during this time, the obligation to respond might even have dissolved. In our setup, the dialogical roles the system can adopt depend first of all on the state of attentiveness the system is currently in. Before the dialogue participants come to speak of the specified domain, they are engaged in dyadic dialogue. The system is in its inactive state and scans the conversation for a familiar trigger. The users are only partially aware that the system is listening. The second participant might not know about the system at all. The main interaction partner might have forgotten at the moment as the system has not been active. It cannot be assumed that the main interaction partner designs its utterances with the system in mind at that stage. Thus, although from the viewpoint of the main interaction partner the system is an intended listener and could thus be assigned the role of a side-participant. However, this does not necessarily include the viewpoint of the second participant. The remaining roles of eavesdropper or overhearer in Goffman’s definition are for unintended listeners only. As it comes closest to the actual situation considering the literal meaning, we still define our system to be adopting the role of an overhearer during inactive state. A similar argumentation can be held during the next phase of the dialogue while the system is in active state. The system is now closely following the conversation and is thus said to adopt the role of a side-participant although it does not get involved in the conversation. The second interaction partner might not be aware of the presence of the system at first. However, the role of the side-participant is justified by the fact that the main interaction partner could anytime address the system with a request which means that she is
2.3 Multi-Party Interaction
27
Fig. 2.1. Interaction model of the dialogue system.
aware of the system’s presence and is thus also expected to include it in her audience design. During the consequent phase of the dialogue, when the system has joined the conversation, both users are aware of the system as an interaction partner. The three interaction partners thus switch randomly between speaker, addressee and side-participant due to the fact that by definition only the main interaction partner is supposed to address the system directly. This, however, has no further effect on the role distribution. The ’indirect’ interaction between the second user (U2) and the system poses an interesting situation. A model of the interaction between the users and the system are depicted in Figure 2.1. As user U2 is not allowed to directly interact with the system, the system (S) is not obligated to respond when U2 is the speaker. However, U2 might quickly realise (so it can be observed in the recorded dialogues) that the system understands him and perform his audience design accordingly, in the way that it addresses the system without the expectation of receiving a direct response or to include the system as a side-participant. Social Roles The situation in which the communication takes place has a substantial impact on the conversation. What are the communication circumstances? Who is talking to whom? What is the relationship between the dialogue participants? The effect on dialogue due to observable differences in status between dialogue partners are confirmed by the theory of status organising processes [Berger et al., 1980]. In Western cultures, people with higher status tend to speak more, equally in dyadic and the multi-party dialogues. Thus, the social roles the participants take on in a conversation have an impact on the conversation. People act differently depending on which roles they possess in a conversation, in general and towards the other DP(s). The dialogue partners can either be at an equal social level or at a different, putting one dialogue partner at a superior level, as found e.g. in a conversation between an adult and a child or between two adults at unequal professional levels as encountered in an employer and employee relationship. The influence of
28
2 Fundamentals
the social factor on the dialogue context has widely been acknowledged, e.g. Bunt (1994) has introduced the social context as part of the complete dialogue context and Traum (2004) talks about specific task roles which relate dialogue participants in certain ways. In order to obtain a wide range of different (such as superior and inferior) behaviour in our corpus, we randomly assigned different roles and scenarios to the dialogue partners during the Wizard-of-Oz recordings. These included e.g. employer and employee, lovers, business colleagues, or friends. 2.3.3 Human-Human and Human-Computer Interaction This section is dedicated to human-human and human-computer interaction. Both types of interaction are first considered separately and discussed from the viewpoint of our own setup which combines both types. Finally, research dealing with the comparison of both types is presented. Human-Human Interaction Face-to-face interaction with other human dialogue partners is for humans the most natural and comfortable way of communicating. Presumably, it is also the fastest and most efficient, e.g. due to the human ability of dissolving ambiguity by interpreting paralinguistic phenomena of communication such as emotions and facial expressions of the other dialogue partner. Semantic content is said to make out only seven percent of a message [Bolton, 1979], the remaining 93 percent denote nonverbal communication. Human dialogue is normally held in order to achieve a certain goal. Each dialogue partner might thereby have her own goal, different from the other’s, however, both dialogue partners work together towards achieving their goals cooperatively which makes out human conversation. Human-human interaction follows certain patterns. Turn-taking, as mentioned above (refer to Chapter 2.3), in dyadic dialogue has the participants switch between the two possible dialogical roles of speaker and addressee. A further characterisation of interaction pattern concerns the purpose of an utterance, i.e. the combination of occurring dialogue acts. An initiating dialogue act induces a reaction from the other dialogue partner in form of another dialogue act. Such initiation - response pairs are called adjacency pair as they are likely to occur adjacent to each other. Prevalent examples for adjacency pairs are question and answer or greet and counter-greet. Depending on the sort of interaction, a dialogue adopts a specific structure, i.e. certain adjacency pairs occur regularly. The knowledge about this structure is necessary in order to understand and model the dialogue and thus facilitates the design of a dialogue system which is to understand conversation between humans. Due to the nature of the task-oriented dialogue we consider, all of our interactions follow an equal pattern. It can be observed that the
2.3 Multi-Party Interaction
29
same phenomena occur regularly and predictably, e.g. the structure of dialogue is very similar and the same adjacency pairs occur. This information is taken into account e.g. in terms of choosing a suitable dialogue act tagset (ref. to Chapter 3.3.2) for the system to be able to model the conversation appropriately. The most frequently occurring adjacency pairs in our scenario consist of proposal and a follow-up act: A proposal from one of the dialogue partners induces a reaction from the other dialogue participant. This response may consist of a simple acknowledgement, an accept or reject, a response with further content, or possibly a counter-proposal. Sometimes, the dialogue partner repeats the proposal which can have the function of acknowledgement, of checking if it was understood correctly or as a way of deferring the dialogue in order to win time to think. Generally, the response follows the proposal directly, however, in some cases it can also occur various turns later in the conversation with possibly even an off-topic discourse happening in the meantime. Table 2.1 shows a short example dialogue snippet labelled with the according dialogue act and the number of the utterance it refers to. User U1 proposes to go to an Italian restaurant. Instead of accepting right away, User U2 repeats U1’s proposal whereupon U1 acknowledges U2’s repetition. In this case, the repetition is to be interpreted as a request for clarification (check act). Utterance
DA, Reference
U1 5: Let’s go to an Italian restaurant.
(sugg,{})
U2 6: An Italian restaurant?
(check,A5)
U1 7: Yes.
(ack,B6)
U2 8: Ok.
(acc,A5) Table 2.1. Dialogue snippet.
The design of our dialogue system is advantageous as it integrates humanhuman interaction with human-computer interaction. The users are able to first come to an initial agreement among themselves before the system gets involved in the conversation. Only when both of the dialogue partners agreed on a choice of preferences, the computer is addressed to start the query process. This seems more efficient and faster than if the interaction would have included the system throughout the entire process. Scaling up the Number of Users If the number of participants of a conversation increases from two to more, certain factors change. Turn-taking becomes less predictive with an increasing number of participants that take part in the dialogue and can possibly take the next turn. The speaker can address more than one person simultaneously.
30
2 Fundamentals
All dialogue participants that are not addressed directly are side-participants. If the set of direct addressees consists of more than one person, an interesting question is who responds and who does not respond? Who is obligated to respond? Are all addressees equally expected to respond to the speaker in form of an answer, accept or reject or even just an acknowledgement to inform the speaker that what was said has been heard? Can an utterance be rated as successfully communicated if it has been acknowledged by all or by some of the addressed participants, or not at all? This question cannot be answered in general. The only generalisation possible to make is in the distinction of the sort of the question or request posed by the speaker. Thus, a question can presuppose either communal (i.e. one speaker takes on the role as a spokesperson) or distributive (every addressee has to respond individually) acceptance. Long distance responses occur more frequently in a setup with more than two interaction partners. If more than one participant responds to an issue raised, only one response can directly follow up the original utterance (without considering overlapping speech), all others have to follow each other. Ginzburg and Fern´ andez (2005) empirically evaluated human-human interaction comparing two-party with multi-party dialogues. Both types of dialogues exhibit adjacency between grounding and acceptance moves and their antecedents. However, while dialogues with two DPs show also adjacency between short answers and their antecedents, in multi-party dialogues long distance short answers are common. Thus, if a long distance short answer is given various turns after the issue was raised in the conversation, different issues might have been raised in the meantime. Thus, the response has to be related to the right issue posing the problem of issue accommodation. Noticeable differences in interaction patterns between large and small groups are shown by Carletta and colleagues (2002). Discussions in small groups involving up to seven participants resemble two-way conversations that occur between all pairs of participants, every participant can initiate conversation. Whereas, discussions in large groups are more like a series of conversations [Jovanovic et al., 2006]. In our setup the first part of the interaction is the only one involving only humans. The multi-party party situation includes a dialogue system and thus belongs to the situation described below. However, if the number of users for the system would be increased as it could be done without great effort, multi-party conversation among humans would be encountered. The qualities of multi-party dialogue described above would then have to be considered for the design of the system. However, as we deploy an optimistic way of grounding the difference would be less significant as the system does not wait for every response before it performs grounding (refer to Section 4.3.1).
2.3 Multi-Party Interaction
31
Human-Computer Interaction The interaction of a human with a computer, such as a task-oriented dialogue system, generally proceeds in a certain way: The user asks questions or poses requests to the system, the system then cooperatively gives information or suggests different options the user can chose from. The user evaluates the information obtained from the system. The interaction is ended when the task has been successfully solved and the user is satisfied with the outcome. The user is obviously aware of the fact that it is interacting with a computer. Nevertheless, conventional social expressions such as thanking and greeting are observed regularly from the user. A system is therefore expected to respond appropriately to not appear rude. Dialogue systems are not necessarily aware of who the user is unless they deploy user modelling in which case a profile for each user is stored that enables the system to retrieve the information about previous interactions or the user’s preferences, for instance. For systems that provide the user with specific or personal information, such as a status request for a booking process for instance, the identification of the user is indispensable. User modelling is further necessary if a system needs to keep track of previous interactions. In the general case, however, a system does not have to be aware of who it is interacting with. From the viewpoint of the human-computer interaction, our system integrates in a way both - single-user and multi-user - setups. The single-user setup is considered in the interaction of the main interaction partner with the system. All remaining factors are multi-party related and will be addressed in the corresponding section below. The main interaction partner addresses the system mainly with requests and is thereby holding the initiative. A typical request would be one of the following: • with propositional content, e.g. ’Can you tell us an Italian restaurant in the city center?’ – repetition of information · complete, i.e. all so far collected information is summed up, · partial, i.e. repeating only parts of the propositions, e.g. what has just been agreed upon – new information • without propositional content, e.g. ’Do you have anything like that?’ Any information that is supplied the system has to integrate, i.e. compare the new information to what has already been collected, and perform a database query (taken the information is conform) to return what the user has requested, e.g. specific information or a list of suggestions. The main other sort of interaction towards the system consists of responses to information from the system in form of e.g. acknowledgements.
32
2 Fundamentals
Scaling up the Number of Users A system can either be aware of the fact that it is interacting with multiple users or ignore it. For systems which serve as pure information seeking applications without a cooperative task solving process behind, the system does not need to be aware of the different users. Being aware of a user means in this context that the different users are modelled in dialogue modelling (refer to Section 2.4). For systems that interact cooperatively with its users, that hold a conversation, should be aware of its counterpart to be able to react to the users appropriately. A further dimension is flexibility in the number of users, i.e. the set of users can be fixed or changing during an interaction. Our system is aware of the two interaction partners. It knows which one is speaking at any moment. However, it does not perform complex user modelling as this is not required for the task at hand and due to the setup of limiting the direct interaction. The two interaction partners build on to the same common ground. Our system deploys an optimistic grounding and integration strategy (refer to Section 4.3.1) that integrates propositions when they are first mentioned without waiting for feedback from the other user. The number of users could be increased with minor alterations in the dialogue modelling as long as one user stays the system’s main interaction partner. The more participants take place in the conversation, the more long distance short answers are expected to occur which complicates the assignment to the right initiating speech act. However, the fact that people tend to subsume and repeat agreements and decisions after long discussions would be fortunate, alleviating the understanding process of the system. This way, it can be assumed that the system will not miss to understand what to include or exclude from the current set of propositions. Impact of System Interaction on the User Behaviour The interaction of the system has a strong bearing on the flow of the conversation in various aspects. To serve and please the user in the best possible way, negative answers should be presented as favourable for the user as possible. Thus, user preferences should be collected and integrated into the system to be able to adapt to the users. If the system understands which constraints are more important to the users than others, it can on its own device alter the query to present results that come as close as possible to what the users request. In our recorded dialogues (refer to Section 3.3), different reactions from the users according the result or information supplied by the system are observed, some of them quite affective. A positive answer and presentation of the query results or retrieved information induces a behaviour ranging from neutral reaction up to the users showing vivid delight, some of them thanking and even praising the system. In a few cases, the users are not pleased with the
2.3 Multi-Party Interaction
33
result and modify the query accordingly. On the other hand, if the answer from the computer is negative (in terms of no results or too many to be read out) this leads to a neutral to negative reaction from the users. While some accept the fact and simply change the request until they are pleased with the outcome, others show frustration, and in the worst case even grumble at the system. However, it has to be noted that during the recordings, the outcome of the database query was not altered. If the query did not yield any result, it was reported to the users. At the same time, if the set of results was too large to name all of them, the number of results was reported together with the suggestion to constrain the query further. The final system takes on a more cooperative way of problem solving. We therefore integrate a way to prioritise user constraints in the problem solving process in order to achieve satisfying answers even if a query with the original set of preferences yields no result (refer to Section 4.5). Comparing Human-Human and Human-Computer Interaction Differences in human-human and human-computer interaction have been noted for a long time. Addressing a computer has been categorised as ’formal’ [Grosz, 1977], or ’computerese’ [Reilly, 1987] showing a telegraphic tendency [Guindon et al., 1987]. Linguistic analysis has revealed that when speaking to a computer users tend to use very few pronouns but a high number of complex nominal phrases [J¨ onsson and Dahlb¨ ack, 1988, Guindon et al., 1987]. In more recent studies, Doran and colleagues (2001) investigated the differences in human-human and human-computer interaction in terms of initiative and dialogue act patterns. They compare the same task as it is performed in two different setups, i.e. in human-human and human-computer interaction. The human-computer data consists of telephony based interaction with different dialogue systems. All dialogues are set in the travel-domain. They are annotated in terms of dialogue acts, initiative and unsolicited information. The empirical evaluation shows that while during the human-human dialogues the initiative is equally shared amongst expert and user, the human-computer dialogues exhibit mostly expert- (i.e. system-) initiative. Users generally talk more in the human-human dialogues and the dialogue act patterns show great differences, e.g. in terms of confirmation. Short confirmations in form of acknowledgments occur very rarely in human-computer interaction. Systems use both long and short confirmations a lot more frequently than the users do and long confirmations about five times more often than short ones. Whereas, in human-human interaction the experts still use more confirmations than the users. However, in this case, shorter confirmations are the norm. Considering the dialogue act distribution it can be said that in human-computer dialogues the users mainly supply task-related information whereas in the human-human setup they tend to also request information. In terms of the exchange of information, the dialogue between the human communication partners is more evenly distributed. Taken together, in human-computer interaction the initia-
34
2 Fundamentals
tive is dominated by the system for most of the time (although all systems deployed for the recordings are said to be mixed-initiative). A difference in the dialogue act patterns shows further domination by the system. Only speculative answers can be given for these phenomena. They could for instance result from the often poor speech recognition ability by automated systems which makes the system fall back to system initiative and forces the users to adapt and interact with the system in a different way, i.e. learned behaviour from the users’ side. Despite the mentioned differences and possibly differing task expectations towards a computer as opposed to another person [J¨onsson and Dahlb¨ack, 2000], human-human dialogues are regularly used to imply humancomputer interaction as they are able to demonstrate real interactions and are easier to acquire than using a WOZ setting. Refer to Section 2.1 for a further discussion on characteristics and techniques for data accumulation.
2.4 Dialogue Modelling Dialogue can be described as a sequence of events (i.e. utterances) that act upon the current state of the dialogue, i.e. state of information regarding the dialogue. Each event modifies or updates the current dialogue state according to the event’s contents. The ’world’ in which a dialogue system operates consists of different kinds of information which taken together is referred to as the dialogue context. The context contains all information the dialogue system has access to consisting of static world knowledge and dynamic information about the ongoing dialogue. The dialogue history is an important part of the context holding all relevant information accumulated during the dialogue up to the current point in time. Dialogue modelling provides concepts and tools to organise this information, i.e. integrate incoming information into the context at each dialogue step, perform updates on the data according the input and dialogue state and finally determine an appropriate next move of the system. Dialogue management adopts the dialogue modelling concepts to a real domain and organises turn-taking and grounding (refer to Section 2.4.2 below). This section starts with a closer look at the dialogue context and with the dialogue history a particular part thereof. Subsequently, a prevalent approach to dialogue management, the Information State Update (ISU) approach (e.g. [Larsson and Traum, 2000, Matheson et al., 2000, Larsson, 2002]), is presented. The approach denotes a straight forward and flexible technique which denotes a suitable basis for the dialogue management of our system. Chapter 4 presents in which way it is adopted to our scenario and situation. The approach is in the following introduced in detail using Larsson’s IBiS1 system [Larsson, 2002] as an example implementation. The chapter concludes with a discussion on recent work on adopting the ISU approach to multi-party environments.
2.4 Dialogue Modelling
35
2.4.1 Dialogue Context and History Dialogue context contains all information accessible to the dialogue system. It consists of a static and a dynamic part. The part of context containing the static information consists of the world knowledge relevant to the system which describes information about e.g. the periphery of the system, such as the location, time, date and day of the week, weather conditions or whatever might be relevant for the application. It further includes the concepts of general conversational principles and of the system’s domain and the database the system accesses to solve the task. All this information is said to be static whereas static does not mean that it cannot change (as it would naturally be the case for weather, time and date and possibly also for the database in case of updates). However, the information is not changed by dialogue. The dynamic part of the context comprises all the information that dialogue participants provide during the course of the conversation. The current state of the dialogue changes with every new utterance that bears relevant semantic context. Bunt claims in (2000, p.101): ’There is no room [..] for an “objective” notion of context, since the participants’ communicative behaviour depends solely on how they view the situation, not on what the situation “really” is.’ Clearly, every participant has a different perception of things which could also be a clear reason against the concept of conducting one common ground between all dialogue partners. However, we claim that for dialogue systems as ours with a focus on the task, any possibly occurring differences can be disregarded as they are not expected to have any effect. We thus take the context as being valid for all participants in the same way. According to the Dynamic Interpretation Theory (DIT) introduced by Bunt (1999) which defines a concept of a comprehensive dialogue context. Five different types of context information are distinguished: • Linguistic context comprises the linguistic information of the conversation as a whole, in raw (utterance as plain text) and analysed (representation of utterance after semantic analysis) form. The information about how the dialogue has proceeded so far (the dialogue history) as well as concepts for the future course of the dialogue (dialogue plans) are further contained. Thus, this context contains static and dynamic information. It is further responsible for anaphora and ellipsis resolution, phenomena that occur regularly in natural dialogue. • Semantic context holds the task-related information such as the taskrecord which is often simply referred to as the task model. All propositions that have been uttered during the dialogue (i.e. user preferences or constraints) are stored here and used as a basis for the database queries. Another part of the semantic context is the domain model which describes the domain the system acts in in form of an ontology, for instance. An ontology is used as a formal representation of a domain describing concepts,
36
2 Fundamentals
their properties and relationships among the concepts (refer to Section 4.2 for the description of an ontology of our example domain). • Physical and perceptual context comprises the physical context that is of relevance to the system, e.g. availability of communicative channels, as well as assumptions on the partner’s physical and perceptual presence and attention. • Social context is composed of interactive and reactive pressures on the agent such as communicative rights, obligations and turn management. For instance, if the user posed a request to the system the system then has the obligation to respond. • Cognitive context comprises the agent’s processing status which is relevant e.g. if certain tasks take some time. If the system is aware of this fact it can report it to the user or use the meantime for something else. The dialogue partners’ states are further modelled in this context, e.g. what their attention currently points at, if they are speaking and to whom, where they are looking at and so on. The dialogue history denotes a particular part of the dialogue context. Some systems use the term dialogue history interchangeably to refer to the complete context model, other systems keep the two terms clearly separate which we support in our notion. The dialogue history represents the flow of the dialogue by storing what is and especially has been talked about throughout the dialogue. It depends on the complexity and requirements of a dialogue system what kind of dialogue history is deployed. Simple slot-filling dialogues normally get on with a simple task model during the course of the dialogue without using an explicit dialogue history. It is not necessary to be able to trace back when and in which order certain slots were filled as it matters only what the values are. More dialogically complex dialogue systems that e.g. are able to hold a conversation with the user(s) are expected to be able to recall events which happened previously in the interaction and to know what has been spoken about. A system should for instance know what it has previously suggested in case the users want to go back to this option. The dialogue history is also used for the resolution of elliptical sentences and anaphoric references to entities contained in previous utterances. Further, in the case that understanding errors occur and the system reaches a state of incorrect information it has to be able to fix this by restoring the last correct state, e.g. through backtracking. With the variety of tasks, the way of storing the data differs as much from system to system as does the type of data to be stored. The representations range from complex hierarchical structures containing various pieces of information to simple sequential representations. For instance, the LINLIN system [J¨onsson, 1997] deploys a dialogue history which has the form of a dialogue tree with three levels corresponding to the whole dialogue, to discourse segments and to speech acts. The VERBMOBIL system [Alexan-
2.4 Dialogue Modelling
37
dersson et al., 1995] contains a dialogue memory which contains representations of intentional, thematic and referential information. The intentional layer contains dialogue phases and speech acts which are represented in a tree-like manner, the thematic layer contains domain-related information relevant to the task-solving process, and the referential information consists of the lexical realisation of utterances. The WITAS system [Lemon et al., 2001] is another system which deploys a tree, called the dialogue move tree, which stores dialogue states. The edges are dialogue moves and the branches represent conversational threads. The dialogue history is thus integrated into the dialogue model. The GALAXY system [Seneff et al., 1996] uses semantic frames to represent the data which is stored in the dialogue history in a sequential way. It maintains a history table of objects which can possibly be referred to in the future, making the interpretation process easier. The WAXHOLM system [Carlson et al., 1995] is another representative of systems to use semantic frames to model the dialogue history. The TRAINS system [Traum, 1996] deploys an approach intermediate to the two presented. The dialogue history in this case is modelled as a stack of discourse units which represent the utterances, the corresponding speech acts as well as further information about the utterance such as speaker and initiative. Our system aims at being an intelligent context-aware system. We thus deploy an extensive context model following Bunt’s DIT concept. Our dialogue history has a sequential structure storing the dialogue states in the order they occur, similar to the approach adopted by the TRAINS project. Our dialogue history builds an essential part of the system due to the fact that it enables proactive interaction behaviour. All the systems described above (all of them are single-user systems) start building their dialogue history at the beginning of the conversation which denotes the system’s first interaction. As our system encounters a different situation with the users already speaking before the system interacts, the dialogue history has to model also this interaction between the users (if it is task-relevant) even before the system’s first interaction. Only then, the system is able to achieve full contextual awareness and can interact proactively (refer to Section 4.3.3 for the description of our dialogue history). We further integrate another sort of history within the structure of the task model to memorise already mentioned constraints as described in Section 4.5. 2.4.2 Dialogue Management The dialogue context4 is incorporated in dialogue systems in different ways and more or less detailed. Xu et al. (2002) differentiate dialogue systems in two dimensions regarding if the underlying dialogue model and task model is either explicit or implicit. An explicit dialogue model represents the linguistic context whereas an explicit task model offers means to represent the semantic 4
We treat the dialogue history as part of the dialogue context in this description.
38
2 Fundamentals
knowledge. Systems whose dialogue structure is predefined in form of a finite state machine model the context implicitly. The entire context is represented in the respective state implicitly, the linguistic context as well as task-related information and other context information relevant for the system. In framebased systems the task is explicitly represented and modelled in the frame; often little attention is paid to the dialogue model (e.g. [Chu-Carroll, 1999]). The opposite holds for systems motivated by linguistic theories rooted in semantics and pragmatics. Here, the linguistic and semantic context is represented very detailed, less attention is paid to the task model. Examples for such systems are the GODIS system [Larsson et al., 2000] which uses question under discussion (QUD), originally developed by Ginzburg and colleagues (1996), the IBiS systems [Larsson, 2002] and the EDIS system [Matheson et al., 2000] which applies grounding and obligation, put forward by Traum and colleagues (1994). These systems further deploy the Information State Update (ISU) approach which is presented in detail in the following section. So far, nearly all speech dialogue systems are exclusively concerned only with dyadic dialogue involving two participants. One of the first multiparty dialogue systems is developed by Traum and colleagues [Rickel et al., 2002, Swartout, 2006]. The Mission Rehearsal Project is used for training in the Army. A doctor, a village elder and a Seargent are interacting in a critical situation set in the context of a Middle Eastern country. They use PTT [Matheson et al., 2000, Poesio and Traum, 1998, Traum et al., 1999] for dialogue modelling. Based on the experience with this system, Traum proposes that the multi-party case in dialogue offers the chance for more exact theories of dialogue by showing the short-comings of different dialogue models which perform similar in the two-party case [Traum, 2003]. However, Traum considers only direct communication in the multi-party setup, limiting each contribution to one speaker addressing only one single addressee which does not resemble a fully-fledged multi-party situation. Dialogue management for multi-user dialogue systems is more complex than for single-user systems. The dialogue needs to consider the different dialogue participants. Each interaction partner has to be modelled individually as each one might have different believes and goals. Grounding is a complex process in multi-party interaction as each dialogue partner, and so also the system, has a common ground with every other dialogue partner. In order to achieve one’s goal, all relevant information from any of the witnessed conversations has to be integrated. For instance, if the system’s goal was to find a common solution between all of the users, it would collect the constraints from all of the users and utterances and uses the resulting set to find a common solution. Turn taking makes up another difference as multi-party interaction does not follow clear patterns. The dialogue modelling thus has to provide flexibility in turn-taking including repair mechanisms and ways for alternate proceeding e.g. in case that a different participant steals one’s turn. Depending on the requirements of each system, the modelling can be more or less
2.4 Dialogue Modelling
39
complex. For our system we adopt the Information State Update approach which models and thus understands dialogue step by step which we believe is an important objective for an intelligent dialogue system. The approach is straightforward deploying a sophisticated dialogue model and providing the flexibility to install extensive task and context models. The ISU approach is presented in detail in the following section followed by recent work on extending the approach towards the multi-party situation. 2.4.3 Information State Update Approach To Dialogue Modelling The Information State Update (ISU) approach is a common and suitable approach for extensive dialogue modelling. It models dialogue as a state of information out of the perspective of a dialogue participant. The so-called Information State (IS) is updated according to the content of the incoming utterances. Attempts to modify the ISU approach to make it multi-party capable have been conducted by Ginzburg and Fern´ andez (2005) and Kronlid (2008) and will be discussed in the following section. The remainder of this section is dedicated to a brief introduction of the general ISU approach. The ISU approach is mainly coined by Larsson [Cooper and Larsson, 1998, Larsson and Traum, 2000, Larsson et al., 2000, Larsson, 2002] and Poesio and Traum [Traum et al., 1999, Matheson et al., 2000] and rooted in Ginzburg’s dialogue gameboard theory (1996). Example implementations have emerged especially out of the context of the Trindikit [Larsson and Traum, 2000], a framework for implementing and experimenting with systems using the ISU (such as e.g. EDiS [Matheson et al., 2000], GODIS [Larsson et al., 2000] and it’s successors IBiS1-4 [Larsson, 2002]). The information state theory typically consists of the following components [Larsson, 2002]: • A description of the informational components that should be represented in the IS. • Formal representations of the above components. • A set of dialogue moves that trigger the updates on the IS. • A set of update rules defining the update operations performed on the IS. • An update strategy to decide upon which rules to deploy on the IS at any given point. The information state is the central part of the ISU describing the dialogue at the current point in time. It is able to describe complex parts of a conversation, such as the dialogue participant’s mental state (beliefs), obligations (e.g. to respond) and goals (as well as ways to achieve these). The IS consists of the dynamic information of the dialogue context as described above and models the dialogue at a specific state from the viewpoint of one dialogue
40
2 Fundamentals
participant. It is divided into two parts. The private part contains information known only by the participant that is modelled, such as the participant’s own beliefs, agenda and plans regarding the ongoing conversation. The public part contains information which is believed to be grounded and known by all participants of the dialogue. The public part could either be modelled as shared between all participants [Larsson, 2002] or agent-specific as done by Traum [Traum et al., 1999, Matheson et al., 2000] and Ginzburg (1996) who therefore call it quasi-shared. In the agent-specific case the view on grounded information is believed to be different for each participant. Generally, however, if the interaction is successful, all DPs end up having the same grounded information5 . As an example, the IBiS1 information state is depicted in Figure 2.2. It consists of a private and a shared part and contains the following components: • AGENDA describes actions to perform in the near future. • PLAN contains longer-term actions to achieve a certain goal. • BEL (beliefs) is used to store the results of the database queries, i.e. information only known to the modelled DP. • COM (commitments) contains propositions that the user and system have mutually agreed upon during the dialogue. • QUD (questions under discussion) is a stack of questions raised in the dialogue that have not yet been resolved. The topmost element is the question that is currently being discussed. • LU (latest utterance) contains the information about the latest utterance, i.e. speaker and dialogue move.
AGEN DA : Stack(Action) P RIV AT E : P LAN : Stack(Action) BEL : Set(P rop) COM : Set(P rop) QU D : Stack(Question) SHARED : SP EAKER : P articipant LU : M OV E : M ove Fig. 2.2. Example information state as deployed in IBiS1 [Larsson, 2002].
Every utterance of the modelled dialogue induces an update of the IS. The possible updates are determined by update rules which consist of preconditions and effects (both working on the IS), the rule name and the name of the class the rule belongs to. All preconditions have to be true for a rule to fire 5
Taken that no malicious agents are involved in the interaction who e.g. purposely deceive other agents making them believe something which the others don’t believe.
2.4 Dialogue Modelling
41
which induces actions on the IS as defined by the effects part of the rule. The update rules are grouped into certain classes according to their functionality, e.g. rules for the integration of the semantic content of an utterance or rules for general grounding procedures, to load plans, or to select next moves. The order in which the different classes of update rules should fire is regulated by an update strategy. Algorithm 1 shows an example update algorithm, again borrowed from the IBiS1 system [Larsson, 2002]. It consists of conditions and calls to classes of update rules. if NOT latest move == failed then apply clear(/PRIVATE/AGENDA); getLatestMove; integrate; try downdateQUD; try loadPlan; repeat execPlan end
Algorithm 1: Update algorithm of IBiS1 [Larsson, 2002].
The update rule getLatestMove fires as the first rule when a new utterance comes in. It loads the data of the latest utterance into the according parts of the IS (MOVE and SPEAKER field of LU) to make it accessible for further operations. The rule looks as follows6 : RULE: getLatestMove; CLASS: grounding; PRE: $Latest Move==Move $Latest Speaker==Participant EFF: copy(/SHARED/LU/MOVE,Move) copy(/SHARED/LU/SPEAKER,$Latest Speaker)
After loading the latest utterance the data has to be integrated. Thus, as a next step in the algorithm, a rule of the integrate class of update rules is called. Thereby, the rule whose complete set of preconditions are true will fire. Taken for instance, that the latest move was a question posed by the user, the update rule integrateUsrAsk will fire. The rule looks like this: The preconditions of the rule check if the latest speaker was the user and if the latest move was an ask move. The firing of this rule causes an update of QUD, i.e. q is pushed on top of QUD as it denotes the new question that is currently under discussion. Further, an action stipulating a response to the question q is pushed on the AGENDA. Analogously to this rule, 6
The notation used here is slightly modified from Larsson’s notation (2002).
42
2 Fundamentals RULE: integrateUsrAsk ; CLASS: integrate; PRE: $/SHARED/LU/SPEAKER==usr in($/SHARED/LU/MOVE,ask(q)) EFF: push(/SHARED/QUD,q) push(/PRIVATE/AGENDA,respond(q))
integrateSysAsk and integrateAnswer are examples for rules that handle dialogue moves7 . Further update rules are deployed to manage plans and actions. They are defined by the rule classes select move, select action, find plan, exec plan, etc. Example rules are selectAsk, raiseIssues, findPlan, managePlan. Interaction protocols emerge from application of the update rules to a dialogue describing a common flow of a conversation. Table 2.2 shows a simple query and an assertion protocol, slightly modified from their specification in [Ginzburg and Fern´ andez, 2005]. Query
Assertion
LatestMove == ask(A:B,q)
LatestMove == assert(A:B,p)
A: push(QUD,q); release turn
A: push(QUD,p?); release turn
B: push(QUD,q); take turn; make max-qud-specific utterance; release turn
B: push(QUD,p?); take turn; Option 1: Discuss p?; Option 2: Accept p LatestMove = Accept(B:A,p) B: increment(FACTS,p); pop(QUD,p?); release turn A: increment(FACTS,p); pop(QUD,p?)
Table 2.2. Query and assertion interaction protocols of IBiS1.
The presented components of the ISU approach are taken from the IBiS1 system [Larsson, 2002] which assumes perfect communication at all times in terms of that all utterances are understood and accepted and no references are used (all of which occurs frequently in natural dialogue). The successor 7
Which dialogue move is assigned to an utterance is defined by the relation between the content and the activity in which it occurs. The set of dialogue moves deployed by the IBiS1 system are ask, answer, greet and quit.
2.4 Dialogue Modelling
43
system IBiS2 modifies the information state structure to enable grounding which is performed by adding temporary fields in the private part of the information state to store the information before it is grounded (at which point it is integrated into the according fields). We do not go into details here and refer to [Larsson, 2002] for further information. Another extension worth mentioning, however, is a field that holds the previous move (the one before the Latest Move) which besides checking the next move for relevance (what it has been originally designed for by Larsson) also enables backtracking, i.e. that false assumptions can be taken back. This is a first step towards a dialogue history which we claim to be necessary for a complex dialogue system with multiple dialogue participants involved. In the following section we present existing approaches that extend the ISU approach to handle multiparty interaction building on the principles introduced in this section. 2.4.4 Multi-Party Dialogue Modelling Introducing additional participants into the conversation affects the dialogue in various ways (e.g. refer to Section 2.3, [Traum, 2003, Ginzburg and Fern´andez, 2005, Kronlid, 2008]). It cannot be assumed that all dialogue participants share the same common ground, the one established during the conversation as well as the common ground built during previous interactions of the participants. Thus, the more participants take part in the conversation, the less common common ground is presumable. In two-party conversation turns switch constantly between the two participants, when one speaks the other one is addressed and takes the next turn. In multi-party conversation turn-taking is not that simple; many different situations can occur. Thus, interaction protocols have to be more flexible to be able to consider all the possible situations. Another issue regards the fact that in multi-party dialogue a differing number of participants can be addressed simultaneously. Either one single participant or various participants are addressed, or possibly both situations occur within one single turn. A side-participant could steal the turn and speak instead of the intended addressee. Does this take away the obligation of the addressee to answer? Is a question considered resolved (and can be taken from QUD) after being answered by one participant while a set of participants was addressed? Below, two approaches to multi-party dialogue modelling are introduced which address and try to answer some of these questions. Traum’s approach to multi-party dialogue management [Traum, 2003] regards multi-party dialogue as sets of pairs of two-party dialogue which simplifies the multi-party situation in enabling the use of existing dialogue models. Only direct addressing is considered, i.e. one speaker and one addressee. Different conversations between different pairs of interaction partners may take place at the same time. For natural multi-party interaction, this approach is not considered flexible enough. The first approach which is discussed in the following is introduced by Ginzburg and Fern´ andez (2005). They were the first to extend the ISU ap-
44
2 Fundamentals
proach [Larsson, 2002] to enable multi-party dialogue (or multilogue as they call it). They introduce a way to scale up the interaction protocols to integrate overhearers and side-participants into the protocols and thus enable for distributive and collective addressing and answering behaviour of groups of addressees. The extended protocols regard the different conversational roles presented in Section 2.3.2 and two benchmarks they derive from investigating multi-party dialogues [Ginzburg and Fern´ andez, 2005]: • Multilogue long distance short answers (MLDSA): Querying protocols for multi-party dialogue must license short answers an unbounded number of turns from the original query. • Multilogue adjacency of grounding/acceptance (MAG): Assertion and grounding protocols for multi-party dialogue should license grounding and acceptance moves only adjacently to their antecedent utterance or after an acceptance of a fellow addressee. Ginzburg and Fern´ andez (2005) thus propose three possible modifications of the interaction protocols used for conversational update in multi-party situations according to the principles shown in Table 2.3. Add Overhearers (AOV): Silent overhearers are added as recipients of the informative speech act, i.e. they add what has been said to their common ground without the chance to interact for clarification or feedback. Duplicate Responders (DR): The addressee role is multiplied. Each of the responders updates the information state in the way an addressee does and is expected to respond, one after another, in the form of distributive acceptance. This protocol extension does not enable the reaction of a later responder to an earlier responder’s utterance (which however occurs regularly in conversations). Add Side Participants (ASP): The audience exists of a set of DPs who update their information state without responding. One member of the audience instantiates the addressee role and acts as a sort of spokesperson whose acceptance counts for the whole audience. This protocol extension enforces communal acceptance. Table 2.3. Interaction principles by Ginzburg and Fern´ andez (2005).
In the following, the interaction protocols that have been presented in the foregoing section (refer to Table 2.2) are listed after the modifications performed according the multi-party principles suggested by Ginzburg and Fern´andez (2005). Table 2.4 shows the querying protocol adopting each of the three proposed principles.
2.4 Dialogue Modelling Query + AO
Query + DR
Query + ASP
LatestMove==
LatestMove==
LatestMove==
ask(A:{B,C1 ,..,Cn },q)
ask(A:B,q) A: push(QUD,q);
A: push(QUD,q);
ask(A:{B,C1 ,..,Cn },q) A: push(QUD,q);
release turn
release turn
release turn
B: push(QUD,q);
B: push(QUD,q);
B: push(QUD,q);
take turn; make max-qud-spec utt; release turn Ci : push(QUD,q)
45
take turn;
take turn;
make max-qud-spec utt;
make max-qud-spec utt;
release turn F or all Ci : push(QUD,q);
release turn Ci : push(QUD,q)
take turn; make max-qud-spec utt; release turn Table 2.4. Interaction protocol query extended to the multi-party situation following the principles of Ginzburg and Fern´ andez (2005).
Kronlid (2008) argues that this approach is insufficient to adequately handle multi-party situations. While the above modifications allow for the distinction of communal and distributive sort of questions, they are not very flexible. Applying DR might be suitable, for instance, for a disagreeing audience or a tutor and student situation, however, in other situations it would seem unnatural to have every addressee of an audience respond one after the other; and only to the principally uttered question due to the fact that it further does not allow for a reaction of one DP to the acceptance of another dialogue participant. ASP might be too restrictive in general. Kronlid defines one of the main challenges of modelling multi-party dialogue the identification of the sort of questions asked. He differentiates between individual, communal and distributive questions. Each type of question specifies a different set of responders. Figure 2.5 shows the principle Kronlid introduces to multiply the addressee role [Kronlid, 2008]. Add Multiple Addressees (AMA): The responder role is duplicated, the responders’ max-qud-specific contributions are optional. Table 2.5. Interaction principle by Kronlid (2008).
The anticipated behaviour lays somewhat in between DR and ASP allowing but not requiring all addressees to respond. One of the addressees
46
2 Fundamentals
responds (by self-selection) after which all other addressees have the chance to do the same. The querying interaction protocol extended with Kronlid’s AMA principle is shown in Table 2.6. Query + AMA LatestMove==ask(A:{B,C1 ,..,Cn },q) A: push(QUD,q); release turn B: push(QUD,q); take turn; make max-qud-spec utt; release turn Ci : push(QUD,q); ( Optional: take turn; make max-qud-specific utterance; release turn ) Table 2.6. Interaction protocol query using the AMA principle.
Applying AMA allows for communal as well as distributive acceptance. However, Kronlid suggests to deploy DR for distributive questions. Commenting on earlier answers or comments of fellow addressees is enabled by introducing acceptance moves and changing the way in which QUD functions. Kronlid extends Larsson’s IBiS1 system to Multi-IBiS (2008) for multi-party interaction. The information state of Multi-IBiS is depicted in Figure 2.3 followed by a description of its components. The main modifications that were performed on the traditional IBiS1 version of the information state are presented in the following: • AGENDA: The agenda is implemented as a queue instead of a stack for better handling of the chronological order. Agenda items are changed in the way that they contain the DP to address as a second argument, if necessary. • PLAN: The plan element is divided into three parts. – ISSUE: The issue which the plan aims to resolve. – THE PLAN: The actual plan. Plans deployed in Multi-IBiS are extended to include the name of a DP who should be addressed to resolve the plan. – OPEN FOR ME: A boolean value denoting if the modelled DP can contribute to the plan or not.
2.4 Dialogue Modelling
47
AGEN DA : Queue(Action) ISSU E : Question P RIV AT E : P LAN : T HE P LAN : Stack(Action) F OR M E : Boolean OP EN : Set(P rop) BEL COM : Set(P rop) ISSU E : Question SP KR : P articipant QU D : StackSet ASET : Set(P articipant) OSET : Set(P articipant) SHARED : ST AT U S : [OPEN|CLOSED] P U : SP KR : P articipant M OV E : Set(M ove) SP EAKER : P articipant LU : M OV E : Set(M ove) Fig. 2.3. Multi-IBiS information state [Kronlid, 2008].
• QUD (questions under discussion): is from now on a list allowing addressing of items that are not maximal. – ISSUE: The question. – SPKR: DP who is raising the question. – ASET: Set of DPs who are directly addressed and who also have the right to address the issue. – OSET: Set of DPs with the obligation to address the issue. – STATUS: Addressing status of the issue. It signals if the ASET or OSET could still be further extended (which could be the case while the current utterance is not finished). • PU (previous utterance): The PU field is borrowed from Larsson’s IBiS2 implementation (2002) where it was called PM. While LU holds the latest utterance, PU holds the same information about the utterance before LU. This enables an agent to respond to something uttered up to two turns ago as well as backtracking in case of overhasty grounding due to the cautiously optimistic grounding strategy that is deployed. The Multi-IBiS data model contains further elements besides the information state [Kronlid, 2008]: • Name of the modelled dialogue participant • Names of all dialogue participants: At the beginning of the interaction the list is empty. Participants are added as soon as one starts speaking.8 8
This way, it does not seem possible to consider silent side-participants as addressees. Nothing is said how this case is handled.
48
2 Fundamentals
• Addressing information, i.e. which participant is currently addressing which other DP(s). The information is accessed by the QUD object. When a participant addresses a different DP, the information is updated accordingly. By extending QUD to include the set of addressees (ASET) and obligated responders (OSET) the fulfilment of obligations to answer can be observed. The ASET and OSET fields contain the same participants at the beginning. With every answer, the name of the current DP is removed from OSET. When all obligated responders have answered OSET is empty. For communal acceptance, the set is emptied after one answer. If ASET is empty from the start the addressees are determined by addressing-by-attribution, i.e. participants self-select and make a contribution if they know the answer. Kronlid modifies the dialogue moves and update rules in order to handle multi-party dialogue. The structure of the update rules is changed as the simple assumption of a perfect two-party dialogue with the speaker and responder switching every turn does not apply for the multi-party case. Therefore, every move has to be integrated regardless of the speaker and addressees (getLatestMove is omitted). The update rules need to differentiate between individual, collective and distributive questions. Kronlid’s approach does not suggest a solution to identify each question according to its type. The context is not taken into account; it is rather assumed that a certain question is always of a certain type. Further changes towards the multi-party situation include renaming of some rules, e.g. integrateUsrAsk is changed to integrateOtherAsk, and the extension of all rules to integrate the ASET and OSET to verify if a DP is permitted or obligated to answer. By introducing an acceptance move accept(q) that is added on QUD when a participant answers q, the subsequent speaker is enabled to address this acceptance move or q. QUD is thus not downdated after any resolving answer but q is left on QUD until the latest move does not contain a q-specific move. Kronlid addresses the prevalent multi-party specific questions with his approach in the following way. Whether an obligated addressee is released from the obligation to answer if another participant answers instead, depends on the type of question raised. If the question is distributive the obligation continues and the participant is expected to answer. In the case that the question is of communal character, the participant is released from the obligation. In our case, a distinction of the different question types is not necessary as all questions are of individual nature. Analysing the recordings shows that if the side-participant (who is not addressed) interrupts the conversation it is done out of a significant reason and mostly entails a response from the former speaker outweighing the response from the former addressee and thus taking away the obligation. The question regarding when issues are resolved is also dealt with in the way that the QUD as well as the conditions for downdate
2.5 Summary
49
were loosened in Kronlid’s approach. In our setup, we adopt this and some other modifications that Kronlid conducted on the original information state, however, put more focus on the task model and dialogue history (refer to Section 4.3.3). Regarding the extensions for the interaction protocols none of the suggested principles fully serves our needs. While Ginzburg’s ASP is not flexible enough, Kronlid’s approach considers multiple addressees (a situation we do not encounter). Thus, we suggest a new principle which is presented in Section 4.1.
2.5 Summary In this chapter fundamentals of different techniques and research areas were provided as a basis for the remaining chapters of this book. The Wizard-of-Oz data collection technique introduced in Section 2.1 is deployed for our data recordings presented in Chapter 3. The introduction on dialgoue system evaluation is relevant for Chapter 5 which presents the evaluation conducted for our dialogue system. The linguistic fundamentals and general human-computer interaction discussions (Section 2.3) are considered for the development of the dialogue management component (Chapter 4) which is based on the Information State Update approach that was introduced in Section 2.4 together with existent multi-party extensions. We discussed that these extensions to the ISU approach do not sufficiently cover the novel situation that our system encounters. We thus introduce a new interaction principle to allow for proactive interaction of side-participants and adapt the data structures to the multi-party situation of our system, as presented in Chapter 4.
3 Multi-Party Dialogue Corpus
Adequate dialogue data is needed to investigate the multi-party interaction with a computer system as an independent dialogue partner in the conversation with several humans. A look is taken at existing multi-party corpora in order to see if any of these could be used to investigate our research questions. The presented collection is limited to corpora that make use of multiple modalities. Although a wide variety of corpora is available, as far as the authors are aware, there is no existent collection of data that stresses the designated features important for our research. Thus, we perform the data collection presented in this chapter. We deploy the Wizard-of-Oz technique as introduced in Section 2.1 to simulate the envisaged system in the example domain of restaurant selection (refer to Section 1.3 for a description of the system) to obtain realistic interaction data which is then used to assist the development of the system (Chapter 4) and for evaluation (Chapter 5). The WOZ recording setup and procedure are presented in Section 3.2 followed by a detailed description of the software tool we have developed to support the recordings. The tool is easily adaptable to other domains and requirements and publicly available for other developers. The collected data results in the PIT corpus which is finally presented in Section 3.3.
3.1 Existing Multi-Party Corpora Recent multi-party corpora are presented in the following. Only corpora that deploy multiple modalities are considered and thus, corpora focusing on the audio modality only, such as the ICSI corpus [Janin et al., 2003] and the ISL audio corpus [Burger et al., 2002, Burger and Sloane, 2004], are not listed. AMI(DA) corpus In the framework of the AMI (Augmented Multi-party Interaction) project [Renals, 2005] which aims at developing meeting browsing technology as well as P.-M. Strauß and W. Minker, Proactive Spoken Dialogue Interaction in Multi-Party Environments, DOI 10.1007/978-1-4419-5992-8_3, © Springer Science + Business Media, LLC 2010
51
52
3 Corpus Development
remote meeting assistants, a multimodal data corpus of about 100 hours of meeting recordings was collected [Carletta et al., 2005]. Some of the meetings were naturally occurring meetings, others were scenario-driven meetings put on for the recordings. The meetings had generally four participants and were held in English with a large proportion of the speakers being non-native English speakers. Instrumented meeting rooms were set up on three different sites that were equipped with individual and room microphones, individual and long-shot video cameras, individual electronic pens, presentation slide capture and white-board capture devices. CHIL corpus The CHIL Audiovisual Corpus for Lecture and Meeting Analysis inside Smart Rooms [Mostefa et al., 2007] was created in the framework of the CHIL (Computers in the Human Interaction Loop) project. It consists of synchronised video and audio streams of real lectures and meetings recorded in five smart rooms. Each of the rooms is equipped with a large amount of recording equipment, for instance a minimum of five cameras and 88 microphones per room. The interaction scenarios are meetings (40) and lectures (46). Both consist of one person giving a presentation in front of an audience, differing in the size of the audience (in lectures ten to twenty persons, in meetings three to five) and interaction (a lot of interaction in meetings versus few interaction with the audience in lectures where the focus lies only on the presenter). Interaction language is English but most of the speakers have a non-native English accent. Lectures were between 40 and 60 minutes long, meetings approximately 30 minutes. Manual annotation includes multi-channel verbatim orthographic transcription of the audio modality that includes speaker turns and identities, acoustic condition information and name entities for part of the corpus. The corpus is used within the CHIL project in order to develop audiovisual perception technologies for human activity analysis during lectures and meetings. This includes person localisation and tracking, person and speaker identification, face and gesture recognition, speech recognition, emotion identification, acoustic scene analysis, topic identification, head-pose estimation, focus-of-attention analysis, question answering and summarisation. MSC1 Corpus The MSC1 Corpus [Pianesi et al., 2007] and its successor corpus [Mana et al., 2007] consists of task-oriented meetings of four persons that are to solve a survival task. The data is used to study social behaviour and personality traits from audio-visual cues such as 3D-body tracking and speech activity. The aim is to develop a system automatically predict these personality traits using the audio-visual cues. The meetings of MSC1 were on average around 19 minutes long. The successor corpus features 52 participants in 13 sessions recorded by cameras and microphones and includes also task-success measures.
3.1 Existing Multi-Party Corpora
53
NIST Corpus The NIST audio-visual corpus of the American National Institute of Standards and Technology1 consists of multimodal multi-channel data sets recorded in a smart room using a variety of microphones and video cameras (enabling 2D-tracking). The recorded meetings are partly real, partly scenario-driven. The number of participants per meeting range from three to nine. The first part of the corpus [Garofolo et al., 2004] contains 19 meetings resulting in approximately 15 hours of data. 61 participants were involved, about a third of them are non-native English speakers. Phase two of the corpus [Michel et al., 2007] contains 17 meetings and approximately 19 hours of data, involved a total of 55 subjects and only four of them non-native English speakers. VACE Corpus The VACE Meeting corpus [Chen et al., 2005] was collected in order to support research on understanding meetings by analysing multimodal cues, such as speech gaze, gestures and postures to understand meetings. The meetings were held in the domain of war game scenarios and military exercises, recorded in an instrumented lecture room holding up to eight participants sitting around a meeting table. Each participant is recorded by various microphones and at least two video cameras to enable 3D tracking of the head, torso, shoulders and hands. M4 Corpus Audio-visual corpus of the MultiModal Meeting Manager (M4) project [McCowan et al., 2005] features meetings of four participants, scripted in terms of type and schedule of group actions or note taking. Recordings were conducted with various microphones and video cameras enabling 2D-tracking. A sort of subcorpus of the M4, the multimodal corpus described in [Jovanovic et al., 2006] is put together consisting of twelve meetings recorded at the IDIAP smart meeting room [Moore, 2002], ten dialogues from M4, one from AMI and one other. Mainly four participants took part in each meeting with a total of 23 participants. The total duration of the corpus is approximately 75 minutes. The research aim pursued with this data collection is to study addressing behaviour in face-to-face conversations. The corpus is handannotated with dialogue acts, adjacency pairs, addressees and gaze directions of meeting participants. MRE Corpus A multimodal corpus of task-oriented multi-party dialogues [Robinson et al., 2004] was collected in the scope of the Mission Rehearsal Exercise project 1
http://www.nist.gov
54
3 Corpus Development
(MRE) [Swartout et al., 2005]. The aim of the project is virtual reality training of a decision-maker in a multi-party mission-related setting. One human trainee is interacting with several virtual characters. The corpus comprises approximately ten hours of audio (human simulation radio data) and approximately five hours of video and audio face-to-face interactions between human trainees and virtual agents. Part of the dialogues were recorded using the Wizard-of-Oz technique (ref. to Section 2.1). ATR Corpus The corpus referred to by [Campbell, 2008] was collected at ATR (Advanced Telecommunications Research Institute International2 ) with the purpose of interaction analysis focusing on engagement in dialogue. It contains (besides dyadic dialogues) a set of conversations (not task-oriented meetings) with four participants from different cultural backgrounds. The dialogues were audio and video recorded enabling head and body tracking. Our research aim is to study human-computer interaction in a multi-party scenario. The question is how the computer is integrated in the conversation as an independent dialogue partner. Thus, data suited for the analysis needs to feature the aimed setup of dialogue partners. All (but one) of the described multi-party corpora feature various human dialogue partners, however, none of them deploys a virtual interaction partner. The MRE corpus is, as far as the authors are aware, up to now the only corpus deploying virtual characters. The data, however, is not suitable for our research as it features multiple virtual characters interacting with a single human while we demand multiple humans interacting with a single computer system. Our focus lies further on task-oriented dialogue, i.e. the interaction pursues a certain goal and ends when the task has been solved. The dialogue system is to play the role of the expert and contributes to solving the task. Dialogues that feature multiple humans involved in the same sort of dialogue could possibly be used for analysis in order to learn the designated behaviour the system should possess from the human expert. However, task-oriented interaction data of this kind is not available in the form of multi-party dialogues, most of the presented corpora act in the meeting domain which underlies different principles of interaction than task-oriented dialogue and is thus not suitable. As far as the authors are aware, there is no existent collection of data stressing the designated features important for our research. Therefore, we build our own data corpus which enables us to study the interaction of the humans with the computer system. The data collection procedure and corpus are presented in the remainder of this chapter.
2
http://www.atr.jp
3.2 Wizard-of-Oz Data Collection
55
3.2 Wizard-of-Oz Data Collection An extensive Wizard-of-Oz environment was set up for the collection of multimodal data. The recordings took place in the scope of the research project ’Perception and Interaction in Multi-User Environments - The Computer as a Dialogue Partner’ conducted within the competence centre Perception and Interactive Technologies (PIT) at Ulm University (Germany) whose aim is the development of components and technologies for intelligent and user-friendly human-computer interaction in multi-user environments. Over the time span of 18 months 76 dialogues were recorded in three recording sessions. The setup stayed the same over all three sessions while the system for the wizard interaction was improved from one session to another introducing new features and speeding up the reaction time. The final system is described Section 3.2.4 and in more technical detail in Appendix A. 3.2.1 Experimental Setup The setup of the system is shown in Figure 3.1. The human dialogue partners U1 and U2 interact with the system S which is operated by the human wizard situated in a different room. U1 is the system’s main interaction partner. The
Fig. 3.1. Data collection setup.
56
3 Corpus Development
dialogue system server runs on the computer S. It produces acoustic output in form of synthesised system utterances and visual output consisting of an avatar and further items displayed on the screen (restaurant’s menu, city map or bus schedule). The wizard’s computer is connected to S, i.e. the dialogue system server, via network connection. The wizard controls the system hearing what the users are saying through microphones M1 and M2 whose signals are transmitted via wireless connection and recorded on the wizard’s computer. A webcam is further installed pointing towards the screen for the wizard to check if the system’s display is working correctly. Audio Recordings The speech signals are recorded by three different microphones. One lapel microphone for each human dialogue partner (M1, M2)3 and one room microphone (M3)4 to capture the entire scene including the system output. The signals recorded by the lapel microphones are transmitted via wireless connection to the wizard’s computer where they are recorded. All audio data is recorded at 16 kilohertz with 16 bit resolution – the standard quality for speech recognition. External soundcards5 are used for improved quality and to be independent of the recording computer. The audio signals from the room microphone are recorded on a MiniDisc recorder6 . Video Recordings Three video cameras are installed to record the dialogues (C1-3)7 . For complexity reasons we use one human user as the main interaction partner (U1) for the system and consider only this user’s gaze. Thus, Camera C1 is responsible for the recording of U1’s gaze direction. Camera C2 records user U1’s perspective, C3 captures the entire scene from the long shot. Figure 3.2 shows the scene from the viewpoint of cameras C3 (left) and C1 (right). 3.2.2 Procedure The recordings proceed in the following way: Two participants take part in each recording. Before the interaction with the system they fill out the first part of the questionnaire and are given a random scenario as a guideline for the interaction. After the recording, the second part of the questionnaire is completed. From the system’s perspective, the wizard has to follow certain guidelines in order to achieve homogeneous recordings. In some of the recordings, an emotion eliciting strategy is deployed as described below. 3 4 5 6 7
AKG 97L with AKG WMS40 AKG 1000S CREATIVE Sound Blaster 24-bit S80300 SONY MZ-R700 JVC GR-D270E
3.2 Wizard-of-Oz Data Collection
57
Fig. 3.2. Video recordings from the viewpoint of cameras C3 (left) and C1 (right) during a Session I dialogue [Strauß et al., 2007].
Participants The participants (n=152)8 were students and employees of the university who gave written consent to participate in this study. They were between 19 and 51 years of age (on average 24.4 years); 53 of them were female (4 at session I (10.5%), 18 at session II (45.0%), 31 at session III (41.9%)). Except for seven participants (I: 2, II: 4, III:1) the native language of all participants is German. The main professional backgrounds are computer science (n=41, i.e. 27.0%) and natural sciences (n=27, i.e. 17.8%). Engineering and medicine are equally represented by 24 persons (15.8%). The daily usage time of computers of 149 participants lies between 10 and 720 minutes, with an average of 252 minutes. Computer experience lies between 4 and 21 years with an average of 11.43 years. The participants can be said to be very familiar with technology and adept with computers. For participation they did not receive any compensation other than coffee and cookies. Questionnaires A comprehensive questionnaire was completed by the participants prior as well as subsequent to the interaction with the system for evaluation purposes. The questionnaires collected data about the participants, self-assessment about technical skills, as well as usability and subjective rating of the system. A detailed description of the questionnaires and evaluation results is given in Section 5.1.
8
150 of the participants filled out the questionnaire, the presented data is hence based on this number.
58
3 Corpus Development
Scenarios We randomly assigned different scenarios to the dialogue partners to supply them with a starting point and theme of the dialogue. For that, each dialogue partner received a few statements describing the situation and role to adopt in the dialogue. An example of such a scenario is shown in Figure 3.1. The specified information includes e.g. the relation to the other dialogue partner, a motivation for the restaurant visit or culinary preferences. The combinations of roles include amongst others employer and employee, business colleagues, friends, or a couple in a loving relationship. Occasionally, the roles contained contrary preferences to induce a lively and emotional discussion. Giving the fact that people act differently according to what kind of roles they possess in a conversation (refer to Section 2.3.2) we tried this way to obtain interesting and diverse dialogues in terms of dialogue behaviour, e.g. relating equality or dominance of the dialogue participants. For instance, assuming a conversation between professionally unequal dialogue partners, it is likely for the ’superior person’ to also play a superior role in the dialogue. Less objection and conflict is detected in this case, the ’inferior person’ is more likely to give in. Whereas, in the case of two friends discussing their choice of a restaurant, it can be observed that constraints are changed more often as the dialogue partners articulate their own preferences more freely, trying to convince the counterpart. Although the scenarios and roles are fictitious, these phenomena can be observed in the recorded dialogues. Thus, a wide range of different behaviour is achieved in the corpus of dialogues. For the major part, the participants followed the role play throughout the dialogues. In a few cases, however, the scenario was used only to stimulate the conversation and not referred back to later on in the dialogue. Situation: Two colleagues want to go out to eat after a long day at work. It is about 7 in the evening and the sun is still shining.. Person A - dialogue partner of computer - loves Greek food - had a bad day (is tetchy) - is short in money
Person B - is disgusted by garlic - likes plain German food
Table 3.1. Example scenario description.
3.2.3 System Interaction Policies One of the characteristics of computers is deterministic behaviour. In order to perform an adequate simulation, it is therefore necessary to define the sys-
3.2 Wizard-of-Oz Data Collection
59
tem’s behaviour. Wizard guidelines have to be drawn up to determine how the wizard should control the system to assert uniform system behaviour throughout all dialogues in order to receive unbiased dialogue data. We deploy two different strategies for interaction: One for standard interaction simulating a perfectly working system and another one to elicit emotional behaviour from the users. Both strategies start with the same general principles: The conversation between the two dialogue partners starts without system interaction while the users act out the provided scenarios. The differences emerge once the system has joined in the conversation following the policies as pointed out in the following paragraphs. Standard Policy During a standard recording the system, or wizard respectively, does not interrupt the conversation of the users. Speech recognition and language understanding are simulated to be perfect. Correct and helpful answers are given as frequently and as promptly as necessary and wherever possible. The system’s first and further interactions are triggered by one of the following situations: • Reactively upon user U1 addressing the system directly • Proactively in order to make a significant contribution to the conversation (e.g. to report a problem in the task solving process) • After pauses in the dialogue exceeding a certain threshold (if a meaningful contribution can be made) In the case where U2 addresses the system directly the utterance is recognised, however, no direct response should be given. Yet, two different reactions can be observed during the dialogues. Either, user U1 instantly takes the turn and poses the same or similar request to the system, or U2’s request is followed by a pause which again would mostly lead to an interaction by the wizard. After the users have found a suitable restaurant the wizard generally closes the dialogue saying goodbye and thanking the users. In some cases, however, the participants decide to search for another location, such as a bar for a cocktail after dinner etc. Overall, the recordings that adhere to the standard policy resemble a perfectly working dialogue system. Thus, the data deriving from these recordings are suitable for usability evaluation of the envisaged dialogue system. Emotion Policy An average dialogue that is recorded deploying the scenarios described above attains (besides neutral data) the emotional characteristics of happiness and at times surprise (e.g. when the avatar appears on the screen for the first time). In order to obtain data which exhibits a wider range of emotions, we try to induce emotions by directing the system’s reaction in a certain way.
60
3 Corpus Development
Thus, instead of simulating a perfectly working system the wizard steers the system’s reaction in an unexpected way. The system does not appear perfect but rather rude and faulty in order to provoke anger, boredom, annoyance and surprise. The following types of errors and disorders are deployed throughout the entire conversation: • • • • •
Speech recognition errors Understanding errors Wrong answers Occasional interruptions of the users Arbitrary pauses or omitting
For instance, if a user says “not expensive restaurant” the wizard returns a selection of expensive restaurants ignoring the “not”. As another example, the wizard might purposely suggest a restaurant that does not match the users’ preferences or is even situated in a different city. Needless to say that the emotions expressed by the users are more moderate than artificial emotions played by actors, as e.g. in the Berlin Database of Emotional Speech [Burkhardt et al., 2005], a prevailing corpus for research on emotion recognition. We consider these moderate emotions as more realistic and common in human-computer interaction and therefore very useful for affective computing tasks which, however, are not within the scope of this work. In Section 5.1.4 the usability ratings of the dialogues recorded using the two different interaction policies are compared to each other. 3.2.4 WIT: The Wizard Interaction Tool The dialogue system in use for the Wizard-of-Oz recordings has to be built so as to simulate the envisaged system’s behaviour as closely as possible. We developed the Wizard Interaction Tool (WIT) which allows fast and faultless wizard interaction. It is a client-server tool implemented in Java and freely available for download9 and customisation to individual needs, domain and also languages [Scherer and Strauß, 2008]. The tool enables the wizard to replace the modules that are not yet functional such as speech recognition and semantic analysis. As a further interaction point, the wizard can check the automatically generated prompts before they are sent to the synthesiser. The first recording session was performed while the development of the system was still in its early stages. For the second recording session, the location of the setup was changed and the system had evolved to a more elaborate client server architecture still adding further functionality for the third and final WOZ recording session. The variances in the system setup and functionality between the recording sessions are listed in the following. In the remainder of this section, the interface of the tool and the applied restaurant application are described. Technical details on the implementation and customisation of the tool are listed in Appendix A. 9
http://www.uni-ulm.de/in/pit
3.2 Wizard-of-Oz Data Collection
61
• Session I. The initial system deployed for the wizard interaction is simple and not very elaborated. The wizard is provided with a VoiceXML frontend to type in the keywords. The backend, i.e. the system itself, is a preliminary version of the final system, implemented in Java. Various wizards were involved in the recordings. The system is represented by synthesised speech output. Restaurants’ menus can be displayed on the screen. Various wizards are involved in the recordings. The main drawback of this architecture is the slow response time of the system presumably caused in part by unclear or indirect phrasing of the queries, where it was ambiguous for the wizard how to react. More practise and training for the wizard might have helped against this to some extent. Further, there is a high chance to misspell words in this frontend. Additionally, the participants would start to speak again while waiting for the answer which, of course, the wizard (resp. system) did not interrupt. They would pick up a random topic or alter the request which induced the wizard to change the query once more before responding. On the other hand, if participants ask direct questions and wait for the answer, the response time is considerably shorter. • Session II. The following improvements are realised: – Speeding up the response time: The slow response time of the system was not acceptable, therefore, the system architecture evolved to a client-server architecture entirely implemented in Java. The new frontend permits only specified keywords to be entered deploying an auto-completion functionality. Thus, faster and more correct interaction is enabled for the wizard which improves the response time considerably. – Giving the system a face: An avatar in form of a head is integrated as a form of personification of the system in a visual modality. It moves its mouth according to the synthesised speech output and occasionally blinks with one eye. – Additional functionality: The system’s range of functionality is extended to include the feature to show street maps pointing to the selected restaurant. • Session III. The system is altered in the following two points: – Single wizard: One single wizard conducts all of the recordings to ensure a small variation as possible regarding system behaviour. – Additional functionality: The functionality is expanded in terms of the ability to display bus schedules. The website of the local public transportation service is consulted for the next bus connections from the current location (i.e. university) to the bus station closest to the selected restaurant. – Avatar or no avatar: Recordings can be performed with or without the avatar displayed by simply (un)selecting a check box.
62
3 Corpus Development
– Emotion eliciting interaction: An emotion inducing policy for the wizard behaviour is introduced to obtain emotionally rich dialogue data. Human Computer Interface Figure 3.3 shows screenshots of the computer screens of the wizard (a) and the users (b) when interacting with the system10 . The different interfaces are described in the following.
Fig. 3.3. (a) Screenshot of the Wizard client, (b) Screenshot of the system from the users’ perspective.
Wizard Interface. Figure 3.3(a) shows the system as seen from the wizard’s perspective. Besides the WIT’s graphical user interface (GUI) on the left side, two further objects are displayed on the wizard’s computer screen. On the upper right side the image of the users’ screen is shown. It is forwarded through the Ethernet using a webcam in order to check if the graphical output of the server is correct. An audio recording tool, displayed on the lower right side, records the users’ conversation which is transmitted by the wireless microphones. The design of the tool is very functional focusing on the basic features. It has only a few buttons and text fields that can be controlled by either mouse or keyboard. The wizard enters commands (either preferences picked out in the dialogue or commands to control the system’s functionality) into the field in the top row with auto-completion functionality, i.e. only permitted words can be entered and are automatically completed when typing the beginning of a word. This way, no invalid or non-existent commands can be typed by the wizard reducing errors enormously and enabling very fast interaction. The line below the command box depicts the automatically generated 10
Obviously, the screenshots were taken at different points in the interaction as (a) has no object displayed on the system’s screen while (b) does.
3.2 Wizard-of-Oz Data Collection
63
system utterances that are displayed before prompting in order for the wizard to control and edit the content of the utterances (if necessary). This gives the wizard further the opportunity to determine the prompting time more precisely. Although for our recordings generally not used, the functionality of entering custom utterances is also provided. The current constraints (i.e. up to a point where the wizard starts a new query) as well as the history of the prompts are shown above the result list. From the prompt history entries can be selected to reload the prompt into the prompt field. The result list is displayed in the lower part of the tool and holds the result set of the latest database query. The restaurants are displayed in a table to give the wizard an overview of the matching restaurants and their properties. Entries can be selected, again using the mouse or keyboard, to load a restaurant as the currently selected restaurant which enables the wizard to access the information about this restaurant. User Interface. The users’ perspective of the system is shown in Figure 3.3(b). The interface is operated by the server and consists of synthesised speech as well as graphical output in form of an avatar (seen on the left side of the screenshot) and objects displayed on the screen (e.g. the map on the right side of the screenshot). • Speech Output. The system utterances to be prompted are transmitted by the client to a module which converts text to phonemes. Hadifix11 is used to generate phoneme files interpreted by MBROLA12 to produce WAV-files that are played on the server. • Avatar. The avatar figure consists of various transparent images of the head, eyes, and different mouth positions as seen in Figure 3.4. The phoneme files are further used to produce harmonious avatar output as the avatar moves its mouth according to the phonemes. It further blinks randomly with one eye to give it some kind of a lively touch. • Displayed Objects. The objects displayed on the system’s screen are the restaurant’s menus, street map and bus schedule. The objects are controlled by scripts which are triggered by the wizard to run on the server. A script for instance starts a web browser showing a map that points to the address of the currently selected restaurant. The maps are dynamically created using Google Maps. Directions to or from addresses cannot be displayed. The restaurants’ menus are stored as HTML objects and are displayed in a browser window. Bus schedules are created dynamically consulting a website for a connection from the university’s bus stop to the bus stop closest to the selected restaurant. At the moment, no specific departing time can be enquired; the current date and time are used.
11 12
http://www.ikp.uni-bonn.de/dt/forsch/phonetik/hadifix/HADIFIXforMBROLA http://tcts.fpms.ac.be/synthesis/mbrola
64
3 Corpus Development
Fig. 3.4. Three different examples of phoneme based mouth positions (from left to right: F, U/O, A).
Restaurant Application The example application implemented for the recordings consists of a simple database which the system accesses to perform queries based on the users’ preferences and in order to provide information about certain restaurants. The restaurant database contains information about approximately 100 restaurants in and close to Ulm, Germany. It is implemented in XML. Figure 3.5 shows an example database entry (in the upper part). Each restaurant is repDatabase restaurant entry: 1
2 Asia Wan am Muensterplatz 3 Muensterplatz 14 4 0731 1537371 5 ... 6 <pricecategory>exclusive 7 <specials>lunch 8 ... 9 Grammar entry of price category: 1
2 3 - val=’exclusive’expensive
4 - val=’exclusive’exclusive
5 ... 6 - val=’moderate’standard
7 - val=’moderate’moderate
8 ... 9 - val=’inexpensive’cheap
10 - val=’inexpensive’inexpensive
11 ... 12 13 Fig. 3.5. Extracts of an example restaurant and grammar entry (both translated from the original version in German into English).
3.3 The PIT Corpus
65
resented by various fields which serve as search criteria and to provide information, such as name, restaurant category, address, phone number, opening hours, public transport access, cuisine, setting (such as terrace, beer garden, children’s corner, non-smoking area), take-away and delivery, specials (such as happy hour or lunch special), web address. The database is complemented by a menu in HTML format for most of the restaurants. Street maps and bus schedules are not part of the database but generated at runtime using the database information such as the address and closest bus station. All permitted values and synonyms for the database entries are defined by a grammar. These are the values the system understands and which can be entered by the wizard. The grammar is also implemented in XML, an extract of the example grammar entry ’price category’ is shown in Figure 3.5 in the lower part.
3.3 The PIT Corpus The corpus consists of 76 dialogues recorded within the setup presented above. Three recording sessions were performed: Session I was recorded in July and August of 2006, Session II in May and June of 2007, and Session III in December 2007. Table 3.2 shows descriptive information of the corpus. Between the recording sessions the system underwent changes which are described in Section 3.2.4. About half of the dialogues were recorded with avatar on the screen, the other half without avatar. One setup was used for all recordings of Session I and Session II, whereas in Session III four different setups were deployed (IIIa-d).
Session
Number of dialogues Duration of session
I
II
19
20
3:47 h
4:18 h
III IIIa
IIIb
14
11
IIIc
IIId
5
7
37 5:40 h
Total
76 13:45 h
Min dialogue duration
3:15 m
4:18 m
2:43 m
2:43 m
Max dialogue duration
26:11 m
33:39 m
18:24 m
33:39 m
Mean dialogue duration
12 m
13 m
9:44 m
11 m
Avatar
-
+
+
-
+
-
51.3%
Emotion-eliciting strategy
-
-
-
-
+
+
15.8%
Table 3.2. Statistical information of the PIT corpus.
66
3 Corpus Development
3.3.1 Data Structure All dialogues follow a certain pattern which arises from the fact that the structure of the dialogue is defined by various crucial points, as depicted in Figure 3.6. The point in time when the conversation between the users enters the specified domain is the first point which induces a change in the system’s state. The second decisive moment is the first interaction of the dialogue system. From now on, three dialogue partners are involved in the conversation. The behaviour of the dialogue partners changes in terms of addressing and gazing. This point thus induces a phase change for the audio and video data, as well as for the system. A further crucial point occurs with the display of an object (other than the avatar) on the screen. Whereas this does not invoke a change in the interaction itself, the object attracts the main dialogue partner’s attention and therefore influences the gazing behaviour. Thus, a phase change in the video data occurs. At the point when the object disappears the former phase is re-entered. When the task is solved and the system retracts from the dialogue, the last decisive moment is reached. This point denotes the end of the dialogue. Audio
phase 1
domain starts
System
Video
phase 2
phase 1
phase 3
first interaction
phase 2
show object
phase 3
hide object
problem solved
phase 2
Fig. 3.6. Dialogue with crucial points and phases.
Audio Data. The audio data can be divided into three phases denoting different characteristics. Each dialogue starts with a domain independent chat between the participants. The second phase is introduced at the point when the conversation switches over to the specified domain and the users start discussing their preferences and aversions in different aspects of the restaurant domain. The third part is characterised by the involvement of the dialogue system in the conversation to achieve the concerted task. The dialogues typically end when the users find a suitable restaurant and thank the system. Some recordings contain various iterations of the restaurant search, i.e. after finding one, instead of ending the dialogue, the users started to look for another restaurant (remaining in the third phase).
3.3 The PIT Corpus
67
Video Data. The video data can also be structured into three interaction phases considering the gaze direction of the main user U1. These phases differ from the dialogue phases described above. The first interaction phase is characterised by the conversation between the human dialogue partners before the first system interaction. During this time, there is almost no gaze directed towards the computer screen. The first interaction of the system initiates phase two and displays the avatar on the screen (where applicable). During this phase, U1’s gaze switches between the computer and user U2, depending on speaker and addressee. The third phase is characterised by an object (other than the avatar) displayed on the screen: Generally, while a restaurant’s menu, a street map, or bus schedule is shown on the screen, most of U1’s gaze points towards the system. When the object disappears, the dialogue returns to the second phase. When the task is solved and the dialogue system retracts from the conversation the avatar is removed from the screen. 3.3.2 Annotation The data is transcribed at the utterance level and annotated with speaker, addressee and dialogue acts. Table 3.3 presents the basic tagset of dialogue acts we used for annotation of the dialogues. It consists of nine one-dimensional dialogue acts. The tagset was generated using a bottom-up approach by empirically analysing the dialogues in order to identify the necessary dialogue acts. The small number of dialogue acts is in our case sufficient to cover all the actually occurring phenomena we aim at identifying in the dialogue in order to solve the task. Figure 1.1 shows the example dialogue from Section 1.3 annotated with the according dialogue acts. Natural dialogue is generally too complex for being coded with a onedimensional tagset [Petukhova and Bunt, 2007]. For computational dialogue modelling, however, the usage and purpose of the dialogue is limited. Thus, sacrifices have to be made and generally simplicity is preferred over the completeness in annotation (e.g. [Traum, 2000]). Complex modelling might contain a lot of information that is redundant for the system as it cannot be used for the limited task. The concepts of our tagset are partly borrowed from Bunt’s DIT++ tagset (e.g. [Petukhova and Bunt, 2007]), a fine grained tagset that considers 11 dimensions for annotation, and partly from the VERBMOBIL tagset [Alexandersson et al., 1998], a one-dimensional tagset which consists of 33 dialogue acts. Although especially the Verbmobil tagset is less complex than the DIT++ and the prominent four-dimensional DAMSL [Core and Allen, 1997] tagset, they are all still too complex to be deployed for our system. For instance, there is no need to differentiate between different forms of suggest; the system does not treat exclamations different from offers or statements, just to name a few. If the system was to be altered in the future to be able to react differently upon these different fine-grained dialogue acts, a different tagset would have to be used.
68
3 Corpus Development
Dialogue Act
Abbr.
Meaning and significance in the domain
suggest
sug
Task-related propsal
request
req
Request for information or database query towards the system; towards the other dialogue partner mainly in order to find out the preferences of the partner
inform
inf
Supplying information (e.g. as done by the system after a query)
accept
acc
Mostly as a positive reaction to suggest
reject
rej
Mostly as a negative reaction to suggest
acknowledge
ack
Signalling of understanding of the previous contribution
check
chk
Repetition of mentioned utterance snippets to elicit approval of what has been understandood. Check acts are occasionally also used as a sort of stall acts in order to win more time to think.
stall
sta
Any fillers uttered in order to defer the dialogue, to keep the floor or win more time to think. Stall acts could also denote uncertainness.
greet
gre
Social act in order to comply with cultural conventions of greeting and introducing oneself, as well as saying goodbye
other
oth
Anything irrelevant for the goal of the dialogue (e.g. discussion of the displayed menu)
Table 3.3. PIT Corpus dialogue act tagset.
Video Data. Anvil [Kipp, 2001] is used for the annotation of the video data. A set of Session I and II dialogues (refer to Section 5.3) was handannotated by one expert with speaker, addressee, gaze direction of user U1 and screen display. Refer to Section 5.3 for the result of the gaze direction analysis of the annotated dialogues. Automatic labelling of the gaze direction of the Session III dialogues is part of future work (refer to Section 6.2). 3.3.3 Dialogue Analysis Table 3.4 depicts one of the dialogues of the corpus13 as an example to show the analysis performed on the data. The dialogue is annotated with dialogue acts, as described above, as well as speaker and addressee information. The dialogue presented here is translated from its original version of German which is listed in Appendix B. 13
The dialogue originates from Session III and has the ID III 09 061207.
3.3 The PIT Corpus
69
ID
Spkr Addr Utterance
DA
1 2 3 4
U1 U2 U1 U2
U2 U1 U2 U1
inf ack sug rej
5
U1
U2
6 7 8 9
U2 U1 U2 U1
U1 U2 U1 U2
10
U2
U1
11 12 13 14 15
U1 U2 U1 U2 S
U2 U1 U2 U1 U1
16
U1
U2
17 18
U2 S
U1 U1
19 20
U1 S
S U1
21 22
U1 U1
S U2
23 U2 24a U1 24b U1
U1 U2 S
Oh, it’s late already. Yeah. So, should we eat something? I just had something a little while ago. I’m not hungry. You are not hungry? Well, what would you.. Where would you go? I’d be up for a cocktail. Yeah, that’s a good idea, but eat a bite at first? Yes, if there’s good cocktails... Yeah, that’s a good idea. But what else could we eat? Well.. I feel like Mexican. What do you think? Hmm, although, then maybe I’ll eat a bit as well. But I’d rather have Italian. Yes, that’s also a good idea. But it shouldn’t be so expensive. Yeah, true. I’m running low on money. Hello, my name is Helmut. I think I could help you. Your query for a locality with Italian cuisine and inexpensive price category yields 6 results. Please confine your request further. So, let’s go to the centre cause we’re gonna meet the others later on anyway.. Right. Your query for a locality with Italian cuisine in the centre with inexpensive price category yields 5 results. Please confine your request further. Well, Italian.. And then cocktails with that? Your query for a locality with Italian cuisine in the centre with inexpensive price category yields Caf´ e Statth¨ ausle. Would you like to see the menu? Oh yes, please %laughter% OK. So, there is Pizza, and Snacks, Soups, Baguettes.. Didn’t you want Mexican? Yes, %laughter% that would be an idea %laughter% and Mexican?
che, req sug acc, sug acc acc, req, sta, sug inf, sug acc sug acc inf gre, inf
sug acc inf
ack, req sug, sug
acc inf sug acc req
70
3 Corpus Development
25
S
U1
26 27 28
U2 U1 S
U1 S U1
29 30 31 32
U1 U2 U1 S
S U1 S U1
33 34 35 36 37 38 39 40
U1 U1 U1 U1 U1 U1 U2 S
S S U2 U2 U2 U2 U1 U1
41 42
U1 U2
S S
Your query for a locality with Mexican cuisine yields 2 results. My suggestions are Restaurant Flamenco and Restaurant Pepper’s. Yeah, but do they serve cocktails? Cocktails? %laughter% Your query for a locality with Mexican cuisine and cocktails yields 2 results. My suggestions are Restaurant Flamenco and Restaurant Pepper’s. Well, that sounds good now. OK. Then it was the same. So, are they in the centre? %laughter% Flamenco is situated in Herdbruckerstrasse 18. Would you like to see a map? Yeah, %laughter% yeah. And what’s the menu like then? %unintelligible% You can’t see it, right? Then let’s go there, %laughter% OK? Yes? Yes. Goodbye. I hope I could help you. Enjoy your meal! %laughter% Thanks.
inf, sug
ack, req req inf, sug
acc ack req inf, sug acc req oth sug sug che acc gre oth oth
Table 3.4: Annotated example dialogue from the PIT corpus.
In the presented dialogue transcript different characteristics can be identified that have an influence on the design of the dialogue management, as follows. Data structure. To demonstrate the data structure described in Section 3.3.1, the transition points that induce phase changes are identified in this dialogue. The domain starts with utterance 3. The system’s first interaction occurs at utterance 15. The system retracts after its closing utterance (40). These three points denote the phases according the dialogue transcripts and audio files. The additional points relevant for the video data are the showing and hiding of the displayed objects other than the avatar which appears at utterance 15. These points can only be determined analysing the video data. By consulting the transcript only, the approximate points in the dialogue of showing the objects can be identified as utterances 22 and 33. The point of
3.3 The PIT Corpus
71
hiding the objects cannot be identified by merely looking at the transcript. In utterance 34, the user asks to see another object. From the transcripts only, it is not clear if the menu is shown or if the conversation goes on about the displayed map, however, the analysis of the video data reveals that the restaurant’s map is displayed upon utterance 34. Proactive system interaction. The system gets proactively involved in the dialogue in three cases. The utterances 15, 18 and 40 are perceived to be proactive interactions. All other system interactions are reactive as they follow a direct interaction request by user U1. The addressee of the utterances can be recognised clearly in the dialogue transcript. Observing the sequences of dialogue acts that precede the proactive interactions a certain pattern becomes apparent: A suggestion is in each case followed by an acceptance move. Investigating the dialogues of the corpus, the sequence of dialogue acts that provides an ideal point in the dialogue for proactive interaction is defined more precisely as presented in the scope of the dialogue management in Section 4.3.2. Another proactive behaviour of the system is the offering of additional helpful resources such as a restaurant’s menu, street map or bus schedule. This can occur in proactive and reactive interactions alike. In the example dialogue, a menu is offered to be shown in the (reactive) utterance 20, a street map in utterances 32. The menu shown upon utterance 34 is displayed on request of the user. Negative feedback. A further interesting point in the dialogue occurs in utterances 9 and 10 where user U2 rejects a suggestion by U1. User U1 suggests Mexican food in utterance 9. In the subsequent utterance 10, U2 implicitly rejects Mexican food stating that she prefers Italian. It is a general question how negative feedback should be treated in the further proceeding of a dialogue. The value could either be taken out of the current valid set of preferences or adopted as a negative constraint. In the present case the user’s intention does not seem to be of the latter case. The reject is of a weak kind, it is assumed the user does not intend to introduce Mexican as a negative constraint. A further possibility would be to add Italian to the current set of valid constraints (i.e. Mexican). The way the system is designed to proceed in such cases is stated in Section 4.3.2. Dialogue acts. The utterances of the dialogues are analysed in order to pick out a set of suitable dialogue acts that cover all possible user (and system) contributions relevant for the task-solving process. The present example contains suggestions proposing new constraints (utt. 6, 9, 10, 12, etc.) and requests, with propositional content (e.g. utt. 24b, 27) or without (e.g. utt. 5). The responses are of an accepting (e.g. utt. 11) or rejecting form (utt. 4), acknowledgements (utt. 2, 30, etc.) or check acts (utt. 5, 38). Further less frequently occurring acts are stall (utt. 9), inform (e.g. utt. 1, 15) and greet
72
3 Corpus Development
(utt. 15 and 40). All utterances that can not be categorised with the defined dialogue acts are marked as other. The tagset is kept as compact as possible as described in Section 3.3.2. Dialogue moves. A further categorisation is performed in terms of dialogue moves which define the function an utterance adopts in the dialogue in order for the system to know how to integrate its content. The finer grained dialogue acts are thus abstracted accordingly. The main moves are also at this level suggest and request. The responses are subsumed as a reply move. Refer to Section 4.2.4 for a description of the dialogue moves. Multiple moves per utterance. Multiple moves can occur in the scope of one utterance. For instance, utterance 26 contains the two dialogue acts reply and request. Both of these acts need to be integrated within the same utterance. This is especially relevant if an utterance contains several propositions which need to be integrated as part of the same utterance for constraint prioritisation reasons as described in Section 4.5. Thus, dialogue modelling should be laid out to handle various moves per utterance.
3.4 Summary This chapter focused on the data collection and development of the PIT corpus of German multi-party dialogues. For the development of a dialogue system suitable dialogue data is needed to obtain interaction models. In Section 3.1 we listed state of the art multi-party multi-modal dialogue corpora. However, none of them possesses the dialogue properties that we intend to investigate. Thus, we recorded our own corpus within an extensive Wizard-of-Oz setup. The resulting PIT corpus consists of audio and video data of 76 dialogues. Post processing includes transcription and annotation of the data as described in Section 3.3. The transcribed dialogues were used for the development of suitable dialogue management (refer to Chapter 4), e.g. to decide about an appropriate range of domain-relevant components, to perform interaction modelling (e.g. to determine appropriate system responses) and the selection of dialogue moves (refer to Section 3.3.3). The questionnaires obtained from the recordings form the basis for the evaluation presented in Chapter 5. The annotated video data was deployed for the evaluation presented in Section 5.3.
4 Dialogue Management for a Multi-Party Spoken Dialogue System
This chapter is dedicated to the multi-party dialogue management functionality of our dialogue system which denotes the most important part of this book. The dialogue management deployed is built on the basis of the Information State Update approach (e.g. [Cooper and Larsson, 1998, Traum et al., 1999, Matheson et al., 2000, Larsson and Traum, 2000, Larsson, 2002]), as introduced in Section 2.4. The ISU approach models dialogue as a state of information that is updated according to the content of the latest incoming utterance. It is very well suited to model an agent-like system as ours which is to stand independently as a conversation partner. The system’s (and also the other users’) state of ’mind’ including believes about itself and the others as well as the goals it aims to achieve can be modelled. The approach provides flexibility for the developer to decide what information should be specified and in what way. It allows modifications and adjustments and thus can be adopted to fit the requirements of our proactive multi-party dialogue system. Modifications of the ISU approach become necessary due to the fact that the existing multi-party extensions to the approach do not fully meet the requirements of our system, as described in Section 2.4.4. Figure 4.1 shows a schematic overview of the dialogue management of the system. Our modified information state builds the central part of the dialogue manager. It is described in Section 4.1 together with a new interaction principle that becomes necessary to extend the functionality of the interaction protocols to enable proactive interaction (refer to the discussion in Section 2.4.4). The task model as part of the information state is described in Section 4.2 in which our example domain is applied and thus, all task-relevant components of the dialogue manager, such as context, domain model and update mechanisms are presented. Section 4.3 describes the strategies the system deploys in dialogue management in order to render the system capable of proactive interaction. For instance, the dialogue history denotes an important part of our setup as it stores the content of the information state at each point of the dialogue, already before the system’s first interaction, and thus enables proactive system interaction (refer to Section 4.3.3). The dialogue manager P.-M. Strauß and W. Minker, Proactive Spoken Dialogue Interaction in Multi-Party Environments, DOI 10.1007/978-1-4419-5992-8_4, © Springer Science + Business Media, LLC 2010
73
74
4 Dialogue Management for a Multi-Party Spoken Dialogue System
Dialogue Management
Input speaker addressee sem.content DA
Dialogue history
Information State PRIVATE
...
SHARED
Task
Problem solving
... ...
Update rules & Dialogue plans
Output result DA addressee
Context & Domain
Fig. 4.1. Dialogue management component of the system.
further contains a problem solving functionality that performs the database queries and evaluates the returned results as presented in Section 4.5. In that context, we introduce a new approach to user constraint prioritisation that enables efficient handling of over-constrained situations in the problem solving process. The input to the dialogue manager consists of the latest utterance in an already processed state. The semantic content has been extracted and is entered in the form of semantic representation, attribute value pairs in our case (refer to Section 4.2.3). For a live functioning of the system a natural language input functionality would have to be implemented which has so far only been developed in the form of a proposal of a semantic parser [Strauß and Jahn, 2007] (refer to Section 6.2). Information about the current speaker, addressee and the corresponding dialogue act(s) are further input with the semantic information. The output of the dialogue manager, e.g. if the system is to interact, such as to reply to the previous utterance, consists of the database result (where applicable) and the dialogue act that states in which form the reply should be presented. Further, the addressee of the utterance is listed. The output functionality that has been implemented in the scope of the WIT system (Section 3.2.4) is deployed to generate system prompts. Besides the components of the dialogue management listed explicitly in the diagram, the system’s optimistic grounding and integration strategy (Section 4.3.1) as well as the interaction strategy that (Section 4.3.2) are further presented in this chapter. At last, Section 4.4 presents an example sequence of information states to illustrate how the dialogue management works in practise.
4.1 Multi-Party Dialogue Modelling
75
4.1 Multi-Party Dialogue Modelling We modified the ISU approach in order to suit the requirements of the multiparty setup as well as the task-oriented and proactive nature of our system. In the following, we present the modified dialogue model and interaction protocols. The task model and other contextual parts of the system (e.g. the domain model) as well as the update mechanisms are presented in the scope of our example domain in the subsequent section. 4.1.1 Dialogue Model The information state (IS) we deploy is based on the information state as introduced by Larsson (2002) for his IBiS1 system (refer to Section 2.4). The private part of the IS is basically used in its original form, the structure of the shared part is modified. Some of the alterations proposed by Kronlid (2008) (refer to Section 2.4.4) are adopted for our approach, however, over all we keep it more simple and task-oriented as described below. Our information state is depicted in Figure 4.2, its particular elements are listed in the following:
AGEN DA : stack(Action) P RIV AT E : P LAN : stack(Action) BEL : list(P roposition) T ASK : set(P roposition) SO : Object RES : list(Object) Q : Question QU D : list SP KR : P articipant SHARED : ADDS : set(P articipant) SP KR : P articipant LU : ti ADDS : set(P articipant) M V S : list(M ove) DH : {ti−1 [DH Obj] , ti−2 [DH Obj] , . . .} Fig. 4.2. Information state structure at point in time ti of the dialogue.
• AGENDA holds the short-term plan, i.e. the actions the system is to perform next. • PLAN holds the long-term plan the system pursues in order to achieve its goals. The plan changes according to the activity state the system adopts. • BEL (Believes) of the system which are not (yet) public, e.g. the result of the database query before it is presented to the users. We use a list as the data structure in order to enable the user to refer to restaurants in the presented order (e.g. ’the second’).
76
4 Dialogue Management for a Multi-Party Spoken Dialogue System
• TASK constitutes the task model, i.e. contains all currently valid user constraints and is thus used as the basis for the database queries. A more detailed description is given below. • QUD (Question Under Discussion) contains all unresolved questions (Q) that are currently open for discussion. It is used for anaphora resolution and ellipsis interpretation and contains SPKR (Speaker) and ADDS (Addressees) of each question to know who can address whom about this issue. • SO (Selected Object) contains the object which is currently being discussed (usually one of RES). • RES (Result Set) of objects returned by the database query after it has been presented to the users, i.e. it contains the data formerly stored in BEL. Thus, the data structure is the same as for BEL. • LU (Latest Utterance) comprises all relevant information of the most recent utterance: SPKR (Speaker), ADDS (Addressees), and MVS (Moves). The content of LU becomes part of DH after integration. • DH (Dialogue History) contains all states of the IS throughout the dialogue in chronological order. Each element contains all relevant information at that specific point in the dialogue. Refer to Section 4.3.3 for a detailed description of the dialogue history. Our information state differs from Larsson’s IS [Larsson, 2002] (refer to Figure 2.2) as described in the following. Figure 4.3 shows an example information state that depicts the processing of utterance 16 from the dialogue listed in Table 3.4 (Section 3.3.3). A detailed description of the individual elements contained by the IS follows in the remainder of this chapter. The
AGEN DA : inf orm(res2 ) P RIV AT E : P LAN : f indRestaurant (R , . . . , R ) BEL : res 2 1 5 T ASK : [f ] f p l 1 2 1 1 SO : RES : Q : l ? 1 SP KR : U 1 SHARED : QU D : ADDS : U 2 SP KR : U 1 LU : t16 ADDS : U 2 M V S : sug(l1 ) DH : {t15 = [..] , t14 = [..] , . . . t1 = [..]} Fig. 4.3. Example information state.
4.1 Multi-Party Dialogue Modelling
77
private part (AGENDA, PLAN and BEL) of the information state can be deployed in its original version as it serves our needs well as is. It provides the plan and agenda, the system follows, and supplies a field for the temporary storage of the query result (as seen in Table 3.4). The data structure of these fields does not have to be modified from Larsson’s version. The extensions proposed by Kronlid in terms of modifying the structure of the PLAN field are not necessary as we do not aim at deploying complex plans at this moment (refer to Section 4.2). Instead of the commitments field (COM) used in the original IS, we deploy various task-related fields: TASK, RES and SO, all of which belong to the shared part of the IS. TASK can be said to be the pendant of COM as it holds the task model, i.e. all constraints that have been collected from the interaction of all participants (refer to the description of the task model in the following section). The RES field holds the result set obtained from the database query after it has been presented to the participants (it is stored in the BEL field before it is presented). The SO field holds the currently selected object, usually one of the presented objects from RES which the participants selected for discussion. The RES field holds the complete result set during the time that one of the objects is under discussion as the participants could possibly switch to another one of the presented objects after. The case that multiple restaurants are selected at one point is handled as selecting one after the other. The structure of our QUD field is extended following the complex QUD object proposed by Kronlid which enables addressing of QUD elements that are not maximal (by storing the QUD elements in a list as opposed to Larsson’s stack). We further store the speaker and addressee with each issue in order to know who is to address for which question, however, not in Kronlid’s way as there is no need in our case to add a special field for obligated addressees, nor for the status of the element. Note that we add an ADDS field although by definition utterances should be directed at only one addressee per utterance. However, in order to render the system flexible in terms of modelling actually occurring situations, the addressee field can contain more than one participant. Multiple addressees are understood in the way of more than one addressee for one utterance as opposed to different (sets of) addressees for different parts of the utterance, i.e. different moves. In the latter case, a new utterance would commence for dialogue modelling. In the latter case the utterance would be split and a new utterance at the change of the addressee. To adopt the fact that multiple moves can be contained in one utterance we allow a set of moves in the LU field, as also done by Kronlid. A restriction prevails in terms of that the addressees are the same for all moves as described above. For instance, utterance 19 of the example dialogue presented in Table 3.4 in Section 3.3.3 contains two moves: ack and req. These moves are integrated successively within the same integration cycle. In the case that several propositions are uttered within the same utterance (e.g. several suggestions) all propositions are integrated into TASK as part of one utterance
78
4 Dialogue Management for a Multi-Party Spoken Dialogue System
which becomes relevant in terms of the priority value calculation introduced in Section 4.5. Utterance 24 of the indicated example dialogue shows an utterance that has two different addressees for the two dialgoue moves: U2 for acc (utterance 24a) and S for req (24b). The utterance is thus split in two separate utterances for integration. Larsson and Kronlid both deploy a field (PM in IBiS2 and PU in MultiIBiS) to store the utterance mentioned before the latest one (which is stored in LU). While it may suffice a single-user system, we claim that for a multiparty system it is not sufficient to store only one previous utterance to perform backtracking due to the nature of the multi-party dialogue where responses are often not adjacent. Our system as an intelligent dialogue partner further needs a more complex memory of the conversation which leads us to deploy an extensive dialogue history as presented in detail in Section 4.3.3. 4.1.2 Interaction Protocols The discussion in Section 2.4.4 revealed that the existent extensions to the ISU approach do not provide the functionality that is required by the multi-party setup of our system. Kronlid’s Add Multiple Addressees (AMA) principle [Kronlid, 2008] regards multiple addressees with an option to respond. Our setup, however, does not consist of multiple addressees. Every utterance is by definition directed at one addressee; the third DP is a side-participant. Thus, AMA is not the correct principle to use for our case. Duplicate Responders (DR) is (out of the same reason) not suitable. An extension considering one direct addressee and an unspecified number of side-participants is provided with the Add Side-Participants (ASP) principle introduced by Ginzburg and Fern´andez (2005). ASP includes contextual updates for the side-participants in the same way they are performed by the addressee. This way, the sideparticipants obey the principle of responsibility which requires all participants of a conversation to keep track of the conversation at all times (refer to Section 2.3). They are thus further enabled to make QUD-specific utterances once they are addressed. In perfectly conducted conversation in terms of turn-taking the current speaker selects the next speaker by addressing (either by words or gaze, implicitly or explicitly). The current speaker could at the same time address various participants of which one or more will be the next speaker(s) (see the discussion on distributive and collective questions above). The only case that considers self-selection of the next speaker is the case when no participant was addressed in particular. Kronlid (2008) calls the situation of what happens (in terms of self-selection) when a group of participants is addressed with a question that only some of them can answer addressing-by-attribution1 . Our setup considers natural dialogue in which the 1
The term is actually borrowed from Clark and Carlson (1982) where it has a slightly different meaning in terms of addressing someone about an attribute without knowing which particular hearer relates to this attribute, such as for instance possesses something the speaker wants.
4.1 Multi-Party Dialogue Modelling
79
situation is normally not as ordered. We encounter side-participants who actually make QUD-specific utterances, i.e. they interact without being asked to (as in addressing) - or proactively - to something they are not the addressee and which thus was not primarily designed for them. We thus define proactive side-participants as follows: Definition: Proactive Side-Participants (PSP): Proactive side-participants are side-participants of a conversation who interact, i.e. they are participants who interact without being currently addressed. We introduce a new interaction principle, displayed in Table 4.1, to integrate proactive side-participants. ASP is for this modified in the way that optional answering by the side-participants is allowed, as a step towards AMA. Add Proactive Side-Participants (APSP): The audience consists of a set of dialogue participants (of which at least one is an addressee) who update their information state in the way addressees do. The obligation of the addressees’ QUD-specific response persists until a side-participant makes a relevant (i.e. QUD-specific) contribution. After that, contributions by addressees are optional. Contributions by side-participants are always optional. Table 4.1. New interaction principle.
Due to the modified information state (especially the structure of the QUD field) APSP allows for the addressing of the original question as well as comments of fellow addressees or side-participants. The situation that no specific addressee is assigned is not considered2 . The principles apply for our system in the following way: While the system is still inactive, i.e. while it is domain-spotting, the AO principle is deployed which allows an investigation of the propositional content of the user utterances. As soon as the dialogue enters the specified domain (i.e. active state), the system is modelled as a side-participant in order to allow for proactive interaction and thus, APSP is deployed. Table 4.2 shows the interaction protocols of simple suggest and request moves. The latter move is displayed with and without propositional content. Depending whether an object is currently being discussed (i.e. SO is not empty), the effect of requests with propositional content would differ. If no object is currently selected (as would be the case in our protocol) the effect is the same as a suggest followed by a request without content, otherwise, 2
By definition, this case does not occur in the setup of our system.
80
4 Dialogue Management for a Multi-Party Spoken Dialogue System
Suggest + APSP
Request + APSP
Request + APSP
LatestMove== suggest(U1:U2,p)
LatestMove== request(U1:S,q)
LatestMove== request(U1:S,q(p))
1. U1: release turn; 2. ALL: push(QUD,p?); 3. S: update(TASK,p); queryDB(TASK); 3. U2: take turn; address p
1. U1: release turn; 2. ALL: push(QUD,q); 3. S: take turn; answer
1. U1: release turn; 2. ALL: push(QUD,p?); push(QUD,q); 3. S: update(TASK,p); queryDB(TASK); 3. U2: take turn; address p
LatestMove== accept(U2:U1,p)
LatestMove== inform(S:U1,result)
LatestMove== reject(U2:U1,p)
...
...
...
Table 4.2. Interaction protocols suggest, request and request using APSP.
the propositional content is matched with the selected object. For a more detailed discussion on different possible kinds of requests that can occur refer to Section 4.2.4. The protocols show a normal flow of the conversation. The first two proceed normally; the addressed DP responds to the speaker’s contribution. The third protocol makes use of the by APSP enabled proactive interaction by a side-participant who was not addressed. The system is addressed by U1 with a request containing proposition p. U2 does not agree with p, grabs the turn before S can take it to respond in order to state her disapproval. U2 was only side-participant in this situation, however, due to the urgency of her contribution, U2 interrupts the dialogue before the proper addressee (S) proceeds. After U2’s utterance, the situation is different and S cannot just continue with where left off but has to consider the changed situation. The obligation to answer might in this case even disappear due to the fact that U1 will address U2 in the following turn with a reaction towards U2’s interruption. In this section the dialogue model used for our dialogue management was introduced in detail, stating in what ways the information state approach has to be adapted for our needs. Additionally, we introduced a new interaction principle necessary to enable proactive interaction by side-participants. The new principle was then adopted to our interaction protocols. In the following, our dialogue modelling is adopted to the example domain of restaurant selection. For an example proceeding of our dialogue management refer to Section 4.4 which shows a sequence of information states updates on part of the example dialogue of Section 3.3.3.
4.2 Dialogue Management in the Example Domain of Restaurant Selection
81
4.2 Dialogue Management in the Example Domain of Restaurant Selection The modified approach of dialogue modelling introduced in the foregoing section is now adopted to our domain of restaurant selection resulting in the dialogue management module of our system. The module has been prototypically implemented in Java. For reasons of continuity and understandability we continue the description in easy understandable pseudocode and the notation presented by Larsson. The knowledge base for the application (i.e. restaurant database and menu files), the querying functionality, as well as the NLG and TTS modules are used from the wizard tool described in Section 3.2.4 and Appendix A. ASR and NLU components are at this point not integrated. A suitable statistical semantic parser is proposed in [Strauß and Jahn, 2007] (refer to Section 6.2). However, further work is necessary to achieve a performance sufficient for the scope of this work on dialogue management which aspires perfect parsing functionality. Thus, utterances are currently input in the form of the required semantic representation together with speaker and addressee information and the corresponding DA, as illustrated in Figure 4.1. This section starts with the description of domain-related components and knowledge sources before it adopts the update mechanisms of the system. 4.2.1 Dialogue Context The dialogue context of the system comprises the five types of context according Bunt’s DIT (1999) (refer to Section 2.4.1). The dialogue model and task model are explicit implementations of the linguistic and semantic context, respectively. The domain model, a knowledge source as part of the semantic context, describes the domain in form of a ontology and is presented below. The dialogue history poses an important part of the linguistic context and is described in detail in Section 4.3.3 below. The social context is implicitly contained in the update rules. It is made up of interactive and reactive pressures of the system, i.e. the system’s obligation or intention to address an open issue or to respond to a user request. The cognitive context denotes the system’s state and processing status. For this, a variable is deployed to denote the current activity state of the system: inactive, active, or interactive. Finally, the physical and perceptual context incorporates the physical world the system is interacting in and with. A variable is used to indicate what is currently displayed on the screen as it influences the system’s as well as the users’ actions. Further, another variable is deployed that holds information about if the main interaction partner is currently looking at the system or not. 4.2.2 Domain Model The domain model is an external knowledge base that belongs to the semantic context. It describes the concepts used in the task model and database and
82
4 Dialogue Management for a Multi-Party Spoken Dialogue System
how they related to each other. Our example domain is restaurant selection. It is modelled in form of the (simplified) ontology depicted in Figure 4.4. An ontology is used as a formal representation of the domain describing concepts, their properties and the relationships among these concepts. Concepts (or classes) are abstract objects that can be divided into subclasses and possess superclasses. Individuals are instances of the classes. They denote real objects and represent the knowledge in the ontology. In our case, the five main classes of the ontology are ’cuisine’, ’category’, ’ambiance’, ’price’ and ’location’. Each of these classes is divided into subclasses. For instance, the class ’cuisine’ subsumes all different kinds of cuisines that exist in the modelled world (the subclasses), i.e. ’Hispanic’ of which ’Spanish’ and ’Mexican’ are two instances. Various relations exist between the elements of a ontology in the form of object properties. This way, it can for example be determined that certain districts of the city are adjacent to other districts (on the concept level) and places are adjacent to other places on the individuals level. The ontology could theoretically also be used as a database to hold the actual contents of the modelled domain. It is, however, not a flexible data structure in terms of updates and we further aim at keeping the conceptual and data level separate. The ontology is used for problem solving to consider related objects in the database search if the designated preferences do not yield a result (refer to Section 4.5). It is further used as a dictionary for pragmatic interpretation in order to resolve relations between the actually entered values and concepts the system can map to the entries in the database. For instance, the users want a restaurant ’close to’ where they currently are. With the help of the ontology, the system maps ’close to’ to the neighbouring areas of the current location of the users. The obtained locations can then be used for the database query. 4.2.3 Task Model The task model constitutes the main part of the semantic context as it contains the propositional information necessary to perform the task. Thus, the task model deployed by our system contains all user preferences that have come up in the conversation. These so-called constraints are stored as attribute value pairs, i.e. the category of the constraint together with the value. For instance, the preference of Italian food would be represented as category cuisine, value Italian. The polarisation of the value (whether it is a user preference (positive constraint) or dislike (negative constraint)) is stored in an according priority value which adopts a positive or negative value accordingly. Prioritisation is discussed further in Section 4.5 as part of the problem solving mechanism of the system. The task model is managed through the update mechanisms that integrate the propositions which come up in the conversation into the context. An optimistic integration strategy is deployed to build the task model (refer to Section 4.3.1), i.e. that any proposition that comes up in an utterance is
4.2 Dialogue Management in the Example Domain of Restaurant Selection dance floor
dance floor
outside
ambiance hispanic
outdoor
spanish
non-smoking area
turkish smoking
moroccan cuisine
indian
indian
greek
greek
terrace beergarden
mexican oriental
smoking area
Soeflingen
chinese asian
west Weststadt
thai sushi
italian
east
italian
Eselsberg
burger
north
districts
american
thing
location
river train station
pastry shop
centre
ice cream parlour category
theatre
fast-food place
Ulm Ulm
city
pub bar
cathedral old town
snack bar fast-food
Michelsberg university
cafe coffeshop
Friedrichsau Oststadt
pizza american
83
Neu-Ulm
cocktail bar beer hall
restaurant
inexpensive
restaurant
inexpensive cheap
pizzeria price category
normal moderate moderate expensive exclusivce exclusive
Fig. 4.4. Ontolgy deployed for the restaurant domain in a slightly simplified form.
integrated into the task model right away. This happens regardless of whether it is accepted or rejected in the following utterance or if it is even acknowledged and thus explicitly grounded. The aim is to memorise every proposition that has been mentioned during the dialogue, even if it is finally not included in the set of valid user constraints. All participants are aware of the fact that it has been mentioned. In the case that it should come up again later on in the dialogue, the proposition is reintroduced as opposed to being mentioned for the first time. A reject of the introduced proposition in the following utterance (e.g. if the other participant does not agree with a suggested proposition) does
84
4 Dialogue Management for a Multi-Party Spoken Dialogue System
not result in a rollback, i.e. the constraint is not taken out of the task model. The polarisation and value of the constraint are adapted accordingly which will exclude it from the set of currently valid constraints. It will thus not be further considered for database queries, however, it is besides being implicitly included in the dialogue history, still explicitly contained in the task model. 4.2.4 Information State Updates The designated flow of the interaction is described by the interaction protocols (refer to Section 4.1) which are enabled by a composition of information state updates as described in the following. Every utterance induces an update of the information state as described in Section 2.4.3. What kind of update is required is determined by the dialogue move an utterance (or part of an utterance) is assigned. Dialogue moves denote the actions that are performed with an utterance in the dialogue. The type of dialogue move that is assigned to an utterance is defined by the ’relation between the content of the utterance and the activity in which the utterance occurs’ [Larsson, 2002, p.32]. The different dialogue moves have to cover the range of all different sorts of utterances that can occur in the dialogue and need to be differentiated by the system. The task-oriented dialogue we are considering contains mainly the actions of suggestions and requests from the users’ side as well as reactions towards these moves as can be seen in the example dialogue in Section 3.3.3. We therefore provide functionality in terms of the following set of dialogue moves: • • • • • • •
suggest(p) with p:Proposition request(q) with q:Question reply() (*)greet() (*)inform(o) with o:Object *respond(o) with o:Object *consultDB() or *consultDB(o) with o:Object
While suggest, request and also greet moves occur mostly as initiating moves, reply is a so-called backward looking move referring to something already mentioned in the dialogue. The content of the moves is further defined by dialogue acts as introduced in Section 3.3.2. The tagset of dialogue acts used consists of suggest, request, inform, accept, reject, acknowledge, check, stall, greet, other). Some of these acts map directly to a dialogue move, others are mapped various to one act, a reply() move can for instance be an accept, reject or acknowledge. Besides the dialogue moves which denote the actions that occur in relation with certain utterances, further moves are deployed by the system as actions. These moves are marked with a star in the list above. The moves marked with a star in parentheses occur in both domains. The system puts the moves respond(o), inform(o), consultDB() or consultDB(o), and greet() on its
4.2 Dialogue Management in the Example Domain of Restaurant Selection
85
agenda to appoint its immediate actions. The respond(o) with o:Object move is put on the agenda after the system has been addressed with a request for reactive interaction. The object o denotes the piece of information that is requested. The move inform(o) with o:Object is used in the proactive case when the system decides to interact after evaluation of the result set from the latest database query, for instance. The object o is in this case the result set. ConsultDB() or consultDB(o) with o:Object is used to perform a database query either in general (based on the task model) or in order to query about an object o (i.e. a specific restaurant or a proposition) which was specified within the latest utterance. Update rules are used to integrate the content of the incoming utterances into the information state and to decide upon appropriate further actions for the system. The organisation of the point in time as well as what kind of and in what order updates are performed is regulated by the update strategy depicted in Algorithm 2. It is based on Larsson’s update strategy (described in Section 2.4.3) with modifications addressing the fact that in our case the agenda items possibly remain on the agenda for a few turns (whereas the agenda was cleared in the beginning of every cycle in Larsson’s algorithm [Larsson, 2002]). Further, we allow for interruptions (as described below) and change the way QUD is downdated. Newly added QUD elements dissolve existent elements as before, however, stay on QUD in the resolved state for another turn to enable a further reaction (possibly from a different dialogue partner) to the same issue. Only then, it is taken from QUD entirely.
while NOT $LatestUtterance == failed do load $LatestUtterance; integrate; try downdateQUD; checkAgenda; try loadPlan; repeat execAgenda end
Algorithm 2: Update algorithm.
The proceeding of the algorithm is presented in the following; an illustrative example is provided in Section 4.4. The algorithm starts with loading the new utterance into the according field (LU) of the IS. The data is in the following step integrated into the information state. Update rules of the class integrate handle all different sorts of input that can occur in the dialogue depending on the dialogue move. Actions to be performed by the system are here put on the AGENDA. In the next step the QUD is checked for possible downdates, i.e. issues that have been resolved (by the utterance before and were not addressed at the current contribution) can be taken from QUD. Sim-
86
4 Dialogue Management for a Multi-Party Spoken Dialogue System
ilarly, the AGENDA is now checked for released or remaining obligations. An intended system action could have become superfluous due to a changed situation caused e.g. by an interruption of another dialogue partner. Existing items on the AGENDA are in the following step aligned with the system’s current plan. Finally, the agenda items are executed. The algorithm proceeds in an atomic manner up to this point. Database queries are performed subsequent to the above block. After the evaluation of the result set and the corresponding agenda update, the system pauses to check the context before it performs a prompt. At this point, another dialogue partner might have already taken the turn; because having been addressed or out of his own motivation. If no other contribution has been uttered the system proceeds as intended, otherwise, the cycle starts again loading the next utterance. Obligations that could not be fulfilled remain on AGENDA and are reconsidered at checkAgenda. The update rules define the operations that can be performed on the information state and are in general based on the rules from the original ISU approach. Continuous modifications have been performed throughout all rules considering the modified interaction protocols and changes implemented on the information state in the process of the adoption to the multi-party situation. We do not list the complete collection but present exemplary rules (adhering to Larsson’s notation) to demonstrate the modifications. Kronlid (2008) argues that for multi-party dialogue modelling the rule getLatestMove is no longer needed due to the fact that, regardless of their origin, all utterances have to be integrated. We claim that the rule is needed to load the latest utterance into the information state and thus keep it. However, we change its name to getLatestUtterance as it loads more information than just the latest move. The rule is listed in Rule 5.1. It is modified from the original in order to allow multiple moves per utterance which occurs frequently in the dialogues (refer to the example dialogue in Section 3.3.3). Further, a set of addressees is deployed in order to know who can speak to whom about what. As input, the values of the latest utterance are assumed to be stored in the variables $Latest Speaker, $Latest Addressees, and $Latest Moves. RULE 5.1: getLatestUtterance; CLASS: grounding; PRE: $Latest Moves==Set(Move) $Latest Speaker==Participant $Latest Addressees==Set(Participant) EFF: copy($/SHARED/LU/MVS,$Latest Moves) copy($/SHARED/LU/SPKR,$Latest Speaker) copy($/SHARED/LU/ADDS,$Latest Addressees)
4.2 Dialogue Management in the Example Domain of Restaurant Selection
87
Another example is provided with two update rules of the integrate class as shown in Rule 5.2 and Rule 5.3. Rule 5.2 integrateSuggest3 is performed after U1 suggested a proposition to U2 (suggest moves towards the system are not common). Rule 5.3 integrateRequest describes the integrating actions performed after U1 requests q from S. Let q be for instance a detail of information about the currently selected restaurant (i.e. SO != 0). The rules integrate the information stored in LU into the appropriate fields of the IS. Thereby, issues are put on QUD (together with speaker and addressee), any propositions are added to TASK, and induced system actions are put on AGENDA (e.g. the action to perform a database query after a constraint is introduced). RULE 5.2: integrateSuggest; CLASS: integrate; PRE: $/SHARED/LU/SPKR==U1 $/SHARED/LU/ADDS==U2 in($/SHARED/LU/MVS,suggest(p)) EFF: put(/SHARED/QUD/SPKR,$/SHARED/LU/SPKR) put(/SHARED/QUD/ADDS,$/SHARED/LU/ADDS) put(/SHARED/QUD/Q,p?) put(/SHARED/TASK,(p)) put(/PRIVATE/AGENDA,reply()) put(/PRIVATE/AGENDA,performDB())
RULE 5.3: integrateRequest; CLASS: integrate; PRE: $/SHARED/LU/SPKR==U1 $/SHARED/LU/ADDS==S in($/SHARED/LU/MVS,request(q)) $/SHARED/SO!=0 EFF: put(/SHARED/QUD/SPKR,$/SHARED/LU/SPKR) put(/SHARED/QUD/ADDS,$/SHARED/LU/ADDS) put(/SHARED/QUD/Q,q) put(/PRIVATE/AGENDA,reply(SO(q)))
Requests occur frequently throughout the dialogues in different forms (as can be seen in the example dialogue in Section 3.3.3). It can occur plain 3
As opposed to Larsson we omit the speaker information in the rule names (e.g. integrateUsrAsk and integrateSysAsk) and implement various rules with the same name. The speaker information is at this point redundant due to the fact that ambiguity is prevented by imposing different, speaker-dependent preconditions which all need to be true for a rule to fire.
88
4 Dialogue Management for a Multi-Party Spoken Dialogue System
without content (e.g. in ’Do you have anything like that?’) or supplying information, such as a proposition (e.g. as in ’Do you have anything Indian?’). These moves belong to the category request for query which aims at queries to obtain a set of restaurants or one specific restaurant, e.g. ’Do you know Bella Napoli?’. The move can further appear in the context of a request for information asking about properties of a specific restaurant, such as a piece of database information (e.g. ’Where is it?’) or requesting information concerning a specific category (e.g. ’Does it have a non-smoking section?’). The difference is that in the latter case the SO field is not empty (or filled during this move). We are in the following taking a closer look at the different possible request moves and system reactions. For this, we abstract from the low-level update rules which operate directly on the fields of the information state. Instead, we list rules in more compact and function-like pseudocode which enable the comprehension of the system’s functionality and line of action. The focus lies on the effect which is to be achieved with the execution of a composition of update rules as it is done in a similar way in the interaction protocols. However, instead of looking at each interaction partner and how they relate, only the viewpoint of the system is described. An example is listed in Rule 5.3’ showing the compact version of Rule 5.3. RULE 5.3’: request; PRE: mt ==request(U1:S,p) SO!=0 EFF: mt+1 =reply(S:U1,SO(cat(p)))
The simple syntax is shortly described: mt == Move(Speaker:Addressee, ’content’) denotes the move at point in time t, thus mt+1 indicates the subsequent move. In the example Rule 5.3’ the target move is a reply of the system addressing U1 with the requested information. The rules differ to the original update rules especially in the relationship between the preconditions and effects parts. The compact rules do not describe one step after the other in terms of operations on the information state but the intended target move leaving out the steps in between. This way, it is focused on what the system does and not how it is performed. When these rules are actually applied, further steps are performed in between and newly incoming moves can possibly influence the flow of the dialogue in the way that they interrupt and render the planned move redundant (as was described in the scope of the update algorithm above). The compact notation does not include a name of a rule class as the rules belong to more than one class of update rules. Rule 5.3’ for instance considers integration as well as generation and also grounding. The rules are listed with the different circumstances in which they can appear whereas only the direct addressing of the system is considered. The moves are again not given distinct names as the preconditions disambiguate the appropriate candidate.
4.2 Dialogue Management in the Example Domain of Restaurant Selection
89
Rule 5.4 shows a request for query without specifying a value. Thus, the result set of the latest database query is returned. Note that the query result is actually stored in BEL first and only at presentation time loaded into RES. However, we skip BEL for presentation and use only RES and SO instead, the latter indicating one single result. An example utterance for this sort of request would be ’Do you have anything like that?’. RULE 5.4: request; PRE: mt ==request(U1:S,{}) SO == 0 EFF: RES = consultDB(TASK) mt+1 =reply(S:U1,RES)
Rule 5.5 contains a request for query with the content of a proposition p. The proposition is integrated into the task model which induces a database query. The result set is in the following returned. An example would be: ’We would like to eat Indian food.’ RULE 5.5: request; PRE: mt ==request(U1:S,p) SO == 0 EFF: try update TASK(p); BEL = consultDB(TASK); mt+1 =reply(S:U1,BEL)
Rule 5.6 depicts a request for information supplying a specific category c. The value of the requested category of the restaurant under discussion (SO) is returned. An example for an utterance of this kind is: ’What’s the address?’ RULE 5.6: request; PRE: mt ==request(U1:S,c) SO != 0 EFF: mt+1 =reply(S:U1,SO(c))
Rule 5.7 shows a request for information specifying a proposition p.4 It is tested if p is part of the currently selected restaurant (SO). In both cases, SO’s value for the category (cat) of p is returned, however, in two distinct prompts. A suitable example is: ’Does it have a non-smoking area?’ Rule 5.8 depicts a request for query supplying a specific restaurant R. General information about the restaurant is returned if the requested restaurant is 4
Note that this is the same rule as displayed in Rule 5.3’.
90
4 Dialogue Management for a Multi-Party Spoken Dialogue System RULE 5.7: request; PRE: mt ==request(U1:S,p) SO != 0 EFF: mt+1 =reply(S:U1,SO(cat(p)))
contained in the database. An example would be: ’Do you know the restaurant Bella Napoli?’ RULE 5.8: request; PRE: mt ==request(U1:S,R) EFF: try SO = consultDB(R) mt+1 =reply(S:U1,SO)
Rule 5.9 contains a request for information supplying a specific restaurant R together with a category c. The value of the requested category c of the specified restaurant (if it is contained in the database) is returned. An example of this move is: ’What kind of food does restaurant Hong-Kong have?’ RULE 5.9: request; PRE: mt ==request(U1:S,({R,c})) EFF: try SO = consultDB(R); mt+1 =reply(S:U1,SO(c))
Rule 5.10 lists a request for information supplying a restaurant R and a proposition p. It is tested if p is part of the specified restaurant R. In both cases, R’s value (if R is contained in the database) for the category cat of p is returned in a prompt according to the result of the test. Example: ’Does restaurant Hong-Kong serve Japanese food?’
RULE 5.10: request; PRE: mt ==request(U1:S,{R,p}) EFF: try SO = consultDB(R); mt+1 =reply(S:U1,SO(cat(p)))
For request for query, the system reply contains information about how many results the query yielded. In the case of one single result, general information about this restaurant is read out. If a menu is available, it is further offered to be shown. If the query yielded 2 to 4 results, the restaurants’ names are listed. In the case of a request for information, the information is supplied.
4.2 Dialogue Management in the Example Domain of Restaurant Selection
91
4.2.5 Dialogue Plans In task-oriented dialogue, the dialogue system has to accomplish a task. The way in which it has to act in order to achieve its goals is specified in the dialogue plans. The overall plan of the system consists of collecting all constraints that come up during the conversation, performing database queries and finally informing the users about the results until one is agreed upon. Different sub-plans are thus deployed which the system adopts according to its current activity state. During inactive state, the system’s plan consists of keyword spotting in order to detect the specified domain. The plan is depicted in Plan 5.1. Once this plan has been successfully achieved, the system’s state changes to active. The plan is now (as well as during interactive state) to find a suitable restaurant. While the system is in active mode it adopts the plan shown in Plan 5.2. The primary aim is to find an interaction point at which the state, and with that also the plan, changes. Plan 5.3 depicts the plan the system follows once it has adopted the interactive state. Plan 5.1: spotDomain { while NOT domainFound do (spot(keywords)) } Plan 5.2: findInteractionPoint { while NOT interactionCriteria do (collect(constraint) queryDB) } Plan 5.3: findRestaurant { while NOT restaurantfound do (collect(constraint) queryDB inform(result)) } In this section, the dialogue modelling of the foregoing section was put into context using the example domain of restaurant selection. Thus, the task and domain model were described, as well as the system’s update mechanisms and dialogue plans. In order to operate, the system further needs a grounding and interaction strategy which are described in the following.
92
4 Dialogue Management for a Multi-Party Spoken Dialogue System
4.3 Enabling Proactiveness The dialogue management strategies for our system are deployed according to enable proactive system interaction. The system interaction strategy is described in Section 4.3.2 in the scope of which we introduce suitable points for proactive interaction in the dialogue identified by empirical analysis. Section 4.3.3 presents the dialogue history deployed in our system which provides the system with the knowledge about the conversation to be able to interact proactively already for its first interaction. In the following, the grounding and integration strategy is presented which operates optimistically, i.e. the system does not wait for explicit grounding of the dialogue partners in order to process the incoming information. This way, the new information is available to the system as soon as possible. 4.3.1 Optimistic Grounding and Integration Strategy for Multi-Party Setup Grounding ensures that all dialogue participants are at the same level of information at each point in the conversation. The common procedure would be to process contributions only when it is ensured that the information has been understood by all dialogue participants. In our case, we deploy an optimistic grounding strategy. That means, the system does not await the reaction of the other dialogue partner(s) that would signal grounding (e.g. in the form of acknowledgement) before processing a contribution. This strategy is enabled by assuming perfect understanding in our system as was the case in the WOZ-recorded dialogues facilitated by the simulation of ASR and NLU. The system does not provide an action to be taken in the case of misunderstanding. If understanding problems were to occur the system as is would have to be changed. A differentiation between negative feedback and understanding problems would be necessary. Wrong values that originate in understanding errors should not be kept in the dialogue history as these values do not denote an intended part of the conversation. The faulty understood value has to be taken out of the task model; and no trace of it should stay in the dialogue history. A cautiously optimistic grounding strategy that enables a rollback of the misunderstood value could be deployed to handle this situation. Larsson (2002) defines a cautiously optimistic grounding strategy as an optimistic strategy with the option for rollback in the case of overhasty integration of a value. In our case, a rollback of that sense is not needed in the case of overhasty integration. A rollback is only needed in the case of understanding problems. A temporary field would have to be integrated into the information state to hold ungrounded information. As soon as the information gets grounded it can be integrated into the task model. The optimistic grounding strategy is chosen in order to be able to immediately integrate the incoming information into the information state. We
4.3 Enabling Proactiveness
93
further deploy an optimistic integration strategy, i.e. the content of every utterance is immediately integrated into the task model without waiting for the other dialogue partner’s response of acceptance or rejection. This is necessary for the system’s proactive behaviour. The aim is for the system to be up to date with the current set of constraints as well as the query result set at each point in the dialogue to be able to proactively interact whenever it becomes necessary. If later on in the dialogue, the constraint is accepted in the response of the dialogue partner, nothing has to be done. Negative feedback effects a change in the task model, i.e. the newly added constraint either has to be taken out or changed in its polarity, as described in the following. Every proposition ever mentioned in the dialogue is thus included in the dialogue history due to the fact that all participants are aware of the fact that this proposition has been mentioned in the dialogue. In the case of reintroduction in the further flow of the conversation, it is not an entirely new constraint and therefore should not be treated as such. There are different ways of dealing with negative feedback. A rejection indicates discontent, i.e. the participant does not agree on a certain point or constraint. The constraint could either be taken out of the set of valid constraints and would thus be not considered for the database queries. Or, the introduced preference could be marked as a negative constraint indicating something the users do not want and would in this way be considered further for the database queries. It is not a simple task to determine which constraint in the dialogue to integrate in which way. A decision is further necessary with every introduction of a new constraint in the case that the task-model already contains another constraint of the same category. Should the constraints be treated conjunctive or disjunctive in the query (i.e. they are each independent query criteria as opposed to combined building one criteria), or should the second one replace the first one? These questions cannot clearly be decided without taking the participant’s intention into account which can rarely be made out by analysing the utterance only. We shortly describe the way these points are handled in our integration strategy as a thorough investigation of this topic goes beyond the scope of our work. In our system, discontent results in a rollback of the conventional kind, i.e. the disagreed value is taken out of the currently valid set of constraints (not entirely, see above). This proceeding is adopted due to subjective analysis of the recorded dialogues where this is thought to have been the users’ intentions in the majority of times. Further, this way is less restrictive and thus less likely to result in an over-constrained situation. If necessary, the users can still adopt the constraint as a negative one later on in the dialogue by mentioning it again. We treat various constraints of the same category as disjunctive values, i.e. each one is individually considered as it is less restrictive. However, for exclusive categories we adopt replacement if the proposed constraints are not related to each other according to the ontology. As an example, if the users propose Chinese food and request Sushi (as in Japanese) a few utterances later, both values stay in the set of constraints. However, if the second proposal
94
4 Dialogue Management for a Multi-Party Spoken Dialogue System
would have been Italian cuisine, the first (’Chinese’) would be replaced as the two values are not related. The same procedure is adopted e.g. on the category ’location’, refer to Section 4.5 for a detailed discussion on the problem solving procedure of the system. 4.3.2 System Interaction Strategy The interaction strategy of the system aims at balancing between being showing presence in the conversation as a third interaction partner and not appearing too intrusive. Thus, the system interacts at appropriate points also proactively if no interaction request has been addressed. At the same time, it tries not to interrupt the interaction by obeying adjacency pair conventions, i.e. awaiting answers or reactions from the counterpart (which would especially be important if grounding was not performed optimistically). The different possible types of system interaction are described in the following. Reactive interaction is the most common type of interaction, i.e. the system is addressed directly by the main interaction partner. In proactive interaction the system interacts without being addressed. It takes place after pauses in the dialogue or if additional information the users could be provided with or should be informed about. In the recorded dialogues of the PIT corpus (refer to Section 3.3) it can be observed that most of the proactive interaction takes place at the first interaction of the system as well as the last (as is the case for instance in the example dialogue presented in Section 3.3.3). In between, the system is mostly addressed and included as a third interaction partner who is endowed with expert knowledge (refer to Section 5.4 for statistical information in this matter). The system responds with the requested data and, whenever appropriate, offers additional information, such as a menu or street map. The system’s interaction takes place as depicted in Table 4.3. The system’s first interaction is a greeting phrase in which the system introduces itself and offers assistance. It proceeds differently depending on the activity state the system is in. Interaction from inactive state is only possible upon a direct interaction request (U1:S) by the main interaction partner. The addressing is recognised by the keyword spotting mechanism. The utterance is then analysed entirely which enables the detection of propositional content. If the contribution was of the form request(p) the system will react with greet directly followed by an inform action5 which presents the result from the database query performed after integrating the proposition. Either way, the system skips in this case the active state and proceeds in the interactive state right away. When the system is in active state the interaction can either be reactive or proactive. Reactive interaction is once again induced by a direct 5
The ’dot’ in the notation is defined to denote that moves being performed successively by one dialogue participant (i.e. the system in this case) whereas the ’comma’-notation used in the following denotes that two moves are performed one after another by different interaction partners.
4.3 Enabling Proactiveness State
Action
Condition
inactive
greet greet.inform
U1:S reactive U1:S && LatestUtterance== reactive request(p)
active
greet
U1:S P IP U1:S && BEL ! = 0 P IP && BEL ! = 0
reactive proactive reactive proactive
interactive respond U1:S inform P IP && BEL ! = 0 suggest(o) with o:Object P IP && SO ! = 0 greet U1:S && SO ! = 0 SP IP && SO ! = 0
reactive proactive proactive reactive proactive
greet.inform
95
Interaction
Table 4.3. Dialogue system interaction types.
interaction request in the same way as described above. Proactive interaction occurs at certain proactive interaction points (PIP) which are specified below. The system’s interaction consists of a greet action which in the case that constraints have already been defined (or are defined within this request) and thus a database query has been (or is) performed is followed by an inform action presenting the result set6 . In the case that a specific object has been selected (i.e. spoken about) by the users, information about this object is given. From interactive mode the system interaction proceeds likewise, however, the actions performed by the system are respond in the reactive, and inform in the proactive case. In the latter case if the condition SO ! = 0 is further true, i.e. an object has already been selected by the users, the system proceeds by offering to show a menu or a city map, depending on the previous utterances. If the users have asked about the location or address a city map is offered whereas if they have been speaking about means of transportation the bus schedule is offered. In the general case, the restaurant’s menu is offered. The closing statement (i.e. the greet move) can be reactive or proactive after a restaurant has been agreed upon. For proactive interaction two criteria have to be considered: The contentual motivation of the system to interact and the appropriate point in time for interaction. The system’s aim of interaction is to assist the users in solving the task. It thus reports facts to the users that help promoting the task solving process such as results and also problems. The question is what kind of information is considered important to be reported proactively? The system can encounter four different situations when performing a database query which 6
The condition BEL ! = 0 implies that BEL is cleared after presentation, i.e. after RES is set (RES = BEL).
96
4 Dialogue Management for a Multi-Party Spoken Dialogue System
are of different significance for the dialogue, as listed in Table 4.4. The contentually relevant situations to be considered occur after a database query, i.e. the result has to be evaluated. Another content to be considered for proactive interaction denotes further the successful ending of the task solving process. If the users have agreed upon a specific restaurant the system can place its closing prompt and the interaction is over. The evaluation of the query result can expose four different situations: An over-constrained situation which yields no results, an under-constrained situation, i.e. the result set is too big to be presented; or it yields a presentable number of results which can either be one object or a small number of objects (2 to 4 in our case) that can be read out individually. The situation that one single result is returned from a query is considered optimal. It is closest to a possible solution denoting a successful end to the conversation and is thus rated most significant. The overconstrained situation constitutes another important situation as it denotes a dead end in the task solving process and thus requires an intervention in the form of a modification of the constraint set which can either be performed directly by the system in terms of compromises or relaxation as described in Section 4.5, or with the help of the users. The presentable number of results is possibly only one step away from the solution as it presents a small number of objects the users can chose from. The under-constrained case is considered least significant as it requires further constraints to be defined before finding a solution. The different levels of importance of the different contentual motivators does at this point not influence the interaction decision to a large extent. Whenever encountering a point in the dialogue suitable for proactive interaction the system takes this chance. As seen from the recorded dialogues, most of the interactions during the interactive phase of the system are reactive. The first and last interactions, however, very frequently occur in a proactive manner. These cases are thus considered separately, as described below. Result Set SO ! = 0 AND FINALLY ACCEPTED BEL == 1 BEL == 0 BEL == 2-4 BEL > 4
Situation
Significance
restaurant found
task solved
optimal over-constrained choice of results under-constrained
task is possibly solved problem, intervention required ok for presentation constraint definition ongoing
Table 4.4. Contentual motivation for proactive interaction.
The point in time for interaction is defined by observing the flow of the dialogue. A pause for instance, i.e. a break in the dialogue that exceeds a certain threshold in duration, denotes an interruption or time which the di-
4.3 Enabling Proactiveness
97
alogue partners use to think. It thus poses an ideal interaction point. The same effect is in natural dialogue often achieved by short utterances, such as acknowledgements or stall and check acts, denoting interruptions or fillers. It can be observed that certain sequences of dialogue acts that occur repeatedly in the recorded dialogues lead to a proactive system interaction (respectively wizard interaction) during the recordings. From observation of the Session III dialogues of the PIT corpus, the most common 7 sequence for proactive interaction consists of a suggest act (uttered by either the system or one of the users) which is followed by at least two acts of agreement by the users, such as one or more accept acts followed by an arbitrary amount of acknowledge acts. At times, a less significant stall or check act is additionally part of the sequence. In pseudo-code, the dialogue act sequence of agreement denoting a strong proactive interaction point (SPIP) is described as follows: [SPIP]
suggest,{[accept]+,[acknowledge]*}[2+]
It can be observed that this sort of sequence occurs at points in the dialogue that conclude something that has been discussed before and thus indicate a suitable point for a proactive system contribution. Linguistically, the sequence displays the termination of an interaction cycle consisting of initiating and concluding act, i.e. several of them in this case. After this, the dialogue continues with a new initiating act by an arbitrary speaker, i.e. either the previous speaker keeps the turn or the other dialogue partner grabs it. This point is thus ideally suited for system interaction of any kind. If this sequence occurs in the process of discussing a certain object it denotes its acceptance. The system offers to show an object or interacts with a closing statement, depending on the previous interaction (as described above). If this sort of agreement occurs at an earlier point in the dialogue, the system provides up-to-date information, i.e. presentation of the database result which can have the form of any of the four kinds listed above. The following sequence of dialogue acts also shows a complete interaction cycle in terms of dialogue acts, however, less assertive. It denotes a suitable proactive interaction point which occurs more frequently throughout the centre part of the dialogue, i.e. between the system’s first and last interaction (as can be seen e.g. in the example dialogue in Section 3.3.3 and Section 4.4 below). It can be described as follows: [PIP]
suggest,accept+
The suggest is uttered by one of the users addressing the other user, which is then responding with an accept act which denotes that both users agree to the proposed item. The system always waits for a reply of any kind by the 7
The most common types of proactive interaction are the system’s first and last utterance which could potentially have an impact on this fact.
98
4 Dialogue Management for a Multi-Party Spoken Dialogue System
other dialogue partner to not appear rude and provide the other participant a chance to react upon the proposition. In the case that the answer to the suggest was a reject no proactive interaction can take place. A rejection induces a change in the task model which denotes a rollback taking out the optimistically inserted constraint. The resulting database query might thus yield the same result that was possibly presented before. The system does not repeat the same result a second time. The system proactively chooses an interaction strategy regarding the presented proactive interaction points and contentual motivators. As seen in Table 4.3, PIP induces proactive interaction in all cases but for the last and final system utterance for which SPIP is required. In which way pauses or silence are to influence and induce proactive behaviour is not considered at this point as the system is not operating in real time. It poses an interesting question, however, which is subject for further research in the domain of non-verbal expressions in interaction which is beyond the scope of this work. 4.3.3 Dialogue History for Proactive System Interaction Proactive behaviour is achieved by full contextual awareness of the system. Constantly observing the situation and context relevant for the system enables it to instantly detect any occurring problems. Conversational and situational awareness should not only prevail of the current point in the interaction but include the past dialogue. Thus, a dialogue history is installed to provide the system with a memory. Besides the dialogical information it is further extended to include all task relevant information in order to be able to refer back to something said or restaurants discussed at an earlier point in the conversation. The aim is to endow our dialogue system with proactive behaviour. To recall, the system goes through the following life cycle: Figure 4.5 shows the different levels of attentiveness the system operates in. While the users talk about a different topic, the system is in inactive mode spotting for its specified domain. As soon as the conversation enters the domain of the system’s expertise the system adopts the active state. It pays close attention in order to model the dialogue between the users and build the task model and dialogue history. With the system’s first interaction it enters interactive mode during which it interacts as an independent dialogue partner assisting the users in every way possible. When the task is solved the system retracts from the conversation and once again adopts inactive mode. The system should not alone rely on the users asking for interaction but be able to decide for itself when it is appropriate and necessary to interact in order to solve a task in the best way possible. Thus, it interacts not only upon explicit interaction requests by the main user (reactive interaction) but also on its own initiative when important occurrences arise that are worth reporting to the users or if pauses occur in the dialogue (proactive interaction). The system’s interaction strategy is described in Section 4.3.2.
4.3 Enabling Proactiveness
99
domain spotting
INACTIVE conversation enters domain
task solved
System activity cycle INTERACTIVE pro- / reactive interaction
ACTIVE interaction required
constructing dialogue history
Fig. 4.5. System life cycle.
A conversation with a dialogue system normally starts with the first interaction of or towards the system which denotes the first utterance of the dialogue in single-party systems. At this point, the interaction and also dialogue history start being modelled. For our system this would be the point when the system steps into interactive mode in order to model the entire human-computer interaction. However, the multi-party situation our system encounters changes the requirements of the dialogue history modelling: The dialogue history has to represent also the duration the system is active in order to enable proactive system interaction, i.e. to be already endowed with knowledge at the point of its first interaction. The dialogue history is thus starting to be build as soon as the dialogue between the users enters the specified domain, i.e. at the point when the system enters active mode. It lasts until the task has been solved and the system takes itself back from the conversation. It thus has knowledge about the complete conversation and can refer back to values mentioned by the users while the system was only silently ’listening’ and does not require the users to repeat their preferences in a request to the system when the system is finally to interact. At the end of the interaction, the dialogue history is discarded. Figure 4.6 pictures the way the dialogue history relates to the dialogue.
100
4 Dialogue Management for a Multi-Party Spoken Dialogue System
Dialogue History
System inactive
domain starts
first interaction
active
interactive
problem solved
Fig. 4.6. Dialogue history as it relates to the dialogue.
Besides the common linguistic information necessary for smooth interaction, ellipsis and anaphora resolution, our dialogue history contains also task-related information. The dialogue is tracked in order to be able to reconstruct the conversation. By deploying the dialogue history, the system can resolve references to things mentioned earlier in the dialogue due to the fact that it is able to trace back to former states in the dialogue. It can adopt the former state completely, i.e. to reload the users’ preferences and result set of an earlier point of time, or extract some information from it. Thus, when the users refer back to earlier results as in e.g. ’Let’s just take the one from before’, the system looks in the dialogue history for the last presented result that had been agreed upon, depending on the context, and presents it again. The dialogue history consists of, in sequential representation, snapshots of the dynamic context at the current point in time. As part of the information state, all presented query results are stored (RES) with the information which of these objects is currently under discussion (SO). The users’ reaction to the presented objects can be accessed in the chronologically adjacent item of the dialogue history holding the next contribution. Dialogical information is stored to the extent of being able to fully reproduce the task-related issues. The private part of the information state, as well as the QUD field, is not part of the dialogue history. These components depend on the current point in the dialogue. There is no reason why the PLAN or AGENDA from former states should be accessed. The PLAN could possibly give information about the system’s activity state at the considered former dialogue state, however, this information can e.g. also be retrieved through an utterance’s relative position to the one containing the unambiguous dialogue move of the system’s first interaction. BEL holds the temporary query result and can thus be recalculated with a retrieved task model. The information contained in QUD is also of momentary value and will change to integrate the retrieved information into the current QUD. The structure of the dialogue history is depicted in Figure 4.7. The dialogue history elements are ordered chronologically, identified by a specific utterance number. The dialogue depicted in the excert is currently at number ti . Therefore, the dialogue history starts with ti−1 . The following elements are contained:
4.3 Enabling Proactiveness
SP K ADS DH : list ti−1 M V S T ASK RES SO
: : : : : :
101
P articipant set(P articipant) list(M ove) , ti−2 [...], . . . , t1 [...] set(P roposition) list(Object) Object
Fig. 4.7. Dialogue history at point in time ti .
• • • •
SPK (Speaker): The speaker of this utterance. ADS (Addressees): The addressee or set of addressees of this utterance. MVS: The move(s) contained by the utterance. TASK: The task model in its current state is contained as a snapshot of the specific point in time. For a detailed description of TASK and its elements refer to Section 4.5 below. The current set of user preferences are contained in the dialogue history as they cannot be reconstructed any other way. According to the point in the dialogue when a preference has been introduced, the value’s priority is high or low, however, no immediate conclusion can be drawn about which preferences were valid at a certain point in time due to the fact that a preference can be taken out in the course of the dialogue (i.e. the value is not rising higher) and later be reintroduced (i.e. the value starts rising again from where it had stopped before). The prioritisation values do not have to be remembered and are thus not contained in the dialogue history. Reloading a previous set of user preferences has a different basis of conversational history at a later point in time. The dialogue has proceeded in the meantime and all dialogue participants are aware of this. Reconstructing an earlier situation thus happens on the basis of the current prioritisation values, in terms of a reintroduction of constraints that had been taken out in the meantime while constraints that have been added in the meantime are taken out. • RES (Result Set): Result set of objects returned by the latest database query, i.e. on the basis of the constraints contained in the TASK of this utterance. For this database query, the original prioritisation values had been used. • SO (Selected Object): The currently selected and discussed object (one of RES). The multi-party setup is only explicitly contained in the dialogue history in the speaker and addressee fields. The information about who proposed specific constraints, which could in general be traced using the speaker field, is not necessary for the system. However, if the dialogue was to be modelled more thoroughly in respect to the individual users by deploying user models to keep track of each user’s individual preferences the dialogue history would need to
102
4 Dialogue Management for a Multi-Party Spoken Dialogue System
be extended. It would have to include information of the proposing dialogue participant with each constraint as well as the other participant’s reaction towards it. User modelling would further require speaker identification so the system could load the right user’s profile if more than one main user is to be able to use the system. Situations can be thought of where user modelling is not intended, e.g. if a user does not want the other dialogue partner to know the content of previous interactions with the system, e.g. regarding the restaurant domain about her favourite or frequently visited restaurants. For our setup we do not deploy user modelling and thus, the presented task-related modelling of the dialogue history is sufficient. The dialogue history presented in this section contains all dynamic context information relevant in order to be able to reproduce information from earlier states in the dialogue. While prevalent dialogue systems (refer to Section 2.4.1) deploy a dialogue history mainly with the focus on linguistic and conversational matters, our dialogue history puts an emphasis on the taskrelated components to enable references to previously selected objects and restoring of formerly active constraint sets. Common dialogue histories start modelling with the first interaction of the system unlike ours which starts to model the dialogue as soon as the users come to speak of the specified domain and this way enables proactive system interaction. All task relevant information is already collected already before the system has joined the interaction. The system can thus interact at any appropriate point, proactively or upon request, and directly provide relevant information.
4.4 Proactive Dialogue Management Example A part of the example dialogue presented in Section 3.3.3 is listed in Figure 4.5. In the following, the extract is depicted as a sequence of information states in order to illustrate the presented proactive dialogue management approach.
ID
Spkr Addr
Utterance
DA
16
U1
U2
sug
17 18
U2 S
U1 U1
So, let’s go to the centre cause we’re gonna meet the others later on anyway.. Right. Your query for a locality with Italian cuisine in the centre with inexpensive price category yields 5 results. Please confine your request further.
Table 4.5: Snippet of example dialogue from the PIT corpus.
acc inf
4.4 Proactive Dialogue Management Example
103
Figure 4.8 shows utterance 16 ”So, let’s go to the centre cause we’re gonna meet the others later on anyway..” at the point of getLatestUtterance at the beginning of the update cycle. Thus, the LU is filled with the latest details: U1 is the speaker, U2 addressee. The utterance contains one move, suggest. TASK contains all propositions collected in the dialogue up to this point: f2 stands for Italian, p1 denotes inexpensive. The proposition f1 stands for Mexican, however, is not active at the moment which is demonstrated by the parentheses. The RES field contains the set of restaurant presented in the foregoing utterance. The QUD field does not contain an entry for res1 due to the fact that res1 was uttered in the form of an inform as it contains too many entries to be listed in detail. The users were thus asked to alter their query. DH contains the previous utterances. The private part of the IS contains integrate on the AGENDA as it is the next action to be performed regulated by the update strategy. Generally, these actions are handled implicitly without appearing on the agenda. In this case, we place the actions of the update algorithm explicitly on the AGENDA for demonstrative reasons. The plan in this phase of the dialogue is findRestaurant (refer to the description of dialogue plans in Section 4.2.5).
AGEN DA : integrate P RIV AT E : P LAN : f indRestaurant BEL : T ASK : [f ] f p 1 2 1 SO : RES : res1 = (R1 , . . . , R6 ) QU D : {} SHARED : SP K : U 1 LU t : ADS : U 2 16 M V S : sug(l1 ) DH : {t15 , t14 , . . . t1 } Fig. 4.8. Example information state after getLatestUtterance of utterance 16.
Figure 4.9 depicts the subsequent step in the sequence where the utterance is integrated into the IS. The suggest move initiates that the associated proposition l1 (which stands for towncentre) is put on QUD as the currently discussed topic. The proposition is further included in TASK due to our optimistic handling of the integration (refer to Section 4.3.1). A change in the TASK field induces a database query. Thus, consultDB is put on the AGENDA. The RES field is cleared as res1 is not subject of the current utterance and further outdated after the change of TASK. Before the database query takes place, QUD has to be checked for possible downdates as the next step in the update algorithm.
104
4 Dialogue Management for a Multi-Party Spoken Dialogue System
AGEN DA : downdateQU D, consultDB P RIV AT E : P LAN : f indRestaurant BEL : T ASK : [f ] f p l 1 2 1 1 SO : RES : : l1 ? Q QU D : SP K : U 1 SHARED : : U 2 ADS SP K : U 1 LU : t16 ADS : U 2 M V S : sug(l1 ) DH : {t15 , t14 , . . . t1 } Fig. 4.9. Example information state after integrate of utterance 16.
Figure 4.10 shows the information state after the database query has been performed. The query result is stored in BEL. The dialogue is at this point ready for the next utterance. The expected next speaker is U2 with a reply to utterance 16. The system waits for the other party to reply in order not to interrupt. The proactive interaction criteria (as described in Section 4.3.2) are not fulfilled at this point. System interaction would only take place at this point if a ’long’ pause would follow this utterance. Figure 4.11 depicts the information state of the dialogue after loading the next utterance ”Right.”, an accept move by U2. The utterance is integrated in the following step.
AGEN DA : inf orm(res2 ) P RIV AT E : P LAN : f indRestaurant : res2 = (R1 , . . . , R5 ) BEL T ASK : [f1 ] f2 p1 l1 SO : RES : : l1 ? Q QU D : SP K : U 1 SHARED : ADS : U2 SP K : U 1 LU : t16 ADS : U 2 M V S : sug(l1 ) DH : {t15 , t14 , . . . t1 } Fig. 4.10. Example information state after consultDB of utterance 16.
4.4 Proactive Dialogue Management Example
105
AGEN DA : integrate, inf orm(res2 ) P RIV AT E : P LAN : f indRestaurant = (R , . . . , R ) BEL : res 2 1 5 T ASK : [f ] f p l 1 2 1 1 SO : RES : : l1 ? Q QU D : SP K : U 1 SHARED : : U 2 ADS SP K : U 2 LU : t17 ADS : U 1 M V S : acc() DH : {t16 , t15 , . . . t1 } Fig. 4.11. Example information state after getLatestUtterance of utterance 17.
Figure 4.12 illustrates the information state after the integration of utterance 17. The accept move is added to QUD with the accepted proposition l1 , the respective speaker and addressee. The content of the previous latest utterance has been put on DH as described in Section 4.3.3. The next step addresses the QUD downdate. The element put on QUD by the accept move dissolves the other element contained in QUD. As described in Section 4.2.4, resolved QUD elements are left on QUD for one extra turn to enable other dialogue partners to address the issue. Thus, no change takes place in the information state. The dialogue modelling is at this point ready to load the next utterance. The system takes the initiative and informs the user about the re
AGEN DA : downdateQU D, inf orm(res2 ) P RIV AT E : P LAN : f indRestaurant = (R , . . . , R ) BEL : res 2 1 5 T ASK : [f ] f p l 1 2 1 1 SO : RES : Q : l Q : l ? 1 1 QU D : SP K : U 2 , SP K : U 1 SHARED : ADS ADS : U1 : U 2 SP K : U 2 LU : t ADS : U 1 17 M V S : acc() DH : {t16 , t15 , . . . t1 } Fig. 4.12. Example information state after integrate of utterance 17.
106
4 Dialogue Management for a Multi-Party Spoken Dialogue System
AGEN DA : integrate P RIV AT E : P LAN : f indRestaurant = (R , . . . , R ) BEL : res 2 1 5 T ASK : [f ] f p l 1 2 1 1 SO : RES : : l1 Q : l1 ? Q QU D : SP K : U 2 , SP K : U 1 SHARED : ADS : U 2 : U 1 ADS SP K : S LU : t18 ADS : U 1 M V S : inf (res2 ) DH : {t17 , t16 , . . . t1 } Fig. 4.13. Example information state after getLatestUtterance of utterance 18.
sult of the latest database query (performing the inform move on AGENDA) making use of the detected proactive interaction point: The past two moves constitute a dialogue sequence defined as a convenient interaction point for the system (refer to Section 4.3.2). Figure 4.13 depicts the information state after the details of the new utterance has been loaded into LU and DH has been updated. Figure 4.14 shows the information state after utterance 18 has been integrated. The inform move induces res2 to be cleared from BEL and instead put in RES. The result set contains five items in this case, too many to be presented individually. Thus, the same situation is encountered as described above and no item is added to QUD.
AGEN DA : downdateQU D P RIV AT E : P LAN : f indRestaurant : BEL T ASK : [f1 ] f2 p1 l1 SO : RES : res2 = (R1 , . . . , R5 ) : l1 Q : l1 ? Q QU D : SP K : U 2 , SP K : U 1 SHARED : ADS: U2 ADS : U1 SP K : S LU : t18 ADS : U 1 M V S : inf (res2 ) DH : {t17 , t16 , . . . t1 } Fig. 4.14. Example information state after integrate of utterance 18.
4.5 Problem Solving Using Discourse Motivated Constraint Prioritisation
107
In the following, QUD is downdated. This time, the latest utterance did not address the issue resolved in the previous utterance; it is thus taken from QUD. The result is shown in Figure 4.15. The example sequence is ended at this point as the system is waiting for the next utterance.
AGEN DA : P RIV AT E : P LAN : f indRestaurant : BEL T ASK : [f1 ] f2 p1 l1 SO : RES : res2 = (R1 , . . . , R5 ) {} SHARED : QU D : SP K : S LU : t18 ADS : U 1 M V S : inf (res2 ) DH : {t17 , t16 , . . . t1 } Fig. 4.15. Example information state after downdateQUD of utterance 18.
4.5 Problem Solving Using Discourse Motivated Constraint Prioritisation This section is dedicated to the problem solving mechanism of the system. The task at hand is rather simple and deploys constraint-based problem solving (similar to the one presented in [Qu and Beale, 1999]), a common and straightforward way of dealing with problem solving of this kind. During the course of the conversation, the system collects all information relevant for the task which forms the basis for the database queries. Positive (what the users want) and negative (what they don’t want) so-called constraints are combined to narrow down the result set. Depending on its size, the result set of objects is then presented to the users who can choose a suitable object. If the result set is empty or exceeds the number of presentable objects (less or equals four in our case), further actions are necessary. If the database query yields too many results, i.e. a number too high for the individual objects to be read out, an under-constrained situation (in short, UC) occurred. The result set has to be further constrained to obtain a presentable number of results. Alternatively, the results can be grouped in a certain way, these groups or topmost elements (according to some sort of rating) can then be presented to the users. If the query yields no results, an over-constrained situation (in short, OC) occurred, i.e. there are too many or too restrictive constraints. In this case, the system has to find an alternative solution to present to the user. The current constraint set needs to be altered in order to yield a solution. A
108
4 Dialogue Management for a Multi-Party Spoken Dialogue System
common way to handle this problem is to relax one or more constraints until the result is non-empty. The crucial decision hereby is which constraint(s) to relax, preferably the ones that are least important to the users. How can the importance or priority of constraints be determined? This section focuses on a solution taking the course of the dialogue into account without the need to deploy a user model as often deployed in related work of this kind. For instance, Walker and colleagues (2004) deploy weighting of user preferences in the restaurant domain by using the multi-attribute decision theory in combination with user models. They classify the domain in six categories [Whittaker et al., 2002] differentiating between quantitative attributes (food quality, cost, decor, service) and categorical attributes (food type, neighbourhood). While the categories are user-independent the weighting of the attributes is user-dependent and thus, user models are needed to perform a ranking of the categories and hence also the preferences. The ranking is used mainly for generation of adequate system output in order to present the most important information and at the same time not to overload the user but only present an amount of information that can easily be taken in by the user. An approach to prioritise user preferences without deploying a user model is presented by Carberry and colleagues (1999) within the domain of course selection at university. Based on the assumption that it is the obligation of a collaborative dialogue system to come up with the best possible solution, the system recognises preferences that are expressed by the user and also by deducting from patterns in the user’s reactions. A combination of conversational circumstances and semantics is considered to weigh the user preferences. Conversational circumstances regard the situation in which a proposition is uttered, e.g. if it is uttered deliberately in the initial description of the problem (more important) or if it is uttered as a response to a proposal of the system (less important). Semantic aspects include direct expression of the preference or obtained through deduction. A proposal history is further implemented to track all of the presented proposals and the users’ reaction towards them in order to find trends that emerge over time. Our approach is somewhat similar in the way that we also take the time into account and cope without user models. One way to find out about priorities is to simply ask the user about which constraint is more important. However, this method is out of the question for an intelligent system of our kind. Another common and straightforward way to prioritise user preferences in single-user systems utilises the semantics of a constraint-bearing utterance to determine the importance of the constraint to the user. This method is also part of the Carberry approach introduced above. The semantic content of an utterance is analysed for specific words that show a sign of importance, e.g. ’maybe’, ’definitely’, etc. The semantic analysis depends only on the constraints and the single utterances it appears in. In the multi-party case, however, each introduced constraint is further discussed in the course of the dialogue, i.e. rejected or accepted by the other dialogue partner. For instance, a suggestion is introduced with a ’maybe’ (obviously a
4.5 Problem Solving Using Discourse Motivated Constraint Prioritisation
109
weak constraint), this low priority is recognised by the dialogue partner, who can then accept the suggestion (Would this move the constraint from weak to moderate?), or, being aware of the low priority of the proposal, make his or her own counter-suggestion with higher importance. Naturally, conversation partners often have different and differing preferences which complicates the problem solving process immensely. It makes automatic semantic analysis very difficult. We claim that the situation of (at least) two users pursuing a common goal supersede the need for exhaustive semantic analysis. In our case, the conversation serves the purpose of finding a mutually liked object. The dialogue partners are generally not interested in a long discussion but rather in coming to a quick consensus. Besides uttering their own preferences and dislikes, the dialogue partners evaluate each other’s preferences against their own and react accordingly. The second DP analyses the proposal of the first DP and constructs the consequent own utterance according to her liking, e.g. underlining the own and preferred suggestion as the attempt of trying to convince the first dialogue participant that her proposal is the better choice8 . Thus, it can have different reasons why a dialogue participant speaks in a certain way and a rating of preferences over both interaction partners poses a difficult challenge. Instead, we suggest using the ongoing discourse, especially the order of occurrence of the constraints for prioritisation. It thereby does not matter who introduces which preferences as the dialogue aims at a mutual solution. The system does not model the preferences of each user independently but collects all information relevant for the task to form a set of common constraints for the database queries. If the query does not yield any results, an intelligent system is expected to provide an alternative solution. Our solution to this problem is presented in the following. The evaluation of the approach is presented in Section 5.2.2. We compare the actual reaction of the users to over-constrained situations in the recorded dialogues (during which the system only reports the OC situation without proposing a solution) to the way our system performs in the same situation. Further, we compare our approach to manual semantic analysis. The results are very promising. 4.5.1 Prioritisation Scheme For prioritisation, information is extracted out of each utterance according to three categories: Changing Categories, Current Preferences, and Prioritisation Values. • Changing Categories (CC) indicate the topic(s) of the current utterance. For instance, if one of the participants makes the statement of wanting to eat Italian food, the CC field is tagged with category F which stands 8
Clearly, however, it is not just that easy. Different factors influence the utterance design such as for instance social roles dialogue participants take on in the conversation (refer to Section 2.3.2).
110
4 Dialogue Management for a Multi-Party Spoken Dialogue System
for food or cuisine. The other distinguishable categories are location (L), ambiance (A), category (C), price range (P), specials (S) • Current Preferences (CP) lists all currently valid constraints represented by individuals of the respective category and is thus used for a database query. In the example above, ’Italian’ would be categorised belonging to category food (F) and is thus allocated the individual F1 (taken it is the first F-subject in this conversation). A second F-value later on in the dialogue, e.g. Mexican food, would then be tagged F2, etc. This is applied analogously to all other categories (L1, L2, P1 etc.). • Prioritisation Values (PV) manage the priority values of the individuals. Each individual is stored with its according priority value and the actual value it represents. With every recalculation (induced by a change in the CP section) all currently valid values rise by 1 Priority Point (PP). A new individual is introduced with the value ’1 PP’, i.e. it has risen 1 PP from the default value of 0 PP. Negative constraints or dislikes are represented with negative values accordingly (starting at ’-1 PP’). At the beginning of a dialogue, the table contains no entries. As soon as a relevant topic is raised, it is displayed in CC. The corresponding individual is inserted into CP and the PV value is assigned 1 PP (or -1 PP in case of negation). Every time the users modify the constraint set, e.g. by proposing or dismissing one, a change in the CP section occurs and the PV are recalculated: The values of all individuals that are currently represented as valid preferences (in CP) are raised by 1 PP (or lowered by 1 PP for negative values). Thus, the longer a subject stays valid, the higher its priority value becomes, which is obviously the desired effect. That means, as long as a subject is not explicitly abandoned or replaced by a different value due to incompatibility between constraints (and disjunctive values), it is considered valid and part of the current preferences. If a constraint is dismissed it is taken out of CP, its PV stays at the current value. Should it be re-introduced into the dialogue with the same polarity, it is reinserted into CP and the priority calculation starts at the former value. A change in the polarity of a valid constraint is performed by simply adding or removing a minus ’-’ to the PP value. All currently valid individuals are listed in CP which serves as the basis for the system’s database queries. Every change in the constraint set induces a database query so that the system is always up-to-date and ready to interact. Generally, the system interacts for the first time after the users have already come to an initial agreement. As also noted by [Carberry et al., 1999], this first request to the computer deserves special attention as it displays the users’ original preference. Thus, all valid individuals at the time of the first computer request receive a First Request Bonus (FRB) of an extra 1 PP. At present, the prioritisation only comes into play in the case of an overconstrained situation, i.e. if the database query does not yield any results. In order to offer the users a best possible alternative result the system has
4.5 Problem Solving Using Discourse Motivated Constraint Prioritisation
111
to decide which constraint(s) to relax. We deploy the (slightly simplified) Algorithm 3. while overconstrained OR $resultset == $previous resultset do if onto check($relaxcandidate).succeed then present($resultset); break; else if relax($relaxcandidate).succeed then present($resultset); break; end end $relaxcandidate++; end
Algorithm 3: Relaxation algorithm. After execution of the algorithm the obtained result set is further examined. If it is empty again the algorithm proceeds. It also continues in case the result set is not satisfactory. The result set is compared to the result set that was presented to the users in the system’s last turn before the initial OC situation. If the result sets are equal, the same result set as the one that obviously had just been rejected or further constrained by the users would be presented again. Thus, the relaxation algorithm proceeds at this point. The constraint with the lowest priority value is chosen as the initial relaxation candidate (RC). The procedure relaxcandidate++ assigns the next candidate for relaxation. If no result was obtained after the first relaxation, the relaxed constraint (the former RC) is reinserted before the next RC is considered for relaxation. After another unsatisfying result, both constraints are relaxed etc. The presented algorithm is simplified in this matter and also in the way that it assumes that each time there is exactly one constraint with minimal priority value which, however, is not always the case. The implemented algorithm handles this by trying out each of the potential RC and taking the one with the best results. In the process of trying to consider all preferences and avoiding relaxation, the system inspects the ontology for similar or related values of the RC (onto check in Algorithm 3). If for instance no restaurant can be found near the town hall, before relaxing this constraint, it is checked if the query would be successful if the area around the cathedral was considered. The query is thus changed in the way to extend the location by the adjacent neighbourhood around the cathedral. This kind of ontology check can be performed for all exclusive categories (L, F, P, C, and A). Observation of the recorded dialogues showed that there is a need to exclude certain constraints from the relaxation process. Values from the category S (e.g. ’cocktails’) or ’expensive’ of category P, as well as negative constraints
112
4 Dialogue Management for a Multi-Party Spoken Dialogue System
were very important to the users. Regardless at what point these values were introduced in the dialogue, they were repeatedly mentioned and never relaxed. Thus, a protection flag is applied to these constraints to not be considered for relaxation. 4.5.2 Example The prioritisation schem is applied to the dialogue snippet shown in Table 4.6. The dialogue proceeds in the following way: Utterance
CC CP
... U2 1: What do you prefer to eat? U1 2: Let’s go to a Chinese restaurant!
F
{F1,C1}
F1=1PP Chinese C1=3PP restaurant
P
{P1,F1,C1}
P1=1PP exclusive F1=2PP Chinese C1=4PP restaurant
F
{F2,P1,C1}
F2=1PP P1=2PP F1=2PP C1=5PP
U2 3: Oh yeah, Chinese is a good idea. S 4: Your query for a Chinese restaurant returned three hits. My suggestions are fast food restaurant Asia Wok, fast food restaurant Asia Wan and restaurant Panda. U1 5: Ok. But we want one that is a bit more exclusive.
PV
S 6:
I found no result for an exclusive Chinese restaurant. U1 7: OK, then we have to go somewhere else. U2 8: Should we go to an Italian restaurant?
Italian exclusive Chinese restaurant
U1 9: Ok, that’s fine. I love Pizza. ...
Table 4.6. Prioritisation scheme applied to an extract of a dialogue.
4.6 Summary
113
• Utterance 29 introduces ’Chinese’ food. Thus, CC is assigned F for cuisine. CP adds F1 to the list of current preferences. The change in CP induces a recalculation of PV. F1 is added at value 1 PP. Note that C1 is already there due to the fact that earlier in the conversation, a ’restaurant’ was requested. Before recalculation, C1 was at value 2 PP which means that it had received an extra point as FRB. Thus, utterance 4 is not the system’s first interaction. • In utterance 3, Chinese is repeated. No change of the current constraint set, thus no recalculation. • The database query resulted in a presentable result set of three restaurants and is therefore read out. • The query is further constrained introducing ’exclusive’ as P1 with VP 1 PP. The change in CP induces a recalculation. • S 6: OC. The database query did not yield any results. The system did not have any means to resolve the problem to present an alternative result. It simply notified the users about the OC. The users change the result set regarding accordingly. • In utterance 8, ’Italian’ is suggested and thus integrated into the according fields. Italian is disjunctive to Chinese, thus both constraints do not coexist and ’Chinese’ (C1 resp.) is taken out, i.e. it is the constraint the users relax. How would our algorithm have performed in this situation? The algorithm would have tried to relax ’exclusive’ at the first round. However, the new result set would be the exact same as the one presented before. The algorithm thus goes on and chooses ’Chinese’ as the next RC, ’exclusive’ is reinserted into the constraint set used for the query. The check in the ontology for related values yields a strong relation between Chinese and Thai or Japanese food. As there are exclusive Thai as well as Japanese restaurants in the database the system would thus be able to present a solution very close to the initial request. We claim that this solution might be even better (i.e. closer to what the users wanted originally) than what the users actually found proceeding in the dialogue when switching to Italian. Clearly, a statement of this kind can only be assumed using common sense, objective evaluation is hardly possible in fictitious scenarios as no real drive is leading the users’ actions and preferences. A more extensive evaluation is presented in Section 5.2.2.
4.6 Summary This section focused on the dialogue management for our proactive multiparty dialogue system. The prevalent ISU approach (e.g. [Larsson, 2002]) was adopted providing an ideal and flexible basis for an agent-like system as ours. Thus, in order to endow our systems with its designated functionalities of 9
U1 2 stands for DP U1 and the second utterance of this dialogue snippet.
114
4 Dialogue Management for a Multi-Party Spoken Dialogue System
proactive interaction behaviour, intelligent problem-solving and multi-party capability we built on top of the approach (and partly on the existent multiparty extensions). In Section 4.1 we introduced our modified information state, for which we partly adopted the proposed multi-party extensions by Kronlid (2008) (refer to Section 2.4.4) and partly performed additional alterations in order to suit our specific setup. We further introduced a new interaction principle to allow proactiveness in interaction protocols. Our example domain of restaurant selection was applied to dialogue management in Section 4.2. All task-related parts, such as task model, domain and context model, as well as update mechanism, were presented. The components were again adopted to the multi-party situation of our system. Section 4.3 concentrated on the different dialogue management strategies that enable proactive system interaction. Our optimistic grounding and integration strategies are presented as well as the system’s interaction strategy that identifies points in the dialogue that are suitable for proactive interaction. Finally, the extensive dialogue history is described that starts modelling as soon as the dialogue enters the specified domain and thus enables proactive interaction already for the system’s first utterance. We illustrated how the dialogue management performs in practise by listing an example sequence of information states. Finally, we described the constraint based problem solving functionality of the system in Section 4.5. In order to always provide the best possible solutions, we introduced a new discourse-motivated algorithm to prioritise user constraints in multi-party dialogues which allows user-friendly handling of over-constrained situations. The evaluation of this algorithm presented in Section 5.2.2 shows its great performance.
5 Evaluation
In this section, the evaluation of our dialogue system is presented. Established methods for evaluating spoken language dialogue systems differentiate between subjective and objective methods as described in Section 2.2. The aim of the evaluation of our dialogue system is to appraise the user acceptance and rating of this novel sort of interactive system. Thus, the main focus is put on subjective evaluation for which data was obtained through the questionnaires filled out by the participants prior and subsequent to the data recordings. Usability evaluation is performed using two established methods (AttrakDiff [Hassenzahl et al., 2003] and a modified version of SASSI [Hone and Graham, 2000]). Evaluation is performed over the different recording sessions to detect the improvement of the system as well as comparing the different setups with and without avatar using the data of the Session III dialogues. A technical self-assessment of the participants was further conducted in order to validate the comparison of the different recording sessions. The results of the evaluation are presented in Section 5.1. Section 5.2 concentrates on objective measures of the dialogues. First, a data analysis is presented. Further, the performance of our novel algorithm for discourse-oriented user constraint prioritisation is examined. For this, we deploy our system with the implemented prioritisation algorithm to the recorded dialogues that contain at least one over-constrained situation. The relaxation candidate that is proposed by the algorithm is then compared to the users’ actual behaviour during the recordings, i.e. which constraint the users relaxed. A second evaluation compares our algorithm to a manual semantic analysis. A further prominent point of the evaluation is the avatar we deploy as a form of personification of the system to provide an additional, i.e. visual, modality. The aim of this is to augment the presence of the system in the interaction in order to work towards the acceptance of the system as an equivalent interaction partner. We study the effect the avatar has on the users’ interaction behaviour and subjective ratings. Section 5.3 presents an analysis of the main user’s gaze direction. It is investigated in which ways the human-computer interaction differs when deploying an avatar as opposed to only voice output P.-M. Strauß and W. Minker, Proactive Spoken Dialogue Interaction in Multi-Party Environments, DOI 10.1007/978-1-4419-5992-8_5, © Springer Science + Business Media, LLC 2010
115
116
5 Evaluation
and in what ways the interaction differs compared to the human-human interaction with the other user. The subjective ratings are analysed in the scope of the usability evaluation presented in Section 5.1.4. Finally, the proactiveness of the system is assessed as presented in Section 5.4. Objective measures are used to obtain information about the system’s actual interaction behaviour. Further, user ratings provide subjective information about how the system’s interaction is perceived.
5.1 Usability Evaluation Usability evaluation is an important means to measure a dialogue system’s usefulness and user-friendliness (refer to Section 2.2). We deploy two different standardised questionnaires to measure the usability: AttrakDiff [Hassenzahl et al., 2003] and SASSI [Hone and Graham, 2000] (in a modified form, see below), and further apply a technical self-assessment to be able to relate the users’ technical affinity to their rating. The design of our questionnaire is described below. Usability evaluation is performed only of the participants who interact directly with the system (group of users U1). It cannot be determined in what way participants of group U2 can judge the interaction as, by definition, they do not interact directly with the system1 . The presented usability evaluation is performed in two ways. The aim of the first evaluation is to assess the improvement of the system between the different recording sessions (in between which the system was improved in terms of speed and functionality) and what impact it has on the participants’ rating Further, the system is evaluated in terms of its acceptance and appraisal of the participants. How does such a system that acts as an independent dialogue partner come across to the users? The third and final recording session which deploys a system simulated as closely as possible to the envisaged system is thoroughly evaluated. Half of the Session III dialogues include an avatar as visual output of the system, the other half use only speech output. This allows further an assessment of the effect the avatar has on the interaction. 5.1.1 Questionnaire Design The questionnaire in use for the recordings is composed of questions about demographic data of the participant, questions concerning technical self assessment and a subjective rating of the system (including AttrakDiff and SASSISV as described below). It closes with open questions about the recordings, i.e. what the participant liked, not liked and what could be improved. The questionnaire was extended for the third recording session to include additional subjective ratings regarding the system interaction. Appendix C 1
In reality, this is of course not always the case, however, it is not clearly definable which users interacted directly with the system and which not.
5.1 Usability Evaluation
117
displays SASSISV, the remaining SASSI items, and further items for subjective ratings as it was deployed during Session III. The different evaluation techniques used are presented in the following. AttrakDiff AttrakDiff [Hassenzahl et al., 2003]2 is a standardised evaluation method to assess attractiveness of any sort of product in terms of usability and appearance. It consists of 28 word pairs of opposite adjectives that are placed on both ends of a seven-item scale. The system is evaluated in terms of the following scales: • Pragmatic Quality (PQ) describes the human needs for security, control and confidence. Errorless and accurate functioning of the system and userfriendly design are measured in this aspect. Example: ”human - technical” • Hedonic Quality human needs for excitement and pride. It is differentiated between the following to aspects: – Identification (HQI) describes aspects if the users can express and identify themselves with the system in the desired way. Example: ”alienating - integrating” – Stimulation (HQS) is for instance obtained through visual design and novel features. It satisfies the user’s needs for excitement and discoveries. Example: ”dull - captivating” • Attractiveness (ATT) denotes how pleasant and likeable the system appears to the user. Example: ”motivating - discouraging” SASSI(SV) Hone and Graham introduced the 34-item questionnaire for Subjective Assessment of Speech System Interfaces (SASSI) [Hone and Graham, 2000] an approach for a multidimensional analysis especially for dialogue systems. Items such as ”I felt confident using the system.” are to be rated with a 7-point Likert scale. The higher the rating, the more positive it is (except for the Annoyance scale where the opposite holds). We perform evaluation with a modified, i.e. shortened, version of the questionnaire. The ’SASSI Short Version’ (SASSISV) contains 16 items specifying the same six scales as determined by the original SASSI with factors listed in the following. It has to be noted that the constructed SASSISV shows highly significant (p<.0001) correlations for all six scales with the original SASSI (tested in Session III with 37 participants) which legitimates the shortening of the questionnaire. Appendix C lists SASSISV in its first 16 items. The following 18 items denote the SASSI items that have not been used for SASSISV. The questionnaire was deployed in this form in Session III and thus lists additionally the mean values of the ratings of IIIa and IIIb data. 2
The version used is: AttrakDiff 2.0 in German
118
5 Evaluation
• System Response Accuracy (SRA) determines the correctness of the system actions from the perspective of the user’s intention and expectation. Example: ”The system didn’t always do what I wanted.” • Likeability (L) contains statements of opinion about the system. Example: ”The system is friendly.” • Cognitive Demand (CD) addresses the amount of effort and mental workload needed to interact with the system. Example: ”I felt confident using the system.” • Annoyance (A) addresses irritating factors. Example: ”The interaction with the system is boring.” • Habitability (H) contains questions that relate to how clear the interaction is for the user, whether the user knows what to say at which point and knows what the system is doing. Example: ”I always knew what to say to the system.” • Speed (S) addresses the speed of the system. Example: ”The interaction with the system is fast.” Technical Self Assessement To be able to assess validity of the evaluation in terms of the participants, technical self assessment is performed. For this, part of a German questionnaire to estimate complacency potential [Feuerberg et al., 2005] is deployed enabling to judge the participants in terms of how technically adept they are. 11 items on a 5-point Likert-Scale are evaluated in terms of the scales: • Confidence in Technology (CIT) rates the participant’s trust in technical devices. Example: ”The operation of fully automated airplaines that function without pilots would increase security in flying.” • Attitude towards Technology (ATE) describes the participant’s individual position towards technology. Example: ”I am very interested in technology.” 5.1.2 Participants 152 participants took part in the recordings over the three sessions3 . They were between 19 and 51 years of age (mean 24.4 years). Their main professional backgrounds are computer science (n=41, i.e. 27.3%) and natural sciences (n=27, i.e.18%). Engineering and medicine are equally represented by 24 persons (16%). The daily usage time of computers of 149 participants lies between 10 and 720 minutes, with an average of 252 minutes. Computer experience lies between 4 and 21 years with an average of 11.43 years. The users can be said to be very familiar with technology and adept with computers. 3
150 of the participants filled out the questionnaire, the demographic data presented here is thus based on this number.
5.1 Usability Evaluation
119
The amount of evaluated questionnaires used to compare the different recording sessions presented in the following section is 19 for Session I, 20 for Session II and 14 for Session III. Only the users acting as U1 are considered. For Session III, only the dialogues that deploy a perfect system simulation, i.e. with avatar and without emotion-eliciting wizard interaction strategy, are used. The evaluation of the technical self assessment shows that all participants have a similar attitude towards technology. The Kruskal-Wallis Test [Kruskal and Wallis, 1952] shows no significant differences between the three groups according the participants’ technical comprehension. Also, gender and professional background do not have an influence on the evaluation. Thus, statistical correctness is ensured for the usability evaluation presented below which compares the data of the different recording sessions. Due to the scientific background of nearly all participants, their self-assessment ratings show interest and confidence in technical devices. For example, most participants state they are very interested in technical aspects (mean: 4.21, SD: .78 on a scale between 1 and 5) and that they enjoy activities using technical devices (mean: 4.16, SD: .83). Contact with technical devices is very common for them (mean: 4.74, SD: .45). The results of the two scales are shown in Figure 5.1. The values are very similar over all sessions for both scales which statistically validates comparisons between the different recording sessions.
5
4
3
2
1
I
II
CIT
III
total
I
II
III
ATE
total
Fig. 5.1. Technical self-assessment shows the mean values of scales Confidence in Technology (CIT) and Attitude towards Technology (ATE) for the participants of each individual recording sessions as well as all participants together.
120
5 Evaluation
5.1.3 Analysing the System Progress The three recording sessions are evaluated to show how the improvements that the system undergoes between the sessions influence the usability ratings. The composition of groups of participants used is the same as described above in participants section: 19 for Session I, 20 for Session II and 14 for Session III. The system’s setup and modifications are described in detail in Section 3.2.4. Here, we sketch shortly the modifications the system has undergone between the different sessions in order to be able to directly relate the results to the system improvements. • Session I: Basic wizard front-end, therefore slow system interaction. System represented by speech only, menus are displayed on screen. • Session II: Elaborate client-server architecture with easy to use wizard front-end including e.g. auto-completion functionality, and thus, faster wizard interaction. System represented by speech and avatar. System’s functionality extended by street maps pointing at a restaurant’s address. • Session III: System’s functionality extended by bus schedules. One single wizard used for all recordings. Avatar can be shown or hidden (50% recordings with, 50% without avatar). Emotion-eliciting interaction strategy for wizard deployed for some recordings. Results: AttrakDiff. Figure 5.2 shows the evaluation of the system over the three recording sessions using the AttrakDiff evaluation method. It displays the mean values and standard deviation of the ratings. Session I revealed that there is still room for improvement in various areas. At Session II, the system was improved and respectively rated more positive, the results show clear improvement between Session I and Session II. The difference between Session II and Session III is not very large which is not surprising as the system in general is the same in both sessions. The small difference that can be observed can tentatively be explained by the number of wizards that took part in the recordings. Session III deployed one single wizard while the recordings of Session II were conducted by various wizards. However, as the number of analysed cases is not very large, the difference is not so prominent. Results: SASSISV. The evaluation over all three recording sessions shows similar results using SASSISV. Figure 5.3 displays again the mean values and standard deviation of the ratings. Note that the A-scale denotes Annoyance, thus, the observed degrading ratings are an expected result. The scores of the Habitability scale show the biggest disparity. For Session III, the ratings rise by about 1.5 points. The Kruskal-Wallis Test [Kruskal and Wallis, 1952] deployed on the different factors shows that significant differences over all three evaluated sessions for Speed (p=.004) and Habitability (p=.031). The effect in the Habitabiliy scale can again speculatively be described by the fact that in Session III only one wizard is deployed who is more acquainted with the system and thus assures consistent behaviour which reflects on the users.
5.1 Usability Evaluation
121
6 5.5 5 4.5 4 3.5 3
I
II
PQ
III
I
II
ATT
III
I
II
III
I
HQI
II
III
HQS
Fig. 5.2. Usability evaluation over all three sessions using AttrakDiff. The scales are Pragmatic Quality (PQ), Attractivity (ATT), Hedonic Quality Identification (HQI), and Hedonic Quality Stimulation (HQS). 7 6 5 4 3 2 1
I
II III
SRA
I
L
II III
I
II III
CD
I
II III
A
I
II III
H
I
II III
S
Fig. 5.3. Usability evaluation over all three sessions using SASSISV. The scales are System Response Accuracy (SRA), Likeability (L), Cognitive Demand (CD), Annoyance (A), Habitability (H), Speed (S).
Highly significant correlations are observed throughout all factor combinations of AttrakDiff and SASSISV. This is an anticipated effect due to the fact that both methods aim at evaluating similar aspects of the system. Although different in the execution, both methods can be said to work equally well and achieve similar results.
122
5 Evaluation
5.1.4 Assessing the Usability To evaluate the usability of the dialogue system we use the dialogues obtained in Session III which are recorded with a simulation of the perfectly working envisaged system (n=25). The recordings, i.e. the questionnaires thereof, that deploy an emotion-eliciting strategy for the system interaction (refer to Section 3.2.3) are not considered for evaluation. 14 of the Session III recordings used for evaluation deploy an avatar (IIIa), 11 use only speech output (IIIb). Thus, the presented evaluation investigates besides the usability of the system in general also the effect the avatar has on the user ratings. What difference does a face make in this context? Figure 5.4 shows the mean values and standard deviation of the ratings obtained using SASSISV. The overall result shows a clear difference between the different setups: The system deploying the avatar performs better throughout the evaluation. The Habitability scales show the greatest improvement with avatar which is apparent considering that the Habitability scales contain questionnaire items relating to how clear the interaction is for the user. It considers whether the user knows what to say at which point and knows what the system is doing. Hone and Graham (2000) also relate it to the concept of ’visibility’. Thus, a face can be thought to make - and obviously also makes - a meaningful difference in this point. The Cognitive Demands scales seem to be at an equal level which means that the amount of concentration necessary for the interaction is the same for both setups. The evaluation with AttrakDiff yields equivalent results and is thus not listed. 7 6 5 4 3 2 1
IIIa IIIb
SRA
IIIa IIIb
L
IIIa IIIb
CD
IIIa IIIb
A
IIIa IIIb
H
IIIa IIIb
S
Fig. 5.4. Usability evaluation of the system using SASSISV. The scales are [Hone and Graham, 2000]: System Response Accuracy (SRA), Likeability (L), Cognitive Demand (CD), Annoyance (A), Habitability (H), Speed (S).
5.1 Usability Evaluation
123
The Effect of the Emotion-Eliciting Interaction Strategy on Usability A comparison of the Session III data over the four different setups is presented using AttrakDiff. It includes the 14 IIIa (avatar) and 11 IIIb (no avatar) dialogues used in the evaluation above and further the 5 IIIc (avatar) and 7 IIId (no avatar) dialogues which deploy the emotion-eliciting interaction strategy. Overall, the setups with avatar score higher than without avatar. IIIa scores best of all setups according the Pragmatic Quality (mean: 5.07), however the Attractivity of the IIIc dialogues with emotion scores higher than the IIIa dialogues (IIIa mean: 5.03, IIIc mean: 5.49). IIIc also yields a higher rating in the Stimulation (HQS) scale. This shows that the users do not necessarily rate a perfectly working system best (except in the pragmatic rating, as expected). Instead of feeling disturbed they might feel more involved and stimulated by a system which commits a few mistakes once in a while and thus rate this setup better. Due to our small basis of data these results are only speculatively. However, the Mann-Whitney U Test [Mann and Whitney, 1947] shows a significant difference (p=.03) in the rating of Attractivity between IIIc and IIId which also use the emotion strategy but no avatar (IIIc mean: 5.49, IIId mean: 4.53). While the ratings go apart between all pairs of setups with and without avatar, it can be observed that the difference in the ratings is more prominent for the emotion-elicited dialogues. The results are depicted in Figure 5.5. 6.5 6 5.5 5 4.5 4 3.5 3
a b c d
PQ
a b c d
ATT
a b c d
HQI
a b c d
HQS
Fig. 5.5. Usability evaluation over the different setups using AttrakDiff .
The evaluation with SASSISV shows consistent results. The same phenomenon is observed. Again, the IIIc setup scored best compared to IIIa (second best) regarding the following scales: Likeability (IIIc mean: 5.68 vs.
124
5 Evaluation
IIIa mean: 5.54), Cognitive Demand (IIIc mean: 5.55 vs. IIIa mean: 5.18) and Annoyance (IIIc mean: 1.5 vs. IIIa mean: 2.25). In all remaining scales (SRA, H, S) IIIa scored best. The Mann-Whitney U Test [Mann and Whitney, 1947] finds no significant differences between IIIa and IIIc, however a highly significant difference between IIIb and IIId for System Response Accuracy (p=.002) which is an expected results due to the fact that the system’s behaviour was directed by the wizard to be less accurate. It can again be observed that the ratings between the emotion dialogues (IIIc and IIId) are mostly more divergent than as is the case with the regular dialogues. The results are depicted in Figure 5.6. 7 6 5 4 3 2 1
a b c d
SRA
a b c d
L
a b c d
CD
a b c d
A
a b c d
H
a b c d
S
Fig. 5.6. Usability evaluation over the different setups using SASSISV.
5.2 Evaluating System Performance Besides the subjective evaluation ratings obtained through questionnaires instrumental measurements can be used to achieve objective evaluation, to predict system performance and quality. For this, general interaction parameters such as task success rates, speech recognition performance, efficiency, task success, naturalness of interaction, or the number of timeouts are collected from the dialogues at runtime or afterwards from the transcribed dialogues. Standard methods for objective evaluation exist, such as PARADISE (refer to Section 2.2), however, it is difficult to define universally agreed metrics as many different types of tasks have to be considered for which the systems can be deployed. While dialogue system evaluation in general is already difficult, it poses an even greater challenge for multi-party systems [Traum, 2004]. The metrics have to be adapted accordingly, for instance, considering the response
5.2 Evaluating System Performance
125
time of the system: Can the real time be used? How should interruptions by the other dialogue partner be accounted? In the scope of the evaluation presented here our aim is not to develop an appropriate evaluation technique, we rather perform a less elaborate objective evaluation considering only the characteristics that are of interest to appraise user acceptance and proactiveness of the system. The main focus lies on the subjective usability evaluation (as presented in the foregoing section) which is able to assess the system as a whole and how it comes across to the users. This proceeding is legitimate due to the fact that a dialogue system’s behaviour when controlled by the wizard can be taken as the gold standard [Paek, 2001] if the simulation is flawless. This of course depends on the system used for the WOZ recordings and the way it is operated by the wizard. If, for instance, a controlled number of speech recognition errors is simulated by deploying a simple ’leave-every-nth -word-out’ error generation strategy (deployed by e.g. [Dudda, 2001]), the system performance in this dimension is not perfect. The gold standard, however, should always target perfect performance [M¨ oller, 2005]. In our setup4 , we do not simulate speech recognition nor understanding errors. Further, maximal task success is basically guaranteed as directed by a human. A perfectly working system is part of the definition and purpose of the presented work to create an intelligent dialogue partner and to assess the system quality and examine the users’ reaction towards such a system. In the following, we present a small analysis of the recorded data, i.e. parts of the PIT corpus. The evaluation that was conducted in order to assess the performance of our discourse-oriented algorithm to prioritise user constraints (as presented in Section 4.5)is described in Section 5.2.2. Further objective evaluation is presented in order to assess user acceptance via gaze direction analysis (Section 5.3) and to rate the system’s proactiveness (Section 5.4). 5.2.1 Descriptive Analysis of the PIT Corpus The PIT corpus is introduced in Section 3.3 with a description of the recording setup, the different recording sessions, the number of dialogues and sort of data obtained. The first part of statistical information of the dialogues presented in this section is based on 20 dialogues from Session I and II. The video data of the dialogues have been hand-annotated in order to perform the gaze direction evaluation described in Section 5.3. Of the 20 dialogues, 8 originate from Session I (without avatar), 12 from Session II (with avatar). The analysed subset of the corpus was chosen due to exclusion criteria which rated some of the dialogues not suitable for evaluation5 . 4
5
Except, of course, for the emotion-eliciting dialogues which are not included in the consideration presented here. For instance, in some dialogues user U2 took over the part of the main interaction partner performing most of the interaction with the system which of course does not yield correct gazing data. In some videos, the data was corrupt or incomplete
126
5 Evaluation
The analysed Session I dialogues result in a total of 66 minutes (average dialogue duration: 8:16 minutes), the Session II dialogues in 88 minutes (average: 7:23 minutes) of video data. Figure 5.7 displays statistics of the durations of the entire dialogues. The durations of the dialogues in general vary a lot. The shortest analysed dialogue lasts less than 3 minutes (196 seconds), the longest nearly 16 minutes (943 seconds), both stem from Session I. The number of requests posed to the computer varies accordingly and ranges from 6 to over 60, with a median of 18.5 requests per dialogue.
900s 800s 700s 600s 500s 400s 300s 200s Session I
Session II
Fig. 5.7. Durations of the dialogues of Session I and Session II in seconds. The displayed boxes indicate the interquartile range, i.e. 50% of the samples. The lines inside the boxes show the median, the whiskers indicate the range of all samples.
As described in Section 3.3, the video data is divided into three phases. Figure 5.8 shows the mean values of the percentages of durations of the different dialogue phases of Session I and Session II. It can be observed that the major part of all dialogues takes place in the interaction phase of all three dialogue partners, Phase 2 (I: 51.7% vs. II: 67.6%), with a significant difference between the two sessions (p=.022). Phase 1 denotes the time before the system’s first interaction and is generally very short, independent of the total duration of the dialogue. Phase 3 spans over all times that an object other than the avatar is displayed on the screen. Its values differ to a great extent (I: 31.5% vs. II: 19.4%), however, can partly be explained by the fact that at Session I the displayed menus were often left on the screen for longer than necessary, i.e. longer than it was focus of the conversation. or did not catch the user in the right angle. Some interaction partners thought they were supposed to speak into the room microphone placed on the desk and thus did not attend to the screen but the microphone for interaction.
5.2 Evaluating System Performance
127
80% 70% 60% 50% 40% 30% 20% 10% 0%
I
Phase 1
II
I
II
Phase 2
I
II
Phase 3
Fig. 5.8. Comparison of the durations of the three dialogue phases of the analysed Session I (dark grey bars) and Session II (light grey bars) dialogues. The depicted values state the mean values of the percentages of the durations.
Statistical information for Session III is based on the analysis of audio data of 20 dialogues. 12 of the dialogues are IIIa dialogues, i.e. include the avatar, 8 are IIIb dialogues whithout avatar. The dialogues with emotioneliciting strategy are not considered. The durations of the entire dialogues are very similar with mean values of IIIa: 6:51 minutes (min: 197s, max: 720s) and IIIb: 7:26 minutes (min: 197s, max: 706s). No significant correlations are found between the duration of the dialogues and the evaluation. However, the duration correlates significantly with the number of interactions. The numbers are again similar for both setups with an average value of 13.92 interactions for IIIa (min: 6, max: 28), for IIIb the average value is 12.25 interactions (min: 5, max: 23). In general, the system speaks an average of 13.9% of all utterances. The number of total utterances per dialogue varies from 45 to 153 with an average of 90.5 for IIIa and between 43 and 160 with an average of 103.5 for IIIb. While the range of duration and utterances is the same for both setups, IIIb dialogues are on average slightly longer and contain more utterances than IIIa dialogues. We cautiously suggest this to be an indication that the interaction with avatar proceeds faster than without avatar. An analysis of the video data is currently still pending for Session III. Investigations about the proactive and reactive nature of the dialogues are presented in Section 5.4. 5.2.2 Evaluation of Discourse Motivated Constraint Prioritisation A novel algorithm to prioritise user constraints was presented in Section 4.5. The algorithm is used in the scope of the problem solving functionality of the system. It uses the ongoing discourse as a measure to calculate priorities for the user constraints. In a normal course of a dialogue all constraints are
128
5 Evaluation
considered for each database query regardless of their priority. If the query does not yield any result, i.e. an over-constrained case (OC) occurred, the query is altered until a result is achieved. At this point, the priority comes into play to determine the relaxation candidate, i.e. one of the constraints which is disregarded for the following database query. We perform evaluation on a set of 14 dialogues from Session II and III of the PIT corpus (ref. to Section 3.3). Each of the used dialogues exhibits at least one OC situation. When an OC occurred during the data recordings, the system told the users that no results have been found and that the users should alter their query. The users then modified their request according to their own preferences. Evaluation is performed in two ways. First, we deploy our algorithm to the transcribed dialogues and compare the outcome to the users’ actual proceeding after the OC in the dialogue, i.e. which constraints they relaxed or modified. The results are depicted in Figure 5.9 on the left side. Our prioritisation algorithm performed equally well or better in 13 out of 14 dialogues (93%), i.e. the algorithm lead to relaxing or modifying the same constraints as the users did. By conducting the ontology check, in 5 of the 13 cases (36%) the outcome would have been even better as the system would have suggested a result closer to the original preferences than what was obtained in the dialogue. In the remaining 1 case (7%), our algorithm lead to relaxing a different constraint than the users. 7%
7%
-
57%
=
+
43% 36%
=
+ 50%
Fig. 5.9. Evaluation results of the discourse-oriented constraint prioritisation algorithm, on the left in comparison with the actual dialogue proceeding (users’ choice), on the right side compared to the performance of a semantic algorithm.
An example where our system provides a better solution could be seen in the dialogue in Table 4.6 in Section 4.5. The users are looking for an exclusive Chinese restaurant. Unfortunately, there is no such restaurant in the database. The users change the query to request an exclusive Italian restaurant as an alternative. Our system in this case would have chosen the same constraint (“Chinese”) as relaxation candidate and consulted the ontology
5.3 Gaze Direction Analysis to Assess User Acceptance
129
before actually relaxing. As a result, the system would have suggested an exclusive Japanese or Thai restaurant. It proceeds in the assumption that these restaurants generally serve similar food, i.e. offer dishes of the other cuisine on their menus. This solution is thus considered better, i.e. closer to the users’ original preferences, than what the users actually ended up choosing. The second evaluation is conducted comparing the algorithm to a semantic prioritisation. In Section 4.5, we describe the semantic approach as a common way to perform constraint prioritisation in single-user dialogue systems. Thus, we hand-annotated the constraints in the dialogues considering keywords that denote importance. The annotation was performed by one expert annotator deploying a weighting scheme from ’1’ (little interest) to ’5’ (strongest interest). The same range was applied to dislikes (’-5’ to ’-1’, with ’-5’ meaning strongest dislike). Weights were dynamically adapted during the course of the dialogue, if necessary. Figure 5.9 shows on the right side the results obtained in this evaluation. The semantic algorithm performed equally well as ours in 6 of 14 cases (43%), i.e. the same result was obtained. In 1 case (7%) it performed better than ours in the way that it relaxed the same constraint as the users when ours did not. In all other cases, our algorithm outperformed the semantic algorithm (50%). In general, it can be said that the semantic algorithm repeatedly tried to relax one or more of the users’ main preferences which e.g. becomes apparent in one of the dialogues just after the OC situation when the users tried to rephrase their main preferences which at this point the system already would have had relaxed using the semantic prioritisation. The overall result is very affirmative: Our algorithm represents user preferences equally well or better than a similar (and by far more complicated) method using semantic analysis for prioritising user constraints in all but one evaluated cases. To confirm a universally valid good performance of the algorithm, it should be installed and evaluated in other systems using different domains.
5.3 Gaze Direction Analysis to Assess User Acceptance As a further means to assess the acceptance of the envisaged multi-party dialogue system as an independent dialogue partner we analyse the main user’s gaze direction to detect differences in the interaction with the system versus with the other human dialogue partner. We differentiate between addressing and listening behaviour of the main user and further investigate the impact of the avatar. The evaluation is based on the video data of 20 hand-annotated dialogues. The same set of dialogues has been deployed for the analysis on general statistical information about the dialogues presented in Section 5.2.1. The annotation was performed as described in Section 3.3.2: Speaker and addressee, displayed object and the main user’s gaze direction. Each dialogue is divided into three phases. Phase 1 corresponds to the conversation between the two participants before the system interaction. Phase 2 starts when the
130
5 Evaluation
system gets involved in the conversation and switches to Phase 3 as long as an object is displayed on the screen. In the course of the interaction, once leaving Phase 1 the dialogue can alternate between Phases 2 and 3. Table 5.1 lists the general gazing behaviour of the main user throughout the entire dialogue and according to the different phases. It lists the mean values of the percentages of U1’s gaze pointing at either one of the dialogue partners, i.e. the system (S) or the other user (U2), regardless of who is speaking or being addressed. It shows that the values differ only slightly comparing the two different setups. While the values for the dialogues as a whole are similar also for the different dialogue partners, the values differ somewhat more if the particular phases are considered analogously for both Session I and Session II dialogues. In Phase 1, the gaze towards the dialogue partner is predominant, as expected, due to the fact that no interaction with the computer has taken place so far. During the interaction with the system the user’s gaze towards the computer screen increases analogously for both setups. In Phase 3, while a menu (or other object for the Session II dialogues) is displayed on the screen the user looks mostly towards the screen. However, the values go further apart which can tentatively be explained by the fact that during the Session I recordings the displayed objects (menus in this case) tend to have been left on the screen longer than necessary. The conversation might have already left the focus of the object before it was removed. During Phase 2, the phase in which all three dialogue partners interact, the system represented by the avatar is looked at in about the same amount as if it is only represented by voice. This fact does not entitle the claim that the system is treated the same way in both setups. Different factors have to be considered regarding the speaking frequency and amount of the different dialogue partners. In the following, we take a look at the different phases in particular considering also current speaker and addressee to get a more complete picture of the gazing behaviour between the different dialogue partners and the role of the avatar.
no avatar (I) U1 looking at
S
Phase 1
10.0%
Phase 2
51.1%
Phase 3
67.4%
Dialogue
44.0%
U2
avatar (II) S
U2
70.4%
9.2%
58.1%
39.1%
50.8%
37.7%
27.5%
77.8%
18.8%
43.9%
45.0%
40.5%
Table 5.1. Means of percentages of U1’s gaze direction towards the dialogue partners S and U2 according to dialogue phases.
5.3 Gaze Direction Analysis to Assess User Acceptance
131
A further question arises when looking at the numbers presented in Table 5.1: Why is the system looked at more than the other human dialogue partner during Phase 2? Regardless of the setup a difference of about 10 percentage points occurs. To investigate if U1 is possibly in general interacting less with U2 than with system we look at the addressing behaviour of U1 towards U2. Table 5.2 depicts how much U1 addresses U2 during the dialogue, regardless of the gaze direction, i.e. the percentage of the total speaking time of user U1 when addressing U2. It can be observed that user U2 is addressed more in the dialogues with avatar. While no difference occurs in Phase 1, in Phase 2 U2 is addressed more in Session II recordings (I: 38.6% vs. II: 52.6%), the same phenomenon occurs in Phase 3 (which might once again stem from the longer displaying time of the object in Session I). With the interaction of the system at the beginning of Phase 2, the amount of addressing U2 decreases to a great extent for both system setups. During Phase 3, the value rises again due to the fact that this phase often consists of U1 explaining U2 what is seen on the screen. The fact that U1 addresses U2 more in Phase 2 with avatar than without avatar but at the same time is looked at less does not have a straightforward explanation. Thus, we take a look at the particular situations of when an interaction partner is speaking as opposed to the entire duration of the phases or dialogues we have been considering up to now.
no avatar (I)
avatar (II)
U1 addressing
U2
U2
Phase 1
95.7%
96.4%
Phase 2
38.6%
52.5%
Phase 3
57.1%
79.3%
Dialogue
62.2%
70.0%
Table 5.2. Percentage of U1 addressing U2.
Gazing Behaviour while speaking. The evaluation presented in Table 5.3 considers the main interaction partner’s gaze direction while speaking and addressing one of the other dialogue partners. For the following evaluations we present the results of Phase 1 and 2 only. While the system is addressed the user generally looks more at the system with avatar when speaking to it (I: 71.5% vs. II: 77.8%). However, the difference is not very large and not statistically significant (p=0.23). The values are in level with the values regarding the human dialogue partner. In the case that U2 is addressed, the system has no impact on this particular analysis, the difference in the two setups is thus very small, as expected.
132
5 Evaluation no avatar (I)
avatar (II)
U1 addressing and looking at
S
U2
S
U2
Phase 1
45.9%
69.0%
97.7%
66.8%
Phase 2
72.3%
72.7%
77.4%
74.9%
Dialogue
71.5%
69.4%
77.8%
72.6%
Table 5.3. Gazing behaviour of U1 during addressing the other dialogue partners according to dialogue phases.
Gazing Behaviour while Listening. Table 5.4 lists U1’s gazing behaviour while listening to the other dialogue partners; Figure 5.10 depicts the same values graphically. The system attracts the user’s gaze significantly more if represented by an avatar (p=0.023) than if the system is represented only by voice. The difference in values while U2 is speaking is not so prominent. However, it can be observed that in general, U2 attracts U1’s gaze a lot more than the system (I: 37.6% vs. II: 55.2%). The aim of our dialogue system is to act as an independent dialogue partner, thus, deploying an avatar is for this aspect a big step into this direction.
no avatar (I) U1 looking at speaker
S
Phase 1
U2
avatar (II) S
84.0%
U2 77.0%
Phase 2
37.6%
83.8%
55.2%
78.8%
Dialogue
37.6%
80.3%
55.2%
77.9%
Table 5.4. Gaze behaviour of U1 while being addressed by the other dialogue partner.
Gazing Behaviour during Screen Display. Finally, we look at U1’s gazing behaviour during Phase 3 when an object is displayed on the screen, regardless of who is currently speaking. U1’s gaze points mainly at the screen, as expected. The significant difference between Session I (mean: 67.4%) and II (mean: 77.8%) dialogues (p=.018) can again be explained by the fact that in Session I the phases of screen display lasted longer, i.e. the menus generally remained longer on the screen than in Session II. The longer an object is displayed the less interesting it becomes and the less attention it attracts. The dialogue might have moved to a different subject before the object is removed.
5.4 Assessing Proactiveness
133
100% 90% 80% 70% 60% 50% 40% 30% 20%
I
S
II
I
U2
II
Fig. 5.10. Listening behaviour of the main user, i.e. mean values of U1 gazing towards the currently speaking other dialogue partner.
A second evaluation of Phase 3 is conducted contemplating the difference between the kinds of objects displayed in Session II. The display of a menu attracted 75.9% of the gaze while the display of a city map caught a mean value of 82.8% of the main dialogue partner’s gaze.
5.4 Assessing Proactiveness The focus of this section is on assessing the system’s proactiveness. For this, we combine subjective and objective measures. We use objective measures extracted from the data and subjective ratings from the questionnaire (both Session III), and consider findings of the gaze direction analysis (Session I and II dialogues). 5.4.1 Addressing Behaviour During First Interaction Request A thorough evaluation of the main user’s gazing behaviour is presented in Section 5.3. Here, we are interested in the gazing behaviour during the main user’s interaction request that leads to the system’s first interaction in the dialogue, i.e. reactive first interaction. Table 5.5 displays the results. Of the analysed Session I dialogues (without avatar), 5 first interactions occur reactively (62.5%), 3 proactively (37.5%). During these five interaction requests to the system the gazing behaviour consists of 45.9% of the gaze pointing towards the system. Of the Session II dialogues (with avatar), 8 have a proactive first system interaction (66.7%), 4 reactive (33.3%). The percentage of gaze pointing at the system while uttering the interaction request for the four
134
5 Evaluation no avatar (I)
avatar (II)
First interaction proactive
3 (37.5%)
8 (66.7%)
First interaction reactive
5 (62.5%)
4 (33.3%)
45.9%
97.7%
U1 looking at S during first interaction request
Table 5.5. Gazing behaviour during first interaction request.
dialogues is 97.7%. There is no plausible explanation for this significant difference due to the fact that the situation before the system’s first interaction was the same for both setups, the avatar was not visible on the screen until the system’s first interaction. Further, the Session I value seems rather low compared to the average value of gaze pointing at S during addressing (I: 71.5% vs. II: 77.8%). Thus, no conclusions can be drawn from these divergent results. 5.4.2 Effect of Avatar on Proactiveness For the next evaluation conducted in order to investigate the system’s proactiveness we analysed 20 Session III dialogues to extract interaction parameters about proactive system behaviour as shown in Table 5.6. The dialogues with emotion-eliciting strategy are not considered due to the fact that the system behaviour was changed in various aspects in terms of the interaction behaviour (refer to Section 2.1). 12 of the analysed dialogues deploy an avatar (IIIa), 8 do not (IIIb). Over the entire dialogues 22.6% of the interactions are proactive (ranging from 0 to a maximum value of 4 proactive interactions per dialogue), the majority of interactions are reactive (77.4%). Most of the proactive interactions occur as the first and last interaction of the system. 80% of the system’s first interactions in the dialogue are proactive. Only in 4 cases (20%) the system is addressed with an interaction request (one time through direct addressing by the system’s name). Of the remaining interactions (excluding the first), on average 1.5 proactive interactions occur per dialogue (16.5% of all interactions). The percentage goes down once again if the system’s last interactions are taken out: A low average number of 1.05 interactions per dialogue occur, ranging from 0 to 3 interactions in both IIIa and IIIb data. This makes out 10.8% for IIIa and 12.1% for IIIb dialogues considering all system interactions that occur in the middle of the interaction (i.e. without the first and last interactions). Of the last interactions of the system 50% were proactive. A difference can be observed between the value with avatar (41.7%) and without avatar (62.5%). That means, 58.3% who interacted with the avatar and only 37.5% of the users who did not interact with the avatar addressed the system with a thanking or good-bye phrase upon which the system reacted accordingly with its closing prompt. A tentative explanation could assume to find the reason in the avatar itself. As it lets the system
5.4 Assessing Proactiveness
135
appear more like a real character than with only speech output (also refer to the evaluation presented below in Section 5.4.3), the users are more likely to politely close the interaction. The system is addressed by its name at least once in 60% of the dialogues with an average of 4.1 times (maximum 9 times) if mentioned at all (mean value over all dialogues: 2.45 times). The system’s proactive offering to show an object such as a restaurant’s menu, street map or bus schedule mostly occurred subsequent to presenting the query result or other information (e.g. address). It is thus not considered in the presented numbers, however, it should be noted that the users’ reactions to it were consistently positive.
Number of dialogues
no avatar
avatar
Total
(IIIb)
(IIIa)
8
12
20
Average dialogue duration per dialogue
446s
411s
425s
Average number of utterances per dialogue
103.8
90.5
95.7
Average number of system interactions (percentage of all utterances)
12.3
13.9
13.3
(11.9%)
(15.0%)
(13.8%)
Proactive interactions (i.e. percentage of all system interactions)
2.5
2.25
2.35
(21.8%)
(23.2%)
(22.6%)
6
10
16
(75.0%)
(83.3%)
(80.0%)
5
5
10
(62.5%)
(41.7%)
(50.0%)
Average number of proactive interactions and as percentage of all system interactions (without first interaction)
1.8
1.4
1.5
(16.2%)
(16.8%)
(16.5%)
Average number of proactive interactions and as percentage of all system interactions (without first and last interaction)
1.13
1
1.05
(12.1%)
(10.8%)
(11.6%)
System is addressed by it’s name at least once in dialogue
62.5%
58.3%
60.0%
Average number of times system’s name is mentioned in dialogue (only if mentioned at all)
4.2
4.0
4.1
First interaction proactive in number of dialogues Last interaction proactive in number of dialogues
Table 5.6. Statistical analysis of proactiveness in Session III dialogues.
Using the same subset of Session III dialogues we look at the subjective user ratings. The question if the avatar facilitated the interaction with the system was agreed to by 83.3% of the users who interacted with the avatar.
136
5 Evaluation
Of the users who interacted with the system without avatar 62.5% voted with yes, i.e. they suppose an avatar would have facilitated the interaction, and 25.0% did not suppose an avatar would have made a difference. The users who interacted with the avatar were further asked if the avatar should have a more realistic appearance to which 58.3% agreed. 5.4.3 Subjective Evaluation The interaction behaviour of the system is investigated with the help of a further part of the questionnaire as listed in Appendix C (items 35-50). All Session III dialogues are evaluated (IIIa-d, n=37). Table 5.7 shows the scores of selected items. The values in the table denote mean values on a Likert scale from 1 (worst) to 7 (best). Values in bold face stand for positive values while values in italic denote less positive values as described in the following. The users who interacted with an avatar rated the system more like a human and felt less interrupted (refer to items 40 and 46). The dialogues recorded with emotion-eliciting strategy scored worse ratings for the question if the system utterances were too long (item 48). This can potentially be explained by the fact that users need to interact more in these dialogues trying to repair the system’s mistakes. While waiting for the system to finish speaking in order to repair the request time might pass slower and the utterances appear longer. It can further be observed that IIId was rated worst in terms of the system’s voice, politeness, interruption and intrusiveness (43, 45, 46, 49). At the same time, IIIc scored best in terms of finding the system’s voice pleasant, considering the system as polite and feeling less interrupted (43, 45, 46). The high score for the IIIc dialogues can tentatively be explained in the way that the users enjoyed the less perfect interaction with the system. The same phenomenon was observed before when examining the effect of the emotion-eliciting strategy on the usability in general (refer to Section 5.1.4). The following significant correlations are observed from the viewpoint of the presented subjective ratings considering all items of the questionnaire as presented in Appendix C (i.e. including SASSI and AttrakDiff). The more the participants rated the system human-like the more they found the system utterances varied (p=.014), the system worthwhile using (p=.043) and found the first interaction less unexpected (p=.024). Significant correlations are further found with the SASSISV scales System Response Accuracy (p=.012), Likeability (p=.028), Habitability (p=.011) and negative correlation with Annoyance (p=.028). The participants who rated the voice as pleasant found using the system was worthwhile (p=.001). They rated the system as polite (p=.02), not monotonous (p=.015) and did not find the system utterances too long (p=.026). The more the participants found the first interaction unexpected, the less human-like they rated the system (p=.024). Who felt often interrupted by the system found the system has spoken too often (p<.0001) and behaves intrusive (p<.0001). They further rated the system utterances as too long (p=.009). The users who found the conversation lead quickly to the
5.4 Assessing Proactiveness
137
ID
Questionnaire item
IIIa
IIIb
IIIc
IIId
Total
36
The first interaction of the system came unexpected.
4.79
6.09
5.80
5.00
5.57
40
The system reacts like a human.
3.36
2.27
2.80
2.29
2.76
43
The voice of the system is pleasant.
3.57
4.00
4.6
3.00
3.73
45
The system is polite.
5.93
5.82
6.20
5.00
5.76
46
The system frequently interrupted me.
2.43
2.91
2.20
3.14
2.68
48
The system utterances were too long.
3.07
3.18
3.60
3.57
3.27
49
The system is intrusive.
2.21
2.27
2.20
2.86
2.35
Table 5.7. Subjective ratings of system interaction in Session III dialogues. The values denote mean values on a Likert scale from 1 to 7. Bold face denotes positive values, italic font negative.
desired result felt well supported (p=.02) and did not find the conversation too long (p=.006). They found using the system was worthwhile (p=.004), felt well understood (p=.011) and did not rate the system monotonous (p=.005). The participants who rated the system as polite felt well supported (p=.002) and found the voice pleasant (p=.022). The system did not appear intrusive (p=.017). Further, a highly significant correlation occurs with SASSISV’s System Response Accuracy (p=.005). To sum up, the analysis of the main user’s gazing behaviour during the first interaction request reveals a great difference between dialogues with and without avatar (45.9% vs. 97.7%) for which there is no apparent explanation. The numbers also differ from the addressing behaviour throughout the dialogue.6 The next evaluation aimed at analysing the recorded dialogues for proactive interaction behaviour. Thus, it can be observed that proactive interactions occur rather sparsely throughout the dialogues (22.6%). Most of the first interactions of the system are proactive (80%) as well as about half of the last interactions. This leaves the low number of 1.05 (or 11.6%) proactive interaction on average throughout the dialogue not counting the first and last interactions. It also shows that the system is addressed for most of the times, i.e. it is well included in the conversation. In general, the numbers are rather similar for both setups (with and without avatar). They go apart only for the system’s last interaction: 58.3% who interacted with the avatar and only 37.5% of the users who did not interact with the avatar addressed the system with a closing phrase such as thanking or saying good-bye (before the system could do the same, at least). A possible explanation for this could be 6
Only Session I and II data were used for this evaluation, while Session III data was used for the subsequent evaluations.
138
5 Evaluation
that the system with avatar appears more human-like which was proven also in the next evaluation. For this, the subjective user ratings were analysed. Over all, it can be said that the IIId dialogues (without avatar and with emo) scored worst whereas IIIc dialogues (avatar and emo) achieve high ratings, sometimes better than the IIIa dialogues. The same phenomenon occurred already in Section 5.1.4. The result has to be treated with caution due to the small number of dialogues, however, we tentatively explained above, that it seems that the users enjoyed the less perfect interaction with the system which sometimes gave them reason for amusement.
5.5 Summary In this chapter, we presented the evaluation of our independent dialogue partner. We deployed SASSISV (a modified version of SASSI [Hone and Graham, 2000]) and AttrakDiff [Hassenzahl et al., 2003] to perform usabiliy evaluation as presented in Section 5.1. Both methods seem equally well suited for the usability evaluation of a dialogue system as ours. They both appeared to be rating the system in a very similar way, showing highly significant correlations between all different factors of the two questionnaires. First, we evaluated the effect the system enhancement between the three recording sessions has on the usability which shows clear improvement towards the final version of the system in Session III. During Session III different system setups were recorded which were compared in the following. We compared the two setups within a perfect system simulation which showed that deploying an avatar has a positive effect on all aspects of the evaluation. In the following step we considered also the dialogues that were recorded with an emotion-eliciting interaction strategy which provided less flawless interaction (refer to Section 2.1). Surprisingly, the evaluation showed (with both evalution methods in the same way) that interaction with emotion strategy and avatar scored best in all prominent scales denoting the system’s attractiveness (AttrakDiff: attractivity and stimulation, SASSISV: likeability and annoyance). It can further be noted th at the difference in ratings of avatar and non-avatar dialogues in the emotion setup is more prominent than in the regular setup. However, due to the small amount of dialogues available in the emotion setup (7 with and 5 without avatar) all of these findings have to be treated with caution. Further, an analysis of two sets of Session I and II dialogues from the PIT corpus (refer to Section 3.3) was presented in Section 5.2.1. In the evaluation of Session III dialogues, the durations and number of utterances of all dialogues (only non emotional dialogues are considered) feature the same range according to minimal and maximal numbers. However, it can be observed that the average length of a dialogue and thus also number of utterances is shorter when deploying an avatar which could possibly indicate that an avatar accelerates the dialogue and thus the task solving process.
5.5 Summary
139
The discourse-motivated constraint prioritisation algorithm (refer to Section 4.5) was evaluated in two ways as presented in Section 5.2.2. Our algorithm was compared to the actual proceeding in the recorded dialogues and to the performance of semantically obtained prioritisation values. Our algorithm performed equally well or better in 92.8% of all cases in the first comparison. It performed better than the semantic algorithm in 50% of the cases, equally well in 42.9% of the cases and was outperformed in only one case (7.1%). The results are very promising and should be verified by adopting the algorithm to other domains and systems. User acceptance was evaluated in Section 5.3 by analysing the main interaction partner’s (U1) gazing behaviour based on a set of 20 Session I and II dialogues. It was observed that U1 looks in general more at the system than at user U2, however, addresses U2 more especially in the dialogues with avatar. While U1 is addressing the system, the system with avatar is looked at slightly more than without avatar. However, the amount of time that U1 spends looking at the system with avatar is roughly the same as when U1 is addressing the other human dialogue partner. The listening behaviour of U1, however, differs to a great extent. While U2 is looked at for 80.3% of the time U2 is speaking during Session I dialogues, the system is looked at for only 37.6% during Session I dialogues. Deploying an avatar has made a great improvement: 55.2% (vs. U2: 77.9%). Finally, a look was taken at the system’s interaction behaviour in term of proactiveness (Section 5.4). Objective measures are used to obtain information about the system’s actual interaction behaviour. It is seen that the system’s interactions occur proactively in 22.6%, mainly in form of the system’s first interaction in the dialogue (80% of first interactions are proactive), as well as last interactions 50% concluding the task-solving process. The analysis of user ratings shows how the system’s interaction is perceived by the users. Participants who interacted with the avatar rated the system more humanlike. They felt further less interrupted and rated the system more polite. Overall, we obtained very positive results throughout the entire evaluation denoting good user acceptance of the system. The usability ratings that still achieved moderate results for Session I dialogues improved significantly until yielding satisfying usability scores for Session III. Attention of the main user that was measured in terms of gazing behaviour during the interaction shows equivalent values during addressing for both dialogue partners (S or U2). During listening periods attention was very low in Session I dialogues, however, considerably improved by deploying an avatar. The avatar can be said to contribute to the positive ratings of the interaction partner. It can be observed throughout all evaluations that the system deploying an avatar scored at least as good as and mostly better than the system without avatar.
6 Conclusions and Future Directions
6.1 Summary The present book features a novel sort of spoken language dialogue system that acts as an independent dialgoue partner in the interaction with two users. As the field of application of dialogue systems has started to rapidly change and expand, established dialogue systems are exposed to new challenges some of which have been taken up in the scope of this work. The book started with a short general introduction on spoken language dialogue systems, followed by a description of current trends and related work in dialogue systems research. Chapter 1 was concluded with a description of the said dialogue system. Chapter 2 provided a basis for the topics covered in the subsequent chapters. Thus, fundamentals on relevant subjects of corpus development, (multi-party) human-human and human-computer interaction, and evaluation were presented. A detailed description on dialogue management focuses on the information state update approach, further taking a look at recent efforts on multi-party dialogue management. Chapter 3 listed a selection of up-to-date multi-party dialogue corpora. As none of those corpora exhibits the characteristics we aim at investigating in order to obtain interaction models suitable for the development of our dialogue management, the PIT corpus of multi-party dialogues was recorded in an extensive Wizardof-Oz environment as presented in the remainder of the chapter. Chapter 4 presented the multi-party dialogue management that enables proactive interaction in the conversation. Finally, the novel sort of dialogue system was evaluated in terms of usability and user acceptance as presented in Chapter 5. The contributions of this work are described in the following in detail. Proactiveness. The presented dialogue system has been endowed with proactive behaviour, i.e. it independently detects the point in the ongoing conversation where it has to start paying close attention in order to grasp the full content relevant for the system’s task solving process. While conventional dialogue systems start from scratch with the system’s first interaction, our sysP.-M. Strauß and W. Minker, Proactive Spoken Dialogue Interaction in Multi-Party Environments, DOI 10.1007/978-1-4419-5992-8_6, © Springer Science + Business Media, LLC 2010
141
142
6 Conclusions and Future Directions
tem has at that point already acquired all the knowledge possible and can thus immediately provide relevant query results or information to advance the task solving process. This approach is enabled by building an extensive dialogue history starting with the beginning of the users’ domain-relevant conversation instead of with the system’s first interaction. The dialogue history contains the task and dialogue-related information of all information states up to the current point in the dialogue. The system can further interact proactively in the ongoing dialogue at appropriate interaction points that were revealed by an analysis of the recorded dialogues. The interaction points depict moments in the dialogue where the latest discussed issue has just been concluded with the reply of the formerly addressed user (to the other user) and thus denotes an ideal time to bring up a new issue or the proactive interaction of the system. Our approach assumes perfect understanding which in real interaction might not always be the case. The sometimes limited understanding capabilities of dialogue systems in terms of speech recognition and also language understanding are not considered. However, for deploying the approach in a real-time system the actual situation has to be taken into account. Further, in real conversations the dialogue might be less restricted and involve many off-topic discourses and discussions which could make proactive interaction throughout the dialogue more difficult and might render contributions as well as interaction points that the system assumed appropriate as not suitable and out of context. Another challenge is the timely proceeding of the dialogue. Interaction between users takes place with often only milliseconds of pauses between turns. The human ability of turn-taking and anticipating the end of a contribution of the counterpart is difficult or impossible to copy. Since, to the authors’ knowledge, there is currently no recipe for handling this problem, proactive interaction will supposedly proceed to often interrupt the users’ contributions. However, this happens also to humans. The evaluation of the recorded dialogues showed that of all system interactions of the analysed dialogues 22.6% occurred proactively. Of the first interactions of the system, 80% were proactive, and about 50% of the last and concluding system interactions. During the dialogue (i.e. without considering the first and last interactions) only one proactive interaction occurs on average. This means, the system is integrated well in the interaction as it is mostly turned to resulting in a reactive interaction (88.4%). The only significant difference that can be observed regarding the different setups (avatar vs. no avatar) is the number of last interactions. While the users that did not interact with an avatar turned to the system in 37.5% of the cases to induce a reactive final system interaction, 58.3% of the users who interacted with the avatar addressed the system with a thanking or closing statement. This could tentatively signify that the system with avatar is treated more like a real character by the users as was revealed also by analysing the subjective ratings: Users who interacted with the avatar rated the system more ’humanlike’ than users who interacted with the system with voice output only.
6.1 Summary
143
Multi-party dialogue management. The main focus of the development of our dialogue system was put on the multi-party dialogue management. We presented a dialogue management approach based on the prevalent Information State Update approach and existent multi-party extensions to it which however did not suffice for the requirements of our system. We thus introduced alterations to the approach to render it to suite our interaction setup and enable efficient dialogue management: A new interaction principle was introduced to allow for proactive interaction by side-participants. The information state was adapted in order to include more task-relevant components and a dialogue history. Further, more flexible data structures were adopted to enable for instance long distance answers which is especially necessary for multi-party dialogues. This means, that dialogue participants can refer to something that has been uttered a few turns ago in the dialogues as opposed to being able to reply only to what has been said in the previous utterance. We assume perfect understanding which enabled us to deploy an optimistic grounding and integration strategy. The interaction and tasksolving process are this way accelerated as the system immediately integrates constraints as they are uttered without waiting for the other user’s reaction in terms of grounding signals or acceptance. An interaction strategy was developed that identifies appropriate interaction points to endow the system with proactive behaviour. The interaction setup of our system contains by definition only single addressees and did not allow for the second user to address the system directly. In real situations, however, it can hardly be avoided that the second interaction partner addresses the system. While in the recordings the second user mostly adhered to the convention not to address the system directly, reactions towards the system’s prompts occurred frequently, however. This shows that the second user feels equally addressed or at least affected by the system’s contributions. As a next step for the system, a true three-party interaction with equal rights for all participants should be considered and accordinly implemented in the dialogue management. The consideration of understanding and interpretation problems could further be addressed in the future deploying for instance a two-step integration strategy that handles the occurrence of understanding problems with restoring the former state. The dialogue management was designed as to enable proactive interaction of the system and in a whole follow the behaviour of the wizard during the recordings in terms of system prompts and functionality. An evaluation of the dialogue management can be performed by rating interaction parameters. However, as the system currently does not function online, evaluation could be performed only using the simulated envisaged system during the WOZ recordings. While at this point, our main focus lies on the user acceptance of the novel sort of dialogue system, a thorough objective evaluation of the dialogue management is still pending.
144
6 Conclusions and Future Directions
Discourse-oriented user constraint prioritisation. A novel algorithm to prioritise user constraints according to the ongoing conversation has been introduced to be deployed in the scope of the problem solving functionality of the system. The algorithm comes into play at the point in the dialogue when the database query does not yield any result with the current constraint set, i.e. an over-constrained situation has occurred. The aim of the system is to present an alternative result set instead of informing the user that no results were found and the constraint set should be altered. Based on empirical analysis, it was observed that the longer a constraint has been spoken about, the more important it is considered by the users, i.e. more important constraints are usually mentioned early in the dialogue. The algorithm observes the dialogue noting where and in which order the constraints are mentioned. It prioritises accordingly and proposes the least important constraint as a relaxation candidate when an over-constrained situation occurs. In the following, the ontology is first consulted to consider related constraints for the query before the relaxation candidate is relaxed. The evaluation of the algorithm yields very promising results in our setup. We compared the performance and priority of the constraints to the actual proceeding in the dialogue, i.e. which constraints the users actually relaxed in the following step after they were told that no results were found. In 57% of cases, our algorithm suggested the same relaxation candidate as the users chose, in 36% of cases, our algorithm would have yielded a better result, i.e. closer to the original request than what the users chose in the end, possibly due to not being aware of the other alternative. Our algorithm further outperformed a conventional semantic algorithm that it was compared to. Evaluation was performed within our example domain of restaurant selection. A next step would be to deploy the algorithm in different domains in order to see if the results yield a similarly optimal result. Additionally, further aspects could be considered for the calculation of the relaxation candidate, such as the frequency of changes that occur within a single constraint category. For instance, in the restaurant domain, if the cuisine was altered by the users a few times it could be rated less important than the location which was defined once and not discussed in the following. Multi-party dialogue corpus. The development of the PIT corpus denotes another prominent part of this book. We set up an extensive Wizardof-Oz environment in which we recorded 76 multi-party dialogues over three different recording sessions. The wizard system was improved in between the different sessions in terms of usability and functionality. About half of the dialogues were recorded using an avatar as a personification of the system. For the other half of the recordings, the system was represented by acoustic output only. Objects (such as a restaurant’s menu) were displayed in all recordings, where applicable. A part of the recordings was conducted using an emotion-inducing interaction policy for the wizard by simulating different sorts of errors and misbehaviours. This way, a wider range of user emotions
6.1 Summary
145
was achieved than what would occur in the normal course of the dialogue when the wizard simulates a perfectly working system. The audio and video data of the recordings were transcribed and analysed extracting typical interaction characteristics which assist in the development of the dialogue management component. Prior and subsequent to the recordings the participants filled out questionnaires which build the basis for the evaluation of the system in terms of usability and acceptability. The corpus was collected as there was no corpus available stressing the designated features of multi-party interaction with a dialogue system that acts as an independent dialogue partner. Thus, it is expected to be of great value for the research community on multi-party human-computer interaction. Evaluation. Finally, the evaluation of the dialogue system builds another focus of this book. The users interact with a new sort of dialogue system that acts as an independent dialogue partner. We are thus interested in how the system would be accepted and rated by the users. Usability evaluation was performed on the basis of the questionnaires the participants filled out prior and subsequent to the recordings. First, the three recording sessions were compared showing a great improvement towards the third session where the final version of the wizard system was deployed. Second, the usability was assessed for the third recording session which overall yielded very positive results. The different recording setups were compared whereas the setup with the avatar was rated higher than the setup without avatar. A further and somewhat surprising finding shows that the setup with avatar and emotion-eliciting interaction strategy (which effected in minor mistakes by the system) scored best among all evaluated setups regarding likeability factors. However, only a small amount of recordings were available in this setup (n = 7); the findings thus have to be treated with caution. User acceptance was further measured by a more objective form of evaluation analysing the gazing behaviour of the main interaction partner during the dialogue. The gazing behaviour towards the system was compared to the behaviour when interacting with the other user during the same situation. The evaluation shows equivalent values for the addressing behaviour towards both dialogue partners, i.e. U1 was addressing and at the same time looking at the other dialogue partner. During listening periods attention of U1 towards the system was very low (i.e. little gaze towards S while S was speaking) in Session I dialogues where no avatar was deployed (38%), however, considerably more attention was paid when deploying an avatar (Session II) (55%). For the purpose of comparison, U2 was attended to 78% of the time. The avatar can be said to contribute to a greater acceptance of the dialogue system. This was also observed in the final evaluation addressing the system’s proactiveness. As already shortly mentioned above, 22.6% of the system interactions throughout the entire dialogue were proactive. As soon as the system gets involved in the interaction it is addressed frequently; the demand for proactive interaction decreases. Thus, as expected, a high number of first interactions occurs (80%). Not considering
146
6 Conclusions and Future Directions
the system’s first and last interactions, 1.05 (11.6%) proactive interactions occur throughout the dialogue. This result is taken as a sign for good acceptance of the system as it is integrated in the dialogue and turned to frequently. A positive effect of the avatar was also affirmed in the subjective ratings of the users regarding proactiveness. The values are in general similar, however, differ regarding the system’s last interactions. When deploying an avatar, only 41.7% of the interactions were proactive (versus 62.5% without avatar). We take this as a further sign that the users treated the system with avatar differently, and more human-like as could be observed in the subjective evaluation presented regarding proactiveness. Overall, the results of the evaluation are very positive and promising.
6.2 Future Directions Future work can be performed in various directions from integrating missing system modules and extending the system’s functionality to online gaze detection or possibly allowing direct interaction by both dialogue partners. We elaborate on these and further points in the following. Extending the System Functionality Adding new system features. The system could be extended to be endowed with a wider variety of features, some of which were suggested by the participants of the recordings in the concluding feedback section of the questionnaire. The functionalities that were requested most are the ability to display pictures of the restaurants as well as an option to make a reservation at the selected restaurant through the system. The database could be extended to include some sort of ratings or comments about the restaurants of former guests, as well as to add information about handicapped accessibility and parking facilities. A route planner for both pedestrians and cars could assist the users in finding their way to the restaurant. The system could further be aware of the content of the displayed objects in the future to be able to answer questions regarding a currently displayed menu or to consider the information in the query. Integration of ASR and NLU. Automatic speech recognition and natural language understanding modules need to be integrated into the system. While automatic speech recognition has been making great improvements over the last years the requirements of our system are still difficult to fulfil. Besides the fact that our system needs to understand colloquial natural language, it is further challenged with speech which is not primarily designed for the computer to understand (as is the case in single-user systems) but another human. ASR thus remains a great challenge and further efforts are necessary. Work has been performed on the development of a statistical natural language parser
6.2 Future Directions
147
for the system [Strauß and Jahn, 2007]. The parser deploys frame-semantics (e.g. [Baker et al., 1998]) in the framework of existent and freely available tools such as the Stanford Parser [Klein and Manning, 2003] for syntactic parsing, SALTO [Burchardt et al., 2006] for semantic annotation, and the Shalmaneser Toolchain [Erk and Pado, 2006] for frame and role assignment. The parser deploys a tagset of 27 frames, 69 predicates and 453 instances. It was developed using a set of 50 dialogues adding up to 241 sentences for training (with 10% left out for testing). Evaluation yielded results of 78.16% correct frame assignment (9.2% wrong and 12.64% no assignment) and 74.39% of correct role assignment (6.93% wrong and 18.68% no assignment). Improvements of the results are expected if the development is resumed with a larger set of dialogue data. System as a personal assistant. The fact that the system relates to one of the dialogue partners (the main interaction partner) in a special way constitutes a suitable situation for deploying the system as a personal assistant to this user. The system could be deployed with long-term memory and user modelling. Thus, the system would remember conversations over several interactions in terms of the user’s preferences and chosen objects which enable it to provide personalised information and recommendations. However, it would have to be considered that the user might not want the other dialogue partner to know about private information, such as frequented restaurants. A long-term memory could further be deployed for the system to build up a general user model which could be used for recommendations of popular objects. Automatic gaze direction tracking. Including an automatic gaze direction tracking mechanism would enrich the system immensely in terms of becoming a fully-fledged multi-modal system. The principle of symmetry that prevails for human-computer interaction asking for alike input and output modalities for multi-modal systems is not adhered in the current state. The system deploys visual output and is thus expected to be endowed also with the functionality of visual input. The evaluation in Section 5.3 showed that the main user turned towards the system for the first interaction request in 97.7% of the cases1 . Throughout the ongoing conversation the percentage of the user looking at the system is similar to the values regarding the other human dialogue partner (77.4% vs. 74.8%). Both are in a moderate range which derives from the fact that in natural conversation, a speaker does not look at the interaction partner the whole time but rather in terms of e.g. giving visual feedback or determining the next speaker [Carletta et al., 2002]. The latter seems to be the case for the high number of visual attention towards the system during the first interaction request as a way to tell the system that it is expected to respond. The system would gain in robustness being able to 1
based on the analysis of Session II dialogues presented in Section 5.4
148
6 Conclusions and Future Directions
automatically recognise this first (and of course also any other) interaction request supporting the dialogue facility (i.e. recognising the request directed at the system from the utterance alone). Dealing with uncertainty and misunderstanding. As perfect understanding is difficult to achieve in natural conversation in reality, means have to be included for a future version of the system that allow the handling of misunderstanding and uncertainty. Different levels of understanding problems have to be distinguished as it can occur at every level of processing, from speech recognition to semantic parsing, pragmatic interpretation, or as a database problem.
Further Evaluation Evaluation of gazing behaviour of Session III data. The Session III video data have not yet been analysed in terms of the main user’s gaze direction. The annotation is to be performed automatically (as opposed to by hand for the Session I and II data) deploying the following procedure. Each video frame is to be annotated with a specific class label to differentiate between video frames where the person (U1) attends to the system and video frames where the person attends to the human dialog partner (U2). The problem is closely linked to the field of automatic image annotation [Cusano et al., 2004,Jeon and Manmatha, 2004], where a system automatically assigns metadata in the form of keywords to an image. An adaboost classifier [Viola and Jones, 2004] is trained with a small subset of manually annotated image frames in order to automatically extract pose information (direct gaze or averted gaze) from the videos. Adaboost is chosen because it is known to be very fast and efficient in detecting faces in images. Two different classifiers are trained, one that finds frontal faces against background and one that finds averted faces against background. Finally, both classifiers are then applied to each of the remaining previously unlabelled video frames to determine their pose label. This information now enables to statistically evaluate the amount of time user U1 spent focusing on the system as opposed to on the other user. Further evaluation of the prioritisation algorithm. The general aptitude of our discourse-motivated user constraint prioritisation algorithm requires further evaluation in the scope of different domains and systems. Scaling Up the Number of Users Additional dialogue partners can be introduced into the conversation in different ways. First, the limitation on deploying one user as the main user could be dropped. The system would be interacting with two ’equal’ users. In this case, the system would have to react to both users’ interaction requests in the
6.2 Future Directions
149
same way. Automatic gaze direction tracking could not be deployed as easily as is the case in the current setup. Many combinations of gaze directions of both users would have to be considered involving a number of possible ambiguous situations. A second extension is possible allowing a larger number of users but continuing the convention of having one main interaction partner. The dialogue management would have to be adapted in both cases, however, by deploying our optimistic interaction and grounding strategy both kinds of extensions should be possible to conduct.
A Wizard Interaction Tool
The implementation and customisation of the Wizard Interaction Tool (WIT) is described in the following. WIT is implemented in Java and uses a clientserver architecture [Scherer and Strauß, 2008]. It is freely available for download and customisation to individual needs, domains and also languages, as described below. System Setup and Implementation Figure A.1 shows a general overview of the system’s client-server architecture. It consists of three parts, all of which are implemented in Java: The Wizard Client, the Wizard Server and the Dialogue Manager. The wizard provides input to the client by typing and sending commands (button clicks) which are transmitted to the server. The server forwards the information to the dialogue manager which after performing the database queries etc. generates the system output. The output is sent back to the server where it is presented to the users and communicated back to the client and wizard respectively. The modules are described in more detail in the following. The Wizard Client is the frontend that runs on the wizard’s computer. It is composed of one central class named WizardClientForm which resembles the GUI of the client the wizard interacts with. It is an interface and does not have any other functionality but the communication with the Wizard Server which is realised by transmitting simple Java String objects over the Ethernet. The content of the auto-complete field, i.e. the permitted values to be entered, is loaded upon initialisation. The keywords and commands entered by the wizard as well as the possibly modified system utterances are transmitted to the server where they are processed. In response, the generated system utterances and results of the database queries are received from the server and displayed in the according fields of the GUI.
152
A Wizard Interaction Tool
Workplace of Wizard
User Interface
Wizard Interaction Tool Input commands text run_script
Wizard Client
Wizard Server
Output avatar scripts speech
Scripts Database
Dialogue Manager
Resources
Fig. A.1. Schematic description of the WIT software architecture.
The Wizard Server runs on the computer the users interact with (S). The system produces output in form of synthesised speech as well as graphical output. The class WizardWebServer adopts the task of communicating commands and query results between the WizardClientForm and the DialogManager. The Dialogue Manager is the heart of the dialogue system. The class DialogManager induces the other modules to perform actions and communicates back to the WizardWebServer. The coarse class structure of the dialogue system is shown in Figure A.2. The value entered by the wizard comes in to the DialogManager and is analysed according its nature. Keywords, i.e. user constraints, induce a database query and the generation of a system utterance in the following way: The value is added to the TaskModel which defines the search criteria for the database queries. Any update on the TaskModel induces a database query which is performed by the Query class accessing the database through the XMLParser. The result set is analysed by the DialogManager for quantity and returned to the server to be displayed by the client. Finally, the PromptMaker is assigned with the generation of an appropriate system utterance for which it communicates with Inlingua, the module responsible for language specific language properties such as providing different cases, gender specific endings, singular or plural endings, etc. The necessary operations are listed in Inlingua, the specific words or endings are listed in property files that are loaded on startup. The generated system utterance is then transmitted to the client to give the wizard a chance to check the utterance before prompting. If the returned result set consists of a single restaurant or if the wizard selects a particular restaurant, the DialogManager retrieves the data about this
A Wizard Interaction Tool
153
Dialogue Manager
TaskModel
Query
SelectedRestaurant
DialogueManager
XMLParser
PromptMaker
Inlingua
Fig. A.2. Class diagram of the dialogue manger of the WIT system.
restaurant from the XMLParser and stores it in SelectedRestaurant. This enables fast information access to supply any information available about this restaurant and perform further actions such as displaying the menu, showing the location on a map, or providing bus connections. The input for the DialogManager can further consist of commands that trigger the execution of scripts, that empty all necessary fields to start a new dialogue or that induce prompting. These commands can either be entered textually or via clicks on the respective buttons in the GUI. The prompting command induces that the DialogManager retrieves the text to be prompted from the text field of the GUI and passes it on to third-party software (as described below) for text-to-speech conversion. The text is first converted into phonemes which are used simultaneously by the avatar for lip movements synchronised to the system’s speech output and for speech synthesis. Subsequently, the actual play back is started. Several knowledge sources such as the restaurant database and the mentioned property files are available to the server. XML is used to store the restaurant database and grammar (described in Section 3.2.4) as well as the property files containing rules, vocabulary, and template texts used for the generation of the system utterances. Further resources are avatar images as well as all the files to be displayed on the screen (the restaurant menus). Third-Party Software Requirements The presented software runs on any standard computer, no special hardware requirements are needed. It is completely implemented in Java, providing flexibility in the choice of platforms. However, it has to be noted that the
154
A Wizard Interaction Tool
freely available text-to-speech software we use, txt2pho1 , only works for Linux. We further deploy the MBROLA2 speech synthesis which is quite flexible with regard to the language used. It supports German, English, and many other available languages. The transmission of the audio and video signals from the recording studio to the wizard’s room is realised using wireless microphones and a webcam. To stream the video signal, the broadcasting capabilities of the freely available multi-platform VLC3 video player are used and received on the client side by a second installation of VLC on the client’s computer. A web browser is needed to be able to run the scripts. Additionally, an SQL database could be used which would require a SQL installation on the server (e.g. the freely available MySQL4 ). However, as our domain and database are not very voluminous we use a XML database. Distribution and Customisation of the Software The presented software is freely available for download5 . It assists developers in rapid prototyping for WOZ recordings. The software can easily be adapted to different domains and languages, as described in the following. The distribution package consists of a template architecture containing the software for the client, the server which comprises all the data and rules, and the possible output mechanisms, such as the avatar, text-to-speech, and runnable scripts. All files used for our setup as described above are included as an example application. To customise the system to a different application, adaptations need to be made in the following parts of the system. First, it is necessary to setup an SQL database or XML file to define categories, data items, and templates for system prompts. By providing different property files, the system can be changed to a different language. The speech synthesis engine we use, MBROLA, is quite flexible with languages6 . A different text-to-phoneme software has to be used for any language other than German which may be also looked up at the MBROLA website. Further, images for the avatar can be designed and integrated. Our sample avatar consists of three types of transparent images: head, eyes, and different mouth positions. Additionally, scripts can be defined similar to the ones distributed such as e.g. running a web browser. The scripts have to be assigned to specific triggers that need to be implemented for the respective needs to be able to induce the execution. For this, a template class which can be adapted to the individual needs of the developers (in order to run customised scripts, start programs, etc.) is supplied as part of the open source implementation. 1 2 3 4 5 6
http://www.ikp.uni-bonn.de/dt/forsch/phonetik/hadifix/HADIFIXforMBROLA http://tcts.fpms.ac.be/synthesis/mbrola http://www.videolan.org/vlc http://www.mysql.com http://www.uni-ulm.de/in/pit For all available languages refer to: http://tcts.fpms.ac.be/synthesis/mbrola
B Example Dialogue
The example dialogue presented in Section 3.3.3 is in the following listed in the original version in German. The dialogue was recorded in the scope of Session III with the ID III 09 061207. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15.
16. 17. 18.
19.
U1: Oh, ist aber schon sp¨ at. U2: Ja. U1: Ja, sollen wir mal was essen? U2: Also ich hab vorher erst was gegessen, ich hab keinen Hunger. U1: Du hast keinen Hunger? Ja, was w¨ urdest ’n du.. Also, wo w¨ urdest denn hingehen? U2: Also ich h¨ att Lust auf ’n Cocktail. U1: Ja doch, des ist ’ne gute Idee, aber davor noch ’ne Kleinigkeit essen. U2: Ja, wenn’s gute Cocktails gibt... U1: Ja, das stimmt, das ist ’ne gute Idee. Aber, was k¨onnten wir noch ¨ essen.. Ahm.. Ich hab Lust auf Mexikanisch. Was meinst du? U2: Hmm, obwohl, dann ess ich vielleicht doch ’ne Kleinigkeit. Aber ich h¨att mehr Lust auf Italienisch. U1: Ja, das ist auch ’ne gute Idee. U2: Aber es soll nicht so teuer sein. U1: Ja, stimmt. U2: Ich hab kaum Geld noch. S: Hallo, mein Name ist Helmut. Ich glaube ich k¨onnte Ihnen helfen. Die Anfrage nach einer Lokalit¨ at mit italienischer K¨ uche ergab 6 Treffer. Bitte schr¨anken Sie die Suche weiter ein. U1: Ja, dann geh’ ma in die Innenstadt, weil wir treffen uns nachher sowieso mit den andern... U2: Ja, stimmt. S: Die Anfrage nach einer Lokalit¨ at mit italienischer K¨ uche in der Innenstadt zu niedriger Preisklasse ergab 5 Treffer. Bitte schr¨anken Sie die Suche weiter ein. U1: Ja, und Italienisch, und dann noch Cocktails dazu?
156
B Example Dialogue
20. S: Auf Ihre Anfrage nach einer Lokalit¨ at mit italienischer K¨ uche in der Innenstadt zu niedriger Preisklasse mit Cocktails passt Caf´e Statth¨ausle. M¨ochten Sie die Speisekarte sehen? 21. U1: Oh ja, bitte, %lachen% 22. U1: Ok, also es gibt Pizza, und Snacks, Suppen, Baguettes.. 23. U2: Wolltest du nicht Mexikanisch haben? 24. U1: Ja, haha, w¨ ar auch ’ne Idee, %lachen% 25. U1: und f¨ ur Mexikanisch? 26. S: Die Anfrage nach einer Lokalit¨ at mit mexikanischer K¨ uche ergab 2 Treffer. Meine Vorschl¨ age sind Restaurant Flamenco und Restaurant Pepper’s. 27. U2: Ja, aber gibt’s da auch Cocktails? 28. U1: Cocktails? %lachen% 29. S: Die Anfrage nach einer Lokalit¨ at mit mexikanischer K¨ uche mit Cocktails ergab 2 Treffer. Meine Vorschl¨ age sind Restaurant Flamenco und Restaurant Pepper’s. 30. U1: Ja das h¨ ort sich jetzt auch gut an. 31. U2: Ok, dann war das das gleiche. 32. U1: Ja, und sind die auch in der Innenstadt? %lachen% 33. S: Flamenco befindet sich in Herdbruckerstrasse 18. Wollen Sie einen Stadtplan sehen? 34. U1: Ja, %lachen% ja. 35. U1: Und wie sieht dann die Speisekarte aus? 36. U1: %unverst¨ andlich% 37. U1: Du kannst gar nicht sehen, gell? 38. U1: Dann geh ma da hin, %lachen% oder? 39. U1: Ja? 40. U2: Ja. 41. S: Auf Wiedersehen. Ich hoffe, ich konnte Ihnen helfen, und w¨ unsche einen guten Appetit. 42. U1: %lachen% 43. U2: Danke.
C Questionnaire
The questionnaire deployed for the Session III recordings is in parts listed in the following. AttrakDiff [Hassenzahl et al., 2003] was additionally used in its original form for all recordings and is thus not listed here. The first part (16 items) of the displayed questionnaire shows SASSISV, refer to Figure C.1. The second part shown in Figure C.2 presents the SASSI [Hone and Graham, 2000] items that are not deployed in SASSISV. The final part displayed in Figure C.3 presents the questionnaire items added in order to obtain subjective evaluation results concerning the system interaction. The questionnaire in this form was deployed for Session III recordings only, Session I and II used only SASSISV. The evaluation results displayed denote the mean values of the ratings of Session IIIa and IIIb data.
16 Ich würde das System im Alltag benutzen. I would use this system.
15 Das System ist nützlich. The system is useful.
14 Ich fühlte mich vertraut im Umgang mit dem System. I felt confident using the system.
13 Ein hohes Maß an Konzentration ist im Umgang mit dem System nötig. A high level of concentration is required when using the system.
12 Das System ist einfach zu benutzen. The system is easy to use.
11 Der Umgang mit dem System ist langweilig. The interaction with the system is boring.
10 Ich fühlte mich angespannt im Umgang mit dem System. I felt tense using the system.
09 Es ist klar, wie man mit dem System zu sprechen hat. It is clear how to speak to the system.
08 Das System ist unzuverlässig. The system is unreliable.
07 Das System macht wenige Fehler. The system makes few errors.
06 Das System ist freundlich. The system is friendly.
05 Die Interaktion mit dem System ist schnell. The interaction with the system is fast.
04 Der Umgang mit dem System ist frustrierend. The interaction with the system is frustrating.
03 Ich wusste immer, wie ich mit dem System zu sprechen habe. I always knew what to say to the system.
02 Ich hatte das Gefühl die Kontrolle über die Interaktion mit dem System zu haben. I felt in control of the interaction with the system.
01 Das System hat nicht immer das gemacht was ich wollte. The system didn't always do what I wanted.
1
2
IIIa
3
starke Ablehnung strongly disagree 4
neutral 5
6
IIIb
7
starke Zustimmung strongly agree
158 C Questionnaire Fig. C.1. SASSISV questionnaire.
34 Das System ist zu unflexibel. The system is too inflexible.
33 Ich war nicht immer sicher, was das System gerade macht. I was not always sure what the system was doing.
32 Der Umgang mit dem System beinhaltet viele Wiederholungen. The interaction with the system is repetitive.
31 Ich habe mich manchmal gefragt, ob ich das System mit den richtigen Worten angesprochen habe. I sometimes wondered if I was using the right word.
30 Ich fühlte mich gelassen im Umgang mit dem System. I felt calm using the system.
29 Das System ist zuverlässig. The system is dependable.
28 Es ist einfach zu lernen wie man mit dem System umzugehen hat. It is easy to learn to use the system.
27 Das System ist widerspruchsfrei. The interaction with the system is consistent.
26 Der Umgang mit dem System ist irritierend. The interaction with the system is irritating.
25 Während der Interaktion ist es leicht den Überblick darüber zu verlieren, wo in der Interaktion man sich gerade befindet. It is easy to lose track of where you are in an interaction with the system.
24 Das System ist effizient. The interaction with the system is efficient.
23 Es hat mir Spass gemacht das System zu benutzen. I enjoyed using the system.
22 Ich konnte mich nach Fehlern leicht wieder zurechtfinden. I was able to recover easily from errors.
21 Die Interaktion mit dem System ist unvorhersehbar. The interaction with the system is unpredictable.
20 Das System ist angenehm. The system is pleasant.
19 Das System reagiert zu langsam. The system responds too slowly.
18 Das System hat nicht immer das gemacht was ich erwartete. The system didn't always do what I expected.
17 Das System ist präzise. The system is accurate.
1
2
IIIa
IIIa
3
starke Ablehnung strongly disagree 4
neutral 5
6
IIIb
IIIb
7
starke Zustimmung strongly agree
C Questionnaire 159
Fig. C.2. SASSI questionnaire without SASSISV items.
50 Das Verhalten des Systems ist eintönig. The behaviour of the system is monotonous.
49 Das System ist aufdringlich. The system is intrusive.
48 Die System-Äußerungen waren zu lang. The system utterances were too long.
47 Die System-Äußerungen sind abwechslungsreich. The system utterances are varied.
46 Das System hat mich oft unterbrochen. The system frequently interrupted me.
45 Das System ist höflich. The system is polite.
44 Das System hat sich zu oft zu Wort gemeldet. The system interacted too frequently.
43 Die Stimme des Systems ist sympathisch. The voice of the system is pleasant.
42 Das System hat mich immer gut verstanden. The system always understood me well.
41 Die Benutzung des Systems hat sich gelohnt. Using the system was worth it.
40 Das System reagiert wie ein Mensch. The system reacts like a human.
39 Das Gespräch war zu lang. The conversation was too long.
38 Der Gesprächsverlauf war holprig. The flow of the conversation was bumpy.
37 Das Gespräch führte schnell zum gewünschten Ziel. The conversation quickly lead to the desired goal.
36 Die erste Interaktion des Systems kam unerwartet. The first interaction of the system came unexpected.
35 Das System hat mich bei meiner Restaurant-Suche gut unterstützt. The system has supported me well during the restaurant search.
1
2
IIIa
3
starke Ablehnung strongly disagree 4
neutral 5
6
IIIb
7
starke Zustimmung strongly agree
160 C Questionnaire Fig. C.3. Questionnaire items for system interaction survey.
Index
2001 – A Space Odyssey, 1 adjacency pairs, 28, 29, 94 AMI, 8, 51 AMIDA, 8 annotation, 52, 67, 129, 147, 148 Anvil, 68 ATR, 54 attentiveness, 12, 26, 98 AttrakDiff, 115–117, 120, 123, 138, 157 attribute value pairs, 3, 74, 82 automatic speech recognition, 2, 60, 81 avatar, 10, 61, 63, 115, 123, 130, 134, 142, 153 Berlin Database of Emotional Speech, 60 CHIL, 9, 52 common ground, 23, 24, 35, 38, 43, 92 context, see dialogue context conversational roles, 17, 23, 44, 58 dialogical, 25, 26 social, 25, 27 task, 25 corpus, 17, 51 multi-party, 14, 51, 65, 141 DAMSL, 67 data collection, 14, 17, 55 dialogue acts, see dialogue acts analysis, 68 context, see dialogue context
history, 4, 12, 13, 36, 43, 49, 73, 81, 92, 98, 142 initiative, see initiative management, see dialogue management model, 4, 34, 37, 80, 81 moves, 39, 42, 72, 84 multi-party, 10, 22, 30, 38, 43 plan, 91 system, see spoken language dialogue system task oriented, 28 dialogue acts, 28, 33, 68, 71, 81, 84, 97 tagset, 29, 67, 84 dialogue context, 4, 12, 34, 35, 37, 81 cognitive, 36, 81 linguistic, 35, 81 physical and perceptual, 36, 81 semantic, 35, 81 social, 28, 36, 81 Dialogue Gameboard Theory, 39 dialogue management, 3, 34 agent-based, 4 finite-state-based, 4, 38 frame-based, 4, 38 multi-party, 13, 43, 73, 86, 143 rule-based, 5 statistical, 5 discourse motivated constraint prioritisation, 109, 111, 112, 127, 139, 144 domain, 4, 10, 51, 73, 139, 144, 151 model, 35, 73, 81, 91
162
Index
Dynamic Interpretation Theory, 35, 37, 67, 81
Likert scale, 117, 136 LINLIN, 36
EDIS, 38, 39 evaluation, 19, 115 methods, 19, 21 objective, 20, 115, 124 participants, 118 subjective, 19, 20 usability, see usability
M4 project, 53 Mann-Whitney U Test, 123, 124 MBROLA, 63, 154 MRE, 38, 54 MSC1, 52 Multi-Attribute Decision Theory, 108 Multi-IBiS, 46, 47 MySQL, 154
GALAXY, 37 gaze direction analysis, 21, 68, 129, 133, 139, 145 GODIS, 38, 39 grounding, 24, 34, 38, 43 cautiously optimistic, 92 optimistic, 25, 92, 143 pessimistic, 25 strategy, 25, 32, 74, 92 Grounding and Obligation, 38 Hadifix, 63, 154 Hawthorne effect, 18 HTML, 64 IBiS, 38, 39, 42, 46, 75 ICSI, 51 information retrieval, 4 state, 39, 43, 75, 86, 100 state updates, 84, 85 Information State Update Approach, 5, 17, 38, 39, 42, 73, 75, 143 initiative, 4, 31, 33 intelligent environments, 7 interaction human-computer, 2, 29, 33, 60, 99, 115, 141 human-human, 28–30, 33, 116, 141 parameters, 20 patterns, 22, 26 principle, 79, 143 protocols, 42, 49, 78, 84 ISL, 51 Java, 60, 61, 81, 151 Kruskal-Wallis Test, 119, 120
natural language generation, 5, 81 understanding, 3, 81 NEEM, 9 NIST, 53 ontology, 35, 82, 144 PARADISE, 20, 124 PIT, 14, 55 corpus, 51, 125, 128, 138, 141, 144 principle of responsibility, 23 proactive interaction point, 95, 142 proactiveness, 2, 12, 13, 71, 98, 116, 133, 139, 141 problem solving, 5, 13, 82, 107, 127, 144 PTT, 38 Question Under Discussion, 38, 40 questionnaire, 57, 118, 133, 145, 157 SASSI, 115–117, 138, 157 SASSISV, 117, 120, 123, 136, 157 semantic analysis, see semantic analysis parser, 74, 81, 146 representation, 3, 35, 74 semantic analysis, 3, 35, 60, 74, 108, 129 rule-based, 3 statistical, 3 Speech Act Theory, 17, 22, 25 spoken language dialogue system architecture, 2 multi-modal, 5, 147 multi-party, 2, 38 speaker-adaptive, 3 speaker-dependent, 3 speaker-independent, 3
Index task-oriented, 31 system interaction policies, 59 strategy, 58, 94 task model, 4, 37, 49, 73, 77, 82, 92, 101 technical self assessment, 115, 116, 118, 119 text to phoneme, 63 text to speech, 6, 81 TRAINS, 37 TRINDIKIT, 39 turn taking, 4, 22, 28, 34, 38, 43, 86 ubiquitous computing, 1, 7 usability, 19, 20, 116, 119, 138, 145
163
user constraints, 35, 82, 93, 108, 127 modelling, 32, 101, 108 VACE, 53 VERBMOBIL, 37, 67 VoiceXML, 61 WAXHOLM, 37 WITAS, 37 Wizard Interaction Tool, 13, 60, 62, 74, 151 Wizard-of-Oz, 14, 19, 51, 55, 60, 125, 141, 144 XML, 64, 154
References
[Adair, 1984] Adair, G. (1984). The Hawthorne Effect: A Reconsideration of the Methodological Artifact. Journal of Applied Psychology, 69(2):334–345. [Alexandersson et al., 1998] Alexandersson, J., Buschbeck-Wolf, B., , Fujinami, T., Kipp, M., Koch, S., Melanie, Maier, E., Reithinger, N., Schmitz, B., and Siegel, M. (1998). Dialogue acts in verbmobil-2 second edition. Technical Report 226, DFKI Saarbr¨ ucken, Germany. [Alexandersson et al., 1995] Alexandersson, J., Maier, E., and Reithinger, N. (1995). A robust and efficient three-layered dialogue component for a speech-tospeech translation system. In Proceedings of the 7th Conference of the European Chapter of the ACL (EACL), pages 188–193, Dublin, Ireland. [Allen, 1995] Allen, J. (1995). Natural language understanding. Cummings, Redwood City, CA, USA, 2nd edition. [Austin, 1962] Austin, J. L. (1962). University Press.
Benjamin-
How to do Things with Words.
Oxford
[Bach and Harnish, 1979] Bach, K. and Harnish, R. (1979). Linguistic Communication and Speech Acts. M.I.T. Press, Cambridge, MA, USA. [Baker et al., 1998] Baker, C. F., Fillmore, C. J., and Lowe, J. B. (1998). The Berkeley FrameNet project. In Boitet, C. and Whitelock, P., editors, Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, pages 86–90, San Francisco, CA, USA. Morgan Kaufmann Publishers. [Barthelmess and Ellis, 2005] Barthelmess, P. and Ellis, C. (2005). The Neem Platform: An Evolvable Framework for Perceptual Collaborative Applications. Intelligent Information Systems, 25(2):207–240.
166
References
[Baum, 1900] Baum, L. F. (1900). The wonderful Wizard of Oz. George M. Hill, Chicago, IL, USA. [Berger et al., 1980] Berger, J., Rosenholtz, S. J., and Zelditch, Jr., M. (1980). Status organizing processes. Annual Review of Sociology, 6:479–508. [Beywl et al., 2007] Beywl, W., Kehr, J., M¨ ader, S., and Niestroj, M. (2007). Evaluation Schritt f¨ ur Schritt: Planung von Evaluationen, volume 20(26). Heidelberger Institut Beruf und Arbeit (hiba), Heidelberg, Germany. [Bolton, 1979] Bolton, R. (1979). People Skills: How to Assert Yourself, Listen to Others, and Resolve Conflicts. Simon & Schuster, New York, NY, USA. [Bortz and D¨ oring, 2006] Bortz, J. and D¨ oring, N. (2006). Forschungsmethoden und Evaluation: f¨ ur Human- und Sozialwissenschaftler. Springer, Heidelberg, Germany, 4th edition. [Branigan, 2006] Branigan, H. (2006). Perspectives on multi-party dialogue. Research on Language and Computation, 1:1–25. [Bunt, 1999] Bunt, H. (1999). Context representation for dialogue management. In Proceedings of the 2nd International and Interdisciplinary Conference on Modeling and Using Context (CONTEXT), volume 1688 of Lecture Notes in Computer Science, pages 77–90. Springer. [Bunt, 2000] Bunt, H. (2000). Dialogue pragmatics and context specification. In Abduction, Belief and Context in Dialogue; Studies in Computational Pragmatics, pages 81–150. John Benjamins. [Bunt, 1994] Bunt, H. C. (1994). Context and dialogue control. THINK Quarterly, 3:19–31. [Burchardt et al., 2006] Burchardt, A., Erk, K., Frank, A., Kowalski, A., Pado, S., and Pinkal, M. (2006). SALTO – A Versatile Multi-Level Annotation Tool. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC), Genoa, Italy. [Burger et al., 2002] Burger, S., MacLaren, V., and Yu, H. (2002). The ISL Meeting Corpus: The Impact Of Meeting Type on Speech Type. In Proceedings of the International Conference on Spoken Language Processing (ICSLP), Denver, CO, USA. [Burger and Sloane, 2004] Burger, S. and Sloane, Z. (2004). The ISL meeting corpus: Categorical features of communicative groups interactions. In Proceedings of the NIST ICASSP Meeting Recognition Workshop, Montreal, Canada. [Burkhardt et al., 2005] Burkhardt, F., Paeschke, A., Rolfes, M., Sendlmeier, W., and Weiss, B. (2005). A Database of German Emotional Speech. In Proceedings of the 9th Conference on Speech Communication and Technology (INTERSPEECH),
References
167
Lisbon, Portugal. [Campbell, 2008] Campbell, N. (2008). Tools and resources for visualising conversational-speech interaction. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC), Marrakech, Morocco. [Carberry et al., 1999] Carberry, S., Chu-Carroll, J., and Elzer, S. (1999). Constructing and utilizing a model of user preferences in collaborative consultation dialogues. Computational Intelligence, 15:185–217. [Carletta et al., 2002] Carletta, J., Anderson, A. H., and Garrod, S. (2002). Seeing eye to eye: An account of grounding and understanding in work groups. Cognitive Studies: Bulletin of the Japanese Cognitive Science Society, 9(1):1–20. [Carletta et al., 2005] Carletta, J., Ashby, S., Bourban, S., Flynn, M., Guillemot, M., Hain, T., Kadlec, J., Karaiskos, V., Kraaij, W., Kronenthal, M., Lathoud, G., Lincoln, M., Lisowska, A., McCowan, I., Post, W., Reidsma, D., and Wellner, P. (2005). The AMI meetings corpus. In Proceedings of the Measuring Behavior 2005 Symposium on Annotating and Measuring Meeting Behavior. [Carlson et al., 1995] Carlson, R., Hunnicutt, S., and Gustafsson, J. (1995). Dialog management in the Waxholm system. In Proceedings of the 8th Swedish Phonetics Conference, Working Papers 43, pages 137–140, Lund, Sweden. [Chen et al., 2005] Chen, L., Rose, R. T., Qiao, Y., Kimbara, I., Parrill, F., Welji, H., Han, X., Tu, J., Huang, Z., Harper, M., Queck, F., Xiong, Y., McNeill, D., Tuttle, R., and Huang, T. (2005). VACE multimodal meeting corpus. In Proceedings of the 2nd Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI), Edinburgh, UK. [Chu-Carroll, 1999] Chu-Carroll, J. (1999). Form-based reasoning for mixedinitiative dialogue management in information-query systems. In Proceedings of the 6th European Conference on Speech Communication and Technology (EUROSPEECH), pages 1519–1522, Budapest, Hungary. [Clark, 1996] Clark, H. (1996). Using language. Cambridge University Press. [Clark and Carlson, 1982] Clark, H. and Carlson, T. (1982). Hearers and Speech Acts. Language, 58(2):332–373. [Clark and Schaefer, 1987] Clark, H. and Schaefer, E. (1987). Concealing one’s meaning from overhearers. Journal of Memory and Language, 26:209–225. [Clark and Schaefer, 1989] Clark, H. and Schaefer, E. (1989). discourse. Cognitive Science, 13:259–294.
Contributing to
[Clark and Schober, 1989] Clark, H. and Schober, M. (1989). Understanding by addressees and overhearers. Cognitive Psychology, 21:211–232.
168
References
[Cooper and Larsson, 1998] Cooper, R. and Larsson, S. (1998). Dialogue moves and information states. In Proceedings of the 3rd International Workshop on Computational Semantics (IWCS), Tilburg, The Netherlands. [Core and Allen, 1997] Core, M. G. and Allen, J. F. (1997). Coding Dialogs with the DAMSL Annotation Scheme. In AAAI Fall Symposium on Communicative Action in Humans and Machines, Cambridge, MA, USA. [Cusano et al., 2004] Cusano, C., Ciocca, G., and Schettini, R. (2004). annotation using SVM. In Internet Imaging IV, volume SPIE.
Image
[Dahlb¨ ack et al., 1993] Dahlb¨ ack, N., J¨ onsson, A., and Ahrenberg, L. (1993). Wizard of Oz studies - Why and how. In Proceedings of the ACM International Workshop on Intelligent User Interfaces (IUI), pages 193–200, Orlando, FL, USA. [Danninger et al., 2005] Danninger, M., Flaherty, G., Bernardin, K., Ekenel, H. K., K¨ ohler, T., Malkin, R., Stiefelhagen, R., and Waibel, A. (2005). The Connector: Facilitating context-aware communication. In Proceedings of the 7th International Conference on Multimodal interfaces (ICMI), pages 69–75, New York, NY, USA. ACM Press. [Diekmann, 2007] Diekmann, A. (2007). Empirische Sozialforschung. Grundlagen, Methoden, Anwendungen. Rowohlt Taschenbuchverlag, Reinbeck bei Hamburg, Germany, 18th edition. [Doran et al., 2001] Doran, C., Aberdeen, J., Damianos, L., and Hirschman, L. (2001). Comparing Several Aspects of Human-Computer and Human-Human Dialogues. In von Kuppevelt, J. and Smith, R., editors, Proceedings of the 2nd SIGdial Workshop on Discourse and Dialogue, Aalborg, Denmark. [Dudda, 2001] Dudda, C. (2001). Evaluierung eines nat¨ urlichen Dialogsystems f¨ ur Restaurantausk¨ unfte. PhD thesis, Institution f¨ ur Kommunikationsakustik, Ruhr-Universit¨ at, Bochum, Germany. [Dutoit, 2001] Dutoit, T. (2001). Kluwer, Norwell, MA, USA.
An Introduction to Text-to-Speech Synthesis.
[Erk and Pado, 2006] Erk, K. and Pado, S. (2006). Shalmaneser - a flexible toolbox for semantic role assignment. In Proceedings of the 5th International Conference on Language Resources and Evaluation (LREC), Genoa, Italy. [Feuerberg et al., 2005] Feuerberg, B. V., Bahner, J. E., and Manzey, D. (2005). Interindividuelle Unterschiede im Umgang mit Automation - Entwicklung eines Fragebogens zur Erfassung des Complacency-Potentials, pages 199–202. Fortschrittberichte VDI, Nr. 22, D¨ usseldorf, Germany. [Garofolo et al., 2004] Garofolo, J., Laprun, C., Michel, M., Stanford, V., and Tabassi, E. (2004). The NIST meeting room pilot corpus. In Proceedings of the 4th International Conference on Language Resources and Evaluation (LREC),
References
169
Lisbon, Portugal. [Ginzburg, 1996] Ginzburg, J. (1996). The Handbook of Contemporary Semantic Theory, chapter Interrogatives: Questions, facts and dialogue, pages 385–422. Blackwell Publisher Oxford. [Ginzburg and Fern´ andez, 2005] Ginzburg, J. and Fern´ andez, R. (2005). Action at a distance: the difference between dialogue and multilogue. In Proceedings of the 9th Workshop on the Semantics and Pragmatics of Dialogue (DIALOR), Nancy, France. [Goffman, 1981] Goffman, E. (1981). Footing. University of Pennsylvania Press. [Grice, 1968] Grice, H. (1968). Utterer’s meaning, sentence meaning, and wordmeaning. Foundations of Language, 4:225–242. [Grosz, 1977] Grosz, B. J. (1977). The representation and use of focus in dialogue understanding. In Grosz, B. J., Jones, K. S., and Webber, B. L., editors, Readings in Natural Language Processing. Morgan Kaufmann Publishers Inc. [Guindon et al., 1987] Guindon, R., Shuldberg, K., and Conner, J. (1987). Grammatical and ungrammatical structures in user-adviser dialogues: Evidence for sufficiency of restricted languages in natural language interfaces to advisory systems. In Proceedings of the 25th Annual Meeting of the Association for Computational Linguistics (ACL), pages 41–44, Stanford, CA, USA. [Hassenzahl et al., 2003] Hassenzahl, M., Burmester, M., and Koller, F. (2003). AttrakDiff: Ein Fragebogen zur Messung wahrgenommener hedonischer und pragmatischer Qualit¨ at. Mensch & Computer 2003. Interaktion in Bewegung, pages 187–196. [Hone and Graham, 2000] Hone, K. S. and Graham, R. (2000). Towards a tool for the Subjective Assessment of Speech System Interfaces (SASSI). Natural Language Engineering, 6(3-4):287–303. [Huang et al., 2001] Huang, X., Acero, A., and Hon, H.-W. (2001). Spoken Language Processing: A Guide to Theory, Algorithm, and System Development. Prentice Hall, Upper Saddle River, NJ, USA. [Janin et al., 2003] Janin, A., Baron, D., Edwards, J., Ellis, D., Gelbart, D., Morgan, N., Peskin, B., Pfau, T., Shriberg, E., Stolcke, A., and Wooters, C. (2003). The ICSI meeting corpus. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pages 364–367, Hong Kong. [Jelinek, 1997] Jelinek, F. (1997). Statistical methods for speech recognition. MIT Press, Cambridge, MA, USA. [Jeon and Manmatha, 2004] Jeon, J. and Manmatha, R. (2004). Using maximum entropy for automatic image annotation. In Proceedings of the 3rd International
170
References
Conference on Image and Video Retrieval (CIVR), pages 24–32, Dublin, Ireland. [J¨ onsson, 1997] J¨ onsson, A. (1997). A model for habitable and efficient dialogue management for natural language interaction. Natural Language Engineering, 3(2):103–122. [J¨ onsson and Dahlb¨ ack, 1988] J¨ onsson, A. and Dahlb¨ ack, N. (1988). Talking to a computer is not like talking to your best friend. In Proceedings of First Scandinavian Conference on Artificial Intelligence (SCAI), pages 53–68, Tromso, Norway. [J¨ onsson and Dahlb¨ ack, 2000] J¨ onsson, A. and Dahlb¨ ack, N. (2000). Distilling dialogues - a method using natural dialogue corpora for dialogue systems development. In Proceedings of the 6th Applied Natural Language Processing Conference, pages 44–51. McGraw-Hill. [Jovanovic et al., 2006] Jovanovic, N., op den Akker, R., and Nijholt, A. (2006). A corpus for studying addressing behaviour in multi-party dialogues. Language Resources and Evaluation, 40(1):5–23. [Jurafsky and Martin, 2000] Jurafsky, D. and Martin, J. H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition. Prentice Hall, Upper Saddle River, NJ, USA. [Kamm et al., 1999] Kamm, A., Walker, M., and Litman, D. (1999). Evaluating spoken language systems. In Proceedings of American Voice Input/Output Society AVIOS. [Kipp, 2001] Kipp, M. (2001). ANVIL – a generic annotation tool for multimodal dialogue. In Proceedings of the 7th European Conference on Speech Communication and Technology (EUROSPEECH), Aalborg, Denmark. [Klein and Manning, 2003] Klein, D. and Manning, C. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting of the Association for Computational Linguistics (ACL), pages 423–430, Sapporo, Japan. [Kronlid, 2008] Kronlid, F. (2008). Steps Towards Multi-Party Dialogue Management. PhD thesis, G¨ oteborg University, Sweden. [Kruskal and Wallis, 1952] Kruskal, W. and Wallis, W. A. (1952). Use of ranks in one-criterion analysis of variance. Journal of the American Statistical Association, 47:583–621. [Kubrick, 1968] Kubrick, S. (Director). (1968). 2001 – A Space Odyssey. [Motion Picture] Metro-Goldwyn-Mayer Pictures Inc. (MGM), USA. [Larsson, 2002] Larsson, S. (2002). Issue-based Dialogue Management. PhD thesis, G¨ oteborg University, Sweden.
References
171
[Larsson et al., 2000] Larsson, S., Ljunglof, P., Cooper, R., Engdahl, E., and Ericsson, S. (2000). GoDiS - an accommodating dialogue system. In Proceedings of the ANLP/NAACL Workshop on Conversational Systems, pages 7–10, Seattle, WA, USA. [Larsson and Traum, 2000] Larsson, S. and Traum, D. R. (2000). Information state and dialogue management in the TRINDI dialogue move engine toolkit. Natural Language Engineering, 6(3-4):323–340. [Lemon et al., 2001] Lemon, O., Bracy, A., Gruenstein, A., and Peters, S. (2001). The WITAS multi-modal dialogue system I. In Proceedings of the 7th European Conference on Speech Communication and Technology (EUROSPEECH), pages 1559–1562, Aalborg, Denmark. [Lemon et al., 2006] Lemon, O., Georgila, K., and Henderson, J. (2006). Evaluating effectiveness and portability of reinforcement learning dialogue strategies with real users: The TALK TownInfo evaluation. In Proceedings of the 1st IEEE-ACL Workshop on Spoken Language Technology (SLT), pages 182–186, Aruba. [Levin and Pieraccini, 1997] Levin, E. and Pieraccini, R. (1997). A stochastic model of computer-human interaction for learning dialogue strategies. In Proceedings of the 5th European Conference on Speech Communication and Technology (EUROSPEECH), pages 1883–1886, Rhodes, Greece. [Mana et al., 2007] Mana, N., Lepri, B., Chippendale, P., Cappelletti, A., Pianesi, F., Svaizer, P., and Zancanaro, M. (2007). Multimodal corpus of multi-party meetings for automatic social behaviour analysis and personality traits detection. In Proceedings of the 9th International Conference on Multimodal Interfaces (ICMI), Nagoya, Japan. [Mann and Whitney, 1947] Mann, H. B. and Whitney, D. R. (1947). On a test of whether one of two random variables is stochastically larger than the other. Annals of Mathematical Statistics, 18:50–60. [Matheson et al., 2000] Matheson, C., Poesio, M., and Traum, D. (2000). Modelling grounding and discourse obligations using update rules. In Proceedings of the first conference on North American chapter of the Association for Computational Linguistics (NAACL), pages 1–8, Seattle, WA, USA. [McCowan et al., 2005] McCowan, I., Gatica-Perez, D., Bengio, S., Lathoud, G., Barnard, M., and Zhang, D. (2005). Automatic analysis of multimodal group actions in meetings. In IEEE Trans. on Pattern Analysis and Machine Intelligence, volume 27, pages 305–317. [McTear, 2002] McTear, M. F. (2002). Spoken dialogue technology: enabling the conversational user interface. ACM Press, New York, NY, USA. [Michel et al., 2007] Michel, M., Ajot, J., and Fiscus, J. (2007). The NIST Meeting Room Corpus 2 Phase 1, volume 4299/2006, pages 13–23. Lecture Notes in
172
References
Computer Science, Springer, Heidelberg, Germany. [Minker et al., 1999] Minker, W., Waibel, A., and Mariani, J. (1999). Stochasticallybased semantic analysis. Kluwer Academic Publishers, Boston, MA, USA. [M¨ oller, 2005] M¨ oller, S. (2005). Quality of Telephone-Based Spoken Dialogue Systems. Springer, New York, NY, USA. [Moore, 2002] Moore, D. (2002). The IDIAP smart meeting room. report, IDIAP, Martigny, Switzerland.
Technical
[Mostefa et al., 2007] Mostefa, D., Moreau, N., Choukri, K., Potamianos, G., Chu, S. M., Tyagi, A., Casas, J. R., Turmo, J., Cristofetti, L., Tobia, F., Pnevmatikakis, A., Mylonakis, V., Talantzis, F., Burger, S., Stiefelhagen, R., Bernardin, K., and Rochet, C. (2007). The CHIL audiovisual corpus for lecture and meeting analysis inside smart rooms. Language Resources and Evaluation, 41(3–4):389–407. [Nielsen, 1993] Nielsen, J. (1993). Usability Engineering. Academic Press, Boston, MA, USA. [Paek, 2001] Paek, T. (2001). Empirical methods for evaluating dialogue systems. In von Kuppevelt, J. and Smith, R., editors, Proceedings of the 2nd SIGdial Workshop on Discourse and Dialogue, pages 100–107, Aalborg, Denmark. [Petukhova and Bunt, 2007] Petukhova, V. and Bunt, H. (2007). A multidimensional approach to multimodal dialogue act annotation. In Proceedings of the 7th International Workshop on Computational Semantics (IWCS-7), pages 142–153, Tilburg, The Netherlands. [Pianesi et al., 2007] Pianesi, F., Zancanaro, M., Lepri, B., and Cappelletti, A. (2007). A multimodal annotated corpus of consensus decision making meetings. Language Resources and Evaluation, 3:409–429. [Poesio and Traum, 1998] Poesio, M. and Traum, D. (1998). Towards an axiomatization of dialogue acts. In Proceedings of the Twente Workshop on the Formal Semantics and Pragmatics of Dialogues, pages 207–222, Twente, The Netherlands. [Qu and Beale, 1999] Qu, Y. and Beale, S. (1999). A constraint-based model for cooperative response generation in information dialogues. In Proceedings of the 16th National Conference on Artificial Intelligence (AAAI/IAAI), pages 148–155, Orlando, FL, USA. [Rabiner and Juang, 1993] Rabiner, L. and Juang, B.-H. (1993). Fundamentals of speech recognition. Prentice-Hall, Upper Saddle River, NJ, USA. [Reilly, 1987] Reilly, R. (1987). Ill-formedness and miscommunication in personmachine dialogue. Information and Software Technology, 29(2):69–74. [Reiter, 1994] Reiter, E. (1994). Has a consensus NL generation architecture appeared, and is it psycholinguistically plausible? In Proceedings of the 7th
References
173
International Workshop on Natural Language Generation, pages 163–170, Kennebunkport, ME, USA. [Reiter and Dale, 2000] Reiter, E. and Dale, R. (2000). Building natural language generation systems. Cambridge University Press, New York, NY, USA. [Renals, 2005] Renals, S. (2005). AMI: Augmented Multiparty Interaction. In Proceedings of the NIST Meeting Transcription Workshop, Montreal, Canada. [Rickel et al., 2002] Rickel, J., Marsella, S., Gratch, J., Hill, R., Traum, D., and Swartout, W. (2002). Toward a new generation of virtual humans for interactive experiences. IEEE Intelligent Systems, 17:32–38. [Robinson et al., 2004] Robinson, S., Martinovski, B., Garg, S., Stephan, J., and Traum, D. (2004). Issues in corpus development for multi-party multi-modal task-oriented dialogue. In Proceedings of the 4th International Conference on Language Resources and Evaluation, pages 1707–1710, Lisbon, Portugal. [Scherer and Strauß, 2008] Scherer, S. and Strauß, P.-M. (2008). A flexible wizard of oz environment for rapid prototyping. In Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC), pages 958–961, Marrakech, Morocco. [Searle, 1969] Searle, J. (1969). Speech Acts: An Essay in the Philosophy of Language. Cambridge University Press, Cambridge, England. [Searle, 1975] Searle, J. R. (1975). Indirect speech acts. Syntax and Semantics, 3: Speech Acts:59–82. [Seneff et al., 1996] Seneff, S., Goddeau, D., Pao, C., and Polifroni, J. (1996). Multimodal discourse modelling in a multi-user multi-domain environment. In Proceedings of ICSLP ’96, volume 1, pages 192–195, Philadelphia, PA, USA. [Singh et al., 2002] Singh, S., Litman, D., Kearns, M., and Walker, M. (2002). Optimizing dialogue management with reinforcement learning: Experimenting with the NJFun system. Artificial Intelligence Research, 16:105–133. [Stiefelhagen et al., 2004] Stiefelhagen, R., Steusloff, H., and Waibel, A. (2004). CHIL - Computers in the Human Interaction Loop. In Proceedings of the NIST ICASSP Meeting Recognition Workshop, Montreal, Canada. [Strauß, 2006] Strauß, P.-M. (2006). A SLDS for perception and interaction in multi-user environments. In Proceedings of the 2nd IEE International Conference on Intelligent Environments (IE), Athens, Greece. [Strauß et al., 2007] Strauß, P.-M., Hoffmann, H., and Scherer, S. (2007). Evaluation and User Acceptance of a Dialogue System Using Wizard-of-Oz Recordings. In Proceedings of the 3rd IET International Conference on Intelligent Environments (IE), Ulm, Germany.
174
References
[Strauß and Jahn, 2007] Strauß, P.-M. and Jahn, M. (2007). Using frame semantics on a domain dependent corpus. In Proceedings of the IJCAI Workshop on Modeling and Representation in Computational Semantics (MRCS), Hyderabad, India. [Swartout et al., 2005] Swartout, W., Gratch, J., Hill Jr., R. W., Hovy, E., Lindheim, R., Marcella, S., Rickel, J., and Traum, D. (2005). Simulation meets Hollywood: Integrating graphics, sound, story and character for immersive simulation. In Stock, O. and Zancanaro, M., editors, Multimodal Intelligent Information Presentation. Series: Text, Speech and Language Technology, volume 27. Kluwer. [Swartout, 2006] Swartout, W. R. (2006). Virtual humans. In Proceedings of the 21st National Conference on Artificial Intelligence and the 18th Innovative Applications of Artificial Intelligence Conference (AAAI), Boston, MA, USA. [Traum, 2003] Traum, D. (2003). Semantics and pragmatics of questions and answers of dialogue agents. In Proceedings of the International Workshop on Computational Semantics, pages 380–394. [Traum, 2004] Traum, D. (2004). Issues in multi-party dialogues. In Dignum, F., editor, Advances in agent communication, pages 201–211. Lecture Notes in Artificial Intelligence 2922, Springer. [Traum and Allen, 1994] Traum, D. and Allen, J. (1994). Discourse obligations in dialogue processing. Proceedings of the 32nd Annual Meeting of the Association of Computational Linguistics (ACL), 32:1–8. [Traum et al., 1999] Traum, D., Bos, J., Cooper, R., Larsson, S., Lewin, I., Matheson, C., and Poesio, M. (1999). A model of dialogue moves and information state revision. Technical report, Department of Linguistics, G¨ oteborg University, Sweden. [Traum, 1996] Traum, D. R. (1996). Conversational agency: The TRAINS-93 dialogue manager. In Proceedings of the Twente Workshop on Langauge Technology: Dialogue Management in Natural Language Systems (TWLT), pages 1–11, Twente, The Netherlands. [Traum, 2000] Traum, D. R. (2000). Journal of Semantics, 17(1):7–30.
20 questions on dialogue act taxonomies.
[Viola and Jones, 2004] Viola, P. and Jones, M. (2004). Robust real-time face detection. International Journal of Computer Vision, 57(2):137–154. [Walker et al., 2000] Walker, M., Kamm, A., and Bol, J. (2000). Developing and testing general models of spoken dialogue system performance. In Proceedings of the 2nd Language Resources and Evaluation Conference (LREC), Athens, Greece. [Walker et al., 1997] Walker, M. A., Litman, D. J., Kamm, C. A., and Abella, A. (1997). PARADISE: A framework for evaluating spoken dialogue agents. In Proceedings of the 35th Annual Meeting of the Association of Computational
References
175
Linguistics (ACL), pages 271–280, Madrid, Spain. [Walker et al., 2004] Walker, M. A., Whittaker, S. J., Stent, A., Maloor, P., Moore, J. D., Johnston, M., and Vasireddy, G. (2004). Generation and evaluation of user tailored responses in multimodal dialogue. Cognitive Science, 28(5):811–840. [Whittaker et al., 2002] Whittaker, S. J., Walker, M. A., and Moore, J. D. (2002). Fish or fowl: A wizard of oz evaluation of dialogue strategies in the restaurant domain. In Proceedings of the 3rd International Conference on Language Resources and Evaluation (LREC), Las Palmas, Canary Islands, Spain. [Williams and Young, 2007] Williams, J. and Young, S. (2007). Partially observable markov decision processes for spoken dialogue systems. Computer Speech and Language, 21(2). [Xu et al., 2002] Xu, W., Xu, B., Huang, T., and Xia, H. (2002). Bridging the gap between dialogue management and dialogue models. In Proceedings of the 3rd SIGdial Workshop on Discourse and Dialogue, pages 201–210, Philadelphia, PA, USA.