Spoken Dialogue Systems for Ambient Environments: Seond International Workshop, IWSDS 2010, Gotemba, Shizuoka, Japan, October 1-2, 2010. Proceedings ... Lecture Notes in Artificial Intelligence)

Lecture Notes in Artificial Intelligence Edited by R. Goebel, J. Siekmann, and W. Wahlster Subseries of Lecture Notes i...

Author: Gary Geunbae Lee | Joseph Mariani | Satoshi Nakamura

21 downloads 460 Views 4MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Lecture Notes in Artificial Intelligence Edited by R. Goebel, J. Siekmann, and W. Wahlster

Subseries of Lecture Notes in Computer Science

6392

Gary Geunbae Lee Joseph Mariani Wolfgang Minker Satoshi Nakamura (Eds.)

Spoken Dialogue Systems for Ambient Environments Second International Workshop on Spoken Dialogue Systems Technology, IWSDS 2010 Gotemba, Shizuoka, Japan, October 1-2, 2010 Proceedings

13

Series Editors Randy Goebel, University of Alberta, Edmonton, Canada Jörg Siekmann, University of Saarland, Saarbrücken, Germany Wolfgang Wahlster, DFKI and University of Saarland, Saarbrücken, Germany Volume Editors Gary Geunbae Lee Pohang University of Science and Technology Department of Computer Science and Engineering San 31, Hyoja-dong, Nam-gu, Pohang, 790-784, South Korea E-mail: [email protected] Joseph Mariani Centre National de la Recherche Scientifique Laboratoire d’Informatique pour la Mécanique et les Sciences de l’ Ingénieur B.P. 133 91403 Orsay cedex, France E-mail: [email protected] Wolfgang Minker University of Ulm, Institute of Information Technology Albert-Einstein-Allee 43, 89081 Ulm, Germany E-mail: [email protected] Satoshi Nakamura National Institute of Information and Communications Technology 3-5 Hikaridai, Keihanna Science City, Kyoto, Japan E-mail: [email protected]

Library of Congress Control Number: 2010935212 CR Subject Classification (1998): I.2, H.5, H.4, H.3, I.4, I.5 LNCS Sublibrary: SL 7 – Artificial Intelligence ISSN ISBN-10 ISBN-13

0302-9743 3-642-16201-0 Springer Berlin Heidelberg New York 978-3-642-16201-5 Springer Berlin Heidelberg New York

This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. springer.com © Springer-Verlag Berlin Heidelberg 2010 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper 06/3180

Preface

It is our great pleasure to welcome you to the 2nd International Workshop on Spoken Dialogue Systems Technology (IWSDS), which was held, as a satellite event of INTERSPEECH 2010, at Gotemba Kogen Resort in the Fuji area, Japan, October 1–2, 2010. The annual workshop brings together researchers from all over the world working in the ﬁeld of spoken dialogue systems. It provides an international forum for the presentation of research and applications and for lively discussions among researchers as well as industrialists. Building on the success of IWSDS 2009 Irsee, Germany, this year’s workshop designated “Spoken Dialogue Systems for Ambient Environments” as a special theme of discussion. We also encouraged discussions of common issues of spoken dialogue systems including but not limited to: – – – – – – – – – – – – – –

Speech recognition and semantic analysis Dialogue management Adaptive dialogue modelling Recognition of emotions from speech, gestures, facial expressions and physiological data User modelling Planning and reasoning capabilities for coordination and conﬂict description Conﬂict resolution in complex multi-level decisions Multi-modality such as graphics, gesture and speech for input and output Fusion and information management Learning and adaptability Visual processing and recognition for advanced human-computer interaction Databases and corpora Evaluation strategies and paradigms Prototypes and products

The workshop program consisted of 22 regular papers and 2 invited keynote talks. This year, we were pleased to have two keynote speakers: Prof. Ram´on L´opez-C´ozar, Universidad de Granada, Spain and Prof. Tetsunori Kobayashi, Waseda University, Japan. We would like to take this opportunity to thank the scientiﬁc committee members for their timely and eﬃcient contributions and for completing the review process on time. In addition, we would like to express our sincere gratitude to the local organizing committee, especially to Dr. Teruhisa Misu, who contributed to the success of this workshop with careful consideration and timely and accurate action. Furthermore, we have to mention that this workshop would not have been achieved without the support of the Korean Society of Speech Scientists and the National Institute of Information and Communications Technology.

VI

Preface

Finally, we hope all the attendees beneﬁted from the workshop and enjoyed their stay at the base of beautiful Mount Fuji. July 2010

Gary Geunbae Lee Joseph Mariani Wolfgang Minker Satoshi Nakamura

Organization

IWSDS 2010 was organized by the National Institute of Information and Communications Technology (NICT), in cooperation with Pohang University of Science and Technology; Centre National de la Recherche Scientiﬁque, Laboratoire d’Informatique pour la M´ecanique et les Sciences de l’Ing´enieur; Dialogue Systems Group, Institute of Information Technology, Ulm University; and The Korean Society of Speech Sciences (KSSS).

Organizing Committee Gary Geunbae Lee Joseph Mariani

Wolfgang Minker Satoshi Nakamura

Pohang University of Science and Technology, Korea Centre National de la Recherche Scientiﬁque, Laboratoire d’Informatique pour la M´ecanique et les Sciences de l’Ing´enieur, and Institute for Multilingual and Multimedia Information, France Dialogue Systems Group, Institute of Information Technology, Ulm University, Germany National Institute of Information and Communications Technology, Japan

Local Committee Hisashi Kawai Hideki Kashioka Chiori Hori Kiyonori Ohtake Sakriani Sakti Teruhisa Misu

National Institute of Information Technology, Japan National Institute of Information Technology, Japan National Institute of Information Technology, Japan National Institute of Information Technology, Japan National Institute of Information Technology, Japan National Institute of Information Technology, Japan

and Communications and Communications and Communications and Communications and Communications and Communications

Referees Jan Alexandersson, Germany Masahiro Araki, Japan Andr´e Berton, Germany Sadaoki Furui, Japan

Rainer Gruhn, Germany Joakim Gustafson, Sweden Paul Heisterkamp, Germany David House, Sweden

VIII

Organization

Kristiina Jokinen, Finland Tatsuya Kawahara, Japan Hong Kook Kim, Korea Lin-Shan Lee, Taiwan Li Haizhou, Singapore Ram´on L´ opez-C´ozar Delgado, Spain Mike McTear, UK Mikio Nakano, Japan

Elmar Noth, Germany Norbert Reithinger, Germany Laurent Romary, France Gabriel Skantze, Sweden Kazuya Takeda, Japan Hsin-min Wang, Taiwan Wayne Ward, USA

Table of Contents

Long Papers Impact of a Newly Developed Modern Standard Arabic Speech Corpus on Implementing and Evaluating Automatic Continuous Speech Recognition Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Mohammad A.M. Abushariah, Raja N. Ainon, Roziati Zainuddin, Bassam A. Al-Qatab, and Assal A.M. Alqudah

1

User and Noise Adaptive Dialogue Management Using Hybrid System Actions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Senthilkumar Chandramohan and Olivier Pietquin

13

Detection of Unknown Speakers in an Unsupervised Speech Controlled System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tobias Herbig, Franz Gerl, and Wolfgang Minker

25

Evaluation of Two Approaches for Speaker Speciﬁc Speech Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tobias Herbig, Franz Gerl, and Wolfgang Minker

36

Issues in Predicting User Satisfaction Transitions in Dialogues: Individual Diﬀerences, Evaluation Criteria, and Prediction Models . . . . . Ryuichiro Higashinaka, Yasuhiro Minami, Kohji Dohsaka, and Toyomi Meguro Expansion of WFST-Based Dialog Management for Handling Multiple ASR Hypotheses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Naoto Kimura, Chiori Hori, Teruhisa Misu, Kiyonori Ohtake, Hisashi Kawai, and Satoshi Nakamura Evaluation of Facial Direction Estimation from Cameras for Multi-modal Spoken Dialog System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Akihiro Kobayashi, Kentaro Kayama, Etsuo Mizukami, Teruhisa Misu, Hideki Kashioka, Hisashi Kawai, and Satoshi Nakamura D3 Toolkit: A Development Toolkit for Daydreaming Spoken Dialog Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Donghyeon Lee, Kyungduk Kim, Cheongjae Lee, Junhwi Choi, and Gary Geunbae Lee New Technique to Enhance the Performance of Spoken Dialogue Systems by Means of Implicit Recovery of ASR Errors . . . . . . . . . . . . . . . . Ram´ on L´ opez-C´ ozar, David Griol, and Jos´e F. Quesada

48

61

73

85

96

X

Table of Contents

Simulation of the Grounding Process in Spoken Dialog Systems with Bayesian Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . St´ephane Rossignol, Olivier Pietquin, and Michel Ianotto

110

Facing Reality: Simulating Deployment of Anger Recognition in IVR Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Alexander Schmitt, Tim Polzehl, and Wolfgang Minker

122

A Discourse and Dialogue Infrastructure for Industrial Dissemination . . . Daniel Sonntag, Norbert Reithinger, Gerd Herzog, and Tilman Becker

132

Short Papers Impact of Semantic Web on the Development of Spoken Dialogue Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Masahiro Araki and Yu Funakura

144

A User Model to Predict User Satisfaction with Spoken Dialog Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Klaus-Peter Engelbrecht and Sebastian M¨ oller

150

Sequence-Based Pronunciation Modeling Using a Noisy-Channel Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hansj¨ org Hofmann, Sakriani Sakti, Ryosuke Isotani, Hisashi Kawai, Satoshi Nakamura, and Wolfgang Minker Rational Communication and Aﬀordable Natural Language Interaction for Ambient Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kristiina Jokinen Construction and Experiment of a Spoken Consulting Dialogue System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Teruhisa Misu, Chiori Hori, Kiyonori Ohtake, Hideki Kashioka, Hisashi Kawai, and Satoshi Nakamura A Study Toward an Evaluation Method for Spoken Dialogue Systems Considering User Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Etsuo Mizukami, Hideki Kashioka, Hisashi Kawai, and Satoshi Nakamura

156

163

169

176

A Classiﬁer-Based Approach to Supporting the Augmentation of the Question-Answer Database for Spoken Dialogue Systems . . . . . . . . . . . . . . Hiromi Narimatsu, Mikio Nakano, and Kotaro Funakoshi

182

The Inﬂuence of the Usage Mode on Subjectively Perceived Quality . . . . Ina Wechsung, Anja Naumann, and Sebastian M¨ oller

188

Table of Contents

XI

Demo Papers Sightseeing Guidance Systems Based on WFST-Based Dialogue Manager . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Teruhisa Misu, Chiori Hori, Kiyonori Ohtake, Etsuo Mizukami, Akihiro Kobayashi, Kentaro Kayama, Tetsuya Fujii, Hideki Kashioka, Hisashi Kawai, and Satoshi Nakamura

194

Spoken Dialogue System Based on Information Extraction from Web Text . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Koichiro Yoshino and Tatsuya Kawahara

196

Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

199

Impact of a Newly Developed Modern Standard Arabic Speech Corpus on Implementing and Evaluating Automatic Continuous Speech Recognition Systems Mohammad A.M. Abushariah1,2, Raja N. Ainon1, Roziati Zainuddin1, Bassam A. Al-Qatab1, and Assal A.M. Alqudah1 1

Faculty of Computer Science and Information Technology, University of Malaya, 50603, Kuala Lumpur, Malaysia 2 Department of Computer Information Systems, King Abdullah II School for Information Technology, University of Jordan, 11942, Amman, Jordan [email protected], [email protected], [email protected], [email protected], [email protected], [email protected]

Abstract. Being current formal linguistic standard and only acceptable form of Arabic language for all native speakers, Modern Standard Arabic (MSA) still lacks sufficient spoken corpora compared to other forms like Dialectal Arabic. This paper describes our work towards developing a new speech corpus for MSA, which can be used for implementing and evaluating any Arabic automatic continuous speech recognition system. The speech corpus contains 415 (367 training and 48 testing) sentences recorded by 42 (21 male and 21 female) Arabic native speakers from 11 countries representing three major regions (Levant, Gulf, and Africa). The impact of using this speech corpus on overall performance of Arabic automatic continuous speech recognition systems was examined. Two development phases were conducted based on the size of training data, Gaussian mixture distributions, and tied states (senones). Overall results indicate that larger training data size result higher word recognition rates and lower Word Error Rates (WER). Keywords: Modern Standard Arabic, text corpus, speech corpus, phonetically rich, phonetically balanced, automatic continuous speech recognition.

1 Introduction Arabic language is the largest still living Semitic language and one of the six official languages of the United Nations (UN). It is the official language in 21 countries situated in Levant, Gulf, and Africa. Arabic language is ranked as fourth after Mandarin, Spanish and English in terms of the number of first language speakers. According to [1], Standard Arabic and Dialectal Arabic are the two major forms of Arabic language. Standard Arabic form includes both the Classical Arabic and the Modern Standard Arabic (MSA). G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 1–12, 2010. © Springer-Verlag Berlin Heidelberg 2010

2

M.A.M. Abushariah et al.

Dialectal Arabic varies from one country to another and includes the daily spoken Arabic. This form of Arabic is deviated from the Standard Arabic and sometimes more than one dialect can be found within a country [1]. Being the most formal and standard form of Arabic, Classical Arabic can be found in The Holy Qur’an scripts. These scripts have full diacritical marks, therefore, Arabic phonetics are completely represented [1]. Modern Standard Arabic (MSA) is the current formal linguistic standard of Arabic language, which is widely taught in schools and universities, and used in the office and the media. Although almost all written Arabic resources use MSA, however diacritical marks are mostly omitted and readers must infer missing diacritical marks from the context. Modern Standard Arabic (MSA) contains 34 phonemes (28 consonants and 6 vowels). Any Arabic utterance or word must start with a consonant. Arabic vowels are classified into 3 short and 3 long vowels, where long vowels are approximately double the duration of short vowels [1, 2]. Since MSA is the only acceptable form of Arabic language for all native speakers [1], therefore, it became the main focus of current Arabic Automatic Speech Recognition (ASR) researches. However, previous Arabic ASR researches were directed towards dialectal and colloquial Arabic serving a specific cluster of the Arabic native speakers [3]. The following section, Section 2, emphasizes on the need for Modern Standard Arabic (MSA) speech corpus. Speech corpus description and analysis is presented in Section 3. Section 4 presents all implementation requirements and components required for the development of the Arabic automatic continuous speech recognition system. The speech corpus testing and evaluation for Arabic ASR systems is presented in Section 5. Section 6 analyzes the experimental results. We finally present the conclusion in Section 7.

2 The Need for Modern Standard Arabic (MSA) Speech Corpus Lack of spoken and written training data is one of the main issues encountered by Arabic ASR researchers. A list of most popular (from 1986 through 2005) corpora is provided in [4] showing only 19 corpora (14 written, 2 spoken, 1 written and spoken, and 2 conversational). A survey on industrial needs for Arabic language resources was conducted on 20 companies situated in Lebanon, Palestine, Egypt, France, and US [5]. Responses highlighted the need for read, prepared, prompted, elicited, and spontaneous Arabic spoken data. In most cases, responding companies did not show much of interest for telephone and broadcast news spoken data. According to [5], responding companies commented that available resources are too expensive and do not meet standard quality requirements. They also lack of adaptability, reusability, quality, coverage, and adequate information types. In a complementary survey [6], a total of 55 respondents were received (36 institutions and 19 individual experts) representing 15 countries located in North Africa, Near and Middle East, Europe, and North America. Respondents insisted on the need for Arabic language resources for both Modern Standard Arabic (MSA) and Colloquial Arabic speech corpora. Over 100 language resources (25 speech corpora, 45 lexicons and dictionaries, 29 text corpora, and 1 multimodal corpus) were identified [6].

Impact of a Newly Developed Modern Standard Arabic Speech Corpus

3

Based on literature investigation, our research work provides Arabic language resources that meet academia and industrial expectations and recommendations. The Modern Standard Arabic (MSA) speech corpus was developed in order to provide a state-of-the-art spoken corpus that bridges the gap between currently available Arabic spoken resources and the research community expectations and recommendations. The following motivational factors and speech corpus characteristics were considered for developing our spoken corpus: 1.

2.

3.

4.

5.

Modern Standard Arabic (MSA) is the only acceptable form of Arabic language for all native speakers and is highly demanded for Arabic language researches; therefore, our speech corpus is based on MSA form. The newly developed Arabic speech corpus was prepared in a high quality and specialized noise proof studio, which suits a wide horizon of systems especially for office environment as recommended by [6]. The speech corpus was designed in a way that would serve any Arabic ASR system regardless of its domain. It focused on the presence of Arabic phonemes as much as possible using the least possible Arabic words and sentences based on phonetically rich and balanced speech corpus approach. The opportunity to explore differences of speech patterns between Arabic native speakers from 11 different countries representing the three major regions (Levant, Gulf, and Africa). The need for read and prepared Arabic spoken data as illustrated in [5] was also considered. Companies did not show interest for Arabic telephone and broadcast news spoken data. Therefore, this Arabic speech corpus is neither a telephone nor a broadcast news based spoken data. It is prepared and read Arabic spoken data.

3 Speech Corpus Description and Analysis Speech corpus is an important requirement for developing any ASR system. The developed corpus contains 415 sentences in the Modern Standard Arabic (MSA). 367 written phonetically rich and balanced sentences were developed in [7], and were recorded and used for training the acoustic model. For testing the acoustic model, 48 additional sentences representing Arabic proverbs were created by an Arabic language specialist. The speech corpus was recorded by 42 (21 male and 21 female) Arabic native speakers from 11 different Arab countries representing three major regions (Levant, Gulf, and Africa). Since this speech corpus contains training and testing written and spoken data of variety of speakers who represent different genders, age categories, nationalities, and professions, and is also based on phonetically rich and balanced sentences, then it is expected to be used for development of many MSA speech and text based applications, such as speaker independent ASR, text-to-speech (TTS) synthesis, speaker recognition, and others. The motivation behind the creation of our phonetically rich and balanced speech corpus was to provide large amounts of high quality recordings of Modern Standard

4

M.A.M. Abushariah et al.

Arabic (MSA) suitable for the design and development of any speaker-independent continuous automatic Arabic speech recognition system. The phonetically rich and balanced Arabic speech corpus was initiated in March 2009. Although participants were selected based on their interest to join this work, but speakers were indirectly selected based on agreed upon characteristics. Participants were selected so that they: • • • • •

Have fair distribution of gender and age. Have different current professions. Have varieties of educational backgrounds with a minimum of high school certification. This is important to secure an efficient reading ability of the participants. Belong to varieties of native Arabic speaking countries. Belong to any of the three major regions where Arabic native speakers mostly live (Levant, Gulf, and Africa). This is important to produce a comprehensive speech corpus that can be used by all Arabic language research community.

As a result, 51 (23 male and 28 female) participants were selected and asked to record the prepared text corpus. Recordings of 3 participants were incomplete, 2 participants were from Eritrea living in Saudi Arabia, and therefore, they are non-native speakers. In addition, 2 participants had resonance or voice disorder problem, whereby the quality of their voice was poor and it was difficult to get a single correct recording. Finally, 2 other participants had articulation disorder problem, whereby some sounds were not pronounced clearly or even replaced in some cases with another sound. Therefore, recordings of 9 participants were excluded. Speech recordings of 42 participants were finally shortlisted in order to form our speech corpus as shown in Table 1. Shortlisted participants belong to two major age groups as shown in Table 2. Table 1. Shortlisted participants Region Levant

Gulf

Africa

Country Jordan Palestine Syria Iraq Saudi Arabia Yemen Oman Sudan Algeria Egypt Morocco Total: Total (%):

Gender Male Female 8 4 2 1 4 3 3 1 4 3 3 3 2 1 21 21 50% 50%

Total 12 2 1 4 3 3 1 7 6 2 1 42 100%

Total/Region 15

11

16 42 100%

Impact of a Newly Developed Modern Standard Arabic Speech Corpus

5

Table 2. Participants’ age and gender distribution No.

Age Category

1 2

Less Than 30 Years 30 Years and Above Total:

Gender Male 7 14 21

Female 14 7 21

Total 21 21 42

Recording sessions were conducted in a sound-attenuated studio. Sound Forge 8 software was installed and used for making the recordings. Default recording attributes were initially used as shown in Table 3. Table 3. Initial recording attributes Recording Attribute Sampling Rate (Hz) Bit-Depth Channels

Value 44100Hz 16 bits 2 channels (Stereo)

These recording attributes were then converted at a later stage to be used for developing speech recognition applications as shown in Table 4. Table 4. Converted recording attributes Recording Attribute Sampling Rate (Hz) Bit-Depth Channels

Value 16000Hz 16 bits 1 channel (Mono)

In order to use our phonetically rich and balanced speech corpus for training and testing any Arabic ASR system, a number of Matlab programs were developed in order to produce a ready to use speech corpus. These Matlab programs were developed for the purpose of 1) Automatic Arabic speech segmentation, 2) Parameters conversion of speech data, 3) Directory structure and sound filenames convention, and 4) Automatic generation of training and testing transcription files. A manual classification and validation of the correct speech data was conducted requiring great human efforts. This process was very crucial in order to ensure and validate the pronunciation correctness of the speech data before using them in training the system’s acoustic model.

4 Arabic Automatic Continuous Speech Recognition System This section describes the major implementation requirements and components for developing the Arabic automatic speech recognition system, which are clearly shown in Fig. 1, which also complies with the generic architecture of the Carnegie Mellon

6

M.A.M. Abushariah et al.

University (CMU) Sphinx engine. A brief description of each component is discussed in the following sub-sections.

Fig. 1. Components of Arabic automatic continuous speech recognition system

4.1 Feature Extraction Feature extraction is also referred to as front end component, is the initial stage of any ASR system that converts speech inputs into feature vectors in order to be used for training and testing the speech recognizer. The dominating feature extraction technique known as Mel-Frequency Cepstral Coefficients (MFCC) was applied to extract features from the set of spoken utterances. A feature vector represents unique characteristics of each recorded utterance, which is considered as an input to the classification component. 4.2 Arabic Phonetic Dictionary The phoneme pronunciation dictionary serves as an intermediary link between the acoustic model and the language model in all speech recognition systems. A rulebased approach to automatically generate a phonetic dictionary for a given transcription was used. A detailed description of the development of this Arabic phonetic dictionary can be found in [8]. Arabic pronunciation follows certain rules and patterns when the text is fully diacritized. A detailed description of these rules and patterns can be found in [9]. In this work, the transcription file contains 2,110 words and the vocabulary list contains 1,626 unique words. The number of pronunciations in the developed phonetic dictionary is 2,482 entries. Fig. 2 shows a sample of the generated phonetic dictionary.

Impact of a Newly Developed Modern Standard Arabic Speech Corpus

7

‫ ﺁﻟَﺎ ُم‬E AE: L AE: M UH ‫ﻦ‬ ٍ ‫ ﺁ ِﻣ‬E AE: M IH N IH N ‫ت‬ ُ ‫ ﺁﻳَﺎ‬E AE: Y AE: T UH ‫ َأ َﺑ َﺪ‬E AE B AE D AE ‫ َأﺑِﻲ‬E AE B IY ‫ﺠَﻠﻨِﻲ‬ َ ْ‫ َأﺑ‬E AE B JH AE L AE N IY ‫ﻄَﺄ‬ َ ْ‫ َأﺑ‬E AE B TT AH E AE ‫ﺞ‬ ُ ‫ َأﺑَْﻠ‬E AE B L AE JH UH Fig. 2. Sample of the rule-based phonetic dictionary

4.3 Acoustic Model Training The acoustic model component provides the Hidden Markov Models (HMMs) of the Arabic tri-phones to be used in order to recognize speech. The basic HMM structure known as Bakis model, has a fixed topology consisting of five states with three emitting states for tri-phone acoustic modeling. In order to build a better acoustic model, CMU Sphinx 3 uses tri-phone based acoustic modeling. Continuous Hidden Markov Models (CHMM) technique is also supported in CMU Sphinx 3 for parametrizing the probability distributions of the state emission probabilities. There are two development phases for the acoustic model training. The first phase is based on 4.07 hours of training data, whereas the second phase is based on 8 hours of training data. 4.3.1 Acoustic Model Training Based on 4.07 Hours During our first development phase, speech recordings of 8 speakers (4 males and 4 females) were manually segmented. Each speaker recorded both training and testing sentences whereby the training sentences are used to train the acoustic model and the testing sentences are used to test the performance of the speech recognizer. Out of the 8 speakers only 5 speakers (3 males and 2 females) are used to train the acoustic model in this phase and the other 3 speakers are mainly used to test the performance. A total of 3604 utterances (4.07 hours) are used to train the acoustic model. The acoustic model is trained using continuous state probability density of 16 Gaussian mixture distributions. However, the state distributions were tied to different number of senones ranging from 350 to 2500. Different results are obtained and shown in Section 5. 4.3.2 Acoustic Model Training Based on 8 Hours During our second development phase, a small portion of the entire speech corpus is experimented. A total of 8,043 utterances are used resulting about 8 hours of speech data collected from 8 (5 male and 3 female) Arabic native speakers from 6 different Arab countries namely; Jordan, Palestine, Egypt, Sudan, Algeria, and Morocco. In order to show a fair testing and evaluation of the Arabic ASR performance, the round robin testing approach was applied, where every round speech data of 7 out of 8 speakers are trained and speech data of the 8th are tested. This is also important to show how speaker-independent the system.

8

M.A.M. Abushariah et al.

Acoustic model training was divided into two stages. During the first stage, one of the eight training data sets was used in order to identify the best combination of Gaussian mixture distributions and number of senones. The acoustic model is trained using continuous state probability density ranging from 2 to 64 Gaussian mixture distributions. In addition, the state distributions were tied to different number of senones ranging from 350 to 2500. A total of 54 experiments were done at this stage producing different results as shown in Section 5. During the second stage, the best combination of Gaussian mixture distributions and number of senones was used to train the other seven out of eight training data sets. 4.4 Language Model Training The language model component provides the grammar used in the system. The grammar’s complexity depends on the system to be developed. In this work, the language model is built statistically using the CMU-Cambridge Statistical Language Modeling toolkit, which is based on modeling the uni-grams, bi-grams, and tri-grams of the language for the subject text to be recognized. Creation of a language model consists of computing the word uni-gram counts, which are then converted into a task vocabulary with word frequencies, generating the bi-grams and tri-grams from the training text based on this vocabulary, and finally converting the n-grams into a binary format language model and standard ARPA format. For both development phases, the number of uni-grams is 1,627, whereas the number of bi-grams and tri-grams is 2,083 and 2,085 respectively.

5 Systems’ Testing and Evaluation This section presents the testing and evaluation of the two development phases of the Arabic automatic continuous speech recognition system. 5.1 First Development Phase Based on 4.07 Hours The testing and evaluation was done based on 3 different testing data sets, 1) 444 sound files (same speakers, but different sentences), 2) 84 sound files (different speakers, but same sentences), and 3) 130 sound files (different speakers, and different sentences). Results are shown in Tables 5, 6, and 7 respectively. Table 8 compares the system’s performance based on diacritical marks. Table 5. System’s performance for testing data set 1 Version Experiment1 Experiment2 Experiment3 Experiment4 Experiment5 Experiment6 Experiment7

Densities 16 16 16 16 16 16 16

Senones 1000 1500 2500 500 350 400 450

Word Recognition Rate (%) 87.26 81.10 72.05 90.39 90.91 91.23 90.43

Impact of a Newly Developed Modern Standard Arabic Speech Corpus

9

Table 6. System’s performance for testing data set 2 Version Experiment8

Densities 16

Senones 400

Word Recognition Rate (%) 89.42

Table 7. System’s performance for testing data set 3 Version Experiment9

Densities 16

Senones 400

Word Recognition Rate (%) 80.83

Table 8. Effect of diacritical marks on the overall system’s performance Testing Sets Set 1 Set 2 Set 3

With Diacritical Marks 91.23 89.42 80.83

Without Diacritical Marks 92.54 90.81 80.83

5.2 Second Development Phase Based on 8 Hours There are 8 different data sets used to train and test the system’s performance based on 8 hours as shown in Table 9. Table 9. Training and testing data sets for 8 hours speech corpus

Experiment ID

Training Data

Exp.1 Exp.2 Exp.3 Exp.4 Exp.5 Exp.6 Exp.7 Exp.8

6379 6288 5569 6308 6296 6331 6219 6009

Same Speakers Different Sentences 906 871 755 888 889 891 861 841

Testing Data Different Speakers Same Different Sentences Sentences 678 80 769 115 1488 231 749 98 761 97 726 95 838 125 1048 145

Total Testing Data 1664 1755 2474 1735 1747 1712 1824 2034

Ratio of Testing Data (%) 20.69 21.82 30.76 21.57 21.72 21.29 22.68 25.29

During the first stage of training the acoustic model, the first data set (Exp.1) was used to identify best combination of Gaussian mixture distributions and number of senones. It is found that 16 Gaussians with 500 senones obtained the best word recognition rate of 93.24% as shown in Fig. 3. Therefore, this combination was used for training the acoustic model in Exp.2 through Exp.8 data sets.

10

M.A.M. Abushariah et al.

Fig. 3. Word recognition rate (%) in reference to number of senones and Gaussians

Tables 10 and 11 show the word recognition rates (%) and the Word Error Rates (WER) with and without diacritical marks respectively. Table 10. Overall system’s performance with full diacritical marks

Experiment ID Exp.1 Exp.2 Exp.3 Exp.4 Exp.5 Exp.6 Exp.7 Exp.8 Average Results

Same Speakers BUT Different Sentences Rec. Rate WER (%) 93.24 10.73 91.80 11.96 93.07 10.53 92.72 11.42 93.43 10.09 92.61 11.56 92.65 11.15 91.85 12.75 92.67 11.27

Different Speaker BUT Same Sentences Rec. Rate WER (%) 94.98 6.28 93.30 10.62 97.22 3.66 96.89 4.16 94.92 7.13 95.55 7.37 96.37 4.51 98.10 2.51 95.92 5.78

Different Speaker AND Different Sentence Rec. Rate WER (%) 90.11 13.48 83.00 27.87 89.81 14.94 91.44 11.76 89.49 14.86 90.64 14.23 88.15 14.25 89.99 13.31 89.08 15.59

6 Experimental Results Analysis During the first development phase based on 4.07 hours, it is noticed that when the number of senones increases, the recognition rate declines. However, the combination of 16 Gaussian mixtures and 400 senones is the best for the current corpus size achieving 91.23% and 14.37% Word Error Rate (WER) for set 1. This result has improved when tested without diacritical marks achieving 92.54% and 13.06% WER.

Impact of a Newly Developed Modern Standard Arabic Speech Corpus

11

Table 11. Overall system’s performance without diacritical marks

Experiment ID Exp.1 Exp.2 Exp.3 Exp.4 Exp.5 Exp.6 Exp.7 Exp.8 Average Results

Same Speakers BUT Different Sentences Rec. Rate WER (%) 94.41 9.57 93.02 10.74 94.29 9.31 93.86 10.29 94.57 8.95 93.75 10.41 94.06 9.74 93.04 11.56 93.88 10.07

Different Speaker BUT Same Sentences Rec. Rate WER (%) 95.22 6.04 93.95 10.33 97.42 3.46 97.33 3.73 95.32 6.73 95.91 7.00 96.68 4.20 98.50 2.11 96.29 5.45

Different Speaker AND Different Sentence Rec. Rate WER (%) 90.79 12.81 84.38 26.49 90.88 13.87 92.87 10.34 90.76 13.59 91.39 13.48 89.42 12.98 91.33 11.97 90.23 14.44

On the other hand, during the second development phase based on 8 hours, the best combination was 16 Gaussian mixture distributions with 500 senones obtaining 93.43% and 94.57% word recognition accuracy with and without diacritical marks respectively. Therefore, the number of senones increases when there is an increase in training speech data, and it is expected to increase further when our speech corpus is fully utilized. Speaker independency is clearly realized in this work as testing was conducted to assure this aspect. For different speakers but similar sentences, the system obtained a word recognition accuracy of 95.92% and 96.29% and a Word Error Rate (WER) of 5.78% and 5.45% with and without diacritical marks respectively. On the other hand, for different speakers and different sentences, the system obtained a word recognition accuracy of 89.08% and 90.23% and a Word Error Rate (WER) of 15.59% and 14.44% with and without diacritical marks respectively. It is noticed that the developed systems perform better without diacritical marks compared to the same systems with diacritical marks. Therefore, the issue of diacritics needs to be solved in future developments. Further parameter enhancements need to be done in order to reduce the WER. This includes language model weights, beam width, and the word insertion penalty (wip).

7 Conclusions This paper reports our work towards building a phonetically rich and balanced Modern Standard Arabic (MSA) speech corpus, which is necessary for developing a high performance Arabic speaker-independent automatic continuous speech recognition system. This work includes creating the phonetically rich and balanced speech corpus with full diacritical marks transcription from different speakers with different varieties of attributes, and making all preparation and pre-processing steps in order to produce a ready to use speech data for further training and testing purposes. This speech corpus can be used for any Arabic speech based applications including speaker recognition and text-to-speech synthesis, covering different research needs.

12

M.A.M. Abushariah et al.

The obtained results are comparable to other languages of the same vocabulary size. This work adds a new kind of possible speech data for Modern Standard Arabic (MSA) based text and speech applications besides other kinds such as broadcast news and telephone conversations. Therefore, this work is an invitation to all Arabic ASR developers and research groups to utilize and capitalize.

References 1. Elmahdy, M., Gruhn, R., Minker, W., Abdennadher, S.: Survey on common Arabic language forms from a speech recognition point of view. In: International Conference on Acoustics (NAG-DAGA), Rotterdam, Netherlands, pp. 63 – 66 (2009) 2. Alotaibi, Y.A.: Comparative Study of ANN and HMM to Arabic Digits Recognition Systems. Journal of King Abdulaziz University: Engineering Sciences 19(1), 43–59 (2008) 3. Kirchhoff, K., Bilmes, J., Das, S., Duta, N., Egan, M., Ji, G., He, F., Henderson, J., Liu, D., Noamany, M., Schone, P., Schwartz, R., Vergyri, D.: Novel approaches to Arabic speech recognition. In: Report from the 2002 Johns-Hopkins Summer Workshop, ICASSP 2003, Hong Kong, vol. 1, pp. 344–347 (2003) 4. Al-Sulaiti, L., Atwell, E.: The design of a corpus of Contemporary Arabic. International Journal of Corpus Linguistics, John Benjamins Publishing Company, 1 – 36 (2006) 5. Nikkhou, M., Choukri, K.: Survey on Industrial needs for Language Resources. Technical Report, NEMLAR – Network for Euro-Mediterranean Language Resources (2004) 6. Nikkhou, M., Choukri, K.: Survey on Arabic Language Resources and Tools in the Mediterranean Countries. Technical Report, NEMLAR – Network for Euro-Mediterranean Language Resources (2005) 7. Alghamdi, M., Alhamid, A.H., Aldasuqi, M.M.: Database of Arabic Sounds: Sentences. Technical Report, King Abdulaziz City of Science and Technology, Saudi Arabia, In Arabic (2003) 8. Ali, M., Elshafei, M., Alghamdi, M., Almuhtaseb, H., Al-Najjar, A.: Generation of Arabic Phonetic Dictionaries for Speech Recognition. In: IEEE Proceedings of the International Conference on Innovations in Information Technology, UAE, pp. 59 – 63 (2008) 9. Elshafei, A.M.: Toward an Arabic Text-to-Speech System. The Arabian Journal of Science and Engineering 16(4B), 565–583 (1991)

User and Noise Adaptive Dialogue Management Using Hybrid System Actions Senthilkumar Chandramohan and Olivier Pietquin SUPELEC - IMS Research Group, Metz - France {senthilkumar.chandramohan,olivier.pietquin}@supelec.fr

Abstract. In recent years reinforcement-learning-based approaches have been widely used for policy optimization in spoken dialogue systems (SDS). A dialogue management policy is a mapping from dialogue states to system actions, i.e. given the state of the dialogue the dialogue policy determines the next action to be performed by the dialogue manager. So-far policy optimization primarily focused on mapping the dialogue state to simple system actions (such as confirm or ask one piece of information) and the possibility of using complex system actions (such as confirm or ask several slots at the same time) has not been well investigated. In this paper we explore the possibilities of using complex (or hybrid) system actions for dialogue management and then discuss the impact of user experience and channel noise on complex action selection. Our experimental results obtained using simulated users reveal that user and noise adaptive hybrid action selection can perform better than dialogue policies which can only perform simple actions.

1 Introduction Spoken Dialog Systems (SDS) are systems which have the ability to interact with human beings using speech as the medium of interaction. The dialogue policy plays a crucial role in dialogue management and informs the dialogue manager what action to perform next given the state of the dialogue. Thus building an optimal dialogue management policy is an important step when developing any spoken dialogue system. Using a hand-coded dialogue policy is one of the simplest ways for building dialogue systems, but as the complexity of the dialogue task grows it becomes increasingly difficult to code a dialogue policy manually. Over the years various statistical approaches such as [9, 3, 21] have been proposed for dialogue management problems with reasonably large state spaces. Most of the literature on spoken dialog systems (policy optimization) focus on optimal selection of elementary dialog acts at each dialog turn. In this paper, we investigate the possibility of learning to combine these simple dialog acts into complex actions to obtain more efficient dialogue policies. Since complex system acts combine several system acts together it can lead to shorter dialogue episodes. Also by using complex system acts, system designers can introduce some degrees of flexibility to the human-computer interaction by allowing users with prior knowledge about the system to furnish and receive as much as information as they wish in one user/system act. The use of complex G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 13–24, 2010. c Springer-Verlag Berlin Heidelberg 2010

14

S. Chandramohan and O. Pietquin

system actions for dialogue management has been studied only to a little extent. Works related to the use of open-ended questions are studied in [11]. The primary focus of this contribution is to learn a hybrid action policy which can choose to perform simple system acts as well as more complex and flexible system acts. The challenge in learning such a hybrid policy is the unavailability of dialogue corpora to explore complex system acts. Secondly, the impact of noise and user simulation on complex system acts are analyzed and means to learn a noise and user adaptive dialogue policy are discussed. This paper is organized as follows: In Section 2 a formal description of Markov Decision Process (MDP) is presented first, following which casting and solving the dialogue problem in the framework of an MDP is discussed. In Section 3 complex system actions are formally defined and then the impact of channel noise and user experience is discussed. Section 4 outlines how channel noise can be simulated using user simulation and how a noise adaptive hybrid action policy can be learned. Section 5 describes how a user-adaptive hybrid-action policy can be learned. Section 6 outlines our evaluation set-up and analyzes the performance of different policies learned. Eventually Section 7 concludes.

2 MDP for Dialogue Management The MDP [1] framework comes from the optimal control community. It is originally used to describe and solve sequential decision making problems in stochastic dynamic environments. An MDP is formally a tuple {S, A, P, R, γ} where S is the (finite) state space, A the (finite) action space, P ∈ P(S)S×A the family of Markovian transition probabilities1, R ∈ RS×A×S the reward function and γ the discounting factor (0 ≤ γ ≤ 1). According to this formalism, during the interaction with a controlling agent, an environment steps from state to state (s ∈ S) according to transition probabilities P as a consequence of the controller’s actions (a ∈ A). After each transition, the system produces an immediate reward (r) according to its reward function R. A so-called policy π ∈ AS mapping states to actions models the way the agent controls its environment. The quality of a policy is quantified by the so-called value function V π (s) which maps each state to the expected discounted cumulative reward given that the agent starts in this state and follows the policy π: ∞ V π (s) = E[ γ i ri |s0 = s, π]

(1)

i=0

An optimal policy π ∗ maximizes this function for each state: π ∗ = argmax V π π

(2)

Suppose that we are given the optimal value function V ∗ (that is the value function associated to an optimal policy), deriving the associated policy would require to know 1

Notation f ∈ AB is equivalent to f : B → A.

User and Noise Adaptive Dialogue Management Using Hybrid System Actions

15

the transition probabilities P . Yet, this is usually unknown and the optimal control policy should be learned only by interactions. This is why the state-action value (or Q-) function is introduced. It adds a degree of freedom on the choice of the first action: ∞ Qπ (s, a) = E[ γ i ri |s0 = s, a0 = a, π] i=0 ∞

Q∗ (s, a) = E[

γ i ri |s0 = s, a0 = a, π ∗ ]

(3) (4)

i=0

where Q∗ (s, a) is the optimal state-action value function. An action-selection strategy that is greedy according to this function (π(s) = argmaxa Q∗ (s, a)) provides an optimal policy. There are many algorithms that solve this optimization problem. When this optimization is done without any information about the transition probabilities and the reward function but only transition and immediate rewards are observed, the solving algorithms belong to the Reinforcement Learning (RL) family [20]. 2.1 Dialogue as an MDP The spoken dialogue management problem can be seen as a sequential decision making problem. It can thus be cast into an MDP and the optimal policy can be found by applying a RL algorithm. Indeed, the role of the dialogue manager (or the decision maker) is to select and perform dialogue acts (actions in the MDP paradigm) when it reaches a given dialogue turn (state in the MDP paradigm) while interacting with a human user (its environment in the MDP paradigm). There can be several types of system dialogue acts. For example, in the case of a restaurant information system, possible acts are request(cuisine type), provide(address), confirm(price range), close etc. The dialogue state is usually represented efficiently by the Information State paradigm [3]. In this paradigm, the dialogue state contains a compact representation of the history of the dialogue in terms of system acts and its subsequent user responses (user acts). It summarizes the information exchanged between the user and the system until the desired state is reached and the dialogue episode is eventually terminated. A dialogue management strategy is thus a mapping between dialogue states and dialogue acts. Still following the MDP paradigm, the optimal strategy is the one that maximizes some cumulative function of rewards collected all along the interaction. A common choice for the immediate reward is the contribution of each action to the user’s satisfaction [17]. This subjective reward is usually approximated by a linear combination of objective measures (dialogue duration, number of ASR errors, task completion etc.). Weights of this linear combination can be computed from empirical data [10]. Yet, most of the time, more simple reward functions are used, taking into account that the most important objective measures are task completion and length of the dialogue episode. 2.2 Restaurant Information MDP-SDS The dialogue problem studied in the rest of this paper is a slot filling restaurant information system. The dialogue manager has 3 slots to be filled and confirmed by the user

16

S. Chandramohan and O. Pietquin

(1) Cuisine (Italian-French-Thai), (2) Location (City center-East-West) and (3) Pricerange (Cheap-Moderate-Expensive). Here the goal of the dialogue system is to fill these slots with user preferences and also to confirm the slot values, if the confidence in the retrieved information is low, before proceeding to seek relevant information from the database. The initial list of possible (commonly used) system actions are, (1) Ask cuisine, (2) Ask location, (3) Ask restaurant type, (4) Explicit confirm cuisine, (5) Explicit confirm location, (6) Explicit confirm type and (7) Greet the user. The dialogue state of the Restaurant information MDP-SDS includes 6 binary values to indicate whether the 3 slots have been filled and confirmed. It also includes a binary value to indicate whether the user had been greeted or not. The reward function is defined as follows: the system will receive a completion reward of 300 if the task is successfully completed and will receive a time step penalty of -20 for every transition. 2.3 Dialogue Policy Optimization Once the dialogue management problem is cast into an MDP, Dynamic Programming or RL methods [20] can be applied to find the optimal dialogue policy [9]. The goal of the policy optimization task is to find the dialogue policy which maximizes the expected discounted sum of rewards that can obtained by the agent over an infinite time period. Most of the recent works done in this direction [6] focus on using online reinforcement learning algorithms such as SARSA for policy optimization. Online RL algorithms like SARSA are data intensive and so it is customary to simulate or model the user behavior based on the available dialogue corpus [8, 18, 13] and to artificially generate simulated dialogues. The RL policy learner will then interact with the simulated user to find the optimal dialogue policy. DIPPER dialogue management framework [4] along with REALL, a hierarchical reinforcement learning policy learner [7] was used to learn and test the dialogue policies discussed in this paper (the exploration rate of the RL policy learner was set as 0.2). The user simulation used in the experiments was trained using town information corpus discussed in [4]. The policy learned using the reward function, action and state space described in 2.2 will be the baseline and will be referred to as simple action policy.

3 Complex System Actions Simple actions are commonly used system acts which are related to one slot such as asking a slot value or explicitly confirming the slot value etc. Actions listed in the subsection 2.2 are all examples of simple system acts (except implicit confirmation). Complex actions are system actions which are formed by combining two or more simple systems actions. Complex actions deal with multiple slots such as confirming two slot values or asking for three slot values. Thus for the restaurant information dialogue system there can be several possible complex actions that can be performed. Some of the complex actions in this case are (1) Ask two slot values, (2) Ask three slot values, (3) Explicitly confirm two slot values, (4) Explicitly confirm three slot values (5) Implicitly confirm two and ask third slot and (6) Implicitly confirm a slot and ask a slot value (commonly used complex action).

User and Noise Adaptive Dialogue Management Using Hybrid System Actions

17

3.1 Hybrid Action Policy This section explains how to learn a hybrid action policy which can choose to perform simple system acts as well as complex system acts. Firstly the action set of the restaurant information MDP-SDS described in 2.2 is updated with the following complex system actions, (1) Ask values for two slots, (2) Explicitly confirm two slot values, (3) Implicitly confirm two slot values and ask value for the third slot. Since the action set is updated with simple and complex actions the RL policy learner will explore both types of actions. But the user simulation learned using the dialogue corpora (which only had simple actions) has an ability to respond only for simple system actions. Thus the user behavior for the complex system acts is hand-coded and combined with the learned user simulation. The hand-coded behavior for complex actions is as follows (1) ask 2 slots {ProvideTwoSlotValue 0.9, ProvideOneSlotValue 0.1, SayNothing 0}, (2) explicitConfirm 2 slots {SayYes 1.0, SayNo 0, SayNothing}, (3) implicitConfirm 2 slots and ask value for slot {ProvideOneSlotValue 1.0, SayYes 0.0, SayNo 0}. The updated user simulation can thus respond to both simple (behavior learned from corpora) and complex system actions (hand-coded behavior). As explained in 2.3 a dialogue policy is learned using the updated user simulation and MDP-SDS, this policy will be referred to as hybrid action policy. System: Hello User: Silence System: What type of restaurant are you looking for and in which location? (complex) User: Italian restaurant in city center System: In what price-range are you looking for a Italian restaurant? User: A cheap one System: Did you say you are looking for a cheap restaurant in city-center? (complex) User: Yes The dialogue episode presented here is the interaction between the RL policy learner (exploration rate set as zero) and user simulation using the hybrid action policy. One can observe that the policy can now choose complex system actions and simple actions when required. It can be observed that given the action set of the restaurant information dialogue system the sample dialogue presented here is an optimal behavior for grounding the three slots. 3.2 Effect of Noise and User on Complex Action The hand-coded user behavior for complex actions discussed in Section 3.1 simulate the zero channel noise scenario i.e., when the user says something it is assumed that the system will capture it correctly and there is no chance for error. This is not always true and there may be some noise in the transmission channel. Thus ideally the probability for SayNo user act is not zero (the fact that the system doesn’t understand what the user said is modeled as the user saying nothing). But if the user response is SayNo for the complex system act ImplicitConfirm2slotsAndAskASlot it would be difficult to identify which of the two slots is wrong. Based on this our 1st assumption is : when there is

18

S. Chandramohan and O. Pietquin

noise in the automatic speech recognition (ASR) channel it is advisable to perform simple system acts and not complex actions. System: Can you let me know your preferences for restaurant selection are? User 1: Nothing (Novice user) User 2: Italian restaurant (Novice user) User 3: Cheap Italian restaurant in City Center (Experienced user) The users who intend to use the restaurant information SDS may range from novice (new) users to experienced (frequent) users. Now let us consider the above mentioned example. Here the user is under providing information in the first two cases but provides all necessary information in the third case. Based on this our 2nd assumption is: it is ideal to perform simple systems actions while interacting with novice users and perform hybrid actions while interacting with experienced users.

4 Noise Adaptive Policy Based on our 1st assumption action selection has to be performed depending on the noise level in the ASR channel. First step to learn a noise dependent policy is to have a noise simulation module. Several works have be done in the recent past such as [14, 12, 5, 16, 19] to simulate channel noise for dialogue modeling. A simple approach to simulate the channel noise is to tune the probabilities of user responses for confirmation system actions [15]. By increasing the probability for negation we can simulate the high noise scenario and by reducing the probability for negation we can simulate the low noise scenario. The user behaviors for complex confirmation actions presented in section 3.1 are modified as shown below in Table 1 to simulate low noise scenario and high noise scenario. Let us refer to these user simulations as low noise user model and high noise user model. Table 1. Handcrafted user behavior for noise simulation Channel noise Action Provide 1Value SayYes SayNo Low ImplicitConfirm2AndAskASlot 0.9 0 0.1 Low ExplicitConfirm2 0 0.9 0.1 High ImplicitConfirm2AndAskASlot 0.6 0 0.4 High ExplicitConfirm2 0 0.6 0.4

The goal now is to learn a hybrid dialogue policy which is sensitive to noise in the ASR channel. A new variable is introduced in the Information State to measure the negation index (i.e. to measure the frequency of negations made by the user). This approach is similar to the one suggested in [2]. The reward function defined is 2.2 is modified to include the negation index as shown below

User and Noise Adaptive Dialogue Management Using Hybrid System Actions

19

Reward = Completion reward − Time step penalty − Negation count ∗ 20 Using this updated reward function and the user simulation with channel noise simulations, noise sensitive policies can be learned. Firstly a policy is learned using the low noise user model and the second using the high noise user model. Let us refer to these dialogue policies as low noise policy and high noise policy. An interaction between the policy learner (learning rate set as zero) using the low noise policy and user simulation with simulated low channel noise is presented below. It can be observed that the system performs complex system actions as well as simple actions. System: greet User: silence System: ask2Slot [askCuisine, askLocation] User: provide 2 Values [cuisine, central Location] System: implConfAskASlot [confirmCuisine, askPrice] User: provide 1 value [Moderate] System: explicitConfirm2 [confirm centralLocation,confirm Moderate] User: yesAnswer An interaction between the policy learner (learning rate set as zero) using the high noise policy and user simulation with simulated high channel noise is presented below. It can be observed that the system performs only simple (confirmation) actions. System: greet User: silence System: ask2Slot [askLocation, askPrice] User: provide 2 values [CentralLocation, Moderate] System: askASlot [askCuisine] User: provide 1 value [Italian] System: explicitConfirm [CentralLocation] User: yesAnswer System: explicitConfirm [Moderate] User: yesAnswer System: explicitConfirm [Italian] User: yesAnswer In real life situations it is practically impossible to say when the channel noise will be low and when it will be high, thus one may not know when to switch between the high noise and low noise policies. One possible solution is to learn a dialogue policy which can adapt itself to different ASR channel noise levels. A noise adaptive dialogue policy is learned by using the high noise user model and low noise user model in parallel. For every dialogue cycle the policy learner randomly chooses one of the two user simulations for interaction. This way one can learn a policy that can adapt to different channel noise levels. Let the policy learned by randomly switching between the user models during the policy optimization process be called as noise adaptive hybrid (action) policy.

20

S. Chandramohan and O. Pietquin

5 User Adaptive Policy The goal now is to first simulate the user experience in the user simulation and use it to learn a user experience dependent policy. To perform this task novice users are assumed (as shown in section 3.2 example) to under provide information for complex actions whereas the experienced users will provide the necessary slot values. In order to simplify the problem the novice users are assumed to say nothing for complex (information seeking) actions whereas the experienced users will provide the necessary slot values in most cases. Tuning the probabilities for user behavior in this way results in two user behaviors and let us term them as novice user simulation and experienced user simulation. In addition to simulating the user experience the user behavior also simulated the low noise level scenario. Novice and experienced user behaviors with low channel noise is outlined in Table 2 Table 2. Handcrafted user behavior for user experience simulation User Novice Novice Novice Experienced Experienced Experienced

Noise Action Give 2 values Give 1 value Low ImplicitConf2&AskASlot 0 0.9 Low ExplicitConf2 0 0 Low Ask2Slots 0 0 Low ImplicitConf2&AskASlot 0 0.9 Low ExplicitConf2 0 0 Low Ask2Slots 0.9 0

Yes 0 0.9 0 0 0.9 0

No Nothing 0.1 0 0.1 0 0 1.0 0.1 0 0.1 0 0 0.1

Similar to the negation index we introduce a term called experience index in the state representation of the restaurant information MDP-SDS. The reward function updated in section 4 is again updated as follows Reward = Completion reward + TimePenalty − NegCount ∗ 5 − ExpIndex ∗ 10 By using the novice and experienced user behaviors one can learn two different dialogue policies. Let us term these policies as novice user policy and experienced user policy. Also as explained in the previous section by using these user simulations simultaneously i.e. by randomly switching them during the policy optimization one can learn a user adaptive policy. Let us term this policy as adaptive hybrid (action) policy.

6 Policy Evaluation and Analysis Table 3 presents the result of comparison between simple action policy and hybrid action policy derived in section 2.3 and 3.1. The results are based on 300 dialogue cycles between the policy learner using the two policies (learning rate set as zero) with user simulation (which was used to learn hybrid action policy). One can observe that by using complex actions along with simple actions we can considerably reduce the dialogue length and hence the overall reward of the dialogue manager can be improved.

User and Noise Adaptive Dialogue Management Using Hybrid System Actions

21

Table 3. Simple action vs Hybrid action policy Policy Name Average Reward Completion Average Length Simple 160 300 7.0 Hybrid 214 300 4.2

Table 4 presents the result of comparison between low noise policy and adaptive noise hybrid action policy derived in section 4. The results are based on 300 dialogue cycles between the policy learner using the two policies (learning rate set as zero) with user simulation (also) simulating low channel noise. It can be observed that the adaptive noise policy performs equally as good as the low noise policy in the low channel noise scenario. Table 4. Low noise policy vs Adaptive noise policy in low noise scenario Policy Name Average Reward Completion Average Length Low noise 216.51 300 4.12 Adaptive noise 214.06 300 4.20

Table 5 presents the result of comparison between high noise policy and adaptive noise hybrid action policy derived in section 4. The results are based on 300 dialogue cycles between the policy learner using the two policies (exploration rate set as zero) with user simulation (also) simulating high channel noise. It can be observed that the adaptive noise policy performs equally as good as the high noise policy in the high channel noise scenario, but there is a small degradation with regard to the task completion average. Table 5. High noise policy vs Adaptive noise policy in high noise scenario Policy Name Average Reward Completion Average Length High noise 160.52 300 6.65 Adaptive noise 175.99 295.84 5.30

Table 6 presents the result of comparison between low noise, high noise and adaptive noise hybrid action policy derived in section 4. The results are based on 300 dialogue cycles between the policy learner using the three policies (exploration rate set as zero) with two user simulations, simulating mixed channel noise. It can be observed that the adaptive noise policy performs better than the high noise policy and low noise policy in the mixed channel noise scenario. This shows that the adaptive policy learns a trade off to switch between complex and simple actions with regard to changing noise levels (where us low noise policy tries to perform complex actions always and high noise policy performs simple actions always). It actually takes advantage of the extended state representation to perform this adaptation.

22

S. Chandramohan and O. Pietquin

Table 6. Low noise policy Vs High noise policy Vs Adaptive noise policy in mixed noise scenario Policy Name Average Reward Completion Average Low noise 140.19 297.56 High noise 170.06 300 Adaptive noise 191.38 298.07

Length 7.38 6.33 4.87

Table 7 presents the result of comparison between novice user policy, experienced user and adaptive user hybrid action policy derived in section 5. The results are based on 250 dialogue cycles between the policy learner using the three policies (exploration rate set as zero) with both novice and experienced user simulations, (randomly switched to simulating mixed user experience). It can be observed that the user adaptive policy performs better than the novice user policy and experienced user policy in the mixed user scenario. This shows that the user adaptive policy learns a trade off to switch between complex and simple actions with regard to changing user experience levels (where the novice user policy tries to perform simple actions always and experienced user policy performs complex actions always). Table 7. Novice Vs Experienced Vs Adaptive user policy in mixed user scenario Policy Avg. Reward Completion Avg. SimpleAct ComplexAct Length Novice 145.9 250.0 7.66 0 7.66 Experience -197.7 228.6 2.9 16.0 18.9 Adaptive 151.9 250.0 4.69 1.0 5.69

7 Conclusion So far the possibilities of using complex system actions along with simple actions for spoken dialogue management has not been much investigated. Based on the experimental results presented in this contribution one can conclude that complex action selection can considerably reduce the dialogue length but in the mean time it is important to consider the channel noise and user experience factors before choosing complex actions. Since it is not possible to predict the channel noise level or the experience level of the user in real life scenario one can learn an adaptive hybrid action policy that can adapt to the channel noise and user experience. Yet, it requires extending the state representation to take into account the behaviour of the user (SayNo or overinformative users for example). All the tasks (learning and testing) presented in this paper are carried out using simulated users partially learned from corpus and partially hand tuned, thus it will be ideal to test these policies with real users in future. Hybrid action selection may move human machine interaction a step closer towards human - human communication. One other interesting direction of future work will be to explore the possibilities of automatically generate new complex actions from a given list of simple actions and use online policy learning approaches to learn an hybrid dialogue policy. This way we may come across potentially new and interesting system actions which may not available in the dialogue corpus.

User and Noise Adaptive Dialogue Management Using Hybrid System Actions

23

References [1] Bellman, R.: A markovian decision process. Journal of Mathematics and Mechanics 6, 679–684 (1957) [2] Janarthanam, S., Lemon, O.: User simulations for online adaptation and knowledgealignment in troubleshooting dialogue systems. In: In Proceedings of LONDial, London, UK (2008) [3] Larsson, S., Traum, D.R.: Information state and dialogue management in the TRINDI dialogue move engine toolkit. Natural Language Engineering 6, 323–340 (2000) [4] Lemon, O., Georgila, K., Henderson, J., Stuttle, M.: An ISU dialogue system exhibiting reinforcement learning of dialogue policies: generic slot-filling in the TALK in-car system. In: Proceedings of the Meeting of the European Chapter of the Associaton for Computational Linguistics (EACL 2006), Morristown, NJ, USA (2006) [5] Lemon, O., Liu, X.: Dialogue Policy Learning for combinations of Noise and User Simulation: transfer results. In: Proceedings of SIGdial 2007, Antwerp, Belgium (2007) [6] Lemon, O., Pietquin, O.: Machine learning for spoken dialogue systems. In: Proceedings of the International Conference on Speech Communication and Technologies (InterSpeech 2007), Antwerpen, Belgium (2007) [7] Lemon, O., Liu, X.X., Shapiro, D., Tollander, C.: Hierarchical Reinforcement Learning of Dialogue Policies in a development environment for dialogue systems: REALL-DUDE. In: Proceedings of the 10th SemDial Workshop, BRANDIAL 2006, Potsdam, Germany (2006) [8] Levin, E., Pieraccini, R., Eckert, W.: A Stochastic Model of Human-Machine Interaction for learning dialog Strategies. IEEE Transactions on Speech and Audio Processing 8, 11–23 (2000) [9] Levin, E., Pieraccini, R., Eckert, W.: Using markov decision process for learning dialogue strategies. In: Proceedings of ICASSP, Seattle, Washington (1998) [10] Kamm, C.A., Walker, M.A., Litman, D.J., Abella, A.: PARADISE: A framework for evaluating spoken dialogue agents. In: Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics (ACL 1997), Madrid, Spain, pp. 271–280 (1997) [11] Pietquin, O.: A Framework for Unsupervised Learning of Dialogue Strategies. PhD thesis, Facult´e Polytechnique de Mons, TCTS Lab (Belgique) (April 2004) [12] Pietquin, O., Dutoit, T.: A Probabilistic Framework for Dialog Simulation and Optimal Strategy Learning. IEEE Transactions on Audio, Speech and Language Processing 14(2), 589–599 (2006) [13] Pietquin, O., Dutoit, T.: A probabilistic framework for dialog simulation and optimal strategy learning. IEEE Transactions on Audio, Speech & Language Processing 14(2), 589–599 (2006) [14] Pietquin, O., Renals, S.: ASR System Modeling For Automatic Evaluation And Optimization of Dialogue Systems. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2002), Orlando, USA, FL (May 2002) [15] Rieser, V.: Bootstrapping Reinforcement Learning-based Dialogue Strategies from Wizardof-Oz data. PhD thesis, Saarland University, Dpt of Computational Linguistics (July 2008) [16] Rieser, V., Lemon, O.: Learning effective multimodal dialogue strategies from wizard-of-oz data: bootstrapping and evaluation. In: Proceedings of the Association for Computational Linguistics (ACL) 2008, Columbus, USA (2008) [17] Singh, S., Kearns, M., Litman, D., Walker, M.: Reinforcement learning for spoken dialogue systems. In: Proceedings of the Annual Meeting of the Neural Information Processing Society (NIPS 1999), Denver, USA. Springer, Heidelberg (1999)

24

S. Chandramohan and O. Pietquin

[18] Schatzmann, J., Weilhammer, K., Stuttle, M., Young, S.: A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. Knowledge Engineering Review 21(2), 97–126 (2006) [19] Schatzmann, J., Young, S.: Error simulation for training statistical dialogue systems. In: Proceedings of the ASRU 2007, Kyoto, Japan (2007) [20] Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction (Adaptive Computation and Machine Learning, 3rd edn. The MIT Press, Cambridge (March 1998) [21] Williams, J.D., Young, S.: Partially observable markov decision processes for spoken dialog systems. Computer Speech Language 21(2), 393–422 (2007)

Detection of Unknown Speakers in an Unsupervised Speech Controlled System Tobias Herbig1,3 , Franz Gerl2 , and Wolfgang Minker3 1

3

Nuance Communications Aachen GmbH, Ulm, Germany 2 SVOX Deutschland GmbH, Ulm, Germany University of Ulm, Institute of Information Technology, Ulm, Germany

Abstract. In this paper we investigate the capability of our self-learning speech controlled system comprising speech recognition, speaker identiﬁcation and speaker adaptation to detect unknown users. Our goal is to enhance automated speech controlled systems by an unsupervised personalization of the human-computer interface. New users should be allowed to use a speech controlled device without the need to identify themselves or to undergo a time-consumptive enrollment. Instead, the system should detect new users during the operation of the device. New speaker proﬁles should be initialized and incrementally adjusted without any additional intervention of the user. Such a personalization of humancomputer interfaces represents an important research issue. Exemplarily, in-car applications such as speech controlled navigation, hands-free telephony or infotainment systems are investigated. Results for detecting unknown speakers are presented for a subset of the SPEECON database.

1

Introduction

Speech recognition has attracted attention for various applications such as oﬃce systems, manufacturing, telecommunication, medical reports and infotainment systems [1]. For in-car applications both the usability and security can be increased for a wide variety of users. The driver can be supported to safely participate in road traﬃc and to operate technical devices such as navigation systems or hands-free sets. Infotainment systems with speech recognition for navigation, telephony or music control typically are not personalized to a single user. The speech signal may be degraded by varying engine, wind and tire noises, or transient events such as passing cars or babble noise. Computational eﬃciency and memory consumption are important design parameters. On the other hand, a large vocabulary, e.g. city or street names for navigation, has to be reliably recognized. However, for a variety of practical applications a small number of users, e.g. 5 − 10 recurring speakers, can be assumed. The beneﬁts of a device that can recognize the voices of its main users are obvious:

This work was conducted at Harman-Becker. Tobias Herbig is now with Nuance Communications. Franz Gerl is now with SVOX.

G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 25–35, 2010. c Springer-Verlag Berlin Heidelberg 2010

26

T. Herbig, F. Gerl, and W. Minker

The dialog ﬂow can be personalized to speciﬁc user habits. New ways for simplifying the interaction with the device can be suggested. Unexperienced users can be introduced to the system. Furthermore, speech recognition can be improved. Since speech recognizers are typically trained on a large set of speakers, there is a mismatch between the trained speech pattern and the voice characteristics of each speaker degrading speech recognition accuracy [2]. Enhanced statistical models can be obtained for each speaker by adapting on speaker speciﬁc data. Without speaker tracking all information acquired from a particular speaker is either lost or degraded with each speaker turn. Therefore, it seems to be reasonable to employ speaker identiﬁcation and speaker adaptation separately for diﬀerent speakers. A simple implementation would be to force the user to identify himself whenever the system is initialized. However, we look for a more natural and convenient human-computer communication by identifying the current user in an unsupervised way. We developed a speech controlled system which includes speaker identiﬁcation, speech recognition and speaker adaptation. We succeeded to track diﬀerent speakers after a short enrollment phase of only two command and control utterances [3]. This is enabled by combining the strengths of two adaptation schemes [4]. In the learning phase only a few parameters have to be estimated allowing to capture the main speech and speaker characteristics. In the long run individual adjustment is achieved. A uniﬁed approach of speaker identiﬁcation and speech recognition was developed as an extension of a standard speech recognizer. Multiple recognitions can therefore be avoided. In this paper we investigate the detection of unknown speakers to overcome the limitation of an enrollment. The goal is to initialize new speaker proﬁles in an unsupervised manner during the ﬁrst few utterances of a new user. Therefore the uniﬁed approach of joint speaker identiﬁcation and speech recognition and a standard speaker identiﬁcation technique were evaluated for several training levels of the employed statistical models. First, speaker identiﬁcation of known and unknown speakers is introduced. An implementation of an automated speech recognizer is described. The speaker adaptation scheme employed to capture and represent speaker characteristics is brieﬂy introduced. Then, the uniﬁed approach is summarized. Finally, the results of our experiments are presented for the uniﬁed approach and a standard technique. Finally, a summary is given and an extension of our approach is suggested for future work.

2

Speaker Identification

Gaussian Mixture Models (GMMs) have emerged as the dominating statistical model for speaker identiﬁcation [5]. GMMs comprise a set of N multivariate Gaussian density functions subsequently denoted by the index k. The multimodal probability density function

Detection of Unknown Speakers in an Unsupervised Speech

p(xt |Θi ) =

N

wki · N xt |μik , Σik

27

(1)

k=1

is a convex combination of their component densities. Each speaker model i is completely deﬁned by the parameter set Θi which contains the weighting factors wki , mean vectors μik and covariance matrices Σik . The parameter set will be omitted for reasons of simplicity. xt denotes the feature vector which may contain Mel Frequency Cepstral Coeﬃcients (MFCCs) [6] or mean normalized MFCCs [5], for example. For speaker identiﬁcation the log-likelihood N T i i i log wk · N xt |μk , Σk (2) log (p(x1:T |i)) = t=1

k=1

is calculated for each utterance characterized by the sequence of feature vectors x1:T . Independently and identically distributed (iid) feature vectors are assumed. The speaker with the highest posterior probability or likelihood is identiﬁed as found by Reynolds and Rose [7]. The detection of unknown speakers is a critical issue for open-set speaker identiﬁcation since unknown speakers cannot be explicitly modeled. A simple extension to open-set scenarios is to introduce a threshold θth for the absolute log-likelihood values as found by Fortuna et al. [8]: log (p(x1:T |Θi )) ≤ θth ,

∀i.

(3)

If the speaker’s identity does not correspond to a particular speaker model, a low likelihood value is expected. However, we expect high ﬂuctuations of the absolute likelihood in adverse environments such as automobiles. This may aﬀect the threshold decision. Advanced techniques may use normalization techniques comprising a Universal Background Model (UBM) [8, 9]. Log-likelihood ratios of the speaker models and UBM can be examined for out-of-set detection [8]. If the following inequality log (p(x1:T |Θi )) − log (p(x1:T |ΘUBM )) ≤ θth ,

∀i

(4)

is valid for all speaker models, an unknown speaker is likely. The latter approach yields the advantage to lower the inﬂuence of events which aﬀect all statistical models in a similar way. For example, phrases spoken in an adverse environment may cause a mismatch between the speaker models and the audio signal due to background noises. Furthermore, text-dependent ﬂuctuations in a spoken phrase, e.g. caused by unseen data or the training conditions, can be reduced [8]. In those cases the likelihood ratio appears to be more robust than absolute likelihoods.

28

3

T. Herbig, F. Gerl, and W. Minker

Implementation of an Automated Speech Recognizer

We use a speech recognizer based on Semi-Continuous HMMs (SCHMMs) [10]. All states st share the mean vectors and covariances of one GMM and only diﬀer in their weighting factors M N

pSCHMM (xt ) =

st =1 k=1

wkst · N {xt |μk , Σk } · p(st )

(5)

where M denotes the number of states. For convenience, the parameter set Θ which includes the initial state probabilities, state transitions and the GMM parameters is omitted. The basic setup of the speech recognizer is shown in Fig. 1.

Front-end

xt

Codebook

qt

Decoder

Fig. 1. Block diagram of a speech recognizer based on SCHMMs

Noise reduction is calculated by a standard Wiener ﬁlter. 11 MFCC coeﬃcients are extracted. The 0 th coeﬃcient is substituted by a normalized energy. Cepstral mean subtraction and a Linear Discriminant Analysis (LDA) are applied to obtain a compact representation which is robust against environmental inﬂuences. We use windows of 9 frames of MFCCs, where the dynamics of delta and delta-delta coeﬃcients have been incorporated into the LDA using a bootstrap training [11]. Each feature vector xt is compared with a speaker independent codebook subsequently called standard codebook. The standard codebook consists of about 1000 multivariate Gaussian densities deﬁned by the parameter set 0 , μ01 , . . . , μ0N , Σ01 , . . . , Σ0N }. Θ0 = {w10 , . . . wN

(6)

The soft quantization qt = (p(xt |k = 1), . . . , p(xt |k = N ))

(7)

is used for speech decoding. The speech decoder comprises the acoustic models, lexicon and language model. The acoustic model is realized by Markov chains. The lexicon represents the corpus of all word strings to be recognized. The prior probabilities of word sequences are given by the language model [10].

4

Speaker Adaptation

Speaker adaptation allows to adjust or to initialize speaker speciﬁc statistical models, e.g. GMMs for speaker identiﬁcation or codebooks for enhanced speech

Detection of Unknown Speakers in an Unsupervised Speech

29

recognition. The capability of adaptation algorithms depends on the number of available parameters which is limited by the amount of speaker speciﬁc training data. The Eigenvoice (EV) approach is advantageous when facing few data since only some 10 parameters have to be estimated to adapt codebooks of a speech recognizer [12]. To modify the mean vectors of our speech recognizer, about 25, 000 parameters have to be optimized. Mean vector adaptation μEV k may result from a linear combination of the original speaker independent mean vector μ0k and a weighted sum of the eigenvoices eEV k,l : 0 μEV k = μk +

L

αl · eEV k,l

(8)

l=1 !

0 where Ei {μEV k } = μk is assumed. Only the scalar weighting factors αl have to be optimized. When suﬃcient speaker speciﬁc training data is available, the Maximum A Posteriori (MAP) adaptation allows individual adjustments of each Gaussian density [5]:

μMAP = (1 − αk ) · μ0k + αk · μML k k nk , η = const αk = nk + η T nk = p(k|xt , Θ0 ).

(9) (10) (11)

t=1

When GMMs for standard speaker identiﬁcation are adapted, we use μUBM k instead of μ0k . For convenience, we employ only the suﬃcient statistics of the standard codebook or a UBM. On extensive training data the MAP adaptation approaches Maximum Likelihood (ML) estimates μML = k

T 1 p(k|xt , Θ0 ) · xt . nk t=1

(12)

Thus, we use a simple yet eﬃcient combination of EV and ML estimates to adjust the mean vectors of codebooks [4]: ML μopt = (1 − βk ) · μEV k + βk · μk k nk , λ = const. βk = nk + λ

(13) (14)

The smooth transition from globally estimated mean vectors μEV to locally k optimized ML estimates allows to eﬃciently retrieve speaker characteristics for enhanced speech recognition. Fast convergence on limited data and individual adaptation on extensive data are achieved. For convenience, the speech decoder’s state alignment is omitted in our notation for codebook optimization.

30

5

T. Herbig, F. Gerl, and W. Minker

Unified Speaker Identification and Speech Recognition

We obtain an unsupervised speech controlled system by fusing all components introduced: Speaker speciﬁc codebooks are initialized and continuously adjusted. The basic setup of the speaker independent speech recognizer shown in Fig. 1 is extended by NSp speaker speciﬁc codebooks which are operated in parallel to the standard codebook.

Speaker Front-End

Speech

ML

Adaptation

Transcription I

II

Fig. 2. System architecture for joint speaker identiﬁcation and speech recognition. One front-end is employed for speaker speciﬁc feature extraction. Speaker speciﬁc codebooks are used to decode the spoken phrase (I) and to estimate the speaker identity (II) in a single step. Both results are used for speaker adaptation to enhance future speaker identiﬁcation and speech recognition. Furthermore, speaker speciﬁc cepstral mean and energy normalization is controlled.

To avoid parallel speech decoding, a two-stage processing can be used. First, the most probable speaker is determined by standard methods for speaker identiﬁcation. Then, the entire utterance can be re-processed by employing the corresponding codebook for speech decoding to generate a transcription. To avoid high latencies and the increase of the computational complexity caused by re-processing, we developed a uniﬁed approach to realize speech decoding and speaker identiﬁcation simultaneously. Speaker speciﬁc codebooks are considered as common GMMs representing the speaker’s pronunciation characteristics. Class et al. [13] and the results in [4] give evidence that speaker speciﬁc codebooks can be employed to track diﬀerent speakers. We model speaker tracking by an HMM whose states represent enrolled speakers. The emission probability density functions are represented by the adapted codebooks of the speech recognizer. For speaker speciﬁc speech recognition with online speaker tracking we employ the forward algorithm to select the optimal codebook on a frame level. Only the soft quantization of the hypothesized speaker is processed by the speech decoder. This technique can be viewed as fast but probably less conﬁdent speaker identiﬁcation to be used for speaker speciﬁc

Detection of Unknown Speakers in an Unsupervised Speech

31

speech recognition under real-time conditions. In this context, codebooks are used to decode a spoken phrase and to determine the current speaker. In parallel, an improved guess of the speaker identity is provided for speaker adaptation which is performed after speech decoding. Each speaker speciﬁc codebook is evaluated in the same way as common GMMs for speaker identiﬁcation. We only employ the simpliﬁcation of equal weighting factors wks to avoid the requirement of a state alignment. The log-likelihood N T 1 i 0 Li = log N xt |μk , Σk (15) T t=1 k=1

denotes the accumulated log-likelihood normalized by the length T of the recorded utterance. The weighting factors wk = N1 are omitted for convenience. In addition, the speech recognition result is used to discard speech pauses and garbage words which do not contain speaker speciﬁc information. The likelihood values of each codebook are buﬀered until a precise segmentation is available. Our target is to automatically personalize speech controlled devices. To obtain a strictly unsupervised speech controlled system, new users should be automatically detected without the requirement to attend an enrollment. In the following, we investigate whether new speakers can be detected when a simple threshold θth is applied to the log-likelihood ratios of the speaker speciﬁc codebooks and standard codebook. If no log-likelihood ratio exceeds this threshold Li − L0 < θth ,

∀i,

(16)

an unknown speaker is detected. In the experiments carried out the performance of the joint speaker identiﬁcation and speech recognition to detect unknown speakers was evaluated and compared to a standard technique based on GMMs purely optimized to represent speaker characteristics.

6

Evaluation

We conducted several experiments to investigate how accurate unknown speakers are detected by our uniﬁed approach. In addition, we examined a reference implementation representing a standard speaker identiﬁcation technique. 6.1

Reference Implementation

For the standard approach a UBM-GMM with diagonal covariance matrices was trained by the Expectation Maximization (EM) algorithm. About 3.5 h speech data originating from 41 female and 36 male speakers of our USKCP1 development database was incorporated into the UBM training. Mean-normalized 11 MFCCs and delta-features were extracted. 1

The USKCP is an internal speech database for in-car applications. The USKCP comprises command and control utterances such as navigation commands, spelling and digit loops. The language is US-English.

32

T. Herbig, F. Gerl, and W. Minker

Speaker speciﬁc GMMs are initialized and continuously adapted by MAP adaptation of the mean vectors [5]. We tested several implementations concerning the number of component densities 32 ≤ N ≤ 256 and tuning parameters 4 ≤ η ≤ 20 for adaptation. The best results for speaker identiﬁcation were obtained for η = 4 as shown in [3]. 6.2

Database

For the evaluation of both techniques we employed a subset of the SPEECON [14] database. This subset comprises 50 male and 23 female speakers recorded in an automotive environment. The sampling rate is 11, 025 Hz. The language is USEnglish. Colloquial utterances with more than 4 words and mispronunciations were discarded. Digit and spelling loops were kept. 6.3

Results

First, the joint speaker identiﬁcation and speech recognition was evaluated. The detection of unknown speakers is realized by the threshold decision given in (16). We employed λ = 4 in speaker adaptation since we observed the best speaker identiﬁcation results for closed-set scenarios. Several implementations of the combined adaptation with 4 ≤ λ ≤ 20 were compared [3]. For evaluation we use a two-stage technique. First, the best in-set speaker model characterized by the highest likelihood is identiﬁed. Then a threshold decision is used to test for an unknown speaker. The performance of the binary in-set / out-of-set classiﬁer is evaluated by the so-called Receiver Operator Characteristics (ROC). The detection rate is depicted versus false alarm rate. To evaluate the detection accuracy of a self-learning system, we deﬁned several training stages given by the number of utterances NA used for adaptation. Speaker models during the learning phase (NA < 20), moderately trained codebooks and extensively trained models (NA > 100) are investigated. Maladaptations with respect to the speaker identity are neglected. Conﬁdence intervals are given by a gray shading. The ROC curves in Fig. 3(a) and Fig. 3(b) show that the accuracy of openset speaker identiﬁcation seems to be highly dependent of the adaptation level of the statistical models and the number of enrolled speakers. Especially for speaker models trained on only a few utterances in an adverse environment a global threshold seems to be not feasible. This observation agrees with our former experiments [4]. Even for extensively trained codebooks, e.g. NA ≥ 100, relatively high error rates can be observed. The same experiment was repeated with the reference implementation. The results for N = 256 and η = 4 are exemplarily shown in Fig. 4(a) and Fig. 4(b). Unknown speakers are detected by a threshold decision similar to (4). However, log-likelihoods are normalized by the length of the current utterance to be robust against short commands. In summary, signiﬁcantly worse detection rates are achieved compared to Fig. 3.

1

1

0.9

0.9

0.8

0.8

Detection rate

Detection rate

Detection of Unknown Speakers in an Unsupervised Speech

0.7 0.6 0.5 0.4

33

0.7 0.6 0.5

0

0.1

0.2 0.3 0.4 False alarm rate

0.5

0.4

0.6

0

(a) 5 speakers are enrolled.

0.1

0.2 0.3 0.4 False alarm rate

0.5

0.6

(b) 10 speakers are enrolled.

1

1

0.9

0.9

0.8

0.8

Detection rate

Detection rate

Fig. 3. Detection of unknown speakers for the uniﬁed system which integrates speaker identiﬁcation and speech recognition - NA ≤ 20 (◦), 20 < NA ≤ 50 (), 50 < NA ≤ 100 (2) and 100 < NA ≤ 200 (+)

0.7 0.6 0.5 0.4

0.7 0.6 0.5

0

0.1

0.2 0.3 0.4 False alarm rate

0.5

0.6

0.4

0

(a) 5 speakers are enrolled.

0.1

0.2 0.3 0.4 False alarm rate

0.5

0.6

(b) 10 speakers are enrolled.

Fig. 4. Detection of unknown speakers for the reference implementation with 256 Gaussian distributions - NA ≤ 20 (◦), 20 < NA ≤ 50 (), 50 < NA ≤ 100 (2) and 100 < NA ≤ 200 (+) 1

Detection rate

0.9

0.8

0.7

0.6

0

0.1

0.2 False alarm rate

0.3

0.4

Fig. 5. Comparison of speaker speciﬁc codebooks (solid line) and GMMs comprising 32 (◦), 64 (), 128 (2) and 256 (+) Gaussian densities for 100 < NA ≤ 200. MAP adaptation with η = 4 is employed.

34

T. Herbig, F. Gerl, and W. Minker

To compare the inﬂuence of the number of Gaussian densities on the detection accuracy, all implementations are shown in Fig. 5. Here, only extensively trained speaker models characterized by NA > 100 are considered. Obviously, the detection accuracy of the reference implementations starts to settle for N > 64. The accuracy also seems to be signiﬁcantly inferior to the codebook based approach.

7

Summary and Conclusion

The evaluation has shown that our uniﬁed speaker identiﬁcation and speech recognition technique is able to detect unknown speakers. The uniﬁed approach produced signiﬁcantly higher detection rates than the investigated reference implementations. However, the detection rates achieved do not allow to operate a speech recognizer in a completely unsupervised manner. In summary, it seems to be diﬃcult to detect new users by only one utterance, especially for short command and control utterances. It became evident that the training of each speaker model should be reﬂected in the in-set / out-of-set decision. A global threshold seems to be inadequate. In future, we will develop more sophisticated posterior probabilities representing the adaptation level of each speaker model. When series of utterances are used for speaker identiﬁcation a signiﬁcant improvement for detecting unknown speakers and for speaker identiﬁcation rates can be expected. Still speaker identiﬁcation and detecting unknown speakers will never be perfect. This presents a challenge for dialog developers. Dialog strategies will have to deal with ambiguous information about the user’s identity and avoid erratic behavior. The dialog may have to wait for increased conﬁdence in following utterances, or take the initiative in conﬁrming the user’s identity. When these challenges are met however, more natural speech understanding systems are possible.

References 1. Rabiner, L., Juang, B.-H.: Fundamentals of Speech Recognition. Prentice-Hall, Englewood Cliﬀs (1993) 2. Zavaliagkos, G., Schwartz, R., McDonough, J., Makhoul, J.: Adaptation algorithms for large scale hmm recognizers. In: EUROSPEECH 1995, pp. 1131–1135 (1995) 3. Herbig, T., Gerl, F., Minker, W.: Evaluation of two approaches for speaker speciﬁc speech recognition. In: Second International Workshop on Spoken Dialogue Systems Technology, IWSDS 2010 (2010) (to appear) 4. Herbig, T., Gerl, F., Minker, W.: Fast adaptation of speech and speaker characteristics for enhanced speech recognition in adverse intelligent environments. In: The 6th International Conference on Intelligent Environments, IE-2010 (2010) (to appear) 5. Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker veriﬁcation using adapted gaussian mixture models. Digital Signal Processing 10(1-3), 19–41 (2000) 6. Reynolds, D.A.: Large population speaker identiﬁcation using clean and telephone speech. IEEE Signal Processing Letters 2(3), 46–48 (1995)

Detection of Unknown Speakers in an Unsupervised Speech

35

7. Reynolds, D.A., Rose, R.C.: Robust text-independent speaker identiﬁcation using gaussian mixture speaker models. IEEE Transactions on Speech and Audio Processing 3(1), 72–83 (1995) 8. Fortuna, J., Sivakumaran, P., Ariyaeeinia, A., Malegaonkar, A.: Open set speaker identiﬁcation using adapted gaussian mixture models. In: INTERSPEECH 2005, pp. 1997– 2000 (2005) 9. Angkititrakul, P., Hansen, J.H.L.: Discriminative in-set/out-of-set speaker recognition. IEEE Transactions on Audio, Speech, and Language Processing 15(2), 498– 508 (2007) 10. Schukat-Talamazzini, E.G.: Automatische Spracherkennung. Vieweg (1995) (in German) 11. Class, F., Kaltenmeier, A., Regel-Brietzmann, P.: Optimization of an hmm - based continuous speech recognizer. In: EUROSPEECH 1993, pp. 803–806 (1993) 12. Kuhn, R., Junqua, J.-C., Nguyen, P., Niedzielski, N.: Rapid speaker adaptation in eigenvoice space. IEEE Transactions on Speech and Audio Processing 8(6), 695–707 (2000) 13. Class, F., Haiber, U., Kaltenmeier, A.: Automatic detection of change in speaker in speaker adaptive speech recognition systems. US 2003/0187645 A1 (2003) 14. Iskra, D., Grosskopf, B., Marasek, K., van den Heuvel, H., Diehl, F., Kiessling, A.: Speecon - speech databases for consumer devices: Database speciﬁcation and validation. In: Proceedings of the Third International Conference on Language Resources and Evaluation, LREC 2002, pp. 329–333 (2002)

Evaluation of Two Approaches for Speaker Specific Speech Recognition Tobias Herbig1,3 , Franz Gerl2 , and Wolfgang Minker3 1

Nuance Communications Aachen GmbH, Ulm, Germany Harman/Becker Automotive Systems GmbH, Ulm, Germany University of Ulm, Institute of Information Technology, Ulm, Germany 2

3

Abstract. In this paper we examine two approaches for the automatic personalization of speech controlled systems. Speech recognition may be signiﬁcantly improved by continuous speaker adaptation if the speaker can be reliably tracked. We evaluate two approaches for speaker identiﬁcation suitable to identify 5-10 recurring users even in adverse environments. Only a very limited amount of speaker speciﬁc data can be used for training. A standard speaker identiﬁcation approach is extended by speaker speciﬁc speech recognition. Multiple recognitions of speaker identity and spoken text are avoided to reduce latencies and computational complexity. In comparison, the speech recognizer itself is used to decode spoken phrases and to identify the current speaker in a single step. The latter approach is advantageous for applications which have to be performed on embedded devices, e.g. speech controlled navigation in automobiles. Both approaches were evaluated on a subset of the SPEECON database which represents realistic command and control scenarios for in-car applications.

1

Introduction

During the last few decades steady progress in speech recognition and speaker identiﬁcation has been achieved leading to high recognition rates [1]. Complex speech controlled applications can now be realized. Especially for in-car applications speech recognition may help to improve usability and security. The driver can be supported to safely participate in road traﬃc and to operate technical devices such as navigation systems or hands-free sets. However, the speech signal may be degraded by various background noises, e.g. varying engine, wind and tire noises, passing cars or babble noise. In addition, changing environments, speaker variability and natural language input may have a negative inﬂuence on the performance of speech recognition [2]. For a variety of practical applications, e.g. infotainment systems with speech recognition for navigation, telephony or music control, typically only 5 − 10 recurring speakers are expected to use the system. The beneﬁts of a device that can identify the voices of its main users are obvious:

This work has been conducted when Tobias Herbig and Franz Gerl were aﬃliated with Harman/Becker. Tobias Herbig is now with Nuance Communications.

G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 36–47, 2010. c Springer-Verlag Berlin Heidelberg 2010

Evaluation of Two Approaches for Speaker Speciﬁc Speech Recognition

37

The dialog ﬂow can be personalized to speciﬁc user habits. New ways for simplifying the interaction with the device can be suggested. Unexperienced speakers can be introduced to the system, for example. Furthermore, speech recognition can be improved by adapting the statistical models of a speech recognizer on speaker speciﬁc data. A speech recognition engine oﬀers a very detailed modeling of the acoustic feature space. Given a correct decoding of an utterance there are techniques that enable adaptation on one single utterance as found in [3], for example. Combining speech recognition and speaker identiﬁcation oﬀers the opportunity to keep long-term adaptation proﬁles. The reasoning that recognizing the utterance may help to achieve reasonable speaker identiﬁcation rates after a short training period prompted us to do the work we report in this paper. We have developed a speech controlled system which combines speaker identiﬁcation, speech recognition and speaker adaptation. Diﬀerent speakers can be reliably tracked after a short enrollment phase of only two command and control utterances. Fast information retrieval is realized by combining the strengths of two adaptation schemes [4]. In the learning phase only a few parameters have to be estimated to capture the most relevant speech and speaker characteristics. In the long run this adaptation scheme smoothly transits to an individual adjustment of each speaker proﬁle. To meet the demands of an eﬃcient implementation suitable for embedded devices, speaker identiﬁcation has to be performed online. We employ the speech recognizer’s detailed modeling of speech and speaker characteristics for a uniﬁed approach of speaker identiﬁcation and speech recognition. A standard speech recognizer is extended to identify the current user and to decode the spoken phrase simultaneously. Alternatively, speaker identiﬁcation can be implemented by standard techniques known from the literature, e.g. Reynolds et al. [5]. To limit the computational overhead and latencies caused by reprocessing of spoken phrases, we combine speaker identiﬁcation with our approach for on-line speaker proﬁle selection. In this paper, speech recognition and speaker identiﬁcation are brieﬂy introduced. Then we discuss our approach for integrated speaker identiﬁcation and speech recognition combined with speaker adaptation. The architecture of a reference system combining standard techniques for speaker identiﬁcation and speaker adaptation is explained. Finally, the evaluation results for realistic command and control applications in automobiles are presented. A summary and conclusion are given.

2

Automated Speech Recognition

We use Hidden Markov Models (HMMs) to represent both the static and dynamic speech characteristics. The Markov models represent the speech dynamics. The emission probability density function is modeled by Gaussian Mixture Models (GMMs). The probability density function of GMMs

38

T. Herbig, F. Gerl, and W. Minker

p(xt |Θ) =

N

wk · N {xt |μk , Σk }

(1)

k=1

comprises a convex combination of N multivariate Gaussian densities which are denoted by the index k. xt represents the feature vector at time instance t. GMMs are deﬁned by their parameter sets Θ which contain the weights wk , mean vectors μk and covariance matrices Σk . For speech recognition, we use so-called Semi-Continuous HMMs (SCHMMs): pSCHMM (xt ) =

M N st =1 k=1

wkst · N {xt |μk , Σk } · p(st ),

(2)

as can be found by Schukat-Talamazzini [6]. All states st of an SCHMM share the mean vectors and covariances of one GMM and only diﬀer in their weighting factors. M denotes the number of states. For convenience, we omit the parameter set ΘSCHMM comprising the initial state probabilities, state transitions and the GMM parameters. The basic setup of our speech recognizer is shown in Fig. 1.

Front-end

xt

Codebook

qt

Decoder

Fig. 1. Block diagram of a speech recognizer based on SCHMMs

Noise reduction is calculated by a standard Wiener ﬁlter and 11 Mel Frequency Cepstral Coeﬃcients (MFCCs) are extracted. The 0 th coeﬃcient is substituted by a normalized energy. Cepstral mean subtraction and a Linear Discriminant Analysis (LDA) are applied to obtain a compact representation which is robust against environmental inﬂuences. We use windows of 9 frames of MFCCs, where the dynamics of delta and delta-delta coeﬃcients have been incorporated into the LDA using a bootstrap training [7]. Each feature vector xt is compared with a speaker independent codebook subsequently called standard codebook. The standard codebook consists of about 1000 multivariate Gaussian densities deﬁned by the parameter set 0 , μ01 , . . . , μ0N , Σ01 , . . . , Σ0N }. Θ0 = {w10 , . . . wN

(3)

The soft quantization q0t = (p(xt |k = 1, Θ0 ), . . . , p(xt |k = N, Θ0 ))

(4)

contains the likelihood scores of all Gaussian densities. The soft quantization is employed for speech decoding.

Evaluation of Two Approaches for Speaker Speciﬁc Speech Recognition

39

The speech decoder comprises the acoustic models, lexicon and language model. The acoustic models are realized by Markov chains. The lexicon represents the corpus of all word strings to be recognized. The prior probabilities of word sequences are given by the language model [6]. The Viterbi algorithm is used to determine the most likely word string.

3

Speaker Identification

Speaker variability can be modeled by GMMs which have emerged as the dominating generative statistical model in speaker identiﬁcation [5]. For each speaker one GMM can be trained on speaker speciﬁc data using the EM-algorithm. Alternatively, a speaker independent GMM so-called Universal Background Model (UBM) can be trained for a large group of speakers. Speaker speciﬁc GMMs can be obtained by speaker adaptation [5]. For testing independently and identically distributed (iid) feature vectors are assumed by neglecting temporal statistical dependencies. Log-likelihood computation can then be realized by a sum of logarithms N T log (p(x1:T |Θi )) = (5) log wki · N xt |μik , Σik t=1

k=1

where x1:T = {x1 , . . . , xt , . . . xT } represents a sequence of feature vectors, e.g. mean-normalized MFCCs. i denotes the speaker index. Subsequently, the speaker with the highest log-likelihood score is identiﬁed as the current speaker iML = arg max {log(p(x1:T |Θi ))} . i

(6)

according to the Maximum Likelihood (ML) criterion.

4

Joint Speaker Identification and Speech Recognition

Speech recognition can be signiﬁcantly improved when codebooks are adapted to speciﬁc speakers. Speaker speciﬁc codebooks can be considered as common GMMs representing the speaker’s pronunciation characteristics. Class et al. [8] and the results in [4] give evidence that speaker speciﬁc codebooks can be employed to track diﬀerent speakers. To avoid latencies and computational overhead caused by multiple recognitions of the spoken phrase and speaker identity, we employ speech recognition and speaker identiﬁcation simultaneously. The basic architecture of our speech controlled system is depicted in Fig. 2. In the front-end standard Wiener ﬁltering is employed for speech enhancement to reduce background noises. MFCC features are extracted to be used for both speech recognition and speaker identiﬁcation. For each speaker energy normalization and cepstral mean subtraction are continuously adjusted starting from initial values.

40

T. Herbig, F. Gerl, and W. Minker

Speaker Front-End Speech

ML

Adaptation

Transcription I

II

Fig. 2. System architecture for joint speaker identiﬁcation and speech recognition comprising two stages. Part I and II denote the speaker speciﬁc speech recognition and speaker identiﬁcation, respectively. The latter controls speaker speciﬁc feature vector normalization. Speaker adaptation is employed to enhance speaker identiﬁcation and speech recognition. Codebooks are initialized in the case of an unknown speaker. The statistical modeling of speaker characteristics is continuously improved.

For speech recognition appropriate speaker speciﬁc codebooks are selected on a frame level. NSp speaker speciﬁc codebooks are operated in parallel to the standard codebook. The posterior probability p(it |x1:t ) is estimated for each speaker i given the history of observations x1:t : p(it |x1:t ) ∝ p(xt |it ) · p(it |x1:t−1 ), it = 0, 1, . . . , NSp p(it |it−1 ) · p(it−1 |x1:t−1 ) p(it |x1:t−1 ) =

(7) (8)

it−1

p(i1 |x1 ) ∝ p(x1 |i1 ) · p(i1 ).

(9)

The codebook iMAP characterized by the highest posterior probability is selected t according to the Maximum A Posteriori (MAP) criterion. Only the corresponding qit is forwarded to the speech decoder to generate a transcription of the spoken phrase. The corresponding state alignment is used for codebook adaptation. For speaker identiﬁcation codebooks are considered as GMMs with equal weighting factors to avoid the requirement of a state alignment. We calculate the log-likelihood per frame N T 1 log N xt |μik , Σ0k (10) Li = T t=1 k=1

to identify the most likely speaker iML according to the ML criterion. The speech recognition result is employed to exclude speech pauses and garbage words by buﬀering the likelihood scores until a precise segmentation can be given. The speaker identiﬁcation result enables to adapt the corresponding codebook and to control feature extraction.

Evaluation of Two Approaches for Speaker Speciﬁc Speech Recognition

41

We use speaker adaptation to initialize and continuously adapt speaker speciﬁc codebooks based on recognized utterances. We only use the suﬃcient statistics of the standard codebook for reasons of computational eﬃciency. Due to limited adaptation data the number of available parameters has to be balanced with the amount of speaker speciﬁc data. Eigenvoice (EV) adaptation is suitable when facing few data since only some 10 parameters αl have to be estimated [3]. Mean vector adaptation can be implemented by a weighted sum of the eigenvoices eEV k,l and an oﬀset, e.g. the original speaker independent mean vector μ0k : 0 μEV k = μk +

L

αl · eEV k,l .

(11)

l=1

Principal Component Analysis (PCA) can be applied to extract the eigenvoices. MAP adaptation allows individual adjustment of each Gaussian density when suﬃcient data is available [9, 5]. On extensive data the MAP adaptation approaches the Maximum Likelihood (ML) estimates μML = k

T 1 p(k|xt , Θ0 ) · xt nk t=1

(12)

T where nk = t=1 p(k|xt , Θ0 ) denotes the number of softly assigned feature vectors. Therefore, we use a simple combination ML μopt = (1 − αk ) · μEV k + αk · μk k nk , λ = const αk = nk + λ

(13) (14)

to eﬃciently adjust the mean vectors of codebooks [4]. Covariance matrices are not modiﬁed. For convenience, the state alignment is omitted in our notation.

5

Independent Speaker Identification and Speech Recognition

In the preceding section a uniﬁed approach for speaker identiﬁcation and speech recognition was introduced. Alternatively, a standard technique for speaker identiﬁcation based on GMMs purely optimized to capture speaker characteristics can be employed. In combination with a speech recognizer where several speaker proﬁles are operated in parallel, a reference implementation can be easily obtained. The corresponding setup depicted in Fig. 3 can be summarized as follows: – Front-end. The recorded speech signal is preprocessed to reduce background noises. The feature vectors comprise 11 mean normalized MFCCs and delta features. The 0 th coeﬃcient is replaced by a normalized energy.

42

T. Herbig, F. Gerl, and W. Minker

Front-End

Speaker Identification

GMM Adaptation

Speech Recognition

Codebook Adaptation

xt

Transcription

Fig. 3. System architecture for parallel speaker identiﬁcation and speaker speciﬁc speech recognition. Codebook selection is implemented as discussed before. Speaker identiﬁcation is realized by additional GMMs.

– Speech recognition. Appropriate speaker speciﬁc codebooks are selected for the decoding of the recorded utterance as discussed before. – Speaker identiﬁcation. Subsequently, common GMMs purely representing speaker characteristics are used to identify the current user. A speaker independent UBM with diagonal covariance matrices is used as template for speaker speciﬁc GMMs [5]. The UBM was trained by the EM algorithm. About 3.5 h speech data originating from 41 female and 36 male speakers of the USKCP1 database was incorporated into the UBM training. For each speaker about 100 command and control utterances, e.g. navigation commands, spelling and digit loops, were used. For testing the ML criterion is applied to identify the current user. The codebooks of the speech recognizer and the GMMs are adapted according to this estimate. – Speaker adaptation. GMM models and the speaker speciﬁc codebooks of the identiﬁed speakers are continuously adjusted. Codebook adaptation is implemented as discussed before. GMMs are adjusted by MAP adaptation as found by Reynolds et al. [5]. However, we only use the suﬃcient statistics of the UBM: = (1 − αk ) · μUBM + αk · μML μMAP k k k nk , η = const αk = nk + η T nk = p(k|xt , ΘUBM ).

(15) (16) (17)

t=1

Adaptation accuracy is supported here by the moderate complexity of the applied GMMs. 1

The USKCP is an internal speech database for in-car applications which was collected by TEMIC Speech Dialog Systems, Ulm, Germany. The language is US-English.

Evaluation of Two Approaches for Speaker Speciﬁc Speech Recognition

6

43

Experiments

Several experiments were conducted for both implementations to investigate speaker identiﬁcation accuracy and the beneﬁt for speech recognition. 6.1

Database

For the evaluation we employed a subset of the US-SPEECON [10] database. This subset comprises 50 male and 23 female speakers recorded in an automotive environment. The sampling rate is 11, 025 Hz. Colloquial utterances with more than four words and mispronunciations were discarded whereas digit and spelling loops were kept. 6.2

Evaluation

The evaluation was performed on 60 sets of ﬁve enrolled speakers which are selected randomly. From one utterance to the next the probability of a speaker change is approximately 10 %. In the learning phase of each set 10 utterances are employed for unsupervised initialization of each speaker model. Only the ﬁrst two utterances of a new speaker are indicated and then the current speaker has to be identiﬁed in a completely unsupervised manner. Then the speakers appear randomly. At least ﬁve utterances are spoken beetween two speaker turns. Both the Word Accuracy (WA) and identiﬁcation rate are examined. The speech recognizer without any speaker adaptation is used as baseline. Short-term adaptation is implemented by an EV approach which applies an exponential weighting window to the adaptation data. This decay guarantees that speaker changes are captured within approximately ﬁve or six utterances if no speaker identiﬁcation is employed. The speech recognizer applies grammars for digit and spelling loops, dedicated numbers and a grammar which contains all remaining utterances. 6.3

Results for Joint Speaker Identiﬁcation and Speech Recognition

First, the joint speaker identiﬁcation and speech recognition is examined for speciﬁc values of λ employed in speaker adaptation. The results are given in Table 1. They show a signiﬁcant improvement of the WA with respect to both the baseline and short-term adaptation. The two special cases ML (λ ≈ 0) and EV (λ → ∞) clearly fall behind the combination of both adaptation techniques. MAP adaptation with speaker independent prior parameters is not able to track diﬀerent speakers in our scenario for η ≥ 8. Furthermore, no eminent diﬀerence in WA can be observed for 4 ≤ λ ≤ 20. Thus, speaker identiﬁcation can be optimized independently of the speech recognizer and seems to reach an optimum of 94.64 % for λ = 4. For higher values the identiﬁcation rates drop signiﬁcantly.

44

T. Herbig, F. Gerl, and W. Minker

Table 1. Comparison of diﬀerent adaptation techniques for joint speaker identiﬁcation and speech recognition Speaker adaptation Baseline Short-term adaptation Combined adaptation ML (λ ≈ 0) λ=4 λ=8 λ = 12 λ = 16 λ = 20 EV (λ → ∞) MAP adaptation η=4 η=8

6.4

WA [%] Speaker ID [%] 85.23 86.13

-

86.89 88.10 88.17 88.16 88.18 88.20 87.51

81.54 94.64 93.49 92.42 92.26 91.68 84.71

87.47 85.97

87.43 21.17

Results for Independent Speaker Identiﬁcation and Speech Recognition

In comparison to the uniﬁed approach the same experiments were repeated with the reference system characterized by separate modules for speaker identiﬁcation and speech recognition. In Table 2 the results of this scenario are presented for several implementations with respect to the number of Gaussian distributions and values of parameter η. Both the speaker identiﬁcation and speech recognition rate reach an optimum for η = 4 and N = 64 or 128. For higher values of η this optimum is shifted towards a lower number of Gaussian distributions as expected. Since the learning rate of the adaptation algorithm is reduced, only a smaller number of distributions can be eﬃciently estimated at the beginning. The performance of the speech recognizer is marginally reduced with higher η. Table 2. Realization of parallel speaker identiﬁcation and speech recognition. Speaker identiﬁcation is implemented by several GMMs comprising 32, 64, 128 and 256 Gaussian distributions. MAP adaptation of mean vectors is used. Codebook adaptation uses λ = 12. MAP η=4 η=8 η = 12 η = 20 N WA [%] ID [%] WA [%] ID [%] WA [%] ID [%] WA [%] ID [%] 32 64 128 256

88.01 88.64 88.13 91.09 88.04 91.18 87.92 87.96

88.06 88.17 88.06 89.64 87.94 87.68 87.97 85.59

87.98 87.29 87.98 87.92 87.87 84.97 87.90 81.20

87.97 87.50 87.92 85.30 87.82 80.09 87.73 76.48

In the next experiment not only mean vectors but also weights are modiﬁed by the MAP adaptation. The results are summarized in Table 3.

Evaluation of Two Approaches for Speaker Speciﬁc Speech Recognition

45

Table 3. Realization of parallel speaker identiﬁcation and speech recognition. Speaker identiﬁcation is implemented by several GMMs comprising 32, 64, 128 or 256 Gaussian distributions. MAP adaptation of weights and mean vectors is used. Codebook adaptation uses λ = 12. MAP η=4 η=8 η = 12 η = 20 N WA [%] ID [%] WA [%] ID [%] WA [%] ID [%] WA [%] ID [%] 32 64 128 256

87.92 87.24 88.11 90.59 88.11 91.32 88.10 91.62

87.97 88.24 88.06 89.99 88.03 89.42 87.97 88.71

87.97 87.61 88.03 88.80 88.03 88.10 88.02 86.01

88.02 87.04 87.93 86.64 87.91 84.26 87.88 82.88

90

90

89

89

88

88

WA [%]

WA [%]

In the preceding experiment the speaker identiﬁcation accuracy could be improved for η = 4 by increasing the number of Gaussian distributions to N = 128. For N = 256 the identiﬁcation rate dropped signiﬁcantly. Now a steady improvement and an optimum of 91.62 % can be observed for N = 256. However, the identiﬁcation rate approaches a limit. For η = 4 doubling the number of Gaussian distributions from 32 to 64 results in 26 % relative improvement of the error rate whereas the relative improvement achieved by the increase from 128 to 256 Gaussian distributions is about 3.5 %. The optimum for speech recognition is again about 88.1 % WA. Finally, the comparison with the combined approach characterized by an integrated speaker identiﬁcation is shown in Fig. 4 and Fig. 5. Mean vector and

87

86

86 85

87

32

64

128

256

N

(a) Speech recognition realized by the reference implementation. MAP adaptation (η = 4) of mean vectors and weights (black) and only mean vectors (dark gray) are depicted.

85

BL ML

4

8

12 λ

16

20 EV ST

(b) Speech recognition implemented by the uniﬁed approach. Results are shown for speaker adaptation with predeﬁned speaker identity (black) [4] as well as for joint speaker identiﬁcation and speech recognition (dark gray). The speaker independent baseline (BL) and short-term adaptation (ST) are shown for comparison.

Fig. 4. Comparison of the reference implementation (left) and the joint speaker identiﬁcation and speech recognition (right) with respect to speech recognition

46

T. Herbig, F. Gerl, and W. Minker

96

96

92

92

ID [%]

100

ID [%]

100

88 84 80

88 84

32

128

64

256

N

(a) Speaker identiﬁcation rates of the reference implementation. MAP adaptation (η = 4) of mean vectors and weights (black) and only mean vectors (dark gray) are depicted.

80

ML

4

8

12 λ

16

20

EV

(b) Speaker identiﬁcation rates of the joint speaker identiﬁcation and speech recognition.

Fig. 5. Comparison of the reference implementation (left) and the joint speaker identiﬁcation and speech recognition (right) with respect to speaker identiﬁcation

weight adaptation are depicted for η = 4 representing the best speech recognition and speaker identiﬁcation rates in our experiments. Furthermore, the upper bound for speaker speciﬁc speech recognition is shown. There the speaker is known when codebook adaptation is performed [4].

7

Summary and Conclusion

Two approaches have been developed to solve the problem of an unsupervised system comprising self-learning speaker identiﬁcation and speaker speciﬁc speech recognition. Speaker identiﬁcation and speech recognition use an identical front-end so that a parallel feature extraction for speech recognition and speaker identiﬁcation is avoided. Speaker speciﬁc speech recognition is realized by an on-line codebook selection. On an utterance level the speaker identity is estimated in parallel to speech recognition. Multiple recognitions are not required. A speech recognizer is enabled to create and continuously adapt speaker speciﬁc codebooks which allow a higher recognition accuracy in the long run. 94.64 % speaker identiﬁcation rate and 88.20 % WA were achieved by the uniﬁed approach for λ = 4 and λ = 20, respectively. The results for the baseline and the corresponding upper bound were 85.23 % and 88.90 % WA [4]. In the latter case it was assumed that the speaker identity is known. For the reference system, several GMMs are required for speaker identiﬁcation in addition to the HMMs of the speech recognizer. Complexity therefore increases since both models have to be evaluated and adapted. An optimum of 91.18 % speaker identiﬁcation rate was achieved for 128 Gaussian distributions and η = 4 when only the mean vectors were adapted. The best speech recognition result of 88.13 % WA was obtained for 64 Gaussian distributions. By adapting both

Evaluation of Two Approaches for Speaker Speciﬁc Speech Recognition

47

the mean vectors and weights, the speaker identiﬁcation rate could be increased to 91.62 % for 256 Gaussian distributions and η = 4. The WA remained at the same level. For both implementations similar results for speech recognition were achieved even though the identiﬁcation rates of the reference were signiﬁcantly worse. This observation supports the ﬁnding that the speech recognition accuracy is relatively insensitive with respect to moderate error rates of speaker identiﬁcation. Thus, diﬀerent strategies can be applied to identify speakers without aﬀecting the performance of the speech recognizer as long as appropriate codebooks are selected for speech decoding. However, the uniﬁed approach seems to be advantageous when unknown speakers have to be detected as shown in [11]. Therefore, we propose to employ the uniﬁed approach to implement a speech controlled system which is operated in a completely unsupervised manner.

References 1. Furui, S.: Selected topics from 40 years of research in speech and speaker recognition. In: INTERSPEECH 2009, pp. 1–8 (2009) 2. Junqua, J.-C.: Robust Speech Recognition in Embedded Systems and PC Applications. Kluwer Academic Publishers, Dordrecht (2000) 3. Kuhn, R., Junqua, J.-C., Nguyen, P., Niedzielski, N.: Rapid speaker adaptation in eigenvoice space. IEEE Transactions on Speech and Audio Processing 8(6), 695–707 (2000) 4. Herbig, T., Gerl, F., Minker, W.: Fast adaptation of speech and speaker characteristics for enhanced speech recognition in adverse intelligent environments. In: The 6th International Conference on Intelligent Environments, IE 2010 (2010) (to appear) 5. Reynolds, D.A., Quatieri, T.F., Dunn, R.B.: Speaker veriﬁcation using adapted gaussian mixture models. Digital Signal Processing 10(1-3), 19–41 (2000) 6. Schukat-Talamazzini, E.G.: Automatische Spracherkennung. Vieweg (1995) (in German) 7. Class, F., Kaltenmeier, A., Regel-Brietzmann, P.: Optimization of an hmm - based continuous speech recognizer. In: EUROSPEECH 1993, pp. 803–806 (1993) 8. Class, F., Haiber, U., Kaltenmeier, A.: Automatic detection of change in speaker in speaker adaptive speech recognition systems. US 2003/0187645 A1 (2003) 9. Gauvain, J.-L., Lee, C.-H.: Maximum a posteriori estimation for multivariate gaussian mixture observations of markov chains. IEEE Transactions on Speech and Audio Processing 2(2), 291–298 (1994) 10. Iskra, D., Grosskopf, B., Marasek, K., van den Heuvel, H., Diehl, F., Kiessling, A.: Speecon - speech databases for consumer devices: Database speciﬁcation and validation. In: Proceedings of the Third International Conference on Language Resources and Evaluation, LREC 2002, pp. 329–333 (2002) 11. Herbig, T., Gerl, F., Minker, W.: Detection of unknown speakers in an unsupervised speech controlled system. In: Second International Workshop on Spoken Dialogue Systems Technology, IWSDS 2010 (2010) (to appear)

Issues in Predicting User Satisfaction Transitions in Dialogues: Individual Differences, Evaluation Criteria, and Prediction Models Ryuichiro Higashinaka1, Yasuhiro Minami2 , Kohji Dohsaka2 , and Toyomi Meguro2 1 NTT Cyber Space Laboratories, NTT Corporation 1-1 Hikarinooka, Yokosuka, 239-0847 Kanagawa, Japan 2 NTT Communication Science Laboratories, NTT Corporation 2-4 Hikaridai, Seika-cho, Soraku-gun, 619-0237 Kyoto, Japan [email protected], {minami,dohsaka,meguro}@cslab.kecl.ntt.co.jp

Abstract. This paper addresses three important issues in automatic prediction of user satisfaction transitions in dialogues. The first issue concerns the individual differences in user satisfaction ratings and how they affect the possibility of creating a user-independent prediction model. The second issue concerns how to determine appropriate evaluation criteria for predicting user satisfaction transitions. The third issue concerns how to train suitable prediction models. We present our findings for these issues on the basis of the experimental results using dialogue data in two domains.

1 Introduction Although predicting the overall quality of dialogues has been actively studied [7,12,13], only recently has the work begun on ways to automatically predict user satisfaction transitions during a dialogue [2]. Predicting such transitions would be useful when we want to perform an in-depth turn-by-turn analysis of the performance of a dialogue system, and also when we want to pinpoint situations where the dialogue quality begins to degrade or improve, the discovery of which could be used to improve dialogue systems as well as to assist human operators at contact centers for improving customer satisfaction [9, 11]. Since the work on automatic prediction of user satisfaction transitions is still in a preliminary phase, there are a number of issues that need to be clarified. This paper addresses three such issues and presents our findings based on experimental results. The first issue concerns the individual differences of user satisfaction ratings. In any work that deals with predicting user satisfaction, it is important to determine whether we should aim at creating user-independent or user-dependent prediction models. We investigate how user satisfaction ratings of individuals differ on the basis of correlations and distributions of ratings and discuss the feasibility of creating a user-independent prediction model. The second issue concerns the evaluation criteria for the prediction of user satisfaction transitions. In any engineering work, it is necessary to establish an evaluation measure. Previous work has used the mean squared error (MSE) of rating probabilities [2]; however, the MSE has a serious limitation: the dialogue has to follow a predefined scenario. We consider the MSE to be too restrictive for common use. In this G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 48–60, 2010. c Springer-Verlag Berlin Heidelberg 2010

Issues in Predicting User Satisfaction Transitions in Dialogues

49

Table 1. Dialogue statistics in the AD and AL domains. Avg and SD denote the average number and the standard deviation of dialogue-acts within a dialogue. Since an utterance can contain multiple dialogue-acts, the number of dialogue-acts is always larger than that of utterances. AD Domain: 90 dialogues # Utterances # Dialogue-acts All 5180 5340 User 1890 2050 System 3290 3290 AL Domain: 100 dialogues # Utterances # Dialogue-acts All 3951 4650 Speaker 2103 2453 Listener 1848 2197

Avg SD 59.33 17.54 22.78 6.60 36.56 11.81 Avg SD 46.50 8.99 24.53 5.69 21.97 5.25

paper, we propose several candidates for evaluation criteria and discuss which criteria should be used. The third issue concerns how to train suitable prediction models. In previous work, hidden Markov models (HMMs) have been used [2]. However, HMMs may not offer the best solution. Recent studies on sequential labeling have shown that conditional random fields (CRFs) [5] provide the state-of-the-art performance in many NLP tasks, such as chunking and named entity recognition [10]. In addition, HMMs are generative models, whereas CRFs are discriminative ones. In this paper, we compare HMMs and CRFs to investigate which kind of model is more appropriate for the task of predicting user satisfaction transitions. The next section describes the dialogue data we use in detail. Section 3describes the individual differences in user satisfaction ratings between human judges. Section 4presents our candidates for the evaluation criteria and Section 5 describes our experiments for comparing the prediction performance of HMMs and CRFs. Section 6 summarizes the paper and mentions future work.

2 Data Collection We collected dialogue data in two domains: the animal discussion (AD) domain and the attentive listening (AL) domain. All dialogues are in Japanese. In both domains, the data were text dialogues. We did not use spoken dialogue data because we wanted to avoid particular problems of speech, such as filled pauses and overlaps, although we plan to deal with spoken dialogue in the future. The dialogues in the AD domain are humanmachine dialogues and those in the AL domain are human-human dialogues; hence, we cover both cases of human-machine and human-human dialogues. In addition, neither domain has specific tasks/scenarios, meaning that our setting is more general than that in the previous work [2], where the course of a dialogue was strictly controlled by using scenarios. 2.1 Animal Discussion Domain In the AD domain, the system and user talk about likes and dislikes about animals via a text chat interface. The data consist of 1000 dialogues between a dialogue system

50

R. Higashinaka et al.

and 50 human users. Each user conversed with the system 20 times, including two example dialogues at the beginning. All user/system utterances have been annotated with dialogue-acts. There are 29 dialogue-act types, including those related to selfdisclosure, question, response, and greetings. For example, a dialogue-act DISC - P denotes one’s self-disclosure about a proposition (whether one likes/dislikes a certain animal) and DISC - R denotes one’s self-disclosure of a reason for a proposition (see [3] for the description of dialogue-acts and sample dialogues). From the data of the initial ten users sorted by user ID, we randomly extracted nine dialogues per user to form a subset of 90 dialogues (see Table 2 for the statistics). Then, two independent annotators (hereafter, AD-annot1 and AD-annot2), who were not the authors, labeled them with utterance-level user satisfaction ratings. More specifically, they provided three different user satisfaction ratings related to “Smoothness of the conversation”, “Closeness perceived by the user towards the system”, and “Willingness to continue the conversation”. The ratings ranged from 1 to 7, where 1 is the worst and 7 the best. Before actual annotation, the annotators took part in a tutorial session so that their standards for rating could be firmly established. The annotators carefully read each utterance and gave a rating after each system utterance according to how they would have felt after receiving each system utterance if they had been the user in the dialogue. To make the situation more realistic, they were not allowed to look down at the dialogue after the current utterance. At the beginning of a dialogue, the ratings always started from four (neutral). We obtained 3290 ratings for 3290 system utterances (cf. Table 2) from each annotator. In this work, we had third persons (not the actual participants of the conversations) judge user satisfaction for the sake of reliability and consistency. 2.2 Attentive Listening Domain In the AL domain, a listener attentively listens to the other in order to satisfy the speaker’s desire to speak and make himself/herself heard. Figure 1 shows an excerpt of a listening-oriented dialogue together with utterance-level user satisfaction ratings (see [6] for details of this domain). We collected such listening-oriented dialogues using a website where users taking the roles of listeners and speakers were matched up to have conversations. A conversation was done through a text-chat interface. The participants were instructed to end the conversation approximately after ten minutes. Within a three-week period, each of the 37 speakers had about two conversations a day with each of the ten listeners, resulting in our collecting 1260 listening-oriented dialogues. All dialogues were annotated with dialogue-acts. There were 46 dialogue-act types in this domain. Although we cannot not describe the full details of our dialogue-acts for lack of space, we have dialogueacts DISC - EVAL - POS for one’s self-disclosure of his/her positive evaluation towards a certain entity, DISC - HABIT for one’s self-disclosure of his/her habit, and INFO for delivery of objective information. Then, we made a subset of the data by randomly selecting ten dialogues for each of the ten listeners to obtain 100 dialogues for annotating user satisfaction ratings (see Table 2 for the statistics). Two independent annotators (hereafter, AL-annot1 and AL-annot2), who were not the authors or annotators for the AD domain, provided utterance-level ratings after all

Issues in Predicting User Satisfaction Transitions in Dialogues Utterance (dialogue-acts) Sm LIS You know, in spring, Japanese food tastes delicious. 5 (DA: DISC - EVAL - POS ) SPK This time every year, I make a plan to go on a healthy diet. But . . . (DA: DISC - HABIT) LIS Uh-huh (DA: ACK ) 6 SPK The temperature goes up suddenly! (DA: INFO ) SPK It’s always too late! (DA: DISC - EVAL - NEG ) LIS Clothing worn gets less and less when not being able to 6 lose weight. (DA: DISC - FACT) SPK Well, people around me soon get used to my body shape though. (DA: DISC - FACT)

51

Cl GL 5 5

5 6

6 6

Fig. 1. Excerpt of a dialogue with AL-annot1’s utterance-level user satisfaction ratings for smoothness (Sm), closeness (Cl), and good listener (GL) in the AL domain. SPK and LIS denote speaker and listener, respectively. Both the speaker and listener are human. Table 2. Correlation (ρ) of ratings. Granularity indicates the levels of user satisfaction ratings. The granularity (a) uses the original 7 levels of ratings, (b) uses 3 levels (we assigned low for 1-2, middle for 3-5, and high for 6-7), (c) uses the same 3 levels with different thresholds [low for 1-3, middle for 4, high for 5-7], (d) uses 2 levels [low for 1-4, high for 5-7], and (e) uses the same 2 levels but with the thresholds [low for 1-3, high for 4-7]. AD Domain AL Domain Granularity Smoothness Closeness Willingness Smoothness Closeness Good Listener (a) 7 ratings 0.18 0.15 0.27 0.18 0.10 0.11 (b) 3 ratings 0.17 0.13 0.18 0.04 0.05 0.11 (c) 3 ratings 0.13 0.11 0.21 0.14 0.08 0.08 (d) 2 ratings 0.20 0.17 0.31 0.18 0.13 0.14 (e) 2 ratings 0.30 0.30 0.32 0.18 0.11 0.04

listeners’ utterances to express how they would have felt after receiving the listeners’ utterances. After a tutorial session, the annotators gave three ratings as in the AD domain; namely, smoothness, closeness, and “good listener”. Instead of willingness, we have a “good listener” criterion here asking for how good the annotator thinks the listener is from the viewpoint of attentive listening; for example, how well the listener is making it easy for the speaker to speak. All ratings ranged from 1 to 7. We obtained 1848 ratings for 1848 listener utterances (cf. Table 2) from each annotator.

3 Individual Differences We investigated how user satisfaction ratings of two independent annotators differ in order to gain insight into whether it is reasonable for us to aim for user-independent prediction models. Table 2 shows the rather low correlation coefficients (Spearman’s rank correlation coefficients, ρ) of the ratings of our two independent annotators for the AD and AL

2

3

4

5

6

600

Frequency 1

0 200

600 0 200

Frequency

1000

R. Higashinaka et al.

1000

52

7

1

2

Ratings

3

4

5

6

7

Ratings

1000 2

3

4

Ratings

5

6

7

600

Frequency 1

0 200

600 0 200

Frequency

1000

Fig. 2. Distributions of the smoothness ratings in the AD domain. The histogram on the left is the distribution for AD-annot1; that on the right is the distribution for AD-annot2.

1

2

3

4

5

6

7

Ratings

Fig. 3. Distributions of the good listener ratings in the AL domain. The histogram on the left is the distribution for AL-annot1; that on the right is the distribution for AL-annot2.

domains. Here, we first calculated the correlation coefficient for each dialogue and then averaged the coefficients over all dialogues. Since it may be too difficult for the 7 levels of user satisfaction ratings to correlate, we changed the granularity of the ratings to 3 levels (i.e., low, middle, high) and even 2 levels (i.e., low and high) for calculating the correlation coefficients. However, this did not greatly improve the correlations in either domains. It is quite surprising that the simple choice of high/low shows very low correlation. From these results, it is clear that the ratings given to user satisfaction transitions are likely to differ greatly among individuals and that it may be difficult to create a user-independent prediction model; therefore, as a preliminary step, we deal with user-dependent prediction models in this paper. We also investigated the distributions of the ratings for the annotators. Figure 2 shows the distributions for the smoothness rating in the AD domain, and Fig. 3 shows the distributions for the good listener rating in the AL domain. It can be seen that, in the AD domain, the distributions are rather similar, meaning that the two annotators provided ratings roughly with the same ratios. This, together with the low correlation shown in Table 2, indicates that the annotators allocate the same rating very differently. As for the AL domain, we see that the distributions differ greatly: AL-annot1 rated most of the utterances 4-5, whereas AL-annot2’s ratings follow a normal distribution-like pattern, which is another indication of the difficulty of creating a user-independent prediction model; the ranges of ratings as well as their output probabilities could differ greatly among individuals. Here, the fact that AL-annot1 rated most of the utterances 4-5 can be rather problematic for training prediction models because the output distribution of

Issues in Predicting User Satisfaction Transitions in Dialogues

53

the trained model would follow a similar distribution, producing only 4-5 ratings. Such a model would not be able to detect good [rating=7] or bad [rating=1] ratings, which may make the prediction models useless. We examine how this bias of ratings affects the prediction performance in Section 5.

4 Evaluation Criteria We conceived of two kinds of evaluation criteria: one for evaluating individual matches and the other for evaluating distributions. We do not consider the MSE of rating probabilities [2] because its use is too restrictive and because we believe the ideal evaluation criterion should be applied to any hypothesis ratings as long as reference ratings are available. 4.1 Evaluating Individual Matches Since our task is to predict user satisfaction transitions, it is obviously important that the predicted rating matches that of the reference (i.e., human judgment). Therefore, we have the match rate (MR) and the mean absolute error (MAE) to calculate the rating matches. Here, the MR treats all ratings differently, whereas the MAE takes the distance of ratings into account; namely, 6 is closer to 7 than to 1. In addition, we calculate the Spearman’s rank correlation coefficient (ρ) so that the correspondence of the hypothesis and reference ratings can be taken into account. They are derived using the equations below. In the equations, R (= {R1 . . . RL }) and H (= {H1 . . . HL }) denote reference and hypothesis rating sequences for a given dialogue, respectively. L is the length of R and H. Note that they have the same length. (1) Match Rate (MR): L

MR(R, H) =

1 match(Ri , Hi ), L i=1

(1)

where ‘match’ returns 1 or 0 depending on whether Ri matches Hi . (2) Mean Absolute Error (MAE): L

MAE(R, H) =

1 |Ri − Hi |. L i=1

(2)

(3) Spearman’s rank correlation coefficient (ρ): L

¯ ¯ − R)(H i − H) , ¯ 2 L (Hi − H) ¯ 2 (R − R) i i=1 i=1

ρ(R, H) = L

i=1 (Ri

¯ and H ¯ denote the average values of R and H, respectively. where R

(3)

54

R. Higashinaka et al.

4.2 Evaluating Rating Distributions As we saw in Fig. 3, the rating distributions of the annotators may vary greatly. Therefore, it may be important to take into account the rating distributions in evaluation. To this end, we can use the Kullback-Leibler divergence (KL), which can measure the similarity of distributions. Having a similar distribution may not necessarily mean that the prediction is successful, because in cases where reference ratings gather around just a few rating values (see, for example, the left hand side of Fig. 3 for AL-annot1’s distribution), there is a possibility of inappropriately valuing highly prediction models that output only a few frequent ratings; such models cannot predict other ratings, which is not a desirable function of a prediction model. In the practical as well as information theoretic sense, we have to correctly predict rare but still important cases. Therefore, in addition to the KL, we use the match rate per rating (MR/r) and mean absolute error per rating (MAE/r). These criteria evaluate how accurately each individual rating can be predicted; namely, the accuracy for predicting one rating is equally valued with that for the other rating irrespective of the distribution of ratings in the reference. We use the following equations for the KL, MR/r and MAE/r. (4) Kullback-Leibler Divergence (KL): KL(R, H) =

K

P(H, r) · log(

r=1

P(H, r) ), P(R, r)

(4)

where K is the maximum user satisfaction rating (i.e., 7 in our case), R and H denote the sequentially concatenated reference/hypothesis rating sequences of all dialogues, and P(∗, r) denotes the occurrence probability that a rating r is found in an arbitrary rating sequence. (5) Match Rate per Rating (MR/r): 1 MR/r(R, H) = K

K

i∈{i|Ri =r}

r=1

match(Ri , Hi )

,

1

(5)

i∈{i|Ri =r}

where Ri and Hi denote ratings at i-th positions. (6) Mean Absolute Error per Rating (MAE/r): 1 MAE/r(R, H) = K

K r=1

i∈{i|Ri =r}

|Ri − Hi |

1

.

(6)

i∈{i|Ri =r}

4.3 Selecting Appropriate Evaluation Criteria We have so far presented six evaluation criteria. Although they can all be useful, it would still be desirable if we could choose a single criterion for simplicity and also for practical use. We made three assumptions for selecting the most suitable criterion.

Issues in Predicting User Satisfaction Transitions in Dialogues States for Rating 1

1:speaker1

55

States for Rating 2

2:speaker2

3:speaker1

4:speaker2

Fig. 4. Topology of our HMM. The states for ratings 1 and 2 are connected ergodically. An oval marked speaker1/speaker2 indicates a state for speaker1/speaker2. Arrows denote transitions and numbers before speaker1/speaker2 are state IDs. Boxes group together the states related to a particular rating.

First, the suitable criterion should not evaluate “random choice” highly. Second, it should not evaluate “no-choice” highly, such as when the prediction is done simply by using a single rating value. In other words, since “random choice” and “no-choice” do not perform any prediction, they should show the lowest performance when we use the suitable criterion. Third, the suitable criterion should be able to evaluate the prediction accuracy independent of individuals because it would be difficult for researchers and developers in the field to adopt a criterion that is too sensitive to individual differences for a reliable comparison. We also believe that the prediction accuracy should be similar among individuals because of the fundamental difficulty in predicting user satisfaction [4]; for a computational model, predicting one person’s ratings would be as difficult as predicting the other person’s. Therefore, we consider the suitable evaluation criterion should produce similar values for different individuals. In the next section, we experimentally find the best evaluation criterion that satisfies these assumptions.

5 Prediction Experiment We trained our prediction models using HMMs and CRFs and compared their prediction performance. Note that we trained these models for each annotator in each domain following the results in Section 3. As baselines and as the requirements for selecting the best evaluation criterion, we prepared a random baseline (hereafter, RND) and a “no-choice” baseline. Our “no-choice” baseline produces the most common rating 4 as predictions; hence, this is a majority baseline (hereafter, MJR). 5.1 Training Data Our task is to predict a user satisfaction rating at each evaluation point in a dialogue. We decided to predict the user satisfaction rating after each dialogue-act because a dialogueact is one of the basic units of dialogue. We created the training data by aligning the dialogue-acts with their user satisfaction ratings. Since we have ratings only after system/listener utterances, we first assumed that the ratings for dialogue-acts corresponding to user/speaker utterances were the same as those after the previous system/listener utterances. In addition, since a system/listener utterance may contain multiple dialogue-acts, its dialogue-acts are given the same rating

56

R. Higashinaka et al. Speaker ID

s-2

s-1

s0

s1

s2

Dialogue-act

DA-2

DA-1

DA0

DA1

DA2

r-2

r-1

r0

r1

r2

Rating

…

Fig. 5. Topology of our CRF. The area within the dotted line represents the scope of our features for predicting the rating r0 .

as the utterance. This process results in our creating a sequence < s1 , DA1 , r1 > · · · < sN , DAN , rN > for each dialogue, where si denotes the speaker of a dialogue-act, DAi the i-th dialogue-act, ri the rating for DAi , and N the number of dialogue-acts in a dialogue. We created such sequences for our dialogue data. Our task is to predict r1 . . . rN from < s1 , DA1 > · · · < sN , DAN >. 5.2 Training HMMs From the training data, we trained HMMs following a manner similar to [2]. We have K groups of states where K is the maximum rating value; i.e., 7. Each group represents a particular rating k (1 ≤ k ≤ K). Figure 4 shows the HMM topology. For the sake of simplicity, the figure only shows the case when we have only two ratings: 1 and 2. Each group has two states: one for representing the emission of one speaker (conversational participant) and the other for the emission of the other speaker. We used this topology because it has been successfully utilized to model two-party conversations [6]. In this HMM, all states are connected ergodically; that is, all states can transit to all other states. As emissions, we used a speaker ID (a binary value s ∈ {0, 1}, indicating speaker1 or speaker2), a dialogue act, and a rating score. The number of dialogue-acts in the AD domain is 29, and the number of dialogue-acts in the AL domain is 46. A speaker ID s is emitted with the probability of 1.0 from the states corresponding to the speaker s. A rating score k is emitted with the probability of 1.0 from the states representing the rating k. Therefore, a datum having a speaker ID s and a rating k is always assigned to a state representing s and k in the training phase. We used the EM-algorithm for the training. In decoding, we made the HMM ignore the output probability of rating scores and searched for the best path using the Viterbi algorithm [8]. Since the states in the best path represents the most likely ratings, we can translate the state IDs into corresponding rating values. For example, if the best path goes through state IDs {1,3,4,2} in Fig. 4, then the predicted rating sequence becomes <1,2,2,1>. 5.3 Training CRFs We used a linear-chain CRF based on a maximum a posteriori probability (MAP) criterion [5]. The most probable rating for each dialogue-act was estimated using the

Issues in Predicting User Satisfaction Transitions in Dialogues

57

following features: the current dialogue-act, previous and succeeding two dialogue-acts, the speaker IDs for these dialogue-acts, and the previous and succeeding two ratings. Figure 5 illustrates the topology of our CRF and the scope of the features. 5.4 Evaluation Procedure We performed a ten-fold cross validation. We first separated the training data into ten disjoint sets. Then, we used nine sets for training HMMs and CRFs, and used the remaining one for testing. We repeated this ten times in a round-robin fashion. In the evaluation, from the output of our prediction models, we extracted predictions only after system/listener dialogue-acts because the reference ratings were originally given only after them. We compared the predictions with the reference sequences using the six evaluation criteria we proposed in Section 4. 5.5 Results Tables 3 and 4 show the evaluation results for the AD domain. Tables 5 and 6 show the results for the AL domain. To compare the means of the MR, MAE, and ρ, we performed a non-parametric multiple comparison test (Steel-Dwass test [1]). We did not perform a statistical test for other criteria because it was difficult to perform samplewise comparison for distributions. Before looking into the individual values, we first need to fix the evaluation criterion. According to our assumptions for choosing appropriate criteria (see Section 4.3), RND and MJR should not show good performance when they are compared to any prediction model because they do not perform any prediction. Since MJR outperforms others in the MR, MAE, and MAR/r, we should not be using such criteria. Using the third assumption, we can also eliminate ρ and KL because their values differ greatly among individuals. For example, ρ for the smoothness in the AD domain for AD-annot1 is 0.187 (column HMM), whereas that for AD-annot2 is just 0.05 (column HMM), and the KL for the closeness in the AL domain for AL-annot1 is 0.093 (column CRF), whereas that for AL-annot2 is 0.029 (column CRF). The elimination of the KL can also be supported by the fact that the similar rating distributions of AD-annot1 and AD-annot2 did not result in high correlations, which suggests that the shape of rating distributions do not necessarily mean the match of ratings (cf. Fig. 2). As a result, we end up with only one evaluation criterion: MR/r, which becomes our recommended evaluation criterion. Here, we do not argue that the MR/r is the best possible measure. There could be more appropriate ones that we could not introduce in this paper. In addition, we do not mean that other measures are not useful; for example, both the MR and MR/r approach the same value of 1 as the prediction accuracy improves. We recommend the MR/r simply because, among our proposed criteria, it can evaluate non-predicting baselines lowly and that it seems less susceptible to individual differences than others. When we focus on the MR/r, we see that HMMs have consistently better values than CRFs (except for just one case). Therefore, we can say that the current best model can be achieved by HMMs. One explanation of this result may be that the parameters of CRFs may have over-tuned to the data with higher posterior probabilities. Consequently, CRFs showed poor performance for the data with lower posterior probabilities. Although HMMs performed comparatively better than CRFs, it should also be

58

R. Higashinaka et al.

Table 3. The MR, MAE, ρ, KL, MR/r and MAE/r for the random baseline (RND), majority baseline (MJR), HMMs and CRFs for the AD domain. The ratings of AD-annot1 were used as references. The asterisks, ‘+’, ‘h’, and ‘c’ indicate the statistical significance (p<0.01) over RND, MJR, HMM, and CRF, respectively. Bold font indicates the best value for a certain user satisfaction rating.

RND MR 0.142 MAE 1.995 ρ -0.007 KL 0.284 MR/r 0.149 MAE/r 2.280

Smoothness MJR HMM 0.376∗h 0.275∗ 0.996∗h 1.420∗ NA 0.187∗ 1.011 0.162 0.143 0.217 1.714 1.782

CRF 0.308∗ 1.252∗ 0.109 0.031 0.172 1.820

RND 0.146 2.004 -0.002 0.184 0.143 2.242

Closeness MJR HMM 0.340∗ 0.279∗ 1.094∗ 1.431∗ NA 0.213∗ 1.079 0.092 0.143 0.231 1.714 1.702

CRF 0.273∗ 1.392∗ 0.110 0.061 0.162 1.836

RND 0.155 1.947 0.025 0.208 0.150 2.219

Willingness MJR HMM 0.298∗ 0.283∗ 1.085∗ 1.403∗ NA 0.169 1.222 0.125 0.143 0.224 1.714 1.705

CRF 0.305∗ 1.245∗ 0.183∗ 0.013 0.208 1.636

Table 4. Evaluation results for the AD domain with the ratings of AD-annot2 as references. See Table 3 for the notations in the table.

RND MR 0.147 MAE 2.010 ρ -0.024 KL 0.154 MR/r 0.149 MAE/r 2.246

Smoothness MJR HMM 0.306∗ 0.257∗ 1.132∗h 1.738 NA 0.050 1.231 0.162 0.143 0.210 1.714 2.017

CRF 0.278∗ 1.431∗ 0.166∗ 0.027 0.177 1.922

RND 0.146 2.025 -0.011 0.140 0.136 2.292

Closeness MJR HMM 0.320∗ 0.240 1.157∗ 1.600∗ NA 0.202∗ 1.217 0.184 0.143 0.232 1.714 1.726

CRF 0.275 1.419∗ 0.171∗ 0.041 0.176 1.852

RND 0.154 2.068 0.009 0.258 0.158 2.250

Willingness MJR HMM 0.288∗ 0.277∗ 1.183∗h 1.595∗ NA 0.105 1.280 0.215 0.143 0.234 1.714 1.938

CRF 0.312∗ 1.225∗ 0.245∗ 0.038 0.238 1.680

Table 5. Evaluation results for the AL domain with the ratings of AL-annot1 as references. See Table 3 for the notations in the table.

RND MR 0.150 MAE 1.878 ρ -0.002 KL 0.944 MR/r 0.158 MAE/r 2.327

Smoothness MJR HMM 0.472∗ 0.436∗ 0.688∗ 0.803∗ NA 0.221∗ 0.781 0.090 0.143 0.228 1.714 1.878

CRF 0.519∗ 0.642∗h 0.241∗ 0.080 0.193 1.962

RND 0.135 1.832 0.005 1.000 0.161 2.221

Closeness MJR HMM 0.556∗ 0.421∗ 0.593∗h 0.897∗ NA 0.101 0.611 0.122 0.143 0.231 1.714 1.994

CRF 0.551∗h 0.575∗h 0.209∗ 0.093 0.190 1.579

RND 0.142 1.937 -0.054 1.005 0.138 2.366

Good Listener MJR HMM CRF 0.370∗ 0.423∗ 0.505∗+ 0.806∗ 0.850∗ 0.619∗+ h NA 0.214∗ 0.294∗ 1.020 0.088 0.081 0.143 0.222 0.202 1.714 1.805 1.596

noted that the absolute values of MR/r are only 0.2–0.24; further improvements are crucial. One possibility for improving HMMs is to incorporate more features, such as word-level features and those related to dialogue history. Another possibility may be to increase the number of states for more accurate modeling of the dialogue-act/rating sequences. We leave further analyses as future work.

Issues in Predicting User Satisfaction Transitions in Dialogues

59

Table 6. Evaluation results for the AL domain with the ratings of AL-annot2 as references. See Table 3 for the notations in the table.

RND MR 0.138 MAE 2.031 ρ -0.018 KL 0.426 MR/r 0.132 MAE/r 2.330

Smoothness MJR HMM 0.292∗ 0.263∗ 1.128∗h 1.489∗ NA 0.183∗ 1.251 0.101 0.143 0.210 1.714 1.974

CRF 0.289∗ 1.264∗ 0.161∗ 0.023 0.185 1.799

RND 0.146 1.972 0.017 0.337 0.148 2.293

Closeness MJR HMM 0.310∗h 0.226∗ ∗ 1.023∗c h 1.508 NA 0.077 1.188 0.094 0.143 0.195 1.714 1.772

CRF 0.263∗ 1.297∗h 0.044 0.029 0.168 1.760

RND 0.150 1.945 0.021 0.342 0.149 2.210

Good Listener MJR HMM 0.300∗h 0.244∗ ∗ 1.053∗c h 1.522 NA 0.130 1.207 0.129 0.143 0.208 1.714 1.841

CRF 0.273∗ 1.313∗ 0.074 0.038 0.185 1.699

6 Summary and Future Work This paper addressed three important issues in automatic prediction of user satisfaction transitions in dialogues: individual differences, evaluation criteria, and prediction models. We first showed, by our observation of great individual differences in our rating data, that it is rather difficult to create a user-independent prediction model. Then, we introduced six possible candidates for evaluation criteria of user satisfaction transitions and experimentally found that the match rate per rating (MR/r) is currently the most appropriate criterion. Finally, in our experiment, we found that HMMs provide better prediction accuracies than CRFs. Our contribution lies in our setting a course for future research in predicting user satisfaction transitions in dialogues, especially by the suggestion of an appropriate evaluation criterion and by revealing the standard performance we could attain by the currently available prediction models. Our future work includes creating more appropriate evaluation criteria, improving the prediction accuracy using more features, and further verification of our findings using the data of more individuals and more domains.

References 1. Dwass, M.: Some k-sample rank-order tests. In: Olkin, I., et al. (eds.) Contributions to Probability and Statistics, pp. 198–202. Stanford University Press, Stanford (1960) 2. Engelbrech, K.P., G¨odde, F., Hartard, F., Ketabdar, H., M¨oller, S.: Modeling user satisfaction with hidden Markov models. In: Proc. SIGDIAL, pp. 170–177 (2009) 3. Higashinaka, R., Dohsaka, K., Isozaki, H.: Effects of self-disclosure and empathy in humancomputer dialogue. In: Proc. SLT, pp. 109–112 (2008) 4. Higashinaka, R., Miyazaki, N., Nakano, M., Aikawa, K.: Evaluating discourse understanding in spoken dialogue systems. ACM Trans. Speech Lang. Process. 1, 1–20 (2004) 5. Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In: Proc. ICML, pp. 282–289 (2001) 6. Meguro, T., Higashinaka, R., Dohsaka, K., Minami, Y., Isozaki, H.: Analysis of listeningoriented dialogue for building listening agents. In: Proc. SIGDIAL, pp. 124–127 (2009) 7. M¨oller, S., Engelbrecht, K.P., Schleicher, R.: Predicting the quality and usability of spoken dialogue services. Speech Communication 50(8-9), 730–744 (2008) 8. Rabiner, L.R., Juang, B.H.: An introduction to hidden Markov models. IEEE ASSP Magazine 3(1), 4–16 (1986)

60

R. Higashinaka et al.

9. Subramaniam, L.V., Faruquie, T.A., Ikbal, S., Godbole, S., Mohania, M.K.: Business intelligence from voice of customer. In: Proc. ICDE, pp. 1391–1402 (2009) 10. Suzuki, J., McDermott, E., Isozaki, H.: Training conditional random fields with multivariate evaluation measures. In: Proc. COLING-ACL, pp. 217–224 (2006) 11. Takeuchi, H., Subramaniam, L.V., Nasukawa, T., Roy, S., Balakrishnan, S.: A conversationmining system for gathering insights to improve agent productivity. In: Proc. IEEE International Conference on E-Commerce Technology and IEEE International Conference on Enterprise Computing, E-Commerce, and E-Services, pp. 465–468 (2007) 12. Walker, M.A., Langkilde-Geary, I., Hastie, H.W., Wright, J., Gorin, A.: Automatically training a problematic dialogue predictor for a spoken dialogue system. Journal of Artificial Intelligence Research 16(1), 293–319 (2002) 13. Walker, M.A., Litman, D., Kamm, C.A., Abella, A.: PARADISE: A framework for evaluating spoken dialogue agents. In: Proc. EACL, pp. 271–280 (1997)

Expansion of WFST-Based Dialog Management for Handling Multiple ASR Hypotheses Naoto Kimura1,2, Chiori Hori1, Teruhisa Misu1, Kiyonori Ohtake1, Hisashi Kawai1, and Satoshi Nakamura1 1

National Institute of Information and Communications Technology (NICT), MASTAR Project, Keihanna Science City, Japan {naoto.kimura,chiori.hori}@nict.go.jp 2 Nara Institute of Science and Technology (NAIST) 8916-5 Takayama-cho, Ikoma-shi, Nara, 630-0192, Japan

Abstract. We proposed a weighted finite-state transducer-based dialog manager (WFSTDM) which was a platform for expandable and adaptable dialog systems. In this platform, all rules and/or models for dialog management (DM) are expressed in WFST form, and the WFSTs are used to accomplish various tasks via multiple modalities. With this framework, we constructed a statistical dialog system using the user concept and system action tags which were acquired from an annotated corpus of human-to-human spoken dialogs as input and output labels of the WFST. We introduced a spoken language understanding (SLU) WFST for converting user utterances to user concept tags, a dialog scenario WFST for converting user concept tags to system action tags and a sentence generation (SG) WFST for converging system action tags to system utterances. The tag sequence probabilities of the dialog scenario WFST were estimated by using a spoken dialog corpus for hotel reservation. The SLU, scenario and SG WFSTs were then composed to be a dialog management WFST which determines the next action of the system responding to the user input. In our previous research, we evaluated the dialog strategy by referring to the manual transcription. Then in this paper, we present the performance of WFSTDM when speech recognition hypotheses are input. To alleviate degradation of the DM performance caused by speech recognition errors, we expand the WFSTDM for handling multiple hypotheses of speech recognition and confidence score which indicate acoustic and linguistic reliability of speech recognition. We also evaluated the accuracy of SLU results and the correctness of system actions selected by the dialog management WFST. We confirmed that the performance of dialog management was enhanced by choosing the optimal action among all the WFST paths for multiple hypotheses (N-best) of speech recognition in consideration of confidence score. Keywords: Spoken dialog, Spoken Language Understanding, Weighted FiniteState Transducer (WFST), Statistical dialog management, speech recognition, confidence score, N-best.

1 Introduction We aim to construct robust spoken dialog systems through which human and machine can have a dialog freely as if human and human do. Since a conventional dialog G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 61–72, 2010. © Springer-Verlag Berlin Heidelberg 2010

62

N. Kimura et al.

system requires a user to answer in response to system questions, the user’s responses are limited by the questions and the user rarely behaves in flexible manners through the dialog. In such a system-driven dialog, the accurate automatic speech recognition (ASR) results are available by restricting users’ spontaneous input. The state-of-theart ASR technologies recognize spontaneous speech in real time with more than 100M-word vocabularies [1]. It is high time to challenge to construct dialog systems which accept user spontaneous dialog behaviors. To realize such dialog systems, corpus-based dialog management is a promising means. Humans have typical patterns of dialog especially when a target task is limited and thus statistical models of dialog scenario to determine the next system’s action could be learned from a dialog corpus. Furthermore the dialog corpus enables to cover more linguistic expressions to understand the user’s intention and more natural system responses as in a corpus. Statistical models for spoken language understanding (SLU) and dialog scenario are trained through using a dialog corpus which is annotated with the concept tags of user and agent. Dialog management (DM) determines system’s next action in response to user input in such statistical models. We proposed a weighted finite-state transducer (WFST) based DM platform [2]. WFSTs are mainly used in speech and language processing [3]. The WFST-based DM (WFSTDM) system accepts user inputs and output system responses according to a dialog management WFST which represents rules/statistical models. The WFSTDM enables to construct an expandable and adaptable dialog system which handles multiple tasks and modalities [4] since various statistical models represented by multiple WFSTs can be composed using the WFST operations. To build more complex dialog systems, more statistical models are required to be composed into a dialog management WFST and the size of the WFST becomes huge. The cost of the decoding such a huge WFST can be reduced by the WFST optimization operation [5]. Statistical models for SLU, dialog scenario, and sentence generation (SG) for system responses can be transformed into WFSTs and then these WFSTs are composed into a dialog management WFST [6]. The dialog management WFST can be decoded using the WFSTDM system, in which user utterance input to the system shall be converted into system responses i.e., sentences. Figure 1 shows the WFSTDM system. SLU WFST

Scenario SG WFST WFST Composed

WFST Management WFST

Speech Recognition System

WFSTDM User speech

Text-ToSpeech

System’s Synthesized Speech

Fig. 1. WFST-based dialog management system

Expansion of WFST-Based DM for Handling Multiple ASR Hypotheses

63

Focused on statistical models for dialog systems, thus far a statistical model of a system action sequence for dialog strategy [7] and that of user concept sequence for understanding [8] [9] [10] were investigated independently. We constructed a statistical model for dialog scenario obtained from the tag sequence of both the clerk and customer sides [11]. This model determines the system’s next action based on probabilities of multiple user concept hypotheses conditioned by the previous system action. We examined the performance of the WFSTDM using this model as a dialog management WFST for the manual transcriptions of user utterance inputs. In this paper we present the performance of the WFSTDM for ASR hypothesis inputs. Since SLU models are trained from the manual transcription of the dialog corpus, ASR errors may degrade the performance of DM. Thus far to minimize such degradation of the DM performance, multiple ASR hypotheses and confidence score indicating acoustic and linguistic reliability of hypotheses were used for SLU in dialog systems [12][13]. User inputs are rejected by dialog systems if the confidence score for each ASR result is lower than a threshold and efficient dialog strategies are chosen by avoiding system redundant confirmation. As another approach, SLU is carried out by selecting most likely understanding which is selected by weights of the confidence score. In this paper, we expand the WFSTDM to handle multiple ASR hypotheses and confidence score in determining a most-likely system’s next action.

2 WFST-Based Dialog System 2.1 Spoken Language Understanding WFST A spoken language understanding (SLU) WFST is a pattern detector which detects a phrase expressing a user concept from an input sentence, and translates it to the user concept tag. We extracted a set of utterance sentences corresponding to each concept tag from the corpus, and then n-word phrases with high relative frequency from each sentence were extracted and embedded them in the SLU WFST as distinguishable expression patterns of the user concept tags. Specifically, the transition weights in the WFST were determined so that the paths corresponding to longer phrases could have a lower cost, i.e. it is based on longest pattern matching. In this paper, we constructed the SLU WFST using frequent n-word phrases (n=1~6) in a corpus for hotel reservation. 2.2 Scenario WFST A statistical model for dialog scenario was trained using a sequence of annotation tags in the corpus. Although there are alternatives in choosing responses to the user, the scenario WFST enables the dialog system to determine which system action could be taken in response to the user input in each state of dialog discourses. In this paper, we constructed the scenario WFST from a 3-gram model of annotation tag sequence from the corpus. 2.3 Dialog Management WFST The SLU WFST, scenario WFST and SG WFST were composed and then optimized using WFST operations. The finally composed WFST is denoted as a dialog management WFST. Manual transcription or speech recognition results of a user utterance

64

N. Kimura et al.

is input to the dialog management WFST, and the next system’s action tag is output from the WFST. In this experiment, the SLU WFST and scenario WFST without SG WFST were combined to the dialog management WFST. 2.4 Algorithm of WFSTDM A WFST T over a semiring K is defined by an 8-tuple as T = (Σ, Δ, Q, i, F , E, λ , ρ ) where: (1) Σ is a finite set of input symbols; (2) Δ is a finite set of output symbols; (3) Q is a finite set of states; (4) i ∈ Q is an initial state; (5) F ⊂ Q is a set of final states; (6) E ⊂ Q × (Σ ∪ ε ) × (Δ ∪ ε ) × Κ × Q is a finite set of transitions;

(7) λ is an initial weight; (8) ρ : F → K is a final weight function.

Given an input symbol sequence to a WFST, the output symbol sequence can be obtained as that on the best path with the minimum (or maximum) cumulative weight. The best path can be found efficiently with Dynamic Programming (DP) among successful paths, from the initial to one of the finals, which accept the input sequence. In dialog management, however, the system has to respond to the user immediately in each turn. Thus the system needs to choose the most appropriate output sequence according to the current situation. We show the algorithm of our WFSTDM in Table I. Steps 1 to 5 perform initial actions that can be taken by epsilon transitions from the initial state. Steps 6 to 10 respond actions to the user’s input at each turn. Steps 11 to 13 check task completion. In the algorithm, π indicates a path consisting of consecutive transitions e1 ,…, eL in the WFST. For a path π, we denote its origin state by p[π], its end state by n[π], and its cumulative weight by w[π] where w[π ] = w[e1 ] ⊗ ⊗ w[e L ] . “ ⊕ ” and “ ⊗ ” are two formally-defined binary operations, i.e., “addition” and “multiplication” over the semiring. In this paper, we use the tropical semiring in which the “addition” and “multiplication” of two real-valued weights are defined as the minimum of the two and ordinary addition, respectively. We can also use the log semiring in which each weight is defined as a minus log probability, and corresponding “addition” and “multiplication” are defined. Note that a (cumulative) weight is assumed to be better than the others if it is smaller than the others. For a set of states S, input symbol sequence x, and output symbol sequence y, we define P(S , x, y ) as a set of all paths each of which originates from one of S, accepts x and outputs y. In the steps of 1 and 7, the system selects the most appropriate action sequence on the paths in P(S , x, y ) . C (π ) is the expected cost for taking π and the possible future transitions from n[π ] . This is a look-ahead for choosing a more appropriate action at each turn, which may be set as a (discounted) reward used in POMDP. St is a set of states the system takes at turn t. Wt(s) is the cumulated weight for each state in St.

Expansion of WFST-Based DM for Handling Multiple ASR Hypotheses

65

Table 1. Algorithm of WFST-based dialog Management // execute initial actions 1. // choose the best action sequence in paths with epsilons yˆ 0 ← arg min ⊕ λ ⊗ w [π ] ⊗ C (π ) y ∈ Δ*

2.

π ∈ P ({i }, ε , y )

execute actions corresponding to yˆ 0

3.

S0 ← {s′ s′ = n[π ], π ∈ P({} i , ε , yˆ 0 )}

4.

foreach s′ ∈ S 0 do W 0 (s ′ ) ←

⊕

π :s ′= n [π ],π ∈P ({i },ε , yˆ 0 )

λ ⊗ w[π ]

5. t ← 1 // execute actions for user’s input 6. receive a user’s input symbol sequence 7.

// choose the best action sequence for xt

yˆ t ← arg min y∈Δ*

8. 9.

xt

⊕

π ∈P ( St −1 , xt , y )

Wt −1 ( p[π ]) ⊗ w[π ] ⊗ C (π )

execute actions corresponding to yˆ t S t ← {s′ s ′ = n[π ], π ∈ P(S t −1 , xt , yˆ t )}

10. foreach s ′ ∈ S t do

Wt (s′) ←

⊕

π :s′= n [π ],π ∈P ( S t −1 , xt , yˆt )

Wt −1 ( p[π ]) ⊗ w[π ]

// check task completion ⎧ ⎫ ⎧ ⎫ ~ WF ← ⎨ ⊕ Wt (s′) ⊗ ρ (s′)⎬ ⊗ ⎨ ⊕ Wt (s′′)⎬ ′ ′ ′ ∈ ∩ ∈ s S F s S ⎩ t ⎭ ⎩ t ⎭ 12. if W~F < Threshold then exit

11.

−1

13. // proceed to the next turn t ← t + 1 and go to Step 6

~ Steps 11 and 12 check the task completion based on WF , the relative overall cumulated weight of all the successful paths. Threshold is a pre-defined constant value. In Step 13, the control returns to Step 6 to receive the next user’s input. Generally, a set of (weighted) rules or (hidden) Markov models can be represented as a WFST. Once such a model is embedded into a WFST, it can be combined with other WFSTs. Many useful operations for WFSTs are used to combine and manipulate WFSTs. The composition operation for two WFSTs can be used to generate a WFST that translates sequentially by the two WFSTs. Suppose we prepare two WFSTs independently, whereas one translates a word sequence into its corresponding concept sequence for language understanding, and the other translates a concept sequence into system actions for dialog management. In the case we compose these two WFSTs, the dialog manager can accept word sequences

66

N. Kimura et al.

directly using the composed WFST. In addition, some optimization operations are effective to reduce the size and the computational cost in runtime. 2.5 Handling Multiple ASR Hypotheses In our previous work, we used transcript of utterances of several dialogs as input to a dialog management WFST in order to evaluate the effectiveness of the dialog management. Then in this work, we use speech recognition result as the input, including recognition errors. To minimize the impact of recognition errors on dialog management, we expand the WFSTDM for handling N-best hypotheses of speech recognition with confidence scores that indicate acoustic and linguistic reliability of speech recognition. The new WFSTDM chooses the optimal action in all the WFST paths for multiple hypotheses while each of paths is weighted with the confidence score of the corresponding hypothesis. The N-best hypotheses of speech recognition often include hypotheses with less error rate than that of the best-scored hypothesis. According to the dialog context, a more appropriate hypothesis can be selected instead of the best-scored one. In this way, we aim to reduce the impact of recognition errors but also to enhance the accuracy of selecting the next system’s action. For enabling WFSTDM to accept N-best hypotheses with confidence scores from a speech recognizer, we modify the step 7 in Table 1 as:

yˆ t ← arg min y∈Δ*

⊕

π ∈P (S t −1 , x t ,n , y ),

Wt −1 ( p[π ]) ⊗ w[π ] ⊗ CM (xt , n ) ⊗ C (π ) ,

x t ,n ∈ x t

where we consider xt = {xt ,n n = 1,..., N }, i.e. xt represents a list of N-best hypotheses

and xt ,n corresponds to the n-th hypothesis in the list. CM (xt , n ) indicates a confidence

score for xt ,n . For example, it can be calculated using the word posterior probabilities for speech input O as:

CM (xt ,n ) = − ∑ log P(w O ) w∈xt , n

Note that we use a negative value of log probability according to the manner of WFST. In the previous work, we assumed that sentence boundaries are established facts, which were annotated in the transcripts. It is, however, difficult to detect the boundaries precisely because they get ambiguous in speech recognition result. Then, when using speech recognition result, we change the WFSTDM so as to accept multiple sentences at once, and to translate them into an output tag sequence without consideration of sentence boundaries, i.e. we consider multiple paths with different sentence boundaries for the input sequence.

Expansion of WFST-Based DM for Handling Multiple ASR Hypotheses

67

3 Evaluation Experiment 3.1 Evaluation Data We used 25 dialogs for hotel reservation between an English speaker and a Japanese speaker as evaluation data [6]. They are annotated with Interchange Format (IF) which is an Interlingua for speech translation systems. Since IF was not originally designed for DM, we modified the original IF tag set to a consistent set for DM and revised the user concept tags and the system action tags. The number of tags resulted in 146 including 58 user concept tags and 88 system action tags. The example of IF tags and number of turns and tags are shown in Table 2 and 3, respectively. Table 2. The example of IF tag Turn

Speaker

IF-tag a:greeting

1

System a:introduce-self a:offer + help c:greeting

2

3

User

c:introduce-self

Japanese Utterance(English Utterance)

お電話，ありがとうございます、 (Thank you for your calling, ) ニューヨークシティホテルでございます。 (New York City Hotel,) ご用件をお伺いいたします。 (may I help you?) もしもし、 (Hello,) わたし田中弘子といいますが、 (my name is Hiroko Tanaka) 部屋の予約をお願いしたいんですけれども。 (and I would like to make a reservation.) はい、(Yes,) いつがご希望でしょうか。

c:request-action + reservation+ hotel a:acknowledge System a:request-information (and when would you like to stay?) + temporal

Table 3. Number of turns and tags used in the systems

#turn/dialogs #tag/turn

User

System

10.76(269/25) 1.79(482/269)

10.8(270/25) 2.91(786/270)

3.2 Speech Recognition We used Julius for speech recognition [8]. The acoustic model is Japanese genderindependent tri-phone model, and the language models are 2-gram and 3-gram models learned with the travel dialog corpus which includes 87194 utterances in 2206 dialogs. For the 25 test dialogs, test-set perplexity of the 3-gram language model is 31.97 and the number of out-of-vocabulary (OOV) words is 111 (0.81%). The word accuracy in speech recognition is 78.7%.

68

N. Kimura et al.

3.3 Evaluation Method In order to evaluate the dialog system, we simulated dialog discourses by using test set as a correct answer dialog. We used a leave-one-out method in which each dialog as a test set and the other 24 dialogs as a training set for the corpus of 25 dialogs with IF tags. To measure the performance of SLU and the performance of prediction for system next actions, we input a set of user utterances at each turn to the WFST by referring the dialog discourse in the test set, and performed spoken language understanding and made the system predict the next action tag sequence in response to the user input. In this method, we use Mean Reciprocal Rank (MRR) as an evaluation metric of performance of prediction of system’s next actions. MRR is defined as:

MRR =

1 M

M

1

∑R i =1

i

where Ri is the rank of the correct system action tag sequence at i-th turn, and M is the number of system turns. MRR is an evaluation method in which the value becomes large when the correct answer is ranked high in the estimated candidates, and it is suitable for evaluation of prediction performance from many candidates in dialogs.

4 Experimental Result 4.1 Accuracy of Spoken Language Understanding Since the SLU WFST is leaned from the corpus and is carried out by converting user utterances into user concept tags based on longest-phrase match, the dialog system cannot precisely understand user utterance in cases when the word N-gram based phrases representing user concepts are not in the training data or are misrecognized in speech recognition results. To know the upper-bound of the performance of the SLU WFST, we evaluated how many correct user concept tags are covered by multiple concept tag hypotheses obtained by the longest-phrase patterns in the user utterances. Figure 2 shows the percentages of the coverage for the correct concept tags when the manual transcription, 1-best and multiple hypotheses of ASR were input. The sentence boundaries are given in the manual transcription results. While the sentence boundaries for the ASR results are unknown. This result shows that if user utterances are correctly recognized by the ASR, the SLU WFST has a potential to understand 80% of the user concept correctly at most. When the ASR results are input, the multiple hypotheses minimize the degradation of the coverage in comparison with the 1-best ASR results. 75.3% of the user concept can be understood even though the results from speech recognition are inclined to be degraded. However, the potential of SLU shown in Figure 2 does not always fully represent the performance of the SLU WFST because it is still difficult to select correct concept tags from multiple hypotheses. To evaluate the performance of the SLU WFST, the concept tag sequence output from the SLU WFST were compared with the references. The accuracy in consideration of insertion, deletion and substitution is shown in Figure 3.

Expansion of WFST-Based DM for Handling Multiple ASR Hypotheses

69

82% 80% 78% 76% 74% 72% 70% 68% 66% 64%

Transcription

ASR(1-best)

ASR(5-best)

ASR(10-best)

Fig. 2. Coverage of correct concept tags in concept tag hypotheses for SLU [%]. Transcription: manual transcription, ASR(N-best): N-best speech recognition results, N= 1, 5, 10.

Concept Accuracy

65%

55%

45%

35%

Transcription with sentence boundaries Transcription without sentence boundaries Speech recognition result without confidence score (1-best) Speech recognition result with confidence score (1-best) Speech recognition result without confidence score (5-best) Speech recognition result with confidence score (5-best)

Fig. 3. Performance of Spoken Language Understanding using SLU WFSTs

The decline of the concept accuracy caused by missing the sentence boundaries is 20% for the manual transcription. All the speech recognition results don’t have the explicit sentence boundaries and thus the upper bound of the performance is 45% given by the manual transcription without sentence boundaries. Roughly 5.5% of the accuracy decline was induced by recognition errors. Although the accuracy of the 1-best was 39.5%, that of the 5-best was improved up to 40.5% due to consideration of multiple ASR hypotheses. This tendency is the same as shown in Figure 2. In the comparison of the ASR results with and without the confidence score, both the accuracies for the 1-best and 5-best were slightly enhanced by using the confidence score. These results also show that multiple hypotheses and confidence score of ASR contributes to spoken language understanding by the SLU WFST.

70

N. Kimura et al.

4.2 Prediction Performance of System’s Next Action The SLU WFST which performance is shown in Figure 3 was composed with the scenario WFST and then optimized to be a dialog management WFST. Table 4 lists the average size of 25 WFSTs. Table 4. Size of WFSTs WFSTs

#state

#transition

Scenario (tag 3-gram)

708

2976

SLU

2833

15737

Composed

7611

47462

Optimized

7194

44744

To test the performance of the WFSTDM, the prediction power of the system’s next action tags was evaluated by using MRR. The MRR for manual transcriptions, 1best and 5-best ASR results are compared and efficiency of the confidence scores for the DM is shown in Figure 4. 0.25

MRR

0.23 0.21 0.19 0.17 0.15

Transcription with sentence boundaries Transcription without sentence boundaries Speech recognition result without confidence score (1-best) Speech recognition result with confidence score (1-best) Speech recognition result without confidence score (5-best) Speech recognition result with confidence score (5-best)

Fig. 4. Prediction power of system next action tag using the WFSTDM

The performance for the manual transcription is 0.246 MRR and is reduced to 0.219 MRR when the sentence boundaries are unknown. The 1-best and 5-best ASR results indicate 0.173 and 0.179 MRRs, respectively. These results show that the performance of SLU reflects to the entire performance of the WFSTDM directly. We confirmed that the performance is enhanced by using N-best (5-best) with confidence score. This is because longer-phrases representing concept tags are covered by multiple hypotheses of ASR results and the increased coverage ratio results in improvement of spoken language understanding and the performance of dialog management. Additionally, confidence score raises the rank of correct action tag in the multiple hypotheses obtained by the WFSTDM. As a result of this boosting, MRR is improved.

Expansion of WFST-Based DM for Handling Multiple ASR Hypotheses

71

Figure 5 shows the correlation between the speech recognition accuracy and SLU accuracy/MRR for the WFSTDM. The word accuracies averaged by the 25 dialogs were obtained in the range of from 61% to 76.3%. The SLU accuracy and the MRR are improved as the performance of ASR is increased and there is strong correlation among them.

0.18

40%

0.17

39% 38%

0.16

MRR

Concept Accuracy

41%

37% 0.15

36%

0.14

35% 61.74

72.19 Accuracy of speech recognition

76.3

Performance of spoken language understanding(N=1) Performance of spoken language understanding(N=5) Prediction power of system next action tag by MRR(N=1) Prediction power of system next action tag by MRR(N=5)

Fig. 5. Compare of performance of spoken language understanding with MRR

5 Conclusions We evaluated the WFSTDM in which statistical models trained using a human-tohuman dialog corpus for hotel reservation was used in response to the manual transcriptions and ASR results of user inputs. To minimize the degradation of the DM performance due to speech recognition errors, we expand the WFSTDM for handling multiple hypotheses of ASR and confidence score indicating acoustic and linguistic reliability of speech recognition. The accuracy of spoken language understanding (SLU) results and the correctness of system actions selected by the dialog management WFST using MRR were evaluated. To solve a problem of sentence boundaries, such framework that user concept tags can be output repeatedly from multiple utterances at each turn was applied. We confirmed that the performance of the SLU WFST and DM were enhanced by choosing the optimal action among all the WFST paths for multiple hypotheses (N-best) of speech recognition and in consideration of confidence score. Although we implemented a sentence generation (SG) WFST module [6], the SG WFST was not composed for the dialog management WFST and thus the performance of the entire dialog management WFST including the SG WFST in response to speech input has not been tested yet in the research for this paper. The future work involves human judgment on appropriateness of sentences generated as system responses by the SG WFST and evaluation experiments by humans via speech input and output.

72

N. Kimura et al.

References [1] Hori, T., Hori, C., Minami, Y., Nakamura, A.: Efficient WFST-based one-pass decoding with on-the-fly hypothesis rescoring in extremely large vocabulary continuous speech recognition. IEEE Transactions on Audio, Speech and Language Processing 15, 1352– 1365 (2007) [2] Hori, C., Ohtake, K., Misu, T., Kashioka, H., Nakamura, S.: Dialog management using weighted finite-state transducers. In: 9th Annual Conference of the International Speech Communication Association, pp. 211–214 (2008) [3] Mohri, M., Pereira, F., Riley, M.: Weighted finite-state transducers in speech recognition. Computer Speech and Lan-guage 16, 69–88 (2002) [4] Kayama, K., Kobayashi, A., Mizukami, E., Misu, T., Kashioka, H., Kawai, H., Nakamura, S.: Spoken Dialog System on Plasma Display Panel Estimating Users’ Interest by Image Processing. In: 1st International Workshop on Human-Centric Interfaces for Ambient Intelligence (July 2010) (to appear) [5] Hori, C., Ohtake, K., Misu, T., Kashioka, H., Nakamura, S.: Recent Advances in WFSTbased Dialog System. In: 10th Annual Conference of the International Speech Communication Association, pp. 268–271 (2009) [6] Hori, C., Ohtake, K., Misu, T., Kashioka, H., Nakamura, S.: Weighted Finite State Transducer Based Statistical Dialog Management. In: IEEE Workshop on Automatic Speech Recognition & Understanding, pp.490–495 (2009) [7] Hurtado, L.F., Griol, D., Sanchis, E., Segarra, E.: A stochastic approach to dialog management. In: IEEE Automatic Speech Recognition and Understanding Workshop, pp.226–231 (2005) [8] Nagata, M., Morimoto, T.: First steps towards statistical modeling of dialogue to predict the speech act type of the next utterance. Speech Communication 15, 193–203 (1994) [9] Higashinaka, R., Nakano, M., Aikawa, K.: Corpus-based Discourse Understanding in Spoken Dialogue Systems. In: 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 240–247 (2003) [10] Higashinaka, R., Nakano, M.: Rankin Multiple Dialog States by Corpus Statistics to Improve Discourse Understanding in Spoken Dialog System. IEICE Trans. Inf. & Syst. E92.D(9), 1771–1782 (2009) [11] Hori, C., Ohtake, K., Misu, T., Kashioka, H., Nakamura, S.: Statistical dialog management applied to WFST-based dialog systems. In: IEEE International Conference on Acoustics, Speech and Signal Processing, pp.4793–4796 (2009) [12] Komatani, K., Kawahara, T.: Flexible mixed-initiative dialogue management using concept-level confidence measures of speech recognizer output. In: International Conference On Computational Linguistics Computational Linguistics, vol. 1, pp. 467–473 (2000) [13] Hazen, T.J., Seneff, S., Polifroni, J.: Recognition confidence scoring and its use in speech understanding systems. Computer Speech and Language 16, 49–67 (2002) [14] Lee, A., Shikano, K., Kawahara, T.: Real-time word confidence scoring using local posterior probabilities on tree trellis search. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp.I-793–I-7966 (2004)

Evaluation of Facial Direction Estimation from Cameras for Multi-modal Spoken Dialog System Akihiro Kobayashi, Kentaro Kayama, Etsuo Mizukami, Teruhisa Misu, Hideki Kashioka, Hisashi Kawai, and Satoshi Nakamura Spoken Language Communication Group, MASTAR project, National Institute of Communications and Technology (NICT), Japan {akihiro-k,kayama,etsuo.mizukami,teruhisa.misu, hideki.kashioka,hisashi.kawai,satoshi.nakamura}@nict.go.jp http://www.nict.go.jp/

Abstract. This paper presents the results of an evaluation of imageprocessing techniques for estimating facial direction from a camera for a multi-modal spoken dialog system on a large display panel. The system is called the “proactive dialog system” and aims to present acceptable information in an acceptable time. It can detect non-verbal information, such as changes in gaze and facial direction as well as head gestures of the user during dialog, and recommend suitable information. We implemented a dialog scenario to present sightseeing information on the system. Experiments which consist of 100 sesions with 80 subjects were conducted to evaluate the system’s eﬃciency. The system grows particularly clear when dialog contains recommendations. Keywords: Facial direction estimation, Head detection, User interface.

1

Introduction

Image processing techniques estimating face and gaze direction from cameras have been widely studied in recent years, and these techniques are used as a multimodal user interface because the directions are thought to indicate the user’s attention [1,2]. It is, however, diﬃcult to evaluate the eﬃciency of these techniques in multimodal applications because many factors inﬂuence user’s impressions. In this paper, we developed a multi-modal dialog system for digital signage using image processing techniques, and evaluated the performance of image processing and its eﬃciency in our application based on dialog corpora and videos of 100 sessions from which 80 subjects actual spoke to our system. “Digital Signage” is an advertising media for selecting and displaying in real time appropriate contents according to the users, and has been actively studied in recent years [3]. Almost all of these, however, one-sidedly display content, or need explicit input devices such as touch panels. We expect natural interfaces make these systems more user-friendly. For example, the ability for a user to draw out desirable information from a system via spoken dialogs and for the system to predict the user’s interests and recommend appropriate information G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 73–84, 2010. c Springer-Verlag Berlin Heidelberg 2010

74

A. Kobayashi et al.

would produce an ambient intelligence. The construction of such systems can lead to applications such as next-generation digital signage. Therefore we proposed a novel interactive information display system that realizes proactive dialogs between human and computer based on image processing techniques [4,5]. A “proactive dialog system” refers to a system that has the functionality of actively presenting acceptable information in an acceptable time in addition to being able to adequately respond to queries [6]. The proposed system is based on spoken dialog. It is also able to detect non-verbal information, such as changes in gaze and facial direction and head gestures of the user during dialog, and recommend appropriate information. We constructed the prototype of this system with data and dialog scenarios for sightseeing guidance on Kyoto. Experiments were held with 80 subjects (total 100 sessions) to analyze user behavior during system use and to evaluate the system’s usefulness and performance of image processing used in the system. In this paper, we present software and hardware architecture, image processing technology for the system, and user evaluations. Hardware and software architecture and details of the parts of spoken language recognition and display control are described in section 2. Image processing that detects users and estimates gaze and facial directions is explained in section 3. The application implemented in this system is described in section 4.The experiment to evaluate the system’s ability is reported in section 5.

2 2.1

System Architecture Outline of Total System

This section discusses the prototype of a spoken dialog system with a plasma display panel that we constructed for a system integrating non-verbal information recognition and spoken dialog. The system is constructed on the premise of it being ﬁxed in a public space, such as a tourist information oﬃce, and presenting information to a general audience. The main input interface is also assumed to be spoken language and image processing is used to enhance dialog quality by estimating user interests. The output interface utilizes a wide screen divided into four windows and displaying a range of information. The character shown in Fig. 2 appears on the screen and explains displayed content and controls dialog via speech synthesis. 2.2

Hardware

The prototype of the proposed system we constructed is shown in Fig. 1. It consists of following parts: 50-inch plasma display panel (PDP). 70-cm wide and 120-cm high portrait display on an 80-cm-high base with a resolution of 1080 × 1920. Three pose-controllable monocular cameras. Grasshopper made by Point Grey Research Inc. with attached lenses so that face of a user is correctly placed in view.

Evaluation of Facial Direction Estimation

Fig. 1. Spoken dialog system on plasma display panel

75

Fig. 2. Screen example of PDP information dialog system

Stereo vision camera. Bumblebee2 also made by Point Gray Research Inc. , with a horizontal angle of about 60-degree width. Directional microphone. CS-3e made by SANKEN Loudspeaker. PN-AZ10 made by Sony 2.3

Software

Software Modules are divided into these four function types: – – – –

Image processing Speech recognition and parsing Dialog control Display control and speech synthesis

The image processing algorithms implemented in the system are human and head area detection and face and gaze direction estimation, The details of the algorithms are described in the next section and the dialog control which includes integration method for image and speech inputs is described in section 4. The speech recognition and parsing are explained in this section. First, voice activity detection (VAD) is performed for the input audio signal and the uttering part is cut out. This part is sent to a module that contains ATRASR [7] which was developed by ATR as a speech recognition engine and performs speech recognition and parsing. Parsed results are sent ﬁnally to the dialog control.

76

A. Kobayashi et al.

The display control outputs a screen as in Fig. 2. In principle, the screen is divided into two or four windows; with each window respectively using HTML. In Fig. 2, a brief on Kinkaku-ji temple is displayed on the upper left, a list of restaurants near Kinkaku-ji on the lower left, details of a restaurant on the lower right. The character agent is displayed on the center of the lower right window. She performs several actions as a virtual conversational partner. She is equipped with lip synchronization. The shape of her mouth is generated based on the vowel by cooperating with the speech synthesis.

3 3.1

Non-verbal Information Processing via Images Detection of Head Area

Candidates for the human head area via use of the stereo vision camera are detected with the following processes: 1. Construct three-dimensional occupancy grid (10-cm size) 2. Divide into each 20-cm height and cluster object existence area 3. Segment to each person area via an applied method of the crossing hierarchy method [8] 4. Detect candidate areas of the human head 5. Evaluate and ﬁlter candidates by evaluating head possibility

Fig. 3. Person detection from stereo images

An example of the processing is shown in Fig. 3. In the crossing hierarchy method, three-dimensional space is divided into overlapping plates: for example, 20-cm plates at heights of 180–160 cm, 170–150 cm, 160–140cm. Areas that are determined as the human region in the upper unit are then propagated to

Evaluation of Facial Direction Estimation

77

lower units and candidate areas of the human region are decided sequentially. In this way entire human regions are abstracted. The original method used multiple stereo vision cameras, but this system uses only one. Moreover, the camera of our system is equipped in plan-view, which causes serious occlusion but makes setting easy. In this system, clustering is ﬁrst performed on each plate. A distinction is then made for each cluster based on whether it will inherit a cluster of the upper plate or appear as a new region. If a cluster inherits multiple regions it is adequately divided, and if multiple clusters inherit a single region they are integrated, if necessary. These processes enable robust detection of each human region when there are occlusions, closes, and contacts with multiple people. The center part of Fig. 3 shows 18 plates at heights ranging from 200–180 cm to 30–10 cm, used in this system. PDP exists in the lower center of each square. No objects are black regions. Grey regions indicate invisible areas that are out of view or occluded. Other colored regions are candidates for the human region. The same color indicates the same person. After that, the upper region, which is about 30 cm from the top of the human region, is detected as a candidate area for human head. Moreover, the possibility that each candidate area is a human head is evaluated on its height, distance from the PDP, size, shape, and head position of the previous frame. Areas that exceed an a priori threshold are ﬁnally decided to be the human head. 3.2

Facial Direction Estimation

The system controls three high-resolution monocular cameras to catch human head regions obtained from the above mentioned processes. Facial direction is then estimated as follows by individual monocular cameras (Fig. 4), and mean of the cameras are used to know where a user looks at: 1. Detect face regions. In this system, images of 800 × 600 pixels are input at 15 frames per second. If the system fails to detect a face or to track facial parts in the image, in the next frame a facial detection routine using Haar-like features is executed. 2. Detect and track facial parts. The system detects 45 feature points on the face by using an active appearance model (AAM), as initial values of the coordination of points are the values of the previous frame (if facial parts were also detected in the previous frame) or a priori values (if the face is newly detected) [9]. AAM is a method for using principal component analysis against the vector consisting of coordination of the features of facial parts in the image and intensities of the pixels of the face region. The correlation in the change of the feature point locations and the change of view is learned. It enables tracking non-rigid objects such as facial parts. 3. Estimate facial direction. Six degrees of freedom (DOF), i.e. three in rotation and three in translation, of the facial direction are estimated by using the steepest descent method by ﬁtting three-dimensional coordination of each feature in an a priori three-dimensional face-shape model and the calculated coordination of the previous step.

78

A. Kobayashi et al.

Fig. 4. Facial direction estimation

4. Estimate gaze direction. The candidate for the iris region is obtained by binarizing eye regions and ﬁtting an ellipse. Moreover, three-dimensional coordination of the center of the eyeball, obtained in the previous step, and coordination of the center of the iris calculated via the coordination of the candidate of the iris region in the image and facial direction are estimated. Gaze direction is the direction of the line on these two points [10]. We use facial direction to know where a user is looking because iris detection is not robust given the variety of individual eye-shape, and lighting noise in the real world. We deﬁne a base-line of facial direction from a center of the PDP to a facial gravity-point of a user who is looking at the center of PDP. Blue lines in Fig. 4 show the base lines. This system calculates the motion of the base-line from the Six DOF. Finally this system identiﬁes the point where the base-line crosses the PDP plane.

4 4.1

Application Acceptable Queries

We implemented a Kyoto sightseeing guide scenario on the prototype system, based on a system for a portable PC [11]. Examples of acceptable queries on the system are as follows: 1. 2. 3. 4.

“Show me sightseeing spots that are famous for determinant. ”. “Show me sightseeing spot. ”. “Show me subject of sightseeing spot. ”. (If search result list is displayed) “Show me details on the n-th item.”

Evaluation of Facial Direction Estimation

79

We assigned cherry blossoms, autumn foliage, and gardens as determinant. A bus schedule, how to go to the site, a map, restaurants near the site, and so on are prepared as a subject in 3. When words are recognized, the system displays the contents. If a sightseeing spot is omitted, the system infers that the spot mentioned just before is also a current sightseeing spot. The system has a database of about 2,000 sits. If a user utters words not contained in the databases, the system searches by treating the words as keywords on Google and displays the results in list form. If a user utters a number of such as “4,” the website nominated by the user is displayed. 4.2

Recommendation Based on Non-verbal Information

A typical function of this dialog system is recommendation dialogue based on non-verbal information. This system shows information in four or two windows on the PDP as shown in Fig. 2, and this system automatically recommends contents by estimating which window the user is looking at, when the user doesn’t know what to say 1 . Finally this system transits the state automatically with a system utterance, “I will explain this content.” Fig. 5 shows details of the dialog status.

Fig. 5. Dialog state transition

1. Initial state: Four determinants are displayed. “Determinants” are displayed on the four-section screen. 2. Four sightseeing spots are displayed. In this mode, pictures and a brief explanation of four spots that fulﬁll the condition selected in step 1 are randomly selected and displayed. 1

The system knows the timing for recommendation using time span between utterances of a user and the system.

80

A. Kobayashi et al.

3. Four content names are displayed. In this mode, an abstract of sightseeing spots selected in step 2 is displayed in the upper left of the screen. In addition, the types of information searchable in the system, such as “how to go to the site,” “map,” and “restaurant near the site” are displayed. 4. Two or four contents are displayed. Contents mean a map near the site of the current topic, list of restaurants near the site, details of one restaurant on the list, bus schedule from Kyoto station to the spot, and so on. When the system estimates that a user utters a query that requests content that is not prepared, the result list from a Google search for the keyword contained in the utterance and the ﬁrst-ranked web-site on the list is displayed. These states are supposed to transit from 1 to 2, 2 to 3, and 3 to 4 as a standard. Regardless of the current state, however, uttering of “determinants” causes transition to state 2, uttering something such as “Show me sightseeing spot ” ( sightseeing spot is diﬀerent from current topic ) causes transition to state 3, and uttering something such as “Show me subject of sightseeing spot ” to state 4. If a human is not detected by the image processing for a certain period or the user says “Thank you,” the system is reset and returns to state 1.

5 5.1

Experiments Outline of Experiments

We conducted the following real-world experiments of the system in December 2009. The number of the session was 100, consisting of 70 subjects who used the system only one time and 10 subjects who used it three times on diﬀerent days. We conducted two types of experiments. The ﬁrst experiment is an evaluation of usability of this application. Subjects were instructed to search for sightseeing information and choose a site they want to go to by using the system freely. The total period of these 100 sessions was about 22 hours. The system recognized 4,807 utterances by users during the experiments. The second experiment was an evaluation of image processing performance. Subjects were instructed to look up 24 markers shown in the display. 5.2

Performance of Image Processing

To evaluate the performance of facial direction estimation described in Sec. 3.2, we instructed subjects to look up markers, the positions of which are known, and compared these known positions with recognized positions where users were looking. Each subject stood 1 meter away from the PDP. The PDP sequentially displayed each of 24 markers (width 4 cm x height 6 cm) at 15 cm intervals. Each marker was 1 cm in size and displayed for 2 seconds. We deﬁned the case in which this system correctly recognized the window (one of four windows shown in Fig. 5) a user was looking at as a success, because the application used this information for the recommendation, as described in Section 4. We calculate the success rate of each of 100 sessions. Fig. 6 shows the

Evaluation of Facial Direction Estimation

81

50 45 40 35 30

Subjects Num 25 20 15 10 5 0

Success

10%

20%

30%

40%

50%

60%

70%

80%

90% 100% Rate

Fig. 6. Number of Subjects at Each Success Rate

distribution of the success rate. In most of the subjects, success rate is distributed between 20-30%, as shown in Fig. 6. We analyzed the main reasons for these failures. In Fig. 7 and Fig. 8 we drew markers as small circles and plots recognized the destinations of user gazes. Fig. 7 shows the result of a subject with a success rate exceeding 90%, while Fig. 8 shows the result of a subject with a 20-30% success rate. As shown in these ﬁgures, some users did not move their head when they looked at the markers. As shown in these results, improvements in recognition and instruction are needed. For example, the system would use not only information about the absolute position but also a ﬂow of motions in the recognition process. In addition, we think this system requires some instruction to direct user’s head motions, such as the motions of a character agent.

5.3

Evaluation of Application

We implemented recommendations based on facial direction during times of no utterance as an application to show the integration of the spoken dialog system and image processing of non-verbal information. To evaluate the eﬀects of the function, 40 sessions were done with recommendations and 60 without. The system recommends information if a user makes no utterance for a certain period. Its threshold is set to 8 or 10 seconds. Experiments were carried out as follows: 1. The subject is given an overview of the system and typical dialog between a user and the system.

82

A. Kobayashi et al.

Fig. 7. Distribution of recognized facial directions of a user who moves their head

Fig. 8. Distribution of recognized facial directions of a user who moves their eyes

2. The subject asks the system six questions, instructed by an assistant, to which the system can respond. The assistant supports the subject. (about 2 minutes) 3. The subject freely uses the system unassisted. (about 10 minutes) 4. The user answers a questionnaire. During the 40 sessions the systems recommended 85 times as described in Section 4.2. Classiﬁcations on whether the user utterance just after recommendation is related to recommended sites are as follows: – Continued recommended topic (A) 25 – Uttered other topic (R) 41 – Reset the system (I) 19 Recommendation accept rate is A/(A + R) = 37.9%. If we regarded that recommendation was refused also in the case I, A/(A + R + I) = 29.4%. 2 Moreover, we analyzed answers for the questionnaire to evaluate users’ impressions of this system recomendation [12]. Table 1 shows the results. The questionnaire is modiﬁed and translated ITU-T Rec. P.851 questionnaire, and contains four-level score (-1.5, -0.5, 0.5, 1.5) of the whole system and individual items such as its potential, along with free-form entry about their impression of the system. We analyzed the questionnaire of 54 sessions with an adequacy 2

In this experiment, subjects were required to say (I) when the subjects were satisﬁed with a site displayed on PDP. Therefore, we couldn’t decide this type of utterance as positive or negative.

Evaluation of Facial Direction Estimation

83

Table 1. Evaluation of recommendations Question

R(N=21) NR(N=33) Avg SD Avg SD

F(1.52)

You knew at each point of the dialogue what the system expected from you. 0.119 0.973 -0.621 0.857 9.611 p<0.01** The system’s behavior was always as expected. -0.595 0.768 -1.046 0.617 5.638 p<0.05* You prefer a human operator. -0.310 0.814 -0.712 0.781 3.302 p<0.1+

response rate higher than 50%. It is because in the session of almost subjects with low adequacy response rate the dialogs were not smooth due to low speech recognition rate and we judged answers of them are inadequate for evaluating recommendation. Among 54 sessions, 21 sessions are in recommendation mode (R) and 33 are not recommendation mode (NR). From analyzing their answers for question, the evaluation values of system R are mostly higher than those of system NR. Table 1 shows average (Avg), standard deviation (SD), and values of F-test from the score of 54 sessions. This is especially notable in these items shown in Table 1 higher than 50%. This result indicates that the system’s clarity rises when dialog provides recommendations. It was, however, often observed that a subject was confused by the recommendations. The suspected reasons for this are that the delay until the system recommends information is too short, and that the system did not distinguish between a state in which the user was perplexed about selecting information displayed by the system and was puzzled about how to use the system, and uniformly regarded these as the former case and made recommendations.

6

Conclusion

We constructed a prototype of a proactive spoken dialog system, which is a smart interactive information presentation system that incorporates a non-verbal information recognition process into a spoken dialog system. The system consists of a 50-inch plasma display panel, microphone for voice input, and cameras (three monocular and one stereo) for image input. We implemented a Kyoto sightseeing guide scenario on the system, in which dialog mainly progresses by spoken language. When there is no utterance for a given duration, the system recommends information by estimating user interests via facial direction estimated by image information. We conducted an experiment on the system with 100 sessions to evaluate its eﬃciency. The head detection rate was almost 100%, but the success rate of detecting where a user was looking was about 30%. This seemed to be the reason why the recommendation acceptance rate was about 30%. Evaluations for the system with recommendations, however, were mostly higher than those without recommendations. In particular, the evaluation of the system’s clarity increases when dialog provides recommendations.

84

A. Kobayashi et al.

Future work includes the improvement of recognition rates in speech and images, addition of a function for estimating the user reactions, such as positive or negative, to the system’s response, and construction of a system that better integrates non-verbal information recognition and spoken dialog.

References 1. Kobayashi, Y., Sugimura, D., Sato, Y., Hirasawa, K., Suzuki, N., Kage, H., Sugimoto, A.: 3D Head Tracking using the Particle Filter with Cascaded Classiﬁers. In: Proc. British Machine Vision Conference (BMVC 2006), pp. 37–46 (2006) 2. Fujie, S., Yamahata, T., Kobayashi, T.: Conversation robot with the function of gaze recognition. In: Proc. 2006 IEEE-RAS Int’l Conf. on Humanoid Robots (Humanoids 2006), pp. 364–369 (2006) 3. Lao, S., Yamaguchi, O.: Facial Image Processing Technology for Real Applications: Recent Progress in Facial Image Processing Technology. IPSJ Magazine 50(4), 319– 326 (2009) 4. Kobayashi, A., Kayama, K., Lee, D., Sumi, K., Kato, T., Kadobayashi, R., Yamazaki, T.: Proposition of Proactive Information Display System Using Face Directions and Head Gestures Estimation. In: Proc. of the IEICE General Conference (2009) 5. Minakuchi, M., Asano, S., Satake, J., Kobayashi, A., Hirayama, T., Kawashima, H., Kojima, H., Matsuyama, T.: Mind Probing: Active Stimulation of Gaze Patterns for Inference of User’s Interest. Information Processing Society of Japan (IPSJ) SIG Technical Reports (Human-Computer Interaction, HCI) 125, 1–8 (2007) 6. Kawahara, T., Kawashima, H., Hirayama, T., Matsuyama, T.: “Automated Information Concierge” based on Proactive Dialog and Information Retrieval. IPSJ Magazine 49(8), 912–918 (2008) 7. Itoh, G., Ashikari, Y., Jitsuhiro, T., Nakamura, T.: Summary and evaluation of speech recognition integrated environment ATRASR. In: Proc. of 2005 Acoustic Society of Japan Fall Meeting, pp. 221–222 (2005) 8. Yoda, I., Sakaue, K.: Concept of Ubiquitous Stereo Vision and Applications for Human Sensing. In: Proc. on, 2003 IEEE International Symposium on Computational Intelligence in Robotics and Automation CIRA 2003, pp. 1251–1257 (2003) 9. Kobayashi, A., Satake, J., Hirayama, T., Kawashima, H., Matsuyama, T.: PersonIndependent Face Tracking Based on Dynamic AAM Selection. In: IEEE Int. Conf. on Automatic Face and Gesture Recognition, FG (2008) 10. Satake, J., Kobayashi, A., Hirayama, T., Kawashima, H., Matsuyama, T.: Accuracy Improvement of Real-Time Gaze Estimation using High Resolution Camera, Technical report of IEICE. PRMU 107(491), 137–142 (2008) 11. Kashioka, H., Misu, T., Ohtake, K., Hori, C., Nakamura, S.: Development of dialog system keeping step with users, Technical report of IPSJ. SLP 2008(68), 93–97 (2008) 12. Mizukami, E., Kashioka, H., Kawai, H., Nakamura, S.: A study toward an evaluation method for spoken dialogue systems with considering criteria of users. In: Proc. of 2nd International Workshop on Spoken Dialogue Systems Technology (IWSDS 2010) (2010) (to Appear)

D3 Toolkit: A Development Toolkit for Daydreaming Spoken Dialog Systems Donghyeon Lee, Kyungduk Kim, Cheongjae Lee, Junhwi Choi, and Gary Geunbae Lee Department of Computer Science and Engineering Pohang University of Science and Technology {semko,getta,lcj80,chasunee,gblee}@postech.ac.kr

Abstract. Recently various data-driven spoken language technologies have been applied to spoken dialog system development. However, high cost of maintaining the spoken dialog systems is one of the biggest challenges. In addition, a fixed corpus collected by human is never enough to cover diverse real user’s utterances. The concept of a daydreaming dialog system can solve the problem by making the system learn from previous human-machine dialogs. This paper introduces D3 (Daydreaming Dialog system Development) toolkit, which is a back-end support toolkit for the development of the daydreaming spoken dialog systems. For reducing human efforts, D3 toolkit generates new utterances with semantic annotation and new knowledge by analyzing the usage log file. The new added corpus is determined by verifying proper candidates using semi-automatic methods. The augmented corpus is used for building improved models and self-evolution of the dialog system is possible by replacing the old models. We implemented the D3 toolkit using web-based technologies to provide a familiar environment to non-expert end-users. Keywords: Spoken Dialog System, Statistical NLP, Daydreaming Computer, Failure-driven Learning, Dialog Development Toolkit.

1 Introduction In recent years, data-driven spoken dialog systems have been studied by many researchers. In general, spoken dialog systems (SDS) consist of three major components: automatic speech recognition (ASR), spoken language understanding (SLU) and dialog management (DM). For building each model, developers need to prepare annotated dialog corpus and domain-specific knowledge. However, corpus preparation requires tedious human efforts. To reduce the laborious work, several development toolkits such as SUEDE [1], CSLU Toolkit [2] and DialogDesigner [3] have been developed to help developers for rapid system design. For example, Jung et al. [4] developed Dialog Studio to reduce engineering works and to upgrade all components including ASR, SLU, and DM together. Nevertheless, there are still problems when using and managing data-driven spoken dialog systems in a practical field. A fixed corpus collected by human is never enough to cover a real user’s utterance patterns. G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 85–95, 2010. © Springer-Verlag Berlin Heidelberg 2010

86

D. Lee et al.

Data-driven user simulation techniques are widely used for learning optimal dialog strategies in a statistical dialog management framework and for automated evaluation of spoken dialog systems [5]. The user simulation technique is an alternative way to resolve common weaknesses of dialog systems such as the scarceness of the training corpus and the cost of an evaluation made by real users. However, the problem of data-driven user simulation is the limitation of user patterns. The response patterns from data-driven simulated user usually tend to be limited to the training data although several exploration algorithms are used to find unseen patterns. Therefore, the patterns lack of reality and it is not easy to simulate unseen user patterns. The problems can be solved by a concept of a daydreaming computer [6]. The daydreaming computer should not be idle when left unemployed by users, but daydreaming. Learning from experience, the one important role of daydreaming, can be applied to the SDS. Therefore, the daydreaming dialog system can keep trying to learn from failures and successes while the users do not use the system. We developed a support tool for a daydreaming dialog system development, called D3 (Daydreaming Dialog Development) toolkit, to help the developer to find the real user’s new utterance patterns. This paper is organized as follows: Section 2 introduces the concept of the daydreaming dialog system. D3 toolkit components and detail strategies are proposed in section 3. Preliminary experiments and how to implement the D3 toolkit will be described in section 4. Finally, section 5 draws a conclusion and a future work.

2 Daydreaming Dialog System An intriguing aspect of humans is that we spend much time engaged in thought not directly related to the current situation or environment. This type of thought, usually called daydreaming, involves recalling or imaging personal or vicarious experiences in the past of future. For both humans and computer, Mueller and Dyer [6] discussed the following function of daydreaming: support for creativity, future planning and rehearsal, learning from experience, emotion modification and motivation. The daydreaming computer takes as input situational descriptions, and produces as output actions that it would perform in given situation and several daydreams when the computer is idle. The daydreaming computer learns as it daydreams by indexing daydreams, planning strategies, and future plans into memory for future use. We believe that the concept of the daydreaming can be extended to the SDS because the daydreaming enables the SDS to learn from successes and failures. While SDS does not process user’s utterances, self-evolutionary modules in the daydreaming dialog system might learn to process the problematic utterances at the background. This daydreaming process would correct the problematic situations and the system could generate appropriate responses when facing the same situation again in the future. For example, the user says “I am travelling for sightseeing”, but some errors occur due to wrong predictions in ASR, SLU or DM modules. While the SDS is idle, the

D3 Toolkit: A Development Toolkit for Daydreaming Spoken Dialog Systems

87

Fig. 1. Overview of daydreaming spoken dialog system architecture

system automatically detects and corrects the errors and updates all models to smoothly process the same utterance in the future. After daydreaming, the system could generate a response, like “How long are you staying”, when the user says the same utterance. Fig. 1 illustrates overview of daydreaming spoken dialog system architecture. The log manager creates log files, which are considered as system experiences including successes and failures. The user manager is used to manage user accounts and load models for a user. When the idling time of the system comes, the system starts a daydreaming process. Daydreaming is performed by a self-evolutionary process, which involves analyzing the human-machine dialog log, trying to extract the patterns, and updating the models to learn from successes and failures. Our D3 toolkit supports this self-evolutionary process of the daydreaming dialog system can be easily done by a real user.

3 Daydreaming Dialog System Development Toolkit Dialog Studio has provided a convenient way to prepare dialog corpus and update all models without large human efforts. However, it is still time-consuming at the annotation step because the developer should annotate new utterances manually to add new patterns without most probable semantic tags. To address this problem, we have developed the D3 support toolkit for easily maintaining SDS. The D3 toolkit supports three main functions: 1) finding out-of-patterns

88

D. Lee et al.

Fig. 2. Procedures of the D3 Toolkit

to make errors in human-machine dialog logs, 2) suggesting their transcriptions for training a language model, and 3) annotating semantic tags tentatively. 3.1 Procedures of the D3 toolkit As shown in Fig 2, D3 toolkit starts from loading a set of log files which have been generated when the real users talked to the dialog system in the past. Log analysis is divided into three parts including: ASR, SLU, and Knowledge part. Log analyzer generates candidates which include some system errors due to out-of-patterns. The developer is only required to verify and modify auto-generated candidates like active learning [7]. New utterance patterns are added to the existing corpus and training all models is executed. 3.2 Logging Step All the processed human-machine dialogs are saved as log files, which are in the XML-format, by the log manager of SDS. An example of the log file is illustrated in Fig 3. Each module of SDS generates information, which is used in the analysis step. ASR module provides a recognized text (ASR RESULT), a confidence score (CONFIDENCE) and a wave filename (WAVE). SLU module provides a semantic structure (DIALOG_ACT, MAIN_GOAL, SLOT) and confidence score (SLU_CONFIDENCE). DM module records discourse history such as previous user intention (PREV_DA, PREV_MG, PREV_SL), previous system action (PREV_SA), filled slots (FILLED_SLOT) and current system action (SYSTEM_ACTION). The FILLED_SLOT represents which slots are filled by the user during the current dialog. 3.3 Analysis Step ASR Evolution. Traditional ASRs requires both language model (LM) and acoustic model (AM). In our system, we use a domain-specific language model and a generic acoustic model to build a spoken dialogue system.

D3 Toolkit: A Development Toolkit for Daydreaming Spoken Dialog Systems

89

<WAVE>/wav/result005.wav 21 ? 0.8587191 <SLU> wh_question <MAIN_GOAL>search_program <SLOT name=”date”>today <SLOT name=”time”>21:00 <SLU_CONFIDENCE>0.87671 request search_program date,genre inform(program) date,time <SYSTEM_ACTION>infrom(program)

오늘 시에는 무슨 프로그램 하지

Fig. 3. Example of the log file

ASR evolution involves LM adaptation and AM adaptation (Fig 4). ASR-1 represents the ASR of the SDS and ASR-2 represents the one of the toolkit in Fig 4. They are same modules except using different LM. To learn out-of-patterns, our toolkit first tries to find mis-recognized utterances in the log file. Out-of-patterns may be recognized incorrectly because they were never seen in training LM of ASR-1. We preliminary used ASR confidence score [8] to detect out-of-patterns because the confidence score has been commonly used to detect erroneous utterances; out-of-patterns could be detected by the filter if their ASR confidence score at the utterance level is lower than a pre-defined threshold. To obtain most probable transcriptions, out-of-patterns are recognized again by ASR-2 in which LM was trained with an extended corpus. The extended corpus contains a basic dialog corpus for SDS, a general conversational corpus, and a web data related to the domain to cover a variety of utterance patterns. In the ASR-2, we can use larger LM because this process is off-line and the decoding time is not critical. After re-recognizing out-of-patterns, the toolkit extracts some recognized utterances with a high confidence score. These may be recognized correctly now although they were mis-recognized. These utterances are used to train new LM in the ASR-1, and then the ASR-1 would recognize them correctly when facing similar patterns.

90

D. Lee et al.

Fig. 4. ASR evolution flow

In addition, well-recognized utterances in the log file are also used to adapt AM using HTK toolkit [9] because acoustic adaptation for a specific speaker is important to increase the recognition accuracy. SLU Evolution. The flow of SLU evolution is illustrated in Fig 5. For SLU Evolution, we have considered three cases: z Case 1 : low ASR confidence score z Case 2 : high ASR confidence score and low SLU confidence score z Case 3 : high ASR confidence score and high SLU confidence score In Case 1, the new utterances could be extracted by ASR evolution. In Case 2, the some errors occur in the current SLU model. Therefore, the utterances in Case 1 and Case 2 should be newly annotated by appropriate semantic tags to train the statistical SLU model. In our system, a hybrid approach to SLU module [10] is used to extract the semantic frames of the user’s utterances. The hybrid intention recognizer in Fig 5 is based on a hybrid model that combines the utterance model and the dialog context model. In the SDS, the utterance model is usually used for SLU to predict the user intention from the utterance itself. The dialog context model predicts the probable user intention given the current dialog context. For the dialog context model, we use CRF model trained on the dialog corpus for SDS. We use the following features for the discourse-based model: previous dialog act, previous main goal, previous component slots and previous system action. For each utterance, this information can be obtained in the log file. The hybrid model merges hypotheses from the current SLU model with hypotheses from the dialog context model to find the best overall matching to user intention.

D3 Toolkit: A Development Toolkit for Daydreaming Spoken Dialog Systems

91

Fig. 5. SLU evolution flow

In Case 3, we considered the SLU results as machine-labeled utterances for unlabeled utterances in semi-supervised learning method [11]. Knowledge Database. One of the most important resources in building SDS is knowledge database (KDB) containing domain-specific information (e.g. TV program schedule) to be informed or suggested to the user. The SDS cannot sometimes find the information because the KDB does not contain what the user said. For example, the user says “I want to watch drama ‘Cinderella's Sister’”, but the system does not provide any information because the new program ‘Cinderella’s Sister’ was not updated previously. In general, the KDB is used to build ASR and SLU modules because they have a set of slot values to be recognized and to be extracted. However, the users do not know which items are included in the KDB; therefore, the users often say out-ofvocabulary (OOV) terms for the slot values. In this case, the daydreaming dialog system has to detect OOV terms to handle unseen items for the next time. In our toolkit, the OOV terms are detected by using word confidence scoring method [12]. In this method, a search network for recognition is constructed using utterance patterns and a keyword slot is assumed as an OOV term if a word confidence score is low. Next, the phonetic recognizer transcribes the phonemes of the OOV terms, and then phoneme to grapheme conversation module generates n-best keywords. After that, the new knowledge is extracted using a knowledge importer, which is a domain-specific module to extract structured information from external

92

D. Lee et al.

Fig. 6. A strategy of knowledge acquisition for OOV terms

knowledge sources such as web. Fig 6 shows a strategy of knowledge acquisition for OOV terms. 3.4 Verification Step At the analysis step, the D3 toolkit suggests a set of unseen utterance patterns which should be adapted to each model for improving the performance. However, these candidates may still include many errors because they were generated automatically through each evolution module. Therefore, these candidates should be verified whether they are correct before training each model. The D3 toolkit provides the verification step using active learning method [7]. We empirically defined a set of thresholds for ASR and SLU models. If the confidence score of the candidate exceeds a given threshold, the candidate is automatically added to train the model. Otherwise, the candidate is verified by a real user. At the verification step, the D3 toolkit provides useful information such as the most likely probable transcription and semantic tags for the user to select which utterances should be added to new models. The user can modify the candidate by listening to the recorded speech files. We believe that our toolkit can be easily used by the real users because the verification step is completed by just selecting the checkbox based on the user’s intuition. 3.5 Training Step After the verification, the existing corpus combines with the new corpus. The augmented corpus is used for building new ASR, SLU and knowledge model. By replacing the models, the self-evolution process is completed.

D3 Toolkit: A Development Toolkit for Daydreaming Spoken Dialog Systems

93

4 Preliminary Experiments 4.1 Corpus For preliminary experiments, we collected human-machine dialogs of about 114 utterances from 25 dialogs in Korean, which were based on a set of pre-defined subjects relating to the TV guidance task. We tested on our example-based dialog system [13]. Table 1 shows the result type of the collected utterances. The result type relates to three cases mentioned in SLU evolution. The number of utterances in Case 2 is small because SLU models are closely related to ASR models in SDS development. After the evolutionary process, 4 utterance patterns with semantic annotations and 2 knowledge sources are newly added to the existing corpus. Table 1. Result type of the collected utterances

# of utterances

Case 1

Case 2

Case 3

51

9

54

4.2 Dialog System Performance To evaluate the effectiveness of our toolkit, we compared the dialog system performance before and after the daydreaming. We quantified ASR according to word error rate (WER) and SLU according to concept error rate (CER). After updating model, WER is decreased from 48.64% to 46.97% and CER is decreased from 32.58% to 29.21%. WER is too high compared to WER in conventional SDSs because all dialogs include many new patterns. We calculated the task completion rate (TCR) for the dialog system evaluation. TCR is increased from 48.00% to 56.00%, which means two more tasks are completed. The experiment results are shown in table 2. The system performance is improved after daydreaming because the system succeeds in processing the utterances, which cause errors before daydreaming, by updating all models. Table 2. Experiment results

WER (ASR) CER (SLU) TCR (DM)

Before daydreaming

After daydreaming

48.64% 32.58% 48.00%

46.97% 29.21% 56.00%

4.3 Implementation D3 toolkit was implemented into client-server architecture. To access the toolkit easily and provide a familiar environment, the graphic user interface is implemented using web-based technologies. For efficiency and adaptability, the analyzing and training part was developed using standard C++ library. The screen shot in fig. 7 shows the interface of D3 toolkit.

94

D. Lee et al.

Fig. 7. Screen shot of the D3 Toolkit user interface

5 Conclusions and Future Work We introduced the daydreaming spoken dialog system which can be semiautomatically improved by learning from the past human-machine dialogs. We also implemented the D3 toolkit to support the self-evolutionary process of the daydreaming dialog system. The D3 toolkit generates new corpus by analyzing the log file. The models of the dialog system can be semi-automatically upgraded using the verification step. The main advantage of D3 toolkit is that little human effort is required to maintain the SDS by feeding out-of-patterns into the models. In the future work, we will apply D3 toolkit to different domains and extensively evaluate the updated system. In addition, the process for a dialog model evolution will be considered at the analyzing step. Acknowledgments. This work was supported by the Industrial Strategic technology development program, 10035252, Development of dialog-based spontaneous speech interface technology on mobile platform funded by the Ministry of Knowledge Economy (MKE, Korea).

References 1. Anoop, K.S., Scott, R.K., Chen, J., Landay, J.A., Chen, C.: SUEDE: Iterative, Informal Prototyping for Speech Interfaces. In: Video poster in Extended Abstracts of Human Factors in Computing Systems: CHI 2001, Seattle, pp. 203–204 (2001) 2. Sutton, s., Cole, R., de Villiers, J., Schalkwyk, J., Vermeulen, P., Macon, M., Yan, Y., Kaiser, E., Rundle, B., Shobaki, K., et al.: Universal Speech Tools: the CSLU toolkit. In: Proc. Internat. Conf. on Spoken Language Processing (ICSLP), pp. 3221–3224 (1998) 3. Dybkjær, H., Dybkjær, L.: DialogDesigner: Tools support for dialogue model design and evaluation. Lang. Resour. Eval. 40(1), 87–107 (2006) 4. Jung, S., Lee, C., Kim, S., Lee, G.G.: Dialog Studio: A workbench for data-driven spoken dialog system development and management. The Journal of Speech Communication 50(8-9), 683–697 (2008)

D3 Toolkit: A Development Toolkit for Daydreaming Spoken Dialog Systems

95

5. Scheffler, K., Young, S.: Corpus-based dialogue simulation for automatic strategy learning and evaluation. In: NAACL Workshop on Adaptation in Dialogue Sys-tems, pp. 64–70 (2001) 6. Mueller, E.T., Dyer, M.G.: Daydreaming in humans and computes. In: Proceedings of the Ninth International Joint Conference on Artificial Intelligence. University of California, Los Angeles (1985) 7. Bonwell, C.C., Eison, J.A.: Active learning: Creating excitement in the classroom. In: ASHE-ERIC Higher Education Report No. 1. The George Washington University, School of Education and Human Development, Washington (1991) 8. Jiang, H.: Confidence measures for speech recognition. Speech Communication 45(4), 455–470 (2005) 9. Hidden Markov Toolkit (HTK), http://htk.eng.cam.ac.uk/ 10. Lee, S., Lee, C., Lee, J., Noh, H., Lee, G.G.: Intention-based Corrective Feedback Generation using Context-aware Model. In: Proceedings of the 2nd International Conference on Computer Supported Education (CSEDU 2010), Valencia (2010) 11. Tur, G., Tur, D.H., Schapire, R.E.: Combining Active and Semi-Supervised Learning for Spoken Language Understanding. The Journal of Speech Communication 45(2), 171–186 (2005) 12. Hazen, T., Burianek, T., Poliforni, J., Seneff, S.: Recognition Confidence Scoring for Use in Speech Understanding Systems. In: Proceedings of ISCA ASR 2000 Tutorial and Research Workshop, Paris (2000) 13. Lee, C., Jung, S., Kim, S., Lee, G.G.: Example-based Dialog Modeling for Practical Multidomain Dialog System. Speech Communication 51(5), 466–484 (2009)

New Technique to Enhance the Performance of Spoken Dialogue Systems by Means of Implicit Recovery of ASR Errors* Ramón López-Cózar1, David Griol2, and José F. Quesada3 1

Dept. of Languages and Computer Systems, CITIC-UGR, University of Granada, Spain [email protected] 2 Dept. of Computer Science, Carlos III University of Madrid, Spain [email protected] 3 Dept. of Artificial Intelligence and Computer Science, University of Seville, Spain [email protected]

Abstract. This paper proposes a new technique to implicitly correct some ASR errors made by spoken dialogue systems, which is implemented at two levels: statistical and linguistic. The goal of the former level is to employ for the correction knowledge extracted from the analysis of a training corpus comprised of utterances and their corresponding ASR results. The outcome of the analysis is a set of syntactic-semantic models and a set of lexical models, which are optimally selected during the correction. The goal of the correction at the linguistic level is to repair errors not detected during the statistical level which affects the semantics of the sentences. Experiments carried out with a previouslydeveloped spoken dialogue system for the fast food domain indicate that the technique allows enhancing word accuracy, spoken language understanding and task completion by 8.5%, 16.54% and 44.17% absolute, respectively.

1 Introduction It is well known that user utterances are frequently misheard, misrecognised or misunderstood by spoken dialogue systems (SDSs), mainly due to current limitations of state-of-the-art automatic speech recognition (ASR). Thus, well designed error handling strategies are crucial in providing robust system performance, specially for not experienced users. Some authors have studied human error recovery strategies in order to apply these, if possible, to SDSs. For example, [1] found that when subjects face speech recognition problems, a common strategy is to ask task-related questions that confirm their hypothesis instead of signalling non-understanding, which leads to better understanding of subsequent sentences. A number of error handling strategies can be found in the literature, mostly working a three levels of the system’s architecture: ASR, spoken language understanding *

This research has been funded by the Spanish Ministry of Science and Technology, under project TIN2007-64718 HADA.

G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 96–109, 2010. © Springer-Verlag Berlin Heidelberg 2010

New Technique to Enhance the Performance of Spoken Dialogue Systems

97

(SLU) and dialogue management (DM). These techniques have been traditionally separated into two groups: error detection and correction. A common method for error detection is using recognition confidence scores, but the problem is that these measures are not entirely reliable, as depend on noise conditions and user types. At the DM level, a common method for detecting errors is using confirmation strategies [2] or re-phrasing, whereas implicit confirmations can be employed for error detection and correction. [3] proposed a model for error correction, which is comprised of four levels: detection, diagnosis, repair plan selection and plan execution interactively. [4] presented an agent-based architecture in which error handling is divided into individual, application independent components. This architecture makes it possible to construct adaptive and reusable components and entire error-handling toolkits. [5] proposed an example-based error recovery method to detect and correct errors, based on a rephrase strategy and a task guidance to help novice users re-phrase well-recognisable and well-understandable sentences. Some error handling techniques try to hide errors made by some components of SDSs, for example, errors made by the ASR. The technique that we propose in this paper follows this direction, as its goal is to detect ASR errors and correct them before the ASR result is the input to the SLU. To do so, it takes the ASR result and carries out two kinds of process, one statistical and the other linguistic. The first process tries to detect and correct errors by considering knowledge extracted from the analysis of a training corpus comprised of utterances and their corresponding ASR results. The goal of the linguistic process is to repair errors not detected during the statistical process, if these affect the semantics of the sentences. The remainder of the paper is organised as follows. In section 2 we discuss previous studies concerned with error handling techniques to automatically detect and correct ASR errors. Section 3 presents our technique for implicit recovery of these errors. It discusses the necessary elements and explains how to implement the algorithms for error correction. Section 4 presents the experiments, comparing system performance results achieved with and without using the proposed technique. Finally, section 5 presents the conclusions and outlines possibilities for future work.

2 Related Work Most previous techniques for automatically correcting ASR errors are based on statistical methods that use probabilistic information about words uttered and words in the recognition results. For example, following this approach, [6] proposed a method based on two parts. On the one hand, a channel model represents errors made by a speech recogniser, whilst on the other, a language model represents the likelihood of a sequence of words uttered by one speaker. They trained both models with transcriptions of dialogues obtained with the TRAINS-95 dialogue system. Their experimental results showed that the post-processor output contained fewer errors than that of the speech recogniser. Following this approach, [7] proposed a method that uses statistical features of characters co-occurrence, which was implemented in two consecutive correcting processes. The former detects and corrects errors using a database of erroneous-correct

98

R. López-Cózar, D. Griol, and J.F. Quesada

utterance pairs. The remaining errors are passed to the second process, which uses a string in the corpus that is similar to the string including recognition errors. The authors found that this method made significant contributions to the performance of speech translation systems. Also, [8] proposed a method to model likely contexts of all words in an ASR system vocabulary by performing a lexical co-occurrence analysis using a large corpus of output from a speech recogniser. They identified regions in the data that contain likely contexts for a given query word. Finally, they detected words or sequences of words that are likely to appear in the context and that are phonetically similar to the query word. Their experiments proved that this method provides high-precision in detection and correction of errors. Several authors have proposed carrying out error detection and correction at several levels. For example, [9] proposed a two-level schema for detecting recognition errors. The first level applies an utterance classifier to decide whether the speech recognition result is erroneous. If it is decided to be incorrect, it is passed to the second level where a word classifier is used to decide which words are misrecognitions. Following the same approach, [10] proposed a method for error detection based on three levels. The first is to detect whether the input utterance is correct. The second level is to detect incorrect words, and the third level is to detect erroneous characters. The error correction first creates candidate lists of errors, and then re-ranks the candidates with a combined model of mutual information and trigram. Statistical methods present several drawbacks. One is that they require large amounts of training data. Another is that their success depends on the size and quality of the speech recognition results, or on the database of collected erroneous strings, since they are directly dependent on the lexical entries. Hence, a number of authors have proposed to combine different types of information sources. For example, [11] presented a method that models the output generated by a set of ASR systems as independent knowledge sources that can be combined and used to generate an output with reduced error rate. The outputs of the single ASR systems are combined into a minimal cost word transition network by means of iterative applications of dynamic programming alignments. The resulting network is decided employing a rescoring or voting process that selects the output sequence with the lowest score. Employing a different approach, [12] combined lexical information with higher level knowledge sources via a maximum entropy language model (MELM). Error correction was arranged at two levels, using a different language model at each level. At the first level, a word n-gram was employed to capture local dependencies and to speed up the processing. The MELM was used at the second level to capture longdistance dependencies and higher linguistic phenomena, and to re-score the N-best hypotheses produced by the first level. Their experiments showed that this approach had superior performance than previous lexical-oriented approaches. The main problem found was that the training of the MELM required a lot of time and was sometimes infeasible.

New Technique to Enhance the Performance of Spoken Dialogue Systems

99

3 Proposed Technique for Implicit Recovery from ASR Errors In this paper we propose a new technique to enhance the performance of SDSs using a method that implicitly recovers from some ASR errors. We state that the recovery is implicit as the user is not aware of the error, in other words, no dialogue turns are employed for error recovery which makes a more natural and friendly interaction with the system. This technique is inspired by previous studies based on pattern matching [7] and statistical information [9], sorting out one drawback of the former type of methods, namely, that the selected pattern may not be optimal. To address this limitation, our technique employs several corpora of previously learnt syntactic-semantic and lexical patterns, as well as a similarity threshold d ∈ [0.0 – 1.0] to decide whether one pattern is good enough for error correction. If it is found not to be good enough, the technique searches for a better pattern in the whole set of patterns available, and if a proper pattern is not found there either, the technique does not make any correction using the patterns. This procedure will be discussed in detail in Section 3.5. In the following sections we describe the elements required to implement our technique: concepts, grammatical rules, syntactic-semantic models and lexical models. 3.1 Concepts We define a concept as a set of keywords of a given type which are necessary to extract the semantic content of sentences within an application domain. For example, in our experiments in the fast food domain that will be described in section 4, we consider, among others, the following concepts: DESIRE = {want, need, …}, FOOD = {sandwich, cake, salad, …}, DRINK = {water, beer, wine, …} and AMOUNT = {one, two, three, …}. 3.2 Grammatical Rules The general format of a grammatical rule is as follows: ssp Æ restriction, where ssp denotes a syntactic-semantic pattern, which will be described in the following section, and restriction is a condition that must be satisfied by all the concepts in the pattern. For example, one rule used in our experiments in Spanish is: NUMBER DRINK SIZE Æ number(NUMBER) = number(DRINK) and number(DRINK) = number(SIZE) and number(NUMBER) = number(SIZE)

where number is a function that returns either singular or plural for each Spanish word in the concepts that it uses as input. The goal of this rule is to check number correspondences of drink orders uttered in Spanish. For example, the sentence dos cervezas grandes (two large beers) holds this correspondence.

100

R. López-Cózar, D. Griol, and J.F. Quesada

3.3 Syntactic-Semantic Models A syntactic-semantic model is a conceptual representation of the sentences uttered by users of a SDS in a dialogue state. This state is associated with a prompt type T of the dialogue system, which represents equivalent prompts to obtain a particular data type from the user. To create a syntactic-semantic model for a prompt type T, we transform each sentence uttered by the user in response to the prompt type into what we call a syntactic-semantic pattern (ssp). This pattern is a sequence of concepts obtained by replacing each word in the sentence with the concept(s) the word belongs to. From the analysis of all the sentences uttered in response to each prompt type we create a set of ssp’s, in which we remove those that are redundant and associate with each ssp its relative frequency within the set. The outcome of this process is a syntacticsemantic model associated with the prompt type T (SSMT). We call α model the set of SSMT’s created considering the m prompt types of a SDS: α = {SSMTi}, i = 1 ... m. 3.4 Lexical Models Lexical models contain information about the performance of the speech recogniser of a SDS. We must create a lexical model for each prompt type T, which we call LMT. To do so, we consider the sentences uttered in response to the prompt type and their corresponding recognition results. The format of this model is: LMT = {wa, wb, pab}, where wa is a word uttered by a user, wb is the recognised word and pab is the posterior probability of obtaining wb given wa. To create LMT we align each uttered sentence with the recognised sentence using the method described in [13], and compute the probabilities pab for each word pair (wa, wb). We call β model the set of LMT’s created considering the m prompt types of a SDS: β = {LMTi}, i = 1 ... m. 3.5 Algorithms to Implement the Technique In this section we discuss the two levels for error detection and correction employed by our technique: statistical and linguistic. 3.5.1 Correction at the Statistical Level The goal of this correction level is to find words wI‘s in the recognised sentence which belong to incorrect concepts KI’s. For each word, we must decide the correct concept KC and select the most appropriate word wC ∈ KC to substitute wI in the recognised sentence. We can implement this procedure in two steps: pattern matching and pattern alignment. 3.5.1.1 Pattern Matching. The procedure for pattern matching is illustrated in Fig. 1. It receives as input the recognised sentence: w1, w2, w3, w4, … wn, the syntacticsemantic model associated with the current prompt type T, and the similarity threshold d.

New Technique to Enhance the Performance of Spoken Dialogue Systems

w1 w2 w3 w4 … wn

SSMT

d

Pattern matching

C1 C2 C3 C4 … Cn Ci Cj Ck … Cm

Yes esspBEST = sspINPUT

Does sspINPUT match any pattern in SSMT ?

esspINPUT sspINPUT

No

Choose a pattern p in SSMT

Correction at linguistic level

score = similarity(sspINPUT, p) No score > d ? Yes Call sspSIMILAR the pattern p

Choose another pattern p in SSMT

All patterns p in SSMT checked ?

No

Yes count = number of sspSIMILAR’s in SSMT Yes: Case 1

count = 1 ? No No: Case 3 count = 0 ? Yes: Case 2

Carry out pattern matching using α instead of SSMT

Yes sspBEST = sspSIMILAR

count = 1 ?

count = number of sspSIMILAR’s that have greatest similarity with sspINPUT

No No count > 1 ? Yes count = number of sspSIMILAR’s with highest frequency in SSMT

Pattern alignment

Yes

count = 1 ? No Correction at linguistic level

Fig. 1. Pattern matching for correction at the statistical level

101

102

R. López-Cózar, D. Griol, and J.F. Quesada

The procedure employs what we call an enriched syntactic-semantic pattern (esspINPUT) obtained from the recognised sentence. This pattern is a sequence of what we call containers: C1, C2, C3, … Cn. The goal of this step is to transform esspINPUT into another pattern concept sequence called esspBEST, which is initially empty. To do this, we firstly create a syntactic-semantic pattern called sspINPUT, which only contains the concepts in esspINPUT, for example: sspINPUT = DESIRE AMOUNT INGREDIENT FOOD. Secondly, we determine whether sspINPUT matches any pattern in the syntacticsemantic model associated with the prompt type T (SSMT). If so, we make esspBEST = esspINPUT and proceed with the correction at the linguistic level (which will be discussed in section 3.5.2). Otherwise, we look for patterns p similar to sspINPUT in SSMT. To do this, we compare sspINPUT with every pattern p in the model, and compute a similarity score as follows: similarity(sspINPUT, p) = (n – med) / n, where n is the number of concepts in sspINPUT and med is the minimum edit distance between both patterns, computed using the method described in [14]. We call sspSIMILAR any pattern p in SSMT such that similarity(sspINPUT, p) > d, where d ∈ [0.0, 1.0] is a similarity threshold, the optimal value of which must be experimentally determined. We consider three cases depending on the number of sspSIMILAR‘s in SSMT. The first case is when there is just one. In this case, we create a new pattern called sspBEST, make sspBEST = sspSIMILAR and proceed with the pattern alignment, which will be discussed in section 3.5.1.2. The second case is when there are no sspSIMILAR‘s in SSMT. In this case, we try to find sspSIMILAR‘s in the α model (discussed in section 3.3) instead of doing so in SSMT, i.e., we employ the same procedure but considering α, not SSMT. The third case is when there are several sspSIMILAR‘s in SSMT (or in α). The question then is to determine the best sspSIMILAR. To make this selection we search for the sspSIMILAR that has the greatest similarity with sspINPUT. If there is just one sspSIMILAR satisfying this condition, we make sspBEST = sspSIMILAR and proceed with Step 2 (pattern alignment). If there are several patterns, we select those with the highest frequency in SSMT (or in α): if there is just one, we make sspBEST = sspSIMILAR and proceed with Step 2; if there are several we do not make any correction at the statistical level. 3.5.1.2 Pattern Alignment. The goal of pattern alignment is to build esspBEST in case it is still empty. The procedure is illustrated in Fig. 2. It receives as input the pattern concept sequence esspBEST, the lexical model associated with the current prompt type T (LMT) and the syntactic-semantic patterns sspINPUT and sspBEST. The procedure takes into account each container Ca in sspINPUT and considers two cases. The first is when the word wa in Ca does not affect the semantics of the sentence, i.e., it is not a keyword (e.g. please). In this case we create a new container D, make D = Ca and add D to esspBEST. The second case is when the word wa in Ca affects the semantics of the sentence, i.e., it is a keyword (e.g. sandwich). Thus, we study whether the word must be corrected. To do this, we try to align the container Ca with a container Cb in sspBEST using the method described in [13] and consider two possible occurrences:

New Technique to Enhance the Performance of Spoken Dialogue Systems LMT

sspINPUT

esspBEST

sspBEST

Pattern alignment

Yes

Correction at linguistic level

Is esspBEST already created ? No foreach Ca in sspINPUT

No

Create new container D D = Ca Add D to esspBEST

Is word in Ca a keyword ?

Yes

Yes: Ca is correct

Is it possible to align Ca with a container Cb in sspBEST ? No

Yes: Insertion error Discard Ca

Number of concepts in sspINPUT > Number of concepts in sspBEST ? No Number of concepts in sspINPUT = Number of concepts in sspBEST ? Yes: Substitution error Use LMT and create U = set of words u in CN with which wI is confused

count = number of words u in U Create new container D Store u in D Add D to esspBEST

Yes

count = 1 ? No count > 1 ? Yes

X = set of words u in CN that have highest confusion probability which wI

count = number of words u in X Yes

count = 1 ?

Fig. 2. Pattern alignment for correction at the statistical level

103

104

R. López-Cózar, D. Griol, and J.F. Quesada

Occurrence 1: Ca can be aligned. In this occurrence we assume that the container Ca is correct and do not make any correction at the statistical level. We create a new container D, make D = Ca and add D to esspBEST. Occurrence 2: It is not possible to align Ca. This occurrence may happen in two situations. The first is when the container is a result of an insertion recognition error. In this situation we discard Ca, i.e. it is not added to esspBEST. The second situation is when the container is a result of a substitution recognition error. Therefore, we must find a correction word from a different concept, wC ∈ CN, store it in a new container D, and add this container to esspBEST. To find wC we consider the lexical model associated with the prompt type T (LMT) and create the set U of words u ∈ CN with which the word wI is confused. If there is only one word u in U, we create a new container D that we name CN, store it in u, and add D to esspBEST. If there are several words, we carry out the same procedure but using the word that has the highest confusion probability with wI if it is unique; if it is not unique, or there are no words in U, we do not make any correction at the statistical level. 3.5.2 Correction at the Linguistic Level The goal of the correction at the linguistic level is to repair errors that are not detected at the statistical level and affect the semantics of sentences. To carry out the correction we use the grammatical rules described in section 3.2. For each rule we carry out the following procedure. The syntactic-semantic pattern ssp of the rule is inserted in a window that slides from left to right over esspBEST, as can be observed in Fig. 3. Sliding window: ssp NUMBER SIZE DRINK DESIRE

NUMBER INGREDIENT

NUMBER

FOOD

SIZE

DRINK beers beers 0.6590 0.6590

I I 0.5735 0.5735

want want 0.7387 0.7387

one one 0.6307 0.6307

ham ham 0.3982 0.3982

sandwich sandwich 0.6307 0.6307

and and 0.4530 0.4530

two two 0.6854 0.6854

small small 0.7861 0.7861

C1

C2

C3

C4

C5

C6

C7

C8

C9

Fig. 3. Sliding window over esspBEST

If the concept sequence in the window is found in esspBEST, then we apply the restriction of the rule to the words in the containers of esspBEST. If the words satisfy the restriction, we do not make any correction. Otherwise, we try to find out the reason for the insatisfaction by searching for an incorrect word wI. To decide the word wC to correct the incorrect word, we consider the lexical model LMT and take into account the set U = {u1, u2, ..., up} comprised of words of the same concept than the word wI. Next, we proceed similarly as discussed in the second case of Occurrence 2 (see previous section) but considering that the goal now is to replace one word in one concept with other word in the same concept.

New Technique to Enhance the Performance of Spoken Dialogue Systems

105

4 Experiments The goal of the experiments described in this section is to test the proposed technique using the Saplen spoken dialogue system, which we developed in a previous study to answer fast food queries and orders made in Spanish [15]. The evaluation has been carried out in terms of word accuracy (WA), spoken language understanding (SLU) and task completion (TC), considering two front-ends for ASR: i) baseline ASR, comprised of the standard HTK-based speech recogniser of the Saplen system, and ii) enhanced ASR, comprised of the same speech recogniser plus an additional module that implements the proposed technique. We have employed a dialogue corpus collected in our University from students interacting with the Saplen system, which contains around 5,500 utterances and roughly 2,000 different words. The utterance corpus has been divided into two separate corpora, each containing around 50% of the utterances. Using the training corpus we have compiled a word bigram that allows recognising sentences of the 18 different types in the corpus. The remaining 50% of the utterances have been used for testing. The experiments have been carried out employing a user simulator developed in a previous study [16]. The interaction between the Saplen system and the simulator is decided considering a set of scenarios that represent user goals. We have created two scenario sets: ScenariosA (300 scenarios) and ScenariosB (100 scenarios). Each dialogue generated by the interaction between the Saplen system and the user simulator is stored in a log file for analysis and evaluation purposes. Given that the construction of the syntactic-semantic and lexical models described in sections 3.3 and 3.4 has been carried out employing simulated dialogues, we have made additional experiments to decide the necessary number of dialogues to obtain the maximum amount of syntactic-semantic and lexical knowledge. The results indicate that 900 dialogues is the optimal trade-off. 4.1 Experiments with the Baseline ASR Employing the user simulator, the Saplen system and ScenariosA, we have generated a corpus of 900 dialogues, which we have called DialoguesA1. Table 1 sets out the average results obtained from the analysis of this corpus. The results show the problems of the system in correctly recognising and understanding some utterances. Analysis of the log files reveals that in some cases the misrecognised sentences are similar to the uttered sentences. For example, the sentence in Spanish dos fantas grandes de limón (two large lemon fantas) is recognised as uno fantas grandes de limón (one large lemon fantas) because of the acoustic similarly between dos and uno when uttered by users with strong Southern Spanish accents. Table 1. Results using the baseline ASR (in %)

WA 76.12

SLU 54.71

TC 24.51

106

R. López-Cózar, D. Griol, and J.F. Quesada

We have also observed problems with confirmations, which happen because the speech recogniser usually substitutes the word si (yes) by the word seis (six), when the former word is uttered by strongly accented speakers. In other cases, the recognised sentences are very distorted by ASR errors. For example, the sentence quiero una fanta de naranja grande (I want one big orange Fanta) is sometimes recognised as queso de manzana tercera (cheese of apple third). 4.2 Experiments with the Enhanced ASR As the concepts required for the technique (discussed in section 3.1), we have employed a set of 21 concepts that we created in a previous study [15]. Following section 3.2 we have created a set of grammatical rules to check the number correspondences for food and drink orders. To create the syntactic-semantic and lexical models, discussed in sections 3.3 and 3.4, we have analysed DialoguesA1 thus obtaining α = {SSMTi} and β = {LMTi}, with i = 1 ... 43, given that the Saplen system can generate 43 different prompt types. To decide the optimal value for the similarity threshold d (discussed in section 3.5.1) we have carried out experiments considering values in the range [0.1, 0.9]. Employing the user simulator and ScenariosB, we have generated a corpus comprised of 300 dialogues for each value, using in all cases the proposed technique. Analysis of the outcomes of these experiments reveals that the best results are obtained when d = 0.5. Using this optimal value, we have employed again ScenariosA to generate another corpus of 900 dialogues, which we call DialoguesA2. Table 2 shows the average results obtained from the analysis of this corpus. Table 2. Results using the enhanced ASR (in %)

WA 84.62

SLU 71.25

TC 68.32

Analysis of the log files shows that the technique is successful in correcting some incorrectly recognised sentences. For example, the incorrectly recognised drink order one large lemon fantas is corrected by doing no changes at the syntactic-semantic level, and replacing one with two at the lexical level. In other product orders the correction is carried out at the semantic-syntactic level. For example, one curry salad is sometimes recognised as one error curry salad. In this case the correction is carried out removing the ERROR concept at the syntactic-semantic level. The technique is useful in correcting the errors with confirmations discussed in the previous section. To do this, it replaces the NUMBER concept with the CONFIRMATION concept, and then selects the most likely word in CONFIRMATION. The enhanced ASR enables as well correction of some misrecognised telephone numbers (in the Spanish format). For example, nine five eight twenty-one fourteen eighteen is sometimes recognised as gimme five eight twenty-one fourteen eighteen because of acoustic similarity between nine and gimme in Spanish. The technique corrects the error by replacing the DESIRE concept with the NUMBER concept and selecting the most likely word in NUMBER given the word gimme at the lexical level.

New Technique to Enhance the Performance of Spoken Dialogue Systems

107

The technique is also useful to correct some misrecognised postal codes. For example, eighteen zero zero one is sometimes recognised as eighteen zero zero turkey. This error is corrected by replacing the INGREDIENT concept with the NUMBER concept and selecting the most likely word in NUMBER given the word turkey. Our proposal is also successful in correcting some incorrectly recognised addresses (in the Spanish format). For example, almona del boquerón street number five second floor letter h is sometimes recognised as almona del boquerón street error five second floor letter zero. This error is corrected by making a double correction. First, replacement of the ERROR concept with the NUMBER_ID concept and selection of the most likely word in NUMBER_ID given the word error. Second, replacement of the NUMBER concept with the LETTER concept and selection of the most likely word in LETTER given the word zero. There are cases where the technique fails in detecting errors, and thus in correcting them. This happens when words in the uttered sentence are substituted by other words and the result is valid in the application domain. For example, this occurs when the sentence two green salads is recognised as twelve green salads, given that there is no conflict in terms of concepts and there is agreement in number between the words. 4.2.1 Advantage of Using SSMT’s, α and d In this experiment we have checked whether using SSMT’s or α, taking into account d, is preferable to the two following alternative strategies: i) use α only without firstly checking the SSMT’s, and ii) use the SSMT’s, but if the pattern sspINPUT is not found in these, use α without considering the similarity threshold d. The α model is the one created employing DialoguesA1 and d is set to the optimal value, i.e., d = 0.5. We have implemented strategy i) and used ScenariosA to generate a corpus of 900 dialogues, which we call DialoguesA3. Next, we have implement strategy ii) and, using again ScenariosA, have generated another corpus of 900 dialogues, which we call DialoguesA4. Therefore, DialoguesA1, DialoguesA3 and DialoguesA4 have been created using the same scenarios and are comprised of the same number of dialogues, the only difference being in the strategy for selecting the correction model to be used. Table 3 shows the average results obtained from the analysis of DialoguesA3 and DialoguesA4. Table 3. Results employing strategies to select the syntactic-semantic correction model (in %)

Corpus DialoguesA3 DialoguesA4

WA 80.15 82.26

SLU 61.67 66.84

TC 39.78 55.35

Analysis of the log files shows that the error correction in confirmations is very much affected by the strategy employed to select the correction model (either SSMT or α). If we always use SSMT to correct errors in confirmations, the correction is in many cases successful. On the other hand, if we always use α the correction is mostly incorrect.

108

R. López-Cózar, D. Griol, and J.F. Quesada

4.2.2 Advantage of Using LMT’s, β and d The goal of this experiment has been to check whether using the LMT’s or β taking into account d is preferable to using β regardless of d. To carry out the experiment we have used the β model created with DialoguesA1. We have employed again ScenariosA and generated a corpus of 900 dialogues, which we call DialoguesA5. Therefore, DialoguesA1 and DialoguesA5 have been obtained using the same scenarios and are comprised of the same number of dialogues, the only difference being in the use of β. Table 4 shows the average results obtained from the analysis of DialoguesA5. The experiment shows that the confusion probabilities of words are not the same in the LMT‘s and β. For example, considering the β model, the highest probability of confusing the word error with a word in the NUMBER concept is 0.0370, and this word is dieciseis (sixteen). However, considering LMT=PRODUCT-ORDER, this probability is 0.0090 and the word is una (one). Therefore, the correction word is dieciseis if we consider β, and una if we take into account LMT=PRODUCT-ORDER, which in some cases is deterministic in making the proper correction. Table 4. Results employing an alternative strategy to select the lexical model (in %)

Corpus DialoguesA5

WA 81.40

SLU 65.61

TC 60.89

5 Conclusions and Future Work Comparing the results set out in Tables 1 and 2 we observe that the proposed technique allows enhancing the performance of the Saplen system in terms of WA, SLU and TC by 8.5%, 16.54% and 44.17% absolute, respectively. These enhancements are mostly achieved because considering the proposed threshold for similarity scores between patterns, the technique decides whether to use correction models associated with the current prompt type T (SSMT and LMT), or general correction models for the application domain (α and β). This novel contribution optimises the procedure for error recovery, as can be observed from comparison of results set out in Tables 2, 3 and 4. These results show that our method for selecting the correction models is preferable to other possible strategies for selecting these models. In particular, we have observed that the benefit of the proposed method is particularly noticeable in the correction of misrecognised confirmations. Future work includes considering additional information sources to correct errors that in the current implementation cannot be detected, such as domain-dependent knowledge. For example, in our application domain we could use this kind of information to consider that the sentence twelve green salads, although syntactically correct, is likely to be incorrectly recognised, given that it is not usual that customers of fast food restaurants order such a large amount of a product. We also plan to study the performance of the technique considering prompt-dependent similarity thresholds.

New Technique to Enhance the Performance of Spoken Dialogue Systems

109

References 1. Skantze, G.: Exploring human error recovery strategies: Implications for spoken dialogue systems. Speech Communication 45, 325–341 (2005) 2. McTear, M., O’Neill, I., Hanna, P., Liu, X.: Handling errors and determining confirmation strategies – An object-based approach. Speech Communication 45, 249–269 (2005) 3. Duff, D., Gates, B., Luperfoy, S.: An architecture for spoken dialogue management. In: Proc. of ICSLP, pp. 1025–1028 (1996) 4. Turunen, M., Hakulinen, J.: Agent-based error handling in spoken dialogue systems. In: Proc. of Eurospeech, pp. 2189–2192 (2001) 5. Lee, C., Jung, S., Lee, D., Lee, G.G.: Example-based error recovery strategy for spoken dialog system. In: Proc. of ASRU, pp. 538–543 (2007) 6. Ringger, E.K., Allen, J.F.: A fertility model for post correction of continuous speech recognition. In: Proc. of ICSLP, pp. 897–900 (1996) 7. Kaki, S., Sumita, E., Iida, H.: A method for correcting speech recognitions using the statistical features of character co-occurrences. In: Proc. of COLING-ACL, pp. 653–657 (1998) 8. Sarma, A., Palmer, D.D.: Context-based speech recognition error detection and correction. In: Proc. of HLT-NAAACL, pp. 85–88 (2004) 9. Zhou, Z., Meng, H.: A two-level schemata for detecting recognition errors. In: Proc. of ICSLP, pp. 449–452 (2004) 10. Zhou, Z., Meng, H., Lo, W.K.: A multi-pass error detection and correction framework for Mandarin LVCSR. In: Proc. of ICSLP, pp. 1646–1649 (2006) 11. Fiscus, J.G.: A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (Rover), pp. 347–352 (1997) 12. Jeong, M., Jung, S., Lee, G.G.: Speech recognition error correction using maximum entropy language model. In: Proc. of Interspeech, pp. 2137–2140 (2004) 13. Fisher, W.M., Fiscus, J.G.: Better alignment procedures for speech recognition evaluation. In: Proc. ICASSP, pp. 59–62 (1993) 14. Crestani, F.: Word recognition errors and relevance feedback in spoken query processing. In: Proc. Conf. on Flexible Query Answering Systems, pp. 267–281 (2000) 15. López-Cózar, R., Callejas, Z.: Combining language models in the input interface of a spoken dialogue system. Computer Speech and Language 20, 420–440 (2006) 16. López-Cózar, R., de la Torre, A., Segura, J.C., Rubio, A.J., Sánchez, V.: Assessment of dialogue systems by means of a new simulation technique. Speech Communication 40(3), 387–407 (2003)

Simulation of the Grounding Process in Spoken Dialog Systems with Bayesian Networks St´ephane Rossignol, Olivier Pietquin, and Michel Ianotto IMS Research Group, Sup´elec – Metz Campus, France [email protected] Abstract. User simulation has become an important trend of research in the field of spoken dialog systems because collecting and annotating real man-machine interactions with users is often expensive and time consuming. Yet, such data are generally required for designing and assessing efficient dialog systems. The general problem of user simulation is thus to produce as many as necessary natural, various and consistent interactions from as few data as possible. In this paper, is proposed a user simulation method based on Bayesian Networks (BN) that is able to produce consistent interactions in terms of user goal and dialog history but also to simulate the grounding process that often appears in human-human interactions. The BN is trained on a database of 1234 human-machine dialogs in the TownInfo domain (a tourist information application). Experiments with a stateof-the-art dialog system (REALL-DUDE/DIPPER/OAA) have been realized and promising results are presented.

1 Introduction Spoken dialog systems are now widespread and are in use in many domains (from flight booking to troubleshooting services). Designing such a speech-based interface is usually an iterative process involving several cycles of prototyping, testing and validation. Tests and validation require interactions between the released system and real human users which are very expensive and time consuming. For this reason, user simulation has become an important trend of research during the last decade. User simulation can be used for performance assessment [4,9] or for optimisation purposes [8,13,19], like when optimizing dialog strategies with reinforcement learning. User simulation should not be confused with user modelling. User modelling is generally used by a dialog system for internal purposes such as knowledge representation and user goal infering [5,10] or natural language understanding simulation [14]. The role of user simulation is to generate a large amount of simulated interactions with a dialog system and the simulated user is therefore external to the system. Dialog simulation can occur at several level of description. In this work, simulated interactions take place at the intention level (such as in [4,8,13,19]) and not at the signal level as proposed in [9]. An intention is here defined as the minimal unit of information that a dialog participant can express independently. It will be modeled as dialog acts. Indeed, intention-based communication allows error modelling of every part of the system, including speech recognition and understanding [15,14,18]. Pragmatically, it is easier to automatically generate intentions or dialog acts compared with speech signals, as a large number of utterances can express the same intention. G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 110–121, 2010. c Springer-Verlag Berlin Heidelberg 2010

Simulation of the Grounding Process in Spoken Dialog Systems with BN

111

The Simulated User (SU) presented in this paper is based on Bayesian Networks (BN). This model has been chosen for several reasons. First, BN are generative models and can therefore be used for inference as well as for data generation, which is of course required for simulation. Second, it is a statistical framework and it can thus generate a wide range of different dialogs that are statistically consistent with each other. Third, BN parameters can either be set by experts or trained on data. Given that data is often difficult to collect, the introduction of expert knowledge can be very helpful. Finally, a lot of efficient tools for inference and training for BN are freely available. This paper relies on previous work [11,13,16] where BN are used for simulation purpose but emphasizes on two novel contributions. First, the model has been modified to generate grounding behaviours [3]. The grounding process will be considered here as the process used by dialog participants to ensure that they share the background knowledge necessary for the understanding of what will be said later in the dialog. In practice, that means that the simulated user will react automatically by providing again correct information to the dialog system if a problem in the information transmission is detected. The ultimate goal of this work being to train dialog policies that handle such grounding behaviours [12]. Second, the model is trained on actual human-machine dialogs data and tested on a state-of-the-art dialog system (the REALL-DUDE/DIPPER/OAA environment [7,1,2]). The considered domain is the TownInfo domain, a tourist information task. The task consists in retrieving information about restaurants in a given city. This can be considered as a slot filling task where three different slots (“food”, “price range” and “area”) are considered. These slots can take respectively 3, 3 and 5 values (see table 1). Table 1. Slots in the task, and corresponding possible values Food “italian” “indian” “chinese” Range price “moderate” “expensive” “cheap” Area “central” “north” “south” “west” “east”

The system acts in use are: , <request>, , , . The user acts in use are: , , , (). The rest of this paper is organized as follows. In Section 2, the proposed model is described in details. In Section 3, experiments are presented. Finally, a conclusion and future works are provided in Section 4.

2 Description of the Model 2.1 Bayesian Network Model The user simulation model described in this contribution is based on the probabilistic model of a man-machine dialogue proposed in [11,13]. The interaction between the user and the dialogue manager is considered as a sequential transfer of intentions organized in turns noted t thanks to dialog acts. At each turn t the dialogue manager selects a

112

S. Rossignol, O. Pietquin, and M. Ianotto

system dialog act at conditionally to its internal state st . The user answers by a user act ut which is conditioned by the goal gt s/he is pursuing and the knowledge kt s/he has about the dialogue (what has been exchanged before reaching turn t). So, at a given turn, the information exchange can be modeled thanks to the joint probability p(a, s, u, g, k) of all these variables. This joint probability can be factored as: p(a, s, u, g, k) = p(u|g, k, a, s)p(g|k, a, s)p(k|s, a)p(a|s)p(s) Given that: – since the user doesn’t have access to the SDS state, u, g and k cannot depend on s, – the user’s goal can only be modified according to his/her knowledge of the dialogue, this expression can be simplified: DM Policy

Goal Modif.

p(a, s, u, g, k) = p(u|g, k, a) p(g|k) User act

p(k|a)

p(a|s) p(s)

Know. update

This can be expressed by the Bayesian network depicted on figure 1-a).

Fig. 1. Bayesian Network-based Simulated User

As explained in [13,12], the practical use of this kind of bayesian network requires a tractable representation of the stochastic variables {a, s, u, g, k}. Therefore, the variable a is split into two subvariables {AS , Sys}: the dialoge act type AS (, <request>, , , ) and the slots Sys on which it is applied (3 in this case, note Sys-SLOT(i)). The user act u is also divided into two subsets: the dialog act type u ( , , , ()) and the values v of slots on witch they apply (e.g.. u = , v = “food type = Italian”). A special variable C is added to simulate the closing of the dialogue by the user. The knowledge k and the goal g are represented as sets of attributevalue pairs. There is a knowledge value for each of the slots in the task (3 in this case,

Simulation of the Grounding Process in Spoken Dialog Systems with BN

113

note k-SLOT(i)). Three levels of knowledge (thus 3 possible values) are considered: low, medium and high. These values represent the knowledge the user has about the fact that s/he previously provided the information about the corresponding slot to the system. For example, if the user once provided information about the type of food s/he wants, the knowledge about this slot goes from low to medium. If this information has been provided several times, the knowledge becomes high and it is more likely that the user will close the dialogue if this slot is asked for one more time. The knowledge somewhat corresponds to a SU estimate of what is the dialog state. The user goal contains one value for each slot in the task. This ensures that the SU behaviour will be consistent according to a given goal. One extra possible value “don’t care” is added to indicate that the user may not be interested in the value of one specific slot. 2.2 Grounding Notice that, if the Dialog Manager (DM) asks confirmation for a slot for which the SU has a low knowledge value (i.e. a slot for which the SU has never provided information yet), it is likely that a grounding problem occurred. In other words, the DM and the user don’t agree on the exchanged slots before reaching turn t. The SU described so far is designed to confirm (or not) the DM information it receives for this slot, with a certain probability, which can be low (or possibly to close the dialog). Yet, the knowledge value could be used to infer the occurrence of a grounding problem. More generally, the Bayesian network of Figure 1-a) shows that one could infer the most probable dialogue state sˆt at turn t given the observed DM act at and the user’s knowledge kt : sˆt = argmaxs p(s|at , kt ) (diagnostic inference). The user’s knowledge is required since it keep traces of the dialogue history from the user’s point of view. If this dialogue state estimate sˆt is very different from the user knowledge kt , a grounding problem is likely to occur. Instead of computing this state estimate and comparing it to the user’s knowledge, we preferred to directly add a decision node in the network as shown on Figure 1b). This node can only take boolean values, the true value meaning that a grounding problem occurred. In practice a grounding value can be obtained for each slot. In this implementation, if a grounding problem is detected for the slot i, the SU is forced to provide information concerning this slot. More sophisticated grounding strategies could be investigated but we restrict the study to this simple one in this paper.

3 Experiments 3.1 Learning BN Parameters The whole set of probabilities in the conditional probability tables (CPTs) can not be learned, as this concerns thousands of them (2151, more precisely, in this paper). For most of the probabilities, it would not be possible to get a good estimate of their values anyway since only a small amount of similar situations can be found in the database. In Tables 2, 3, 4, 5 and 6, are summarized the probabilities that we suppose to well generalize the whole set of probabilities. For instance, the slots are considered as equivalent, and it is considered that their actual values do not influence the user’s answer to

114

S. Rossignol, O. Pietquin, and M. Ianotto

confirmations. Furthermore, when considering the probability for the user to close the dialogue (C variable mentioned before) we consider that the four system acts “hello”, “request”, “confirm”, and “implicitconfirm request” are equivalent. Only the system act “closingDialog” has been treated differently. Finally, the variables coming from the AS node and from the k nodes are supposed independent. Therefore, for the probability of closing the dialogue, it is assumed than providing 2 + 3 = 5 probabilities is enough to well generalize the parameters for the corresponding variable C. The same kind of reasoning has been performed for each node, leading us to a set of only 25 probabilities that need to be learned or heuristically fixed. Notice that the aim of the expert was to obtain as short dialogs as possible: the expert probabilities were thus fixed accordingly. Table 2. Expert versus trained probabilities – closing probability variable C expert learned from AS : the system act is not “closingDialog” from AS : the system act is “closingDialog” from k-SLOT(i); the knowledge is low, thus the SU rather does not close (low value) the dialog from k-SLOT(i); the knowledge is medium, thus the SU rather closes the dialog from k-SLOT(i); the knowledge is high, thus the SU rather closes the dialog (probability higher than above)

0.90 0.99 0.01

0.997 0.944 0.016

0.90 1.00

0.104 0.773

The heuristically determined values for these probabilities, and their corresponding learned values are provided in the next five tables (Tables 2, 3, 4, 5 and 6). For instance, the first probability in Table 2 corresponds to the probability that the Simulated User decides to close the dialog if the system act coming from the DM is something else than a “closingDialog” act. Furthermore, in Table 2, the fact that the trained value for the probability of closing the dialog when the knowledge has a high value is quite low (0.773) compared to the corresponding expert probability value (1.00), indicates that human users tend to avoid to close the dialog even when they should think that they already provided the whole requested pieces of information. Therefore, dialogs generated with the heuristically parameterized BN-based simulator would probably tend to be shorter than those obtained with human users or with the trained BN-based Simulated User. The values in the seventh row and in the last row of Table 3 are going in the same direction. This indicates as well that the heuristic BN has been especially designed to allow the full system to reach as fast as possible the end of the dialog, that is to say in as few turns as possible. 3.2 Interaction with REALL-DUDE/DIPPER/OAA The Simulated User has been interfaced with the spoken dialog system provided in the REALL-DUDE/DIPPER/OAA environment (see [7], [1] and [2]). This environment originally aims at training policies by reinforcement learning. Yet, the dialog policy used for our experiments has been trained independently of the SU presented here and is used for testing purpose only. Dialogs similar to the examples provided in Tables 7 and

Simulation of the Grounding Process in Spoken Dialog Systems with BN

115

Table 3. Expert versus trained probabilities – {u, v}=INFORM SLOT (i) expert learned from AS : the system greets, or asks for a slot, thus the SU decides to enter the slot(i) from AS : the system asks for a confirmation, thus the SU decides rather not to enter the slot(i) (low value) from AS : the system closes the dialog, thus the SU decides rather to enter the slot(i) (low value) from Sys-SLOT(i): the slot(i) is concerned by the system act, thus the SU decides to enter the slot(i) from Sys-SLOT(i): the slot(i) is not concerned by the system act, but the SU decides to enter the slot(i) anyway from Sys-SLOT(i): the slot(i) is concerned by the system act, and the SU decides to enter the slot(j) from k-SLOT(i): the knowledge for the slot(i) is low, thus the SU decides to enter the slot(i) from k-SLOT(i): the knowledge for the slot(i) is medium, thus the SU hesitates (probability smaller than above) to enter the slot(i) from k-SLOT(i): the knowledge for the slot(i) is high, thus the SU decides rather not to enter the slot(i) (low value)

1.00

0.32

0.01 0.0314 0.35 9.6e−5 1.00

0.848

0.95

0.727

0.95

0.47

1.00

0.968

0.8

0.768

0.001 0.361

Table 4. Expert versus trained probabilities – {u, v} = CONFIRM SLOT (i) expert learned from AS : the system greets or asks for a slot, but the SU decides to confirm the slot(i) (probability low) from AS : the system asks for a confirmation for the slot(i), thus the SU decides to confirm the slot(i) from Sys-SLOT(i): the slot(i) is concerned by the system act, but the SU decides to confirm the slot(i) from Sys-SLOT(i): the slot(i) is not concerned by the system act, thus the SU decides to confirm the slot(i) (low value) from k-SLOT(i): the slot(i) is concerned by the system act and the knowledge for the slot(i) is low, and the SU decides to confirm the slot(i) (low value?) from k-SLOT(i): the slot(i) is concerned by the system act and the knowledge for the slot(i) is medium, thus the SU rather decides to confirm the slot(i) from k-SLOT(i): the slot(i) is concerned by the system act and the knowledge for the slot(i) is high, thus the SU rather decides to confirm the slot(i)

0.01

0.117

1.00

0.709

0.99

0.873

0.001 0.0832 0.01

0.906

0.98

0.915

0.99

0.833

Table 5. Expert versus trained probabilities – {u, v} = INFORM VALUE SLOT (i) expert learned from INFORM SLOT ( I ): INFORM SLOT ( I ) not selected (the system decided not 0.02 0.0213 to provide information about the slot(i)), thus the SU decides to enter the value for the slot(i) (low value) from VALUE GOAL SLOT ( I ): the SU decides to enter the value for the slot(i) it 0.99 0.979 has in its goal

116

S. Rossignol, O. Pietquin, and M. Ianotto Table 6. Expert versus trained probabilities – {u, v} = CONFIRM VALUE SLOT (i) expert learned

from u = confirm, v = SLOT(i): CONFIRM SLOT ( I ) not selected (the system 0.05 0.00443 decided not to confirm the slot(i)), but the SU decides to confirm the value for the slot(i) (low value) from VALUE GOAL SLOT ( I ): the SU decides to confirm the value for the slot(i) 0.99 0.980 it has in its goal

8 are obtained. The SU accompanies each hypothesis it sends to the DM with a probability simulating the confidence score which would be provided among other things by the Automatic Speech Recogniser (ASR) system. The DM can decide to ask for a confirmation if this probability is too low. This explains the “<syst act> confirm(Food)” and “<user act> confirm(food=yes)” turns in the second dialog example (in Table 8). It can be noticed that the SU correctly answered to this system act. Table 7. First dialog example, obtained using the Simulated User integrated within the REALLDUDE/DIPPER/OAA spoken dialog system environment User goal: Food: indian <syst <user <syst <user <syst <user <syst

act> act> act> act> act> act> act>

rPrice: cheap

hello(Food) inform(slot_1=’indian’) request(Area) inform(slot_3=’west’) request(rPrice) inform(slot_2=’cheap’) close

Area: west Hello. What kind of food would you like? I’m looking for an Indian restaurant. What part of town do you have in mind? I want a place in the west part of town. What kind of price would you like? I’d prefer something cheap please. The ... is a nice place. Goodbye.

3.3 Heuristic versus Trained BN-Based SU – Statistics In this Section, are presented statistics computed on dialogs obtained using two versions of the SU presented in this paper and the REALL-DUDE/DIPPER/OAA environment. The first version of the SU is obtained considering heuristically determined values for the BN parameters. The second version is obtained considering trained parameters. The database used for training contains 1234 dialogs. It provides much more complex dialogs than the proposed task. For instance, twelve slots in total are considered. The values for some of these slots are requested by the DM, such as “type”, “food”, etc., appealing thus a “request” system act and an “inform” user act; others are requested by the User, such as “address”, “phone”, etc.; the “price range” can be requested by both sides. Furthermore, more than ten different system and user acts are used (see [23] for an exhaustive list). In the current task, less acts are considered, as noticed in the introduction. Finally, more than one slot and more than one act can be presented during each turn, both considering the DM turns and the SU turns. The database has been described in more details in [22] and [20], where it has been used for training dialog management strategies. In Table 9, are presented the results obtained using the heuristic BN (h-BN) and the trained BN (t-BN). A thousand of dialogs have been simulated for each of them. On

Simulation of the Grounding Process in Spoken Dialog Systems with BN

117

Figure 2, is presented the histogram of the number of turns required to reach the end of the dialogs. Considering the h-BN, it can be seen that most of the dialogs indeed are carried out into four turns, as expected (93.4 %). This is due to the fact that it has been designed to obtain as short dialogs as possible. Table 8. Second dialog example, obtained using the Simulated User integrated within the REALL-DUDE/DIPPER/OAA spoken dialog system environment User goal: Food: italian

rPrice: expensive

<syst act> hello(Food) <user act> inform(slot_1=’italian’) <syst act> confirm(Food) <user <syst <user <syst <user

act> act> act> act> act>

confirm(food=yes) request(rPrice) inform(slot_2=’expensive’) request(Area) inform(slot_3=’central’)

<syst act> close

Area: central Hello. What kind of food would you like? I’m looking for an Italian restaurant. You are looking for an Italian restaurant, right? Yes, that’s right. What kind of price would you like? Err, expensive please. What part of town do you have in mind? I want a place in the central part of town. I can recommend the .... Goodbye.

Table 9. Mean number of turns requested to reach the end of the dialogs; max number of turns; min number of turns; percentage of dialogs for which the end is reached in 4 turns; percentage of dialogs for which the end is reached in less than 9 turns mean max min 4 turns < 9 turns h-BN 4.969 21 t-BN 4.577 9

4 93.20 % 93.40 % 4 58.00 % 99.90 %

Using the t-BN, the dialogs are quite longer than when using the h-BN. This cannot be seen considering the average number of turns, but considering the percentage of dialogs which needed exactly four turns to reach their end: this percentage drops from 93.2 % to 58.0 %. However, very promisingly, the mean number of turns and the percentage of dialogs which needed less than nine turns to reach their end are quite better using the t-BN. Furthermore it can be noticed that the very long dialogs (more than nine turns requested), indicating some deep misunderstanding between the DM and the SU, have completely disappeared. This indicates that the dialogs obtained with the t-BN are much more natural, at least from a DM point of view, than the dialogs obtained with the h-BN. The distribution is actually more in agreement with the data which shows again the naturalness of the simulated dialogs. Table 10 presents the number of turns needed per slot, respectively when the h-BN is used the longest dialogs being kept, when the h-BN is used the longest dialogs not being kept (they only reflect the DM stopping condition), when the t-BN is used, and considering the database. Clearly, the t-BN provides more realistic dialogs, in terms of number of turns needed for each slot.

118

S. Rossignol, O. Pietquin, and M. Ianotto

number of dialogues

heuristic BN 900 800 700 600 500 400 300 200 100 0 4

6

8

10

12

14

16

18

20

22

16

18

20

22

number of dialogues

trained BN 500 400 300 200 100 0 4

6

8

10

12

14

number of turns needed to reach the end of the dialogue

Fig. 2. Histogram of the number of turns requested to reach the end of the dialog – top: the heuristic BN is used – bottom: the trained BN is used

Table 10. Number of turns needed per slot; h-BN and longest dialogs being kept, h-BN used and longest dialogs not being kept, t-BN used, and considering the database h-BN

h-BN without long dialogs

1.6563 1.3986

t-BN

database

1.5257 1.5661

3.4 Grounding Process – Dialog Examples and Statistics On 1000 simulated dialogs, 496 grounding problems have been detected. In Tables 11 and 12, dialog examples with a grounding problem are shown. In the first case, the SU detected the grounding problem, and solved it; in the second case, a grounding problem occurred, but the SU did not act consequently and the dialog went worst afterwards. When the DM asks confirmation for the third slot, the SU answered to it even if it had never sent information about this slot before. The SU is confused, and the DM is likely to be confused afterwards as well. Statistics computed on dialogs obtained using the grounding-enabled SU are presented as well, the grounding problem solver being used or not. As the task features three slots and as the SU is configured to send information about no more than a single slot per turn, the minimum number of turns necessary to reach the end of the dialog is four: one per slot, plus the “closingDialog” turn. Notice that a turn is defined here as a couple <syst act>/<user act> (except for the “closingDialog” act, as it is the DM which is actually able to stop a dialog). Considering only the dialogs with a grounding problem, and not the whole set of dialogs, results in Table 13 are obtained. They are quite promising. The amount of dialogs longer than 5 turns drops of 30.6 %.

Simulation of the Grounding Process in Spoken Dialog Systems with BN

119

Table 11. Dialog example, obtained using the Simulated User integrated within the REALLDUDE/DIPPER/OAA spoken dialog system environment; a grounding problem is detected and solved (the grounding problem solver is used) User goal: Food: italian

rPrice: cheap

Area: central

<syst act> hello(Food) <user act> inform(slot_1=’italian’)

Hello. What kind of food would you like? I’m looking for an Italian restaurant.

<syst act> confirm(Area)

You are looking for a restaurant in the central part of town, right?

=> grounding error detected for the slot_1 (the DM understood something wrong); problem solved by the SU <user act> inform(slot_1=’italian’)

Erm, an Italian restaurant please.

<syst act> request(rPrice) <user act> inform(slot_2=’cheap’) ...

What kind of price would you like? Could I have a cheap restaurant?

Table 12. Dialog example, obtained using the Simulated User integrated within the REALLDUDE/DIPPER/OAA spoken dialog system environment; a grounding problem occurs and it is not solved by the SU (the grounding problem solver is not used) User goal: Food: italian

rPrice: cheap

Area: east

<syst act> hello(Food) <user act> inform(slot_1=’italian’)

Hello. What kind of food would you like? I’m looking for an Italian restaurant.

<syst act> request(rPrice) <user act> inform(slot_2=’cheap’)

What kind of price would you like? I’d prefer something cheap please.

<syst act> confirm(Area)

Did you say you are looking for a restaurant in the central part of town?

=> grounding error detected for the slot_2 (the DM understood something wrong); the SU does not respond correctly to this problem <user act> confirm(area=no)

No, sorry.

...

Table 13. Mean number of turns requested to reach the end of the dialogs; percentage of dialogs for which the end is reached in more than 5 turns – the BN without grounding problem solver or the BN with a grounding problem solver is used mean > 5 turns no grounding problem solver 5.254 29.22 % grounding problem solver 5.069 20.29 %

4 Conclusion and Future Works In this paper, a user simulation model based on Bayesian Networks is proposed to simulate realistic human-machine dialogs at the intention level, including grounding

120

S. Rossignol, O. Pietquin, and M. Ianotto

behaviours. Our goal was first to show the interest of training the parameters of the Bayesian Networks using a database of actual human-machine dialogs, and second to show the interest of simulating the grounding process often occuring in human-human dialogs. This is done by comparing the number of turns required to reach the end of a dialog using different configuration of the proposed simulation framework. Several directions for future works are furthermore envisioned. First, one of the goal of developing this Simulated User is the training of the dialog management policies within a reinforcement learning paradigm. The developed Simulated User will allow training the policies in use in the POMDP engines implemented in the REALL-DUDE/DIPPER/OAA environment. Second, some preliminary results concerning the interaction of the presented Simulated User with an independently developed Dialog Manager are shown. It is planned in the very next future to interact with the spoken dialog system provided in the REALL-DUDE/DIPPER/OAA environment in a much more extensive and systematic way. This will allow comparing the number of turns obtained with the BN-based Simulated Users to the number of turns obtained with human users and analysing the corresponding task completion scores etc. Furthermore, as definitely considering only the dialog length is not enough for the evaluation of User Simulations, refined metrics ([21],[17],[6]) are under study. Third we would like to use the ability of Bayesian Networks to learn online their parameters so as to improve the simulated dialogs naturalness as users are really interacting with the system and to be able to retrain policies between real interactions.

Acknowledgements This work is supported by the CLASSIC (Computational Learning in Adaptive Systems for Spoken Conversation) european project, Project Number 216594 funded under EC FP7, Call 1 (http://www.classic-project.org/). The authors also want to thank Oliver Lemon and Paul Crook for their help in using the REALL-DUDE/DIPPER/OAA framework.

References 1. Bos, J., Klein, E., Lemon, O., Oka, T.: DIPPER: Description and Formalisation of an Information-State Update Dialogue System Architecture. In: Proceedings of the 4th SIGDIAL Workshop on Discourse and Dialogue, pp. 115–124 (2003) 2. Cheyer, A., Martin, D.: The Open Agent Architecture. Journal of Autonomous Agents and Multi-Agent Systems (4), 143–148 (2001) 3. Clark, H., Schaefer, E.: Contributing to discourse. Cognitive Science 13, 259–294 (1989) 4. Eckert, W., Levin, E., Pieraccini, R.: User Modeling for Spoken Dialogue System Evaluation. In: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 80–87 (1997) 5. Horvitz, E., Breese, J., Heckerman, D., Hovel, D., Rommelse, K.: The Lumiere Project: Bayesian User Modeling for Inferring the Goals and Needs of Software Users. In: Proc. of the 14th Conference on Uncertainty in Artifical Intelligence (July 1998) 6. Janarthanam, S., Lemon, O.: A Two-Tier Simulation Model for Reinforcement Learning of Adaptative Referring Expression Generation Policies. In: Proceedings of 10th SIGDIAL, pp. 120–123 (2009)

Simulation of the Grounding Process in Spoken Dialog Systems with BN

121

7. Lemon, O., Liu, X., Shapiro, D., Tollander, C.: Hierarchical Reinforcement Learning of Dialogue Policies in a Development Environment for Dialogue Systems: REALL-DUDE. In: 10th SemDial Workshop on the Semantics and Pragmatics of Dialogue; BRANDIAL 2006 (2006) 8. Levin, E., Pieraccini, R., Eckert, W.: A Stochastic Model of Human-Machine Interaction for Learning Dialog Strategies. IEEE Transactions on Speech and Audio Processing 8, 11–23 (2000) 9. L´opez-C´ozar, R., Callejas, Z., McTear, M.F.: Testing the performance of spoken dialogue systems by means of an artificially simulated user. Artificial Intelligence Review 26(4), 291– 323 (2006) 10. Meng, H., Wai, C., Pieracinni, R.: The Use of Belief Networks for Mixed-Initiative Dialog Modeling. In: Proceedings of the 8th International Conference on Spoken Language Processing (ICSLP) (October 2000) 11. Pietquin, O.: A Probabilistic Description of Man-Machine Spoken Communication. In: Proceedings of the IEEE International Conference on Multimedia and Expo (ICME) (July 2005) 12. Pietquin, O.: Learning to Ground in Spoken Dialogue Systems. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 4, pp. 165–168 (2007) 13. Pietquin, O., Dutoit, T.: A Probabilistic Framework for Dialog Simulation and Optimal Strategy Learning. IEEE Transactions on Audio, Speech, and Language Processing 14, 589–599 (2006) 14. Pietquin, O., Dutoit, T.: Dynamic Bayesian Networks for NLU Simulation with Applications to Dialog Optimal Strategy Learning. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (May 2006) 15. Pietquin, O., Renals, S.: ASR System Modeling for Automatic Evaluation and Optimization of Dialogue Systems. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Orlando, USA, FL (May 2002) 16. Pietquin, O., Rossignol, S., Ianotto, M.: Training Bayesian Networks for Realistic ManMachine Spoken Dialogue Simulation. In: Proceedings of the 1rst International Workshop on Spoken Dialogue Systems Technology (December 2009) 17. Schatzmann, J., Georgila, K., Young, S.: Quantitative Evaluation of User Simulation Techniques for Spoken Dialogue Systems. In: Proceedings of the 6th SIGDIAL, pp. 45–54 (2005) 18. Schatzmann, J., Thomson, B., Young, S.: Error Simulation for Training Statistical Dialogue Systems. In: Proceedings of the International Workshop on Automatic Speech Recognition and Understanding (ASRU), Kyoto, Japan (2007) 19. Schatzmann, J., Weilhammer, K., Stuttle, M., Young, S.: A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. Knowledge Engineering Review 21(2), 97–126 (2007) 20. Thomson, B., Gaˇsi´c, M., Keizer, S., Mairesse, F., Schatzmann, J., Yu, K., Young, S.: User Study of the Bayesian Update of Dialogue State Approach to Dialogue Management. In: Proceedings of Interspeech (2008) 21. Vuurpijl, L., ten Bosch, L., Rossignol, S., Neumann, A., Pfleger, N., Engel, R.: Evaluation of multimodal dialog systems. In: Proceedings of the LREC Workshop on Multimodal Corpora (2004) 22. Williams, J.D., Young, S.: Partially Observable Markov Decision Processes for Spoken Dialog Systems. Computer Speech and Language 21, 231–422 (2007) 23. Young, S.: CUED Standard Dialogue Acts. Technical report, Cambridge University Engineering Dept (October 2007)

Facing Reality: Simulating Deployment of Anger Recognition in IVR Systems Alexander Schmitt1 , Tim Polzehl2 , and Wolfgang Minker1 1

2

Dialogue Systems Group/Institute of Information Technology, University of Ulm Albert-Einstein-Allee 43, D-89081 Ulm [email protected] Quality and Usability Lab der Technischen Universit¨ at Berlin/Deutsche Telekom Laboratories Ernst-Reuter-Platz 7, D-10587 Berlin, Germany [email protected]

Abstract. With the availability of real-life corpora studies dealing with speech-based emotion recognition have turned towards recognition of angry users on turn level. Based on acoustic, linguistic and sometimes contextual features classiﬁers yield performance values of 0.7-0.8 f-score when classifying angry vs. non-angry user turns. The eﬀect of deploying anger classiﬁers in real systems still remains an open point and has not been examined so far. Is the current performance of anger detection already adequate enough for a change in dialogue strategy or even an escalation to an operator? In this study we explore the impact of an anger classiﬁer that has been published in a previous study on speciﬁc dialogues. We introduce a cost-sensitive classiﬁer that reduces the number of misclassiﬁed non-angry user turns signiﬁcantly.

1

Introduction

An increasing number of studies deal with the detection of emotions in speech corpora stemming from telephone applications. Especially speech-based anger detection in such Interactive Voice Response (IVR) Dialogue systems can be used oﬄine to monitor quality of service [15,14]. It can indicate potentially problematic turns or grammar slots to the carrier so he can monitor and reﬁne the system. Furthermore it could serve as trigger to switch to dialogue strategies that take into account the user’s emotional state [7,1]. The classiﬁcation is gaining a performance level that makes it attractive not only for research. Anger detection also opens up opportunities in deployed systems. With reliable classiﬁers it will be possible to re-route customers to the assistance of a human operator when they get angry. However, this would require a solid performance of the classiﬁer in spotting angry users. In this study we employ a state-of-the art anger classiﬁer on speciﬁc dialogues. We utilize the new Witchcraft Workbench [10] that has been developed for simulation of prediction and classiﬁcation models in order to analyze the impact of the classiﬁer on dialogue and to evaluate its performance in a real-life application. G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 122–131, 2010. c Springer-Verlag Berlin Heidelberg 2010

Facing Reality: Simulating Deployment of Anger Recognition in IVR

123

The remainder of this paper is organized as follows: In Section 2 related work on classifying anger in IVR systems is presented. For our study we employed a large IVR corpus that is presented in Section 3. Section 4 presents an anger classiﬁer applied on 1.702 calls from our dialogue corpus. The weaknesses of this approach lead us to Section 5 where we analyze the eﬀect of employing a cost-sensitive Meta classiﬁer that reduces the number of false predictions. Section 6 introduces the Witchcraft Workbench as evaluation framework for anger models along with SDS dialogues. We ﬁnally conclude and discuss our ﬁndings in Section 7.

2

Related Studies for Detecting User Anger

One of the ﬁrst studies addressing emotion recognition with special application to call centers was presented by [8] in 1999. Due to the lack of “real” data, the corpus was collected with 18 non-professional actors that were asked to record voice messages of 15-90 seconds in 22kHz which were split afterwards into chunks of 1-3 seconds. Raters were asked to label the chunks with “agitation” (including anger) and “calm”. Recognition is performed on acoustic cues. Studies on reallife corpora came up in the early 2000s. [4] employed data from a deployed IVR system, but complains about data sparsity. The dataset comprises a balanced set of only 142 short utterances with “negative” and “non-negative” emotions. In later studies [5,3] the employed set contained 1.197 utterances, however, strongly balanced towards non-angry emotions. Additionally to acoustic features, Lee et al. make use of linguistic cues that are extracted from ASR transcripts and propose the concept of Emotional Salience as additional feature. In 2003 [15] stated that real-life corpora are still hard to obtain and employed an acted corpus containing 8 speakers at 22 kHz quality. The utterance length was artiﬁcially shortened to simulate conditions typical for Interactive Voice Response Systems. Recent studies employ real-life data and some include additional knowledge sources that go beyond acoustic and linguistic information of the current turn. [6] and [11] used context features such as the performance of the ASR and the barge-in behavior of the user to support anger detection on turn level. In [13] the knowledge about the emotional state of the user in previous turns has been used to improve anger detection on turn level. Current approaches yield average scores between 60-80% accuracy when classifying balanced sets of angry and non-angry user turns. At the same time the detection of non-angry user turns performs better than the detection of angry user turns. In earlier work we compared a German and an English IVR corpus and found the classiﬁer’s precision for “non-anger” at 84.9% and 86.0%, while the classiﬁer’s precision for “anger” was as low as 72.0% and 67.0% [9].

3

Corpus Description

The dialogue corpus in our studies originates from a US-American IVR portal capable of ﬁxing Internet-related problems jointly with the caller. In previous

124

A. Schmitt, T. Polzehl, and W. Minker

work three labelers labeled the single user turns in the corpus with the labels angry, annoyed and non-angry. The ﬁnal label was deﬁned based on majority voting resulting in 90.2% neutral, 5.1% garbage, 3.4% annoyed and 0.7% angry utterances. 0.6% of the samples in the corpus were sorted out since all three raters had diﬀerent opinions. While the number of angry and annoyed utterances seems very low, 429 calls (i.e. 22.4% of all dialogues) contained annoyed or angry utterances. For recent studies we collapsed annoyed and angry to angry to be comparable with other studies in this ﬁeld. Table 1 depicts the details of the corpus. Table 1. Details of the employed dialog corpus

Domain Number of Dialogs in Total Audio Duration in Total Average Number of Turns per Dialog Number of Raters Speech Quality Average Duration Anger in Seconds Average Duration Non-Anger in Seconds Cohen’s Extended Kappa

4

English Internet Support 1911 10h 11.88 3 Narrow-band 1.87 ±0.61 1.57 ±0.66 0.63

Cost Insensitive Classiﬁcation

A discriminative binary classiﬁcation task such as anger recognition aims to distinguish between two distinct classes. By deﬁnition, one class is depicted as ”positive” class, the other as ”negative” class. When evaluating the performance of a binary classiﬁer with a test set of n samples, four numbers play an important role: the number of correctly classiﬁed samples from the positive class (true positives; TP); the number of mistakenly as belonging to the positive class classiﬁed samples (false positive; FP); the number of correctly classiﬁed samples from the negative class (true negatives; TN) and the number of incorrectly as belonging to the negative class classiﬁed samples (false negative; FN). Consequently, the sum T P + F P + T N + F N represents the number of samples in the test set n. Porting these numbers to the anger detection domain, we can state that TP, TN, FP and FN signify how many utterances that – – – –

have have have have

been angry have really been spotted by the classiﬁer? (TP) been angry have been mistakenly been classiﬁed as non-angry? (FN) not been angry have been correctly classiﬁed as non-angry? (TN) not been angry have mistakenly been classiﬁed as angry? (FP)

In this task it is vital to yield a low FN rate. For reasonable classiﬁcation it is important that each class is represented equally in the training set. Otherwise most classiﬁers would tend to always predict the predominant class if a suﬃcient

Facing Reality: Simulating Deployment of Anger Recognition in IVR

125

success rate is guaranteed this way, which is in this task “non-angry”. The test is typically carried out with a test set where classes are also equally spread to see whether the learned model separates the classes appropriately. However, when considering the later application of a trained model on realistic test data, the distribution of classes is not at all balanced and is frequently very skew towards “non-angry”. Furthermore, the classiﬁer has to deal with noisy data like coughing, laughing or oﬀ-talk. We evaluate the anger classiﬁer that is designed as described in [12]. In order to stay speaker-independent the training set for training the classiﬁer is deﬁned as follows: we randomly select 50% of the calls that contain anger and use all their angry user turns for training. These 459 angry user turns stem from 209 calls and are replenished with 459 non-angry turns from the same callers. To assess how a classiﬁer would perform under real-life conditions we use the remaining 1.702 calls from the corpus as test set. It contains the full bandwidth of speech samples that typically occur in IVR systems: 472 angry, 17.269 nonangry and 935 garbage utterances. For training and testing, turns labeled as “garbage” have been collapsed with the class “non-angry”. The corpus-wide results are depicted in Table 2. Table 2. Speaker-independent classiﬁcation result of real-life data when evaluating the presented classiﬁer

pred. non angry pred. angry class recall

true non angry true angry class precision 13.040 (TN) 126 (FN) 99.0% 5.124 (FP) 386 (TP) 7.0% 71.8% 75.3%

At ﬁrst sight, the performance appears rather satisfying. 75.3% of all “angry” utterances in the corpus could be identiﬁed as angry which can be seen from the recall value of the “angry” class. On the other hand, the precision of the anger class is very low with 7%: 5.124 non-angry turns have been falsely identiﬁed as angry. Imagine a scenario where the classiﬁer’s prediction is employed to trigger escalation to an operator. The result shows 13 times more “false alarms” than “true alarms”. The classiﬁer in its current state would lead to a large number of false escalations.

5

Cost Sensitive Classiﬁcation

In most work it is assumed that both classes are equally important to detect and the costs of misclassiﬁcation are equal for both cases. Provided that anger detection is deployed in a real telephone system where the recognition of the classiﬁer has an impact on the further dialogue ﬂow, it is certainly easy to understand that a high number of FN is not as severe as a high number of FP: if a user turn is classiﬁed as non-angry although the user is angry (FN), the

126

A. Schmitt, T. Polzehl, and W. Minker

dialogue system would not behave diﬀerently than when no anger detection is deployed. A completely diﬀerent scenario would occur when a non-angry user is mistakenly classiﬁed as angry (FP): it would transfer the caller to an operator although there is no need in doing so. Thus we can state that not all numbers are of equal importance to the task. To make the classiﬁer cost sensitive, we apply the MetaCost algorithm [2] to our classiﬁer. MetaCost uses the conﬁdence scores of the underlying classiﬁer in order to choose the class with the minimal risk. In this context the conditional risk R(i|x) is deﬁned as R(i|x) = P (j|x)C(i, j) j

The risk is the expected cost of classifying sample x as belonging to class i. The probability P is the conﬁdence of the underlying classiﬁer that sample x belongs to class j. C(i, j) is the cost matrix that deﬁnes the costs for correct and wrong predictions. A cost matrix contains the penalty values for each decision a classiﬁer can make when predicting the class of an utterance. Normally, each correct decision is weighted with 0 (no penalty, since it was a correct decision) and each misclassiﬁcation is weighted as 1. Altering these weights causes the classiﬁer to favor one class at the charge of the other class. The true costs of misclassifying non-angry callers as angry ones can hardly be quantiﬁed or depends on a variety of factors such as costs for operators, availability of operators, per-minute-costs for the IVR system, etc. We thus empirically determine a sensitive value for the cost matrix that yields reasonable results. We increase the costs for misclassifying non-angry callers to the factors 1,2,..15. The results are depicted in Figure 1. With increasing costs for misclassifying non-angry turns, the classiﬁer favors for the class non-angry. The precision for angry is increasing, on the other hand the recall for angry decreases. Looking at Figure 1 we can see that the optimal cost for misclassifying non-angry turns is a cost value of 7. It reduces the number of misclassiﬁed non-angry turns (FP) by factor 16.9, namely from 5.124 (cost=1) to 303 while the number of correctly classiﬁed turns (TP) is only reduced by factor 3.7 from 386 to 103.

6

The Eﬀect on Dialogue

Which eﬀect would the deployment of the cost insensitive classiﬁer from Section 4 and the cost sensitive classiﬁer from Section 5 have on speciﬁc dialogues? We apply both classiﬁcation models within the Witchcraft Workbench. Our Workbench for Intelligent exploraTion of Human ComputeR conversaTions1 is a new platform-independent open-source workbench designed for the analysis, mining and management of large spoken dialogue system corpora. What makes Witchcraft unique is its ability to visualize the eﬀect of classiﬁcation and prediction models on ongoing system-user interactions. Witchcraft is able to handle 1

witchcraftwb.sourceforge.net

Preccision/Recall

Facing Reality: Simulating Deployment of Anger Recognition in IVR

127

1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 1

2

3

4

5

6

7

8

9 10 11 12 13 14 15

pos_precision

pos_recall

neg_precision

neg_recall

(a) Precision and Recall 6000

#Saamples

5000 4000 3000 2000 1000 0 1

2

3

4

5 TP

6

7

8

9 10 11 12 13 14 15

FP

FN

(b) TP, FP, FN Fig. 1. Results of iteratively increasing the cost for misclassifying a non-angry user turn: class-wise precision and recall values (a) and number of TP, FP, FN (b). The number of TN is not depicted here due to visualization. It slightly increases from 19.000 to 20.000 samples.

predictions from binary and multi-class discriminative classiﬁers and regression models. Hence, the workbench allows us to apply the results of an anger classiﬁcation model on speciﬁc dialogues. Witchcraft operates on turn level requesting the classiﬁer to deliver a prediction based on information available at the currently processed dialogue turn of a speciﬁc call. Witchcraft reads in the predictions from the classiﬁer that have been determined on turn level (e.g. the current emotional state of the caller). Then it

128

A. Schmitt, T. Polzehl, and W. Minker

creates statistics on call level. It counts the number of TP, FP, TN, FN within the call and calculates class-wise precision and recall values as well as accuracy, f-score etc. The integrated SQL-based search mechanism allows searching for calls that fulﬁll a speciﬁc criterion. By that, calls with a high number of false positives, low f1 scores for the class anger etc. can be spotted. An example of such a call under analysis in the Witchcraft Workbench is depicted in Figure 2.

Fig. 2. Screenshot of the Witchcraft Workbench in the Analysis Perspective. It consists of two Call Selection Views (top-left), a Call Detail view, the central Dialogue View enabling navigation within the conversation, Classiﬁcation Tables, Chart Views and the audio player. Turns that have been correctly identiﬁed as anger (TP) are depicted with green background color, false predictions are depicted with red background color (FN+FP).

Our interest lies in ﬁnding why and to which extent misclassiﬁcation of nonangry turns is happening. An auditory analysis of the misclassiﬁed samples in Witchcraft turned out that the anger model seems to frequently misclassify distorted and garbage turns that contain higher frequencies and high loudness values as “angry”. When using the cost-insensitive classiﬁer Witchcraft identiﬁes 1.196 calls that contain no user anger (according to the raters) but at least one mistakenly as angry classiﬁed user turn (according to the classiﬁer). 313 calls contain even 5 and more such misclassiﬁcations. The adjusted cost-sensitive classiﬁer with the cost value for misclassifying non-angry turns set to 7 produces a diﬀerent image. The numbers drop to only 134 and 4 calls respectively. Two representative dialogues are depicted in Figure 3.

Facing Reality: Simulating Deployment of Anger Recognition in IVR

(a) Non-angry Caller

129

(b) Angry Caller

Fig. 3. Typical calls from a non-angry caller and an angry caller predicted from the cost-insensitive and cost-sensitive learner (respectively left and right side). Again, green turns symbolize correctly spotted angry turns (TP), red turns symbolize misclassiﬁed turns (FN+FP).

The cost-insensitive classiﬁer generally causes a large number of false alarms. Figure 3(a) contains a call from a non-angry user. The cost-insensitive classiﬁer (left side) misclassiﬁes a high number of non-angry turns as angry ones. The costsensitive classiﬁer (right side) ignores the predictions of the underlying classiﬁer and decides for “non-angry”. Figure 3(b) contains a call from an angry caller. The cost-insensitive classiﬁer predicts nearly all angry turns correctly while the cost-sensitive one behaves more conservatively and misses some of the true angry turns.

7

Conclusion and Discussion

This study has presented the impact of deploying anger recognition to an IVR system. The presented cost insensitive classiﬁer generates a very high number of false positives. This ﬁrst setup does not seem suitable for an IVR system that relies on anger detection for escalating to operators. The employment of the presented cost-sensitive learner decreases the number of misclassiﬁed non-angry user turns. A cost value of 7 turned out to be the best compromise in order to gain low FP values while keeping the TP value comparably high. Notwithstanding these improvements, the false positives for the detection of angry users remain twice as high as the true positives. The fact that the classes “annoyed” and “angry” have been previously merged to one single class could be an explanation for this behavior. Virtually the patterns between really angry and non-angry turns are blurred and the classiﬁer looses performance.

130

A. Schmitt, T. Polzehl, and W. Minker

For classiﬁers that are intended to predict hot anger online it is thus recommendable to adjust the rating process and include only those samples that contain hot anger. Further it turns out to be questionable whether the detection of one angry user turn should serve as basis for escalation to operators. Pattern matching taking into account several user turns would be sensitive to detect callers that are in rage. The current performance seems to be appropriate for an adaption of the dialogue strategy based on the classiﬁer’s prediction.

Acknowledgment The research leading to these results has received funding from the Transregional Collaborative Research Centre SFB/TRR 62 “Companion-Technology for Cognitive Technical Systems” funded by the German Research Foundation (DFG). We would like to thank the reviewers for their valuable comments.

References 1. Burkhardt, F., van Ballegooy, M., Huber, R.: An emotion-aware voice portal. In: Proceedings of Electronic Speech Signal Processing ESSP (2005) 2. Domingos, P.: Metacost: A general method for making classiﬁers cost-sensitive. In: Proceedings of the Fifth International Conference on Knowledge Discovery and Data Mining, pp. 155–164. ACM Press, New York (1999) 3. Lee, C.M., Narayanan, S.S.: Toward detecting emotions in spoken dialogs. IEEE Transactions on Speech and Audio Processing 13(2), 293–303 (2005) 4. Lee, C.M., Narayanan, S.S., Pieraccini, R.: Recognition of negative emotions from the speech signal. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2001, pp. 240–243 (2001) 5. Lee, C.M., Narayanan, S.S., Pieraccini, R.: Combining acoustic and language information for emotion. In: ICSLP 2002 (2002) 6. Liscombe, J., Riccardi, G., Hakkani-T¨ ur, D.: Using context to improve emotion detection in spoken dialog systems. In: Interspeech, pp. 1845–1848 (2005) 7. Metze, F., Englert, R., Bub, U., Burkhardt, F., Stegmann, J.: Getting closer: tailored humancomputer speech dialog. Universal Access in the Information Society (2008) 8. Petrushin, V.A.: Emotion in speech: Recognition and application to call centers. In: Artiﬁcial Neural Networks in Engineering (ANNIE 1999), St. Louis, pp. 7–10 (1999) 9. Polzehl, T., Schmitt, A., Metze, F.: Salient Features for Anger Recognition in German and English IVR Portals. Springer, Boston, USA, Spoken Dialogue Systems Technology and Design edition (August 2010) 10. Schmitt, A., Bertrand, G., Heinroth, T., Liscombe, J.: Witchcraft: A workbench for intelligent exploration of human computer conversations. In: International Conference on Language Resources and Evaluation (LREC), Valetta, Malta (May 2010) 11. Schmitt, A., Heinroth, T., Liscombe, J.: On nomatchs, noinputs and bargeins: Do non-acoustic features support anger detection?. In: Proceedings of the 10th Annual SIGDIAL Meeting on Discourse and Dialogue, SigDial Conference 2009, London (UK). Association for Computational Linguistics (September 2009)

Facing Reality: Simulating Deployment of Anger Recognition in IVR

131

12. Schmitt, A., Pieraccini, R., Polzehl, T.: ’For Heavens sake, gimme a live person!’ Designing Emotion-Detection Customer Care Voice Applications in Automated Call Cent. In: Advances in Speech Recognition: Mobile Environments, Call Centers and Clinics. Springer, Heidelberg (September 2010) 13. Schmitt, A., Polzehl, T.: Modeling a-priori likelihoods for angry user turns with hidden markov models. In: Proc. of the Fifth International Conference on Speech Prosody 2010 (May 2010) 14. Shafran, I., Riley, M., Mohri, M.: Voice signatures. In: 2003 IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU 2003, November - December 3, 31–36 (2003) 15. Yacoub, S., Simske, S., Lin, X., Burns, J.: Recognition of emotions in interactive voice response systems. In: Proc. Eurospeech, Geneva, pp. 1–4 (2003)

A Discourse and Dialogue Infrastructure for Industrial Dissemination Daniel Sonntag, Norbert Reithinger, Gerd Herzog, and Tilman Becker German Research Center for AI (DFKI) Stuhlsatzenhausweg 3, 66123 Saarbruecken, Germany

Abstract. We think that modern speech dialogue systems need a prior usability analysis to identify the requirements for industrial applications. In addition, work from the area of the Semantic Web should be integrated. These requirements can then be met by multimodal semantic processing, semantic navigation, interactive semantic mediation, user adaptation/personalisation, interactive service composition, and semantic output representation which we will explain in this paper. We will also describe the discourse and dialogue infrastructure these components develop and provide two examples of disseminated industrial prototypes.

1

Introduction

Dialogue system construction is a diﬃcult task since many individual natural language processing components have to be combined into a complex AI system. Theoretical aspects and perspectives on communicative intention have to meet the practical demands of a user machine interface—the speech based communication must be natural for humans, otherwise they will never accept dialogue as a proper means of communication with machines. Over the last several years, the market for speech technology has seen signiﬁcant developments [14] and powerful commercial oﬀ-the-shelf solutions for speech recognition (ASR) or speech synthesis (TTS). Even entire voice user interface platforms (VUI) have become available. However, these discourse and dialogue infrastructures have only moderate success so far in the entertainment or industrial sector. This is the case because a dialogue system as complex AI system cannot easily be constructed. Additionally, the dialogue engineering requires many customisation works for speciﬁc applications. We implemented a new discourse and dialogue infrastructure for semantic access to structured and unstructured information repositories for industrial applications. In this paper, we basically provide two new contributions. First, we provide architectural recommendations of how new dialogue infrastructures may look like. Second, we discuss which components are needed to convey the requirements of dialogical interaction in multiple use case scenarios and with multiple interaction devices. To meet these objectives, we implemented a distributed, ontology-based, dialogue system architecture where every major component can be run on a diﬀerent host, increasing the scalability of the overall G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 132–143, 2010. c Springer-Verlag Berlin Heidelberg 2010

133

D. Sonntag et al.

system. Thereby, the dialogue system acts as the middleware between the clients and the backend services that hide complexity from the user by presenting aggregated ontological data. We also implemented the attached dialogue components within the architecture. This paper is structured as follows. First we will discuss related work and the basic dialogue architecture. This will be followed by the discussion of individual dialogue tasks and components. Finally, we will discuss two industrial dissemination prototypes and provide a conclusion. Related Work . The dialogue engineering task is to provide dialogue-based access to the domain of interest, e.g., for answering domain-speciﬁc questions about an industrial process in order to complete a business process. Prominent examples of integration platforms include OOA [11], TRIPS [1], and Galaxy Communicator [19]; these infrastructures mainly address the interconnection of heterogeneous software components. The W3C consortium also proposes intermodule communication standards like the Voice Extensible Markup Language VoiceXML1 or the Extensible MultiModal Annotation markup language EMMA2 , with products from industry supporting these standards3 . In addition, many systems are available that translate natural language input into structured ontological representations (e.g., AquaLog [10]), port the language to speciﬁc domains, e.g., ORAKEL [3], or use reformulated semantic structures NLION [16]. AquaLog, e.g., presents a solution for a rapid customisation of the system for a particular ontology; with ORAKEL a system engineer can adapt the natural language understanding (NLU) component [4] in several cycles thereby customising the interface to a certain knowledge domain. The system NLION uses shallow natural language processing techniques (i.e., spell checking, stemming, and compound detection) to instantiate an ontology object. The systems which integrate sub-tasks software modules from academia or industry can be used oﬀ the shelf. However, if one looks closer at actual industrial projects’ requirements, this idealistic vision begins to blur mainly because of software infrastructure or usability issues. These issues are explained in the context of our basic architecture approach for industrial dissemination. Basic Architecture . We learned some lessons which we use as guidelines in the development of basic architectures and software infrastructures for multimodal dialogue systems. In earlier projects [27, 17] we integrated diﬀerent subcomponents into multimodal interaction systems. Thereby, hub-and-spoke dialogue frameworks played a major role [18]. We also learned some lessons which we use as guidelines in the development of semantic dialogue systems [12, 23]; over the last years, we have adhered strictly to the developed rule “No presentation without representation.” The idea is to implement a generic, and semantic, dialogue shell that can be conﬁgured for and applied to domain-speciﬁc dia1 2 3

http://www.w3.org/TR/voicexml20/ http://www.w3.org/TR/emma/ http://www.voicexml.org

A Discourse and Dialogue Infrastructure for Industrial Dissemination

134

logue applications. All messages transferred between internal and external components are based on RDF data structures which are modelled in a discourse ontology (also cf. [6, 8, 21]). Our systems for industrial dissemination have four main properties: (1) multimodality of user interaction, (2) ontological representation of interaction structures and queries, (3) semantic representation of the interface, and (4) encapsulation of the dialogue proper from the rest of the application.4 These architectural decisions are based partly on usability issues that arise when dealing with end-to-end dialogue-based interaction systems for industrial dissemination (they correspond to the use case requirements). Intelligent AI systems that involve intelligent algorithms for dialogue processing and interaction management must be judged for their suitability in industrial environments. Our major concern which we observed in the development process for industrial applications over the last years is that the incorporation of AI technologies such as complex natural language understanding components (e.g., HPSG based speech understanding) and open-domain question answering functionality can unintentionally diminish a dialogue system’s usability. This is because negative side-eﬀects such as diminished predictability of what the system is doing at the moment and lost controllability of the internal dialogue processes (e.g., a question answering process) occur more often when AI components are involved. This tendency delivers new requirements for usability to account for the special demands introduced by the use of AI. For the identiﬁcation of these usability issues, we adopted the binocular view of interactive intelligent systems (discussed in detail in [9]). The main architectural challenges we encountered in implementing a new dialogue application for a new domain can be summarised as follows: ﬁrst, providing a common basis for task-speciﬁc processing; second, accessing the entire application backend via a layered approach. In our experience, these challenges can best be solved by implementing the core of a dialogue runtime environment, an ontology dialogue platform (ODP) framework and its platform API (the DFKI spin-oﬀ company SemVox, see www.semvox.de, oﬀers a commercial version), as well as providing conﬁgurable adaptor components. These translate between conventional answer data structures and ontology-based representations (in the case of, e.g., a SPARQL backend repository) or Web Services (WS)— ranging from simple HTTP-based REST services to Semantic Web Services, driven by declarative speciﬁcations [24]. Our ODP workbench (ﬁgure 1) builds upon the industry standard Eclipse and also integrates other established open source software development tools to support dialogue application development, automated testing, and interactive debugging. A distinguishing feature of the toolbox is the built-in support for eTFS (extended Typed Feature Structures), the optimised ODP-internal data representation for knowledge structures. This enables ontology-aware tools for the knowledge engineer and application developer. A detailed description of the rapid dialogue engineering process, which is possible thanks to the Eclipse plugins and templates, can be found in [25]. 4

A comprehensive overview of ontology-based dialogue processing and the systematic realisation of these properties can be found in [21], pp.71-131.

135

D. Sonntag et al.

Fig. 1. Overall design of the discourse and dialogue infrastructure: ontology-based dialogue processing framework and workbench

2

Dialogue Tasks and Components

In addition to automatic speech recognition (ASR), dialogue tasks include the interpretation of the speech signal and other input modalities, the context-based generation of multimedia presentations, and the modelling of discourse structures. These topic areas are all part of the general research and development agenda within the area of discourse and dialogue with an emphasis on dialogue systems. According to the utility issues and industrial user requirements we identiﬁed (system robustness/usability and processing transparency play the major roles), we distinguished ﬁve dialogue tasks for these topics and built ﬁve dialogue components, respectively. These components which allow for a semantic and pragmatic/application-based modelling of discourse and dialogue are presented in more detail in the rest of this section. Multimodal Semantic Processing and Navigation . This task provides the rule-based fusion of diﬀerent input modalities such as text, speech, and pointing gestures. We use a production-rules-based fusion and discourse engine which follows the implementation in [13]. Within the dialogue infrastructure, this component plays the major role since it provides basic and conﬁgurable dialogue processing capabilities that can be adapted to speciﬁc industrial application scenarios. More processing robustness is achieved through the application of a special robust parsing feature in the context of RDF graphs as a result of the input parsing process. When the user only utters catchwords instead of complete utterances, the semantic relationship between the catchwords can be guessed (following [26]) according to the ontological domain model of the industrial application domain. The Tool Suite builds upon the industry standard Eclipse and also integrates other established open source software development tools to

A Discourse and Dialogue Infrastructure for Industrial Dissemination

136

support dialogue application development, automated testing, and interactive debugging. The semantic navigation and interaction task/module builds the connection to backend services (the tasks Interactive Semantic Mediation and Web Service Composition) and the presentation module (the task Semantic Output representation) and allows for a graph-like presentation of incremental results (also cf. [2]). The spoken dialogue input is used to generate SPARQL queries on ontology instances (using a Sesame repository, see www.openrdf.org). Users are then presented a perspective on the result RDF graph with navigation possibilities. Interactive Semantic Mediation . Interactive semantic mediation has two aspects: (1) the mediation with the processing backend, the so-called dynamic knowledge base layer (i.e., the heterogeneous data repositories), and (2) the mediation with the user (advanced user interface functionality for the adaptation to new industrial use case scenarios). We developed a semantic mediation component to mediate between the query interpretation created in the dialogue shell and the semantic background services prevalent in the industry application context. The mediator can be used to, e.g., collect answers from external information sources. Since we deal with ontology-based information in heterogeneous terminology, ontology matching has become one of the major requirements. The ontology matching task can be addressed by several string-based or structurebased techniques (cf. [5], p.341, for example). As a new contribution in the context of large-scale speech-based AI systems, we think of ontology matching as a dialogue-based interactive mediation process (cognitive support frameworks for ontology mapping involve users). The overall approach is depicted in ﬁgure 2. A dialogue-based approach can make more use of partial, unsure mappings. Furthermore, it increases the usability in dialogue scenarios where the primary task is diﬀerent from the matching task itself (cf. industrial usability requirements). So as not to annoy the user, he/she is presented only the diﬃcult cases for disambiguation feedback; thus we use the dialogue shell to basically conﬁrm or reject pre-considered alignments [20]. User Adaptation and Personalisation . Industrial usability requirements clearly advocate new AI systems to work with user models and personalised information. An additional, related concern is given by the industrial requirement to provide user privacy protection and implement data security guidelines. This makes a user adaptation and personalisation task an indispensable asset for many advanced AI systems. Our discourse and dialogue infrastructure should beneﬁt from user preferences, communities, their ontological modelling, the privacy issues of anonymity, data security, as well as data control by the user (also cf. processing transparency and transaction security for processes that involve sensitive data like medical patient records). Figure 3 illustrates the close interrelationships between the adaptation and personalisation blocks. In the context of new dialogue system infrastructures, persistent and personalised annotation of personal data (e.g., personal image annotations) should be provided. Further

137

D. Sonntag et al.

Fig. 2. Interactive semantic mediation for industry adaptation

topics of interest include interaction histories with dialogue session, task, and personalised interaction hierarchies. For example, our use case implementation tries to guess a predeﬁned user group according to the interaction history.

Fig. 3. Industry-relevant forms of user adaptation and personalisation

Interactive Service Composition . The service composer module takes input from the multimodal semantic base module. A mapping of the structured representation of the query to a formal query is done ﬁrst. The formal representation does not use concepts of verbs, nouns, etc. anymore, but rather uses the custom knowledge representation in RDF. If only one query answering backend exists, the knowledge representation of this backend can be used. Otherwise, an interim RDF-based representation is generated. The formal query is analysed

A Discourse and Dialogue Infrastructure for Industrial Dissemination

138

and mapped to one or more services that can answer (parts of) the query. This step typically involves several substeps including decompositing the query into smaller parts, ﬁnding suitable services (service discovery), mapping the query to other representations, planning a series of service executions, and initiating possibly clariﬁcation requests or disambiguation requests. The diﬀerent workﬂows form diﬀerent routes in a pre-speciﬁed execution plan which includes the possibility to request clariﬁcation information from the user if needed (hence interactive service composition). Figure 4 outlines the hard-wired execution plan to dynamically address and compose SOAP/WSDL-based and REST-based services.

Fig. 4. Execution plan for (interactive) service composition

Semantic Output Representation . We implemented a semantic output representation module which realises an abstract container concept called Semantic Interface Elements (SIEs) for the representation and activation of multimedia elements visible on, e.g., a touchscreen user interface [22]. The semantic output representation architecture comprises of several GUI-related submodules (such as the Semantic Interface Elements Manager or the Event Manager), dialogueengine-related modules (such as the Interaction Manager or natural language Parser), and the Presentation Manager sub-module (i.e., the GUI Model Manager). The most important part of the architecture is the Display Manager which observes the behaviour of the currently displayed SIE. The display manager dispatches XML based messages to the dialogue system with the help of a message

139

D. Sonntag et al.

Decoder/Encoder. This display manager has to be customised for every new application, whereas all other modules are generic. The Multimodal Dialogue Engine then processes the requests in the Interaction and Interpretation Manager modules. A new system action or reaction is then initiated and sent to the Presentation Manager. The GUI Model Manager builds up a presentation message in eTFS notation (internal typed feature structure format) before the complete message is sent to the Display Manager as a PreML message.

3

Industrial Dissemination

In the following discussion of industrial dissemination, the ﬁrst example shows a medical prototype application for a radiologist while the second shows a business-to-business mobile application. Both dialogue systems build upon the multimodal speech-based discourse and dialogue infrastructure described in this paper. The prototypes put emphasis on various and combined input forms on diﬀerent interaction devices, i.e., multitouch screens and iPhones. Radiology Dialogue System . In the MEDICO use case, we work on the direct industrial dissemination of a medical dialogue system prototype. Clinical care and research increasingly rely on digitised patient information. There is a growing need to store and organise all patient data, including health records, laboratory reports, and medical images. Eﬀective retrieval of images builds on the semantic annotation of image contents. At the same time it is crucial that clinicians have access to a coherent view of these data within their particular diagnosis or treatment context. This means that with traditional user interfaces, users may browse or explore visualised patient data, but little or no help is given when it comes to the interpretation of what is being displayed. Semantic annotations should provide the necessary image information and a semantic dialogue shell should be used to ask questions about the image annotations and reﬁne them while engaging the clinician in a natural speech dialogue at the same time. The process of reading the images is highly eﬃcient. Recently, structured reporting was introduced that allows radiologists to use predeﬁned standardised forms for a limited but growing number of speciﬁc examinations. However, radiologists feel restricted by these standardised forms and fear a decrease in focus and eye dwell time on the images [7, 28]. As a result, the acceptance for structured reporting is still low among radiologists while referring physicians and hospital administrative staﬀ are generally supportive of structured standardised reporting since it eases the communication with the radiologists and can be used more easily for further processing. We strive to overcome the limitations of structured reporting by allowing content-based information to be automatically extracted from medical images and (in combination with dialogue-based reporting) eliminating radiologist’s requirements to ﬁll out forms by enabling them to focus on the image while either dictating the image annotations of the reports to the dialogue system or reﬁning existing annotations.

A Discourse and Dialogue Infrastructure for Industrial Dissemination

140

The domain-speciﬁc dialogue application [24], which uses a touchscreen (ﬁgure 5, left) to display the medical SIE windows, is able to process the following medical user-system dialogue: U: “Show me the CTs, last examination, patient XY.” S: Shows corresponding patient CT studies as DICOM picture series and MR videos. U: “Show me the internal organs: lungs, liver, then spleen and colon.” S: Shows corresponding patient image data according to referral record. U: “This lymph node here (+ pointing gesture) is enlarged; so lymphoblastic. Are there any comparative cases in the hospital?” 6 S: “The search obtained this list of patients with similar lesions.” 7 U: “Ah okay.” Our system switches to the comparative records to help the radiologist in the diﬀerential diagnosis of the suspicious case, before the next organ (liver) is examined. 8 U: “Find similar liver lesions with the characteristics: hyper-intense and/or coarse texture ...” 9 S: Our system again displays the search results ranked by the similarity and matching of the medical ontology terms that constrain the semantic search. 1 2 3 4 5

Show me the internal organs: lungs, liver, then spleen and colon.

Fig. 5. Left: Multimodal touchscreen interface (reprinted from [24]). The clinician can touch the items and ask questions about them. Right: Mobile speech client for the business expert.

Currently, the prototype application is being tested in a clinical environment (University Hospitals Erlangen). Furthermore, the question of how to integrate this information and image knowledge with other types of data, such as patient data, is paramount. In a further step, individual, speech-based ﬁndings should be organised according to a speciﬁc body region and respective textual patient data. Mobile Business-to-Business Dialogue System . In the TEXO use case, we try to assist an employee and his superior in a production pipeline [15]. Our mobile business scenario is as follows: searching on a service platform, an employee of a company has found a suitable service which he needs for only

141

D. Sonntag et al.

a short period of time for his current work. Since he is not allowed to carry out the purchase, he formally requests the service by writing a ticket in the company-internal Enterprise Resource Planning (ERP) system. In the deﬁned business process, only his superior can approve the request and buy the service. But ﬁrst, the person in charge has to check for alternative services on the service platform which might be more suitable for the company in terms of quality or cost standards. The person in charge is currently away on business but he carries his mobile device with him that allows him to carry out the transaction on the go. The interaction is speech-based and employs a distributed version and instance of the dialogue system infrastructure. The mobile client (ﬁgure 5, left part of right half) streams the speech and click input to the dialogue server where the input fusion and reaction tasks are performed. The user can ask for speciﬁc services by saying, “Show me alternative services” or naming diﬀerent service classes. After the ranked list of services is presented, multiple ﬁlter criteria can be speciﬁed at once (e.g., “Sort the results according to price and rating.”). As a result, the services are displayed in a 2D grid which eases the selection according to multiple sorting criteria (ﬁgure 5, right). We tested the mobile system in 12 business-related subtasks before the industrial dissemination. Eleven participants were recruited from a set of 50 people who responded to our request (only that fraction was found suitable). The selected people were all students (most of them in business or economics studies). From our analysis of the questionnaires we conclude that our mobile B2B system can be valuable for business users. Almost all users successfully completed the subtasks (89% of a total of 132 subtasks). In addition, many of them also provided the following positive feedback: they felt conﬁdent about the ticket purchase being successful. Then we introduced the prototype to the industrial partner. According to ﬁrst user tests with the industrial partner SAP, the mobile speech client enhances the perceived usability of mobile business services and has a positive impact on mobile work productivity. More precisely, our mobile business application reﬂects the business task but simpliﬁes the selection of services and the commitment of a transaction (internal purchase); it additionally minimises text entry (a limitation on mobile devices) and displays relevant information in such a way that a user is able to grasp it at ﬁrst glance (3D visualisation).

4

Conclusion

Based on an integration platform for oﬀ-the-shelf dialogue solutions and internal dialogue modules, we described the parts of a new discourse and dialogue infrastructure. The ontology-based dialogue platform provides a technical solution for the dissemination challenge into industrial environments. The requirements for an industrial dissemination are met by implementing generic components for the most important tasks that we identiﬁed: multimodal semantic processing, semantic navigation, interactive semantic mediation, user adaptation/personalisation, interactive service composition, and semantic output representation.

A Discourse and Dialogue Infrastructure for Industrial Dissemination

142

Semantic (ontology-based) interpretations of dialogue utterances may become the key advancement in semantic search and dialogue-based interaction for industrial applications, thereby mediating and addressing dynamic, businessrelevant information sources. Acknowledgments. Thanks go out to Robert Nesselrath, Yajing Zang, Markus L¨ockelt, Malte Kiesel, Matthieu Deru, Simon Bergweiler, Eugenie Giesbrecht, Alassane Ndiaye, Norbert Pﬂeger, Alexander Pfalzgraf, Jan Schehl, Jochen Steigner and Colette Weihrauch for the implementation and evaluation of the dialogue infrastructure. This research has been supported by the THESEUS Programme funded by the German Federal Ministry of Economics and Technology (01MQ07016).

References 1. Allen, J., Byron, D., Dzikovska, M., Ferguson, G., Galescu, L., Stent, A.: An Architecture for a Generic Dialogue Shell. Natural Language Engineering 6(3), 1–16 (2000) 2. Boselli, R., Paoli, F.D.: Semantic navigation through multiple topic ontologies. In: Proceedings of Semantic Web Applications and Perspectives (SWAP), Trento, Italy (2005) 3. Cimiano, P., Haase, P., Heizmann, J.: Porting natural language interfaces between domains: an experimental user study with the orakel system. In: IUI 2007: Proceedings of the 12th International Conference on Intelligent User interfaces, pp. 180–189. ACM, New York (2007) 4. Engel, R.: SPIN: A Semantic Parser for Spoken Dialog Systems. In: Proceedings of the 5th Slovenian and First International Language Technology Conference, ISLTC 2006 (2006) 5. Euzenat, J., Shvaiko, P.: Ontology matching. Springer, Heidelberg (2007) 6. Fensel, D., Hendler, J.A., Lieberman, H., Wahlster, W. (eds.): Spinning the Semantic Web: Bringing the World Wide Web to Its Full Potential. MIT Press, Cambridge (2003) 7. Hall, F.M.: The radiology report of the future. Radiology 251(2), 313–316 (2009) 8. Hitzler, P., Kr¨ otzsch, M., Rudolph, S.: Foundations of Semantic Web Technolo-gies, Chapman & Hall/CRC, Boca Raton (August 2009) 9. Jameson, A.D., Spaulding, A., Yorke-Smith, N.: Introduction to the Special Issue on Usable AI. AI Magazine 3(4), 11–16 (2009) 10. Lopez, V., Pasin, M., Motta, E.: Aqualog: An ontology-portable question answering system for the semantic web. In: G´ omez-P´erez, A., Euzenat, J. (eds.) ESWC 2005. LNCS, vol. 3532, pp. 546–562. Springer, Heidelberg (2005) 11. Martin, D., Cheyer, A., Moran, D.: The Open Agent Architecture: a framework for building distributed software systems. Applied Artiﬁcial Intelligence 13(1/2), 91–128 (1999) 12. Oviatt, S.: Ten myths of multimodal interaction. Communications of the ACM 42(11), 74–81 (1999) 13. Pﬂeger, N.: FADE - An Integrated Approach to Multimodal Fusion and Discourse Processing. In: Proceedings of the Dotoral Spotlight at ICMI 2005,Trento, Italy (2005)

143

D. Sonntag et al.

14. Pieraccini, R., Huerta, J.: Where do we go from here? research and commercial spoken dialog systems. In: Proceedings of the 6th SIGDdial Worlshop on Discourse and Dialogue, pp. 1–10 (September 2005) 15. Porta, D., Sonntag, D., Neßelrath, R.: A Multimodal Mobile B2B Dialogue Interface on the iPhone. In: Proceedings of the 4th Workshop on Speech in Mobile and Pervasive Environments (SiMPE 2009) in Conjunction with MobileHCI 2009. ACM, New York (2009) 16. Ramachandran, V.A., Krishnamurthi, I.: Nlion: Natural language interface for querying ontologies. In: COMPUTE 2009: Proceedings of the 2nd Bangalore Annual Compute Conference on 2nd Bangalore Annual Compute Conference, pp. 1–4. ACM, New York (2009) 17. Reithinger, N., Fedeler, D., Kumar, A., Lauer, C., Pecourt, E., Romary, L.: MIAMM- A Multimodal Dialogue System Using Haptics. In: van Kuppevelt, J., Dybkjaer, L., Bernsen, N.O. (eds.) Advances in Natural Multimodal Dia-logue Systems. Springer, Heidelberg (2005) 18. Reithinger, N., Sonntag, D.: An integration frame modaldialogue system accessing the Semantic Web. In: Proceedings of INTERSPEECH, Lisbon, Portugal, pp. 841– 844 (2005) 19. Seneﬀ, S., Lau, R., Polifroni, J.: Organization, Communication, and Control in the Galaxy-II Conversational System. In: Proceedings of Eurospeech 1999, Budapest, Hungary, pp. 1271–1274 (1999) 20. Sonntag, D.: Towards dialogue-based interactive semantic mediation in the medical domain. In: Proceedings of the Third International Workshop on Ontology Matching (OM) Collocated with the 7th International Semantic Web Conference, ISWC (2008) 21. Sonntag, D.: Ontologies and Adaptivity in Dialogue for Question Answering. AKA and IOS Press, Heidelberg (2010) 22. Sonntag, D., Deru, M., Bergweiler, S.: Design and Implementation of Combined Mobile and Touchscreen-Based Multimodal Web 3.0 Interfaces. In: Proceedings of the International Conference on Artiﬁcial Intelligence (ICAI), pp. 974–979 (2009) 23. Sonntag, D., Engel, R., Herzog, G., Pfalzgraf, A., Pﬂeger, N., Romanelli, M., Reithinger, N.: SmartWeb Handheld—Multimodal Interaction with Ontological Knowl-edge Bases and Semantic Web Services. In: Huang, T.S., Nijholt, A., Pantic, M., Pentland, A. (eds.) ICMI/IJCAI Workshops 2007. LNCS (LNAI), vol. 4451, pp. 272–295. Springer, Heidelberg (2007) 24. Sonntag, D., M¨ oller, M.: Unifying semantic annotation and querying in biomedical image repositories. In: Proceedings of International Conference on Knowledge Management and Information Sharing, KMIS (2009) 25. Sonntag, D., Sonnenberg, G., Nesselrath, R., Herzog, G.: Supporting a rapid dialogue engineering process. In: Proceedings of the First International Workshop On Spoken Dialogue Systems Technology, IWSDS (2009) 26. Tran, T., Wang, H., Rudolph, S., Cimiano, P.: Top-k Exploration of Query Candidates for Eﬃcient Keyword Search on Graph-Shaped (RDF) Data. In: ICDE 2009: Proceedings of the 2009 IEEE International Conference on Data Engineering, Washington, DC, USA, pp. 405–416. IEEE Computer Society Press, Los Alamitos (2009) 27. Wahlster, W.: SmartKom: Symmetric Multimodality in an Adaptive and Reusable Dialogue Shell. In: Krahl, R., G¨ unther, D. (eds.) Proceedings of the Human Computer Interaction Status Conference 2003, Berlin, Germany, pp. 47–62. DLR (2003) 28. Weiss, D.L., Langlotz, C.: Structured reporting: Patient care enhancement or productivity nightmare? Radiology 249(3), 739–747 (2008)

Impact of Semantic Web on the Development of Spoken Dialogue Systems Masahiro Araki and Yu Funakura Department of Information Science, Kyoto Institute of Technology, Matsugasaki Sakyo-ku 6068585 Kyoto Japan [email protected]

Abstract. We examined several possible uses of semantic Web technologies in developing spoken dialogue systems. We report that three advanced features of the semantic Web, namely, collective intelligence, smooth integration of knowledge, and automatic inference, have a large impact on various stages in the development of spoken dialogue systems, such as language modeling, semantic analysis, dialogue management, sentence generation, and user modeling. As an example, we implemented a query generation method for semantic search based on semantic Web technology. Keywords: Spoken dialogue system, semantic Web, language model, semantic analysis, dialogue management, sentence generation, user model.

1 Introduction Semantic Web [1] is expected to be a next-generation Web technology. Semantic Web functions as a “Web of data”, as compared to the current HTML-based technology, which is considered to be a “Web of documents”. The advanced features of the semantic Web are realization of collective intelligence, smooth integration of distributed knowledge, and automatic inference based on well-established logics. Taking into account this movement, a number of studies on spoken dialogue systems have used semantic Web technology as a knowledge representation (e.g. [2]-[7]). However, a large portion of previous research on the use of semantic Web in spoken dialogue systems did not make full use of the advantages of semantic Web. We examined several possible uses of semantic Web technologies in developing spoken dialogue systems. In the present paper, we report that three advanced features of the semantic Web, namely, collective intelligence, smooth integration of knowledge, and automatic inference, have a large impact on various stages of development of spoken dialogue systems, such as language modeling, semantic analysis, dialogue management, sentence generation, and user modeling. The remainder of the present paper is organized as follows. Section 2 presents an overview of the basic concept of semantic Web technology. Section 3 surveys previous research on the use of semantic Web in spoken dialogue systems. Section 4 explains several uses of semantic Web technology in the development of spoken dialogue systems. As an example, Section 5 reports a query generation method for G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 144–149, 2010. © Springer-Verlag Berlin Heidelberg 2010

Impact of Semantic Web on the Development of Spoken Dialogue Systems

145

semantic search based on semantic Web technology. The present paper is concluded in Section 6 with a discussion of future research.

2 Overview of Semantic Web Technology The goal of semantic Web technology is to realize a “Web of data” [1]. All resources (e.g., people, organizations, documents, events, music, etc.) are represented by Universal Resource Identifiers (URIs). For example, people and organizations can be represented by their Web page URLs. The general concepts (e.g., a location name or a famous event) can be represented by the entry URL of Wikipedia. The web of knowledge is expressed by combining these resources in triplet form, i.e., subject-predicate-object. This knowledge representation method is referred to as the Resource Description Framework (RDF)1. In the RDF, the subject and the predicate are resources. The object can be either a resource or a literal datum. Fig. 1 shows two examples of RDFs. In the figure, the ellipses indicate resources and the rectangle indicates a literal. foaf:maker http://ex.org/rdf.html

http://ex.org/people#ma

foaf:name http://ex.org/people#ma

Masahiro Araki

Fig. 1. Example of RDF. The upper RDF represents that “a Web page (http://ex.org/rdf.html) is created by a person (http://ex.org/people#ma)”, and the lower RDF represents that “the person’s name is ‘Masahiro Araki’”.

The predicates (foaf:maker and foaf:name) in Fig. 1 are Friend Of A Friend (FOAF) vocabulary, 2 which is an established set of vocabulary (i.e., ontology) of semantic Web. This knowledge can be easily embedded in HTML pages. Microformats3 and RDFa4 are promising approaches for standardizing the embedding mechanisms of the RDF in HTML. Therefore, a web crawler can easily recognize these RDFs if such information is embedded in HTML. In addition to FOAF, there are a number of useful ontologies, such as events, reviews, and documents. The simple RDF representation and well-established ontology are the foundation of the next generation of collective intelligence. The knowledge integration mechanism of a distributed source is another advantage of semantic Web technology. In most basic case, for example, the two RDFs in Fig. 1 can be merged into a single graph structure by integrating the object node of the upper RDF and the subject node of the lower RDF. Even if these RDFs are distributed to 1

http://www.w3.org/RDF/ http://www.foaf-project.org/ 3 http://microformats.org/ 4 http://www.w3.org/TR/xhtml-rdfa-primer/ 2

146

M. Araki and Y. Funakura

different knowledge sources, they can be integrated based on the URI identification. In the case of ontology integration, there is no need to unify the predicate name or class hierarchy because several set relations can be represented by the Web Ontology Language (OWL)5. For example, the equivalence of the predicate can be represented by the following OWL statement:6 foaf:maker

owl:equivalentProperty

ex:creator.

This statement indicates that the predicate foaf:maker, which is used in one ontology, is equivalent to the predicate ex:creator, which is used in another ontology. Such URI-based resource identification and ontology-based set relation representation enable easy integration of distributed knowledge. Another advantage of semantic Web technology is automatic inference. The OWL is based on description logic. Many OWL data sources provide basic inference functionality, such as set operation and restriction operation. Therefore, there is no need to write basic inference code for the knowledge modeler in order to implement intelligent systems, such as the back-end system of a spoken dialogue system.

3 Previous Research Using Semantic Web in Spoken Dialogue Systems In this section, we overview the previous research using semantic Web technology in spoken dialogue systems. The goal of the companions project7 is to realize intelligent, persistent, and personalized multimodal interfaces to the Internet. The companions project uses RDF representation and an inference mechanism as back-end knowledge. For example, one system uses family knowledge expressed in the RDF when referencing a user’s photo [2]. In [3], Sonntag et al. proposed a rapid prototyping of a spoken dialogue system using RDF/OWL as a knowledge representation and SPARQL (Query Language for RDF)8 as a query language. In [4], Heinroth et al. demonstrated how to easily merge domain knowledge written in OWL. A few studies have examined sentence generation from the RDF [5],[6]. It is natural to use standardized concept representation as an input of the sentence generator of the dialogue system. For dialogue management, we proposed a method of using the frame representation of OWL as a knowledge of frame-driven dialogue management [7]. In the proposed method, the dialogue flows from upper concept to lower concept, following the ontology representation, without any domain specific control rules. In summary, previous studies have revealed several advantages of using semantic Web in spoken dialogue system development. However, these studies restricted the 5

http://www.w3.org/TR/owl-overview/ OWL statement is represented by RDF. This notation is called N3. It represents subject, predicate and object, followed by period. 7 http://www.companions-project.org/ 8 http://www.w3.org/TR/rdf-sparql-query/ 6

Impact of Semantic Web on the Development of Spoken Dialogue Systems

147

application of semantic Web technology to a single element of the dialogue system. As we will show in the next section, there is significant potential to integrate knowledge in spoken dialogue systems using semantic Web.

4 Impact of the Development of Spoken Dialogue Systems So far, we described three advanced features of the semantic Web, namely, collective intelligence, smooth integration of knowledge, and automatic inference. These features have a significant impact on various elements of spoken dialogue systems. 4.1 Language Model There exists the possibility of automatic acquisition of a word dictionary using semantic Web. As shown in Section 2, semantic tagging using Microformats has been already attached on several Web pages. Combining the class information, such as foaf:Person, foaf:Group, and foaf:Organization, and the predicate information, e.g., foaf:name, we can easily construct categorical dictionaries in grammar-based language models from the scratch. In addition, the categorical dictionaries can be automatically updated by adding a new word. 4.2 Semantic Analysis The RDF can be read as subject, predicate, and object (SPO) sentence structure in English. If the user input to the spoken dialogue system is a statement, and if we can map the input entity (e.g., person’s name, organization name) to the resources represented by a URI and their relation to the predicate resource (e.g., belongs to), then the user input can be represented by the RDF and can be easily stored to the RDF data store as a dialogue history. On the other hand, if the user input is a question, then the input can be converted to SPARQL. An example of converting spoken input to SPARQL is shown in Section 5. 4.3 Dialogue Management We previously proposed a frame description method using OWL for our frame-driven dialogue system [7]. There are several advantages of using OWL as a knowledge representation for dialogue management. First, the AND-OR task structure can be represented using rdf:Seq and rdf:Alt. Second, OWL can be used as a dialogue flow controller. The dialog begins at the top level concept, and continues by going down the concept hierarchy and specifies values for lower level slots. The transition of the concept is made by dialogue pattern. Next possible dialogue act can be obtained by the knowledge of the adjacency pair. If the precondition of the dialogue act is fulfilled, this act becomes a member of the next possible act of the system. Some of this dialogue pattern knowledge is task independent. In addition, such basic inference is realized in the general OWL reasoner, and the amount of task-dependent knowledge that the developer must record will decrease.

148

M. Araki and Y. Funakura

4.4 User Modeling A number of spoken dialogue systems are assumed to be used by a small number of people for each system. For example, a personal robot can be used in a household (ordinarily, one to five people), whereas a mobile phone can be used by only one person. In such situations, the adaptability of the system to each user is very important. In order to realize user adaptability, a representation of the user model information and an update mechanism of the attribute values in the user model are necessary. If the representation and update interface are standardized, then this information can be used across various systems. Generally, in earlier user adaptable interaction systems, the user model representation and its update procedure were embedded in the system code. Therefore, it was difficult to port one user-modeling module to another system. User modeling based on semantic Web technology is a promising approach for dealing with the portability problem of user information. (See, e.g., [8].)

5 Query Analysis on Semantic Search In the information retrieval, a natural language utterance is suitable for expressing the intention of a user directly. However, in order to develop a natural language interface, it is necessary to understand the meaning of words and to distinguish words that should be treated as search conditions from words that should be disregarded. Therefore, in order to perform information retrieval based on semantic analysis, we present a method of generating query commands from natural language utterances. Since we use the RDF as a knowledge source, the target query representation is SPARQL. An example of SPARQL is shown in Fig. 2. PREFIX rdf: PREFIX j.0: SELECT ?name WHERE { ?x j.0:name ?n . ?x j.0:address ?a . FILTER regex(?a, "Kyoto") . ?x j.0:class ?c . FILTER regex(?c, "hotel") . } LIMIT 10

Fig. 2. Example of SPARQL. The input is “Kyouto (Kyoto) ni aru (located) hoteru (hotel) wo sagashite (Search).” (Search hotels located in Kyoto.).

Specifically, we gathered functional expressions from a large quantity of question sentence data on the Web Q&A site and developed a semi-task dependent functional expression dictionary (61,429 entries) that maps Japanese functional expressions to search condition representation in SPARQL. We got SPARQL generation accuracy of 42% in the task of location-related search (among 33 inputs). For example, the dictionary entry of the word “located” (in Fig. 2) is shown in Fig. 3.

Impact of Semantic Web on the Development of Spoken Dialogue Systems role

functional

Property

149

Task property

rdf:type located

rdf:type j.0:notation A22584

j.0:prev

cond-property

rdf:type

rdf:type

j.0:role j.0:propertyKind location-property A32674 j.0:frequency j.0:taskPName

j.0:Pname address

noun-locate j.0:next 6 noun-general

j.0:task

j.0:rule http://ii-lab.is.kit.jp/relationship#

?x proN ?place . FILTER regex(?place, "condition") .

Fig. 3. Example of a semantic dictionary (functional words: “ni aru (located)”)

6 Conclusion We examined several possible uses of semantic Web technologies and reported that semantic technology has a significant impact on various stages in the development of spoken dialogue systems. As an example, we implemented a query generation method for semantic search based on semantic Web technology. In the future, we intend to implement a fully integrated semantic Web based spoken dialogue system and examine its efficiency of development of intelligent systems.

References 1. Berners-Lee, T., Hendler, J., Lassila, O.: The Semantic Web. Scientific American (May 2001) 2. Field, D., Catizone, R., Cheng, W., Dingli, A., Worgan, S., Ye, L., Wilks, Y.: The Senior Companion: a Semantic Web Dialogue System. In: Proceedings of the 8th International Conference on Autonomous Agents and Multiagent Systems, Vol. 2, pp. 1383–1384 (2009) 3. Sonntag, D., Neßelrath, R., Sonnenberg, G., Herzog, G.: Supporting a Rapid Dialogue System Engineering Process. In: Proceedings of the 1st IWSDS (2009) 4. Heinroth, T., Denich, D., Bertrand, G.: Ontology-based Spoken Dialogue Modelling. In: Proceedings of the 1st IWSDS (2009) 5. Sun, X., Mellish, C.: Domain Independent Sentence Generation from RDF Representations for the Semantic Web. In: Proceedings of the Eleventh European Workshop on Natural Language Generation, pp. 105–108 (2007) 6. Wilcock, G.: Talking OWLs: Towards an Ontology Verbalizer. In: Fensel, D., Sycara, K., Mylopoulos, J. (eds.) ISWC 2003. LNCS, vol. 2870, pp. 109–112. Springer, Heidelberg (2003) 7. Araki, M.: OWL-based Frame Description for Spoken Dialog Systems. In: Proc. International Semantic Web Foundations and Application Technology (SWFAT) Workshop (2003) 8. Fonseca, J.M.C., et al.: Model-Based UI XG Final Report (2010), http://www.w3.org/2005/Incubator/model-based-ui/XGR-mbui/

A User Model to Predict User Satisfaction with Spoken Dialog Systems Klaus-Peter Engelbrecht and Sebastian Möller Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Ernst-Reuter-Platz 7, 10587 Berlin, Germany {Klaus-Peter.Engelbrecht,Sebastian.Moeller}@telekom.de

Abstract. In order to predict interactions of users with spoken dialog systems and their ratings of the interaction, we propose to model basic needs of the user which impacting her emotional state. In defining the model we follow the PSI theory by Dörner [1] and identify Competence and Certainty as relevant needs in this context. By analysis of questionnaires we show that such needs impact the users overall opinion of the system. Furthermore, relations to interaction parameters are analyzed. Keywords: PARADISE, user model, evaluation, spoken dialog system.

1 Introduction Automatic evaluation of interactive systems can save time and costs of testing, and thus for the entire development of the system [2]. Alternatively, ongoing research deals with automatically trained systems which adapt their behavior to the users [3][4]. In both cases, user models are used to simulate interactions with the system before real users are confronted with it. While there are pros and cons for either way of designing a system, both approaches are heavily dependent on criteria to determine the quality of the observed interaction as it would be perceived by the user [5][6]. While perceived quality this is typically measured using a questionnaire, in simulated interactions this is not possible. However, as proposed by the PARADISE framework [7], we can try to predict the rating a user would assign to an interaction from a description of the interaction in the form of interaction parameters. The accuracy of such predictions can be sufficient to predict differences between mean ratings of systems or system configurations [8]. However, the relation between the interaction parameters and the user judgments can differ between the users, leading to much lower accuracy when predicting judgments for individual dialogs [8]. These differences can be attributed to user characteristics such as affinity towards technology or attitude towards SDSs to some degree [9]; however, still a substantial part of the variance in the judgments cannot be explained so far. Thus, a deeper study of what happens inside a user remains an interesting research question. In [10] we proposed an integrated user model covering the interaction and the prediction of the corresponding judgment, by partly drawing on the same resources, such G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 150–155, 2010. © Springer-Verlag Berlin Heidelberg 2010

A User Model to Predict User Satisfaction with Spoken Dialog Systems

151

as the task goal and the task model. Such a model may work with “dynamic user attributes”, i.e. parameters of the user which change over the course of an interaction, as it has been proposed for the MeMo workbench [11]. In this paper, we propose a set of such attributes drawing on PSI theory [1]. This theory tries to explain motivated and emotional behavior of agents on the basis of needs and psychological parameters. By analysis of questionnaires and interaction data, we evaluate the applicability of this approach to interactions with spoken dialog systems (SDSs). Thus, our work should be seen as exploratory in nature.

2 Proposed Model According to Dörner [1], the behavior of an agent is modulated by a set of needs. Dörner mentions needs for hunger, thirst, physical well-being, and affiliation, as well as more fundamental needs for Competence and Certainty. Competence can be described as the feeling of being able to manipulate the world (or a device) as desired, while Certainty reflects the reliability of the environment (or some device), i.e. how foreseeable it is. The latter two parameters directly impact the emotional behavior of the agent by modulating behavioral parameters such as arousal, attention (resolution level of perception), flight tendency vs. aggression, and the selection threshold (how easy the agent changes the current goal). The default settings of these parameters are user-specific, so that the same environment can cause different reactions from different agents. In Human-Computer-Interaction (HCI), the former needs describing the agents basic urges seem to be of minor importance, as they mainly impact the goal selection, and in HCI we typically consider scenarios with known (i.e. fixed) goals. However, Competence and Certainty might be of interest, as they are related to the emotions and may thus help to predict the emotional response to a system. The emotional response can then be expected to impact the user behavior as well as the subjective evaluation of the system and may thus explain some of the inter-rater differences. Figure 1 shows how these needs could be integrated into a prediction model for user satisfaction. Displayed are the relations between different variables in one time slice. We display measurable parameters as white ovals, while latent variables are shaded. The target variable, which is also hidden, is displayed in black. Relations between parameters are indicated by arrows, where thick arrows indicate that the relation is deterministic. Note that “emotions” is a placeholder for a sub-system compound of the above mentioned parameters. Not all of these parameters might be crucial to the interaction and judgment process. Functions describing the relation between Competence, Certainty, and emotional parameters can be found in [12]. Interaction parameters and audio parameters of the system turns can be measured directly in the interaction. Note that audio parameters of the user utterances can help to determine the current values of the latent variables. In the next time-slice (not displayed), the actions of the user and her expectations will be influenced by the current emotional parameters, and thus indirectly by the underlying needs for Competence and Certainty. In the remainder of this paper, we will analyze a database of interactions with SDSs with respect to these relations.

152

K.-P. Engelbrecht and S. Möller

Fig. 1. Proposed model with Competence and Certainty as the central parameters modulating user actions and the emotional apparatus of the user

3 Database The database contains dialogs with 3 commercial SDSs over a telephone line. All three systems provide information on public transport connections, however, for different areas of Germany. The systems differ considerably in their dialog strategy and the voices used for the prompts. One of the systems did not allow barge-in, while the others did most of the time. Nine young (M=22.4, SD=2.8, 4 males) and 7 older (M=53.7, SD=7.8, 4 males) users took part in the experiment. Each user performed a dialog with each of the 3 systems, and judged the system on a full questionnaire designed in accordance to ITU-T Rec. P.851 [13]. The calls were recorded and transcribed. Interaction parameters could be read from the transcriptions or were annotated afterwards on the basis of the transcriptions. They cover • Efficiency (#Turns, #Turns till solution, #constraints/utterance, words per system turn) • Understanding errors (parsing correct, partially correct, failed, or incorrect (PA:CO, PA:PA, PA:FA, PA:IC)) • Interaction style (user speech acts (e.g. negate, no_input, rephrase) , words per user turn, confirmation type (explicit, implicit, none), system help, successful and failed barge-in attempts, time-outs) • Task success

A User Model to Predict User Satisfaction with Spoken Dialog Systems

153

4 Relation of Competence and Certainty to User Satisfaction The proposed model can only make sense if we can observe some variance of Competence and Certainty within our dialog corpus. This can be analyzed easiest using the questionnaires the users filled out after the interaction. While we did not ask for Competence or Certainty directly, the questionnaire, having 37 items, covers a wide variety of aspects of the interaction including such statements as “The system reacted as expected” or “I felt in control of the interaction”. Such items are closely related to the targeted concepts and may thus be indicative of the desired relations. In order to simplify the analysis, we first reduced the dimensionality of the data using factor analysis. Before this, we excluded all items asking for an overall evaluation of the system (e.g. “I am satisfied with the system”, “I would use the system again”). The remaining questions all deal with more specific aspects of the interaction. We then performed several factor analyses, using different numbers of resulting factors or the eigenvalues >1 criterion, in order to obtain a clearly interpretable solution from the small umber of cases. The factor solution yielding the best interpretable result has 5 factors, which explain 72% of the variance in the data. One of the factors could not be named as it contained only 2 very different items. The other factors roughly describe the Controllability of the system, the Cognitive Effort involved in the interaction, the Naturalness and the Symmetry of the interaction. Most items load on the first two factors. These factors also include all of the items which are semantically similar to the concepts of Competence and Certainty. Thus, we are particularly interested in these dimensions for this study. We store the factor scores after Varimax rotation for later analysis. In order to obtain a reliable and valid measure of Overall Quality of the system, we also performed a factor analysis on the items related to the overall evaluation of the system. All items load on the same factor, suggesting that they measure the same, and only one, construct. A scale reliability analysis using Cronbach’s alpha supports this assumption (alpha =0.83). Thus, we feel confident to use the resulting factor score as a measure of the Overall Quality of the system. We then analyzed the impact of the five factors capturing specific quality aspects on the Overall Quality, using Linear Regression with stepwise inclusion. The stepwise algorithm selected all factors except the unnamed one for inclusion in the model, where Controllability and Cognitive Effort are selected first and show the highest Beta coefficients. The accuracy of the model is characterized by R2 = 0.87 and a Standard Error of 0.38. These results indicate that Competence and Certainty do vary within the database, and that they have a considerable impact on the overall evaluation of the system. However, it is not yet clear how the concepts Competence and Certainty relate to the factors we extracted from the questionnaires and named Controllability and Cognitive Effort. In a next step, we therefore evaluated the relations of these factors to the interaction parameters and the measured user characteristics, in order to see if they modulate the user behavior in conformance to our assumptions about Competence and Certainty.

154

K.-P. Engelbrecht and S. Möller

5 Relation of Competence and Certainty to Interaction Behavior Of the 30 interaction parameters measured in the data, 10 are significantly correlated with Overall Quality, 8 with Controllability, and 6 with Cognitive Effort. While this statistic suggests that the correlation with Overall Quality determines how strongly the factor is related to the interaction parameters, there are some parameters which are correlated with just one of the more specific factors, or they are stronger correlated to the specific item than to Overall Quality. In the following, we restrict ourselves to an analysis of these parameters. Table 1 shows an overview of these cases. According to the table, the system was perceived more controllable if there was at least one very long system prompt (WPST_max) or if there were less utterances incorrectly understood. The latter is directly linked to the number of negations (the user disconfirmed what the system understood). In addition, Controllability is related to no-inputs and task success. The parameters which are correlated mainly with Cognitive Effort include a number of insignificant “trends”. Note that we measure two-tailed significance, as for some parameters the direction of the correlation could not be anticipated, while otherwise 1tailed significance would be sufficient. Thus, the correlation with #Turns till solution, successful barge-ins, and system help can be considered as reliable, as the effect is observed as it was anticipated. The other parameters related to this factor are the number of rephrasings the user did with her utterances, the number of explicit confirmations, and the question if the user had ever used a SDS. In summary, most of these correlations would be expected also with Competence and Certainty. However, these concepts cannot be straightforwardly assigned to one of the factors. Most importantly, Controllability resembles Certainty given the correlated parameters, but includes task success, which should rather be correlated with Competence. Cognitive Effort, in turn, seems to be related to Competence, but we miss correlations with task success, or #constraints/utterance, or the absolute number of barge-in attempts, to name some examples. Table 1. Correlations (Pearson’s r) between interaction parameters and the Overall Quality, Controllability, and the Cognitive Effort scale. Asterisks indicate that a correlation is significance (*) or highly significant (**; two-tailed). Parameter Prior experience with SDSs #Turns till solution WPST_max PA:IC Succ. barge-ins USA:negate USA:no_input USA:rephrase Explicit confirm. System help Task success

OQ 0.27

Controllability 0.16

Cogn. Eff. -0.43**

-0.15 0.43** -0.25 0.10 -0.35* -0.43** -0.09 -0.27 0.12 0.46**

-0.18 0.46** -0.32* -0.00 -0.37* -0.45** -0.10 -0.21 -0.03 0.50**

0.26 (p=0.089) -0.28 0.01 -0.26 (p=0.094) 0.21 0.32* 0.29 (p=0.056) 0.325* -0.27 (p=0.074) -0.16

A User Model to Predict User Satisfaction with Spoken Dialog Systems

155

6 Conclusion We presented a model of a user interacting with and judging a SDS. The model should be capable of explaining emotional behavior. In order to verify that such a model is useful for the simulation of interactions and corresponding judgments, we tried to identify the involved variables in our data, and analyzed how they are related to parameters describing the interactions with the system. By analysis of questionnaires, we found two factors which are related to Competence and Certainty. These factors show the expected correlations with interaction parameters to some degree, however not as much as expected. In particular, Competence and Certainty seem to be limited with respect to the prediction of variants in user actions. Our results allow predictions about no-inputs and rephrasing probability when an utterance is to be repeated. Other correlated parameters seem to be predictors of the factors rather than being influenced by them. Future work will have to analyze such relations on a turn-level.

References 1. Dörner, D.: Die Mechanik des Seelenwagens. Eine neuronale Theorie der Handlungsregulation. 1. Auflage. Verlag Hans Huber, Bern (2002) 2. Kieras, D.E.: Model-based Evaluation. In: Jacko, J., Sears, A. (eds.) The Human-Computer Interaction Handbook, pp. 1191–1208. Erlbaum, Mahwah (2003) 3. Levin, E., Pieraccini, R., Eckert, W.: Using Markov decision process for learning dialogue strategies. In: ICASSP 1998, Seattle, pp. 201-204 (1998) 4. Pietquin. O.: A framework for unsupervised learning of dialogue strategies. PhD thesis, Faculty of Engineering, Mons (TCTS Lab), Belgium (2004) 5. Ai, H., Weng, F.: User Simulation as Testing for Spoken Dialog Systems. In: 9th SIGdial Workshop on Discourse and Dialogue, Columbus, OH (2008) 6. Rieser, V., Lemon, O.: Automatic Learning and Evaluation of User-Centered Objective Functions for Dialogue System Optimisation. In: LREC 2008, Marrakech, Morocco, pp. 2356-2361 (2008) 7. Walker, M., Litman, D., Kamm, C., Abella, A.: PARADISE: A Framework for Evaluating Spoken Dialogue Agents. In: ACL/EACL 35th Ann. Meeting of the Assoc. for Computational Linguistics, Madrid, pp. 271–280 (1997) 8. Möller, S., Engelbrecht, K.-P., Schleicher, R.: Predicting the Quality and Usability of Spoken Dialogue Services. Speech Communication 50, 730–744 (2008) 9. Engelbrecht, K.-P., Hartard, F., Gödde, F., Möller, S.: A Closer Look at Quality Judgments of Spoken Dialog Systems. In: Interspeech 2009, Brighton (2009) 10. Möller, S., Engelbrecht, K.-P.: Towards a Perception-based Evaluation Model for Spoken Dialogue Systems. In: 4th IEEE Tutorial and Research Workshop PIT, Kloster Irsee (2008) 11. Engelbrecht, K.-P., Quade, M., Möller, S.: Analysis of a New Simulation Approach to Dialogue System Evaluation. Speech Communication 51, 1234–1252 (2009) 12. Dörner, D., Gerdes, J., Mayer, M., Misra, S.: A Simulation of Cognitive and Emotional Effects of Overcrowding. In: ICCM 2006, pp. 92–99 (2006) 13. ITU-T Rec. P.851. Subjective Quality Evaluation of Telephone Services Based on Spoken Dialogue Systems. In: International Telecommunication Union, Geneva (2003)

Sequence-Based Pronunciation Modeling Using a Noisy-Channel Approach Hansj¨ org Hofmann1,2 , Sakriani Sakti1 , Ryosuke Isotani1 , Hisashi Kawai1, Satoshi Nakamura1 , and Wolfgang Minker2 1

National Institute of Information and Communications Technology, Japan {hansjoerg.hofmann,sakriani.sakti,ryosuke.isotani, hisashi.kawai,satoshi.nakamura}@nict.go.jp 2 University of Ulm, Germany [email protected]

Abstract. Previous approaches to spontaneous speech recognition address the multiple pronunciation problem by modeling the alteration of the pronunciation on a phoneme to phoneme level. However, the phonetic transformation eﬀects induced by the pronunciation of the whole sentence are not considered yet. In this paper we attempt to model the sequence-based pronunciation variation using a noisy-channel approach where the spontaneous phoneme sequence is considered as a “noisy” string and the goal is to recover the “clean” string of the word sequence. Hereby, the whole word sequence and its eﬀect on the alternation of the phonemes will be taken into consideration. Moreover, the system not only learns the phoneme transformation but also the mapping from the phoneme to the word directly. In this preliminary study, ﬁrst the phonemes will be recognized with the present recognition system and afterwards the pronunciation variation model based on the noisy-channel approach will map from the phoneme to the word level. Our experiments use Switchboard as spontaneous speech corpus. The results show that the proposed method improves the word accuracy consistently over the conventional recognition system. The best system achieves up to 38.9% relative improvement to the baseline speech recognition. Keywords: Spontaneous speech recognition, pronunciation variation, noisy-channel model, statistical machine translation.

1

Introduction

Nowadays pervasive computing gains more and more importance and is already partly integrated in our daily lives. As speech is the most common and convenient way to communicate among humans, spoken language dialog systems (SLDS) are highly in demand. State-of-the-art automatic speech recognition systems (ASR) perform satisfactory in closed environments. If a SLDS which is integrated in everyday life shall support the user it has to handle more relaxed constraints to allow the user to speak spontaneously. Spontaneous or conversational speech diﬀers severely from accurately read speech and so the recognition G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 156–162, 2010. c Springer-Verlag Berlin Heidelberg 2010

Sequence-Based Pronunciation Modeling Using a Noisy-Channel Approach

157

rates decrease[14]. In natural conversations people pronounce diﬀerently, tend to combine or even miss words out. Discourse particles (e.g. like) or hesitation sounds (e.g. ahm) are used to structure the sentence and have no semantic meaning. However, Riley et al. [17] consider multiple pronunciation variants as being the main problem of spontaneous ASR which has to be faced. Several approaches have been made how to resolve the multiple pronunciation problem. One attempt is to extend the dictionary manually with further pronunciation variants or to improve it by applying rule-based algorithms. Nevertheless, both approaches are very time consuming and the latter needs a signiﬁcant amount of expert knowledge. Other researchers apply data driven approaches to model the alteration of the pronunciation on a phoneme-to-phoneme level. Decision-tree-based approaches have been applied by Bates et al.[2] and improved the ASR performance. Chen et al.[4] examine the eﬀect of prosody on pronunciation and propose to use artiﬁcial neural networks (ANN) to model pronunciation variation. Livescu et al.[11] propose a feature-based pronunciation model based on a dynamic Bayesian Network (BN). A report by Sakti et al.[18] also uses a BN technique to model the variation of the base form and the surface form of the phoneme. After applying the Bayesian Network a small performance improvement of the ASR was gained. As the realization of the current phone does not only depend on neighboring phones the observation window should be extended. Fosler-Lussier[6] proposes to take syllabiﬁcation into consideration and investigate decision tree models based on syllables but the word error rate increases slightly. This may be because the phonetic transformation eﬀects induced by the pronunciation of the whole sentence are not considered yet. In this paper we attempt to model the sequence-based pronunciation variation using a noisy-channel approach where the spontaneous phoneme sequence is considered as a “noisy” string and the goal is to recover the “clean” string of the word sequence. Hereby, the whole word sequence and its eﬀect on the alternation of the phonemes will be taken into consideration. Moreover, the system not only learns the phoneme transformation but also the mapping from the phoneme to the word directly. In this preliminary study, ﬁrst the phonemes will be recognized with the present recognition system and afterwards the proposed pronunciation model will map from the phoneme to the word level.

2

Noisy-Channel Approach

The noisy-channel model used to translate the recognized phonemes into the words is adopted from statistical machine translation (SMT). Given a sequence of units of a source language, the SMT translates the units into a speciﬁed target language. In this case the source language consists of phonemic units and the output language consists of words. The SMT computes the most probable word sequence w given an input phoneme sequence p by solving the following Maximum Likelihood estimation: w ˆ = argmaxw P (p|w) ∗ P (w)

(1)

158

H. Hofmann et al.

Here, P(w) represents the probability of the word w provided by the language model (LM) of the target language. P(p|w) denotes the likelihood of the phoneme sequence p given the word w, represents the transition from the phonemic to the word representation and is computed by the translation model[3]. The structure of the SMT system is shown in Figure 1:

Fig. 1. Basics of SMT framework

The SMT system is trained with parallel matching pairs of text data from the input and the output language. While testing the translation system the SMT evaluates each proposed hypothesis by assigning a score according to the statistical model probabilities. During the translation process all possible hypothesis are considered and ﬁnally the path with the highest score is chosen as a result. Our SMT system is based on a log-linear framework[12], uses a trigram LM and applies phrase-based SMT techniques[10]. The statistical models of our system were trained with special toolkits for language modeling[19] and word alignment[13] and also a lexicalized distortion model was applied[1]. The translation process is performed by a tool called CleopATRa[5] which is a multi-stack phrase-based SMT decoder.

3 3.1

Experimental Evaluation Data Corpora

Read and spontaneous speech data corpora were used in this experiment (see Table 1). The Wall Street Journal speech corpus (WSJ0 and WSJ1)[16] contains read speech recorded from English speakers who had to read newspaper text paragraphs. The training set consists of 60 hours of speech and the so-called WSJ test consists of 215 utterances of a 5K-word dictionary (Hub 2)[15]. The spontaneous speech data was obtained from two subsets of the Switchboard corpus[7] which consists of spontaneous telephone conversations and has a signiﬁcant amount of pronunciation variability[18]. The ﬁrst subset has been phonetically hand-transcribed by ICSI, consists of 4 hours (5117 utterances) of

Sequence-Based Pronunciation Modeling Using a Noisy-Channel Approach

159

data and is used for modeling the pronunciation variation. The second one is SVitchboard 1 (The Small Vocabulary Switchboard Database)[9] which consists of statistically selected utterances from the whole Switchboard corpus. This data set has been segmented in small vocabulary tasks from 10 words up to 500 words. Each segmentation has been further partitioned into 5 subsets (A-E). In this preliminary study we use three subsets of SVitchboard1: 50, 250 and 500 words. For some of the words there exist more than 10 diﬀerent pronunciation variants. An example of the pronunciation variation of the vocabulary is given below: and: /ae eh n/, /ae eh n d /, /ae n/, /ae n d /, /ah n/, /ah n d /, etc. In each case 200 utterances which are at least 3 words long were randomly selected from the subset A and B and are used for evaluation. Table 1. Speech data composition of this experiment Speech Type Read

Data Set Hours Words Usage WSJ Training Set 60 4989 Baseline ASR Train WSJ Test Set 0.2 4989 Baseline ASR Test Spontaneous Hand Transcribed Switchboard 4 3843 AM Adapt, LM & SMT Train SVitchboard1 C&D&E 4 50 AM Adapt, LM Train SVitchboard1 A&B 0.2 50 - 500 ASR Test, SMT Test

3.2

Baseline ASR System

The triphone HMM acoustic model (AM) was trained with the WSJ corpus described above. As the spontaneous speech data have a sampling rate of 8kHz the WSJ data were down sampled from 16kHz to 8kHz. A frame length of a 20-ms Hamming window, a frame shift of 10ms and 25 dimensional feature parameters were used. The 25 feature parameters consist of 12-order MFCC, delta MFCC and log power. At the beginning each phoneme consists of a 3-state HMM. By applying a successive state splitting (SSS) algorithm based on the minimum description length (MDL) the optimum state level HMnet is obtained. Further information about the MDL-SSS algorithm may be obtained from [8]. We built four diﬀerent AMs with diﬀerent Gaussian mixture numbers: 5, 10, 15 and 20 mixtures. Each of the AMs has a total number of 1903 states. The result of the baseline tested on read speech and its degradation on spontaneous speech is shown in Table 2. Table 2. Recognition accuracy of baseline testing WSJ Test Set SVitchboard Test Set 50words SVitchboard Test Set 250words SVitchboard Test Set 500words

5mix 10mix 15mix 20mix 88.7 89.1 89.2 90.8 37.5 40.7 39.4 41.8 29.5 34.1 30.3 33.2 30.1 32.1 32.1 34.9

160

H. Hofmann et al.

3.3

Proposed ASR System

Since the amount of spontaneous speech data in SVitchboard1 is very limited we adapt our baseline to the conversational speech data using the data described in Table 1 with the standard maximum a posteriori (MAP) adaptation. The SMT was trained on Switchboard with the phoneme as a source and the word as the target. Here, we use dictionary based canonical phoneme sequences and hand-labelled surface phoneme sequences which result in 10k utterances in total. Given the correct phoneme sequence of the test list the performance of our SMT is up to 99% accuracy. By increasing the word range from 50 to 500 words the performance decreases just slightly (see Figure 2 a)). Due to the limitation of time the 50 words word range is used for the further experiments. Next, the proposed model is applied after the ASR is conducted. The ASR outputs the most likely word and the according phoneme sequence but for further processing only the phoneme strings are used. Given the ﬁrst best path of the ASR output (Adapt+SMT (1best)) the performance achieved a relative improvement of 19.5% (see Figure 2b)). However, here is just the best result of the ASR considered and translated to the word level. By keeping the whole lattice of the speech recognition result the system can be further improved. Here, we apply the SMT on the n-best (10best, 50best) list generated from the ASR which results in unique word sequences. Figure 2b) shows the optimum results (Adapt+SMT(10best), Adapt+SMT(50best)) given the SMT output. Thereby a relative improvement of 9.0% to the “Adapt+SMT (1best)” could be achieved. Higher orders still improve the accuracy but converge to a saturation level. To further improve the performance we assess the reliability of the ASR output by using the generalized utterance posterior probability (GUPP)[20] approach. We enumerate diﬀerent thresholds and send only utterances with lower reliability values than the threshold to the SMT. Figure 2b) shows only the optimum

100

60 55

Word Accuracy (%)

Word Accuracy (%)

95

90

85

80

50 45 40 35 30 25

75

20 5mix

70 50

250

Word Ranges

(a)

500

10mix

15mix

Baseline (WSJ)

Adapt + SMT (1best)

Adapt + SMT (10best)

Adapt + SMT GUPP (50best)

Adapt + SMT UpperBound (50best)

20mix Adapt + SMT (50best)

(b)

Fig. 2. a) Performance of SMT given the correct phoneme sequence. b) Improvement of spontaneous accuracy results by applying the proposed model.

Sequence-Based Pronunciation Modeling Using a Noisy-Channel Approach

161

results (Adapt+SMT GUPP(50best)) and the best system achieves up to 53.6%. Additionally, in Figure 2b) the upper bound of our proposed system is shown when sending only those utterances to the SMT which can be improved by the system. As can be seen(Adapt+SMT UpperBound(50best)) a relative improvement of 20.4% to the “Adapt+SMT (1best)” and up to 58% accuracy can be achieved. In comparison using Bayesian Network approaches Sakti et al.[18] achieve 56.3% and Livescu et al.[11] achieve 57.3% words accuracy. However, both results cannot be compared directly as the tasks diﬀer.

4

Conclusions

We have proposed a sequence-based pronunciation model for spontaneous ASR applying the noisy-channel approach based on SMT. First, we adapted our baseline AM with spontaneous speech data. After having recognized the phonemes with our ASR engine we used the SMT system to map from the phoneme to the word level. The results show that the proposed method improves the word accuracy consistently over the conventional recognition system. The best system achieves up to 38.9% relative improvement to the baseline AM and 21.4% to the proposed model given the ﬁrst best path of the ASR output. The results point towards the positive direction opening the possibility to increase the complexity of the experiment’s topology. Future work includes extensions of the word range and usage of a larger data set and the extension of the ngram LM of the SMT.

References 1. Al-Onaizan, Y., Papineni, K.: Distortion models for statistical machine translation. In: Proc. ACL/COLING, pp. 529–536 (2006) 2. Bates, A., Osterndorf, M., Wright, R.: Symbolic phonetic features for modeling of pronunciation variation. Speech Communication 49, 83–97 (2007) 3. Brown, P., Pietra, S., Pietra, V.: The mathematics of statistical machine translation: Parameter estimation. Computational Linguistics 19(2), 263–311 (1993) 4. Chen, K., Hasegawa-Johnson, M.: Modeling pronunciation variation using artiﬁcial neural networks for English spontaneous speech. In: Proc. ICSLP, pp. 1461–1464 (2004) 5. Finch, A., Denoual, E., Okuma, H., Paul, M., Yamamoto, H., Yasuda, K., Zhang, R., Sumita, E.: The NICT/ATR speech translation system for IWSLT 2007. In: Proc. IWSLT, pp. 103–110 (2007) 6. Fosler-Lussier, E.: Contextual word and syllable pronunciation models. In: Proc. IEEE ASRU Workshop (1999) 7. Godfrey, J., Holliman, E., McDaniel, J.: SWITCHBOARD: Telephone speech corpus for research and development. In: Proc. ICSLP, pp. 24–27 (1996) 8. Jitsuhiro, T., Matsui, T., Nakamura, S.: Automatic generation of non-uniform HMM topologies based on the MDL criterion. IEICE Trans. Inf. Syst. E87-D (8) (2004) 9. King, S., Bartels, C., Bilmers, J.: Small vocabulary tasks from Switchboard 1. In: Proc. EUROSPEECH, pp. 3385–3388 (2005)

162

H. Hofmann et al.

10. Koehn, P., Och, F.J., Marcu, D.: Statistical phrase-based translation. In: Proc. the Human Language Technology Conference, pp. 127–133 (2003) 11. Livescu, K., Glass, J.: Feature-based pronunciation modeling for speech recognition. In: Proc. HLT/NAACL (2004) 12. Och, F., Ney, H.: Discriminative training and maximum entropy models for statistical machine translation. In: Proc. ACL, pp. 295–302 (2002) 13. Och, F., Ney, H.: A systematic comparison of various statistical alignment models. Computational Linguistics 29(1), 19–51 (2003) 14. Pallet, D.: A look at NISTS’s benchmark ASR tests: Past, present, future. In: Proc. ASRU, pp. 483–488 (2003) 15. Pallett, S., Fiscus, J., Fisher, M., Garofolo, J., Lund, B., Przybocki, M.: 1993 benchmark tests for the ARPA spoken language program. In: Proc. Spoken Language Technology Workshop (1994) 16. Paul, D.B., Baker, J.: The design for the Wall Street journal-based CSR corpus. In: Proc. ICSLP (1992) 17. Riley, M., Byrne, W., Finke, M., Khudanpur, S., Ljolje, A., McDonough, J., Nock, H., Saraclar, M., Wooters, C., Zavaliagkos, G.: Stochastic pronunciation modelling from handlabelled phonetic corpora. In: Proc. ETRW on Modeling Pronunciation Variation for Automatic Speech Recognition, pp. 109–116 (1998) 18. Sakti, S., Markov, S., Nakamura, S.: Probabilistic pronunciation variation model based on Bayesian networks for conversational speech recognition. In: Second International Symposium on Universal Communication (2008) 19. Stolcke, A.: SRILM - an extensible language modeling toolkit. In: Proc. ICSLP, pp. 901–904 (2002) 20. Lo, W.K., Soong, F.K.: Generalized posterior probability for minimum error veriﬁcation of recognized sentences. In: Proc. ICASSP, pp. 85–88 (2005)

Rational Communication and Affordable Natural Language Interaction for Ambient Environments Kristiina Jokinen Department of Speech Sciences, University of Helsinki, Finland [email protected]

Abstract. This paper discusses rational interaction as a methodology for designing and implementing dialogue management in ambient environments. It is assumed that natural (multimodal) language communication is the most intuitive way of interaction, and most suitable when the interlocutors are involved in open-ended activities that concern negotiations and planning. The paper discusses aspects that support this hypothesis by focussing especially on how interlocutors build shared context through natural language, and create social bonds through affective communication. Following the design guidelines for interactive artefacts, it is proposed that natural language provides humancomputer systems with an interface which is affordable: it readily suggests the appropriate ways to use the interface.

1 Introduction Human-computer interaction can be seen as a communicative situation where the participants have various kinds of intentions and expectations concerning the content and flow of the communication (see e.g. [1,3,8,13]). It is not our claim, however, that the computer is conscious about its acts or that it understands the meaning of linguistic symbols in the same way as the human users; rather, we put forward the view that the human-computer interactions are perceived as natural and intentional, if the computer agent’s operation and interaction capability are based on similar capabilities as those used in human-human interactions. Our interest lies in studying assumptions and processes that affect natural interaction and human perception of what is natural interaction. Rationality is a crucial aspect of social agency and communication, and it is important to take it into consideration when building ambient conversational agents that emerge in our environment [16]. The tasks that the computer is expected to perform have become more complex: search, control, manipulation, and presentation of information are activities where the relation between the input and the output is not necessarily predetermined in a straightforward one-to-one input-output manner, but the system appears to reason without any apparent control by the user. Moreover, ubiquitous computing paradigm and distributed processing add to unstructured, uncontrolled nature of interaction: the user interacts with many computers instead of one, and the actual computing may be physically performed somewhere else than in the device next to us (Service-Oriented Architectures). Furthermore, new emerging tasks deal with activities which need G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 163–168, 2010. © Springer-Verlag Berlin Heidelberg 2010

164

K. Jokinen

socially aware behaviour: affective computing concerns companions, assistants, and virtual friends which do not only aim at efficient task completion, but providing company, entertainment, and friendship. This kind of behaviour requires sensitivity to non-verbal signals and analysis of their function in controlling and coordinating human behaviour. This paper discusses these issues in the framework of rational communication, focussing especially on the interlocutors’ cooperation in constructing the shared context. We will first discuss dialogue activities that the interlocutors take part in, including the change in the role of the computer from a tool into a participant in the interaction. We then move on to the construction of the shared context by dialogue participants, and finally consider the concept of affordance in the design and development of interactive systems. The paper is mainly theoretical and follows thoughts and ideas drawn from the wide research dealing with rational interaction as well as from the common design guidelines for natural and intuitive interfaces.

2 Dialogue Activities The use of natural language as an interface mode often goes together with an assumption that the computer should also be able to chat like humans. While this may be a fair requirement concerning virtual companions, it seems an unreasonable and undesirable requirement for all interactive systems. Different types of communication strategies, styles of interaction, roles, and politeness codes apply to different human activities and similarly, also human-computer interactions are situated activities where different constraints operate. For instance, a robot companion or a booking system engage their users in different activities with different goals and participant roles, and the requirements for what is natural, efficient, useful, and helpful vary according to the activity types. Consequently, in the design and development of interactive systems it is important to classify the different activities which involve automated interaction and to identify the purpose of activity as well as the roles and expectations that constrain the participants’ behaviour. Activity analysis for communication has been proposed by several authors, and the background of the ideas goes back to linguistic philosophy, psychology, and sociology, see [1] for an overview. The following four parameters can be used to characterize a social activity among human participants [1]: • • • •

Type: purpose, function, procedures Roles: competence/obligations/rights Instruments: machines/media Other physical environment

The first parameter refers to the reason for an activity, its motivation, or rationale that helps to understand, and consequently define, means and procedures that can be used to pursue the activity. Procedures can also be used to define the activity: an idea of the purpose of an activity helps to define the specific type of activity. Each activity is also associated with standard activity roles which refer to the tasks related to the purpose of the activity. Each role is performed by one person, and analyzed into competence requirements, obligations, and rights. Roles give rise to expectations of the

Rational Communication and Affordable Natural Language Interaction

165

participants’ actions and style of behaviour, and of a coherent and consistent topic. The instruments of the activity refer to tools, machines and other devices that are used to pursue the activity. Usually they create their own patterns of communication. Computers are typically regarded as instruments, i.e. tools used to achieve the main activity goal (however, their role has changed, cf. the discussion below). The parameter “other physical environment” concerns environmental conditions which affect the interlocutors’ behaviour and manner of interaction. These include light, background noise, and furniture, which in ambient environments are also important constraints for the system operation. 2.1 The Computer as a Tool and an Agent As noted, the role of the computer in interactions seems to have changed from a tool to a more interactive and intelligent agent. The traditional HCI view has regarded the computer as a transparent tool: it supports human goals for clearly defined tasks, takes the user’s cognitive limitations into account, and provides the user with explicit information about the task and the system's capabilities. The approach goes well with applications intended to support the users’ work and thinking rather than act as a partner in an interactive situation (e.g. web portals, assistive tools for computer-aided design and computer supported collaborative work). In speech applications, the interaction often follows the HMIHY algorithm [7], smoothed through rigorous statistical analysis. The users are expected to behave according to certain assumptions of how the interaction should and could proceed. However, speech interfaces seem to bring in tacit assumptions of fluent human communication and consequently, expectations of the system’s communicative capability become high. For instance, [9] noticed that the same system gets different evaluations depending on the users’ predisposition and prior knowledge of the system: if the users were told that the system had a graphical-tactile interface with a possibility to input spoken language commands, they evaluated the spoken language communication significantly better than those users who thought they were interacting with a speechbased system which also had a multimodal facility. Developments in interaction technology have also led to reduction of human control over the system’s behaviour. Computers are connected in networks of interacting processors, their internal structure and operations are less transparent due to distributed processing, and various machine-learning techniques make the software design less deterministic and controllable. Context-aware computing investigates systems that can observe their environment and learn preferences of individual users and user groups, adapting decisions accordingly. Mobile devices offer frameworks for group interaction and social context can include virtual friends besides other humans. Future communications with the computer are thus envisaged to resemble human communication: they are conducted in natural language using multimodal interaction technology, understanding the effect of social context. The view of the computer as a passive tool under human control is disappearing. 2.2 Rationality The aim of the interaction research and technology is usually described as being related to the need to increase efficiency and naturalness of interactions between

166

K. Jokinen

humans and computer agents, i.e. improving robustness of the system behaviour. According to the view advocated in the paper, the general principles related to activity analysis, cooperative communication, and rational behaviour also function as the basic principles for successful communication in human-machine interactions. Natural communicative behaviour using language-based symbolic communication provides the standards according to which communicators behave and assess their success in interaction. The actual rationality of system actions may differ from those of the human users, but it is the user‘s perception of the system’s rationality that plays the crucial role in the interaction. In other words, the system is not regarded as a simple reactive machine or a tool, but capable of acting appropriately in situations which are not directly predictable from the previous action. In search for the definition of rational action, we look at the communication in general [1,3,5,8,13,15]. Rationality emerges in social activity as the agent's compliance with the requirements of communicative principles; it captures the agent's communicative competence, and is effectively a sign of the agent's ability to plan, coordinate, and choose actions so that the behaviour looks competent and fulfils the goal which motivated the action. The observation that rationality is in fact perceived rationality, leads us to a straightforward definition of incompetent behaviour: if the agent's interaction with the partner (or with the environment) does not converge to a conclusion, the agent's behaviour is described as incompetent, irrational, lacking the basic characteristics of a motivated and competent agent.

3 Construction of Dialogues Several authors have described interactive situations as cooperative activity where the shared context is gradually built by the participants by exchanging information [5,12,15]. We call this kind of dialogue modelling constructive [8]. Although the interaction itself is complex, it is assumed that communicative actions can be formulated using only a few underlying principles, and recursive application of these principles allows construction of the more complex communicative situations. The constructive approach to interaction management has the following properties: • • • • •

The speakers are rational agents, engaged in cooperative activity The speakers have partial knowledge of the situation, application, and world The speakers are conduct a dialogue to order to achieve an underlying goal The speakers construct shared context The speakers exchange new information on a particular topic.

The main challenge in the constructive dialogue modelling lies in the grounding of language: updating one's knowledge and constructing shared context accordingly. The grounded elements are not automatically shared by other agents, let alone by the same agent in different situations, but the agents' repeated exposure to the same kind of communicative situations enables learning from interaction: the shared context is the result of the agents' cooperation in similar situations, their alignment [5,12] with each other rather than something intrinsic to the agents or the environment as such. A related question is the agents' understanding of the relation between language and the physical environment, the embodiment of linguistic knowledge via action and interaction, discussed

Rational Communication and Affordable Natural Language Interaction

167

in experimental psychology and neuroscience. The relation between high-level communication and the processing of sensory information is not clear, although much interesting research is conducted concerning how senso-motor activity constitutes linguistic representations [2, 14]. This line of research will surely affect the design and development of interactive systems concerning how natural language interaction is enabled in ambient environments and with robotic companions.

4 Affordance One of the main questions in the design and development of more natural interactive systems is what it actually means to communicate in a natural manner. For instance, speech is not “natural” if there are privacy issues involved or if the enablements for speech communication are not fulfilled at all. Following the activity analysis (Section 2), if the participants understand the rationale behind the activity, it is easier for them also to comprehend the means and procedures that can be used to pursue the activity. Concerning interactive systems, the adjective “natural” should not only refer to the system’s ability to use natural language, but to support functionality that the user finds intuitive in that it fulfils the rationale behind the interaction. We say that the interactive systems should afford natural interaction. The concept of affordance was introduced by [6] in the visual perception field to denote the properties of the world that the actors can act upon, and brought to product design and HCI by [11]. Depending on whether we talk about the uncountable number of actionable relationships between the actor and the world, or about action possibilities perceived by the user, we can distinguish real and perceived affordances, respectively. In HCI, the (perceived) affordance is related to the properties of the object that suggest to the user the appropriate ways to use the artefact. The smart technology environment should thus afford natural interaction techniques, i.e. interfaces should lend themselves to a natural use without the users needing to reason how the interaction should take place to get the task completed. Interactive spoken conversations include a wide communicative repertoire of nonverbal signals concerning the speaker's emotions, attitudes, and intentions. Recent research has focussed on collecting large corpora of natural conversations, and examining how ordinary people communicate in verbal and non-verbal ways in their daily conversations. This allows building of experimental models for affordable spoken interactions, which can then be integrated into the design of naturally interacting applications and services [e.g. 4, 8]. Such intuitive communication strategies also encourage interdisciplinary research between human sciences and technological possibilities of how to design and construct affordable interactive systems.

5 Conclusions This paper has discussed various issues related to rational and cooperative interaction between humans and intelligent computer agents. The discussion has been mainly theoretical, with the aim to understand the development of human-computer interactions and to explore the impact of the new digital technology on the society and human communication in general. Understanding of the communicative activities that

168

K. Jokinen

humans get involved in may help to understand the activities where the partner is an intelligent automatic agent. We tend to think of inanimate objects as tools that can be used to perform particular tasks. Computers, however, are not only tools but complex systems whose manipulation requires special skills, and interaction with them is not necessarily a step-wise procedure of commands, but resembles natural language communication. Thus it seems reasonable that the design and development of interactive systems takes into account communicative principles that concern rational activity and cooperation in human interaction.

References 1. Allwood, J.: An Activity Based Approach to Pragmatics. Gothenburg Paper. In: Theoretical Linguistics 76, Dept. of Linguistic, Göteborg University (2000) 2. Arbib, M.: The evolving mirror system: A neural basis for language readiness. In: Christiansen, M., Kirby, S. (eds.) Language evolution, pp. 182–200. Oxford UP, Oxford (2003) 3. Bunt, H.C.: A framework for dialogue act specification. In: Fourth Workshop on Multimodal Semantic Representation (ACL-SIGSEM and ISO TC37/SC4), Tilburg (2005) 4. Cassell, J., Sullivan, J., Prevost, S., Churchill, E. (eds.): Embodied Conversational Agents. MIT Press, Cambridge (2003) 5. Clark, H.H., Schaefer, E.F.: Contributing to Discourse. Cog. Sci. 13, 259–294 (1989) 6. Gibson, J.: The ecological approach to visual perception. Houghton Mifflin, Boston (1979) 7. Gorin, A.L., Riccardi, G., Wright, J.H.: How I Help You? Speech Communication 23(1-2), 113–127 (1997) 8. Jokinen, K.: Constructive Dialogue Modelling – Speech Interaction and Rational Agents. John Wiley & Sons, Chichester (2009a) 9. Jokinen, K.: Gesturing in Alignment and Conversational Activity. In: Proceedings of the Pacific Linguistic Conference, Sapporo, Japan (2009b) 10. Jokinen, K., Hurtig, T.: User Expectations and Real Experience on a Multimodal Interactive System. In: Proceedings of Interspeech-2006, Pittsburgh, US (2006) 11. Norman, D.A.: The psychology of everyday things. Basic Books, New York (1988) 12. Pickering, M., Garrod, S.: Towards a mechanistic psychology of dialogue. Behavioral and Brain Sciences 27, 169–226 (2004) 13. Sadek, D., Bretier, P., Panaget, F.: ARTIMIS: Natural dialogue meets rational agency. In: Proceedings of IJCAI 1997, pp. 1030–1035 (1997) 14. Tomasello, M.: First verbs: A case study of early grammatical development. Cambridge University Press, Cambridge (1992) 15. Traum, D.: Computational models of grounding in collaborative systems. In: Working Papers of the AAAI Fall Symposium on Psychological Models of Communication in Collaborative Systems, pp. 124–131. AAAI, Menlo Park (1999) 16. Weiser, M.: The Computer for the Twenty-First Century. Sci. Am., 94–104 (1991)

Construction and Experiment of a Spoken Consulting Dialogue System Teruhisa Misu, Chiori Hori, Kiyonori Ohtake, Hideki Kashioka, Hisashi Kawai, and Satoshi Nakamura MASTAR Project, NICT, Kyoto, Japan http://mastar.jp/index-e.html

Abstract. This paper addresses a spoken dialogue framework that helps users make decisions. Various decision criteria are involved when we select an alternative from a given set of alternatives. When adopting a spoken dialogue interface, users have little idea of the kinds of criteria that the system can handle. We thus consider a recommendation function that proactively presents information that the user would be interested in. We implemented a sightseeing guidance system with a recommendation function and conducted a user experiment. We provided an initial analysis of the framework in terms of the system prompt and users’ behavior, as well as in terms of user’s behavior and his/her knowledge.

1

Introduction

Over the years, a great number of spoken dialogue systems have been developed. Their typical task domains include airline information (ATIS & DARPA Communicator) [1] and railway information (MASK) [2]. Dialogue systems, in most cases, are used in the ﬁelds of database (DB) retrieval and transaction processing, and dialogue strategies are optimized so as to minimize the cost of information access. Meanwhile, in many situations where spoken dialogue interfaces are installed, information access by the user is not a goal in itself, but a means for a decision making [3]. For example, in using a restaurant retrieval system, the user’s goal may not be the extraction of price information but to make a decision based on the retrieved information on candidate restaurants. There have only been a few studies that have addressed spoken dialogue systems that help users make decisions. In this paper, we provide our model of consulting dialogue systems with speech interfaces. In this study, our model is concerned with the implementation of a sightseeing guidance system for Kyoto city. We address in our preliminary analysis the user’s experience while engaging with this system.

2

Dialog Model for Consulting

A sightseeing guidance system of the type that we are constructing is regarded as a kind of decision support system. That is, the user selects an alternative from a given set of alternatives based on some criteria. G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 169–175, 2010. c Springer-Verlag Berlin Heidelberg 2010

170

T. Misu et al.

Choose the best spot

Goal

Criteria

Cherry blossoms

Japanese garden

Easy access

・・・・・

…. Alternatives

Kinkakujitemple

Ryoanjitemple

Nanzenjitemple

・・・・・

Fig. 1. Hierarchy structure for sightseeing guidance dialogue

There have been many previous studies of decision support systems in the operations research ﬁeld, and the typical method that has been employed is the Analytic Hierarchy Process [4] (AHP). In the AHP, the problem is modeled as a hierarchy that consists of the decision goal, the alternatives for reaching it, and the criteria for evaluating these alternatives. In the case of our sightseeing guidance system, the goal is to decide on an optimal spot that is in agreement with the user’s preference. The alternatives are all sightseeing spots that can be proposed and explained by the system. As criteria, we adopt the determinants that we have deﬁned in our tagging scheme of the Kyoto sightseeing guidance dialogue corpus [5]. The determinants include various factors that are used to plan sightseeing activities, such as “cherry blossoms”, “Japanese garden”, etc. An example hierarchy using these criteria is shown in Fig. 1. When adopting such a hierarchy structure, the problem of deciding on the optimal alternative can be solved by estimating weights for criteria. They are often optimized through pairwise comparisons, followed by weight tuning based on the results of such comparison [4]. However, the methodology cannot directly be applied to spoken dialogue systems. Generally, the system knowledge is usually not fully observable to users at the beginning of a dialogue, and is observed only through interaction with the system. In addition, spoken dialogue systems usually handle quite a few candidates and criteria, which makes pairwise comparison a costly aﬀair. Although there have been several studies that are dealing with decision making with spoken dialogue interface [3], these works assume that the users know all the criteria that the users/system can use for making decision. In this work, we assume a situation where users are unaware of not only what kind of information the system can provide but their own preference or factors that they should emphasize. We thus consider a spoken dialogue system that provides users with information via system-initiative recommendations. We assume that the number of alternatives is relatively small, and that all alternatives are known to the users. This is highly likely in real world situations; for example, the situation wherein a user selects one restaurant from a list of the candidates presented by a car navigation system.

Construction and Experiment of a Spoken Consulting Dialogue System

3

3.1

171

Decision Support System with Spoken Dialogue Interface System Overview

The dialogue system we constructed is capable of two functions: answering users’ requests and recommending them information. In an answering function, the system can explain the sightseeing spots in terms of every determinant, unlike conventional systems that are only capable of explaining the pre-set abstract of a given spot, In the recommendation function, the system provides information about what the system can explain, since novice users are unlikely to know the capabilities of the system (e.g., the system, as part of its recommendation function, provides determinants that the user might be interested in). The system ﬂow based on these strategies is summarized below. The system: 1. 2. 3. 4. 3.2

Recognize the user’s utterance, Detect the spot and determinant in the user’s utterance, Present information based on this understanding, Recommend information related to the current topic. Knowledge Base

Our back-end DB consists of 15 sightseeing spots as alternatives, and 10 determinants described for each spot. The number of alternatives is small compared to systems dealing with information retrieval. This work focus on the process of comparing and evaluating candidates that meet “essential condition” such as “Famous temple around Kyoto station”. We selected determinants that frequently appear in our dialogue corpus [5]. The determinants are listed in Table 4. Normally, these determinants are related and are dependent on one another, but in practice, the determinants are assumed to be independent and have a parallel structure. The spots are annotated in terms of these determinants if they apply to them. The value of the evaluation is “1” when the spot applies to the determinant and “0” when it does not. The text is generated by retrieving appropriate reasons from the Web. An example of the DB is shown in Table 1. Table 1. Example of the database (translation of Japanese) Spot name

Determinant Eval. Text Cherry blossoms 1 There are about 1,000 cherry trees in the temple ground. Best of all, the vistas from the main temple are amazing. Vista 1 The temple stage is built on the slope, and the Kiyomizu temple views of the town from here are breathtaking. Not Crowded 0 This temple is very famous and popular, and is thus constantly crowded. ... ... ... ...

172

3.3

T. Misu et al.

Speech Understanding and Response Generation

Our speech understanding process tries to detect sightseeing spots and determinant information in the automatic speech recognition (ASR) results. We thus prepared two modules for the spots and the determinants, respectively. In order to facilitate ﬂexible understanding, we adopted an example-based understanding method based on vector space models. That is, the ASR results were matched against a set of documents written about the target spots1 , and the spots with the highest matching scores were used as understanding results. The ASR results were also matched against a set of sample query sentences and determinants are detected. In addition, we also concatenated contextual information on a spot or on a determinant under current focus if the ASR results included either a spot or a determinant. The system would then generate a response by selecting from one of the appropriate responses in the DB, and present it through synthesized speech.

3.4

System Initiative Recommendation

In the information retrieval systems, users often have diﬃculty making queries. This is particularly the case when they are unsure of what information the system possesses. In addition, it is important to raise awareness of the users’ potential preferences through a dialogue. We thus design a system initiative recommendation, which is followed by the system response. The content of the recommendation is determined based on one of the following three methods. 1. Recommendation based on the current focused spot This method is structured on the basis of the users’ current focus on a particular spot. Speciﬁcally, the system selects three determinants related to the current spot whose evaluation is “1” and presents them to the user. 2. Recommendation based on the current focused determinant This method functions on the basis of the focus on a certain speciﬁc determinant. The system selects three spots related to the current determinant whose evaluation of the determinant is “1”. 3. Open prompt The system does not make a recommendation, and presents an open prompt. After users acknowledge the domain and the system knowledge this system may be considered preferable, since users may become irritated with repetitive recommendations. Although we can use various recommendation methods such as the collaborative ﬁltering, which is often used for recommender systems [6], in this work we use these simple methods, because we do not estimate user preference and thus the system may presents irrelevant information. An example dialogue with the system is shown in Table 2. Note that, the system was set to make possible recommendations randomly in the following experiment. 1

The documents we referred to were all sourced from Wikipedia.

Construction and Experiment of a Spoken Consulting Dialogue System

173

Table 2. Example dialogue with the system (translation of Japanese) S1: Hello, this is the Kyoto tourist information system. How may I help you? U1: Please show me the cherry blossom of Ninnaji. S2-Ans.: The cherry blossoms of Ninnaji are specially refered to as “omuro no sakura”. These cherry trees are quite unique because they are lower than the others. S2-Rec.: Japanese garden, events and fall foliage are some of the areas in which information is available on Ninnaji; do you want me to explain? (Method 1) ... U11: Does it have a lovely landscape? S12-Ans.: From the top of the famous gate, you can see Kyoto city spread out below. S12-Rec.: I can name Kiyomizu-temple, Fushimiinari and Kurama-temple as famous spots that grant great vistas. (Method 2) U12: Kiyomizu temple. S13-Ans.: Kiyomizu temple is built on the side of the hill, and you are treated to a view of Kyoto city center. S13-Rec.: Do you have any other thing you want to know? (Method 3) U13: I decided to go to Nanazenji temple S14: The item, Nanzenji temple, has been chosen. Are there any questions? ...

Table 3. Percentages of user utterance types for system prompts

Acceptance Determinant Spot name Determinant and spot name Others (Commitment, OOS, etc)

4 4.1

Opening (%) Method 1 (%) Method 2 (%) Method 3 (%) 71.5 30.0 15.8 7.0 25.0 38.6 50.0 16.8 16.7 24.8 0 1.4 6.7 2.0 34.2 3.3 21.6 34.6

System Experiment and Analysis User Experiment

We collected test data from 72 subjects who had not used our system before. Subjects were requested to use the system to determine one sightseeing spot out of 15 alternatives, based on the information obtained from the system. No instructions on the system knowledge were given except the following three example utterances of “Could you please tell me the spots famous for XXX?” and “Tell me about XXX temple”. We asked the subjects not to use their own knowledge and experiences while arriving at a decision. Further, they were requested to utter the phrase “I’ll go to XXX,” signifying commitment once they had reached a decision. Only one set of dialogues was collected per subject, since the ﬁrst such dialogue session would very likely have altered the level of user knowledge. The average length of dialogue before a user communicated his/her commitment was 16.3 turns with 7.0 turns being the standard deviation. 4.2

Analysis of Collected Dialogue Sessions

We transcribed a total of 1,752 utterances and labeled their correct dialogue acts (spots and determinants) by hand. The percentage of user utterances that the

174

T. Misu et al. Table 4. Analysis of user preference and knowledge

Percentage of users Percentage of users Percentage of users uttered who value it (%) who uttered it (%) before system recom. (%) Japanese garden 34.7 47.2 22.2 19.4 41.7 1.4 Not crowded World heritage 48.6 50.0 2.7 48.6 22.2 1.4 Vista 16.7 19.4 19.4 Easy access Fall foliage 37.5 47.2 18.1 33.3 51.4 13.9 Cherry flower 43.1 31.9 12.5 History Stroll 45.8 38.9 1.4 29.2 36.1 8.3 Event Determinant

system could handle was 89.0%, out of which the system could correctly respond to 72.4%. Analysis of user utterances. First, we analyzed the relationship between system prompts and user utterances. The percentages of user utterances for system prompts are shown in Table 3. “Acceptance of recommendation” refers to the cases where the users accept the recommendation. That is, Method 1 is regarded as accepted when the user requests either of the recommended determinants. Method 2 is regarded as accepted when the user requests either of the recommended spots. The tendency of user utterances varies according to the recommendation types. Many users make queries the systems cannot handle (out of system; OOS) in the opening and in the open prompt (Method 3). Meanwhile, many users can make in-domain queries by presenting system knowledge through recommendations. Analysis of user preference and domain knowledge. We analyzed the sessions in terms of the preference and domain knowledge of the subjects. Table 4 lists the preferences by percentage that subjects emphasize when selecting sightseeing spots. These are based on questionnaires conducted after the dialogue session. (We allowed multiple selection.) Since subjects were asked to select determinants from the list of all determinants, their selections are considered to be their preferences under the all system knowledge. However, when the subjects start with the dialogue sessions, some of the above preferences turn out to be only potential preferences, owing to limited nature of the users’ knowledge about the system. In order to analyze the knowledge of user utterances, we analyzed the percentage of the utterances that included the determinants before the system recommendation. The result is shown in Table 4. Several determinants were seldom uttered before the system made its recommendations, even if they were important for many users. For example, “World heritage site information” and “Stroll information” were seldom uttered before the system’s recommendation, despite the fact that around half of the users had emphasized on them. These results show that some of users’ actual preferences

Construction and Experiment of a Spoken Consulting Dialogue System

175

remained as potential preferences before the system made its recommendation or at the very least, the users were not aware that the system were able to explain those determinants; thus, it is important to have users notice their potential preferences through system-initiative recommendations. 4.3

Analysis of Users’ Decisions

Finally, we analyzed the relationship between user preference and the decided spot. We evaluated the number of agreed attributes between the user preferences and the decided spots. The average number of agreements was 2.20, which was higher than the expectation by random selection (1.96). However, if the users had known about their potential preferences and about the system knowledge, and then selected an optimal spot according to their preferences, the average number of agreements, then, would have been 3.34. This result indicates that an improved recommendation strategy can help users make a better choice.

5

Conclusion

In this paper, we addressed a spoken dialogue framework that helps users select an alternative from a list of alternatives. Through an experimental evaluation, we conﬁrmed that user utterances are largely aﬀected by system recommendations; moreover, we learnt that users can be helped to make better decisions by improving the dialogue strategies. Therefore, we will extend the framework of this research to estimate users’ preferences from their utterances. In addition, the system is expected to handle a more complex planning of natural language generation in recommendations, such as those discussed in [7]. We also plan to optimize the selection of responses and recommendations, based on users’ preferences and state of knowledge.

References 1. Levin, E., Pieraccini, R., Eckert, W.: A Stochastic Model of Human-machine Interaction for Learning Dialog Strategies. IEEE Trans. on Speech and Audio Processing 8, 11–23 (2000) 2. Lamel, L., Bennacef, S., Gauvain, J.L., Dartigues, H., Temem, J.N.: User Evaluation of the MASK Kiosk. Speech Communication 38(1) (2002) 3. Polifroni, J., Walker, M.: Intensional Summaries as Cooperative Responses in Dialogue: Automation and Evaluation. In: Proc. ACL/HLT, pp. 479–487 (2008) 4. Saaty, T.: The Analytic Hierarchy Process: Planning, Priority Setting, Resource Allocation. McGraw-Hill, New York (1980) 5. Ohtake, K., Misu, T., Hori, C., Kashioka, H., Nakamura, S.: Annotating Dialogue Acts to Construct Dialogue Systems for Consulting. In: Proc. The 7th Workshop on Asian Language Resources, pp. 32–39 (2009) 6. Breese, J., Heckerman, D., Kadie, C.: empirical analysis of predictive algorithms for collaborative filtering. In: Proc. the 14th Annual Conference on Uncertainty in Artificial Intelligence, pp. 43–52 (1998) 7. Rieser, V., Lemon, O.: Natural Language Generation as Planning Under Uncertainty for Spoken Dialogue Systems. In: Proc. 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL (2009)

A Study Toward an Evaluation Method for Spoken Dialogue Systems Considering User Criteria Etsuo Mizukami, Hideki Kashioka, Hisashi Kawai, and Satoshi Nakamura National Institute of Information and Communications Technology 3-5, Hikaridai, Seikacho, Sorakugun, 619-0289, Kyoto, Japan [email protected]

Abstract. In the development cycle of a spoken dialogue system (SDS), it is important to know how users actually behave and talk and what they expect of the SDS. We are developing SDSs which realize natural communication between users and systems. To collect users’ real data, a wide-scale experiment was carried out with a smart-phone prototype SDS. In this brief paper, we report on the experiment’s results and make a tentative analysis of cases in which there were gaps between system performance and user judgment. This requires both an adequate experimental design and an evaluation methodology that considers users’ judgement criteria.

1 Introduction We are developing spoken dialogue systems (SDSs) which realize natural communication between users and systems. For this, we adopt a mobile-type SDS as a prototype system. A mobile-type system could seemingly enable users to access information easily at any place and any time. Though the performance of each system module has progressively improved, if users actually talk to the systems as they do to humans and without proper usage methods, the systems cannot ideally behave as a human in the current moment. We must develop systems with consideration of how the users behave with them and what they expect of them. In this brief report, we introduce a smartphone prototype system, report on some of the results of the wide-scale experiment with human monitors and results of analysis of interaction between users and the system, and discuss a problem regarding the relation of users’ expectation and the system’s performance.

2 Methods We constructed systems with data and dialogue scenarios for sightseeing guidance in Kyoto. The use case of the mobile-type system is a situation in which users want to decide on a place to visit and obtain necessary information about it. In this chapter we describe the outline of the mobile system and the procedures of the experiment. G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 176–181, 2010. © Springer-Verlag Berlin Heidelberg 2010

A Study Toward an Evaluation Method for Spoken Dialogue Systems Considering

177

2.1 Outline of System The mobile-type, server-client system is mounted on a smart-phone (iPhone ©Apple Inc.). The user can input a request in natural spoken language by touching the microphone icon on the display panel, and complete the request by touching the end icon. During input, the speech wave data is sequentially divided into a certain size and sent to the speech recognition server wirelessly (3G or WiFi), and this starts the recognition process at the end of the arrival. The result of the speech recognition is sent to the dialogue management server (WFSTDM – Weighted Finite State Transducer Dialog Manager (Hori et al, 2009)) and a corresponding action is called by the language understanding and the scenario WFSTDM. A relevant backend module, map retrieval, bus timetable information, or Kyoto tour guide database is successfully called, and the extracted information is sent to the client with the wave data of system speech generated by the speech synthesis server to be played on the speaker. The user can scroll and resize the display via the iPhone’s touch interface. 2.2 Experiment Procedures The experiment was carried out for 10 days in a Kyoto office room in Dec. 2009. Since 10 participants per day participated in the experiment, the total number of participants was 80 including 10 who did it three times, for a total of 100 sessions. The participants consisted of 46 females and 34 males aged 20s to 70s. The participants had no previous experience using the spoken dialogue system. First, participants answered a questionnaire about their background (modified and translated ITU-T Rec. P.851 questionnaire (URL:http://www.itu.int/itu-t/recommendations/)), and received an explanation of the experiment and how to use the system. If the participant had no questions about the procedures and usage of the system, the experimental tasks were conducted as follows. 1) Practical scenario task In order to acclimate participants with the system, all were instructed to input a typical scenario composed of seven user turns as follows. User: Hello. System: Welcome to Kyoto. This is the Kyoto tour guide system. Do you have any particular place you’d like to go? User: I’d like to go somewhere with a beautiful garden. System: Fifty-nine sites were found as recommended sites for gardens; such as Tenryuji, Ginkakuji, and Kyoto Imperial Palace. Which are you interested in? User: Do you have a recommendation? System: How about Kyoto Imperial Palace? Kyoto Imperial Palace is very much recommended for the beauty of its gardens. It has a lovely and famous Japanese-style garden called Oikeniwa, which was built in the early Edo period. Is there any other information that you’d like to know? User: How do I get there? System: Access to Kyoto Imperial Palace is shown here. Is there any other information that you are interested in regarding Kyoto Imperial Palace?

178

E. Mizukami et al.

User: Show me a map of the area. System: Here is a map of the area around Kyoto Imperial Palace. Is there any other information that you are interested in regarding Kyoto Imperial Palace? User: No, I’ve decided to go there. System: Are you sure that Kyoto Imperial Palace is your goal? User: Yes, thanks. System: Kyoto Imperial Palace is decided as the goal. Thank you very much for using the system.

2) Free scenario task One of the aims of the experiment was to collect actual data on users’ natural utterances and expressions. This task was designed with that aim. Participants were instructed to use the system freely without concern about expression of utterances as in the practical scenario task, but with the usage aim of deciding the destination and repeating the task until the time limit. The time limit was set for 15 min. including the practical scenario task. Half of the one-time participants used the system by referring to acceptable phrases to the system (Presented), while the others did not refer to them (Non-presented). After both tasks the participants judged the system use by means of a subjective evaluation using a modified and translated ITU-T Rec. P.851 questionnaire. They responded to 20 statement in terms of “Disagree,” “Somewhat disagree,” “Somewhat agree” and “Agree.” Table 1 shows the questionnaire terms. They were also asked to share any comments or needs. During the experiments, the participants’ usage of the system was recorded on video (SONY DSR45) and audio (MOTU828 and Digital Performer). Table 1. Questionnaire statements for subjective evaluation of the system Q-No. Q1: Q2: Q3: Q4: Q5: Q6: Q7: Q8: Q9: Q10. Q11. Q12: Q13: Q14: Q15: Q16: Q17: Q18: Q19: Q20:

Statement My overall impression of the system was good. The system provided the desired information. The provided information was complete. The information was clear. I expected more help from the system. I felt well-understood by the system. The system’s voice was clear. I understood what the system expected from me. The system’s behaviour was always as expected. The system reacted naturally. I was able to control the dialogue in the way I wanted. The system reacted fast enough. The system’s voice was natural. Overall, I was satisfied with the dialogue. I enjoyed the dialogue. During the dialogue, I felt relaxed. I prefer a human guide. I can see this possibility for obtaining information as helpful. The handling of the system was easy. In the future, I would use the system again.

A Study Toward an Evaluation Method for Spoken Dialogue Systems Considering

179

3 Results The obtained data were system logs, wave data saved on the speech recognition server and audio recorder, video data, and subjective judgments and users’ requirements from the questionnaires. The data of the two sessions were omitted from the analysis since those two sessions were carried out by using an alternative system due to system trouble. The other wave data the participants input were transcribed as text data for the evaluation (N = 98 sessions). 3.1 Interaction Parameters Table 2 shows results from the experiment as interaction parameters (Möller, 2007) with respect to the system performance. Since our system speaks in reply to the user’s request, the number of turns should be same as the user turns. The difference between the numbers of the user and system came mainly from the detection of noise instead of participants’ no-input despite having touched the input icon. WER is the word error rate for user utterance, which is defined as S+D+I/N, such that the number of words in user sentences is N, correct words in the recognition result is C, error words in substitution is S, error words in insertion is I, and error words in defection is D. S.ERR is the sentence error rate, which is the rate of sentences not matched to user sentences. CR is the correct response of the system including incomplete information like “Restaurant information is not available now.” in response to user requests of, “Is there any place to have lunch around Kinkakuji?” Some values indexing system performance tended to be much worse for older participants; S.ERR and the rate of CR of those in their 50s were respectively 51.8% and 46.8%, but 36.1% and 60.9% for those in their 20s. There were no significant differences among them regarding WER. Table 2. Statistical data and interaction parameters obtained from the recorded data. See the text for more details.

Interaction parameter Total user turn Total system turn WER (%) S.ERR (%) Rate of CR of task 1 (%) Rate of CR of task 2 (%) Av. system-user Delay (s) Av. user-system Delay (s) Av. user-utt-duration (s) Av. system-utt-duration (s) Total # of tasks Av. of tasks Total # of task successes Rate of task successes (%) Max user turn of tasks Av. user turn per task

amount 4409 4441 19.9 38.6 88.3 63.3 4.90 5.51 1.64 7.02 315 3.21 228 72.4 42 11.5

180

E. Mizukami et al.

Task success was defined as only being when the participant decided the destination to close the dialogue. Task success of a free scenario task was difficult to define because participants could request or query information about one spot after another without deciding on any destination. Therefore it is difficult to calculate the Kappa efficiency adopted in PARADISE (Walker et al, 1997) because we cannot define the necessary “keys (slots).” Cases that participants cancelled by inputting “Restart,” or system crashing due to some form of trouble were counted as a fault. The rate of task success and the other interaction parameters are also shown in Table 2. 3.2 Subjective Evaluation The values of the subjective evaluations obtained were converted to -1, -0.33, +0.33, and +1. From the factor analysis for these data (maximum likelihood solution, promax rotated; Q5 and Q17 were omitted because of their low communality), four factors were extracted and named “Acceptability” as constructed by Q1, Q2, Q9, Q10, 11, and Q14, “System Potential” by Q3, Q4, Q18, and Q20, “System Transparency” by Q7, Q8, Q12, and Q13, and “User Comfort” by Q15, Q16, and Q19. The average values of subjective evaluations of composing each factor as high factor loading (> 0.35) were Acceptability = -.22, Potential = +.03, Transparency = +.17, and Comfort = -.08. The correlation between the factor score of the Acceptability and the sentence error rate was significantly negative (Kendall’s tau = -.145, p < .05). There was also a significantly positive correlation (Kendall’s tau = .240, p < .01) between the score of Acceptability and the rate of the correct response. Therefore, the participants who were recognized and responded to correctly by the system tended to judge the Acceptability, such as in Q1 and Q2, as comparably high. On the other hand, not all the participants who were correctly responded to by the system in over 70% of requests (av. COR > 89.4%) judged Acceptability; e.g., the average of the judgment on Q14 “Overall, I was satisfied with the dialogue.” resulted in 0.0 (N=19), while the average of the CR of the participants who judged Q14 as “Agree (1)” or “Somewhat agree (0.33)” were 60.0%. There were no significant differences between two experimental groups (Presented group A or Not-presented group B) in both interaction parameters and subjective evaluations.

4 Discussions and Conclusion Why did some users rate the system poorly despite its high performance? Some factors that could influence the evaluation were extracted from the analysis of the interaction between the users and the systems and users’ comments and needs given in the questionnaires. 1) User’s stance on the system: Strict users did not repeat the same request once the system failed to respond, and tended to perform the free scenario task like routine work; that is, they repeated typical interactions in a similar way to the practical scenario task. For such users, the most important thing may be the speechrecognition ability; i.e., credibility of the SDS as an interface. 2) Generation or habituation to the machine: Older participants tended to be comparatively misrecognized by the system. There seemed to be problems in the usage method, such as the timing of touching the icon to input the requests, the way of making a substandard

A Study Toward an Evaluation Method for Spoken Dialogue Systems Considering

181

voice, e.g., speaking too loud or broken, or too quiet. For such users, the most important thing may be the understandability of the system use before its other abilities. These results indicate the importance of experimental design for adequate evaluation of SDSs. This includes clarification of target users, task design, and its instruction taking a priming effect into account. However, it is ideal that there is an evaluation method that considers users’ backgrounds and communication styles (Mizukami et al, 2009). The above problems may stem from the fact that all users have their own criteria for judging the systems. The difference of users’ criteria or expectations can constrict a stable evaluation. A method like PARADISE (Walker, 1997) can solve this problem by adopting values which have high correlation with user satisfaction as efficient values, but it is presumed that the judgments deviated from the correlation as the above cases are ignored. One approach to such a problem is to normalize the value of subjective evaluation by the rate of correct response. For instance, a normalized value of subjective evaluation NSE may be defined as NSEi(Q#) = SEi(Q#) * cri, such that Q# is the term number which has a significantly positive correlation to the rate of correct response, i is the user ID, and cr is the rate of the correct response in the case of SE > 0 and is the rate of failed response in the case of SE < 0. After this application, each value of factors is adjusted to Acceptability = -.11, Potential = +.05, Transparency = +.17, and Comfort = -.08. We are now improving the systems based on the experiment results, along with the methodology of the experiment and evaluation framework.

References 1. Hori, C., Othake, K., Misu, T., Kashioka, H., Nakamura, S.: Statistical dialog management applied to WFST-based dialog systems. In: Proceedings of the 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4793–4796 (2009) 2. Mizukami, E., Kashioka, H., Kawai, H., Nakamura, S.: An Exploratory Analysis on Users’ Communication Styles affecting Subjective Evaluation of Spoken Dialogue Systems. In: Proceedings of 1st IWSDS (2009) 3. Möller, S.: Evaluating Interactions with Spoken Dialogue Telephone Services: Recent Trends in Discourse and Dialogue, pp. 69–100. Springer, Heidelberg (2007) 4. Walker, M.a., Litman, D.J., Kamm, C.A., Abella, A.: PARADISE: A Framework for Evaluating Spoken Dialogue Agents. In: Proceedings of Annual Meeting of the Association for Computational Linguistics, pp. 271–280 (1997)

A Classifier-Based Approach to Supporting the Augmentation of the Question-Answer Database for Spoken Dialogue Systems Hiromi Narimatsu1 , Mikio Nakano2 , and Kotaro Funakoshi2 1 2

The University of Electro-Communications, Japan Honda Research Institute Japan Co., Ltd., Japan

Abstract. Dealing with a variety of user questions in question-answer spoken dialogue systems requires preparing as many question-answer patterns as possible. This paper proposes a method for supporting the augmentation of the question-answer database. It uses user questions collected with an initial question-answer system, and detects questions that need to be added to the database. It uses two language models; one is built from the database and the other is a large-vocabulary domainindependent model. Experimental results suggest the proposed method is eﬀective in reducing the amount of eﬀort for augmenting the database when compared to a baseline method that used only the initial database.

1

Introduction

When humans engage in a dialogue, they use knowledge on the dialogue topic. Without such knowledge, they cannot understand what other humans say and they cannot talk on the topic. Spoken dialogue systems also need to have knowledge on the topic of dialogue. Technically speaking, such knowledge is a knowledge base consisting of speech understanding models, dialogue management models, and dialogue content. Constructing a knowledge base is, however, a timeconsuming and expertise-demanding task. It is therefore crucial to ﬁnd a way to facilitate constructing the knowledge base. This paper concerns a kind of spoken dialogue system that answers user questions by retrieving answers from a database consisting of a set of questionanswer pairs. We call such systems Question-Answering Spoken Dialogue Systems (QASDS) and the database Question-Answer Databases (QADB). In each example question, keyphrases are indicated by braces. Those keyphrases are used for matching a speech recognition result for a user utterance with example questions as is done in [4]. If the speech recognition result contains the same set of keyphrases as one of the example questions in a QA pair, the answer in the pair is selected. Fig. 1 illustrates this. A statistical language model for speech recognition is trained on the set of example questions in the database. Although QASDSs are simpler than other kinds of systems such as ones that perform frame-based dialogue management, they have an advantage in that they G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 182–187, 2010. c Springer-Verlag Berlin Heidelberg 2010

A Classiﬁer-Based Approach to Supporting the Augmentation of the QADBs Input utterance Speech recognition are there any parks in Sengawa keyphrase-based matching parks and Sengawa match

183

QADB QA Pair QA Pair QA Pair QA Pair Example questions: QA Example questions: QAPair Pair Example questions: Example questions: tell me about {parks} inin｛Sengawa｝ Example questions: tell me about {parks} ｛Sengawa｝ Example questions: tell me about {parks} inin ｛Sengawa｝ tell me about {parks} ｛Sengawa｝ are there any {squares} in {Sengawa} tell me about {parks} in ｛Sengawa｝ are there {squares} tell meany about {parks}in in{Sengawa} ｛Sengawa｝ are there any {squares} in are there any {squares} in{Sengawa} {Sengawa} Answer: are there any {squares} Answer: are there any {squares}inin{Sengawa} {Sengawa} Answer: Answer: There is Saneatsu park in Sengawa ...... Answer: There isisSaneatsu park ininSengawa Answer: There Saneatsu park Sengawa ...... There isisSaneatsu park ininSengawa There Saneatsu park Sengawa There is Saneatsu park in Sengawa......

answer There is Saneatsu park in Sengawa ...

Fig. 1. Answer Selection based on Question-Answer Database

are easy to design for people without expertise in spoken language processing because the system behaviors are more predictable than more complicated systems. Much work on QASDS has been done (e.g., [3], [9], and [7]), but they assume that a lot of real user utterances are available as training data. Unlike their work, we are concerned with how to bootstrap a new system with a small amount of training data. This is because obtaining a lot of data requires a considerable amount of time and eﬀort, resulting in making system development diﬃcult. One of the most crucial problems with QASDS is that it cannot handle outof-database (OODB) questions. Since there is no appropriate answers to OODB questions, the system cannot answer them. In addition, since the language model is built from the example question in a database, OODB questions tend to be misrecognized, resulting in selecting an answer that is not desired by the user. Since it is not possible to list all possible questions before system deployment, database augmentation is required based on the user questions obtained by deploying the system. However, augmenting the database requires a lot of eﬀort since it requires listening to all user questions to ﬁnd OODB questions. This paper proposes a classiﬁer-based approach to support the QADB augmentation. It tries to ﬁnd questions that are highly likely to be OODB questions, and asks the developer to determine if those questions are really OODB or not. This enables the developer to eﬃciently augment QADB than randomly listening to user questions. From the system’s point of view, it automatically selects questions whose transcription is more eﬀective in augmenting the system’s database. This can be regraded as a kind of active learning [6,2]. To better estimate the scores, the classiﬁer uses various features obtained from the results of speech recognition using not only the language model built from the initial QADB but also a large-vocabulary domain-independent language model.

2

Proposed Method

Our method uses a classiﬁer that classiﬁes user questions into OODB questions and in-database (IDB) to estimate a score that indicates how likely the question is OODB. The classiﬁer uses various features obtained from the results of speech

184

H. Narimatsu, M. Nakano, and K. Funakoshi

recognition using both the language model built from the initial database and a large-vocabulary domain-independent language model. Features concerning the conﬁdence for the recognized keyphrases would be eﬀective to indicate how likely the question matches the example question having the same keyphrases. The results of speech recognition with the large-vocabulary language model are used for estimating the correctness of the results of speech recognition with the database-derived language model. This is similar to utterance veriﬁcation techniques [5]. They can also be used for investigating whether the question includes noun phrases other than keyphrases. The existence of such noun phrases indicates the question might be OODB.

3

Experiments

3.1

Data

We used the data collected using a QASDS that provides town information. The initial QADB contains 232 question-answer pairs, and 890 example questions in total. The vocabulary size of the language model built from the database was 460 words. When the system answers user questions, corresponding slides are shown at the same time. 25 people (12 males and 13 females) engaged two times in dialogues with the system for about 14 minutes each. In total, we collected 4076 questions in the experiment. Among them, 594 are non-question utterances, such as questions consisting of just ﬁllers, and fragments as results of end-point detection errors. These were excluded from the experimental data as we plan to incorporate a separate method for detecting those questions. 3.2

Classifier Training

We used the following two language models for speech recognition: – LMdb : The trigram model trained on the 890 example questions in QADB. – LMlv : A domain-independent large-vocabulary trigram model trained on Web texts [1]. Its vocabulary size is 60,250 words. We used Julius1 for the speech recognizer, and Palmkit2 for the training of language models. From the speech recognition results, we extracted 35 features. Due to a lack of space we do not list all the features. Sixteen features were obtained from the result of speech recognition with LMdb . They include acoustic score, language model score, the number of words in the top recognition result, the average, minimum, and the maximum of the conﬁdence scores of keyphrases used for answer selection, the ratio of nouns in the top recognition result. and whether the top speech recognition result was classiﬁed as OODB or not. Nine similar 1 2

http://julius.sourceforge.jp/ Palmkit is a language model toolkit which is compatible with CMU-Cambridge Toolkit and developed at Tohoku University (http://palmkit.sourceforge.net/).

A Classiﬁer-Based Approach to Supporting the Augmentation of the QADBs

185

features were obtained from the result of speech recognition with LMlv , but some of the answer-selection-related features were not used. Ten features were obtained by comparing features obtained from LMdb -based speech recognition results and those obtained from LMlv -based speech recognition results. We used the logistic regression3 in Weka data mining toolkit [8] as the classiﬁer. We used the ﬁrst 25 questions of 5 users as a training data set, and 50 questions (the ﬁrst 25 questions of each dialogue session) of the remaining 20 people as the test data set. Then the non-question utterances were removed from these sets. The average number of utterances for training in each data set is about 100, and the average number of utterances for testing is about 851. We limited the number of training data so that the amount of eﬀort for labeling the training data could be reduced. We performed feature selection to avoid overﬁtting. We used backward stepwise selection so that the average F-measure of OODB detection (with the threshold of 0.5) of the 10-fold cross validations over the ﬁve training sets could be maximized. Ten features remained and they achieved the F-measure of 0.74 (the recall is 0.77 and the precision is 0.70). We examined which features are crucial among the remaining ones by investigating how much F-measure decreases when removing each feature. The top ﬁve crucial features are as follows: 1. maxi (the number of occurrences of keyphrase i used for answer selection in SRRdb,all ) / the number of words in SRRdb,all ). 2. the number of words in SRRdb,1 . 3. (the number of nouns in SRRdb,1 / the number of words in SRRdb,1 ) − (the number of nouns in SRRlv,1 / the number of words in SRRlv,1 ) 4. (the number of nouns in SRRdb,1 / the number of words in SRRdb,1 ) − (the number of nouns in SRRdb,all / the number of words in SRRdb,all ) 5. (the number of nouns in SRRlv,1 / the number of words in SRRlv,1 ) − (the number of nouns in SRRlv,all / the number of words in SRRlv,all ) Here SRRdb,1 is the top result of LMdb -based speech recognition, and SRRdb,all is its all results. SRRlv,1 and SRRlv,all are those of LMlv -based speech recognition. We think these features are eﬀective for the following reasons. If Feature 1 is small, the possibility that a keyphrase is a misrecognition result is high and the question is possibly OODB. If Feature 2 is large, the length of utterance is long and it might be the case that keyphrases are misrecognized as short words. Feature 3, the diﬀerence in the ratios of nouns, represents the possibility that the recognition results of LMlv include the words out of LMdb . If this value is close to zero, the recognition result of LMdb is likely to be correct. Features 4 and 5 represent the conﬁdence of the recognition result. If these values are large, erroneously recognized nouns are likely to exist, and the question may be OODB. 3.3

Evaluation Results

We evaluated our method by estimating how much it can reduce the cost for listening to or transcribing user questions to augment the database. We compared 3

Logistic regression model with a ridge estimator with Weka’s default values.

186

H. Narimatsu, M. Nakano, and K. Funakoshi

the number of OODB questions in n questions extracted by the proposed and baseline methods, when n is given. We compared the following methods. – Proposed Method: Extract top n questions in the order of the scores assigned by the proposed classiﬁer described above. – Baseline 1 (Random): Extract n questions randomly. – Baseline 2 (Initial-DB): Extract n questions randomly among the questions that are classiﬁed as OODB questions. If the system using Initial DB cannot select answer to the question, the question is classiﬁed as OODB. If n is larger than the number of questions classiﬁed as OODB using the initial database, the rest are extracted randomly from the remaining questions. In this condition, 5,000 frequent words were added to the language model and treated as unknown word class words. This prevents out-of-vocabulary words from being misrecognized as keywords. Figure 2 compares the above methods. Although the performance of the proposed method is close to that of the initial-QADB-based method when the number of extracted questions is small. This is because the number of questions that initialQADB-based method classiﬁes as OODB questions is small so the precision is high. The proposed method outperforms the initial-QADB-based method when the number of extracted questions is large. 600

500

400

The number of OODB questions in extracted questions

300

Proposed Method

200

Baseline 1 (Random) 100

Baseline 2 (Initial-DB)

0 0

100

200

300

400

500

600

700

800

900

The number of extracted questions

Fig. 2. The number of out-of-database questions among the extracted questions

4

Summary and Ongoing Work

This paper presented a novel framework for supporting the augmentation of the QADB. It estimates the probability of user questions being OODB. It uses a language model built from the initial QADB and a large-vocabulary statistical language model. Although the improvement by the proposed method is limited, the experimental results suggested the possibility of a framework.

A Classiﬁer-Based Approach to Supporting the Augmentation of the QADBs

187

Among the future work is ﬁnding eﬀective features other than those used in the experiment. In addition, we plan to investigate the way to build language models more eﬀective in detecting OODB questions. In this experiment, we assumed that the database and the classiﬁer based on which the OODB question candidates are extracted are ﬁxed. In real settings, however, it is possible to incrementally update the database and the classiﬁer for extracting OODB questions candidates. Future work includes conducting an experiment in such a setting. Since our method requires some amount of real user questions to train the classiﬁer, we will also try to ﬁnd a way to train the classiﬁer using user questions in other domains.

References 1. Kawahara, T., Lee, A., Takeda, K., Itou, K., Shikano, K.: Recent progress of opensource LVCSR engine Julius and Japanese model repository. In: Proc. Interspeech 2004 (ICSLP), pp. 3069–3072 (2004) 2. Nakano, M., Hazen, T.J.: Using untranscribed user utterances for improving language models based on conﬁdence scoring. In: Proc. Eurospeech 2003, pp. 417–420 (2003) 3. Nisimura, R., Lee, A., Yamada, M., Shikano, K.: Operating a public spoken guidance system in real environment. In: Proc. Interspeech 2005, pp. 845–848 (2005) 4. Nisimura, R., Uchida, T., Lee, A., Saruwatari, H., Shikano, K., Matsumoto, Y.: ASKA: Receptionist robot with speech dialogue system. In: Proc. IROS 2002, pp. 1314–1317 (2002) 5. Rahim, M.G., Lee, C.H., Juang, B.H.: Discriminative utterance veriﬁcation for connected digits recognition. IEEE Transactions on Speech and Audio Processing 5(3), 266–277 (1997) 6. Riccardi, G., Hakkani-T¨ ur, D.: Active and unsupervised learning for automatic speech recognition. In: Proc. Eurospeech 2003, pp. 1825–1828 (2003) 7. Takeuchi, S., Cincarek, T., Kawanami, H., Saruwatari, H., Shikano, K.: Question and answer database optimization using speech recognition results. In: Proc. Interspeech 2008 (ICSLP), pp. 451–454 (2008) 8. Witten, I.H., Frank, E.: Data Mining: Practical machine learning tools and techniques, 2nd edn. Morgan Kaufmann, San Francisco (2005) 9. Yoshimi, Y., Kakitsuba, R., Nankaku, Y., Lee, A., Tokuda, K.: Probabilistic answer selection based on conditional random ﬁelds for spoken dialog system. In: Proc. Interspeech 2008 (ICSLP), pp. 215–218 (2008)

The Influence of the Usage Mode on Subjectively Perceived Quality Ina Wechsung, Anja Naumann, and Sebastian Möller Deutsche Telekom Laboratories, Quality & Usability Lab, TU Berlin, Ernst-Reuter-Platz 7, 10587 Berlin, Germany {Ina.Wechsung,Anja.Naumann,Sebastian.Moeller}@telekom.de

Abstract. The current paper presents an evaluation study of a multimodal mobile entertainment system. Aim of the study was to investigate the effect of the usage mode (explorative vs. task-oriented) on the perceived quality. In one condition the participants were asked to perform specific tasks (task-oriented mode) and in another to do “whatever they want to do with the device”. It was shown that the explorative test setting result in better ratings than the taskoriented. Keywords: Multimodal Interaction, Evaluation, Usability, User Experience.

1 Introduction Nowadays usability testing is more or less obligatory when presenting new interface techniques or interaction paradigms. Dependent variables of these studies are typically the factors described in the widespread ISO 9241 standard: effectiveness, efficiency and satisfaction. In a meta-analysis by [1] reviewing current practice in usability evaluation all studies measured at least one of these factors. According to [1] effectiveness and efficiency are most frequently measured via error rate respective task completion time, data often labeled as “objective”. To assess such data it is obvious that at least to some extent predefined tasks are necessary. Although these “objective” measures might be in line with the concept of usability they are not sufficient for assessing another key concept in HCI, namely User eXperience (UX). With the attention of the HCI community shifting from Usability to UX the view on how to humanize technology widened [2]. As described in [2], the usability perspective implicates that technology usage is primarily motivated by accomplishing tasks as efficient and effective as possible to gain time for the real pleasurable activities not related to technology. Hassenzahl [2] questions this position and argues that humans use technology for itself as technology usage can be a source of a positive, enjoyable experience. Thus technology usage is linked to different goals: do goal (buying tickets at a web shop) and be goal (being competent). Do goals are linked to pragmatic qualities and thus a system’s usability, be goals are associated with the non-instrumental aspects referred to as hedonic qualities. G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 188–193, 2010. © Springer-Verlag Berlin Heidelberg 2010

The Influence of the Usage Mode on Subjectively Perceived Quality

189

Since experience can only be subjective the term UX sets the focus on “the subjective side of product use” [2]. Thus these so-called “objective” parameters used for measuring usability might not be meaningful for UX evaluation.

2 Related Work So far several evaluation biases associated with task-oriented usability testing are documented: Evidence for the high importance of the tasks is provided by [3]. The same website was tested by nine usability expert teams. The results of the evaluation were hardly affected by the number of participants but the number of tasks. With more tasks more usability problems were discovered. It is concluded that it is preferable to give a large variety of different sets of tasks to a small number of users over presenting a small set of task to many users. Cordes [4] pointed out, that typically only tasks are selected that are supported by the product. Domain relevant tasks the system is not capable of are not presented. This is a rather unnatural situation since discovering a product’s capabilities is an essential aspect when being confronted with a new device [4]. Cordes [4] showed that if users are told that the given tasks may not be solvable, the users tended to terminate the task earlier and terminate more tasks than users receiving the exactly same instruction except for the hint that the tasks may not be solvable. Thus, it is likely that success rates in traditional usability tests are higher than in natural settings. Another crucial issue is the wrong perspective when selecting tasks with respect to the products capabilities: The product is evaluated not according to users needs but to its functionalities. A direct comparison between task-oriented and explorative instruction is provided by [5] and [6]. In the first study [5] the participants either received the task to find specific information on a website or were instructed to just have fun with the website. Retrospective judgments of overall appeal, pragmatic and hedonic qualities were assessed. It was shown that with a task-oriented instruction the websites’ usability respective the ability to support the given tasks was of stronger influence on the judgments of experienced overall appeal than for the explorative group. Regarding the explorative group a correlation between usability and appeal was not observed. In the second study [6] the participants interacted with a story-telling platform. Again there were either given the task to find some specific information or asked to freely interact with the system. Besides the retrospective measures of the first study mental effort and affect were measured throughout the usage of the system. Additionally experienced spontaneity was assessed after the interaction. It was shown that with taskoriented instruction spontaneity was related to perceived effort, negative affect and reduced appeal. In the explorative group spontaneity was linked to positive affect and lead to higher appeal. Based on these results the authors concluded that different instructions trigger different usage modes and evaluation criteria: Depending on the usage mode (task or non-task) the importance of usability for a systems’ appeal differs, whereas hedonic qualities are equally important in both modes. However, the authors also point out that generalization of their results is difficult and more research using different systems is necessary [6].

190

I. Wechsung, A. Naumann, and S. Möller

So far only unimodal, non-mobile systems have been investigated. If considering multimodal applications modality preference and availability should influence user experience. If in task-oriented setting all tasks have to be performed, for tasks were the preferred modality is not offered the user will have to switch to a less liked modality, which may result in a negative experience. If no tasks are given the user is likely to stick with the most preferred modality. Thus we formed the following hypotheses: mental workload should be lower with an explorative instruction since touch (the more familiar modality) is expected to be used more often than e.g. speech. The experienced identification with the system should be higher since the usage of the system is not determined by the experimenter but by the users decision. According to [5, 6] overall appeal should be determined by pragmatic qualities for the task-oriented group.

3 Method 3.1 Participants 30 German-speaking individuals (15m, 15f, Ø 28 yrs.) took part in the study. All of them were paid for their participation. The majority (70%) was familiar with touch input; voice control was considerably less known (30%). 3.2 Material The tested application is called mobile multimodal information cockpit and offers the functionality of a remote control, a mobile TV and video player, video on demand services and games. The application was implemented on an ultra mobile personal computer, the Samsung Q1 Ultra (cf. Fig. 1). The tested system is controllable via a graphical user interface with touch screen and speech input. The output is given via the graphical user interface and audio feedback. For some tasks only one of the modalities was available. To assess ratings for hedonic and pragmatic qualities the AttrakDiff questionnaire [7] was employed. The AttrakDiff consists of four scales measure hedonic as well as pragmatic attributes. The scale Hedonic Quality-Stimulation measures the extent to which a product can provide stimulation. The scale Hedonic Quality-Identification (HQ-I) measures a products ability to express the owner’s self. The scale Pragmatic Quality (PQ) covers a product’s functionality and the access to the functionality and is thus more or less matching the traditional concept of usability. Additionally the perceived global quality is measured with the scale Attractiveness (ATT). The entire questionnaire comprises 28 items on a 7-point semantic differential. Furthermore the SEA-scale [9], which is the German version of the Subjective Mental Effort Questionnaire (SMEQ also known as Rating Scale Mental Effort) [8], was employed as a measure of perceived mental effort. The SEA-scale is an one-dimensional measure with a range between 0 and 220. Along this range seven verbal anchors (hardly effortful - extremely effortful) are given.

The Influence of the Usage Mode on Subjectively Perceived Quality

191

Fig. 1. Tested application

3.2 Procedure The experiment consisted of two blocks: one task-oriented and one explorative. Half of the participants started with the task-oriented block followed by the explorative block. For the other half of participants the order was reversed (within-subject design). They were either instructed to perform 16 given tasks (e.g. logging in to the system, switching the channel, searching for a certain movie, a certain TV show; a certain actor, increasing volume; decreasing volume, playing the quiz, switching between the different categories) or to use the next 15 minutes to do whatever they want to do with the device. The duration was set to 15 minutes since pretests showed that this was the average time to accomplish all tasks. In both test blocks the participants were free to choose the input modality. It was at any time possible to switch or combine modalities. In order to rate the previously tested condition, the SEA-scale [9] and the AttrakDiff [7] had to be filled in after each test block. To analyze which modality was used by the participants, for every interaction step the modality used to perform the step was logged. This way, the percentages of modality usage were computed.

4 Results 4.2 Subjective Data SEA-Scale: No differences could be observed for perceived mental effort. AttrakDiff: The AttrakDiff (cf. Figure 2) showed differences on the scale Attractiveness (Wilcoxon Z= 2.20, p = .013) and on the scale Hedonic Quality-Identification (Wilcoxon Z = 1.89, p = .029). In contrast to the results reported in [5, 6] high correlations between pragmatic qualities and overall attractiveness could be observed in both blocks. (Pearson’s Rexp = .796, p<.01, Pearson’s Rtask = .817, p<.01). However, as expected the correlation was higher in the task-oriented block.

192

I. Wechsung, A. Naumann, and S. Möller

Fig. 2. Ratings on AttrakDiff scales and subscales by usage mode

4.3 Performance Data Modality Preferences: As expected, with explorative instruction touch was more frequently used than with task-oriented instruction (Mexp = 89.41, SDexp = 13.22, Mtask = 83.48, SDtask = 16.33, Wilcoxon Z = 1.802, p = .029). However, for both conditions touch was by far the dominant modality and speech was rarely used. An explorative analysis of the data of the task-oriented block showed that speech was used for search tasks only. The familiarity of the input modality was of influence in the task-oriented block: participants with prior experience used speech marginally more often (Mwith = 17.71, SDwith = 9.24, Mwithout = 15.22, SDwithout = 18.71, MannWhitney U = 63.5, p =. 081). For the explorative block no effect was shown.

5 Discussion The results show that task-oriented instructions reduce the experienced identification with the system as well as the perceived overall attractiveness. However, in contradiction to [5, 6], and therefore contradictory to our hypothesis, pragmatic qualities were strongly related to overall attractiveness in both usage modes. This relation was only slightly stronger for the task-oriented group. Moreover, mental workload was not higher with the task-oriented instruction but the same in both modes. An explanation might be the kind of tasks: In the studies by [5, 6] knowledge acquisition tasks were given whereas in our study only specific actions (e.g. switching the channel) had to be performed. It is plausible that searching for information and keeping it active in the working memory is mentally more demanding than just performing some given actions not requiring memorizing much information. In addition, touch was the modality most often chosen by the participants both in the task-oriented and the exploratory test block. Thus, the chosen modality might have a stronger influence on both mental workload and attractiveness of an application than the usage mode. This is in line with previous research on modality usage

The Influence of the Usage Mode on Subjectively Perceived Quality

193

[e.g., 10]. It also shows the importance of the task itself since the users’ choice for a modality has been found to be task dependent [e.g., 11]. Thus in a future study different task types with a variety of tasks including more complex tasks also provoking the usage of different modalities will be compared to an explorative setting.

References 1. Hornbæk, K., Law, E.L.: Meta-analysis of correlations among usability measures. In: Proceedings of CHI 2007, pp. 617–626. ACM Press, New York (2007) 2. Hassenzahl, M.: User Experience (UX): Towards an experiential perspective on product quality. In: Proceedings of the 20th French-Speaking Conference on Human-Computer Interaction, pp. 11–15 (2008) 3. Lindgaard, G., Chattratichart, J.: Usability Testing: What Have We Overlooked? In: Proc. CHI 2007, pp. 1415–1424. ACM Press, New York (2007) 4. Cordes, E.R.: Task-selection bias: a case for user-defined tasks. International Journal of Human–Computer Interaction 13, 411–420 (2002) 5. Hassenzahl, M., Kekez, R., Burmester, M.: The importance of a software’s pragmatic quality depends on usage modes. In: Luczak, H., Cakir, A.E., Cakir, G. (eds.) Proceedings of the 6th International Conference on Work With Display Units (WWDU), pp. 275–276. ERGONOMIC, Berlin (2002) 6. Hassenzahl, M., Ullrich, D.: To do or not to do: Differences in user experience and retrospective judgments depending on the presence or absence of instrumental goals. Interacting with Computers 19, 429–437 (2007) 7. Hassenzahl, M., Burmester, M., Koller, F.: AttrakDiff: Ein Fragebogen zur Messung wahrgenommener hedonischer und pragmatischer Qualität. [A questionnaire for measuring per-ceived hedonic and pragmatic quality]. In: Ziegler, J., Szwillus, G. (eds.) Mensch & Computer 2003. Interaktion in Bewegung, pp. 187–196. B.G. Teubner, Stuttgart (2001) 8. Zijlstra, F.R.H.: Efficiency in work behavior. A design approach for modern tools. PhD thesis, Delft University of Technology. Delft University Press, Delft 1993) 9. Eilers, K., Nachreiner, F., Hänecke, K.: Entwicklung und Überprüfung einer Skala zur Erfassung subjektiv erlebter Anstrengung [Development and evaluation of a scale to assess subjectively perceived effort]. Zeitschrift für Arbeitswissenschaft 40, 215–224 (1986) 10. Naumann, A.B., Wechsung, I., Möller, S.: Factors Influencing Modality Choice in Multimodal Applications. In: André, E., Dybkjær, L., Minker, W., Neumann, H., Pieraccini, R., Weber, M. (eds.) PIT 2008. LNCS (LNAI), vol. 5078, pp. 37–43. Springer, Heidelberg (2008) 11. Wechsung, I., Hurtienne, J., Naumann, A.: Multimodale Interaktion: Intuitiv, robust, bevorzugt und altersgerecht? In: Wandke, H., Struwe, D., Kain, S. (eds.) Mensch und Computer 2009. 9. fachübergreifende Konferenz für interaktive und koooperative Medien Grenzenlos frei 2009, pp. 213–222. Oldenbourg, Berlin (2009)

Sightseeing Guidance Systems Based on WFST-Based Dialogue Manager Teruhisa Misu, Chiori Hori, Kiyonori Ohtake, Etsuo Mizukami, Akihiro Kobayashi, Kentaro Kayama, Tetsuya Fujii, Hideki Kashioka, Hisashi Kawai, and Satoshi Nakamura MASTAR Project, NICT, Kyoto, Japan http://mastar.jp/index-e.html

Abstract. We are developping a spoken dialogue system that help the users through spontaneous interactions on sightseeing guidance domain. The systems are constructed on our framework of weighted finite-state transducer (WFST) based dialogue manager. The demos are our prototype spoken dialogue systems on Kyoto tourist information assistance.

1

WFST-Based Framework for Spoken Dialogue Systems

We have proposed an expandable and portable dialogue scenario description and platform to manage dialogue systems using a Weighted Finite-State Transducers (WFSTs) [1]. The role of dialogue management is to transduce user’s utterace (ASR results) into system action. The framework performs these process using WFSTs of speech language understanding (SLU) and state transition (dialogue scenario). Since many types of methodologies used for SLU and management (handcrafted rules, statistical n-grams, etc.) can be tranformed to WFSTs, different components can easily be integrated and work on a common platform. These WFSTs can be combined and optimized through useful WFST operations, which contributes to eﬃcient beam search in decoding process [2].

2

Tourist Guide Dialogue System

The demo systems are prototype applications for sightseeing guidance domain of Kyoto City. We implement two systems. One is implemented on mobile phone (Fig. 1) and the other on PC with a large display and several active cameras (Fig. 2). The systems use an SLU WFST and a scenario WFST that are trained using the statistics of our human-to-human dialogue corpus [3]. They are complemented by handcrafted rules, which are made based on the corpus. Both systems can explain more than 700 sightseeing spots in Kyoto. It also can explain the spot in terms of 15 viewpoint that are used to plan sightseeing activities (“cherry blossoms”, “Japanese garden.”, etc.) as well as basic informations (“access”, “fee”, etc.). Based on the information, users can narrow down and evaluate the spots. G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 194–195, 2010. c Springer-Verlag Berlin Heidelberg 2010

Kyoto Sightseeing Guidance Systems

Fig. 1. Mobile phone system

195

Fig. 2. Large display system

The system with large display and cameras can detect non-verbal information, such as changes in gaze and face direction and head gestures of the user during dialogue, and proactively recommend information that the user would be interested in [4].

3

System Performance

We evaluated the performance of the mobile phoene system with 100 subjects. Subjects were assumed to visit Kyoto and to ﬁnd a sightseeing spot they are interested in. In total, 390 dialogue sessions with 3,670 utterances were collected with an average sentence error rate of 70.7%. The system was able to correctly answer 64.7% in response to the users’ requests. The results were not satisfactory, but the utterances that the system were not able to handle can easily be reinforced by adding appropriate paths to the SLU and scenario WFSTs. This aspect is considered as an advantage of our framework.

References 1. Hori, C., Ohtake, K., Misu, T., Kashioka, H., Nakamura, S.: Dialog Management using Weighted Finite-state Transducers. In: Proc. Interspeech, pp. 211–214 (2008) 2. Hori, C., Ohtake, K., Misu, T., Kashioka, H., Nakamura, S.: Statistical dialog management applied to wfst-based dialog systems. In: Proc. ICASSP, pp. 4793–4796 (2009) 3. Ohtake, K., Misu, T., Hori, C., Kashioka, H., Nakamura, S.: Annotating Dialogue Acts to Construct Dialogue Systems for Consulting. In: Proc. The 7th Workshop on Asian Language Resources, pp. 32–39 (2009) 4. Kayama, K., Kobayashi, A., Mizukami, E., Misu, T., Kashioka, H., Kawai, H., Nakamura, S.: Spoken Dialog System on Plasma Display Panel Estimating User’s Interest by Image Processing. In: Proc. 1st International Workshop on Human-Centric Interfaces for Ambient Intelligence, HCIAmi (2010)

Spoken Dialogue System Based on Information Extraction from Web Text Koichiro Yoshino and Tatsuya Kawahara School of Informatics, Kyoto University Sakyo-ku Kyoto 606-8501, Japan

We present a novel spoken dialogue system which uses the up-to-date information on the web. It is based on information extraction which is defined by the predicateargument (P-A) structure and realized by shallow parsing. Based on the information structure, the dialogue system can perform question answering and also proactive information presentation using the dialogue context and a topic model. To be a useful and interactive system, the system should not only reply to the user's request, but also make proactive information presentation. Our proposed scheme realizes this function with the information extraction technique to generate only useful information. The useful information structure is dependent on domains. Conventionally, the templates for information extraction were hand-crafted, but this heuristic process is so costly that it cannot be applied to a variety of domains on the web. Therefore, we introduce a filtering method of predicate-argument (P-A) structures generated by the parser, which can automatically define the domain-dependent useful information structure. This scheme is applied to a domain of baseball news, and we design a dialogue system which can reply to the user's question as well as make proactive information presentation according to a dialogue history and a topic model. The system can be viewed as a smart interactive news reader. The architecture of the dialogue system is depicted in Figure 1. First, information extraction is conducted by parsing web texts in advance. A user's query is also parsed to extract the same information structure, and the system matches the extracted information against the web information. If the system finds some information which completely matches to the user's query, the system makes a response using the corresponding web text. When the system cannot find exact information, it searches for some information which matches partially. For example, when the user asked “Did Ichiro hit?'' and the system cannot find exact information “[Ichiro (agent) hit]”, it may find “[Lopez (agent) hit]” which is partially matched and most relevant. This information is used to generate a similar response that the user would expect. In the conventional RDB-based dialogue scheme, the system hardly makes relevant responses if it finds no matched entries, thus usually replies “There is no matched entries”. In the conventional question answering scheme, the same situation often happens. Occasionally, a close-matched answer may be found by statistical matching, but the found answer may not be relevant to the user's query. In the proposed scheme, we guarantee that the answer is at least partially matched to the user's query in terms of the information structure. G.G. Lee et al. (Eds.): IWSDS 2010, LNAI 6392, pp. 196–197, 2010. © Springer-Verlag Berlin Heidelberg 2010

Spoken Dialogue System Based on Information Extraction from Web Text

USER

Query

SYSTEM P-A Structure Information Extraction

Answer

Question Answering

Presentation

Proactive Presentation

197

BACK END

Topic Model

Extracted Information P-A Structure Web Text (News Site)

Dialogue Manager

Fig. 1. System architecture

We show an example dialogue between a user and the system in Figure 2. The system responds in the question-answering mode when the user's question has a predicate and an argument. If the system cannot find exact information, the system makes proactive presentation by choosing information that has the same predicate but different arguments.

User: Sys: User: Sys： User: Sys:

Please tell me about today’s Tigers. The Tigers won the game 3-2 against the Giants. Who was the winning pitcher? Nohmi was the winning pitcher. Did Kubota come in to pitch? Kubota didn’t, but Fujikawa came in to pitch 9th inning. (long pause) Sys: Do you know Fujikawa’s EPA? User: I don’t know. Sys: 0.00 User: Wow. Fig. 2. Dialogue example

Author Index

Abushariah, Mohammad A.M. Ainon, Raja N. 1 Al-Qatab, Bassam A. 1 Alqudah, Assal A.M. 1 Araki, Masahiro 144 Becker, Tilman

13

48

Engelbrecht, Klaus-Peter

150

Fujii, Tetsuya 194 Funakoshi, Kotaro 182 Funakura, Yu 144 Gerl, Franz 25, 36 Griol, David 96

Meguro, Toyomi 48 Minami, Yasuhiro 48 Minker, Wolfgang 25, 36, 122, 156 Misu, Teruhisa 61, 73, 169, 194 Mizukami, Etsuo 73, 176, 194 M¨ oller, Sebastian 150, 188 Nakamura, Satoshi 61, 73, 156, 169, 176, 194 Nakano, Mikio 182 Narimatsu, Hiromi 182 Naumann, Anja 188 Ohtake, Kiyonori

Herbig, Tobias 25, 36 Herzog, Gerd 132 Higashinaka, Ryuichiro 48 Hofmann, Hansj¨ org 156 Hori, Chiori 61, 169, 194 Ianotto, Michel 110 Isotani, Ryosuke 156 Jokinen, Kristiina

Kimura, Naoto 61 Kobayashi, Akihiro 73, 194 Lee, Cheongjae 85 Lee, Donghyeon 85 Lee, Gary Geunbae 85 L´ opez-C´ ozar, Ram´ on 96

132

Chandramohan, Senthilkumar Choi, Junhwi 85 Dohsaka, Kohji

1

163

Kashioka, Hideki 73, 169, 176, 194 Kawahara, Tatsuya 196 Kawai, Hisashi 61, 73, 156, 169, 176, 194 Kayama, Kentaro 73, 194 Kim, Kyungduk 85

61, 169, 194

Pietquin, Olivier 13, 110 Polzehl, Tim 122 Quesada, Jos´e F.

96

Reithinger, Norbert Rossignol, St´ephane

132 110

Sakti, Sakriani 156 Schmitt, Alexander 122 Sonntag, Daniel 132 Wechsung, Ina

188

Yoshino, Koichiro

196

Zainuddin, Roziati

1