Digital Speech Processing, Synthesis, and Recognition
Signal Processing and Communications Series Editor
K. J. Ray L...
75 downloads
574 Views
22MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Digital Speech Processing, Synthesis, and Recognition
Signal Processing and Communications Series Editor
K. J. Ray Liu University of Maryland College Park, Maryland Editorial Board Sadaoki Furui, Tokyo lnstitute of Technology Yih-Fang Huang, University of Notre Dame Aggelos K. Katsaggelos,Northwestern University Mos Kaveh,University of Minnesota P. K. Raja Rajasekaran, Texas lnstruments John A. Sorenson, Technical University of Denmark
1.
Digital Signal Processing for Multimedia Systems, edited by Keshnb K. Parhi and Tnkuo Nishitani
2.
Multimedia Systems, Standards, and Networks,
edited by Atul Puri
and Tsuhan Chen
EmbeddedMultiprocessors:SchedulingandSynchronization, Sundm-arajan Sriranz and Shuvra S. Bhattcrcharyva David C. Swanson 4. Signal Processing for Intelligent Sensor Systerns, edited by Ming-Ting Sun and Amy 5. Compressed Video over Networks, R. Riebmm Xiang-Gen 6. Modulated Coding for Intersymbol Interference Channels, Xia 7. Digital Speech Processing, Synthesis, and Recognition: Second Edition, Revised and Expanded,Sadaoki Furui 3.
Additiorml Volzmes irt Preparation
Modern Digital Halftoning,David L. Lau altd Gonzalo R. Arce Blind Equalization and Identification,Zhi Ding and Ye (Geoffrey) Li Video Coding for Wireless Communications, King H. Ngan, Chu Yu Yap, aud Keng T.Tal2
Digital Speech Processing, Synthesis, and Recognition Second Edition, Revised andExpanded
Sadaoki Furui Tokyo Institute of Technology Tokyo, Japan
MARCEL
MARCEL DEKKER, INC. D E K K E R
NEWYORK BASEL
Library of Congress Cataloging-in-Publication Data Furui, Sadaoki. . Digital speech processing, synthesis, and recognition / Sadaoki Furui.ed., rev. and expanded. p. cm. - (Signal processing and communications; 7) ISBN 0-8247-0452-5 (alk. paper) 1. Speech processing systems. I. Title. 11. Series. TK788TS65 F87 2000 006.4’54-dc3 1
2nd
00-060 197
This book is printed on acid-free paper. Headquarters Marcel Dekker, Inc. 270 Madison Avenue, New York. NY 10016 tel: 21 2-696-9000:fax: 2 12-685-4540 Eastern Hemisphere Distribution Marcel Dekker AG Hutgasse 4, Postfach 812, CH-4001 Basel. Switzerland tel: 4 1-61-261-8482; fax: 4 1-6 1-26 1-8896 World Wide Web http://www.dekker.com The publisher offers discounts on this book when orderedinbulkquantities. For moreinformation, write to Special Sales/Professional Marketingat the headquarters address above. Copyright (0 2001 by Marcel Dekker, Inc. All Rights Reserved. Neither this book nor any part may be reproduced or transmitted in any form or by any means, electronic or mechanical, including photocopying, microfilming, and recording, or by any informationstorage and retrieval system, without permission in writing from the publisher. Current printing (last digit) 10987654321
PRINTED IN THE UNITED STATES OF AMERICA
Series Introduction
Over the past 50 years, digital signal processing has evolved as a major engineering discipline. The fields of signal processing have grown from the origin of fast Fourier transform and digital filter design to statisticalspectral analysis andarray processing, and image,audio, and multimedia processing, and shaped developments in high-performance VLSI signal processor design. Indeed, there are few fields that enjoy so many #applications-signal processing is everywhere in our lives. Whenone uses a cellular phone,the voice is compressed, coded, andmodulated using signal processing techniques. As a cruise missile winds along hillsides searching forthetarget,the signal processor is busy processing the images taken along theway. When we are watching a movie in HDTV, millions of audio and video dataare being sent toour homes and received with unbelievable fidelity. When scientists compare DNA samples, fast pattern recognition techniques are being used. On and on, one can see the impact of signal processing in almost every engineering and scientific discipline. Because of the immense importance of signal processing and the fast-growing demands of business and industry, this series on signal processing serves toreportup-to-datedevelopmentsand advances in the field. The topics of interest include but are not limited to the following:
iii
Series Introduction
iv 0 0 0 0 0 0 0
Signal theoryandanalysis Statistical signal processing Speech andaudio processing Image and video processing Multimedia signal processing and technology Signal processing forcommunications Signal processing architectures and VLSI design
I hopethis series will providetheinterestedaudience withhigh-quality,state-of-the-artsignalprocessingliterature through research monographs,editedbooks,and rigorously written textbooks by experts in their fields. K. J . R q ’Liu
Preface to the Second Edition
More than a decade has passed since the first edition of Digital Speed1 Processiug, Synthesis, nnd Recog?zitio?l was published. The as both a bookhas beenwidely used throughouttheworld textbook and a reference work. The clear need for such a book stems fromthe fact that speech is themost naturalform of communication among humans and that it alsoplays an ever more salient role in hunm--nlachine communication.. Realizing any such system of conmunication necessitates a clear andthorough understanding of the core technologies of speech processing. The field of speech processing, synthesis, and recognition has witnessed significant progress in thispastdecade,spurred by advancesinsignalprocessing,algorithms,architectures,and hardware. These advancesinclude: ( I ) international standardization of various hybrid speech coding techniqu,es, especially CELP, and its widespread use in manyapplications, such as cellular phones; (2) waveform unit concatenation-based speech synthesis; (3) large-vocabularycontinuous-speechrecognition based on a statistical pattern recognitionparadigm,e.g.,hidden Markov models (HMMs)and stochasticlanguage models; (4) increased robustness of speech recognition systems against speech variation, such as speaker-to-speaker variability, noise, and channel distor( 5 ) speakerrecognitionmethodsusingthe HMM tion;and technology.
Preface
toEdition the Second
vi
This second edition includes these significant advances and details important emerging technologies. The newly added sections include Robust and Flexible Speech Coding, Corpus-Based Speech Synthesis, Theory and Implementation of HMM, Large-VocabularyContinuous-SpeechRecognition,Speaker-Independent and Adaptive Recognition, and Robust Algorithms Against Noise and Channel Variations. In an effort to retain brevity, older technologies now rarely used in recent systems have been omitted. The basic technology parts of the book have also been rewritten for easier understanding. It is my hope that users of the first edition, as well as new readersseeking to explore both thefundamentalandmodern technologies in this increasingly vital field, will benefit from this second edition for many years to come.
"_" " "
,.
" " "
"""~"_l
"
- "
Acknowledgments
I am grateful for permission from many organizations and authors to use their copyrighted material in original or adapted form: Figure 2.5 containsmaterial which is copyright 0 Lawrence Erlbaum Associates, 1986. Used with permission. All rights reserved. Figure 2.6 contains material which is copyright 0Dr. H. Sato, 1975. Reprintedwithpermission of copyright owner. All rights reserved. Figures 2.7, 3.8,4.9, 7.1, 7.4, 7.6, and 7.7 contain material which respectively is copyright 0 19-52, 1980, 1967, 1972, 1980,1987, and 1987 AmericanInstitute of Physics. Reproduced with permission. All rights reserved. Figures 2.8, 2.9, and 2.10 containmaterial which is copyright 0 Dr. H. Irii, 1987. Used with permission. All rights reserved. Figure 2.11 contains material which is copyright 0Dr. S. Saito, 1958. Reprintedwithpermissionofcopyright owner. All rights reserved. Figure 3.5 contains material which is copyright 0Dr. G. Fant, 1959. Reproducedwithpermission. All rights reserved. Figures 3.6, 3.7, 6.6, 6.33, 6.35, and 7.8 contain material which respectively is copyright 0 1972, 1972, 1975, 1986,
vii
Acknowledgments
viii
0
0
0
0
0
0
0
0
1986, and 1986 AT&T. Used with permission. All rights reserved. Figures 4.4, 5.4, and 5.5 containmaterial which is copyright (Q Dr. Y. Tohkura, 1980. Reprinted with permission. All rights reserved. Figures 4.12, 6.1, 6.12, 6.13, 6.18, 6.19, 6.20, 6.24, 6.25, 6.26, 6.27, 6.32, 6.34, 7.9, 8.1, 8.5, 8.14, B.1, C.1, C.2, and . C.3containmaterial which respectively is copyright (Q 1966,1983,1986,1986,1981, 1982, 1981,1983,1983, 1983,1980,1982,1982,1988,1996,1978,1981,1984, 1987, 1987, and 1987 IEEE. Reproduced with permission. All rights reserved. Figures 5.2, 5.3, 5.9, 5.10, 5.1 1, and 5.18, as well as Tables 4.1,4.2, 4.3, and 5.1 contain material which respectively is copyright 0Dr. F. Itakura, 1970, 1970, 1971, 1971, 1973, 1981, 1978, 1981, 1978, and 1981. Used with permission of copyright owner. All rights reserved. Figure 5.19 contains material which is copyright 0Dr. T. Nakajima, 1978. Reproduced with permission. All rights reserved. Figure 6.36 contains material which is copyright 0Dr. T. Moriya, 1986. Used with permission of copyright owner. All rights reserved. Figures 6.28 and 6.29 contain material which is copyright 0 Mr. Y. Shiraki, 1986. Reprintedwith permission of copyright owner. All rights reserved. Figure 6.38 contains material which is copyright 0Mr. T. Watanabe, 1982. Used with permission. All rights reserved. Figure 7.5 contains material which is copyright 0 Dr. Y. Sagisaka, 1998. Reproduced with permission. All rights reserved. Table 8.5 contains material which is copyright 0 Dr. S. Nakagawa, 1983. Reprintedwithpermission. All rights reserved. Figures 8.12,8.13, and 8.20 containmaterial which is copyrightPrenticeHall, 1993. Used withpermission. All rights reserved.
Acknowledgments 0
0
IX
Figures 8.15, 8.16, and 8.21 containmaterial which is 1996, 1996, 1997 Kluwer respectively copyright AcademicPublishers.Reproduced with permission. All rights reserved. Figures 8.22 and 8.23 contain material which is copyright 0 DARPA, 1999. Used with permission. All rights reserved.
e)
This Page Intentionally Left Blank
Preface to the First Edition
Research in speech processing has recently witnessed remarkable progress.Suchprogresshasensuredthe wide use of speech recognizers and synthesizers in a great manyfields, such as banking services anddatainputduringqualitycontrolinspections. Althoughthe level and range of applicationsremainsomewhat restricted,this technological progresshastranspired through an efficient and effective combination of thelongandcontinuing history of speech research with the latest remarkable advances in digital signal processing (DSP) technologies. In particular, these DSP technologies,includingfast Fouriertransform,linear predictive coding, and cepstrum representation, have been developed principally to solve several of the more complicated problems in speech processing. Theaim of thisbomok is, therefore, to introducethereadertothemostfundamentalandimportant speech processing technologies derived from the level of technological progressreachedin speech production,coding, analysis, synthesis, and recognition, as well as in speaker recognition. Although the structure of this book isbased on my book in Japanese entitled Digital Speech Processing (Tokai University Press, Tokyo, 1985), I have revised and updated almost all chapters in line with the latest progress. The present book also includes severalimportant speech processing technologies developedin Japan, whch, for the
xi
xii
Preface to the First Edition
most part, are somewhat unfamiliar to researchers from Western nations. Nevertheless, I have made every effort to remain as objective as possible in presenting the state of the art of speech processing. This book hasbeen designed primarily to serve as a text for an advanced undergraduate- or for a first-year graduate-level course. It hasalso been designed as a reference book withthe speech researcher in mind. The reader is expected to have an introductory understanding of linear systems and digital signal processing. Several peoplehavehad a significant impact, both directly and indirectly, on the material presented in this book. My biggest debt of gratitude goes to Drs. Shuzo Saito and Funlitada Itakura, both former headsof the Fourth Research Section of the Electrical ConlnlunicationsLaboratories(ECLs),NipponTelegraphand TelephoneCorporation(NTT).Formanyyearstheyhave providedmewithinvaluableinsight into theconductingand reporting of my research. Inaddition, I hadthe privilege of working as a visiting researcher from 1978 to 1979 in AT&T Bell Laboratories’ Acoustics Research Department under Dr. James L. Flanagan. During that period, I profited immeasurably from his views and opinions. Doctors Saito, Itakura, and Flanagan have not only had a profound effect on my personal life and professional career but have also had a direct influence in many ways on the information presented in this book. I also wish to thank the many members of NTT’s ECLs for providing me with the necessary support and stimulating environment in which many of the ideas outlined in this book could be developed. Dr.Frank K. Soong of AT&T Bell Laboratories his valuablecomments and deserves a note of gratitudefor criticism on Chapter 6 during his stay at the ECLs as a visiting researcher. Additionally, I would like to extend my sincere thanks to Patrick Fulnler of Nexus International Corporation, Tokyo, for his carefLd technical review of the nlanuscript. Finally, I would like to expressmy deep and endearing appreciation to mywife and family for their patience and for the time they sacrificedon my behalf throughout the book’s preparation. Suclaoli-i Frrrrri
Contents ...
Series Introductio~ ( K . J . Ray Liu) Preface to the S e c o d Edition Acknon,ledg~.l./enrs Preface to the First Edition
Ill
1'
vii
xi 1
1.
INTRODUCTION
2.
PRINCIPALCHARACTERISTICS OF SPEECH 2.1 Linguistic Information 2.2 Speech and Hearing 2.3 Speech Production Mechanism 2.4 Acoustic Characteristics of Speech 2.5 Statistical Characteristics of Speech 2.5.1 Distribution of amplitude level 2.5.2 Long-time averaged spectrum 2.5.3 Variationinfundamental frequency 2.5.4 Speech ratio
5 5 7 9 14 20 20 23 24 26
3.
SPEECH PRODUCTION MODELS 3.1 AcousticalTheory of Speech Production 3.2LinearSeparableEquivalentCircuitModel 3.3 Vocal Tract Transmission Model 3.3.1 Progressing wave model 3.3.2 Resonance model 3.4 Vocal Cord Model
27 27 30 32 32 38 40
xiii
"""""".""
.
L
-
"
"
xiv
4.
5.
Contents
SPEECHANALYSISANDANALYSISSYNTHESIS SYSTEMS Digitization 4.1 Sampling 4.1.1 4.1.2Quantizationandcoding 4.1.3 A/DandD/A conversion Spectral Analysis 4.2 4.2.1 Spectralstructure of speech 4.2.2 AutocorrelationandFouriertransform 4.2.3 Window function 4.2.4 Sound spectrogram Cepstrum 4.3 4.3.1 Cepstrumand itsapplication 4.3.2 Homomorphic analysis andLPC cepstrunl Filter Bank and Zero-Crossing Analysis 4.4 4.4.1 Digital filter bank 4.4.2 Zero-crossing analysis Analysis-by-Synthesis 4.5 Analysis-Synthesis Systems 4.6 4.6.1 Analysis-synthesis system structure 4.6.2 Examples of analysis-synthesis systems Pitch Extraction 4.7 LINEARPREDICTIVECODING(LPC)ANALYSIS 5.1 Principles of LPC83Analysis 5.2 LPC Procedure Analysis Maximum 5.3 Likelihood Spectral Estimation 5.3.1 Formulation of maximumlikelihood estimation spectral 5.3.2 Physicalmeaning of maximum likelihood spectral estimation 5.4 SourceParameterEstimationfromResidual Signals 5.5 Speech Analysis-Synthesis System by LPC 5.6 PARCOR Analysis 5.6.1 Formulation of PARCOR analysis
45 45 46 47 51 52 52 53 57 60 62 62 66 70 70 70 71 73 73 73 78 83 86 89 89 93 98 99 102 102
1
Contents
xv
5.6.2
5.7
5.8
Relationship between PARCORand LPC coefficients 5.6.3 PARCOR synthesis filter 5.6.4 Vocal tractarea estimadion based on PARCOR analysis LineSpectrumPair(LSP)Analysis 5.7.1 Principle of LSPanalysis 5.7.2 Solution of LSPanalysis 5.7.3 LSP synthesis filter 5.7.4Coding of LSPparameters 5.7.5 Composite sinusoidal rnodel 5.7.6 Mutual relationshipsbetween LPC parameters Pole-Zero Analysis
SPEECH 6 CODING 6.1 Principal Techniques for Speech Coding 6.1.1 Reversible coding 6.1.2 Irreversiblecoding andinformation rate distortion theory 6.1.3Waveformcoding and analysissynthesis systems 6.1.4 Basic techniques for waveformcoding methods 6.2 Coding in Time Domain 6.2.1 Pulsecodemodulation(PCM) 6.2.2 Adaptive quantization 6.2.3 Predictive coding 6.2.4 Delta modulation 6.2.5Adaptivedifferential PCM(ADPCM) 6.2.6Adaptivepredictivecoding(APC) 6.2.7 Noise shaping 6.3 Coding in Frequency Domain 6.3.1 Subband coding (SBC) 6.3.2Adaptivetransformcoding(ATC) 6.3.3 APC withadaptive bit allocation (APC-AB)
108 109 110 116 116 119 122 126 126 127 129 133 133 133 134 135 138 141 141 143 143 149 151 153 156 159 159 163 166
Contents
xvi
6.3.4
6.4
6.5
Time-domainharmonic scaling (TDHS) algorithm Vector Quantization 6.4.1 Multipath search coding 6.4.2 Principles of vectorquantization 6.4.3Treesearch and multistage processing 6.4.4Vectorquantizationforlinear predictor parameters 6.4.5 Matrixquantizationandfinite-state vector quantization Hybrid Coding 6.5.1 Residual- or speech-excited linear predictive coding 6.5.2 Multipulse-excited linear predictive coding (MPC) 6.5.3 Code-excited linear predictive coding (CELP) 6.5.4 Coding by phaseequalization and variable-rate tree coding EvaluationandStandardization of Coding Methods 6.6.1Evaluationfactors of speech coding systems 6.6.2 Speech coding standards Robustand Flexible Speech Coding I
6.6
6.7
SYNTHESIS 7 SPEECH 7.1 Principles of Speech Synthesis 7.2 Synthesis Based on Waveform Coding 7.3 Synthesis Based on Analysis-Synthesis Method 7.4 Synthesis Based on Speech Production Mechanism 7.4.1 Vocalanalog tract method 7.4.2 Terminal method analog 7.5 Synthesis by Rule 7.5.1 Principles of synthesis by rule 7.5.2 Control of prosodic features
168 173 173 175 178 180 182 187 187 189 193 196 199 199 203 21 1 213 213 217 221 222 223 224 226 226 230
Contents
7.6 Text-to-Speech Conversion 7.7 Corpus-Based Speech Synthesis 8.
SPEECH RECOGNITION 8.1 Principles of Speech Recognition 8.1.1 Advantages of speech recognition 8.1.2 Difficulties in speech recognition 8.1.3 Classification of speech recognition 8.2 Speech Period Detection 8.3 Spectral Distance Measures 8.3.1 Distancemeasures used in speech recognition 8.3.2 Distances based onnonparametric spectral analysis 8.3.3 Distances based on LPC 8.3.4Peak-weighteddistances based on LPC analysis 8.3.5 Weighted cepstral distance 8.3.6 Transitionalcepstraldistance 8.3.7 Prosody of WordRecognition Systems 8.4Structure 8.5 Dynamic Time Warping (DTW) 8.5.1 DP matching 8.5.2 Variationsin DP matching 8.5.3 Staggered array DP nlatching 8.6 Word RecognitionUsingPhonlemeUnits 8.6.1 Principal structure 8.6.2 SPLIT method 8.7 TheoryandImplementation of HMM 8.7.1 Fundamentals of HMM 8.7.2Three basic problemsfbr HMMs 8.7.3SolutiontoProblem 1-probability evaluation 8.7.4 Solution toProblem 2--optimal state sequence 8.7.5Solution to Problem 3-parameter estimation
xvi i
234 237 243 243 245 246 248 249 249 251 252 258 260 262 264 264 266 266 270 272 275 275 277 278 278 282 283 286 288
Contents
xviii
8.7.6
8.8
8.9
8.10
8.1 1
i
Continuousobservation densities in HMMs 8.7.7 Tied-mixture HMM 8.7.8 MMIandMCE/GPDtraining of HMM 8.7.9 HMM system forword recognition Connected Word Recognition 8.8.1 Two-level DP matchingand its modifications 8.8.2 Word spotting Large-Vocabulary Continuous-Speech Recognition 8.9.1 Threeprincipalstructuralmodels 8.9.2 Other system constructingfactors 8.9.3 Statisticaltheory of continuousspeech recognition 8.9.4Statisticallanguagemodeling 8.9.5 Typical structure of large-vocabulary continuous-speech recognition systems 8.9.6 Methodsforevaluating recognition systems Examples of Large-Vocabulary ContinuousSpeech Recognition Systems 8.10.1 DARPA speech recognitionprojects 8.10.2 English speech recognition system at LIMSI Laboratory 8.10.3 English speech recognition system at IBM Laboratory 8.10.4 AJapanese speech recognition system Speaker-Independent and Adaptive Recognition 8.11.1 Multi-templatemethod 8.11.2 Statistical method 8.1 1.3 Speakernormalizationmethod 8.1 1.4 Speaker adaptation methods
290 292 292 293 295 295 303 306 306 308 311 312 314 318 320 323 323 324 325 328 330 332 333 334 335
c
Contents
8.12
9
10
xix
8.1 1.5 Unsupervisedspeaker aldaptation method Robust AlgorithmsAgainstNoise and Channel Variations 8.12.1 HMM composition/PMC 8.12.2 Detection-based approachfor spontaneous speech recognition
SPEAKER RECOGNIT ION 9.1 Principles of Speaker Recognition 9.1.1 Humanandcomputer speaker recognition 9.1.2 Individual characteristics 9.2 Speaker Recognition Methods 9.2.1 Classification of speaker recognition methods 9.2.2 Structure of speakerrecognition systems 9.2.3 Relationship between errorrateand number of speakers 9.2.4 Intra-speakervariationandevaluation parameters of feature 9.2.5 Likelihood (distance) normalization 9.3 Examples of Speaker Recognition Systems 9.3.1 Text-dependent speaker recognition systems 9.3.2 Text-independentspeakerrecognition systems 9.3.3 Text-prompted speaker recognition systems FUTUREDIRECTIONSOFSPEECH PROCESSING INFORMATION 10.1 Overview 10.2 Analysis and Description of Dynamic Features
336 339 344 344 349 349 349 351 352 352 354 358 360 364 366 366 368 373
375 375 378
xx
Contents
10.3 10.4 10.5 10.6 10.7 10.8 10.9 10.10
Extraction and Normalization of Voice Individuality Adaptation Environmental to Variation Basic Units for Speech Processing Advanced Knowledge Processing Clarification of Speech Production Mechanism Clarification of Speech Perception Mechanism Evaluation Methods for Speech Processing Technologies LSI for Speech Processing Use
APPENDICES z-Transform Convolution and A A. 1 Convolution A.2 1-Transform A.3 Stability Quantization Algorithm Vector B B. 1VQ (Vector Quantization) Technique Formulation B.2 Lloyd's Algorithm(&MeansAlgorithm) Algorithm B.3 LBG Nets C Neural BibIiogmphy Ill dex
379 380 381 382 383 384 385 386 387 387 388 391 393 393 394 395 399
405 437
Digital Speech Processing, Synthesis, and Recognition
"""-""""""
.-
This Page Intentionally Left Blank
Introduction
Speech communication is one of the basic andmost essential capabilities possessed by human beings. Speech can be said to be the single mostimportantmethodthrough which people can readily convey information without the need for any ‘carry-along’ tool.Although we passively receive morestimulifromoutside through the eyes than through the ears, mutually communicating visually is almost totally ineffective compared to what is possible through speech communication. The speech wave itself conveys linguistic information,the speaker’s vocal characteristics, and the speaker’s emotion.Information exchange by speech clearly plays a very significant role in our lives. The acoustical and linguistic structures of speech have been confirmed to be intricately related to our intellectual ability, and are, moreover, closely intertwined with our cultural and social development. Interestingly, the most cultural1:y developed areas in the world correspondtothoseareasin which thetelephone network is the most highly developed. One evening in early 1875, Alexander Graham Bell was speaking with his assistant T.A. Watson (Fagen, 1975). He had just conceived the idea of a mechanism based on the structure of the human ear during the course of his research intofabricatinga telegram machine for conveying music. He said, ‘Watson, I have another idea I haven’t told you about that I think will surprise you.
1
2
Chapter I
If I can get a mechanism which will make a current of electricity vary in its intensity as the air varies in density when a sound is passing through it, I can telegraph any sound, even the sound of speech.' This, as we know, became the central concept coming to fruition as the telephone in the following year. The invention of the telephone constitutes not only the most important epochinthehistory of communications, but italso represents the first step in which speech began to be dealt with as an engineering target. The history of speech research actually started, however, long before the invention of the telephone. Initial speech researchbeganwiththedevelopment of mechanicalspeech synthesizers towardtheend of the18thcentury and into vocal vibration and hearing mechanisms in the mid-19th century. Before the invention of pulse code modulation (PCM) in 1938, however, the speech wave had been dealtwith by analog processing techniques. Theinvention of PCMand thedevelopment of digital circuits and electronic computers have made possible the digital processing of speech and have brought about the remarkableprogressin speech information processing, especially after 1960. Thetwomost important papers to appear since 1960 were presented at the 6th international Congress on Acoustics held in Tokyo, Japan, in 1968: the paper on a speech analysis-synthesis system based onthe maximumlikelihoodmethodpresented by NTT's Electrical Communications Laboratories, and the paper on predictive codingpresented byBell Laboratories.Thesepapers essentially producedthegreatestthrusttoprogressin speech information processing technology;inotherwords, they opened the way to digital speech processing technology. Specifically, both papers deal with the information compression technique using the linear prediction of speech waves and are based on mathematical techniques for stochastic processes. These techniques gave rise to linear predictive coding (LPC), which has led to the creation of a new academic field. Various other complementary digital speech processing techniques have also been developed. In combination, these techniques have facilitated the realization of a wide range of systems operatingonthe principles of speech coding, speech
Introduction
3
analysis-synthesis,speechsynthesis,speechrecognition, and speaker recognition. Books on speech information processing have already been published, and each has its own special features (Flanagan, 1972; MarkelandGray 1976; RabinerandSchafer, 1978; Saitoand Nakata, 1985; FurniandSondhi, 1992; ScJxoeder, 1999). The purpose of the present book is to explain the technologies essential to the speech researcher and to clarify and hopefully widen his or her understanding of speech by focusing on the most recent of the digital processing technologies. I hope that those readers planning to study and conduct research in the area of speech information processing will find this book useful as a reference or text. To those readers already extensively involved in speech research, I hope it will serve as a guidebook for sorting through theincreasingly more sophisticated knowledge base forming around the technology and for gaining insight into expected future progress. I have tried to cite wherever possible themostimportant aspects of the speech information processing field, including the precise development of equations, by omittingwhat is now considered classic information. In such instances, I have recommended well-known reference books. Since understandingthe intricaterelationships between variousaspectsofdigital speech processing technology is essential to speech researchers, Ihave attemptedtomaintain a sense of descriptiveunity andto sufficiently describe the mutual relationships between the techniques involved. I have also tried to refer to as many notable papers as permissible to further broaden the reader’s perspective. Due to space restrictions, however, several important research areas, such as noise reduction and echo cancellation, unfortunately could not be included in this book. Chapters 2, 3, and 4 explore the fundamental and principal elements of digital speech processing technology. Chapters 5 through9presentthemoreimportanttechniques as well as applications of LPC analysis, speech waveformcoding, speech synthesis, speech recognition, and speakerrecognition. The final chapter discusses futureresearchproblems. Several important concepts,terms, andmathematicalrelationshipsare precisely
4
Chapter 1
explained in the appendixes. Since the design of this book relates the digital speech processing techniques to each other in developmental and precise terms as mentioned, the readeris urged to read each chapter of this book in the order presented.
1
II
L
~~
Principal Characteristics of Speech
2.1 LINGUISTICINFORMATION The speech wave conveysseveralkinds of information, which consistsprincipally of linguisticinformationthatindicatesthe meaningthespeaker wishes toimpart, individualinformation representing who is speaking, and emotional information depicting the emotion of the speaker.Needless to say, the first informational type is the most important. Undeniably, the ability to acquire and produce language and to actually make and use tools are the two principal features that distinguishhumans fromotheranimals.Furthermore,language andculturaldevelopmentareinseparable.Althoughwritten language is effective for exchangingknowledge and lastslonger thanspoken language if properlypreserved, theamount of information exchanged by speech is considerably larger. In more simplified terms,books,magazines, and thelike are effective as one-way information transmission media, but are wholly unsuited to two-way communication. Human speech production begins with the initial conceptualization of an idea which the speaker wants to convey to a listener.
5
6
Chapter 2
The speakersubsequentlyconverts that idea into a linguistic structure by selecting theappropriatewordsor phrases which distinctlyrepresentit, and thenorderingthemaccordingto loose or rigid grammatical rules depending upon the speakerlistenerrelationship.Followingtheseprocesses,thehuman brain produces motor nerve commands which move the various muscles of the vocal organs. This process is essentially divisible intotwosubprocesses:thephysiologicalprocessinvolving nerves and muscles, and the physical process through which the speech wave is produced and propagated. The speech characteristics as physical phenomena are continuous, although language conveyed by speech is essentially composed of discretely coded units. A sentence is constructed using basic word units, with each word beingcomposed of syllables, andeach syllablebeing composed of phonemes, which, in turn, can be classified as vowels or consonants. Although the syllable itself is not well defined, one syllable is generally formed by the concatenation of one vowel and one to several consonants. The number of vowels and consonants vary,dependingonthe classification methodandlanguage involved.Roughlyspeaking,English has 12 vowels and 24 consonants,whereasJapanesehas 5 vowels and 20 consonants. The number of phonemes in a language rarely exceeds 50. Since there are combination rules for building phonemes into syllables, the number of syllables in each language comprises only a fraction of all possible phoneme combinations. In contrast with the phoneme, which is the smallest speech unit from thelinguistic or phonemic point of view, the physical unit of actual speech is referred to as the phone. The phoneme and phoneare respectively indicated by phonemic and phonetic symbols, such as /a/ and [a]. As another example, the phones [E] [e], which correspondtothephonemes /e/ and /e/inFrench, correspond to the same phoneme /e/ in Japanese. Although the number of words in each languageis very large and new words are constantly added, the total number is much smaller than all of the syllable or phoneme combinations possible. It has been claimed that the number of frequently used words is
Principal Characteristics of Speech
7
between 2000 and 3000, and that the number of words used by the average person lies between 5000 and 10,000. Stress and intonation also play critical roles in indicating the location of important words, in makinginterrogative sentences, and in conveying the emotion of the speaker.
2.2 SPEECH AND HEARING
Speech is uttered for the purpose of being, and on the assumption that it actually is, received andunderstood by theintended listeners. This obviously means that speech production is intrinsically related to hearing ability. The speech wave produced by the vocal organs is transmitted through the air to the earsof the listeners, as shown in Fig. 2.1. At the ear, it activates the hearing organs to produce nerve impulses which are transmitted to the listener's brain through the auditory nerve system. Thispermitsthe linguistic infomation which the speaker intends to convey to be readily understood by the listener.
"To_ I ker
,"""""
Listener
>
l""""""""" [Linguistic]
c Discrete-+
process
4
FIG.2.1 Speechchain.
[Physical]
[Phyriologicoi] (acoust ic) p ro ce ss
Continuous
I
I
L
J
" " " I " " " " "
]
Linguistic [process
--
+Discrete+
process [Physioiopicol
process
]
8
Chapter 2
The same speech wave is naturally transmitted to the speaker’s ears as well, allowing him to continuously control his vocal organs by receiving his own speech as feedback. The critical importance of thisfeedbackmechanism is clearly apparent withpeoplewhose hearing has become disabled for more than a year or two. It is also evidentinthefact that it is very hard to speakwhen our own speech is fed back to our ear with a certain amount of time delay (delayed feedback effect). The intrinsicconnectionbetweenspeechproductionand hearing is calledthespeechchain(Denes and Pinson, 1963). In terms of production, the speech chain consists of the linguistic, physiological, and physical (acoustical) stages, theorder of which is reversed for hearing. The human hearing mechanism constitutes such a sophisticated capability that, at this point in time anyway, it cannot be closely imitated by artificial/computational means. One advantage of this hearing capability is selective listening, which permits the listener to hearonlyone voice even whenseveralpeopleare speaking simultaneously, and even when the voice a person wants to hearis spoken indistinctly, with astrong dialectal accent, or with strong voice individuality. On the other hand, the human hearing mechanism exhibits very low capability. One example of its inherent disadvantage is that the ear cannot separate two tones that aresimilar in frequency or that havea very short timeinterval between them.Another negative aspect is that when two tones exist at the same time, one cannot be heard since it is masked by the other. The sophisticatedhearingcapabilitynoted is supported by thecomplexlanguageunderstandingmechanismcontrolled by the brain, which employs various context information in executing thementalprocessesconcerned.Theinterrelationships between these mechanisms thus allows people to effectively communicate with each other. Although research intospeech processing has thus farbeenundertakenwithoutadetailedconsideration of the concept of hearing, it is vital to connect any future speech research tothe hearingmechanisminclusive of therealm of language perception.
Principal Characteristics of Speech
9
2.3 SPEECHPRODUCTIONMECHANISM
The speech production process involves three subprocesses: source generation,articulation,andradiation.Thehumanvocalorgan complex consists of the lungs, trachea, larynx., pharynx, and nasal andoralcavities.Togetherthese forma connectedtube as indicated in Fig. 2.2. The upper portion beginning with the larynx is called the vocal tract,which is changeable into various shapes by moving the jaw, tongue, lips, and other internal parts. The nasal
Soft polote
Vocal tract Pharynx
LarynK Esophagus
FIG.2.2
Schematic diagram of the human vocal mechanism.
Chapter 2
10
cavity is separated from the pharynx and ora1 cavity by raising the velum or soft palate. When the abdominal muscles force the diaphragm up, air is pushed up and out from thelungs, with the airflowpassing through thetrachea and glottis into thelarynx.Theglottis, or the gap between the left and right vocal cords, which is usually open during breathing, becomes narrower when the speaker intends to produce sound.Theairflowthroughtheglottis is then periodically interrupted by opening and closing the gap inaccordance with theinteraction between the airflow and the vocal cords.This intermittent flow, called the glottal source or the source of speech, can be simulated by asymmetrical triangular waves. The mechanism of vocal vibration is actually very complicated. In principle, however, the Bernoulli effect associated with theairflow and thestabilityproduced by the elasticity of the muscles draw the vocal cords toward each other. When the vocal cords are strongly strained and the pressure of the air rising from the lungs (subglottalair pressure) is high,theopen-and-close period (that is, the vocal cord vibration period) becomes short and the pitch of the sound source becomes high. Conversely, the lowair-pressureconditionproduceslower-pitched sound. This vocal cordvibrationperiod is called thefundanlentalperiod,andits reciprocal is called thefundamentalfrequency. Accent and intonation resultfromtemporalvariation of theflmdamental period. The sound source, consisting of fundamental and harmonic components, is modified by the vocal tracttoproducetonal qualities, such as /a/ and io/, in vowel production. During vowel production,the vocal tract is maintainedina relatively stable configuration throughout the utterance. Two other mechanisms are responsible for changing the airflow from the lungs into speech sound. These are the mechanisms underlying the production of two kinds of consonants: fricatives and plosives. Fricatives, such as /si, if/,and are noiselike sounds produced by turbulent flow which occurs when the airflow passes through a constriction in the vocal tract made by the tongue or lips. The tonaldifference of each fricative correspondsto a fairly precisely located constriction and vocal tract shape. Plosives (stop
/si,
i
Principal Characteristics of Speech
11
consonants), such as /p/, /ti, and /k/, are impulsive sounds which occur with thesudden release of high-pressureairproduced by checking the airflow in the vocal tract, again b:y using the tongue or lips. The tonal difference corresponds to thedifference between the checking position and the vocal tract shape. The production of these consonants is wholly independent of vocal cord vibration. Consonants which are accompanied by vocal cord vibration are known as voiced consonants, and those which arenotaccompanied by thisvibrationarecalledunvoiced consonants.Thesoundsemitted with vocal cordvibration are referred to as voiced sounds,andthosewithoutarenamed unvoiced sounds.Aspirationor whispering is produced when a turbulent flow is made at the glottis by slightly opening the vocal cords so that vocal cord vibration is not produced. Semivowel, nasal, and affricate sounds are also included in the family of consonants. Semivowels are produced ina similar way as a vowels, but their physical properties gradually change without steadyutteranceperiod.Although semivowels are included in consonants, they are accompanied by neither turbulent airflow nor pulselike sound, since the vocal tract constriction is loose and vocal organ movement is relatively slow. In the production of nasal sounds, the nasal cavity becomes anextendedbranch of theoralcavity, with theairflow being supplied to the nasal cavity by lowering the vellum and arresting the airflow at some particular place in the oral cavity. When the nasal cavity forms a part of the vocal tract together with the oral cavity during vowel production, the vowel quality acquires nasalization and produces the nasalized vowel. Affricates areproduced by the succession of plosive and fricative sounds while maintaining a close constriction at the same position. Adjusting the vocal tract shape to produce various linguistic sounds is called articulation, while the movement of each part in the vocal tract is known as articulatory movement. The partsof the vocal tract used for articulation are called articulatory organs, and those which can actively move, such as the tongue, lips, and velum, are named articulators.
Chapter 2
12
The difference between articulatory methods for producing fricatives, plosives, nasals, and so on, is termedthemanner of articulation. The constriction place in the vocal tract produced by articulatory movement is designated as the place of articulation. Various tone qualities areproduced by varyingthevocaltract shape which changes the transmission characteristics (that is, the resonance characteristics) of the vocal tract. Speech sounds can be classified according to the combination of source and vocal tract (articulatory organ) resonance characteristics based on the production mechanism described above. The consonants and vowels of English are classified in Table 2.1 and Fig. 2.3, respectively. The horizontal lines in Fig. 2.3 indicate the approximatelocation of the vocal tractconstrictioninthe representation: the more to the left it is, thecloser to the front (near the lips) is the constriction. The vertical lines indicate the degreeofconstriction, which correspondsto the jaw opening position;thelowestlineinthefigureindicatesmaximum jaw opening. Thesetwo conditions inconjunctionwithliprounding represent the basic characteristics ofvowel articulation. Each of the vowel pairs located side by side in the figure indicates a pair in which only the articulation of the lips is different:the left one does not involveliprounding,whereastherightone is produced by TABLE 2.1
Consonants
Articulation place
1
Source
uv 1v v uv
Fricatives Articulation Affricates manner Plosives Semivowels Nasals
~
~~
Labial Dental Alveolar Palatal Glottal
v f
uvvuv v vuv z s
6 6'
dz ts d p w m
d t I n ~~~
V = voiced; UV = unvoiced
~~
3 J d, tJ g k j, r ?I
h
Principal Characteristics of Speech
13
Tongue hump position /
Front
A
Central
7
6 ack
( H iqh c 0
0
Q
Low
FIG. 2.3 Vowelclassificationfromapproximatevocalorgan representation.
roundingthelips.Thisliproundingrarelyhappensfor vowels produced by extended jaw opening. The phoneme [a] is called a neutral vowel, sincethe tongue and lips forproducing this vowel are in the most neutral positiomhence, the vocal tract shape is similarto a homogeneous tube having a constant cross section. Relativelysimple vowel structures, s w h asthat of the Japanese language, are constructed of those vowels located along the exterior of the figure. These exteriorvowels consist of [i, e, E, a, a, D,3,0, u, LU]. This means that the back tongue vowels tend to feature lip rounding while the front tongue vowels exhibit no such tendency. Gliding monosyllabic speech sounds produced by varying the vocal tract smoothly between vowel or semivowel configurations are referred toasdiphthongs.Thereare six diphthongsin American English, /ey/, /om/, /ay/, /am/, /oy/, and /ju/, but there are none in Japanese. Thearticulatedspeech wave withlinguisticinformation is radiated from the lips into the air and diffused. In nasalized sound, the speech wave is also radiated from the nostrils.
14
Chapter 2
2.4 ACOUSTIC CHARACTERISTICS OF SPEECH
Figure 2.4 represents the speech wave, short-time averaged energy, short-time spectral variation (Furui, 1986), fundamental frequency (modified correlation functions; see Sec. 5.4), and sound spectrogramfortheJapanesephrase/tJo:seNnaNbuni/, or 'in the southern part of Korea,' uttered by a male speaker. The sound spectrogram, the details of which will be described in Sec. 4.2.4, visually presents the light and dark time pattern of the frequency spectrum. The dark parts indicate the spectral components having high energy, and the vertical stripes correspond to the fundamental period. This figure shows that the speech wave and spectrum vary as nonstationary processes in periods of '/2 s or longer. In appropriately divided periods of 20-40 nls, however, the speech wave and spectrum can be regarded as having constant characteristics. The vertical lines in Fig. 2.4 indicate these boundaries. The segmentation was done automatically based on the amount of short-time spectralvariation.Duringtheperiods of /tJ/or Is/ unvoiced consonant production, the speech waves show random waves with small amplitudes, and the spectra show random patterns. On the other hand, during the production periods of voiced sounds, such as those with /i/, /e/, /a/, io/, /u/, /N/, the speech waves present periodic waves having large amplitudes, with the spectra indicating relatively global iterations of light and dark patterns. The dynamic range of the speech wave amplitude is so large that the amplitude difference between the unvoiced sounds having smaller amplitudes and the voiced sound having larger amplitudes sometimes exceeds 30 dB. The dominant frequency components which characterize the phonemes corresponding to the resonant frequency components of the vowels, generally have three formants,which are called the first, second, and third formants, beginning with the lowest-frequency component. They areusually written as F1, F2, and F3. Even for the samephoneme,however,theseformantfrequencieslargely vary,dependingonthespeaker.Furthermore,theformant
I
I
,
tf
0
I :
I
I s
t
I
Time (sl N
n
a
I
N
b
u
FIG.2.4 Speech wave, short-time averaged energy, short-time spectral variation, fundamental frequency, and sound spectrogram (from top to bottom) for the Japanese sentence /tJo:seN naNbuni/.
n
i
16
Chapter 2
frequenciesvary,dependingontheadjacentphonemesin continuouslyspokenutterances,suchasthoseemittedduring conversation. Theoverlappingofphoneticfeaturesfromphonemeto phoneme is termedcoarticulation.Eachphonemecan be considered as a target at which the vocal organs aim but never reach. As soon as the target has been approached nearly enough to be intelligible to the listener, the organs change their destinations and start to head for a new target. This is done to minimize the effort expendedinspeakingandmakesforgreaterfluency.The phenomenon of coarticulationaddstotheproblems of speech synthesis and recognition. Since speechin which coarticulation does not occursoundsunnaturaltoourears,forhigh-quality synthesis, we must include an appropriate degree of coarticulation. In recognition, coarticulation means that the features of isolated phonemes are never found inconnectedsyllables;hence any recognition system based on identifying phonemes mustnecessarily correct for contextual influences. Examples of the relationship between vocal tract shapes and vowel spectral envelopes are presented in Fig. 2.5 (Stevens et al., 1986). Fronting or backing of the tongue body while maintaining approximately the same tongue height causes a raising or lowering of F2, withthe effect ontheoverallspectralshapeaccordingly produced as shown. As is clear, FZ approaches Fl for back vowels and F3 for front vowels. A further lowering of F2 can be achieved by rounding the lips as illustrated in Fig. 2.5(c). The basic acoustic characteristics of vowel formants can be characterized by Fl and F2. Figure 2.6 is ascatterdiagram of formant frequencies of the isolatedly spoken five Japanese vowels onthe F1-F2 plane,thehorizontal and verticalaxes of which correspond to the first- and second-formant frequencies, F1 and F2, respectively (Sato, 1975). This figure indicates the distributions for 30 male and 30 female speakers as well as the mean and standard deviation values for these speakers. The five vowels are typically distributed in a triangular shape as shown in this figure, which is sometimescalledthe vowel triangle. For comparative purposes, Fig. 2.7 presents the scatter diagram of formant frequencies of 10
Principal Characteristics of Speech
Back
20
-
Ft
'
20
1
Neutral 10
0
-10
-20
1
0
1
2
3
0
1
2
3
Frequency [kHz]
FIG.2.5 Examples of the relationship between vocal tract shapes and vowel spectral envelopes: (a) schematization of mid-sagittal section of vocal tract for a neutral vowel (solid contour), and for back and front tongue-body positions; (b) idealized spectral envelopes corresponding to the three tongue-body configurations in(a); (c) approximate effect of lip rounding on the spectral envelopefor a back vowel.
English vowels uttered by 76 speakers (33 adult males, 28 adult females, and 15 children) on the F1-F2 plane (Peterson and Barney, 1952). Thedistribution of the vowels extractedfromcontinuous speech generally indicatesan overlap between differentvowels. The
Chapter 2
l oI
F1 [Hz)
FIG.2.6 Scatter diagram of formant frequencies of five Japanese vowels uttered by 60 speakers (30 males and 30 females) in the FI-F2 plane. variation owing to the speakers and their ages, however, can be approximated by theparallel shift inthelogarithmicfrequency plane,inotherwords, by the proportional changeinthelinear frequency, which can be seen inthemale and female voice comparison in Fig. 2.6. Hence, this overlapping of different vowels can be considerably reduced when the distribution is examined in three-dimensionalspaceformedfromaddingthethird formant, which characterizestheindividuality of voice. The higher-order formant indicatesasmallervariation,depending on the vowels
Principal Characteristics of Speech
2000
19
-
n N
r
u
f
Ft
[HZ]
FIG.2.7 Scatter diagram of formant frequencies of 10 English vowels uttered by 76 speakers (33 adult males, 28 adult females, and 15 children) in the F,-F2 plane.
uttered. Therefore, the higher-order formant has a peculiar value for each speaker corresponding to his or her vocal tract length. Although difficult, measuring formant bandwidths has been attempted by many researchers.The extracted values range from 30 to 120 Hz (mean 50 Hz) for F 1 , 30 to 200Hz (mean 60 Hz for F?, and 40 to 300 Hz (mean 115 Hz) for F?. Variation in bandwidth has little influence on the quality of speech heard.
20
Chapter 2
Consonants are classified by the periodicity of waves (voiced/ unvoiced), frequency spectrum, duration, and temporal variation. The acoustic characteristics of the consonants largely vary as the result of coarticulation with vowels since the consonants originally have no stable or steady-state period.Especially with rapid speech, articulation of the phoneme which follows, that is, tongue and lip movement toward the articulation placeof the following phoneme, starts beforecompletionofarticulation of thephoneme being presently uttered. Coarticulationsometimesaffectsphonemeslocatedbeyond adjacent phonemes. Furthermore, since various articulatory organs participate in actual speech production, and since each organ has itsowntime constant of movement,theacousticphenomena resulting from these movementsare highly complicated. Hence,it is verydifficulttoobtainone-to-onecorrespondence between phonemic symbols and acoustic characteristics. Under these circumstances, the focus has been on examining ways to specify eachphoneme by combiningrelativelysimple features instead of on determining the specific acoustic features of eachphoneme(Jakobsonetal., 1963). Thesefeaturesthus far formalized, which are calleddistinctivefeatures,consist of the binary representation of nine descriptive pairs: vocal/nonvocalic, consonantal/nonconsonantal, compact/diffuse,grave/acute, flat/ plain,nasal/oral,tense/lax,continuant/interrupted,andstrident/ mellow. Since the selection of these features hasbeen based mainly on auditory rather than articulatorycharacteristics, many of them are qualitative, havingweak correspondence to physical characteristics.Therefore,considerableroom still remainsintheirfinal clarification.
2.5 STATISTICALCHARACTERISTICS OF SPEECH 2.5.1 Distribution of AmplitudeLevel
Figure 2.8 shows accumulated distributionsof the speech amplitude level calculatedforutterances by 80 speakers (4 speakers x 20
c
Principal Characteristics of Speech
20
10
0
Instantaneous
-10
21
-20
amplitudelevel
-30
-40
-50
[de rel. to RMSI
FIG.2.8 Accumulated distribution of speech amplitude level calculated for utterances madeby 80 speakers having a durationofroughly 37 min.
languages) having a duration of roughly 37 minutes (Irii etal., 1987). The horizontal axis, specifically the amplitude level, is normalized by the long-term effective value, or root mean square (rms) value. The vertical axis indicates the frequencyof amplitude accumulated from large values, in other words, the frequency of amplitude values larger thantheindicatedvalue.Theseresults clearly confirm that the dynamic range of speech amplitude exceeds %dB. Thedifference between theamplitude level, at which the accumulatedvalue amounts to 1%, and the long-term effective value is called the peak factor because it relates to the sharpness of the wave. The speech and sinusoidal wave peak factors are about 12 dB and 3 dB, respectively, indicating that the speech waveis much higher in sharpness. The derivative of the accumulated distribution curve corresponds to the amplitude density distribution function. The results
Chapter 2
22 20.020""F 1QO -
705.0 r"
8
-
30-
-
U
2.0 0
c
0,
1
cr r.0-
z
07 0.5 -
0.3-
0.201
20
1
I
10
0
I
-10
I
1
-20
Instantaneous amplitudelevel
-30
-40
I
-50
[dB rel.to RMS]
FIG.2.9 AmplitudedensitydistributionfunctionderivedfromFig.
2.8.
derived from Fig. 2.8 are presented in Fig. 2.9 (Irii et al., 1987). The distribution can be approximated by an exponential distribution:
Here, 0 is the effective value (a2corresponds to the mean energy). Distribution of the long-term effective speech level over many speakers is regarded as being thenormaldistributionforboth
I -
Principal Characteristics of Speech
23
%,
v)
C 4)
n U
-60 -
"0" " . -
Male Fema Le
"-0
.cI
0 Q,
CL
v,
I
-700
0.2 0.3
1
, 0.5 Q7 I
,
,
,
,
Frequency
I
I
1
2
3
.
L
& ' A
5 7 1 0
[kHz]
FIG.2.10 Long-time averaged speech spectrum calculated for utterances made by 80 speakers.
males and females. The standard deviation for these distributions is roughly 3.8 dB, and the mean value formale voices is roughly 4.5 dB higher than that for female voices. The long-term effective valueunderthehigh-noise-levelcondition is usuallyraised according to that noise level. 2.5.2
Long-TimeAveragedSpectrum
Figure 2.10 shows the long-time averaged speech spectra extracted using 20 channels of one-third octave bandpass filters which cover the 0-9 kHz frequency range (Irii et al., 1987). These results were alsoobtained using theutterancesmade by 80 speakers of 20
24
Chapter 2
languages. As is clear, only a slight difference exists between male and female speakers, except for the low-frequency range where the spectrum is affected by the variation in fundamental frequency. The difference is also noticeably very small between languages. Based on these results, the typical speech spectrum shape is represented by the combination of a flat spectrum and a spectrum having a slope of-10 dB/octave (oct). The formeris applied to the frequency range of lower than 500 Hz, while the latter is applied to that of higherthan 500 Hz. Although the longtime averaged spectra calculated through the above-mentioned method demonstrate only slightdifferencesbetweenspeakers,thosecalculatedwithhighfrequencyresolution definitely featureindividual differences (Furui, 1972). 2.5.3 VariationinFundamentalFrequency
Statistical analysis of temporal variation in fundamental frequency during conversational speech for every speaker indicates that the mean and standard deviation for female voices are roughly twice those for male voices as shown in Fig. 2.1 1 (Saito et al., 1958). The fundamental frequency distributed over speakers on a logarithmic frequency scale can be approximated by two normal distribution functions which correspond to male and female voices, respectively, as shown in Fig. 2.12. The mean and standard deviation for male voices are 125 and 20.5 Hz, respectively, whereas those for female voices aretwo times larger.Intraspeakervariation is roughly 20% smaller than interspeaker variation. Analysis of thetemporaltransitiondistributioninthe fundamentalfrequencyindicatesthatroughly18% of theseare ascending and roughly 50% are descending. Frequency analysis of the temporal pattern of the fundamental frequency, in which the silent period is smoothly connected, shows that the frequency of the temporal variation isless than 10 Hz.This implies that the speed of the temporal variation in the fundamental frequency is relatively slow.
Principal Characteristics of Speech
25 X'
0 '
//'
CI
2
U
c
.-0
c,
.-> 0
Q)
U U
L
0 0 C
0
c,
m
*" 100 200 x)o Mean fundamentalfrequency
(Hz)
FIG. 2.1 1 Meanandstandarddeviation of temporalvariationin fundamentalfrequencyduringconversationalspeech for various speakers.
Fundamental frequency FIG.2.12
[Hz)
Fundamental frequency distribution over speakers.
Chapter 2
26
2.5.4
Speech Ratio
Conversational speech includes speech as well aspauseperiods, and the proportion of actual speech periods is referred to as the speech ratio. In conversational speech, the speech ratio for each speaker is roughly changing, of course, as afunction of the speech rate.Anexperiment which increased and decreased the speech rate to 30-40% indicated that the expansion or contraction at pauseperiods becomes 65-69%, althoughduringthe speech period it is 13-19% (Saito, 1961). This means that the variation in the speech rate is mainlyaccomplished by changingthepause periods. Moreover, expansion or contraction duringvowel periods is generally larger than that during consonant periods.
4,
3 Speech Production Models
3.1 ACOUSTICAL THEORY OF SPEECH PRODUCTION
As described in Sec. 2.3, the speech wave production mechanism can be divided intothree stages: soundsourceproduction, articulation by vocal tract,andradiationfromthe lips and/or nostrils (Fant, 1960). These stages can be ftlrther characterized by electrical equivalentcircuits based ontherelationship between electrical and acoustical systems. Specifically, sound sources are either voiced or unvoiced. A voiced sound source can be modeled by a generator of pulses or asymmetricaltriangular waves which arerepeatedat every fundamentalperiod.Thepeak value of thesource wave corresponds to the loudness of the voice. An unvoiced sound source, on the other hand, can be modeled by a white noise generator, the mean energy of which correspondstotheloudness of voice. Articulation can be modeled by the cascade or parallel connection of several single-resonance or antiresonance circuits, which can be realized through a multistage digitalfilter. Finally, radiation canbe modeled as arisingfromapistonsoundsourceattached to an infinite, plane baffle. Here, the radiation impedance is represented
27
Chapter 3
28
TABLE3.1 SpeechProductionProcessModels System function
production model Speech Type ~~
~
> Vocal tract
Voiced sou rce
Vowel tY Pe
> Radiation
Resonance -only (allpole model) "O
r
Consonant tY Pe
Nasal type (nasal & nasalized vowel)
Vocal tract (back)
Unvoiced source
"+
Vocal
Resonance resonance (pol e-ze ro)
Vocal Voiced source - tract " +
resonance (pole-zero)
by an L-rcascadecircuit,wherer is theenergylossoccurring through the radiation. The speech production process can accordingly be characterized by combining these electrical equivalent circuits as indicated in Table 3.1. The resonance characteristics depend on the vocal tract shape only, and not on the location of the sound source during both vowel-type and consonant-type production. Conversely, the antiresonancecharacteristicsduringconsonant-typeproduction depend primarily on the antiresonance characteristics of the vocal tract between the glottis and sound source position. The resonance and antiresonance effects are usually canceled in the low-frequency range, since these locations almost exactly coincide. Resonance characteristics for the branched vocal tract, such as those for nasal-typeproduction,areconditioned by the oral
Speech Production Models dB 40
29
Vowel
l-
/a/
20
0' '
I
0
dB 40 -
Poly
1
2
1
Zelo
I
3 kHz
Nasalization
20
0
I
I
J
1
2
3 kHz
Frequency
FIG.3.1 An example of spectral change caused by the nasalization of vowel /a/. It is characterized by pole-zero pairs at 300-400 Hz and at around 2500 Hz. F,, F2, F3 are formants.
cavitycharacteristicsforward and backwardfromthe velum and by thenasaltractcharacteristicsfromthe velum tothe nostrils. The antiresonance characteristics of nasalized consonants (nasalsound)aredetermined by theforwardcharacteristics of the oral cavity starting from the velum. On the other hand, the anti-resonance characteristics of nasalized vowels depend on the nasaltractcharacteristicsstartingfromthevelum.Figure 3.1 exemplifies the spectral change caused by the nasalization of the vowel /a/.
30
Chapter 3
When the radiation characteristics are approximated by the above-mentioned model, the normalized radiation impedance for the unit plane’s free vibration can be represented by
(kn)’ 2
8ka 37r
2).= -+j-
(kn << 1)
where n is the radius of the vibration plane, k = w / c , w is the angle frequency, and c is thesound velocity (Flanagan, 1972). This equation is obtained for smallvalues of ka. The first component in Eq. (3.1) represents the energy loss associated with the radiation of the speech wave. The second component indicates that the vocal tract is equivalently extendedby 8a/37rhaving a crosssection which is equal to the opening section. The radiation characteristics are usuallyapproximated by consideringthe6-dB/octderivation characteristics only and not the phase characteristics. Radiation impedance decreases all resonance frequencies with a constant ratio, but increases their bandwidths. The fact that the glottal source impedance is finite increases all resonance frequencies and bandwidths. These effects for high-frequency resonances, however, can be neglected.
3.2
LINEAR SEPARABLE EQUIVALENT CIRCUIT MODEL
Presentspeech information processingtechniques are based on the linear separable equivalentcircuitmodelof the speech production mechanism detailed in Fig. 3.2. This model is constructed by simplifyingthemodeloutlined in the previoussection.Specifically, this involves completely separatingthe source G(w) fromthe articulation (resonance and antiresonance) H(w) and representing the production model for the speech wave S(w) as the cascade connection of each electrical equivalent circuitwithout mutual interaction such that S(w) = G ( w ) H ( w )
(34
Speech Production Models
""_Source
31
G (a) " "
1
Ar t i c u l a t ion
H(m)
Speech wove
S W
i c u t a tI i o n, a r t
i
equivo lent
I
Spect rat envelope parameters
Fundamental Voiced/ Amplitude period unvoiced
FIG. 3.2 Linear separable equivalent circuit model of the speech production mechanism.
Thesound source is approximated by pulse and white noise sources, andthe vocal tractarticulation is represented by of theall-polemodel orthe polethefiltercharacteristics zero model. The overall spectralcharacteristics of theglottal wave are included in the vocal tract filter characteristics together withthe radiationcharacteristics.Consequently,thespectral characteristic of G(w) is flat, and H(w) is adigital filter having time-variablecoefficients, which includesthesourcespectral envelope andradiationcharacteristicsinaddition to the vocal tract filter characteristics. Since thetemporalvariation of the vocal tractshapeduringtheutterance of continuous speech is relatively slow, thetransmissioncharacteristics of the timevariable parameter digital filter can be regarded as having nearly constant characteristics in short periods, such as those 10-30 ms in length.
Chapter 3
32 3.3VOCALTRACTTRANSMISSIONMODEL
From the perspective of determining features as alinguistic sound, the mostimportant of the three speechwave production mechanism subprocesses isvocal tractarticulation.The vocal tract length of adults is roughly 15-1 7 cm, and the wave lengths X of the speech wave in the vocal tract are roughly 35 cm and 7 cm at 1 kHz and 5 kHz, respectively. Furthermore, the equivalent radius of the vocal tract is less than 2 cm when the vocal tract cross section approximates acircle. Therefore, in the frequency range of less than 4-5 kHz, X/4 is larger than the equivalent radius of the vocal tract. The vocal tract is thusappropriatelyanalyzedasbeinga distributed parameter system of the one-dimensional acoustic tube whose cross section is continuously changing. This means that the transmission of the speech wave can be regarded as that of the plane wave. Although the nasal tract actuallyexists as a partof the vocal tract, it is omitted from the present discussion of the principal vocal tract characteristics for simplicity purposes. Heat conduction losses, viscous losses, and leaky losses, which accompanysound wave transmission,aresmallenough to be neglected undernormalconditions.These losses are therefore usually disregarded in the modeling.The vocal tract characteristics can be more precisely represented by equivalently locating these losses at the glottis and lips. 3.3.1
Progressing Wave Model
Sound wave transmissionalongtheaxisinalosslessonedimensional sound tube featuring a nonuniform cross section can be represented by twosimultaneouspartialdifferentialequations which consist of the momentum equation and the mass conservation equation (Rabiner and Schafer, 1978):
Speech Production Models
33
and dP dt
- PC2 - A(x)
"
du
"
dx
Here, x is the distance from the glottis along the axis, U is volume velocity, P is sound pressure, A ( x ) is the area function of the vocal tract cross section, p is air density, and c is sound velocity. To accurately portray the vocal tractcharacteristics, let us now divide the vocal tract every Ax and approximate eachsegment by a small acoustic tube having a constant cross-sectional area (in other words, by a distributed parameter system). The length of Ax is determined by the frequency bandwidth F (Hz) of the speech wave with which we are concerned. Approximating the segment having a length of A x by a set of distributed parameters requires that Ax be less than a quarter of the wavelength of the sound wave; that is, it is necessary that A s c/4F. If F = 4 kHz, for example, A x must beless than roughly 1 cm. The solution of the one-dimensional waveequation for volume velocityand sound pressure can then be representedby the linearcombination of the forward propagation wave from the glottis toward the lips and the backward propagation waves. When the forward and backward waves and the size of the cross section of the nth (11 2 1) small segment from the lipsare represented byf,(t), b,(t), and A,,, respectively, and when the propagation time for half of one section is represented by At (At = Ax/2c), the volume velocity and sound pressure can be given by
and
The volume velocity relationship is expressed inFig. 3.3. The equation for the sound pressure is obtained by putting the volume velocity equation into Eq. (3.3).
Chapter 3
34
Glottis
I
I
1
. . "
n-th sect ion
J
-Lips
(n-1 );th sect Ion
FIG. 3.3 Definition of forward and backward waves with respect to volume velocity at the nth cross section, and continuity condition for the volume velocity at the boundary between the (n - 1)th and nth sections.
Owing to the continuity of both volume velocity and sound pressure at the boundary, two equations can then be obtained:
and
Models
Production
35
Speech
The continuity conditionfor the volume velocityis also indicated in Fig. 3.3. When the two equations expressed as Eq. (3.5) are combined, we get
Since b,,(t + At) can be regarded as the reflection of J,(t the boundary indicated in Fig. 3.3,
-
At) at
is defined as the reflection coefficient. The reflection coefficient satisfies -1 5 ~i,, 2 1, since the area function has positive values. By modifying Eq. (3.7), we can then obtain
When areas A,, and at thetwo segments are equal, k,, = 0 and no reflection occurs. Let us now calculate the sum and difference ofEqs. (3.5) and divide them by 2. When Eq. (3.8) is applied to these results, we are left with
and
When these equations are solved forJ,_,(t two fundamental equations are obtained:
+
At) and b,,(t
+
At),
36
Chapter 3
FIG.3.4 Transmission modelof acoustic waves in the vocaltract (D = time delay of 2Af): (a) transmission with respectto volume velocity; (b) transmission with respect to sound pressure.
1
I 1
h
Speech Production Models
37
and
Figure 3.4(a) depicts the signal flow graph of these fundamental equations. Whensoundpressure is used as thefundamentalquantity instead of volume velocity, the continuous equations become
and
(3.1 1) The fundamental equations are then expressed as
and
The signal flow graph forthese equationsis indicated in Fig.3.4(b). In both figures, the state at t + At depends on only the states of the sections adjacent to both sides at t - At. Therefore, the new state can be obtained by calculating these equations successively for each segment, and then by substituting these new calculated values for all previous values. The calculation can be done every 2At. If A x 2 c/4F is satisfied as mentioned before, 2At = Ax/c 1/4Fis satisfied, which means that the sampling theory holds. This then indicates that the sound wave propagation in the vocal tractcan becompletely described by area ratios or by equivalent reflection coefficients. This model is called Kelly's speech production model (Kelly, 1962).
38
Chapter 3
3.3.2 Resonance Model When parameter P is eliminated from Eq. (3.3), what is known as Webster's horn equation is obtained: (3.13) Since the variables of this partial derivative equation are separable, the following totaldifferentialequationcanbeformulated, assuming that U ( x , t ) = U(x)d"': (3.14) Moreover, because thearea equation
A(") is positive, we can derive the
(3.15) which is the Stiirm-Liouville derivative equation. The transmission function of the vocal tract can be calculated when this equation is solved for U under a given area function A(x) and an appropriate boundarycondition.The vocal tracttransmissionfunction can then be calculated by H(w) = U(1, t)/U(O, t ) I w based on the U values at x = 0 (glottis) and .x = 1 (lips). The eigenvalue X correspondstotheresonanceangle frequency w of the system. There are two methods for obtaining eigenvalues. One is thealgebraicmethodwhichsolvesthe derivative equation by converting it into a differential equation. Theother focuses on thecalculus of variations.Theresonance frequencies of the vocal tract are the formantfrequencies described in Sec. 2.4. If theanglefrequency of the 11th formantand itsfrequencybandwidtharerepresented by w,, and h t I ,
Speech Production Models
39
n
m
U
U
a -
U
7
.-+
-
d
a -
E
a -40-50;
1I
I
2
1
3
I
4
Frequency [kHz)
FIG.3.5 Contribution of each formant to the amplitude spectrum.
respectively, the amplitude spectral characteristics written as
I H(w) I can be
(3.16)
Here, I Vtl(w) I is the amplitude spectrum of the 12th formant, which hasthree specific characteristics,asindicatedinFig. 3.5 (Fant, 1959):
1. The spectrum is almostflat at w < w,. 2. It has a resonance peak at w ==: wI1,the level of which is decided by W , I / ~ , ? . 3. It decreases at high frequency with the inclination of -12 dB/ oct at w > wt2.
40
Chapter 3
As isclear,the w, valuecontrolsnotonlytheresonance position but also the spectral levelof the high-frequency region. 0, primarilyinfluencesthespectralshape On theotherhand, near w,~. Intheabove-mentionedmodel,theimpedance-matching connectionwiththesoundsourcepart is assumed, andthe losses in the vocal tract are taken into account only equivalently by the backward propagation wave into the sound source part. The actual vocal tract wall is not completely rigid, however, but hasafinitemassandresistance.Thiseffectincreasesthe resonance frequency and bandwidth, especially for the lower-order formants.
3.4 VOCALCORDMODEL
The vocal cord sound sourceis comprised of five principal physical characteristics (Stevens, 1977): slowly. 1. The fundamental frequency fluctuates both rapidly and 2. The volume velocity variation in the fundamental period is almost exactly proportional to the temporal variation of the open area function at the glottis, and can be approximated by asymmetrical triangular waves. 3. For a strongvoice, the glottal-closed-interval increasesand the triangular wave becomes sharper. 4. The frequencyspectralenvelope of the glottal wave has an inclination of - 12 - - 18 dB/oct. be neglected inthe 5 Interactionwiththevocaltractcannot frequency region below 500 Hz, and it influences the waveform at the onset of vocal cord vibration.
A two-mass model was investigated as a vocal cord vibration model which successfully expresses the actual vibration of human vocal cords (Ishizaka and Flanagan 1972; Flanagan et al., 1975). In
Speech Production Models
FIG. 3.6 Configuration of two-massmodel;crosssection (Ag1 = area at dl section; Ago = areaintheneutralstate section).
41
of glottis at dl
this model, the vocal cord is separated into two parts which are connected to each otherin terms of stiffness kc, as indicated in Fig. 3.6. Thevocalcords are assumed to move inonlythevertical direction for simplicity's sake. Several physiological conditionsand actually measured values were used in thesimulationexperimentbasedonthismodel. Additionally, an equivalentelectrical circuit including the coupling tothevocaltract was introducedtoachieveahigh levelof simulation. The simulation experiment made clear the conditions for the occurrence of vocal cord vibration (oscillation condition), vibration modes, temporal variationof the glottal area and volume velocity, and the vibration frequency of the vocal cords. Theresultsindicate thatthevariationrate of vocalcord vibrationfrequencyaccordingtothevariation of subglottal pressure is 2-3 Hz per 1 cm H20, andthat it is only slightly influenced by thevocaltractshape.Inaddition,astrong correlation is observable between the vocal tract shape (resonance characteristics) and vocal cord waveform. Furthermore, the phase difference between the vibration modes for the upper and lower
Chapter 3
42
T i me
[ms]
FIG. 3.7 Simulation of speech production for vowel /a/ using the two-mass model.
parts of the vocal cord is found to be between 0" and 60". Finally, the model shows that vocal cord vibration can be determined by the subglottal pressure, vocal cordtension,glottalopeningarea during the neutral state, and the vocal tract shape. Figure 3.7 indicatestheglottalarea, vocal cordvibration waveform (glottal volume velocity), and sound pressure at the lips for the vowel /a/ produced by this model. These results correspond well with our knowledge of vocal cord vibration. A speech production physical model which combines vocal cords and vocal tract characteristics based on the above-mentioned model is outlined in Fig. 3.8 (Flanagan et al., 1980). During experimentation, each controlparameter of this speech production model was estimated by the analysis-by-synthesis method (A-b-S method; see Sec. 4.5) so that the analyzed results of an actual speech wave and synthesizedvoiceusing this model fit as closely as possible in the logarithmic spectral and cepstral domains. This model is the first system capable of conlpletely taking into account the effect of vocal
Speech Production Models
Ey
’I-
.-E
n n
C
0
+.,
.-0 .-N
a
Y
.-E 0
1
43
44
Chapter 3
tract loss on the vocal cord vibration as well as the terminal effect of the glottis on the vocal tract characteristics. Consequently, the model isexpected to fully contribute to the improvement ofsynthesized speech quality and to the progress of continuous speech recognition. As forthenoisesource,models of turbulent flow production which incorporate the interaction with the vocal tract have been investigated (Stevens, 1971; Flanagan et al., 1975). These modelsshould be ableto realize an increase in theconsonant resonancebandwidthaccompanyingtheincreasedlossresulting fromturbulentsoundsourceproduction. Suchturbulent flow production upon constrictionin the vocal tract and the diminishing of the turbulent flow caused by the release of this constriction are nonlinearhysteresisphenomenamediated by theReynolds number.Thesephenomena will consequentlyrequirea highly complicated analysis.
Speech Analysis and AnalysisSynthesis Systems
4-1 DIGITIZATION The speech signal, or speech wave, can be changed into aprocessible object by converting it into an electrical signal using a microphone. The electrical signal is usually transformed from an analog into a digital signal prior to almost all speech processing for two reasons (Oppenheim and Schafer, 1975). First, digital techniques facilitate highly sophisticated signal processing which cannototherwise be realized by analog techniques. Second, digitalprocessing is far more reliable and can be accomplishedby using a compact circuit. Rapid development of computers and integratedcircuitsinconjunction with the growth of digital communications networks have encouraged theapplication of digital processing techniques to speech processing. Analog-to-digitalconversion,commonlyreferredtoas digitization,consists of thesampling,quantizing, and coding processes. Sampling is theprocess for depictingacontinuously varyingsignal asa periodicsequence of values.Quantization involves approximately representing a waveform valueby one of a
45
46
Chapter 4
finite set of values. Coding concerns assigning an actual number to eachvalue. For suchatask,binarycoding, which uses binary number representation, is usually used. These processes thus enable a continuous analog signal to be converted into a sequence of codes selected from a finite set. 4.1 .I Sampling In the sampling process, an analog signal x(t) is converted into a sequence (sampled sequence) of values {.u,} = {.u(iT)} at a periodic time t, = iT (i is an integer), as plottedinFig. 4.1. Here, T [SI is called the sampling period, and its reciprocal, S = 1/T [Hz], is termed the sampling frequency. If Tis toolarge, the original signal cannot be reproduced from the sampled sequence; conversely, if Tis too small, useless samples for the original signal reproduction are included inthe sampled sequence. Along these lines, Shannon-
x( t )
:
Analog wove
Ti me
Sampling period
FIG.4.1
Sampling in the time domain.
47
Speech Analysis Analysis-Synthesis and Systems
Someya's samplingtheoremfortherelationship between the frequency bandwidth of the analog signal to be sampled and the sampling period was proposed as ameans for resolving this problem (Shannon and Weaver, 1949). This sampling theoremsays that when the analog signal x(t)is band-restricted between 0 and W[Hz] and when x ( t ) is sampled at every T = 1/2W [SI, theoriginal signal can be completely reproduced by
x o c l
s(t) =
)&(x.
i=-cu,
sin { 2 n W ( t - i / 2 W ) } 2 n W ( t - i/2W)
(4.1)
Here, x(i/2 w)is a sampled value of x ( t ) at ti = i/2 W ( i is an integer). Furthermore, 1/T = 2 W [Hz] is called the Nyquist rate. For example, aregulartelephone signal can be sampled every T = l/SOOO [s], since itsbandwidth W is restricted under 4 kHz.Thesampling frequency for digitally processing speech signals is usually set between 6 and 16 kHz. Even for several special consonants, setting the sampling frequency at 20 kHz is sufficient. For those signals thefrequencybandwidths of which are not known, a low-pass filter is used to restrict the bandwidths before sampling.Whena signal is sampled contraryto the samplingtheorem,aliasingdistortionoccurs,whichdistorts the high-frequency components of the signal, as showninFig. 4.2. The sampled signal, which is discontinuous in thetime domain but still continuous in the amplitude domain, is called a discrete signal. 4.1.2 QuantizationandCoding Duringquantization,the entire continuousamplitude range is divided into finite subranges, and waveforms, theamplitudes of which are in the same subrange, are assigned the same amplitude values. Figure 4.3 exemplifies the input-output characteristics of an eight-level (3-bit) quantizer, where A is the quantization step size. In
48
bx
Chapter 4
Original
L
c
v)
0
0
2s
S
2s
S
3s
Frequency
> Frequency
FIG. 4.2 Samplinginthefrequency domain: (a) Correctsampling (S 2 2W); (b) incorrect sampling (S -e 2W).
this example, each code is assigned so that it directly represents the amplitude value. The quantization characteristics depend on both the number of levels and on the quantization stepsize A . When the signal is assumed to be quantized by B [bit], the number of levels is usually set to 2B to ensure the most efficient use of the binary code words. A and B must beselected together to properly cover the range of the signal. If we assume that 1 . ~ ~21 xma-c,then we should
set 2~,,, = A2B The difference between the sampled value after quantization X i , ei = 2i - -Xi, is calledthe
2j andtheoriginalanalogvalue
Speech Analysis and Analysis-Synthesis Systems
49
out put
A 011
-7A 2 5 2
-
-3A
-
1
000
2
I
, -A,TA
1
100
-44 -34 "24
.010
-A
I
001
A
1
1 --A
I
24
-Input
34
44
2
101
A
-" 3 A 2
110
111
-" 5 A 2 -" 7 A 2 Peak-to-peak range
"J
FIG.4.3 An example of the input-output characteristics of eight-level (3-bit) quantization.
quantization error, quantization distortion, or quantization noise. It can be seen in Fig. 4.3 that the quantization noise satisfies
when A and B are set to satisfy Eq. (4.2). A statistical model incorporating three characteristics can be assumed to serve as the quantization noise (Rabiner and Schafer, 1975). The first characteristic is that the quantization noise is a stationary white noise process. The secondis that the quantization
Chapter 4
50
noise is uncorrelated with the input signal. The third is that the distribution of quantization errors is uniform over each quantization interval, and that the following equation is satisfied since all quantization intervals have the same length: Prob(eJ
n5
1
n
e. < 2 “2 otherwise
=-
- -
n
=o
The signal-to-quantization noise ratio (SNR) is defined as
Whentheabove-mentioned satisfied,
assumptionsandEq.
(4.2) are
Therefore. SNR
=
3 x 22B (-7cnzux/Ox)
or, when represented in the dB scale, SNR [dB] = 10 loglo
(3)
When the quantization range is set to xmaX= 40,, SNR [dB] = 6B
-
7.2
(4.7)
Speech Analysis and Analysis-Synthesis Systems
51
4.1.3 A/D and D/A Conversion
Conversion from analog to digital signals is called A/D conversion, and, conversely, the opposite process is known as D/A conversion. The low-pass filtering necessary before A/D conversion is also necessary after D/A conversion to remove the distortion present in the higher harmonic components. Therelationship between the lowpass filter characteristics and the D/A conversion frequency must satisfy the same requirement as that fundamental to the sampling process. In speechsignalprocessing,preemphasis,namely,the compressionofthesignaldynamicrange by flatteningthe spectraltilt, is effective in raising the SNR. This is usually done by emphasizingthe higher-frequency componentsroughly 6dB/oct prior to low-pass filtering for A/D conversion.Preemphasiscanalso be accomplished after A/D conversion through differential calculation orthroughapplication of thefirst-order digital filtering H(z) = 1
- 0s-l
(4.10)
where c1' isset to a value close to I . Maximizing the SNR as muchaspossible,however,necessitatesthatpreemphasis be applied prior to A/D conversion. The process of adding a tilt of 6dB/octto reproducetheoriginalspectraltilt is called deemphasis. Since thedynamicrange of the speechwave is larger than 50 dB, 10 bits or more is necessary for A/D conversion. However, when blocknormalization is applied at every short periodtonormalizetheamplitudevariation by multiplying a constant value assigned to the short period by the speech wave, a sufficient quantizationresolutioncan be obtained even at a bit rate of 6 to 7 bits. Since thepeakfactor of speech is 12 dB, the permissible maximum level of an A/D converter must be set 12dB higher than the effective level of theinput speech signal.
52
Chapter 4
4.2 SPECTRAL ANALYSIS 4.2.1 SpectralStructure of Speech
As discussed in Sec. 2.4, the speech wave is usually analyzed using spectral features, such as the frequency spectrum and autocorrelation function, instead of directly using the waveform. There are two important reasons for this. Oneis that the speech wave is considered to be reproducible by summing the sinusoidal waves, the amplitude and phase of which change slowly. The other is that the critical features for perceiving speech by the human ear aremainly included in the spectral information, with the phase information not usually playing a key role. The powerspectraldensityina short interval, that is,the short-time spectrum of speech, can be regarded as the product of twoelements:thespectralenvelope, which slowly changes asa function of frequency, andthe spectralfinestructure, which changesrapidly. The spectralfinestructureproducesperiodic patterns for voiced sounds but not for unvoiced sounds, as shown in Fig. 4.4 (Tohkura, 1980). The spectral envelope, or the overall spectral feature, reflects not only the resonance and antiresonance characteristics of the articulatory organs, but also the overall shape of the glottal source spectrum and radiation characteristics at the lips and nostrils. On the other hand, the spectralfinestructure corresponds to the periodicity of the sound source. Methods for spectral envelope extraction can be divided into parametric analysis (PA) and nonparametric analysis (NPA). In PA, a model which fits the objective signal is selected and applied to the signal by adjusting the feature parameters representing the model. On the other hand, NPA methods cangenerally be applied to various signals since they do not model the signals. If the model thoroughly fits the objective signal, PA methods can represent the features of the signal more effectively than can NPA methods. The majormethodsforanalyzingthespeechspectrumandspectral features are shown in Table 4.1 (Itakura and Tohkura, 1978). Of these, linear predictive coding analysis will be described precisely in Chap. 5.
Speech Analysis and Analysis-Synthesis Systems
53
50 )r
-40 2 30 G 20 "
-z t
10
o
0
g -10 v)
Frequency [kHz] Vowel /a/
FIG.4.4 Structure of short-time speech spectra for male voices when uttering vowel /a/ and consonant /tJ/.
4.2.2 Autocorrelation and FourierTransform When a sampledtime sequence is written by x(n) (n is an integer), its autocorrelation function +(m) is defined as
(4.11)
Chapter 4
54 70I
I
I
"
I
I
60 50 40 30 20 10
0 -1 0
-20'
I
I
I
I
I
I
60 50
40 30 20 10 -1 0 X)
,o
Fine, structure I
I
I
I
I
I
I
1
0 -1 0
-20
-501 0
1
1
I
2
Consonant
FIG.4.4
I
3
I
4
Frequency [kHz]
/tf/
(Continued)
where N is the number of samples in the short-timeanalysis interval. The length of the interval, N T (Tis a sampling period), is usually set at around30 ms. Specifically, intervals of around 20 and 40 ms often bring good results for female and male voices, respectively. The short-time spectrumS(w) and $(HI) constitute the Fourier transform pair (Wiener-Khintchine theorem)
Speech Analysis and Analysis-Synthesis Systems TABLE4.1 Major Methods for Analyzing Speech Spectra and Their Principal Features Type
Analysis method
Parameters
NPA
(i) Short-time autocorrelation
@(m)
Spectral envelope and fine structure are convoluted.
(ii) Short-time spectrum
S(W)
Spectral envelope and fine structure are multiplied. Fast algorithm can be realized by FFT.
(iii) Cepstrum
c(7)
Spectral envelope and fine structure can be separated in quefrency domain. Two FFTs and log transform are necessary.
(iv) Band-pass filter bank (v) Zero-crossing analysis
(i) PA
Features
rms of filter Global spectral enveloutput ope can be obtained. Zero-crossing Formant freq. can be rate obtained by combination with (iv). Realized by simple hardware.
Analysis-bysynthesis
(ii) Linear predictive coding
Formant, bandwidth, etc.
Precise modeling is possible. Accurate formant freq. can be obtained. Complicated iteration is necessary. Simple all-pole spectrum modeling. Parameters can be estimated from autocorr. or covariance without iteration.
55
Chapter 4
56
TABLE 4.1
(Continued) ~
~
~~~~
Analysis Type method Parameters Features
PA (ii-a) Maximum (cont.) likelihood method
ai
Stability of synthesis filter is guaranteed. Time window is necessary. Number of calculations a p2
(ii-b) Covariance method
ai
Stability synthesis of filter Suitaguaranteed. is not ble for short-time analysis. Number of calculationsa p3
(ii-c) PARCOR method
ki
Normal equation can be solved by lattice filter. Equivalent to (a) and (b). Number of calculations a p2
(ii-d) LSP method
wi
Quantization and interpolation characteristics are good. Similar to formant. Number of calculations is slightly larger than for PARCOR.
NPA = nonparametric analysis; PA = parametric analysis; rms square; p = order of linear predictive coding model.
=
root mean
and
(4.12)
Speech Analysis Analysis-Synthesis and Systems
57
where w is a normalizedangle frequency which can be represented by w = 27rjT (f is a real frequency). S(w) is usually computed directly from the speech wave using the discrete Fourier transform (DFT) facilitated by the fast Fourier transform (FFT) algorithm: (4.13) The autocorrelation function can also be calculated more simply by using the DFT (FFT) compared with the conventional correlation calculationmethod when higher-ordercorrelation elements are needed. With this method, the autocorrelation function is obtained as the inverse Fourier transform of the short-time spectrum,which is calculated by using Eq. (4.13). 4.2.3 Window Function In order to extract the N-sample interval from the speech wave for calculating the autocorrelation function and spectrum, the speech wave must be multiplied by an appropriate time window. Therefore, x(n), indicated in Eqs. (4.1 1) and (4.13) for calculating +(m) and S(w),respectively, is usually not the originalwaveform but rather the waveform multiplied by the window function. The multiplication of the speech wave by the window function has two effects. First, it gradually attenuates the amplitudeat both ends of the extraction interval to prevent an abrupt change at the endpoints.Second,itproducestheconvolutionfortheFourier transform of the window function and the speech spectrum, or the weightedmovingaverageinthespectral domain.It is thus desirable that the window function satisfy two characteristics in order to reduce the spectral distortion caused by the windowing. One is a high-frequency resolution, principally, narrow a and sharp main robe. The other is a small spectral leak from other spectral elementsproduced by theconvolution,inotherwords,alarge attenuation of the side robe. The definition and properties of the convolution are specifically described in Appendix A.
Chapter 4
58
Since these tworequirements are actually contrary to each other,and because it is impossible to satisfy both, several compromise window functions have been proposed. Among these, the Hanming window WH(n),defined as W H ( H= ) 0.54
-
(&)
0.46~0s
(4.14)
is usually used asthe window functionfor speech analysis. The Hamming window is advantageousin that its resolution inthe frequency domain is relatively high and its spectral leak is small since the attenuation of the side robe is more than 43 dB. On the other hand, a rectangular window, WR(n)= 1 (0 2 IZ 5 N - l), which correspondstothe simple extraction of Nsamplepoints of the speech wave, hasthelargestfrequency resolution, whereas the attenuation ofitsfirst side robe is only 13 dB. The rectangular window thus is not suited to the analysis of speech a wavehavinglarge a dynamicrangeofspectral components. Another window, called the Hanning window, (4.15) is also employed. Although the advantage of this window is that its higher-order side robes are lower than those of theHamming window, the attenuation of the first side robe is only roughly 30 dB. The shapesof these windows and the spectra for10 periods of 1-kHz sinusoidal waves extracted by using these windows are shown in Fig. 4.5. The relationship between the sampling periodT [SI, number of samples for analysis N , and nominal frequency resolution of the calculated spectrum Af [Hz] is expressed as 1 TN
4f=-
(4.16)
59
Speech Analysis and Analysis-Synthesis Systems Rec t angu
17 r
w I ndow
1 0
. a
N -1 2
N -1
0
-
-1 0
rn -20
a
-x)
E -40 -50 -60 -70 -80
-
,#
-
.
1
I II
"90 0.6
1.o
0.8
11
\I \ y \In'/
Y
4 \I \.'
1.4
1.6
1.8
I1 12
Frequency [kHz) (b)
FIG. 4.5 Majorwindowfunctions (a) andthespectrum for the 10 periods of a %kHz sinusoidal wave extractedusingeach of the windows (b).
From this, it is clear that the resolution increases in proportion to the length of the speech interval for analysis. For example, when T = 0.125nls (8-kHz sampling) and N = 256 (32-ms duration),
4f
=
1o3 0.125 x 256
=
31 [HZ]
(4.17)
60
Chapter 4
Whentheanalysiswindowlengthincreases,thefrequency resolution increases as the time resolution decreases. On the other hand, when theanalysiswindowlengthshortens,thetime resolution increases as the frequency resolution decreases. These relationshipscan be easily understoodfromthefactthatthe multiplication of the waveform by a window function corresponds to the moving average of the spectrum in the frequency domain. Furthermore, when the waveform is multiplied by either the Hamming or the Hanning window, the effective analysis interval length becomes approximately 40% shorter since the waveforms near both ends of the window are compressed, as indicated in Fig. 4.5. This results in a consequent 40% decrease in the frequency resolution. Hence, the multiplication of the speech wave by an appropriate window reduces the spectral fluctuation due to the variation of the pitch excitation position within the analysis interval. This is effective in producing stable spectra during the analysis of voiced sounds featuring clear pitch periodicity. Since multiplication by the window function decreases the effective analysis interval length, the analysis interval should be overlappingly shifted along the speech wave to facilitate tracing the time-varying spectra. Theshort-timeanalysisintervalmultiplied by awindow function and extracted from the speech wave is called a frame. The length of the frame is referred to as the frame length, and the frame shifting interval is termed the frame interval. A block diagram of atypicalspeechanalysisprocedure is showninFig. 4.6. Alsoindicated at eachstagearetypical parameter values and examples of speechwaves. 4.2.4
Sound Spectrogram
Soundspectrogramanalysis is amethod for plottingthe time function of the speech spectrum using densityplots.The special device used for measuring and plotting the sound spectrogram is called the sound spectrograph. Figure 4.7 is an example of sound
Speech Analysis and Analysis-Synthesis Systems
61
Speechwave
Low-pass i f l t e r (a 1 AID (Sarnplinqand quantization)
-
. . . . Sornplingfrequency
= 16 bit ,,
e Fxtrti rao nam n t ceer v o l
(Hamming, Hanning, Windowing
= 16 kHz
Quant i t a t i o n b i t r a t e
1 (b)
I
= 8 kHz
C u t o f rf e q u e n c y
etc. )
. . F r a m el e n g t h
= 30 ms
= toms
. . . Windowlength = F r a m el e n g t h
Featurextroction
P a r a m e t r irce p r e s e n t a t i o n Exci t o t ion parameters V o c atlr a cpt a r o m e t e r s
{
FIG.4.6 Block diagram of a typical speech analysis procedure. Typical parameter values and examples of speech waves at each stage are also indicated.
62
Chapter 4
spectrograms for the Japanese word /ko:geN/, or ‘plateau,’ uttered by a male speaker. As indicated, the sound spectrogram provides two types of representations: light and dark and contour.Light-anddark representations illustrate themagnitude of the frequency component by darkness,inother words, thedarkerareas reveal higher-intensity frequency components.Withcontour representations, as with contour maps, the magnitude is roughly quantized, and the area where the magnitude is in the same quantization level is produced by the same shade of darkness. Usually the bandwidth of the band-pass filter (see Sec. 4.4.1) for the frequency analysis, i.e., the frequency resolution, is either 300 Hz or 45 Hz, depending on the purpose of the analysis. When the frequency resolutionis 300 Hz, the effective length of the speech analysis interval is roughly 3 ms, and when the resolution is 45 Hz, the length becomes 22ms. Since this trade-off occurs between the frequency and time resolutions,thepitchstructure of speech is indicated by a vertically-striped fine repetitive pattern along the time axis in the case of the 300-Hz frequency resolution, and by a horizontally-stripedequallyfinerepetitivepatternalongthe frequency axis in the case of the 45-Hz resolution, as shown in Fig. 4.7. Many of thesoundspectrograms originally produced by analog technology using the sound spectrograph are now produced by digitaltechnology throughcomputersand theirprinters. The digitalmethod is particularly beneficial inthat itpermits easy adjustment of various conditions, and in that the spectrograms can be produced sequentially and automatically with good reproducibility.
4.3 4.3.1
CEPSTRUM Cepstrumand Its Application
The cepstrum, or cepstral coefficient, c(r) is defined as the inverse Fourier transform of the short-time logarithmic amplitude spectrum
63
Speech Analysis and Analysis-Synthesis Systems
c
t
Q' Ir '
2t-
( a )Wide-band,L ight-and-shade
I
( b ) Wide-band,contour
t
I
6-
-
n
N
.-e -..
H
. + c
LI
>r
u c 0
3 U 0)
I
0:2
k (c
o :'g
0:4
e
I
N
014
0.6 [sl
0.2
0.'60 k
0 : g
N
Narrow-band,l ight-and-shade ( d )Narrow-band,contour
FIG.4.7 Examples of sound spectrograms for a male voice when uttering the Japanese word jko: geN/.
Chapter 4
64
IX(w)I (Bogert et al., 1963; Noll, 1964; Noll, 1967). Theterm cepstrum is essentially a coined word which includes the meaning of the inverse transform of the spectrum. The independent parameter for the cepstrum is called quefrency, which is obviously formed from the word jiequencv. Since the cepstrum is the inverse transform of thefrequencydomainfunction,the quefrency becomes the timedomain parameter. The special feature of the cepstrum is that it allows for the separate representation of the spectral envelope and fine structure. Based onthelinearseparableequivalentcircuitmodel describedin Sec. 3.2, voiced speech x(t) can be regarded as the response of the vocal tract articulation equivalent filter driven by a pseudoperiodicsource g(t). Then x(t) can be given by the convolution of g ( t ) and vocal tract impulse response h(t) as
s, t
x(t) =
g(+z(t
- 7)dT
which is equivalent to X(w)
=
G(w)H(w)
(4.18)
where X(w), G(w), and H(w) are the Fourier transforms of x ( t ) ,g(t), and lz(t), respectively. If g( t ) isaperiodicfunction, IX(w)I isrepresented by of which arethereciprocal line spectra, the frequency intervals of thefundamentalperiod of g ( t ) . Therefore, when IX(w)l is calculated by theFouriertransform ofasampledtime sequence for a short speech wave period, it exhibits sharp peaks withequalintervalsalongthefrequencyaxis.Itslogarithmlog Ix(w)I is
The cepstrum, which is the inverse Fourier transform of log IX(W)I, is
Speech Analysis Analysis-Synthesis and Systems
65
where F is the Fourier transform. The first and second terms on the right sideof Eq. (4.19) correspond to the spectral fine structure and thespectral envelope, respectively. Theformer is the periodicpattern,andthelatter is theglobalpatternalongthe frequency axis. Accordingly, large differences occur between the inverse Fourier transform functions of both elements indicated in Eq. (4.20). Principally, the first function on the right side of Eq. (4.20) indicates the formation of a peak in the high-quefrency region, and the second function represents a concentration in thelow-quefrency region from 0 to 2 or ms. 4 The fundamental period of the sourceg(t) can then be extracted from the peak at the high-quefrency region. Ontheotherhand,theFouriertransform of the low-quefrency elements produces the logarithmic spectralenvelope from which the linear spectral envelope can be obtained through the exponential transform. The maximum order of low-quefrency elements used for the transform determines the smoothness of the spectral envelope. The process of separating the cepstralelements into these two factors is called liftering, which is derived from filtering. is calculated by the DFT,it is Whenthecepstrumvalue necessary to set the base valueof the transform, N , large enough to eliminate the aliasing similar to that produced during waveform sampling. The cepstrum then becomes 1 N-1
(0 5 n 5 N
-
1) (4.21)
The process steps for extracting the fundamental period and spectral envelope using the cepstral method are given in Fig. 4.8, withexamples of theextractedresultsshowninFig. 4.9 (Noll,
Chapter 4
66 Sampled sequence
w Window
+
Cepst r a l window ( I i f t e r i n g 1 el ement s 1
Spectral envelope
Fundamental period
FIG. 4.8 Block diagram of cepstrum analysis for extracting spectral envelope and fundamental period.
1967). The cepstrum values indicated in the latter figure squared values of the cepstrunl c,, defined above. 4.3.2
are the
HomomorphicAnalysisandLPCCepstrum
Cepstral analysis, which is the process of separating two convolutionally related properties by transforming the relationship into a summation, is akind of homomorphic analysis or filtering (OppenheimandSchafer, 1968). In general, homomorphic analysis implies signal processing, which decomposes the nonlinear
Speech Analysis and Analysis-Synthesis Systems
I
I
I
I
0
1
2
3
Frequency [kHz]
I 4
0
3
6
Quefrency
67
9 1 2 1 5
hsl
FIG.4.9 Examples of short-time spectra (left) and cepstra (right) for male voice when uttering‘(r)azor.’Samplingfrequency 10 kHz; Hamming window length 40 ms; frame interval 10 ms.
(non-additive) system into independentfactors, similar to the filtering which differentiates the linearly added signals. Homomorphic analysis makes use of several special nlethods to transform the relationship into an additive one.
Chapter 4
68
Let us consider the cepstrum in a special casewhich in X(w) = H ( z ) Iz=exp(jwn. Here, H(z) is thez-transform of theimpulse response of an all-pole speech production system estimated by the linearpredictivecoding(LPC)analysismethod (see Chap. 5). Accordingly,
1
H(z) = 1
+
P
a,z"
(4.22)
i= 1
The definition and properties of the z-transform are described in Appendix A. Equation (4.22) means that the all-pole spectrum H(z) is used for the spectral density of the speech signal. This is accomplished by expanding the cepstrum into a complex form by replacing the DFT, logarithmic transform, and inverse discrete Fourier transform(IDFT)inFig. 4.8 witha dualz-transform,complex logarithmic transform, and inverse dual z-transform, respectively (Atal, 1974). When this complex cepstrum for a time sequencex(n) is represented by t,,, and their dual z-transforms are indicated by X ( z ) and C(z), respectively, C(Z)
= log [ X ( z ) ]
(4.23)
If we now differentiate both parts of this equation by G 1 , and then multiply by X@), we have X(Z)?(Z)
= Y(Z)
This equation permits recursive equations to be obtained:
(4.24)
Speech Analysis and Analysis-Synthesis Systems
69
co < 4 T h s cepstrum is referred to as the LPC cepstrum, since it is derived through the LPC model. The original cepstrumis sometimes called the FFT cepstrum to distinguish it from the LPC cepstrum. Figure 4.10 compares the spectral envelope calculated using thecepstrumdirectlyextractedfromthewaveformwiththat calculated using theLPC cepstrum (Furui,1981). In this figure, the by LPC short-time spectrum and the spectral envelope extracted 10 c t r a l envelope by LPC cepstrum
8
3a 6 c
.L
z
O
0 0
4
.-I
Short-time spectrum
-
Spectral envelope by FFT ceps t rum I
OO
I
1
I
1
I
I
2
Frequency
I
I
I
1
3
[kHz]
FIG.4.10 Comparison of spectral envelopes by LPC, LPC cepstrum, and FFT cepstrum methods.
70
Chapter 4
(maximum likelihood method) are also shown for reference. The spectral envelope derived from the LPC cepstrum clearly tends to follow thespectralpeaksmorestrictly than doesthespectral envelope obtained through the FFT cepstrum.
4.4
4.4.1
FILTER BANK AND ZERO-CROSSING ANALYSIS Digital Filter Bank
The digital filter bank, more specifically, a set of band-pass filters, is one of the NPA techniques mentioned in Sec. 4.2.1. The filter bank requires a relatively small amount of calculation and is therefore quite suitable for hardware implementation.Since there is a definite trade-off between the time and frequency resolution of each bandpass filter, as indicated in Sec. 4.2.3, it is necessary to design various parameters according to the purposes intended.Generally, the bandpass filters are arranged so that the center frequencies are distributed with equal intervals on thelogarithmic frequency scale, taking human auditory characteristics into account, and so that the 3-dB attenuation points of the adjacent filters coincide. The output of each band-pass filter is rectified, smoothed by rms (rootmean square) value calculation, and sampled every 5 to 20ms to obtain values which represent the spectral envelope. The spectral analysis part of the sound spectrogram analysis described in Sec. 4.2.4 is usually performed using a single bandpass filter whose center frequency is continuously changed. There the recorded speech wave is iteratively played back and analyzedby the filter. 4.4.2 Zero-Crossing Analysis
The zero-crossing number of the speech wave in a predetermined time interval, which is counted as thenumber of times when adjacent sample points have different positive and negative signs?
Speech Analysis Analysis-Synthesis and Systems
71
approximately corresponds to the frequency of the major spectral component. Based on this principle, formant frequencies can be estimated by zero-crossing analysis as follows. First, the speech wave is passed through a set of four or five octave band-pass filters, and the power and zero-crossing number of the rectified and smoothed output of each filter are measured at short intervals, such as 1Oms. When the power of a filter exceeds the predetermined threshold, this frequency range is regarded as having a formant, with the formant frequency being estimated by the zero-crossing rate.This zerocrossing rate can also be used to detect the periodicity of the sound source as well as to estimate the fundamental period. Although the zero-crossing analysis method is well suited to hardware implementation, its drawback is that it is sensitive to additive noise.
4.5 ANALYSIS-BY-SYNTHESIS
Analysis-by-synthesis (A-b-S), presented in Fig. 4.1 1, is the process of determining the parameters which characterize the system based on an assumed signal production model (Bell et al., 1961). The model parameters are adjusted in the course of iterative feedback control so that theerror between the observed value andthat produced by the model is minimized. Importantin A-b-S are selection of the assumed production model, the initial parameter values,the errorevaluationmeasure,andtheminimization algorithm. A-b-S is useful not only for speech parameter extraction but also for many applications in which a production model can be used. Duringformant frequencyextraction based onthe A-b-S technique, the following parameters are adjusted: the first through thethirdorfourthformant frequencies andbandwidths,the fundamental frequencyas well as the spectral envelope of the voice source, and theoverallspectralcompensationcharacteristics includingthehigher-order formant characteristics. Themean square error between the logarithmic power spectraof the modeled and observed speech is typically used as the error evaluating
Chapter 4
72 I n i t i a l value
E r r o rc a l c u l a t i o n bet w88n observed
I
Change parameters
and smal I?
parameter readout
A n a l y s i s results
FIG.4.11
Principle of A-b-S method.
measure. Formant frequency extraction resolutionsof f10 Hz and f 2 0 Hz were respectively obtained experimentally for the first and second formants. AlthoughtheA-b-Smethod is betterthananyother in principle, it is problematicinthatconsiderablecomputation is required. Specifically, it needsalargenumberofiterations of feedback controlduringactual speechanalysisbecause of the mutualinteractions between thevariousparameter effects on spectral envelope production.
Speech Analysis and Analysis-Synthesis Systems
73
4.6 ANALYSIS-SYNTHESIS SYSTEMS 4.6.1 Analysis-Synthesis SystemStructure
Analysis-synthesisis the process in which the speech waveis reproduced(synthesized)using voice sourceandarticulation parameters.Theparametersareextracted based on the linearly separable equivalent circuit for the speech production mechanism described in Sec.3.2. These parametersdesignatefour types of information: 1. 2. 3. 4.
Distinction between voiced sound (pulse source) and unvoiced sound (noise source) Fundamental period or fundamental frequency of voiced sounds Source amplitude Linearfilter(resonance)characteristics
Thefirstthreeprovidesourceinformation,whereasthelast parameter set gives spectral envelope (articulation) information. A careful investigation into the three principal procedures of speech analysis-synthesis systems is essential to ensure improved synthesized speech quality. The firstis extracting those parameters which precisely convey only theimportant auditory informationby neglectingthe redundant information included in speech waves. The second is coding the feature parameters efficiently.The third is reproducing the original speech as precisely, clearly, and naturally as possible by using the coded feature parameters.
4.6.2
Examples of Analysis-Synthesis Systems
Major examples of speech analysis-synthesis systems are summarized in Table 4.2 (Itakura, 1981). As indicated, the prototype of the speech analysis-synthesis system is the vocoder, invented in 1939 by H. Dudley of Bell Laboratories (Dudley, 1939). The term vocoder is
74
N
I 0 0
*
Chapter 4
w
a,
L
c/!
a
9
s
U
&
a. n
v)
Speech Analysis and Analysis-Synthesis Systems
c
a,
L
6
a. a
>g
0,
.-0
rtr0
5 % 00 U5 5 g 2w 9s I >
75
Chapter 4
76 Transmission Synthesizer Analyzer I
1 Modulo-
Bond-pass Rect i- Low-pass I f i t t e frise frisl t e r s I
! tors
Band-pass tf ei Ir s
I " "
450H z
Mi crophone
."" I I
detector
Pi t c h extractor
Pulse
Noise
FIG.4.12 Structure of the (channel) vocoder.
an abbreviation for voice coder.Thestructure of thevocoder is diagrammed in Fig. 4.12, in which spectral analysisis applied to the speech wave through a band-pass filter bank at the analysis part (transmitter) (Schroeder, 1966). At the same time, the presence of periodicity and the fundamental period for the periodic signals are analyzed. These signals are then transmitted to the synthesis part (receiver) where source signals areproduced by apulse or noise generator,depending on the presence of periodicity. The source signals are amplitude-controlled at each frequency band and passed through the band-pass filters, which are similar to the transmitter. Theoutput signals of theband-passfiltersarethensummed to reproduce the original speech. The word vocoder is being widely used today to represent all speech analysis-synthesis systems.The original vocoder, which uses a band-pass filter bank for spectral analysis, is now referred to as the channel vocoder (Gold and Rader, 1967). Although the channel vocoderhasbeenimprovedinquality through increasingthe
Speech Analysis Analysis-Synthesis and Systems
77
number of channels, it is limited in its ability to reproduce natural speech. The formant vocoder is problematic in its accurate extraction of formant frequencies, and the correlation vocoder has difficulty in accurately reproducing the spectrum. With the pattern matching vocoder,phonemesinthe speech wave are identified based on the time-frequency pattern of theband-pass filter output, with thephonemesymbolsbeingtransmitted(Smith,1969). Although this technique realizes the highest compression rate, it presents several unsolvedproblems.One is how toextractthe phonemes from continuous speech. Another is how to measure the similaritybetweeninput speech and reference patterns. Still another is how to synthesize natural speech based on the phoneme symbol sequence. Withthehomomorphicvocoder,thespectral envelope is represented by the cepstral coefficients of lower-order quefrencies (for example, 30 elements) by using the method described in Sec. 4.3. In addition, the pitch estimation and voiced/unvoiced decision aremade based onhigher-order quefrency elements. Atthe synthesizer, anapproximate value forthe impulse response is produced by usingthetransmittedlow-quefrencyelements. Simultaneously,theexcitationfunction (impulse sequence or random noise), which is produced based on pitch,voiced/unvoiced, and amplitude information,is convoluted by the impulse response. Whenthe DFT of thelower-orderquefrencyelementsare exponentially and inverse Fouriertransformed,thezero-phase impulse response is obtained. If the lower-order quefrency elements are multiplied by the following lifter, the minimum phase impulse response is obtained:
n=O 0 < 11 < no other n
(4.26)
Experimental results indicate that the best-quality speech can be synthesized under the minimum phase condition, which is close to natural speech (Oppenheim, 1969).
Chapter 4
78
Another speech synthesis method based on the homomorphic vocoder has surfaced, which uses a filter set directly approximating logarithmic amplitude characteristics (Imai and Kitamura, 1978). The synthesis filter set in this method is constructed through the cascadeconnection of several filtershavingthe system function HI,(s):
The synthesized voice is directly produced without transforming the cepstruminto an impulse response. Thelogarithmicamplitude characteristics of the filter constructed by cascade connection of these (no + 1)-stage filters is Iln
n=O
Since the cepstrum multipliedby the lifter Z(n) of Eq. (4.26) is used as C ( H ) in this equation, the synthesis filter features minimumphasecharacteristics. It has been ascertained that high-quality synthesized voice can be obtained by using thismethod at a relatively low bit rate. The analysis-synthesis systems based on theLPCmethod (linear predictive vocoder,maximumlikelihoodvocoder, PARCOR vocoder, LSP vocoder, etc.) offer a considerable number of advantages which will be precisely described in Chap. 5.
4.7
PITCHEXTRACTION
In speech analysis-synthesis systems, it is necessary to extract source parameters in parallel with spectral envelope parameter extraction. The source parameters include the presence of vocal cord vibration (voiced/unvoiced), fundamental frequency for voiced sound,and
Speech Analysis and Analysis-Synthesis Systems
79
source amplitude. Although the accurate extraction of the fundamental frequency (pitch extraction) has been one of themost important study concerns since the beginning ofspeech analysis research, no definite approach has yet been established. This difficulty with pitch extraction stems from three factors. First, vocal cordvibrationdoesnot necessarily havecomplete periodicity especially at the beginning and end of voiced sounds. Second, it is difficult to extract the vocal cord source signal from the speech wave separately from the vocal tract effects. Third, the dynamic range of the fundamental frequency is very large. With these in mind, recent pitchextraction research has been undertaken from three viewpoints. One is how to reliably extract the periodicity of quasi-periodic signals. Another is how to correctthepitchextractionerrorowingtothedisturbance of periodicity. The other is how to remove the vocal tract (formant) effects. Major errors in pitch extraction areclassified into doublepitch and half-pitch errors. The former are those errors occurring when extracting a frequency which is twice as large as the actual value. The latter are errors arising when extracting thehalf-value of theactualfundamental frequency. The tendency toward which error is mostapttooccurdependsontheextractionmethod employed. The major pitch extraction methods are outlined in Table 4.3 (Itakura, 1978). They can generally be groupedinto waveform processing (I), correlation processing (11), and spectral processing (111). Group I is composed of methods for detecting the periodic peaks in the waveform. Group I1 methods are those most widely used in digital signal processing of speech, since the correlation processing is unaffected by phase distortion in the waveform, and since it can be realized by a relatively simple hardware configuration.Amongthemethods in Group 111, the principle of pitch extraction using cepstral analysis has already beendescribed in Sec. 4.3. The modified correlation method and simplified inverse filter tracking (SIFT) algorithm (Markel, 1972), which are correlation methods, and the cepstral method are generally the most efficient since they explicitly remove the vocal tract effects. The modified correlation method will be described in detail in Sec. 5.4.
Chapter 4
80
TABLE 4.3 Classification of Major Pitch Extraction Methods and Their Principal Features Pitch extraction features Principal Classification method I. Waveform processing
Parallel processing method
Uses majority rule for pitch periods extracted by many kinds of simple waveform peak detectors.
Data reduction method
Removes superfluous waveform data based on various logical processing and leaves only pitch pulses.
Zero-crossing count method
Utilizes iterative patterns in waveform zero-crossing rate.
II. Correlation Autocorrelation processing method
Modified correlation method
Employs autocorrelation function of waveform. Applies center andpeak clipping for spectrum flattening and computation simplification. Utilizes autocorrelation functionfor residual signal of LPC analysis. Computation is simplified by LPF and polarization.
SIFT (simplified Applies LPC analysis for spectrum inverse filter flattening after down-sampling of tracking) algorithm speech wave. Time resolution is recovered by interpolation. AMDF method
Ill. SpecCepstrum method trum processing Period histogram method
Uses average magnitude differential function (AMDF) for speech or residual signal for periodicity detection. Separates spectral envelope and fine structure by inverse Fourier transform of log-power spectrum. Utilizes histogram for harmonic components in spectral domain. Pitch is decided as the common divisor for harmonic components.
Speech Analysis Analysis-Synthesis and Systems
81
The voiced/unvoiced decision is usually made using a method for pitch extraction, since, for the sake of simplicity, the cues for periodic/unperiodic decision are usually regarded asthosefor voiced/unvoiced decisions. The peak values of the autocorrelation or modified autocorrelation functions are generally implemented in the decision. Because these methods do not work effectively for unperiodic voiced sounds, improvement in decision accuracy has beenattempted by employingseveralotherparametersas additional cues (Atal and Rabiner,1976). These parametersinclude the speech energy, zero-crossing rate,first-orderautocorrelation function, first-order linear predictor coefficient, and energy of the residual signal.
This Page Intentionally Left Blank
5 Linear Predictive Coding (LPC) Analysis
5.1
PRINCIPLES OF LPC ANALYSIS
Since theterm linear prediction was firstcoined by N. Wiener (Wiener, 1966), the technique has become popularly employed in a wide range of applications based on a number of formulations. Thistechnique, first used for speech analysis and synthesis by ItakuraandSaito(ItakuraandSaito, 1968) andAtaland Schroeder (Atal and Schroeder, 1968), has produced a very large impact on every aspect of speech research(Markel andGray, 1976). The importance of linear prediction stems from the factthat the speech wave and spectrum characteristics can be efficiently and precisely represented using a very small number of parameters. Additionally, these parameters are obtained by relatively simple calculation. Let us express the discrete speech signal sampled at every AT [SI by ( x t > ( t is an integer). When the frequency rangeof the speech 1/2W [SI. Let us then signal is 0- W [Hz], AT must satisfy AT
83
Chapter 5
84
assume the following first-order linear combination between the present sample value xt and the previous y samples, "'ct
+ a1xt-1 + . . . + a p X , - p
= Et
(5.1)
where ( E , ) is anuncorrelatedstatisticalvariablehavingamean value of 0 and a variance of 02. This linear difference equation means that the present sample value -xtcan be linearly predicted using the previous sample values. That is, if the linearly predicted value i t for x, is represented by
the following equation can be obtained from Eqs. (5.1) and (5.2):
We thus consider Eq. (5.1) to be the linear prediction model having linear predictor coefficients (a,}.E t is designated as the residual error. Let us now define the linear predictor filter as
and define X ( z ) <-> i t and X ( z ) <- > x, asthepairs of z-transforms and their sample values. The z-transformof Eq. (5.2) is then expressed by
Based on Eqs.(5.2) and (5.3), thelinearpredictionmodelin z-transform notation can be given by
X ( z ) (1
-
F ( z ) )= E(z)
(5.6)
Linear Predictive Coding (LPC) Analysis
85
or X ( z ) A ( z )= E(z)
(5.7)
where P
A(z)= 1
+ xa,-i
=
1
i= 1
. is called the inverse filter (Markel, (1972). and E(s) < -> E ~ A(z) Based on these definitions, the linear predictive model using the linear predictor filter F(z) and inverse filter A(z) can be block diagrammedas in Fig. 5.1. The LPC analysis, that is, theprocessofapplyingthelinearpredictivemodel tothe speechwave,minimizestheoutput o2 by adjustingthe of either the linear predictor filter or the inverse coefficients {ai> filter. Based on the linear separable equivalent circuit model of the speech production mechanism (Sec. 3.2), the speech wave is regarded as the output of the vocal tract articulation equivalent filter excited by a vocal source impulse. The characteristics of the equivalent filter, which include the overall spectral characteristics of the vocal cords as well as the radiation characteristics, can be assumed to be passive and linear. The speech wave is then considered to be the impulse response of the equivalent filter, and,
x (2)0 FIG.5.1 Linear prediction model block diagram.
m E ( z )
86
Chapter 5
therefore, the equivalent filter characteristics can be theoretically obtainedasthesolution of thelineardifferential equation. Accordingly,the speech wave can be predicted, andthe speech spectralcharacteristics can be extracted by thelinearpredictor coefficients. Although linear predictive analysis is based on these assumptions, they actuallyvary anddonot hold completely. This is because the vocal tractshape is temporallychanging slowly, of course, and because the vocal source is not a single impulse but rather the iteration of impulses or triangular waves accompanied by noise sources. 5.2
LPC ANALYSIS PROCEDURE
Let us here consider the method for estimating the linear predictor coefficients (ai> by applying the least mean square error method to Eq. (5.3). Specifically, let us determine the coefficients (aili=1 p so that the squared sumof the error E t between the sample values of -xt and the linearly predicted values 2t over a predetermined period of [to, 4 is minimized. The total squared error p is
where a.
=
1. Defining
(5.10)
Linear Predictive Coding (LPC) Analysis
87
p can then be equivalently written as P
P
1=0
i=O
(5.11)
Minimization of ,8 is obtained by setting to zerothepartial derivation of ,8 with respect to aj (j = 1, 2, . . ., p ) and solving. Therefore, from Eq. (5.1 I),
The predictor coefficients (ai}can be obtained by solving this set of p linear simultaneous equations. The known parameters cij (i = 0,1,2, . . ., p ;j = 1,2, . . ., p ) are defined from the sample databy Eq. (5.10), which shows thatthe samples from to - p to tl are essential to the solution. For theactualsolution based on a sequence of N speech samples, { x , } = {xo, xl, . . ., x N - l } , two specific cases have been investigated indetail. These are referred to as thecovariance method and the autocorrelation method. The covariance method is defined by setting to = p and tl = N - 1 so that the error is minimized only over the interval [p, N - I], whereasallthe N speechsamples are usedin calculating thecovariancematrix elements cij (AtalandHanauer, 1971). Accordingly, Eq. (5.12) is solved using N- 1 t=p
Thecovariancemethoddrawsitsnamefromthefact that cij represents the row i, column j element of a covariance matrix. The autocorrelation methodis defined by setting to = -00 and t J = 00, and by letting xt = 0 for t < 0 and t 2 N (Markel, 1972). These limits allow cij to be simplified as
Chapter 5
88
t=-m
(5.14) Thus, aiis obtained by solving (5.15) where N- 1 -T
(5.16) t=O
Althoughtheerror E , is minimized over an infiniteinterval, [0, equivalent results areobtained by minimizing itonlyover N - I]. This is because xris truncated to zero forf < 0 and t 2 N by
multiplied by a finite-length window, suchasa Hamming window. The autocorrelation method is so named from the fact that for the conditions stated, cij reduces to the definition of the short-term autocorrelation r7 at the delay 7 = I i - j l . Equation (5.16) can be expressed by matrix representation as
............
rp- 1
(5.17)
.........
Y1
YO
Analysis Predictive (LPC) Linear Coding
89
The p x p correlation matrix of the left term has the form of a Toeplitzmatrix, which is symmetrical, and has the same values along the lines parallel to the diagonal. This type of equation is called anormalequation or aYule-Walkerequation. Since the positive definiteness of the correlation matrix is guaranteed by the definition of the correlation function, an inverse matrix exists for the correlation matrix.Solving the equation then permits{ai} to be obtained.Ontheotherhand,the positive definiteness of the coefficient matrix is not necessarily guaranteed in the covariance method. The equations for the covariance and correlation methods can be efficiently solved by the Cholesky decomposition methodand by Durbin’s recursive solution methods, respectively. Durbin’s method is equivalenttothePARCOR(partialautocorrelation) coefficient extraction process which will be presented later in Sec. 5.6. Althoughthecovarianceandautocorrelationmethods give almost the same results when {x,} is long ( N >> 1) and stationary, their results differ when {x,] is short and has temporal variations. The number of multiplications and divisions in Durbin’s method are p’ and p , whereas the number of multiplications, divisions, and square root calculations in the Cholesky decomposition are ( p 3 + 9p’ + 2 p ) / 6 , pand , p . Assuming that y = 10, computationally, the former method is three times more efficient than the latter method. In linear system identification in modern control theory, the process exemplified by Eq. (5.1) is called the autoregressive (AR) process, in which E, and x, are the system inputandoutput, respectively. This system is also referred to as the all-pole model since it has an all-pole system function.
5.3 MAXIMUMLIKELIHOODSPECTRALESTIMATION 5.3.1
Formulation of MaximumLikelihoodSpectral Estimation
Maximumlikelihoodestimation is themethod used to estimate parameters which maximize the likelihood based on the observed
90
Chapter 5
values. Here, the likelihood is the probability of occurrence of the actualobservations(the speech samples) underthe presumed parameter condition. The maximum likelihood method is better than any other estimation methodin the sense that the varianceof the estimatedvalue is minimized when the samplesize issufficiently large. In order to accomplish maximum likelihood spectral estimation, let us make two assumptions for the speech wave (Itakura and Saito, 1968): 1.
2.
The sample value x t can be regarded as the sample derived from a stationaryGaussian processcharacterized by thepower spectral density f i x ) . (Here, X = WAT is thenormalizedanglefrequency; i.e., X = f7r corresponds to the frequency f W.) Thespectral density j(X) is represented by an all-pole polynomial spectral density function of the form
1
1
o2
-
27r
A0
+2 P
(5.18) A , COS i X
i= I
where siis the root of (5.19) i= 1
Linear Predictive Coding (LPC) Analysis
91
and Ai is defined as
Furthermore, o2 is the scaling factor for the magnitude of spectral density, and y is the number of poles necessary for approximating theactualspectraldensity.Here,apair of conjugate poles is counted as two separate poles. 1 is easily approvedfor unvoiced Althoughassumption consonants, it is not readily so for voiced sounds having a pitchharmonic structure. In actual speech, however, the glottal source usually features temporal variation and fluctuation, and, therefore, harmoniccomponentsarebroadenedinthespectraldomain. Hence,assumption 1 can be accepted forthespectral envelope characteristics of both voiced and unvoiced sounds. Assumption 2 corresponds to the AR process described in the previoussection. That is, the signal {xr},exhibitingthespectral density of Eq. (5.18), satisfies the relationship of Eq. (5.1) in the time domain. This correspondence can be understood if one traces back from Eq. (5.8) to Eq. (5.1). Zeros are notincluded in the hypothesized spectral density for two reasons. First, thehuman auditory organs aresensitive to poles and insensitive to steep spectral valleys, such as those represented only by zeros (Matsuda, 1966). Second, removing zeros simplifies as well as facilitates the mathematical process and the parameter extraction procedure. When { E [ ) is Gaussian, the logarithmic likelihoodL ( X I&) for the N sample sequence X = (xo, x l , . . ., . Y ~ - ~can ) be approximated bY
92
Chapter 5
where W indicates the parameter set (a2,al, a2, . . ., aP)in Eq. (5.18). f(A) and i j T , which respectively express theshort-term spectral density (periodogram)andshort-termautocorrelation function for {x,}, are defined as
(5.22) and
(5.23) and rT in Eq. (5.14) are related by i j T = rJN. Equation (5.21) shows thatthelogarithmic likelihood fora given X can be p time delay approximatelyrepresented using only thefirst elements of the short-term autocorrelation function, { i j T } T = p . Let us maximize L ( X I w > with respect to a2 first. From aL(Xlw)/aa2 = 0, we obtain
ijT
a2 = J(Q1)Q 2 ) . . r= -p
Then (5.25) Therefore, the maximization of L ( X IW)with respect to {ai}+I p is attained by the minimization of J ( q , a2, . . ., Q J . Since J(Q1)Q,) . . .
)
Qp)
=
+ P
(5.26) ij=O
Analysis Predictive (LPC) Linear Coding
93
(ai>+f can be derived by solving thelinearsimultaneous equations
From Eqs. (5.24) and (5.26), P
2
(5.28)
= r=O
Since Eqs. (5.27) and (5.15) are equivalent, the (aili=1 p values obtained by Eq. (5.27) areequal to the values derived by the autocorrelation method. This means that linear predictive analysis’ employing theautocorrelationmethod and maximum likelihood spectral estimation, respectively, solvethe same passive linear system (acoustic characteristics of the vocal tract, including the source and radiation characteristics) in the time domain and frequency domain, respectively. The maximum likelihood spectral estimation method is equivalent to theprocess of adjusting the coefficients to minimize the output power o2 when the input signal is passed through an adjustablepth-order inverse filter. Hence, this method is often referred to as the inverse filtering method (Markel, 1972). 5.3.2
Physical Meaning of Maximum Likelihood Spectral Estimation
The functionf(X) in Eq. (5.21) is restricted in that it takes on the form of Eq. (5.18). Without sucharestriction, f(X), which maximizes Eq. (5.21) under the condition of a given f(X), is equal to !(X) ( - 7 ~5 X 5 .-). The maximum value of L is
Chapter 5
94
Therefore,
which is defined by 87r (LmaX - L ( X lG))/N, becomes zero only when f(A) = j(X) (-7r 5 X Azx), otherwise it has a positive value. Accordingly, El (flf) can be regarded as a matching error
measure when the short-term spectral density is substituted by a hypothetical spectral density f i x ) . This means that the estimation of spectral information i3, based on the maximum likelihood method,correspondstothespectrummatching which minimizes the matching error measurein the same way as the A-b-S method. If theintegrand of El log (f(A)/j(A)), it becomes
is represented by afunction d =
2 ( ~ / + e - ~1) -
=
(5.3 1)
which is shown by the solid curve in Fig. 5.2. On the other hand, intheconventionalA-b-Smethod, G2(d) = d2 has usually been used as an integrand for measuring the spectral matching error. In the region Id1 < 1, G1 ( (0 and G2(d) arealmostthesame. When d > 1 and d < -1, however, Gl(cJ) respectively increases linearly and exponentiallyasafunction of d. G2(d) hasa symmetrical curve around d = 0, whereas Gl(d) is unsymmetrical. Thismeansthatinspectralmatchingusingthemaximum likelihood method, the matching error for neglecting a local valley in f(X) is evaluated as being smaller than that forneglecting a local peakhavingthesameshape.Thenonuniform weighting inthe maximum likelihood method is preferred over uniform weighting since the peaks play a dominant role in the perception of voiced speech. The poles of the spectral envelope, z, (i = 1,2, . . ., p ) , can be obtained as roots of the equation
Linear Predictive Coding (LPC) Analysis
-2
-3 1
-12-9
-1
1
1
O I
-6 - 3
I
0
95
1
2
3
I
I
I
3
6
9
d I
12 dB
FIG.5.2 Comparison of matching error measure in maximumlikelihood method, GI(@, withthat in analysis-by-synthesis (A-b-S) method, G2(@.d = log {(X)/f(X)}; f(X) = model spectrum; f(X) = short-term spectrum.
1
+
x P
QIZ-i = 0
(5.32)
i= 1
in which complex poles correspond to quadratic resonandes. Their resonance frequencies and bandwidths are given by the equations
Chapter 5
96
and
(5.33) where AT is the sampling period. Theformants canbe extracted by selecting the poleswhosebandwidth-to-frequencyratiosare relatively small. Figure 5.3 comparestheshort-termspectral densities and spectral envelopes estimated by the maximum likelihood method for the male and female vowel /a/ when the number of poles is
Male v o w e l /
0
I
/
1
Female vowel / a /
..
0
I
2 Frequency
3
4
0
1
2
3
4 kHz
Frequency
FIG. 5.3 Comparison of (a) short-term spectra and (b) spectral envelopes obtained by the maximum likelihood method.
Analysis Predictive (LPC) Linear Coding
97
varied between 6 and 12. It is evident that major peaks in the shortterm spectrum can be almost completely represented by f ( X ) when the speech wave is band-limited between 0 and 4 kHz and p is set larger than or equal to 10. Figure 5.4 exemplifies the time function of spectral envelopes for the Japanese test sentence beginning /bakuoNga/, or ‘A whir is . . .,’ uttered by amalespeaker (Tohkura, 1980). Here,the Hamming window length is 30ms, the frame period is 5 ms, and p is set at 12.
-
0
I
2
3
4kHz
FIG.5.4 Time function of spectral envelopes for the Japanese phrase /bakuNga/ uttered by a male speaker.
98
Chapter 5
5.4 SOURCE PARAMETER ESTIMATION FROM RESIDUAL SIGNALS
Let us consider the spectral fine structure of the residual signal (5.34) Since the fine structure is obtained by normalizing the short-term spectrum of input speech f(X), using the spectral envelope-f(X),it is almostflatalongthefrequency axis and exhibitsaharmonic structureforperiodic speech.Therefore,theautocorrelation function for the residualsignal, called the modified autocorrelation function, produces large correlationvalues at the delays having the integer ratio of the fundamental period for voiced speech, whereas no specific correlation is demonstratedfor unvoicedspeech (Itakura and Saito, 1968). In this way, the vocal source parameters can be obtained using the modified autocorrelationfunction regardless of the spectral envelope shape. The modified autocorrelation function can be easily calculated by the Fourier transform of j(X)if(X) as follows:
1 x
1 " ?(X) o2 -"
=1
P
A , c o s ( ~- s)X dX
s= - p
u
(5.35) where A , is a correlation fL1nction of linear predictor coefficients as previously defined by Eq. (5.20). Equation (5.35) means that w, can be calculated by the convolution of the short-term autocorrelation function and {A,} ,= I y for input speech, followed by normalization using 02. w, can also be obtained by directly calculatingthe correlation function for E , using Eq. (5.34).
Linear Predictive Coding (LPC) Analysis
99
Since actual speech often features intermediate characteristics between theperiodic and unperiodic,thesourcecharacteristic function V(w,) is defined so that it expresses not only merely voiced or unvoiced soundbutalsotheintermediatecharacteristics between these sounds. In the course of pitch extraction, low-passfilteringiswidely applied to speechwaves or residual signals for improving the resolution of the extracted pitch period. Low-pass filtering is effective for removingtheinfluence of high-orderformantsandfor compensating for the insufficiency of the time resolution arising in the autocorrelation function. The latter effect is especially important for pitch extraction using this modified autocorrelation function. The double-period pitch error due to the time resolution insufficiency can be considerably minimized by employing low-pass filtering. Figure 5.5 exemplifies waveforms, autocorrelation functions, and short-term spectra for speech waves, residual signals, and their low-passed signals forthe vowel /a/ uttered by a male speaker (Tohkura, 1980). Thecutofffrequency for the low-pass filter is 900Hz. Comparison of thecorrelationfunctionsforthe speech waves and for the residualsignals shows that the latter,specifically, the modified autocorrelation function, is more advantageous than the former correlation function. When the correlation function for the speech waves is used, formant-relatedcomponents, which become large when the harnzonic components of the fundamental frequency and the formant frequencies are close together,cause errors in maximum value selection. On the other hand, when the correlationfunctionforthe residual signals is used,peaks are observed only at thefundamentalperiodandatitsratios of integers and are not affected by formants. 5.5 SPEECH ANALYSIS-SYNTHESIS SYSTEM BY LPC
The original speech wave can be reproduced based on the relationship x, = -?t + E , or X ( z ) = E(z)/.A(z),using the speech synthesis circuit indicated in Fig. 5.6 and the residual signal E [ as the sound source. For the purpose of reducing information, pulses
Chapter 5
81 L . P . F .
SPEECH HRVEFORH
AUTO-CORR.
OF
Ruro-corm.
OF C . P . F .
AUTO-CORR.
OF RESIOURL HRVEFORR
RUTO-CORR.
OF L.P.F. RESIDUAL
SPEECH HAVEFORM
SPEECH
lr
0
C)
RESIOURL HRYEFORH lH=l2l
'r
Dl L.P.F.
AESIDURL HRVEFORH
'F
u
0
-
1
0
10
20
OELlY IHS)
30
FIG.5.5 Waveforms, autocorrelation functions, and short-term spectra for a speech wave, a residual signal, and their low-pass filtered signals for the vowel /a/ uttered by a male speaker.
and white noise are utilized as sound sources to drive the speech synthesis circuit instead of employing E [ itself. Pulses and white noise are controlled based on the source periodicity information extracted from E*. The control parameters of the speech synthesis circuit arethuslinearpredictor coefficients {ai}l = l p , pulse amplitude A , , and fundamental period T for the voice source. A ,
Linear Predictive Coding (LPC) Analysis
101
out put
I"+x t xt-1
xt-2 -P
FIG.5.6 Speech synthesis circuit based on linear predictive analysis method.
and T are replaced with noise amplitude A,, for the unvoice source (Itakura and Saito, 1968). The stability of theabove-mentioned synthesis filter l/A(z) must be carefully maintained since it has a feedback loop. Stability here, meaning that the output of the system for a finite input is itself finite, correspondstotheconditionthatthe difference (see Appendix A.3). If equation (5.1) hasastationarysolution the linear predictor coefficients are obtained through the correlation method of linear predictive analysis or through the maximum likelihood method, stability of the synthesis filter is theoretically guaranteed. The reason for thisis that the spectral density function f ( X ) alwaysbecomespositivedefinitewhentheshort-term autocorrelation function {67}7= is a positive definite sequence. During actual parameter transmission or storage, however, stability is not always guaranteed because of thequantization error. In such situations, thereis no practical, clear criterionfor the range of {aili,I p which secures stability.This is one of the difficulties of using LPC or the maximum likelihood method in speech analysis-synthesis systems. In order to minimize this problem, the spectral dynamic range, namely, the difference between the maximum and minimum values
102
Chapter 5
(peaks and valleys) in the spectrum, should be reduced as much as possible. Effective for this purpose is the application of a 6-dB/oct high-emphasis filter or a spectral equalizer adapted to the overall spectral inclination.The stability problem, however, has finally beensolved theoretically as well as practically by the PARCOR analysis-synthesis nlethod as described in the following section. 5.6 5.6.1
PARCOR ANALYSIS Formulation of PARCOR Analysis
Thesametwoassumptionsmadeforthemaximumlikelihood estimation (see 5.3.1) are also made for the speech wave. When the prediction errors for the linear prediction of xland .Y+~,, using the sampled values (s,-i>i= 1”’” are written as
i=O
and (5.36) the PARCOR (yartial autocorrelation) coefficient k,, between and . Y ~ - ~is? ~defined by
X,
(5.37) Thisequation means thatthePARCOR coefficient is the ~fi(’”-’) and the correlation between the forward prediction error backward prediction error ~ b t ( n l - ’ ) (Itakura and Saito, 1971). The definitional concept behind the PARCOR coefficient is presented inblockdiagramforminFig. 5.7. Since thepredictionerrors, 1) and ~ ~ , l ( ” ’ - l ) ,are obtained after removing the linear effect of ,f[o??-
Analysis Predictive Linear (LPC) Coding
103
H Z sample
values between -xt and x c I - 1 1 2 from these sample values, k,,, represents the pure or partial correlation between x, and xt+,. WhenEq. (5.36) is putinto Eq. (5.37), the PARCOR coefficient sequence knz( r n = 1,2, . . ., p ) can be written as
(5.38)
OQ)
23
a zz
J
~
&m-
I)
p m
Backward prediction
Correlotor
.
&ft(m-ll
I
error
Forward predict ion
error
PARCOR coefficient
FIG.5.7
sz
Definition of PARCOR coefficients.
104
Chapter 5
where vi is the short-term autocorrelation function for the speech wave. Although this autocorrelationfunctionshould be written as Ci in line with the notations made thus far, it is written as v i for simplicity's sake. kl is equalto u l / v o , i.e., tothefirst-order autocorrelation coefficient. This is also clear from the definition of kHz. Using Eq. (5.38) and the fact that the prediction coefficients r (nz-I) jai }i=fl-l and {ppz-l)}j= constitute the solutions of the sinlultaneous equations
and
i= 1
(5.39) the following recursive equations canbe obtained (m
=
1,2, . . ., p):
Additionally, the following equation is obtained from Eq. (5.39):
pi h - 1 )
(m- a/l?-i
1)
(i
=
1 ) 2 , .. . ) M Z
-
1)
(5.41)
Based on these results, the PARCOR coefficients and linear predictor coefficients {anz}nl= 1 p areobtainedfrom
"-
-
-I-_ " " 1 . -
"....__ -_ "I".".-"
1p
105
Analysis Predictive Linear (LPC) Coding
{v,}+ 1 p through the flowchart in Fig.5.8 by using Eqs. (5.38) and (5.40). This iterative method is equivalent to Durbin's recursive solutionforsimultaneouslinearequations.Thenumbers of multiplications,summations,and divisions necessary for this computation are roughly p ( p + l), p ( p + l), and p , respectively. When these computations are done using a short word length, the truncation error in the computation accumulates as the analysis
w m = m+ 1
FIG.5.8
Flowchart for calculating {km}i=oP and
" " " "
"
"-
{ c ~ l , } ~ = ~from P {V~}~=~P.
"
Chapter 5
106
progresses. In the iteration process, each k,(m = 1,2, . . ., p ) is obtainedone by one, whereas the a,, values changeat every iteration. Finally, a,,, values are obtained as a,,? =
(1 5
ai,,@)
M2
5 p)
(5.42)
Since the normalized mean square error o2 is equal to up/vo fromitsdefinition, o2 can be calculated using PARCOR coefficients, instead of linear predictor coefficients, from
n P
o2 =
112=
(1
-
kt,?)
(5.43)
1
This equation is obtained from Eq. (5.40) 1 p directly from the signal {x,) let us In order toderive {k,,2)itl= A,,(D) define the forward and backward prediction error operators and BIII(D)as
i=O
and
i= 1
where D is the delay operator such that D'x, (5.36) can then be written as
= .xpi.
Equations
and (5.45)
Linear Predictive Coding (LPC) Analysis
107
From Eq. (5.40), we can arrive at the recursive equations
and
Based on Eqs. (5.38), (5.45), and (5.46), the PARCOR coefficients {k,J can subsequently be produced directly from the using ! a cascade connection of variable parameter speech wave -Idigitalfilters(partialcorrelators),each of which includes a correlator as indicated in Fig. 5.9(a). Since E(($11-1))2) = E { ( E ~ ~ ( / " - ~the ) )correlator '), can be realized by thestructure indicatedinFig.5.9(b), which consists of square,addition, subtraction, and division circuits and low-pass filters. The process of extractingPARCORcoefficientsusing thepartialcorrelatorsinvolves successively extractingand removingthecorrelations between adjacentsamples.This is an inverse filtering process which flattensthespectral envelope successively. Therefore, when thenumber of partialcorrelators p is largeenough,thecorrelation between adjacentsamples, which corresponds to the overall spectral envelope information, is almostcompletelyremoved by passingthespeechwave through thepartialcorrelators.Consequently,theoutput of thefinalstage,namely,theresidual signal, includes only the correlation between thedistantsamples which relates to the source(pitch)information.Hence,thesourceparameterscan be extractedfromtheautocorrelationfunctionfortheresidual signal,inotherwords,fromthemodifiedautocorrelation function. Thedefinition of the PARCOR coefficients confirms that lkljTlL- 1 is always satisfied. Furthermore, if Iktj21< 1, the roots for A,(,?) = 0 have also been verified to exist inside of the unit circle, and, therefore,thestability of the synthesis filter is guaranteed (Itakura and Saito, 1971).
108
(a)
Chapter 5
O”
xt
P a r t i a l correlotor
Am- I
P a r t la I correla tor
Xt
FIG. 5.9 (a) PARCOR coefficient extraction circuit constructed by cascade connection of partial autocorrelators and (b) construction of each partial autocorrelator.
5.6.2 Relationship between PARCOR and LPC Coefficients
If eitherone of the set of f or (am},.,,= f is given, the othercan be obtained by iterativecomputation.For example, when {knJnI = 1p are given, {a,,,},,, = I p are derived by iterative computations (n.t = 1,2, . . ., p ) using a part of Durbin’s solution:
On the other hand, (k,,,},,= 1p can be drawn from {a,},,= I p using the iterative computations in the opposite direction (177 =
Analysis Predictive Linear (LPC) Coding
109
p,p- 1, . . ., 2,l) as indicated below, where the initial condition ai,,(Y) = Q~~~ (1 2 m p ) :
is
(5.48)
5.6.3 PARCOR Synthesis Filter A digitalfilter which synthesizes speech waveformemploying PARCOR coefficientscan be realized by theinverseprocess of speech analysisincorporatingpartialautocorrelators.Inother words, in the PARCOR synthesis process the correlation between samplevalues issuccessively provided to the residualsignal, or resonance is added to theflat spectrum of the residual signal.More specifically, the synthesis filter features the inverse characteristics of the analysis filter l/Ain(D). Reversingthesignal propagation directionfor A inthe recursive equation (5.46) produces the relationships
and
Let us assume that the synthesis filters having the transmission characteristics of l/A,(D) and B,,(D) are already realized as shown within the solid rectangle in Fig. 5.10. In order to attain a synthesized output y r at thefinal output terminal Q? asignal A,,(D)y, must be inputto terminal a,. Thispermitsasignal B,,(D)y, to appear at terminal b,. Let us next construct a lattice filter based on Eq. (5.49), as indicated within the dashed rectangle, and connectit to the circuitwithinthesolidrectangle. If these
110
Chapter 5
FIG. 5.10 Principal construction features of sythesis filter using PARCOR coefficients.
combined circuits are viewed from terminal a,, 1, they exhibit an input-output relation of l/A,,,+ (D)since they produce outputy r at terminal Q for input signal A,, + @)yl. At the same time, signal B,, + ( D ) y , appears at terminal b,,, + Therefore,thestructure indicated in Fig. 5.10 realizes one section (stage) of the PARCOR synthesis filter. Several equivalenttransformations exist forthis lattice filter, as indicated in Fig. 5.11. Astructuralexample of a speech analysis-synthesis system using PARCOR coefficients is presented in Fig. 5.12. Here, partial autocorrelators are used for the analysis. For comparison, Fig. 5.13 offers an example in which a recursive computation-based method is employed for the same purpose. Whenthe synthesis parameters of thePARCOR analysissynthesis system are renewed at time intervals(frameintervals) different from the analysis intervals, the speaking rate is modified without an accompanyingchangeinthepitch(fundamental frequency).
5.6.4 Vocal Tract Area Estimation Based on PARCOR Analysis
The signal flow graph of Kelly’s speech synthesis model (Fig. 3.4(b) in Sec. 3.3.1) formally coincides with the speech synthesis digital
-
Linear Predictive Coding (LPC) Analysis Input
""--4+)"
n
111 output
0
-" I
1-
FIG.5.11 Equivalent transformations for lattice-type digital filter.
filter used in PARCOR analysis-synthesis systems (Fig. 5.12). In otherwords,the PARCOR coefficient k,, correspondstothe reflection coefficient K,?. Also, thePARCOR latticefilter is regarded as an equivalent circuit for the vocal tract acoustic filter simulating the cascade connection of y equal-length acoustic tubes havingdifferentareas. Since arelationship exists between the reflection coefficient and area function as described in Sec. 3.3.1, the area function can expectantlybe estimated from the PARCOR coefficients. However, several problems exist with this assumption. Fromthe speech productionmechanism,the total system function S(z) for the speech production system is represented by
112
C
4-
.0 .-N
rd Q c f
0
v) 0,
r.
C
c
c
v)
Chapter 5
" L
Linear Predictive Coding (LPC) Analysis L
0
0
4
"--r
I
I I
I
I
t
I
0,
L
-0 L
0 V
Y-
Q,
D
..U
r" I -
L
0
c
L
Q, L
0 V
I 0 7
t
a
1
1
7 e
a
e '1 U
3
113
114
Chapter 5
the product of the system functions for source generation G(z), vocal tract resonance V(z),and radiation R ( z ) as S(Z) = G ( z ) V ( z ) R ( z )
(5.50)
WithPARCOR analysis, which is based on thelinear separable equivalent circuit model described in Sec. 3.2, the vocal tract system function is obtained by assuming that the sound source consists of an impulse or random noise having a uniform spectraldensity.Therefore,theoverallcharacteristics of S(z) includingthesource andradiationcharacteristics are derived instead of V(z).Consequently, when the area function is calculated according totheformal coincidencebetweenthe PARCOR coefficient k,, and the reflection coefficient E,,, the result widely differs from the actual area function. Properlyestimatingtheareafunctionthusrequiresthe removal of the effects of G(z) and R(s) fromthe speech wave prior to the PARCOR analysis, which is called inverse filtering or spectral equalization. Twospecific methods have been investigated for inverse filtering. 1.
First-order differential processing As is well known, the frequency characteristics of the sound source G(z) and radiation R(s) can be roughly approximated as -12 dB/oct and 6 dB/oct, respectively. Based on this approximation, the sound source and radiation characteristics can be canceled by B-dB/oct spectral emphasis (Wakita, 1973). This is actually done by analog differential processing of the input speech wave ( x , } or by digital processing of digitized speech. The latter is accomplished by calculating y r = xI--.q_1, which correspondstothe filter processing of F ( Z ) = 1 - 8 . 2. Adaptive inverse filtering On the assumption that overall vocal tract frequency characteristics are almost flat and have hardly any spectral tilt, the spectral tilt in the input signal is adaptively removed at every analysisframe using lower order correlation coefficients (Nakajinla et al., 1978). When the first-order inverse filter is applied,thefirst-ordercorrelation
Linear Analysis Predictive (LPC) Coding
115
coefficient, that is, the first-order PARCOR coefficient (kl = rl = ul/ vo), is used to construct the F(s) = 1 - klz” filter. This is achieved by the computation of y r = xI - k l ~ , - lor by the convolution of the correlation coefficients. Using theconvolutionmethod, inverse filtering can easily be done even for the second- or third-order critical damping inverse filtering. Appropriate boundary conditions at the lips and the glottis must also be established for properly estimating the area function. For this purpose, two cases have been considered for vowel-type speech production in which thesoundsource is located at the glottis and no connection exists with the nasal cavity. 1.
Case 1
Lips: The vocal tract is open to the field having an infinite area (that is, K~ = 1) such that the forward propagation wave is completely reflected, and the circuit is short (impedance is 5 = p c / A I ‘4 “‘m = 0). Glottis:Thevocaltract is terminated by thecharacteristic impedance pc/Ap. The backward propagation wave flows out to the trachea without reflection and causes a loss. The input signal is supplied to the vocal tract through this characteristic impedance (Wakita, 1973; Nakajima et al., 1978). 2.
Case 2
Lips: The vocal tract is terminated by thecharacteristic impedance pc/Al. The forward propagationwave is emitted to the field without reflection and results in a loss. Glottis:The vocal tract is completely closed (in otherwords, Lip+ 1 = - 1) such that thebackwardpropagation wave is completely reflected, and the input signal is supplied tothe glottis as a constant flow source (Atal, 1970). The vocal tract area ratio is successivelydetermined from the lips in Case 1 and fromthe glottis in Case 2. Linear predictive analysisand PARCOR analysis correspond to Case 1. Comparing the results of
Chapter 5
116
these two cases, which are usually quite different from each other, Case 1 seems to give themostreasonableresults. For final transformation fromthe area ratio to the area function, itis necessary to define the glottal areaAp so that the final results become similar to the actual values determined through x-ray photography and other techniques. The relationshipbetween the area function andPARCOR coefficients in Case 1 is shown in Fig. 5.14. The vocal tract area functionestimated from the actualspeech wave based on the above method has been confirmed to globally coincide with the results observed by x-ray photographs. Figure 5.15 compares spectral envelopes and area functions (unit interval length is 1.4 cm) estimatedby applying adaptive inverse filtering for the five Japanese vowels uttered by a male speaker (Nakajima et al., 1978). If the vocal tract area could ever be estimated automatically and precisely fromthe speech wave, theestimationmethod achieving this will certainly become a fundamental speech analysis method.Furthermore, thismethod will be extremelyuseful for analyzingthespeechproductionprocessandforimproving speech recognitionandsynthesis systems.Several problems remain,however,inachievingthenecessaryprecision of the estimated area function. These warrant further investigation into properlymodelingthesourcecharacteristicsintheestimation algorithm.
5.7 5.7.1
LINE SPECTRUM PAIR (LSP) ANALYSIS Principle of LSPAnalysis
AlthoughthePARCOR analysis-synthesismethod is superior toanyother previouslydevelopedmethods,ithasthelowest bit rate limit, 2400 bps. If thebit rate falls below thisvalue, synthesized voice rapidly becomes unclear and unnatural. The LSP method was thus investigated to maintain voice quality at smaller bit rates (Ttakura, 1975). The PARCOR coefficients are essentially
C
4-
a
r
c
I
x
i
u)
.-a J
1 .-#
t
0
t
c3
" " " " " "
J
C
E
k & I + - + + a a c
II x
" " " " " "
Linear Predictive Coding (LPC) Analysis
-
117
" " " " "
Chapter 5
118
t
Lips I
0
I
I
2
'
I
4
Spectrum
'
t
Glottis
I
6 kHz Estimated area function
FIG.5.15 Examplesofspectralenvelopesandestimatedarea functions for five vowels: (a) overall spectralenvelope for inverse filtering (source and radiation characteristics); (b) spectralenvelope after inverse filtering (vocal tract characteristics). parametersoperatinginthe time domain as aretheautocorrelation coefficients, whereas the LSPs are parameters functioning in the frequency domain. Therefore, the LSP parameters are advantageous in that the distortion they produce is smaller than that of thePARCOR coefficients even when they are roughly quantized and linearly interpolated. As with PARCOR analysis, LSP analysis is based on the allpolemodel.Thepolynonlialexpressionfor z , which is the
Analysis Predictive Linear (LPC) Coding
119
denominator of the all-pole model, satisfies the following recursive equations, as previously demonstrated in Eq. (5.46):
and (5.5 1) where A&) = 1 and &(z) = z" (initial conditions). Let us assume that Ap(z) is given, and represent two Ap+ &) types, P(z) and Q(z), undertheconditions kp+ = 1 and kI,+ = - 1, respectively. The condition Ikp+ I = 1 corresponds to the case where the airflow is completely reflected at theglottisinthe (pseudo) vocal tract model represented by PARCOR coefficients. In other words, this condition corresponds to the completely open or closed termination condition. The actual boundary conditionat the glottis is, however, the iteration of opening and closing, as a function of vocal cord vibration. Since theboundarycondition at the lips in the PARCOR analysis is a free field (ko = -1) as mentionedintheprevious section, the present boundary condition sets the absolute values of the reflection coefficients to 1 at both ends of the vocal tract. This means that the vocal tract acoustic system becomes a lossless system which completely shuts in the energy. The Q value at every resonance mode in the acoustic tube thus becomes infinite, and a pair of delta function-like resonance characteristics (a pair of line spectra) which correspondtoeachboundarycondition at the glottis are obtained. The number of resonances are 2p.
5.7.2 Solution of LSP Analysis
Chapter 5
120
and
Although P(z) and Q(z) areboth (p + 1)st-orderpolynomial expressions, P(z) has inversely symmetricalcoefficientswhereas Q(z) has symmetrical coefficients. Using Eq. (5.52), we get (5.53) On the other hand, from the recursive equations of (5.51),
Continuingthistransformation, equation
we can derivethegeneral
If p is assumed to be even, P(z) and Q(z) are factorized as P(z)
=
(1
-
z")
rI i=2,4,. ..,p
(1
-
2 8 coswj
+ 3-2)
121
Linear Predictive Coding (LPC) Analysis
and Q(Z)
= (1
l"J
+ z- 1)
(1 - 22" cos WI
+ z-?)
i=1,37...,p- 1
(5.56) The factors 1 - z-l and 1 + z-' are found by calculating P(1) and Q(-1) after putting Eq. (5.55) into Eq. (5.52). The coefficients {wi} which appear in the factorization of Eq. (5.56) are referred to as LSP parameters. {w,} are ordered as 0
<Wl
<
w2
< ... <
wp-1 < w p
<
(5.57)
7T.
Even-suffixed { w i } are proved toseparateeach element of odd-suffixed {w,}, and vice versa. In otherwords, even-suffixed { w i } and odd-suffixed {w,} are interlaced. Furthermore, this interlacing is proved to correspond to the necessary and sufficient condition for the stability of the all-pole model: H(z) = l/Ap(z).Under the condition that y is odd, the LSP is obtained in the same way. Using Eq. (5.53), the power transmission function for H(z) can be represented as
+ cos2 -2 W
l"J
(cos w
- cos wJ
2 -2
}
i=1,3, . . . p - 1
(5.58) The first term in braces approaches 0 when w approaches 0 or one of the {wi} (i = 2,4, . . ., p ) , and the second term approaches 0 when w approaches 7 ~ or . one of the { w i } (i = 1,3, . . ., y - 1). Therefore,
122
Chapter 5
when two LSP parameters,wi and wJ, are close together and whenw approachesboth of them,thegain of H(z) becomes large and resonance occurs. Strong resonance occurs at frequency w when two or moreWi's are concentrated near w. That is, the LSP method represents the speech spectral envelope through a distribution density of p discrete frequencies {wJ. Either of the following methods can beused to obtain the zeroes for P(z) and Q(z) with respect to 8 after deriving the coefficients for A&), that is, the linear predictor coefficients {ai>. 1.
Rootfinding in algebraic equations Equation (5.56) can be transformed into
l"J (1 - 2
8 cosw,
+
ZY2)
j=1
(5.59)
2.
Then, by replacing (z + ~ - ~ ) / 2 ~ , = = ~ ~cos ~ ~w+ with , ~ x , the equations P(z)/(1 - z") = 0 and Q(z)/(l + z") = 0 can be solved as a pair of (p/2)th-order algebraic equations withrespect to x using the Newton iteration method. DFT for the coefficients of the equations The values of P(z) and Q(z) at z,, = e-j'zT'N ( n = 0, . . ., N) are first obtained through the DFT using the coefficients of P(z) and Zeros can then be estimated by the interpolation of two points which produce a zero between them. The procedure for searching for the zeros is largely reduced using the relationship 0 < w1 < w2 < . . . < wp < 7r. A value between 64 and 128 is considered large enough for N .
e(-).
5.7.3
LSP Synthesis Filter
In LSP speech synthesis, adigital filter whichcorrespondsto H(z) is constructed based on the LSP parameters ( w I , w2, . . ., wp).
123
Linear Predictive Coding (LPC) Analysis
Since H(s) = l/Ap(z),thistransferfunction can be realized by inserting a filter having a transfer function of Ap(z) - 1 into a negative feedback path inthe signal flow graphin thesame way as inthe LPC analysis-synthesissystem (Itakuraand Sugamura, 1979). Based on Eqs. (5.53) and (5.56), when p is even, we then have
1 A y ( z ) - 1 = - [ ( P ( z )- 1) 2
r=l
J=-
1 odd
D-
+ ( Q ( z ) - I)]
1
+ j"J
1
I odd
(1
+ c#
+ z-71
(5.60)
Here, C, =
-2
COS W ,
(i
=
1 , 2 , .. . , p ) ,
co =
C-1
= -z-~
(5.61)
A,(z) - 1 can thus be constructed by a pair of trunk circuits which respectively correspond to odd and even numbers of i, as shown in Fig. 5.16(a). Each trunk circuit is a p/2-stage cascade connectionof quadratic antiresonance circuits: 1-2 cos wiz-' + zf2. The outputs at the middle of each stage on each trunk are successively summed up,andtheoutputs at thefinalstage areadded or subtracted from the former value. The synthesis filter for oddp , represented in Fig. 5.16(b), is realized in the same way. The numbers of computations necessary for synthesizing one sample of speech using thissynthesis filter are y multiplications and
Input
FIG.5.16 Signal flow graph of LSP synthesis filter: (a) p = even; (b) p = odd.
125
Linear Predictive Coding (LPC) Analysis
3p+ 1 additions or subtractions. Although the number of multiplications is roughly half that for the two-multiplication-type PARCOR synthesis filter, the number of registers for delay is roughly twice that of the latter. An example of an LSP analysis result for the complete Japanese test sentence /bakuoNga giNsekaino ko:geNni hirogaru/, or ‘A whir is spreading over the plateau covered with snow,’ is presented in Fig. 5.17. This figure indicates that LSPs are concentrated at the place where the speech spectrum is strong and that they resemble the movement of formants.
4D
3 . 0
-72
2
2.0
6
I ;
1.0
4 q s
N n t
h l r o g o
r u
Tlrnr
(sl
FIG.5.17 LSP analysis result in which time functions of power, spectrum and LSP parameters are given for the spoken Japanese sentence indicated.
126
5.7.4
Chapter 5
Coding of LSPParameters
Experimental studies on quantization characteristics (Sugamura and Itakura, 1981), have confirmed that if the distribution range of LSP parameters is considered in the quantization, the same spectral distortion can be realized by roughly 80% of the quantization bit rate compared with the PARCOR systems.As forthe interpolation characteristics, the interpolation distortion has been demonstrated as being maintainable even if the parameter renewal rate is roughly 75% of the rate forPARCOR parameters. As the result of the combination of these two effects, the LSP method produces the same synthesized sound quality using only roughly 60% of the bit rate as compared with that needed employing the PARCOR method. Theadvantagesanddisadvantages of the PARCORand LSP methods are more closely compared in summarized form in Table 5.1. TABLE 5.1 Comparison of PARCOR and LSP Methods Method
Advantages
Disadvantages ~~~
PARCOR
a. lkil < ensures stability
{ ki}
b. Directly extracted by b. lattice-type analysis c. ki values are independent of analysis order LSP {w,}
a. w1 < w2 -e . . . w,, ensures stability b. Good quantiation and interpolation characteristics c. Similar to formant
a. Poor interpolation characteristics Large spectral resolution variation over parameters c. Indirect correspondence to spectrum a. Computation amount for parameter extracion is slightly increased b. W i values depend on analysis order
Linear Predictive Coding (LPC) Analysis
5.7.5
127
CompositeSinusoidalModel
Thecomposite sinusoidal model(CSM) method is a speech analysis method closely related totheLSPmethod(Sagayama and Itakura, 1981). In the CSM method, the autocorrelation of the signal r7 is represented by a linear combination of p/2 cosine waves as
x PI3
r7 =
nzi cos xi7
(7-
=
0 , 1 , . . . , p - 1) (5.62)
i= I
where m i is a nonnegative coefficient called the CSM magnitude, and X i is termed the CSM frequency. The parameter set { m i , Ai} is uniquely determined from r7 (r = 0,1, . . ., p - 1). 5.7.6
MutualRelationshipsbetween LPC Parameters
The mutual relationships between each parameter obtained based on all-pole spectral modeling (LPC modeling) areindicatedin Fig. 5.18 (Itakura, 1981). For reference, the relationships between the LPC cepstral coefficients and linear predictor coefficients {ai] i = 1y were described in Sec. 4.3.2. The relationshipexisting between theautocorrelation function for the impulse response of the all-pole system rr and {ai} i = op can be expressed as P
i=O
Fr, which is often called the LPC correlation function, agrees with the autocorrelation function for thesignal rr in the range of r to p.
=
1
128
7
I
L
1
t
.-c0
L
f
1
r
Chapter 5
Linear Predictive Coding (LPC) Analysis
5.8
129
POLE-ZERO ANALYSIS
Although speechanalysisbased ontheall-polemodelhas numerousadvantages,theactual speech production models for nasal andconsonantsoundsareofthe pole-zero type having formantsandantiformants.Theglottal source wave is also considered to have zeros initsspectrum.Therefore, it is more realistic to represent the speech production system function using both poles and zeros as
Here, v, and vi* are conjugate zeros onthe z-plane, y, and yr are conjugate poles, and q and p are the number of zeros and poles, respectively, except for those at theorigin and infinite positions. A relationship then results between the input (x,} and output (YtZ: n
n
(5.65)
Considering this relationship, the digitalfilter which synthesizes the speech wave based onEq. (5.64) using a pulse traincan be constructed as indicated in Fig.5.19. The lower half of this figureis identical to the synthesis circuit based on the all-pole model (Sec. 5.5, Fig. 5.6). There area number of variations in constructingH(z) such as the cascade or parallel connection of quadratic systems, each of which has a complex conjugate pole-zero pair.
Chapter 5
130
FIG.5.19
Speech synthesis circuit derived from pole-zero modeling.
Fiveprincipalmethodshave been proposedforparameter estimation based on the pole-zero model: 1. 3
I.
3.
4. 5.
Homomorphic prediction (Oppenheim et al., 1976) Iterative computation using inverse filtering (correlation matching) (Fukabayashi and Suzuki, 1977) Iterativecomputationfor poles and zeros using inverse spectrum (Ishizaki, 1977) of thesingular Expansion of Yule-Walkerequation(application factorization method) (Morikawa and Fujisaki, 1984) Maximum likelihood estimation forpole-zero parameters (Sagayama and Furui, 1977)
Analysis Predictive Linear (LPC) Coding
131
The pole-zero model is characteristic in that poles and zeros cancel eachother.Furthermore, it is theoretically muchmore difficult to solve than the all-pole model because it causes nonlinearequationstoariseforthenumeratorterms even inthe simplestcasewhenminimumsquareestimation is directly performed.Theseequationscan be solved only by iterative methods,and convergence tothe global optimum values is not guaranteed. Although both inputs and outputs of the system are usually given in the general linear system identification problems, the input wave (sound source) cannot be directly observed, and onlythe output waveis given in speech analysis.Therefore, an acceptable solution to the pole-zero model which can be reliably applied to actual speech has not yet been established.
This Page Intentionally Left Blank
Speech Coding
6.1 PRINCIPAL TECHNIQUES FOR SPEECH CODING 6.1.I Reversible Coding
Principal codingtechniques can be classified into reversible coding, which is notaccompanied by information loss, and irreversible coding, which is so accompanied. Reversible coding is based on Shannon’s information source coding theory, which states that the codingefficiency is limited by the entropyof the informationsource (Jayant and Noll,1984). This means that when the occurrence probability of each code is not homogeneous,the bit ratecan be reduced by variablelength coding. In variable length coding, therefore, the bit length of each code varies according to its entropy or occurrence probability. A short code is used for codes with a high occurrence probability, whereasalongcode is used forthose with a low-occurrence probability.Thiscoding,also referred toasentropycoding, is effective in raising the signal-to-noise ratio (SNR), especially when it is combined with uniform quantization. Shannon-Fano coding and Huffman coding are examples of entropy coding. Huffman coding has been ascertained to be the
133
134
Chapter 6
optimum (compact) coding since it achieves the minimum average codelength when theoccurrenceprobability is given. Whena speech waveform of a relatively long period (block) is coded using Huffman coding, the approximate limit of the information source codingtheory is realized. WithHuffmancoding, however, the complexity of both coding and decoding increases exponentially with the block length. Inorderto copewiththis difficulty, an arithmeticcoding method has been proposed in which the complexity of coding and decoding increases linearly withtheblocklength.Arithmetic coding can also realize the approximate limit of the information source coding theory. With these coding methods the probability distribution of the information source is assumed to be known. Ontheotherhand, a universal codingmethodhas been proposed to devise a coding method which approaches the limit without the need to know the probability distribution. However, this method is disadvantageous in that the block length must be large in order to achieve the proper compression effect. Variable-length coding, which is usually combined with one of several predictive coding methods, requiresa time delay (buffering) making frame synchronization difficult. 6.1.2 IrreversibleCodingandInformationRate Distortion Theory
Although no information is lost with reversible coding, a certain amount of distortion is usually permitted in speech coding as long as auditory comprehensibility is not impaired. Irreversible coding, which accompanies signal distortion, is based on the following rate distortiontheory(JayantandNoll, 1984). When a certain information source is coded so that the distortion is less than a certain value D,the average code length L for each information source exhibits the lower limit of L 2 R(D). On the other hand, whentheinformationrate R is given,thelowerlimitof quantizationdistortion D(R) exists.Thelowerlimitsof
Speech Coding
135
R(D) and D(R) are referred to as the rate distortion function and distortion rate function, respectively. Theratedistortiontheoryhasnopracticalapplication, however, for two reasons. First, R(D)and R ( D ) are very difficult to calculate except for very simple cases. Second,actualcoding methods cannot be derived from this theory. The speech signal exhibitslargeredundancies owing to the physical mechanism of vocal tract speech production and to the characteristics of the linguistic structure.Thedynamicand frequency ranges of our hearing are restricted because of the physical mechanism of ourauditoryorgans. As mentionedin Sec. 2.2, for example, an auditory masking phenomenonis involved in which low-frequency, high-level sound prevents the listener from hearing high-frequency sound existing simultaneouslywiththe of thisphenomenon, low-level noise or former. As aresult distortionunderthe noise threshold, which is related to the spectral envelope of speech as shown in Fig. 6.1, cannot be heard (Crochiere and Flanagan, 1983). A strong formant tends to mask the noise initsfrequency locality as long as the noise is about 15 dB below the signal. Using these redundancies and restrictions in both speech production and perception, information for representing speech signals can be reduced to achieve highly efficient transmission or low-capacity storage. 6.1.3 Waveform Coding and Analysis-Synthesis Systems
Irreversible coding methods for speech signals can be divided into the waveform coding method and the analysis-synthesis method. In the waveform coding method, the waveform is represented as precisely as possible by the decreased amount of information. In the analysis-synthesis method, the speech wave is transformed into a set of parameters based on the speech production model. A brief comparison between these two methods is given in Table 6.1. The table also includes the hybrid coding inwhich the waveform coding and the analysis-synthesis methods are combined.
136
0,
U
.-V0 C
>
3
M
N
l-.
0
Chapter 6
c
Speech Coding
r ..
137
m
P a
0
a..
138
Chapter 6
Figure 6.2 diagrams the relationship between the coding bit rateand speech qualityformajorcodingmethods.When telephone-bandwidth speech is quantized (coded by PCM) based on its amplitude variation characteristics, high-quality speech can be obtained at roughly 64 kbps. When the correlation characteristics of the waveform are used along with the spectral characteristics, the bit rate can be reduced to 32 or even 24 kbps. The bit rate can be further reduced to 9.6 kbps ifwe take into account the harmonic structure and apply noise shaping, which is a technique for controlling the distortion so that it remains below the noise threshold in all frequency bands. When the bit rate is reduced even further by using thewaveformcodingtechnique,thequality of coded speech rapidly decreases. On the other hand, although the analysis-synthesis method can reducethe bit rate to less than 1 kbps, it limits thequality possible, even if the amount of information is increased. A matrix (segment) quantization-based approach forvery low bit rate coding at 200 to 300 bps has been investigated (see Sec. 6.4.5). From 2 to 16 kbps,thehybridmethodscombiningtheadvantages of the waveformcoding and analysis-synthesis methods have been investigated.These include theresidual-, speech-, multi-pulse-, and code-excited LPC methods. 6.1.4
Basic Techniques for Waveform Coding Methods
The basic techniques for waveform coding methods are asfollows. (1) Nonlinearquantization Theamplitude is compressed by nonlineartransformation (logarithmictransformation,etc.), based on thestatistical characteristics of the speech amplitude.Figure6.3shows examples of linear and nonlinear quantization characteristics. (2) Adaptive quantization Thestepwidth of thequantizer is varied according to the amplitude variation in order to cope with the nonstationarity of the speech amplitude dynamics.
Speech Coding
I
0
139
Chapter 6
140
( a )
( b
FIG.6.3
L i n e aqr u a n t i z a t i o n
N o n l i n e aqru a n t i r a t
ion
Input-output characteristics of linear and nonlinear quantization.
Speech Coding
141
(3) Predictive coding The transmission bit rate can be compressed by utilizing the correlation between adjacent samples as well as distantsamples in a speech wave. The difference between adjacent samples or the difference between predicted and actual values (prediction residual) is encoded. In the latter case, the predicted value is calculated based onthecorrelationthroughoutthe sample sequence of a certain period. (4) Time and frequency division Speech information is divided into several time periodsor several frequency bands, with the larger amount of information being allocated to large-amplitude periods or perceptually more important frequency bands. ( 5 ) Transform coding A speech wave of a certain period,such as 20 ms, which can be regarded as a stationary signal, is orthogonallytransformed into a frequency domain by a method such as DCT and then encoded. This methodis based on the perceptual redundancyof a speech wave in the frequency domain. (6) Vector quantization Instead of coding individual samples, the information source sample sequence, group, or block is coded (quantized) all at once as vector.The average code length per information source can be reduced to an approximate lower limit of R(D)using this technique. Each of theabovetechniques,includingcodingsystem examples, will be explained in the following sections.
6.2 CODING IN TIME DOMAIN 6.2.1
PulseCodeModulation(PCM)
The simplest waveformcodingmethod is linear pulse code modulation (PCM). In this method, analog signals are quantized
142
Chapter 6
in homogeneous steps similar to the usual A/D conversion. This method does not compress the information rate, since it use no speech-specific characteristics. When the quantization stepsize and therange ofsignalamplitude areindicated by A and L, respectively, the quantization bits B must satisfy A 2B2 L or B 2 log2 ( L / A ) (See Sec. 4.1.2). Since the SNR of a PCM signal quantized by B bits is roughly 6B-7.2 [dB] (Eq. (4.9)), the number of bits B must be decided so that the SNR of the quantized signal is larger than that of the signal before quantization. For example, a bit rate of roughly 100 kbps, in other words, 8-kHz sampling and 13-bit quantization, is necessary for quantizing 4-kHz-bandwidth telephone speech by linear PCMwithoutproducing detectable distortion arising from the quantization noise. PCM used in the ordinary telephonesystem is called log PCM because the amplitude is compressed by logarithmic transformation before linear quantization and coding. This transformation is based on the statistical characteristics of speech amplitude. Since the amplitude of a speech signal has an exponential distribution, theoccurrenceprobabilityforeachbit is equalized by the logarithmictransformation.Therefore,thedistortioncan be minimized as suggested by the information theory.At the decoding stage, the amplitude is exponentially expanded. Two kinds of transformationformulae, p-law and A-law, are usually used inactualPCM systems which produce highquality speech at 56 or 64 kbps. The actual difference between the twoformulae is small. The p-lawcompressionformula can be written as
Here, x, is a sample valueof the speech wave, x,,,ax is the maximum permissible input level, and p is theparametercontrollingthe amount of compression (Rabiner and Schafer, 1978). The larger p becomes,thelargertheamountofcompressionbecomes. Typically, values between 100 and 500 are used for p.
I
Speech Coding
143
6.2.2 Adaptive Quantization
In order to utilize the nonstationarity of the dynamic characterSNR of quantized istics of speech amplitude for improving the speech, the quantization step sizeis varied according to the rms value of theamplitude.Thismethod is called adaptivePCM (APCM) (Jayant, 1973, 1974; Schafer and Rabiner, 1975). Since the speech signal can be considered to be stationary for a short period, the step size can be varied relatively slowly. There are two methods for varying the step size. In the first method, which is called theforward(feedforward) adaptation method,thestep size is changed at every block. In the second method, known as the backward (feedback) adaptation method, the step size is changed on a sample-by-sample basis according to the decoded samples. The principles of these twomethodsare indicated in Fig. 6.4. In the forward adaptation method, the optimum step size is decided according to the rms value calculated for every block, and is thentransmitted to the receiver as side information.Inthe backward adaptation method, the step size does not need to be transmitted, since it can be automaticallygeneratedsample by sample by using reconstructed samples at both ends. Althoughthe adaptation might be more effective fortheforwardadaptation method,thismethodhasa higher bit rate because of the side information. Various algorithms for renewing the step size have been proposed for the latter method and used in combination with several coding methods, such as ADPCM and SBC, which will be explained later. The former method is usually used in combination with APC, which will also be described later.
6.2.3
Predictive Coding
A speech signal has a correlation between adjacent samples as well as between distantsamples.Therefore,informationcompression
144
I
C c-l
L
I
Chapter 6
xP
Speech Coding
145
can be achieved by coding the difference between adjacent samples or the difference between the actual samplevalue and the predicted value calculated using the correlation (prediction residual). Since the difference andprediction residual havea smaller range of variation and smaller mean energy than that of the original signal, the quantization bits can be reduced. The method based on this principle is referred to as predictive coding, the actual structure of which is indicated in Fig. 6.5. Whentheprediction is performedaccordingtolinear prediction, described in Chap. 5, the prediction residual
i= 1
is quantized and transmitted. In the simplest case of first-order the - ~ . linear prediction, the equation becomes d, = x, + ~ ~ xIf ~ predictor coefficient is simply set as a 1 = -1, the system merely transmits the difference between adjacent samples. This system is called differential PCM (DPCM). In order to cope with the problem of accumulated encoder errorandto achieve themaximum prediction gain, themethod indicated in Fig. 6.5 is used. In this method, a local decoder which is identical to the decoder of the receiver is located in the transmitter, and the difference between the input signal and the output of the local decoder is encoded instead of simply encoding the difference between the input signal and a linear combination of past samples. The reason for the improvement in SNR by predictive coding can be explained as follows. Since the output of the local decoder is equal to the encoder output at thereceiver when no additive noise or transmission error is present, the equation
is obtained, wheree, is the quantization errorof the residual signal. Therefore, the SNR of this system becomes
146
I I I I
t
b n c
3
U
L
n
0
rc
Chapter 6
Speech Coding
147
where E[ ] indicates the expectation value. This equation indicates that the SNR, which originally has the SNR value of the quantizer q = E[d[']/E[e,'],is increased by the prediction gain G = E[.u,"]/ E[dt2].The smaller the prediction residual becomes, the larger the prediction gain becomes. Thepredictiongain when the yth-order linearprediction indicated by Eq. (6.2) is used is represented as
i= 1
where rl is theautocorrelation coefficient for .Y[ (ro = 1). The prediction gains for an actual speech signal with fixed prediction and adaptive prediction are shown in Fig. 6.6 (Noll, 1975). In the
14
1 0
FIG.6.6
A d a p t i v ep r e d i c t i o n
2
1
4 '6 8 10 12 14 16 18 Number of p r e d i c t o rc o e f f i c i e n t s
Prediction gain for actual speech signals.
20
148
Chapter 6
former case? the predictor coefficients are set at fixed values. In the lattercase, they are changedin accordance with the variation in the speech signal. Prediction gains of around 10 dB and 14 dB can be expected using fixed predictionandadaptiveprediction, respectively. When p = 1, Eq. (6.5) becomes
This means that G > 1 when 0 < Irl I 2 1. In the case of DPCM, the equation
can be obtained, where
Therefore,
This means that G > 1, or, more specifically, that the differential coding is effective when r1 > 0.5. Theabove-mentionedpredictionmethodsare specifically called spectral envelope or short-term prediction methods, since the prediction is based only on the adjacent4 to 20 samples. On the other hand, the prediction between speech samples at pitch period intervals, whichwill be described later, is called pitch prediction or DPCM with adaptivepredictionand/or long-termprediction. adaptive quantization is referred to as adaptive differential PCM (ADPCM).
Speech Coding
149
Similar to adaptive quantization, adaptive prediction is of two types: theforward type andthebackward type. Theformer,in which optimum prediction is performed for every block of speech signal, is specifically called adaptive predictive coding (APC). APC in the narrowest sense designates a coding system involving pitch prediction and two-level quantization for the prediction residual (AtalandSchroeder, 1970). Backwardadaptiveprediction,in which the predictor coefficients are modified sample by sample to reducethepredictionresidual, is sometimes called adaptive predictive DPCM (AP-DPCM).
6.2.4
Delta Modulation
Deltamodulation(DMorAM) is an extrememethod of differential quantization, in which the sampling frequency is raised so high thatthe difference betweenadjacentsamplescan be approximated by a 1-bit representation.Thismethod,advantageous in its simple structure, is based onthefactthatthe correlation between adjacent samplesincreases as a functionof the samplingfrequencyexceptforuncorrelatedsignals.Asthe correlation increases, the prediction residual decreases. Therefore, a coarse quantization can be used when signals are sampled at a high frequency. A high prediction gain can thus be obtained by such a differential coding structure. In the decoding of a deltamodulated signal, a A value is simply added to or subtracted from the previous sample according to the 1-bit (positive or negative) signal. The method in which A isfixedis sometimes called linear deltamodulation(LDM).In this method, whenthe speech amplitudebecomestoolargeorchangestoorapidly,the reconstructed sample does not exactly follow the original signal. Distortion in this case is referred to as slope overload distortion. On the other hand,when thereis no speech, that is, during a period of silence, or when the speech wave changes only slightly and very slowly, the quantization output alternates between 0 and 1. The
Chapter 6
150
encodedwaveformthusindicates an alternating increase and decrease with the stepping of A. This type of distortion, which is referred to as granular noise, produces a harsh noise. The mechanism producing these two kinds of distortion is shown in Fig. 6.7(a). SI ope
over Ioad
w
Sampling peri od
Gronulor n1 o- i s e
-
0 0 0 0 0 0 1 1 1 1 0 1 1 0 1 0 : c ~ Time ( a ) LDM
size
0 0 0 0 1 0 1 1 1 1 0 1 0 0 1 0 : ( b ) ADM
FIG.6.7
Illustration of delta modulation.
Ct
Speech Coding
151
The slope overload distortioncan be reduced by increasing the step A. When A is too large, however, thegranular noise increases. Therefore, A is usually set at a compromise value which minimizes the mean square quantization error. In order to maintain a highquality speech wave havingatelephonebandwidth using this method, however, the sampling frequency mustbe as high as 100 to 200 kHz. This problem has been handled more effectively by a coding method called adaptive delta modulation (ADM), in which A is changedadaptively with respect to the input speech waveform (Jayant, 1970). Most of the various ADM methods are based on thebackwardadaptation (feedback adaptation) technique,in which the minimum step size is adjusted according to the output code sequence. The general structure of this method is shown in Fig. 6.8. The step size is typically decided by the procedure (6.10) where Amit,and Amaxare the predetermined lower and upper limits of the step size. When the output codes ct and ct-l in Fig 6.8 are the same, a constant value P > 1 is given to M , whereas when ct and ct-l are different, a constant value Q < 1 is assigned. Experiments confirmed the optimum conditionto be P x Q N 1. Figure 6.7 (b) is an example of the quantization when a = 1, P = 2, and Q = 1/2. Since the step size is varied exponentially in ADM, the slope overload distortion and the granular noise in ADM are smaller than thosein LDM. Experimentalevaluationindicates thatan ADM with a sampling frequency of 56 kHz has almost the same quality as a 7-bit log PCM with 8-kHz sampling. 6.2.5 AdaptiveDifferentialPCM (ADPCM)
ADPCM is a type of DPCM which includes backward adaptive quantization and/or backward adaptive prediction. This method is
152
t
C
7 CL U
Chapter 6
P '
Speech Coding
153
advantageous in that only the residual signals must be transmitted (Cumnliskey et al., 1973). The method using backward (feedback) adaptive quantization andfixed prediction produces high quality in spite of its simple structure. In this method, the quantization step for the first-order difference using a fixed coefficient is controlled to adaptto input speech. This method producesan SNR of roughly 22 dB at 32 kbps (8-kHz samplingand 4-bit quantization), which is roughly 8 dB higher than that of log PCM at the same data rate. Subjective evaluation by the preference method indicates that the quality of 4-bit ADPCM is between that of 6-bit and 7-bit log PCM. This means that ADPCM can achieve an improvement of roughly 2.5 bits. There are two reasons for this improvement by ADPCM. One is that ADPCM can cover a wider amplitude range than log PCM with thesame bit rate.Theother is that itspowerspectral distribution for the quantization noise is not homogeneous but is concentrated in the lower-frequency range. Therefore, the speech spectrum can easily mask the quantization noise. A block diagram of an advanced ADPCM system incorporatingbothadaptivequantizationandadaptiveprediction capabilities is shown in Fig. 6.9. Thedashed lines indicatethe additionaltransmissioninformation, A, and (a,],with forwardtypequantizationandprediction. A backward-type system at 32 kbps has been reported to produce an SNR of roughly 30 dB, and that one at 16 kbps (6.67-kHz sampling and 4th-order linear prediction)hasthesame intelligibility as 8-kHzsampling, 5-bit log PCM.
6.2.6 Adaptive PredictiveCoding(APC) Adaptive predictive coding is forward-typeAP-DPCMinthe broadest sense of thedefinition. Inthe narrowest sense, it is a system which includes pitchprediction as well, as shownin Fig. 6.10 (Atal and Schroeder, 1970). The prediction model in the latter case is
154
i ! i
I I
t
c
CL
3 W
Chapter 6
I
I
I
t
Speech Coding
=I I
I
I
c
c
a
3
W
."_
"-" I
1
E Q)
c
a c
3
W
155
Chapter 6
156
k= 1
The speech signal is analyzed block by block toobtainthe pitch period M , and amplitude of the predictor coefficients {ai), pitch component P. This information and quantization step width q for the residualsignal, which together are called side information, are transmitted along with the residual signal. The residual signal is quantized and 1 -bit coded (twolevels). Since linear prediction is performed using all samples in each block, unlike ADM and ADPCM, a large prediction gain can be obtained. Subjective evaluationexperimentsindicated that when the sampling frequency is 6.67 kHz (a transmission bit rate for the residual signal is 6.67 kbps and a small amount of side information is additionally transmitted), the quality of coded speech is slightly lower than with6-bitlog PCM (SNR = 27dB).The side information is not quantized in these experiments. AlthoughthepredictorstructurepresentedinFig. 6.10(b) corresponds well toEq. (6.1 I),it is redundant. Therefore,a different structure is usually used in which PIand P2 are separated into different feedback loops. 6.2.7 Noise Shaping
Figure 6.1 1 is a block diagram of an APC system capable of noise shaping. Noise shaping is the process of decreasing the auditory quantization noise using the auditory masking effect described in Sec. 6.1.2. This is accomplished by modifying the flat quantization noise spectrum into aspectrum which resembles that of speech (Makhoul and Berouti, 1979; AtalandSchroeder, 1979). This is done by feeding back the quantization noise.
Speech Coding
c
3
c
Q H
L
h
0,
0
U
158
Chapter 6
The transmission function of the feedback filter F is (6.12) The filter coefficients are set at (bi) = {yia,), where (ai>are the coefficients of predictor P2, and parameter y is set at 0 < y < 1. Accordingly,when y approaches 0, thequantization noise spectrum approaches the speech spectrum, and when y approaches 1 the noise spectrum flattens. For practicalpurposes, y is set atanappropriate value derived from the results of the hearing test. Although the coding algorithm becomesslightlycomplicated by introducingthis method, the subjective SNR is improved by 12dB when y = 0.8. When pitch prediction and sophisticated quantization algorithms are introduced, a quality equivalent to that of 7-bit log PCM canbe obtained at 16 kbps. An adaptive noise-shaping postfilter has also been proposed as apostprocessor for enhancing speech quality(Jayant and Ramamoorthy, 1986). The philosophy of the postfiltering technique can be explained using the simple illustration in Fig. 6.12. Part (a) of the figure shows a signal spectrum with two narrowband components in the frequency regions W1and W2,and a flat noise spectrum that is 15 dB below the first signal component but 5 dB above the second signal component. Part(b) presents the spectraof thepostfiltered signal and noise, when thepostfiltertransfer f h c t i o n is identical tothe signal spectrumin(a).Although theresulting SNRs in regions W 1 and W2 are same as those before postfiltering, the noise in the rest of the frequency range is much lower thanthe signal levels. Thispostfilteringoperation perceptually enhances the signal. Figure 6.13is a block diagram of the adaptive postfiltering applied to the ADPCM decoder output.The coefficients of the postfilter are scaled versions of the coefficients of theadaptive predictor in ADPCM. The speech distortion inevitable in postfilteringcan be mitigated by adaptingthe degree of postfiltering according to the ADPCM performance.
Speech Coding
159
.-I
Signal
- Noise Background n o ise
w2
W1
Frequency
FIG. 6.12 An idealized explanation of the effects of postfiltering: (a) signal and noise spectra at the postfilter input; and (b) postfiltered spectra.
6.3 CODING IN FREQUENCYDOMAIN 6.3.1 SubbandCoding(SBC)
The coding method, in which a speech band is divided into several contiguous bands by a bank of band-pass filters (BPFs), and a
160
Chapter 6 En ha nc ed ADPCM output
Conventional AOPCM output
Input to ADPCM
decoder
r
t
'
I
I
ADPCM decoder
1
I
+
Postfilter
r
t
I 4
I I I
I I L
____
__"- J I I
I
L
"_
I
I I r""" I "l I Coefficient I scaling I algorithm I L -J
F-J'
4
" " " "
FIG.6.13 Block diagram indicating adaptive postfiltering of the output of an ADPCM decoder.
specific codingstrategy is employedforeach band signal, is called subband coding (SBC, Crochiere et al., 1976). As shown in Fig. 6.14, the speechsignalpassing througheachBPF is transformed into a baseband signal by low-frequency conversion, down-sampled at theNyquistrate,andcoded by an adaptive codingmethodsuchasA,DPCM.Theinverseprocedures reproduce the original signal. Thismethod is advantageousfortworeasons.One is that processing concerning human auditory characteristics suchas noise shaping caneasily be applied. The other is that a higher bit rate can be allocated to those bands in which higher speech energy is concentrated orto those bands which are subjectively more important. Therefore,thismethod canproduce less perceptible quantization noise at the same or even at a lower bit rate. This method is also beneficial in that the quantizationnoise produced in onebanddoesnot influence anyotherband;that is, low-level speech input will not be corrupted by quantization noise in another
Speech Coding
.""""
"
"_
" " " "
W" A
m
1
c
a c
w
U
A
II
0
0
X
5
LL
161
162
Chapter 6
band. Since ashort-timefrequency analysis of input signals is performedinthehumanauditorysystem,themethodfor controlling the quantization noise in the frequency domain is both effective andnaturallyappealing (Tribolet and Crochiere, 1979; Krasner, 1979). The BPF banknecessary for this method is realized by general digital filters or by acharge-coupled device (CCD) filter which handlesanalogsampled values. Themostreasonable way of dividing the frequency band is to equalize the contributions to the articulation index from all subbands. Since this method complicates low-frequency conversion, however, a far more practical and simpler way is through integer band sampling. Figure 6.15 indicates the sampling process for each band. Assuming that n z is an integer, the frequency range of each subband is set at [IT& ( n z + llfl, and the output signal is sampled at 2J: The output is then coded and transmitted. At the receiver, the original signal in each subband is reproduced by passing the decoded signal through a BPFhaving a frequency range of [ n & (131 + 1)A. When the flexibilityof theband division is restricted to aratio of 2 1 , a quadrature mirror filter (QMF) can be used (Esteban and Galand,
( a ) Subband
spectrum
(b) Somplin g
clock
( c 1 Spectrum of ter sampling
FIG.6.15
'
-4f
t
I
-31 -2f
t
-f
I
0
t
1
I
f
2f
t
3f
4f
I
-4f
- 2f
0
2f
4f
-4f
-2f
0
2f
4f
Illustration of integerbandsampling.
Speech Coding
163
1977).Since a QMF advantageously provides for very simple processing and for the automatic cancellation of aliasing distortion, this method is frequently used for realizing SBC systems. Experimental evaluation indicates that although the SNR for 16-kbps SBC is 11.1 dB, which is almostthesameas16-kbps ADPCM, the subjective quality is almost the same as that of 22kbps ADPCM (Crochiere et al., 1976). Although SBC is classified as a type of frequency-domain coding in this book, it can also be defined as a time-domain coding method where input signals are subdivided into frequency bands and quantized.
6.3.2 Adaptive Transform Coding (ATC)
Adaptive transform coding (ATC) is a method in which a speech signal is divided into several frequency bands in a way similar to that with SBC. However, ATC is more flexible than SBC. In this method, a speech wave of around 20 nls, which can be considered stationary, is extracted as a block or a frame. The speech wave of every block is first orthogonallytransformedintofrequencydomain components,which are subsequently processed by adaptive quantization. At the decoder stage, the speech wave is reproduced by concatenatingtheinverse-transformed block waveforms (Zelinski and Noll, 1977; Tribolet and Crochiere, 1979). Although various kinds of orthogonal transforms are used, such as the DFT, discrete cosine transform (DCT), and Karhunen-Loeve transform (KLT), ATC usually refers to a system in which DCT and adaptive bit allocation are employed. DCT for a block consisting of an "sample speech signal, { x ~=] nlr~ 1 , is defined as
x,<=
(21 + 1) 7rk
"1 StCk t=O
Ck =
1
fi
cos
(k
2M
k=O k = 1,2, ..., A4
-
1
=
0 , 1 , 2, . . . , M - 1)
Chapter 6
164
The inverse DCT is defined as
1 Xk =
-
x
"1
k=O
XkCk
cos
(21 + 1) 7rk 2M
( t = 0 , 1 , 2 , . . . , M -1)
(6.14) There are four advantages of using DCT: 1. Unlike KLT where input signals aretransformed into their principal components, DCT corresponds to conversion into the conventionalfrequencydomain. Itthusfacilitatesthe useof processes based on the auditory function of frequency analysis and the control of quantization noise in the frequency domain. 7 DCT takes a relatively smallamount of calculation since an Npoint DCT for each frame canbe performed using a symmetric 2N-point FFT. Additionally, in contrast with KLT, it is not necessary to transmit the fixed base vectors. 3. The base vector of DCT is statistically closer to that of KLT, whichis theoptimumorthogonaltransformation,thanother well-known orthogonaltransformationssuchas DFTand Walsh-Hadamard transformation. Accordingly, DCT is statistically more efficient than DFT in terms of coding performance. 4. DCT is less sensitive to the edge effect in waveform extraction than DFT (Tribolet and Crochiere, 1979). The distortion in the frequency domain is therefore naturally smaller with DCT. I .
The basic structure of ATC is shown in Fig. 6.16. The speech wave ineachblock is transformed by DCT into thefrequency domain, and the resultant DCT coefficientsareroughlydivided into 20 subbands. The mean energy of each subband is calculated and coded as side information. A spectral envelope is obtained by interpolating the mean energy values of adjacent subbands (linear interpolation on a logarithmic scale). The number of quantization bits and corresponding quantization steps are optimally allocated to each DCT coefficient basedon the spectral envelope valuesso as to maximize the resultant SNR.
t
Speech Coding
c 0
n
w
C
a
t
tJ x
Y
C
0
165
166
Chapter 6
The bit allocation maximizing the SNR corresponds to the allocation minimizing thesquaredsum of thequantization distortion for each spectral coefficient. This allocation produces a uniform level of quantization noise along the frequency axis. A system including noise shaping in the frequency domain,similar to the noise shaping used in APC, has also been proposed (Flanagan et al., 1979). Accordingly, output signals fromthecoderarequantized DCT coefficients and side information which representsthe spectral envelope. Roughly 2 kbps is necessary for the transmission of side information. An SNR improvement of 17 to 23 dB over that possible using log PCM can be achieved by ATC at 16 to 32 kbps (Tribolet and Crochiere, 1979). Vocoder-driven ATC inwhich LPC (PARCOR) coefficients and pitchinformationare used as side information has alsobeen proposed (Tribolet and Crochiere,1978). Vocoder-driven ATC of 16 kbps has been reported to be able to producehigh-quality speech which is almosttransparent, i.e., perceptually equivalent, to original speech.
6.3.3 APC with Adaptive Bit Allocation (APC-AB)
APC-AB, the principal structure of which is indicated in Fig. 6.17 (Honda and Itakura, 1984), is based on a combination of SBC and APC. A speech signal is divided into subbands, and each subband signal is down-sampled to a baseband. APC which includes ythorder linear predictive analysis is then applied to each baseband signal. Both short-term and long-term prediction (pitch prediction) are performed in APC. For theresidual signal, bit allocationforeachsubband is performed based onthe energy distribution.Additionally,each pitchperiod is divided into L time intervals, anddynamic bit allocation is performed to adapt to the energy of each subinterval. Adaptive bit allocation is basically a process of minimizing the mean waveform distortion over all subintervals. In this system, predictor coefficients, residual energy, and the parameters for the
~
n Y
".
n
. I " " " "
Speech Coding
_ I _
m
? n
0
a
0
rc
4
L
167
-""
168
Chapter 6
subintervals (interval length corresponding to pitch period, and the relative position of the first pitch epoch in an analysis interval) are transmittedassideinformation.Signalsarereproduced by procedures which are inverse to the coding. The SNR gain GFT resulting from both the frequency-domain andtime-domainbitallocationcan be represented by the summation of the SNR gain in the frequency domain, GF, and that in the time domain, GT, as GFT = GF -k GT of thealgebraicmean where GF is equaltotheratio geometric mean of the spectraf(X), such that
(6.15) to the
Here, f ( X ) is represented by the combination of subband spectra, each of which is modeled by an all-pole spectrum. Therefore,iff(X) 2c f ’(X), where f ’(X) is the all-pole spectrum representing the spectrum for the entire frequency range, GF becomes equal to the prediction gain for the entire frequency range. The subjective quality of the speech coded by APC-AB at 9.6, 16, and 24 kbps is equal to 6-, 7-, and 8-bit log PCM, respectively. In this experimental system, the LSP parameters (see Sec. 5.7) are used for transmitting LPC side information. 6.3.4 Time-Domain Harmonic Scaling (TDHS) Algorithm
The TDHS algorithm is a method for compressing or expanding a harmonic structure to a ratioof between 1/3 and 3 by processing it inthe time domain. Mechanisms andproceduresforharmonic scaling by TDHS are given in Figs. 6.18 and 6.19 (Malah et al., 1981; Crochiereetal., 1982). In this method, waveforms of adjacentpitchperiodsaremixedafter being multiplied by an appropriate weighting factor.The weighting factor is set asa
Speech Coding
169
Amp1 i t ude pect ra I enve I ope Original speech
Harmonic s t r u c t u r e
* w
* 2:l Compress i on
w
FIG.6.18 Illustration of harmonic scaling by TDHS (A+ = spectral width of each harmonic component).
function of thelocationintimeinorder to producea speech waveform without discontinuities. Withacompression to 1/2, adjacentpitchsegments which have a length of P are multiplied by a triangular-shaped weighting factor and added together as shown in Fig. 6.19(a). If the output segments are concatenated and played out at the original sampling period,atime-compressedsignalresults.Ifeachsegment is expanded to the length of 2P and sampled at a sampling period twice thelength of theoriginalperiod,afrequency-compressed
170
Chapter 6
A
FIG.6.19 Procedure of (a) compression and (b) expansion of harmonic structure by TDHS. s(n) = original speech wave; s,(n) = compressed wave; &(n) = compressed wave after codinganddecoding; i ( n ) = reproduced wave after expansion; W(m) = weighting function.
signal results. Since theoriginalwaveforms at both ends of the 2P-sample-long period are preserved after compression, no waveform discontinuity will occur. The computation loadis low because the actual computationprocess consistsof only two multiplications andoneadditionforeachoutputsample.Also,thesampling periodaftercompression is longer than theoriginalsampling period. With double expansion, on the other hand, overlapping 2Psample-longwaveforms are multiplied by weighting factors and added together as shown in Fig. 6.19(b). If the original sampling period is retained,atime-expanded signal is obtained.And if this wave is compressed to a P-sample-long waveformand sampled
Speech Coding
171
FIG.6.19 (Continued)
at a period half that of the original sampling period, a frequencyexpanded signal is obtained. Although the input signal is not necessarily reproduced as the output signal, this algorithm is advantageous in that it produces natural sound even for noisy speech. This robustness stems from thefact that thealgorithmdoesnotperform voiced/unvoiced decisions. Thus, the same algorithm is used for both voiced and unvoiced periods.Thepitchdetection errordeterioratessynthesized speech quality fairly gradually in this algorithm,and it works well even with two or more than two speakers competing at the same time. When this algorithm is combined with various waveform coding methods as shown in Fig.6.20, the bit rates of these coding
172
$&,”
L
qI
q-
I - 0
6
1 3 c1
4-
c
Y
Chapter 6
Speech Coding
173
methods can be reduced even further. For example, the SBC/HS method, which is a combination of SBC and TDHS, realizes at 9.6 kbpsthequalityequivalentto16-kbpsSBC.TheATC/HS method, a combination of ATC and TDHS, actualizes at 7.2 kbps the quality equivalent to ATC having a bit rate roughly 4 kbps higher.Cepstrum analysis is used inthe TDHSparts of both methods for pitch extraction. Since the ATC method already uses pitch information for time-domain bit allocation, TDHS does not improvequalityasmuchasSBC.Additionally,the SBC/HS method is superior to ATC in that it can realize higher quality with simpler hardware.
6.4 VECTOR QUANTIZATION 6.4.1 MultipathSearchCoding
Since the backward adaptive quantization method determines the present quantization step using only the information concerning past signals, itdoesnot necessarily producetheminimum quantization distortion over several samples. On the other hand, treecoding canfurtherreducequantizationdistortion using a delayed decision strategy before more future samples are received. The method utilized to search for the optimum code maximizing the overall SNR on the assumption that some delay is involved is called multipath search coding or delayed decision encoding. This method includes tree (search) coding (Anderson and Bodie, 1975; Schroeder and Atal, 1982), trellis coding (Stewart et al., 1982), and vector quantization. Thestructure of multipathsearchcoding is presentedin Fig. 6.21. Although error energy minimization is usually used as the criterion for selecting the optimum sequence y k for the input signal vector x,a criterion including frequency weighting has also been proposed. The M-L method (Jelinek and Anderson, 1971) and amethodbasedondynamicprogramming (DP) have been investigatedasalgorithms for deriving theoptimum sequence.
7
c
n c c (
0)
u)
: t
Chapter 6
Speech Coding
175
Application of the tree coding methodto ADPCMrealizes an SNR of roughly 20 dB at 16 kbps. In contrast to multipathsearch coding, coding methodsbased only on present and/orpast sample values, such as AM and DPCM, are referred to as types of single-path search coding. As can be expected, the circuit structure of multipath search coding is usually more complicated than that of single-path search coding. 6.4.2
Principles of Vector Quantization
Vector quantization (VQ) is a quantization method in which wavefomx or spectral envelope parameters are not quantized on a sampleby-sample basis, but instead a set of scalars composing a vector is represented by a single code in waveform coding or in the analysissynthesis method (Gersho and Cuperman, 1983; Gersho and Gray, 1992). Conversely, the one-dimensional coding methods described so far are generally classified as forms of scalar quantization. VQ was first proposed as a highlyefficient quantization method forLPC parameters (Linde et al., 1980), and was later applied to waveform coding. Figure 6.22 indicates the principle of VQ.
Wove form
Input
Source information
_--_"._-_---
o r spectral
7 i
I
I
Binary
Optimum
t
selection
Reproduction (Synthesis 1
(Centroids)
(Centroids1
(Coder 1
( Decoder 1
pattern
FIG.6.22 Block diagram of VQ.
Output
176
Chapter 6
In VQ waveform coding (vectorPCM, VPCM), a certainperiod of the sampled waveform is extracted, and the waveform pattern in thisperiod is represented by a single code.Thisprocedure is accomplished by storing typical waveformpatterns (code vectorsor templates), and giving a code to each pattern. The table indicating thecorrespondencebetweenpatternsandcodes is termeda codebook. An input waveform is compared with each pattern at every predetermined interval, and the waveform of each period is delineated by acodeindicatingthe pattern havingthelargest similarity to the waveform. Thecodebookshouldthusprovideanappropriateset of patterns which minimize the overall distortion when various types of waveforms are depicted by a limited number ofpatterns. Solving a nonlinear optimization problem usually facilitates the construction of aset of these patterns based on theoriginal pattern distribution. Since findingagloballyoptimalsolution is usually computationally prohibitive, a locally optimal solutionis generally obtained iteratively. For thispurpose,twocodebookgenerationmethods,the random learning and clustering-basedmethods,have been proposed. Random learning enables vectors to be randomly selected from training data and stored as code vectors. This method is used when the amount of training data is comparable to the number of code vectors. The clustering method is normally based on Lloyd’s algorithm (K-means algorithm) (Lloyd, 1957; Max 1960). In this method,trainingdataare clustered intononoverlappedgroups, and corresponding centroidswhich minimize the average distortion arecomputed.Thesecentroidsarethenstoredascodevectors. Although the global optimality of this method is not guaranteed, theaverage distortioncan be monotonicallydecreased by the iteration of codebook renewal, and a locally optimal solution can thus be obtained. The centroid depends on the distortion measure (distance measure, similarity) selected. As an expansion of Lloyd’s algorithm,thecluster-splitting method(LBGalgorithm) was proposed(Lindeetal., 1980). In this method, code vectors are obtained by Lloyd’s algorithm, and thenumberofclusters is doubledat eachstageofcodebook
Speech Coding
177
renewal by adding new code vectors in the vicinity of previous vectors. Thisprocedure is iteratedstarting with the one-cluster condition.Thevectorquantizationalgorithmsare precisely explained in Appendix B. RegardingVQwaveform coding, the following equation exists for the number Nof patterns in the codebook, the dimension k of each vector (number of samples in each period), and the bit rate (bit/sample) r: (6.20) When N is large enough, SNR [dB]
=
log-,N 6k
+ Ck
(6.21)
where C k is ak-dependentparameter(GershoandCuperman, 1983). Therefore, when the codebook doubles its size, the SNR increases by 6/k [dB]. Thecondition k = 1correspondstothe scalar quantization. When r is given, the quantization distortion can generally be reduced to approach the minimum quantization distortion D(R) derived from the information rate distortion theory by increasing k. VQ serves asa highly efficient codingmethod because it utilizes thestatistical occurrence ortheprobabilitydistribution function of the source, no matter how varied it is. VQ also employs the smooth continuity arising from the correlation or nonlinear dependency existing in acertainperiod of speechsamples. Therefore, Clc becomeslarger thanthat possible with scalar quantization,andthe bit ratecan be reduced. For example, experimental results show that C8 is 7 dB larger than C1. More formal results indicate that when the sampling frequency is 8 kHz, an SNR of 12 to 14dB is obtained under the conditions of r = 2, k = 4, and N = 256, and an SNR of 8 to 1OdB can be derived when the conditions become r = 1, k = 8, and N = 256 (Abut et al., 1982).
Chapter 6
FIG.6.23
Principle of BTC.
6.6.3 TreeSearchandMultistageProcessing
Selecting themostappropriatepatternfromthecodebookas efficiently as possible is one of the important issues in VQ. One of the most practical selection methods is binary tree coding (BTC), indicatedinFig. 6.23 (GershoandCuperman, 1983). In this method, codebook patterns are stored in a binary tree structure, and patterns are soughtby tracing the binary tree. When the size of the binary tree codebook is N = 2kr, for example, the number of binary decisions is log2 N = kr. That is, N-dimensional space is successively divided, and the input vector is compared with two code vectors only at the last decision stage. Another selection method is full search coding (FSC) in which the input vector is compared with all N-patternsstoredinthe codebook to calculate the similarities before selecting the closest pattern. Therefore, when the number of patterns is fixed, FSC can achieve a lower distortionthanBTC.FSC is disadvantageous, however, in that its amount of similarity calculations is larger than that in BTC. On the other hand, since the codebook has a binary treestructureinBTC,itmusthave a memorycapacity
Speech Coding
179
FIG.6.24 Block diagram of multistage VQ.
corresponding to thenumber of nodesinthe tree. Thus,BTC requires roughly twice the memory capacity of FSC. A multistage VQ has been proposed to reduce the memory size and to simplify the coding process (Juang and Gray, 1982). Figure 6.24 expresses the principle of this method. The quantization errors of each stage are transmitted to the next stage, and the final codeis constructed by a sequence of the code obtainedat each stage. When the bit rate r is fixed and the dimension k is increased in FSC, the number of calculations and the memory size increase exponentially as a function of N = 2kr and Nk = k2"', respectively. However, when the number of stages S is proportional to k , the amount of processing increases only in proportion to k'. Using a tree structurecodebook at eachstage of quantizationfurther reduces the increase in the amount of processing. Figures 6.25 and 6.26 indicate the amount of processing and memory capacity as a function of the number of vector dimensions under the condition of 1 bit/sample (Gersho and Cuperman, 1983). These results show that the increase in the amount of processing can be suppressed by combining multistage processing and binary tree coding, even when the number of dimensions increases. A vector-scalar quantization has also been proposed in which a VQ using a small codebook is followed by a scalar quantization around eachcodevector.Adaptivetransformcoding with VQ (ATC-VQ) is a modification of ATC in which residual signals are
Chapter 6
180
/
-al v)
c
105
Fsc
I
TC
0 2 4 6 8 10 12 14 16 1 8 2 0 2 2 2 4 2 6 Vector dimension
FIG.6.25 Amount of VQ processing as a function of vector dimension.
transformed by DFT and vector-scalar quantized with adaptive bit allocation (Moriya and Honda, 1986). Adaptive vector predictive coding (AVPC) is a modification of APC in which residual signals are vector quantized (Cuperman andGersho, 1982). Experimentalevaluationofthismethod indicates that an 18 to 20-dB SNR can be obtained when residual signals are vector-quantized under the conditionsthat the sampling frequency is 8 kHz, k = 5 , and r = 2 (16 kbps). A modification of SBC in which VQ is applied to each subband signal has also been investigated. 6.4.4 Vector Quantization for Linear Predictor Parameters
In the VQ of LPC parameters,the1stthroughpth-orderLPC parameters (PARCOR coefficients, LSP parameters, etc.) are dealt
L
Speech Coding
181
I
107 -
- lo6 v)
0, 105-
B
U
.-c
104-
toge BTC
(k/s=4)
0 0
0" lo30
toge 8TC (k/s=l 1
FIG.6.26 Memory capacity for VQ as a function of vector dimension.
with as a vector (pattern) and represented by a code. Using this method, a very-low-bit-rate coding which still maintains intelligibility can be realized at 150 to 800 bps, although the naturalnessis inferior to that of high-bit-ratewaveformcoding(Smith, 1969; Buzo et al., 1980; Roucos et al., 1982a). Figure 6.27 shows the spectral distortion for the speech coded by VQ or scalarquantizationasafunction of theamount of spectral envelope informationperframe (Buzo etal., 1980). A spectral distortion of 1.8 dB is realized by FSC at 10 bits/frame, which is27 bits/frame (73%) smaller than the bit rate of scalar quantizationproducingthesamedistortion.Additionally,the distortion by BTC at 10 bits/frame is 0.6 dB larger than FSC, andis equal to the distortionby FSC of 8 bits/frame. It has been reported that a VQ of 800 bps can realize almost the same quality produced by a 2.4-kbps LPC vocoder (Wong et al., 1982).
Chapter 6
182
2 7 b i t s / f rame FSC
FIG. 6.27 Spectral distortion as a function of bit rate for spectral envelope parameters; comparison between VQ and scalar quantization.
6.4.5
Matrix QuantizationandFinite-StateVector Quantization
Matrix quantization (MQ) and finite-state VQ (FSVQ) have been investigated for the purposeof realizing a very-low-bit-rate coding. MQ is an expansion of VQ, in which a set of spectral parameters overmultipleframes(spectralsegment) is expressed by a single code (Wong et al.,1983). The speech spectral parametersequence is depicted as the concatenation of spectral segments. On the other hand, FSVQ utilizes the transitional characteristics of the vector codes (Foster et al., 1985). The principal MQ procedureconsists of two processes: dividing theparameter sequence into multiple-frame segments (segmentation process), andmatching each segment withcode segments inthedictionary(quantization process) asshownin Fig. 6.28. MQ methods can be divided into two types, depending
Speech Coding
0)
L
CI
L
n
0 0 0
U
0
0
c 3
n
J
c
T
e
U 0
1T c
3
c
9. U
183
184
Chapter 6
on whether the segment length (numberof frames) is fixed (Roucos etal., 1982b) orvariable(ShirakiandHonda, 1986). Variablelength MQ canfurther be divided intotwo types according to whether segmentation and quantization are performed separately or jointly. The latter method generally outperforms the former. The phonocode method is a typical example of the variablelength MQ method in which the segmentation and quantization processes are performed jointly based on the spectral distortion minimization criterion. The structure of the phonocode method is shown in Fig. 6.29. Both the sequence of code segments and the segmentlengthsare efficiently determinedusingadynamic programming procedure. Speech quality with a phoneme identification score of 78% was obtained by this method at roughly 200 bps using a 10-bit matrix codebook. FSVQ, the structure of which is shown in Fig. 6.30, is a kind of VQ in which the codebook is adapted at every frame based on thetransitionalcharacteristics of vector codes. FSVQcan be considered a VQ version of the backward adaptive quantization techniqueusedinthewaveformcoding. In FSVQ,acode minimizing the distortion is selected from the codebook, depending on the state, and, at the same time, transition to the next state occursaccording to the selected code. Since thestatetransition depends only on the initial state and code sequence, the decoder can reproduce vector codes by reconstructing the state transition based on the received code sequence. Trellis coding has been proposed as a modification of FSVQ (Stewart et al., 1982; Juang, 1986). In this coding method, the vector code of each frame is determined to minimize the sum of distortion over several frames. Hidden Markov model (HMM)coding has also beensuggested to delineate the dynamic characteristics of vector sequences (Farges and Clements, 1986). HMM is a stochastic finitestate model represented by the occurrence probability of each code vector at each stateand interstate transitional probability. The principle and procedures of HMM areprecisely explained in Section 8.7. The amount of information necessary to depict the spectral envelope parameters can bereduced 30% below that neededby simple VQ using FSVQ, trellis, or HMM coding.
Q,
L
h
6
c
n.
c 3
185
<xT
Q,
e
0
u
'p
c-5
L Q,
C 0
A
U 0
u,
c
O+
f
Speech Coding
L
+ CI
x
3
a
c C
w
186
c, 1
r c
3
CI
c
W
c
+
@
w
I
+
.-
-
tn
4
Chapter 6
r-I a
3
c
c 3
0
Speech Coding
187
HYBRID CODING
6.5
6.5.1
Residual- or Speech-Excited Linear Predictive Coding
Hybrid coding methods have been derived to serve as intermediate methods between the analysis-synthesis (vocoding) system using pulses and noise as sound sources and the waveform coding system. In these methods, the low-frequency components of either an input speech wave or an inverse-filtered input speech wave are coded as sound sources and transmitted with spectral information. At the decoder, the high-frequency conlponents of the source signal are regeneratedusingthelow-frequencycomponents.Allthese components are then modified by spectral envelope information to reproduce the speech signal. These methods offer four advantages. The system is free fromqualitydegradation dueto source modeling. A low-frequency waveform is exactly reproducedwithinthe limit of the quantization error. Spectral information for the entire frequency range is efficiently represented by the analysis-synthesis method such as the LPC analysis method. Since pitch period estimation and voiced/unvoiced decision are not necessary, the system is free from both pitch estimation error and voiced/unvoiced decision error. The first experimental system based on these methods was proposed as a modificationof the channel vocoder, andwas named the voice-excited vocoder or baseband vocoder (David et al.,1962). Subsequently, the residual-excited LPC vocoder (RELP) andvoiceexcited LPC vocoder(VELP)have been investigated by many researchers as systems which effectively modify the LPC method. In both systems, the spectral envelope is extracted by LPC analysis. The LPC spectral parameters, such as the PARCOR coefficients, are encoded for transmission. At the same time, the low-frequency
188
Chapter 6
components of either speech signals or residual signals are downsampledandencoded.Usually,baseband signals lower than roughly 800 Hz are encoded by either log PCM, ADPCM, APCM, ADM, APC, or SBC. At the decoder, high-frequency components are reproduced by combiningnonlinear processing such asrectificationand clipping, high-emphasis, spectralsmoothing,and noise addition. The bit rate for these systems is 4.8 to 9.6 kbps in which roughly 2 kbps is used for the transmission of LPC parameters. Block diagrams of three of these systems are presented in Fig. 6.31. RELP, outlined in Fig. 6.31(a), is a system in which the low-frequency components of anLPC residual signal are encoded andtransmitted.A systemwhich is also regarded as amodificationof vocoder-driven ATChas beenproposedas a modification of RELP (TriboletandCrochiere, 1980). In this system, the low-frequency components of a residual signal arecoded by DCT,and high-frequency componentsarereproduced by shifting the low-frequency DCT spectrum. This system is favorableinthatitenablestherealization of asystem with any bit rate between that of a 2.5-kbps LPC vocoder and 16-kbps ATC. In VELP shown in Fig. 6.31(b), the low-frequency components of a speech signal are encoded andtransmitted.VELPreproduced speech is said to be somewhatsmootherthanthat produced by RELP. The problemwith VELP, however, is that the LPC synthesizer excited by the speech signal produces an energy discrepancy in the low-frequency region due to resonances which occur at formant frequencies. Figure6.3 l(c) illustratesasysteminwhichthelowfrequency componentsare processed by waveformcoding and the high-frequency components are processed by an LPC vocoder. Although this method is beneficial in that low-frequency waveforms are preserved, it also requires pitch extraction and voiced/ unvoiced decision. Additionally,waveformcodingalong with vocoding in thissystem result in an overlap of low-frequency components.
Speech Coding
189 L PC
LPC c o e f f i c i e n t s
+ analysis Input
I
0-
1
C
Inverse filter
"--c
LPF
--c
Residue
L PC -analys
is
-
Input 0
:'
U
LPC c o e f f l c i e n t s
-
"
Pi t c h
0)
v
Pitch
extroct Ion
.-n + X
Q)
c
c
LPF
E"
( c )
FIG. 6.31 Block diagrams of linear predictivecodersexcited by residual or speech: (a) RELP, (b) VELP, (c) combination of vocoder and waveform coder.
6.5.2
Multipulse-ExcitedLinearPredictiveCoding(MPC)
Multipulse-excited LPC(MPC) is a system in which anLPC synthesis filter is excited by multiple pulses, regardless of whether the sourceis voiced or unvoiced, without modeling the sourceusing pulses and noise (Atal and Remde, 1982). A schematic diagram of
Chapter 6
190
.
Output
J
I
I 2 I
1"1
I
FIG.6.31 (Continued).
the synthesis filter for this method is presented in Fig. 6.32. The strongpoint of thismethod is thatit is free fromquality degradationduetosourcemodelingandsourceparameter estimationerrorssuch as pitchperiodestimation error.In this aspect, MPC is similar to RELP and VELP. The problem with this method is its difficulty in determining optimum amplitudes and locations of multiple pulses. A method based on theA-b-Stechnique, as showninFig. 6.33, is usually used for thispurpose.First,a speech wave roughly 20ms long is extracted as aframe(block)approximately every lOms, and
191
Speech Coding
Pulse am ti t udesoant loco t 1 ons
Mu1t ipulse
-
I i
generotlon
-
A
L PC
-.%
a
LPF
synthesis
Synt het I c speech A
s( t )
LPC parameters (Predlctor coeff .)
FIG.6.32 Block diagram of MPC.
Original speech
0 Spectral Fineenvelope structure (Pitch) Multipulse excitation generation
c
-
Long-delay correlation filter
Average
-
(formants)
Sn
3,
Shortdelay correlation
filter
synrhe t ic speech
Square
~
J
*
Objective error
Petceptuai weighting filter
* i
FIG.6.33 Procedure for determining optimum excitation in MPC.
the spectral envelope is estimated using LPC analysis for this frame period. Next, multiple pulses are determined as a driving source function using the algorithm indicatedin the figure at every 5 or 10ms. Here, we assume thattheamplitudesandlocations of a certain number of pulses are already determined.Accordingly, the multiple pulses {v,,] are transformed into synthesized speech (s,}
192
Chapter 6
through the LPC synthesis filter corresponding to the estimated spectral envelope. The amplitude and location of a new pulse is determinedto minimize themeansquareerror between the synthesized speech and original speech using perceptually based weighting. The synthesizedspeechproduced by thepulses determined thus far is then subtracted from the speech wave. The procedurefordetermininga new pulse andsubtractingthe synthesized speech from the previous speech is repeated until the error becomes less than a certain predetermined threshold or the number ofpulses reaches a predetermined number. Thesynthesized speech samples of the current frame are used as the initial condition for the next frame. Perceptually based weighting is performed in the same way as the noise shaping described in Sec. 6.2.7. The transfer function of the weighting filter is 1+
W ( z )=
1+
P
aiZ-i i= 1 P
y'a&
(6.17)
i= 1
where {aili= 1 p are the linear prediction coefficients, and y is the weighting factor (0 < y < I), which is empirically determined. The autocorrelation function of the impulse response for the synthesis filter and thecross-correlation between this impulse response and the original speech signal can be used for sequentially determining the impulse amplitude and 'position as follows. The energy of error between the speech wave synthesized by K-pulses and the original speech wave is (6.18) where h,, is the impulse response for the synthesis filter, N is the frame length, and giand lni are the amplitude and position of the ith pulse intheframe, respectively. The weighting process is
193
Speech Coding
omitted in this equation for simplicity. To putit more precisely, the values after convolutionwith the impulse response of the weighting filter must be used instead of s,, and h,. The pulse amplitude and position minimizing Eq. (6.18) can be obtained by maximizing the following equation, which is derived by setting the partial differential function forEq. (6.18) concerning
I
I= 1
I
(6.19)
Here, Rj,ilis the autocorrelation for the impulse response of the synthesis filter, and o h s is the cross-correlation between the impulse response and the original speech. Experimental results indicate that eight pulses for every IO-ms periodproducea speech qualityin which littledistortioncan be perceived. Examples of original speech, synthesized speech, multipulses, and residual signals for a 100-ms period are shown in Fig. 6.34 (Atal and Remde, 1982). In this case, the number of poles (p) is 16, framelength is 20 ms, frameperiod is 10 ms, and multipulses are determinedfor every 5-ms period.Thisfigure confirms that the waveform is accurately reproduced even for a transitional signalbetween voiced and unvoicedperiods.A subjective evaluation test shows that quality equivalent to that of 6.4-bit log PCM can be obtained at 9.6 kbps (16 pulses for a 20-ms period; Ozawa et al., 1982). Pitch information has been confirmed to be useful for improving quality when the number of pulses is small (Ozawa and Araseki, 1986). 6.5.3 Code-Excited Linear Predictive Coding (CELP)
Code-excited LPC (CELP) or stochastically excited LPC is a method in which the residual signal is vector quantized by a stochastic or random sequence of pulses. The residual signal is produced by both
194
2-
fi
n U
+
IT m IT
IT 0 0
v)
a,
m
m
4 v)
3
v)
>
m
0 .-L
Chapter 6
195
Speech Coding
the long-term prediction based on the long-term periodicity of the source and theshort-term prediction based on thecorrelation between adjacent samples (Atal and Schroeder, 1984). This method can be regarded as a modification of MPC by replacing the multipulses with vector-quantized random pulse sequences. Since each codevector is a random noise vector, L kinds of Nsample vectors can be stored as a single ( L + N)-sample noise wave instead of storingthemseparately.Thedifferentcodevectors having N-samples are thenextractedfromthe single vector by shifting the starting position sampleby sample. Each vector codeis thus represented by the position in the ( L+ “sample vector from whichtheN-samplesequence is extracted.Selection of the optimum N-sample vector is performed so asto minimize the perceptually weighted sum of thesquarederror between the synthetic speech wave and the original speech wave as shown in Fig. 6.35 (Atal and Rabiner, 1986). The same ( L +N)-sample vector is stored in the decoder, and the N-sample vector at the position indicated by the transmitted
Original s p eech
Fine structure (Pitch)
Codebook
Spectral envelope
-
Code word 2
Objective
error
Excitation
Perceptualerror
Average
Synthetic
Square
-
Perceptual weighting f iI ter
-
I
FIG. 6.35 CELP.
Searchprocedure for determiningbestexcitation
code in
196
Chapter 6
signal is extracted from the ( L + N)-sample vector as the excitation signal. High-quality speech with a meanSNR of roughly 15 dB was reported to be obtained under the conditionsof N = 40 (5 ms) and a bit rate of 0.25 bit/sample(10 bits/40 samples) (Schroederand Atal, 1985). MPC and CELP are analysis-by-synthesis coders, which are essentially waveform-approximating coders because they produce an output waveform that closely follows the original waveform. (The minimization of the mean square error in the perceptual space viaperceptualweightingcausesaslightmodificationtothe waveform-approximationprinciple.)Thiseliminatestheold vocoder problem of having to classify a speech segment as voiced or unvoiced.Suchadecisioncan never be made flawlessly and many speech segments have both voiced and unvoiced properties. Recent vocoders also have found ways to eliminate the need for making the voiced/unvoiced decision.The multiband excitation (MBE) (Griffin and Lim, 1988) and sinusoidal transform coders (STC)(McAulay andQuatieri, 1986), alsoknownasharmonic coders,dividethespectrumintoaset of harmonicbands. Individual bands can be declared voiced or unvoiced. This allows the coder to produce a mixed signal: partially voiced and partially unvoiced.Mixed-excitation LPC(MELP) (Supplee etal., 1997) and waveforminterpolation(WI)(KleijnandHaagen, 1994) produce excitation signals that are a combination of periodic and noise-like components. These modern vocoders produce excellentquality speech compared to their predecessors, the channel vocoder andtheLPC vocoder.However,they are still less robustthan higher-bit-rate waveform coders. Moreover, theyare more affected by background noise and cannot code music well. 6.5.4
Coding by Phase Equalization and Variable-Rate Tree Coding
Speech coding methods canbe classified into waveform coding and analysis-synthesis, with the difference between them being whether the sound source is modeled or not. For example, the excitation
Speech Coding
197
information is compressed by quantizing the LPCresidual in APC, whereas it is modeled dichotomously either using a periodic pulse train or random noise source in an LPC vocoder. The residual waveform representation in an LPC vocoder is considered to be a process of both whitening the short-time power spectrum of the prediction residual and modifying the short-time phase into the zero phase or random phase. A phase modification process utilizing human perceptual insensitivity to the short-time phase change is highly effective for bit rate reduction. Also with waveformcoding, if theLPC residual can be modified into a pulselike wave, speech energy willbe temporally localized, and, hence, coding efficiency can be increased by timedomain bit allocation. This is similar to the effectiveness of energy localization in the frequency domain which increases the prediction gain. For this purpose, a highly efficient speech coding method has been proposed combiningphase equalization in the time domain with variable-rate (time-domain bit allocation) tree coding (Moriya and Honda, 1986). Figure 6.36 shows a block diagram of this system. The phase equalization is realised through the matched filter principle.The characteristics of the phase equalization filter are determined to
Code sequence
c(n 1
FIG. 6.36 Block diagram of coder based on phase equalization of prediction residual waveform and variable-rate tree coding.
198
Chapter 6
minimize the mean square error between the pseudoperiodic pulse train andthe filter output forthe residual signal. The impulse response of the phase equalization filter can be approximated as a timereversed residual waveform under the assumption that adjacent samples are uncorrelated. Theoutput residual of this filter is approximately zero-phased over a short period, and itbecomes an impulse-train-like signal. The matched filter principle implies that the phase equalization filter corresponds to the filter which maximizesthe amplitude at the pulse position under the fixed gain condition. Examples of phase-equalized waveforms, shown in Fig. 6.37, clearly indicate that the residual signal is modified to an impulse-train-like signal by phase equalization.
O r i g i n a l Speech
Phase-equalizedoriginal
speech
I
I
Residual s ignol
Phase-equalizedresidualsignal
FIG. 6.37 Examples of phase-equalization processing for original speech and residual signal by a female speaker.
Speech Coding
199
In this method, the phase-equalized residual signal is coded by variable-rate tree coding (VTRC). Variable-rate coding is effective for signals withtemporally localized energy.Treecoding is a method in which a tree structureof signals is used to search for the optimumexcitationsource signal sequence minimizing the error between the input speech signal and coded output (Anderson and Bodie, 1975; Fehn and Noll, 1982). The tree coder in this system is constructed by acodegeneratorhavingavariableratetree structure and an all-pole prediction filter. Each code in the code sequence minimizing the error between the phase-equalized speech wave and coded output over several sample values is successively determined using a method similar to the A-b-S procedure. The number of bits R(n) andquantizationstep size &) for each branch of thetree are allocatedaccording tothetemporal localization of the residual energy. Decoding is performed by the excitation of the all-pole filter using a residual signal which is phase-equalized and tree-coded. Since thedecodedspeechwaveform is processed by phase equalization, it is generally different from the original waveform. Thecodingmethod based on phaseequalizationnot only provides an efficient method of speech waveform representation, but also makespossible a unified modeling of the speech waveform inthesameframeworkincludingwaveformcoding andthe analysis-synthesis method. The latter capability is similar to that possible in the multipulse coding (MPC) method.
6.6 EVALUATION AND STANDARDIZATIONOF CODING METHODS 6.6.1
EvaluationFactors of SpeechCodingSystems
Speech coding has found a diverse range of application such as cellulartelephony, voice mail,multimediamessaging,digital answering machines, packet telephony, audio-visual teleconferencing, and, of course, many other applications in the Internetarena.
200
Chapter 6
Evaluation factors for speech coding systems include bit rate (amount of information in coded speech), coded speech quality, including robustness againstnoise and coding errors,complexity of coderanddecoder (usually a coder is more complex thana decoder), and coding delay. The cost of coding systems generally increases with their complexity. For mostapplications, speech coders are implemented on either special-purpose devices (such as DSP chips) or on general-purpose computers (such as a PC for Internet telephony). In either case, the important quantities are number (million) of instructionspersecondthatareneeded to operate in real-time andtheamount of memory used. Coding delays can be objectionable in two-way telephone conversations, especially when they areaddedtothe existing delays in the transmission network and combined with uncanceled echoes. The practical limit of round-trip delays for telephony is about 300ms. One component of the delay is due to the algorithm and the other to the computation time. Individual sample coders have thelowest delay, while coders that work on ablock or frame of samples have greater delay. Techniques for evaluating the quality of coded speech can be divided into subjectiveevaluationandobjectiveevaluation techniques. Subjective evaluation includes opinion tests, pair comparison,sometimes called A-B tests, and intelligibility tests. The former two methods measure the subjective quality, including naturalnessand ease of listening, whereasthe lattermethod measures how accurately phonetic information canbe transmitted. In the opinion tests, quality is measured by subjective scores (usually by five levels: 5 is excellent, 4 good, 3 fair, 2 poor, and 1 bad). The mean opinionscore (MOS) is then calculated, in which a mean value is determined for the many listeners. Since the MOS indicates only the relative quality in a set of test utterances, the opinion-equivalent SNR value (SNR,) has also been proposed to ensure that the MOS is properly related to the objective measures (Richards, 1973). TheSNR, indicates thesignal-to-amplitudecorrelated noise ratio of the reference signal which results in the same MOS as that for each test utterance. Amplitude-correlated noise is white noise which has been modified by the speech signal
Speech Coding
201
amplitudeinorderto give it thesamecharacteristicsasthe quantization noise. The energy ratio of the original signal to the modified noise is called thesignal-to-amplitude-correlated noise ratio. In the pair comparison test, each test utterance is compared with variousotherutterances,andtheprobabilitythatthe test utterance is judgedto be betterthantheotherutterances is calculated as the preference score. Intelligibility is measured using thecorrectidentification scores for sentences, words, syllables, or phonemes (vowels and consonants).Analyzingtherelationshipbetween syllable and sentence intelligibility indicates that when the syllable identification (articulation) score exceeds 75%, the sentence intelligibility score approaches 100%. The intelligibility is often indicated by the AEN (articulation-equivalent loss), the calculation of which is based on the identification (articulation) score (Richards, 1973). The AEN is the difference in transmission losses between the system to be measured and the reference system when the phoneme identification scores for both systems are 80%. In the calculation, thereferencesystem is adjustedtoreproducetheacoustic transmission characteristics between two people facing each other at a distanceof 1 m in a free field. Importantly, the AEN values are more stable than the raw identification scores. Although definitive evaluation of coding methods should be performed by human listeners, the subjective tests require a great deal of labor and time. Therefore, it is practical to build objective evaluation methods producing evaluationresults which correspond well with the subjective evaluation results. Amongthevarious objective measures proposed, the most fundamental is the SNR. Similar to this measureis the segmental SNR (SNR,,,), which is the SNR measured in dBs at short periodssuch as 30 ms, and averaged over a long speech interval. SNR,,, corresponds better with the subjective values than does the SNR,since the short-term SNRsof even relatively small amplitude periods contribute to this value. In addition to this time-domain evaluation, spectral-domain evaluation methods have also been proposed. These methods are based on spectral distortion measured using various parameters
Chapter 6
202
such asthespectrum,predictor coefficients, autocorrelation function,andcepstrum.Themost typical method is that using the cepstral distance measure defined as (6.22) V
i=l
where c/*') and e/?'' are cepstral or LPC cepstral coefficients for input and output signals of the coder, and Db is the constant for transforming the distance value into the dB value (Db= 10/ln 10) (Kitawakietal., 1982). Subjective evaluationresults using the MOS for various coding methodsverifies that the C D has a better correspondence to the subjective measure than does the SNR,,,. The relationship between the C D and MOS is demonstrated inFig. 6.38 in which the regression equation between them obtained from the experiments is indicated by a quadratic curve. Thestandarddeviationfortheevaluation values fromthe 5 -
o PCM ADPCMf ADPCM" o ADM A APC o ATC A A P C - A B
4 -
:3-
E 2-
1 1
0
I
1
I
2
I
3
I
4
C D [dB]
FIG.6.38
Relationship between CD and MOS.
I
5
I
6
Speech Coding
203
regressioncurve is 0.18.Theseresultsindicate thatquality equivalent to that of 7-bit log PCM can be obtained by 32-kbps ADPCM or 16-kbps APC-AB and ATC. The objective and subjective measures do not correspondwell in several cases such as in systems incorporating noise shaping. A universal objective measure which can be applied to all kinds of coding systems has not yet been established. Table 6.2 comparesthetrade-offsincurredin using representative types of speech coding algorithms. The algorithms must be evaluated based on the total measure which is constructed from an appropriately weighted combination of these factors. A broader range of coding methods from high-quality coding to very-low-bitratecoding are now being investigated inorderto meet the expected demands. Digital network telephonygenerally operates at 64 kbps, cellular systems runfrom 5.6 to 13 kbps,and secure telephonyfunctions at 2.4 and 4.8 kbps.High-qualitycoding transmits not only speech but also wideband signals such as music at a rate of 64 kbps. Very-low-bit-rate coding under investigation fully utilizes the speech characteristics to transmit speech signals at 200 to 300 bps. The evaluation methods for these coding techniques, specifically, the weighting factors for combining evaluation factors, must be determined,depending on their bit ratesandapplication purposes.Acrucialflltureproblem is how best to measurethe individuality and naturalness of coded speech.
6.6.2 Speech CodingStandards
For speech coding to be useful in telecommunication applications, be standardized(i.e.,itmustconformtothesame ithasto algorithmand bit format)to ensure universal interoperability. Speech-coding standardsare established by various standards organizations: for example, ITU-T (International Telecommunication Union, Telecommunication Standardization Sector, formally CCITT), TIA (TelecommunicationsIndustryAssociation), RCR
-."-
" l " "
" "
""-
204
O
T-
T
-
m z
cv
Chapter 6
Speech Coding
205
(Research and Development Center for Radio Systems) in Japan, ETSI(EuropeanTelecommunications StandardsInstitute),and othergovernment agencies (Childersetal., 1998). Figure 6.39 summarizes the trend in standardization at ITU-T, aswell as gives examples of standardized coding for digital cellular phones. The figure also exemplifies analysis-synthesis systems. Since CELP can achieve relatively high coding quality at the bit-rateragefrom4to 16 kbps,CELP-basedcoders have been adopted in a wide range of recent standardization. The LD-CELP (low-delay CELP),CS-ACELP(conjugatestructurealgebraic CELP),VSELP (vector sum excited linearprediction) and PSICELP (pitchsynchronousinnovationCELP)inthe figure are CELP-based coders. The principal points of each are summarized as follows. LD-CELP
LD-CELP was standardized by the ITU-T for use in integrated services digitalnetworks (ISDN). Figure 6.40 showsthe search procedure for determining the best excitation code in LD-CELP (Chen et al.,1990). The key feature ofthis coding system is its short system delay (2 ms) which is achieved by using a short block length for the speech and the backward prediction technique instead of the forward prediction used in the conventional CELP. The order of prediction is around 50, covering the pitch period range,which is five times longer than that in the conventional CELP. CS-ACELP
The key features of CS-ACELP system are its conjugate codebook structure in the excitation source generator and its shorter system delay(the round-trip delay is less than32ms)thanwith conventional CELP (Kataoka et al., 1993). The conjugate structure reduces memoryrequirements and enhancesrobustnessagainst transmission errors. The shorter system delay is achieved by using
206
Chapter 6
Speech Coding
4
............................................ d
s
207
208
Chapter 6
backward prediction, similarto LD-CELP. Theexcitation source is efficiently represented by an algebraic coding structure. 8-kbpsCSACELP has coded speech quality equivalent to 32-kbps ADPCM, andhas been used inpersonalhandyphone systems (PHS)in Japan. VSELP
In VSELP, as shown in Fig. 6.41, the excitation sourceis generated by linear combination of several fixed basis vectors; this enhances robustnessagainstchannelerrors(Gerson andJasiuk, 1990). Although a one-bit transmission error of excitation source vector indexproducesacompletelydifferentvectorinconventional CELP, only inversion of a basis vector occurs in VSELP and its effect is much smaller. In addition, an efficient multi-stage vector quantization technique is employed to speed upthecodebook search.Complexity and memoryrequirementsaresignificantly reduced by VSELP. VSELP has been standardized for the full-bitrate (1 1.2 kbps in Japan and 13 kbps in North America, including errorprotectionbits)systemfordigitalcellularandportable telephone systems. PSI-CELP
The PSI-CELP algorithm, shown in Fig. 6.42, has two important features: 1) the random excitation vectors in the excitation source generator are given pitch periodicity for voiced speech by pitch synchronization, and 2) the codebook has a two-channel conjugate structure (Miki et al., 1993). The pitch synchronization algorithm using an adaptive codebook reduces quantization noise without losing naturalness at low-bit rates. In particular, this significantly improves voiced speech quality. The two-channel conjugatestructureanda fixed codebookfor transient speechsignals have been proposed to reduce memory requirements against channel errors. This conjugate structure is made
Speech Coding
5W
cn > 0
rc
209
210 CA
t
i
Chapter 6
Speech Coding
21 1
by selecting the best combination of code vectors from well-organized codebooks to minimize distortion resulting from summing up two codebooks. PSI-CELP has been adopted as the digitalcellular standard in Japanfora half-rate (3.45 kbps for speech + error protection = 5.6 kbps) digital cellular mobile radio system. Its quality at the half bit rate nearly equals or is better than that of VSELP at the full bit rate. However, the amount of processingand the codec system delay for the former are about twice that of the latter. 6.7
ROBUST AND FLEXIBLE SPEECH CODING
Most of the low-bit speech coders designed in the past implicitly assunlethatthe signal is generated by aspeakerwithoutmuch interference.Thesecodersoften demonstratedegradationin quality when used in an environment in which there is a competing speech orbackground noise including music. A recent research challenge is to make coders perform robustly underwide a range of conditions, including noisy automobile environments (Childers et al., 1998). From theapplicationpoint ofview, it is useful if a common coder performs well for both speech and music. Another challenge is the coder’s resistance to transmission errors, which areparticularlycriticalincellularandpacket communicationapplications.Methods that combinesource and channelcoding schemes or that conceal errors are important in enhancing the usefulness of the coding system. As packet networking is becoming more and more prevalent, a new breed of speech coders is emerging. These coders need to take into account and negotiate for the available network resources (unlike the existing digital telephony hierarchy in which a constant bit rate per channel is guaranteed) in order to determine the right coder to use. They also have to be able to deal with packet losses (severe at times), For this reason,theidea of embedded and scaleable (in terms of bit rates) coders is being investigated with considerable interest (Elder, 1997).
This Page Intentionally Left Blank
Speech Synthesis
7.1 PRINCIPLES OF SPEECH SYNTHESIS Speech synthesis is a process which artificially produces speech for variousapplications,diminishingthedependenceonusinga person’srecorded voice. The speech synthesismethodsenablea machine to pass oninstructionsorinformationtotheuser through‘speaking.’ The applications include information supply services overtelephone,such asbanking services anddirectory services,variousreservationservices,public announcements, such as those attrainstations,readingoutmanuscripts for collation,readingemails,faxes,and web pagesovertelephone, voice outputinautomatictranslation systems, and special equipmentforhandicappedpeople,suchaswordprocessors withreading-outcapability andbook-readingaidsfor visuallyhandicappedpeople, and speakingaidsforvocally-handicapped people. As already mentioned, progress in LSI/computer technology and LPC techniqueshave collectively helped to advance speech synthesis research. Moreover, information supply services are now available in a wider range of application fields. Speech synthesis
213
214
Chapter 7
research is closely related to research into deriving the basic units of infornlationcarriedin speech waves andinto the speech production mechanism. Voice response technology designed to convey messages via synthesized speech presents several advantagesforinformation transmission: Anybody can easily understand the message without training or intense concentration; The message can be received even when the listener is involved inother activities, such as walking, handling an object or looking at something; Theconventionaltelephonenetwork can be used to realize easy, remote access to information; and This form of messaging is essentially a paper-free communication form. The last ‘advantage’ also means, however, that no hard copy of the messages makes them difficult to scan. TIILK,synthesized speech is sometimes inappropriate for conveying a large amount of complicated information to many people. History’sfirstspeechsynthesizer is said to have been constructed in 1779, more than 200 years ago. Figure 7.1 shows the structure of the speech synthesizer subsequently produced by von Kenlpelen in 1791 (Flanagan, 1972). This synthesizer, the first of its kind capable of producing both vowels and consonants, wasintendedtosimulatethehumanarticulatoryorgans. Soundsoriginatingthroughthevibration of reeds were nlodulated by the resonance of a leather tube and radiated as a speech wave. Fricativesounds were producedthroughthe ‘S’ and ‘SH’ whistles. This synthesizer is purportedto have been able to produce words consisting of up to 19 consonants and 5 vowels. Earlymechanicallystructured speech synthesizers, of course, couldnotgeneratehigh-quality synthesizedspeechsinceit was difficult to continuously and rapidly change the vocal tract shape.
I
Speech Synthesis
0 0,
J
215
216
Chapter 7
The first synthesizer incorporating an electric structure was made in 1922by J. Q. Stewart. Two coupled resonant electric circuits were excited by a current interrupted at a rate analogous to the voice pitch. By carefully tuning the circuits, sustained vowels could be produced by this synthesizer. The first synthesizer which actually succeeded in generating continuous speech was thevoder,constructed by H. Dudley in 1939. Itproducedcontinuous speech by controllingthefundamentalperiodandband-pass filter characteristics, respectively, using a foot pedal and10 finger keys. The voder, which laterserved astheprototype of the speech synthesizer forthevocoder introduced in Sec. 4.6.2, became a principal foundation block for recent speech synthesis research. The voder structure, based on the linear separable equivalent circuit model, is still used in present speech synthesizers. Present speech synthesis methods can be divided into three types: 1) Synthesis based on waveform coding, inwhichspeechwaves of recordedhuman voice storedafter waveformcoding or immediately after recording are used to producedesired messages 2) Synthesis based onthe analysis-synthesis method, in which speech waves of recorded human voice are transformed into parameter sequences by the analysis-synthesis methodand stored, with a speech synthesizer being driven by concatenated parameters to produce messages. 3) Synthesis by rule, in which speech is producedbased on phonetic andlinguistic rules from letter sequences or sequences of phoneme symbols and prosodic features. The principles of these three methods and a comparison of their features are presented in Fig. 7.2 and Table 7.1, respectively. Synthesis systems based onthe waveformcodingmethod are simple and provide high-quality speech, but they also exhibit low versatility, that is, themessages can only be used in theform recorded. At the other extreme, synthesis-by-rule systems feature
Speech Synthesis
217
[ synthesls Analysis-]
cod [ Waveform] ing
Basic form of infor [mation
-
1
0 Waveform
e
parameter
a
Linguistic
Input data
symbol
e d u c t , o n o , Reduction
Synthesis (Parameter conversion)
Parameter connection connect ion
Playback synthesizer synthesizer
Speech Speech
FIG.7.2
Parameter sequence generation
Speech
Speech
Basic principles of three speech synthesis methods.
great versatility but are also highly complex, and, as yet, of limited quality. In practical cases, it is desirable to select the method most appropriate for the objectives fully takingtheperformance and properties of each method into consideration. The details of each method will be discussed in the following.
7.2 SYNTHESIS BASED ON WAVEFORM CODING
As mentioned, synthesis based on waveform coding is the methodby which short segmental units of human voice, typically words or
218
E S
n v)
v)
a. a m I
b
m
0
.-c
E
v)
I
b
T
0
0
a,>
v,
-k Y ,
Chapter 7
Speech Synthesis
219
phrases, are stored and the desired sentence speech is synthesized by selecting and connecting the appropriate units. In this method, the quality of synthesized sentence speech is generally influenced by the quality of the continuity of acoustic features at the connections between units. Acoustic features include the spectral envelope, amplitude, fundamental frequency, and speaking rate. If large units such as phrases or sentences are stored and used, thequality (intelligibility and naturalness) of synthesized speech is better, although the variety of words or sentences which can be synthesized is restricted. On the other hand,when small units such as syllables or phonemes are used, a wide range of words and sentences can be synthesized but the speech quality is largely degraded. In practical systems typically available at present, words and phrases are stored, and wordsare inserted or connected with phrases to produce a desired sentence speech. Since the pitch pattern of each word changes according to its position in differing sentences, it is necessary to store variations of the same words with rising, flat, and falling inflections. The inflection selected also depends on whether the sentence represents a question, statement, or exclamation. Two major problems exist in simply concatenating words to produce sentences (Klatt, 1987). First, a spoken sentence is very differentfrom a sequence of wordsuttered in isolation. In a sentence, words are as short as half their duration when spoken in isolation,makingconcatenatedspeech seem painfullyslow. Second, the sentence stress pattern, rhythm, and intonation,which are dependent on syntactic and semantic factors, are disruptively unnatural when words are simply concatenated even if several variations of the same word are stored. Inorderto resolvesuch problems,synthesismethods concatenating phoneme units haverecently been widely employed. The acceleration of computer processing andthereduction of memory prices are advancing these methods. In these methods, a largenumber of phonemeunits or sub-phoneme(shorter than phonemes) units corresponding to allophones and pitch variation are stored, and the most appropriate units are selected based on rules and evaluation mesures and are concatenated to synthesize speech. Several methods have been developed of overlapping and
220
Chapter 7
adding pitch-length speech waves according to the pitch period of synthesizing speech and various methods of controlling prosodic features by iterating or thinning out the pitch waveforms. These methodscan synthesize unrestricted sentences even thoughthe unitsarestored by speech waveforms.Typicalexamples of methods include TD-PSOLA and HNMdescribed in the following. In order toreduce requirements for memorysize, the units are sometimescompressed by waveformcodingmethodssuchas ADPCM rather than simply storing with analog or digital speech waves. Synthesis derived fromthe analysis-synthesis method, which will be discussed inSection 7.3, is considered to be an advancedform of thismethodfromtheviewpoint of its information reduction and controllability. TD-PSOLA
The TD-PSOLA (Time Domain Pitch Synchronous OverLap Add) method (Moulines and Charpentier, 1990) is currentlyone of the most popular pitch-synchronouswavefolm concatenation methods. This method relies on the speech production modeldescribed by the sinusoidal framework. The ‘analysis’ part consists of extracting shorttime analysis signalsby multiplying the speech waveform by a sequence of time-translated analysis windows.The analysis windowsare located around glottal closure instants and their length is proportional to the local pitch period. During unvoiced frames the analysis time instants are set at a constant rate. During the ‘synthesis’ process, a mapping betweenthesynthesistime instants and analysistime instants is determinedaccording to thedesired prosodic modifications. T h s processspecifies whch of the short-time analysissignals willbe eliminated or duplicated in order to form the final synthetic signal. HNM HNM (Harmonic plus Noise Model) method (Laroche et al., 1993) is based on a pitch-synchronous harmonic-plus-noise representation
sis
Speech
221
of the speech signal. The spectrum is divided into two bands, with low band being represented solelyby harmonically represented sinewaves having slowly varying amplitudes and frequencies. Here, h ( t ) = Z A k ( t ) cos(kO(t)
+ &(I))
li= 1
with O(t) = f awo(l>dl..4,4t) and $,(t) are the amplitude and phase at time t of the kth harmonic, wO(t) is the fundamental frequency and K(t) is the time-varying number of harmonics included in the harmonic part. The frequency content of the high band is modeled by a timevarying AR model; its time-domain structure is represented by a piecewise linear energy-envelope function. The noise part, n(t), is thereforeassumed to have been obtained by filtering a white Gaussian noise b(t), by a time-varying, normalized all-pole filter h(r, t ) and multiplying the result by an energy envelope function w(t), such that r7
( t ) = w (1) [ h ( r ,t ) * b ( t ) ]
A time-varying parameter referred toas maximum voiced frequencydeterminesthe limit between thetwobands. During unvoiced frames the maximum voiced frequency is set to zero. At synthesis time, HNM frames areconcatenatedandthe prosody of units is altered according to the desired prosody.
7.3 SYNTHESIS BASED ON ANALYSIS-SYNTHESIS METHOD
In synthesis derived from the analysis-synthesis method, words or phrases of human speech are analyzed based on the speech productionmodelandstoredas timesequencesof feature parameters. Parameter sequences of appropriate units areconnected
-""".-""
"
222
Chapter 7
and supplied to a speech synthesizer to produce the desired spoken message. Since the units are stored by source and spectral envelope parameters, the amount of information is much less than with the previous method of storing by wavefoml, although the naturalness of synthesized speech is slightly degraded. Additionally, this method is advantageous in that changing the speaking rate and smoothing the pitch and spectral change at connections can be performed by controlling the parameters. Channel vocoders and speech synthesizers based on LPC analysis methods, such as LSP and PARCOR methods, or the cepstral analysis methods, are used for this purpose. Phoneme-based speech synthesis can also be implemented by the analysis-synthesis method, in which thefeatureparameter vector sequence of eachallophone is storedor produced by a model.Amethodhas been recently developed using HMMs (hidden Markov models) to model the feature parameter produca parameter vector tion process for each allophone. In this method, sequence consisting of cepstra and delta-cepstra for a desired sentence is automaticallyproduced by a concatenation of allophone HMMs based on the likelihood maximization criterion. Since delta-cepstra aretakenintoaccount in thelikelihood maximization process, a smooth parameter sequence is obtained (Tokuda et al., 1995).
7.4 SYNTHESISBASEDONSPEECHPRODUCTION MECHANISM
Twomethodsarecapable of producing speech by electroacoustically replicating the speech production mechanism. One is the vocal tract analog method, which simulates the acoustic wave propagation in the vocal tract. The other is the terminal analog method simulating the frequency spectrum structure, that is, the resonance andantiresonancecharacteristics, which reproduces articulation as a result. Although in the early years these methods were realized by analog processing using analogcomputersor variable resonance circuits, most of the recent systems use digital
Speech Synthesis
223
processing owing to advances in digital circuits and computers and to their ease of control.
7.4.1 VocalTract Analog Method
The vocal tract analog method is based on the principle described in Sec. 3.3. More specifically, the vocal tract is represented by a cascade connection of straight tubes with various cross-sectional areas, each of which has a short length Ax.The acoustic waves in thetubesareseparatedintoforwardandbackward waves. Acoustic wave propagation in the vocal tract is represented by theintegration of reflection andpenetrationofforwardand backward waves at each boundary between adjacent tubes. The amount of reflectionandpenetrationattheboundary is determined by the reflection coefficient which indicates the amount of mismatching in acoustic impedance. The signal processing for speech synthesis based on this principle is previously detailed in Fig. 3.4. Amethodhas also been investigated in which vocal tract characteristics are simulated by a cascade connection of7r-type four-terminal circuits, each of which consists of L- and C-elements. The circuit is terminated by another circuit havinga series of L- and R-elements, which is equivalent to the radiation impedance at the lips. The vocal tract model is excited by a pulse generator at the input terminalof the 7r-type circuit for voiced sounds, and by a white noise generator connected to a four-terminal circuit where turbulent noise is produced for consonants. Rather than remaining with the modeling of the vocal tract area function, it would be better to take the next, more difficult step, and directly formulate a model based on the structure of the articulatory organs. In such a modeling system, which is called the articulatory model, locations and shapesof articulatory organs are used as control parameters for speech synthesis. In this method, synthesisrulesareexpectedto be muchclearersincethe articulatorymovements of the organs can be directly described
Chapter 7
224
and controlled. In an example speech synthesis system based on this method (Coker et al., 1978), the glottal area, gap between the velum and pharynx, tongue location, shape of the tongue tip, jaw opening, and the amount of narrowing and protruding of the lips are controlled to produce speech. The speech synthesizer based on the vocal tractanalog method is considered to be particularly effective in synthesizing transitionalsoundssuchasconsonants, since it can precisely simulatethedynamicmanner of articulation in the vocal tract. Additionally, this method is considered to be easily related to the phonetic information conveyed by the speech wave. High-quality synthesized speech has not yet been obtained, however, since the movement of thearticulatoryorganshasnot been sufficiently clarified to offer suitable control rules. 7.4.2 TerminalAnalogMethod
Theterminalanalogmethodsimulatesthe speech production mechanism using an electrical structure consisting of the cascadeor parallel connection of several resonance (formant) and antiresonance(antiformant)circuits.Theresonanceorantiresonance frequency and bandwidth of each circuit are variable. This method is also called the formant-type synthesis method. As indicatedin Sec. 3.3.2 (resonancemodel),thecomplex frequency characteristics (Laplace transformation) of a resonance (pole) circuit can be represented as
where s =
-0
+jw
Speech Synthesis
225
Digitalsimulation of this circuit can be represented through its s-transformation
where
T is the sampling period, and Res[ ] indicates the residue number. These equations imply that the digital simulation circuit can be represented as shown in Fig. 7.3(a). When the resonance frequency-f;, = w,,/27r [Hz] and bandwidth b, = oi2/7r[Hz] are given, the circuit parameters can be obtained. The antiresonance (zero) circuitindicated in Fig. 7.3(b) can be easily obtained from the resonance circuit, based on the inverse circuit relationships. Here, k, = wi7/(o,:+q:). Thecascadeconnection of resonance andantiresonance circuits is advantageous in that mutual amplitude ratios between formants and antiformants are automatically determined. This is feasible because vocal tract transmissioncharacteristics can be directly represented by this method. On the other hand, parallel connection is advantageous in that the final spectral shape can be precisely simulated. Suchprecise simulation is made possible by the fact that the amplitude of each formant and antiformant can be represented independently, even though this methoddoesnot directlyindicate the vocal tracttransmissioncharacteristics. Therefore, cascade connection is suitable for vowel speech having a clear spectral structure, and parallel connection is best intended
Chapter 7
226 Input 4
2-1
-
output
c
rc
'
Kp*A
B
-
-
2-1
.c-
2-1
Input 4
FIG.7.3 Digital simulation of resonanceandantiresonance (a) resonance (pole) circuit; (b) antiresonance (zero) circuit.
circuits;
for nasal andfricative sounds featuringsuch a complicated spectral structure that their pole and zero structures cannot be extracted easily. Figure 7.4 shows a typical example of the structure of a synthesizer which is constructedbasedon these considerations (Klatt, 1980).
7.5 7.5.1
SYNTHESIS BY RULE Principles of SynthesisbyRule
Synthesis by rule is a method for producing any words sentences or based on sequences of phoneticlsyllabic symbols or letters. In this
I "
7 "
Speech Synthesis
"
I
I I
I
i
".""""
Y
227
-
" " " " 0
228
Chapter 7
method,featureparametersforfundamental small units of speech such as syllables, phonemes or one-pitch-period speech, arestoredandconnected by rules. At thesame time, prosodic features such as pitch and amplitude are also controlled by rules. The quality of fundamental units for synthesis as well as control rules (control information and control mechanisms) for acoustic parameters play crucially important roles in this method, and they must be based on phonetic and linguistic characteristics of natural speech. Furthermore,toproducenaturaland distinct speech, temporal transitions of pitch, stress, and spectrum mustbe smooth, and other features such as pause locations and durations must be appropriate. Vocaltractanalog,terminalanalog,andLPCspeech synthesizers used to bewidely employed for speech production. AsdescribedinSection 7.2, waveform-based methodshave recently become very popular.Featureparametersforfundamentalunitsareextractedfromnatural speech or artificially created. When phonemes are taken as the fundamental units for speech production, the memory capacity can be greatly reduced, since the number of phonemes is generally between 30 and 50. However, the rules for connecting phonemes are so complicated that high-quality speech is hard to obtain. Therefore, units larger than phonemes or allophone (context-dependent phoneme) units are frequently used. In thelatter case, thousandsortens of thousand of unitsare necessary for synthesizing high-quality speech. FortheJapaneselanguage, 100 CV syllables (C is a consonant, V is a vowel) corresponding to symbols in the Japanese ‘Kana’ syllabary are often used as these units. CVC units have also been employed to obtain high-quality speech (Sato, 1984a). The number of CVC syllables appearing in Japaneseis very large, being somewhere between 5000 and 6000. Thus, combinations of roughly 1000 CVC syllables frequently appearing in Japanese along with roughly 200 CV/VC syllables havebeen used to synthesize Japanese sentences. Combinations of between 700 and 800 VCV units have also been attempted (Sato, 1978).
Speech Synthesis
229
For example, the Japanese word ‘sakura,’ or cherry blossom, can be represented by the concatenation of these units as CV units CVC units VCV units
sa+ku+ra sak + kur + ra sa + aku + ura
CVC units are connected at consonants, and VCV units at vowel steady parts. Each method presents its own advantages in ease of connection. 3500 Incontrast,the Englishlanguagehasmorethan syllables, which expand to roughly 10,000 whenallophones (phonological variations) are taken into consideration. Therefore, syllables are usually decomposed into smaller units, such as dyads, diphones (both have roughly400 to 1000 units; Dixon and Maxey, 1968), or demisyllables (roughly 1000 units; Lovins et al., 1979). These units basically consist of individual phonemes and transitions between neighboring phonemes. Although demisyllables are slightly larger than the other two units, all units are composed in such a way that they may be concatenated using simple rules. In phoneme-based systems (Klatt, 1987), synthesis beginsby selecting targets for each controlparameterfor each phonetic segment. Targetsare sometimes modified by rules that takeinto account features of neighboring segments. Transitions between targetsarethencomputedaccordingto rules thatrange in complexity from simple smoothing to a fairly complicated implementation of the locus theory.Mostsmoothinginteractions involve segments adjacent to one another, but therules also provide for articulatory/acoustic interaction effects that span more than the adjacent segment. Since these rules are still very difficult to build, synthesis methods concatenating context-dependent phoneme units are now widely used as described in Secs. 7.2 and 7.3. Control parameters forintonation, accent, stress, pause, and duration used to be manually inputinto the systemin order to synthesize high-quality sentence speech. Because of the difficulty of
230
Chapter 7
inputting these parameters, however,text-to-speech conversion? in which these control parameters are automatically produced based on letter sequences, has been introduced. This system can realize the human ability of reading written texts, that is, converting unrestricted text to speech. This is essentiallythe ultimate goal of speech synthesis. Building such a text-to-speech conversion system, though, necessitates clarifying how people understand sentences using our knowledge of syntax and semantics. To be totally effective, this process of understanding must then be converted into computer programs. The principles of text-to-speech conversion are described in Sec. 7.6. 7.5.2 Control of Prosodic Features
In prosodic features, intonation and accent are most important in of synthesizedspeech. Fundamental improvingthequality frequency,loudness, anddurationare related to these features. In the period of speech between pauses, that is, the period of speech uttered in one breath, pitch frequency is usually high at the onset and gradually decreases towardtheendduetothe decrease in subglottal pressure. This characteristic is called the basic intonation component.Thepitchpattern of each sentence is produced by adding the accent components of the pitch pattern to this basic intonation component. The accent components are determined by the accent position for each word or syllable. Figure 7.5 shows an example of the pitch pattern production mechanismforaspokenJapanese sentence, in which thepitch pattern is expressed by the superposition of phrase components and accent components (Sagisaka, 1998). The accent component for each phrase is finally determinedaccording to thesyntactic relationships existing between phrases. In asuccessfd speech synthesis system for English (Klatt, 19871, the pitch pattern is modeled in terms of impulses and step commands fed to a linear smoothing filter. A step rise is placed near the start of the first stressed vowel in accordance with the 'hat theory' of intonation. A step fall is placed near the start of the final stressed vowel. These rises and falls set off syntactic units. Stress is
Speech Synthesis
231
Chapter 7
232
also manifested in this rule system by causing an additional local rise on stressed vowels using the impulse commands. The amount of rise is greatest for the first stressed vowel of a syntactic unit,and smallerthereafter.Finally,smalllocalinfluences of phonetic segments are added by positioning commands to simulate the rises for voiceless consonants and high vowels. A gradual declination line (the basic intonation component) is also included in the inputs to the smoothing filter. Thetopportion of Fig. 7.6 shows three typicalclause final intonation patterns, and the bottom portion exemplifies a pitch 'hat pattern' of rises and falls between the brim and top of the hat for a two-clause sentence. An example of the step andimpulsive commands for the English sentence noted, as well as the pitch pattern generated by these commands and the rules, are given in Fig. 7.7.
I
Time ~
Final f a l l
Question rise
~
~-
Foil- rise continuum
-"-Time
FIG. 7.6 Three typical clause-finalintonationpatterns (top), and an example of a pitch "hat pattern" of rises and falls (bottom).
I
I
1
I
I
L
1
Speech Synthesis
I
I
v)
n U
233
234
Chapter 7
Duration control foreach phoneme is also an importantissue in synthesizing high-quality speech. The duration of each phoneme in continuous speech is determined by many factors, such as the characteristicspeculiar to eachphoneme, influence of adjacent phonemes, and the numberof phonemes as well as their locationin theword(Sagisaka andTohkura, 1984). Theduration of each phonemealsochangesasafunction of the sentence context. Specifically, the final vowel of the sentence is lengthened, as are the stressed vowels and the consonants that precede them in the same syllable, whereasthe vowels before voiceless consonantsare shortened (Klatt, 1987).
7.6 TEXT-TO-SPEECHCONVERSION Text-to-speech conversion is an ambitious objective and continues to be thefocus of intensive research. A text-to-speech system produced would find a wide range of applications in a number of fields. These rangefrom accessing emails and variouskinds of databases by voice-over telephone to reading for the blind. Figure 7.8 presentsthe chief elementsoftext-to-speechconversion (CrochiereandFlanagan, 1986). Input textoftenincludes abbreviations,Romannumerals,dates, times,formulas,and punctuation marks. The system developed must be capable of first converting these into somereasonable, standard form and then translating them into a broad phonetic transcription. This is done by usingalargepronouncingdictionarysupplemented by appropriate letter-to-sound rules. IntheMITalk-79 system, which is one of themajor pioneering English text-to-speech conversion systems yet developed, 12,000 morphs, covering 98VO of ordinary English sentences, are used as basic acoustic segments (Allen et al., 1979). Morphs, which are smaller than words, are minimum units of letter strings having linguistic meaning.They consist of stems, prefixes, and suffixes. The word ‘changeable,’ for example, is decomposed into the morphs ‘change’ and ‘able.’ The morph dictionary stores the
Speech Synthesis
-
c
0
1
YI
235
236
Chapter 7
spelling and pronunciation for each morph, rules for connecting with othermorphs,and rules forsyntax-dependentvariations. Phoneme sequences for low-frequency wordsareproduced by letter-to-sound rules, instead of preparing morphs for them. Thisis based onthefactthatirregularletter-to-soundconversions generally occur for frequent words though the pronunciation of infrequent words tends to follow regular rules in English. TheMITalk-79 system convertswordstringsintomorph strings by aleft-to-right recursive processusingthemorph dictionary.Eachword is thentransformedintoa sequence of phonemes. Additionally, stress in each word is decided according to the effects of prefixes, suffixes, the word compound, and the part ofspeech.Sentence level prosodicfeaturesareadded according to syntax and semantics analysis, and sentence speech is finally synthesized using the terminal analog speech synthesizer introduced in Sec. 7.4.2 (Fig. 7.4). Thequality of the speech synthesized by theMITalk-79 system was evaluated by phoneme intelligibility in isolated words, word intelligibility in sentence speech, and sentence comprehensibility. Experimental results confirmed that the error rate for the phoneme intelligibility test was 6.9%, and that word intelligibility scores were, respectively, 93.2% and 78.7%in normal sentences and meaningless sentences. The DECtalksystem, which is the most successful commercialized text-to-speech conversion system, is based on refinements of the technology used in theMITalk-79 system (Klatt, 1987). Text-to-speech conversion systems for several other languages have also been investigated (Hirose et al., 1986). In a Japanese textto-speech conversion system (Sato, 1984b), input text, whichis written in acombinationof Chinese characters, or Kanji and Japanese Kanasyllabary, is analyzed by depth-first searching for the longest match using a 58,000-word dictionary and a word transition table. Thetransition table provides candidates for the following word.Compoundandphraseaccentandsentenceprosodic characteristics are next determined by reconstruction of phrases on the basis oflocal syntactic dependency analysis. A continuous speech signal is finally synthesized by concatenating CV speech units.
Speech Synthesis
7.7
237
CORPUS-BASED SPEECH SYNTHESIS
As described in Section 7.2, speech synthesis methods relying on a largenumber of short waveformunitscoveringpreviousand succeeding phonetic context and pitch are now widely used. The waveform units are usually made by using a large speech database (corpus) andstored.Themostappropriateunitsthathavethe closest phonetic context and pitch frequency to the desired speech andthat yield thesmallestconcatenation distortion between adjacent units are selected based on rules and evaluation measures andcancatenated(Hirokawaetal., 1992). Theunitsareeither directly connected or interpolated at the boundary. If the number of units is large enough and the rule of selection is appropriate, smoothsynthesized speech can be obtainedwithoutapplying interpolation.Insteadofstoringaunifiedlengthunitssuchas phonemes, methods of using variable length units according to the amount of data and kinds of speech to be synthesized have also been investigated (Sagisaka, 1988). The major factors determining synthesized speech quality in these methods consist of: 1) speech database, 2) methods for extracting the basic units, 3) evaluation measures for selecting the most appropriate units, and 4) efficient methods for searching the basic units, COC Method
COC (Context-Oriented-Clustering) speech synthesis method has been pioneering in usinghierarchical, decision treeclusteringin unit selection for speech synthesis. The method was first proposed forJapanese(NakajimaandHamada, 1988) and was later extended to English(Nakajima, 1993). Inthis approach, allthe instances of a given phoneme in a single-speaker continuous-speech
238
Chapter 7
database are clustered into equivalence classes according to their preceding and succeeding phonemecontexts.The decision trees which perform the clustering are constructed automatically so as to maximize theacoustic similarity within the equivalence classes. Figure 7.9 shows an example of the decision tree clustering for the phoneme /a/. This approach is sinlilar tothat used inmodern speech recognition systems to generate hidden Markov models in different phonetic contexts (See Subsection 8.9.5). Inthe synthesis systems, parametersor segments arethen extracted fromthedatabaseto representeach leaf inthetree. During synthesis, the trees are used to obtain theunit sequence required to produce the desired sentence. A key feature of this method is thatthe tree constructionautomaticallydetermines which context effects are most important in terms of their effect upon the acoustic properties of the speech, and thus enables the automatic identification of a leaf containing segments or parametersmostsuitable for synthesizing a given contextduring synthesis, even when the context required is not seen in training. It was confirmedthat, by concatenatingthephoneme-contextdependent phoneme units, smooth speech can be synthesized. The COC method was extended to use a set of cross-word decision-treestate-clusteredcontext-dependenthidden Markov models and define a set of subphoneunits to be used in a concatenation synthesizer (Donovan and Woodland,1999). During synthesis the required utterance, specified as a string of words of knownphoneticpronunciation, was generated as a sequence of these clustered states using a TD-PSOLA waveform concatenation synthesizer. A method of using HMM likelihood scores for selecting themostappropriatebasicunitshavealso been investigated (Huang et al., 1996). CHATR
CHATR is a corpus-basedmethod forproducing speech by selecting appropriate speech segments according to a labeling which annotates prosodic as well as phonemic influences on the
Speech Synthesis
239
240
f
Chapter 7
sis
241
Speech
speech waveform (Black and Campbell, 1995; Deng and Campbell, 1997). The labeling of speech variation in the natural data has enabled a generic approach to synthesis which easily adapts to new languages and to new speakers with little change to the basic algorithm.Figure 7.10 summarizesthe data flow in CHATR.It shows that processing (illustrated here in the form of pipes) occurs at two main stages: in the initial (off-line) database analysis and encoding stage to provide index tables and prosodic knowledge bases, and in the subsequent (online) synthesis stage for prosody prediction and unit selection. Waveform concatenation is currently the simplest part of CHATR, as the raw waveform segments to which the index points for the selected candidates are simply concatenated. Irrespective of recent progressin speech synthesis, many research issues still remain, including:
1) Improvement of naturalness, especially that of prosody, in synthesized speech; 2) Control of speaking style, such as reading or dialogue style and speech quality; and 3) Improvement of the accuracy of text analysis.
.
This Page Intentionally Left Blank
Speech Recognition
8.1 8.1. I
PRINCIPLES OF SPEECH RECOGNITION Advantages of Speech Recognition
Speech recognition is the process of automatically extracting and determining linguistic information conveyed by a speech wave using computers or electronic circuits. Linguistic information, the mostimportantinformationina speech wave, is also called phonetic information. In the broadest sense of the word, speech recognition includes speaker recognition which involves extracting individualinformationindicatingwho is speaking. Theterm ‘speech recognition’ will be used from here on, however, to mean the recognition of linguistic information only. Automatic speech recognition methods have been investigated formany years aimed principally at realizing transcriptionand human - computer interaction systems. The first technical paper to appear on speech recognition was published in 1952. It described Bell Labs’spoken digit recognizer Audrey(Davisetal., 1952). Research on speech recognition has since intensified, and speech recognizers for communicating with machines through speech have recently been constructed although they remain only of limited use.
243
244
Chapter 8
Conversation with machines canbe actualized by the combination of a speech recognizer and a speech synthesizer. This combination is expected to be particularly efficient and effective for human computer interaction since errors can be confirmed by hearing and then corrected promptly. Interest is growing in viewing speech not just as a means for accessing information, but also initself as a source of information. Important attributes that would make speech more useful in this respect include: random access, sorting ( e g , by speaker, by topic, by urgency), scanning, and editing. Similar to speech synthesizers,speechrecognitionfeatures four specific advantages: Speech input is easy to perform because it does not require a specialized skill as does typing or pushbutton operations; Speech can be used to input information three to four times faster than typewriters and eight to tentimesfaster than handwriting; Information can be input even when theuser is moving or doing other activities involving the hands, legs, eyes, or ears; and Sinceamicrophone ortelephonecanbeusedasan input terminal,inputtinginformation is economical,with remote inputting capable of being accomplished over existing telephone networks and the Internet. Regardless of these positive points, however, speech recognition also has the same disadvantages as does speech synthesis. For instance,the inputorconversation is notprinted,andnoise canceling oradaptation is necessarywhenusedinanoisy environment. In typical speech recognitionsystems,the input speech is comparedwithstoredunits(modelsorreferencetemplates) of phonemes or words, and the most likely (similar) sequenceof units is selected as a candidate sequence of phonemes or words of input speech. Since speech waveforms are too complicated to compare,
Speech Recognition
245
and since phase components whichvary according to transmission and recording systems little affect human speech perception, the phase components are desirably removed from the speech wave. Thus,short-time spectral density is usually extracted atshort intervals and used for comparison with the units. 8.1.2 Difficulties inSpeechRecognition
The difficulties in speech recognition can be summarizedas follows.
1) Coarticulationandreductionproblems Thespectrum of aphoneme in aword or sentence is influenced by neighboringphonemesasaconsequence of coarticulation.Suchaspectrum isvery different fromthose of isolated phonemes or syllables since the articulatory organs do not moveasmuch in continuous speech as in isolated utterances. Although this problem canbe avoided in the case of isolated word recognition by using words as units,how best to contend with this problem is very important in continuous-speech recognition. With continuous speech, the difficulty is compounded by elision, where thespeakerrunswordstogetherand‘swallows~most of the syllables. 2) Difficulties in segmentation Spectra continuously change from phoneme to phoneme due to their mutual interaction. Since the spectral sequence of speech can essentially be compared to a stringof handwritten letters, it is very difficult to precisely determine the phoneme boundaries which segmentthe time functionofspectralenvelopes.Although unvoiced consonants can be segmented relatively easily based on theamount of spectral variationandthe onset and offset of periodicity, attempting to segment asuccession of voiced sounds is particularly burdensome. Furthermore, it is almost impossible to segment a sentence of speech into words merely based on their acoustic features.
246
Chapter 8
3) Individuality and other variation problems Acoustic features vary from speaker to speaker,even when the samewords areuttered,accordingto differences inmanner of speaking and articulatory organs. To complicate matters, different phonemesspoken by differentspeakersoften have thesame spectrum. Transmission systems or noise also affect the physical characteristics of speech. 4) Insufficient linguistic knowledge The physical features of speech do not always convey enough phoneticinformation in and of themselves. Sentence speech is usually uttered with a n unconscious use of linguistic knowledge, such as syntactic and semantic constraints, and is perceived in a similar way. The listener can usually predictthe next word according to several linguistic constraints, and incomplete phonetic information is compensatedfor by such linguistic knowledge. However, what we know about the linguistic structure of spoken utterances is much smaller than that of written languages, and it is verydifficult tomodelthemechanism of usinglinguistic constraints in human speech perception. 8.1.3 ClassificationofSpeechRecognition
Speech recognition can be classified into isolated word recognition, in which words uttered in isolationare recognized, and continuousspeech recognition,in which continuouslyuttered sentences are recognized. Continuous-speech recognition can be further classified intotranscriptionandunderstanding.Theformeraimsat recognizing each word correctly. The latter, also called conversational speech recognition, focuses on understanding the meaningof sentences rather than recognizing each word. 111 continuous-speech recognition, it is very importantto use sophisticated linguistic knowledge. Applyingrules of grammar, which govern the sequence of words in a sentence, is but one example of this. Speech recognition can also be classified from different points of view into speaker-independent recognition and speaker-dependent
gnition
Speech
247
recognition. The former system can recognize speech uttered by any speaker, whereas, in the latter case, reference templates/models must be modifiedeverytime the speaker changes. Although speakerindependent recognition is much more difficult thanspeakerdependent recognition, it isof particular importance to develop speaker-independent recognition methods in order to broaden the range of possible uses. Various units of reference templates/models from phonemes to words have been studied. When words are used as units, the digitized input signal is comparedwitheach of the system’s storedunits,i.e.,statisticalmodels or sequencesofvalues corresponding to the spectral pattern of a word, until one is found that matches. Conversely, phoneme-based algorithms analyze the input into a string of sounds that they convert to words through a pronunciation-based dictionary. Whenwords are used as units,wordrecognition can be expected to be highly accurate since thecoarticulationproblem within words can be avoided. A larger vocabulary requires a larger memoryandmorecomputation,however,makingtraining troublesome.Additionally,thewordunitscannotsolvethe coarticulationproblemarising between wordsincontinuousspeech recognition.Usingphonemes as unitsdoes not greatly increase memory size requirements, on the otherhand,nor the amount of computationas a function of vocabulary size. Furthermore,trainingcan be performed efficiently. Moreover, coarticulation within and between words can be adequately taken intoconsideration. Since coarticulation rules have not yet been established, however, context-dependentmultiple-phonemeunits are necessary. The most appropriate units for enabling recognition success depend on the type of recognition, that is, on whether it is isolated word recognition or continuous-speech recognition, and on the size of the vocabulary. Along these lines, medium-size units between wordsandphonemes,suchas CV syllables, VCV syllables, diphones,dyads, and demisyllables, havealso been exploredin orderto overcomethedisadvantages of usingeitherwords or phonemes.
248
Chapter 8
With these subword (smaller-than-word) units, it is desirable to select more than one candidate in the unit recognition stage toform lattices andtotransfer these candidateswiththeir similarity values to the next stage of the recognition system. This method will help minimizethe occurrence of serious errorsat higherstagesduetomatchingerrorswith these unitsand segmentation errors involved in the lower stages. In most of the currentadvancedcontinuous-speechrecognition systems, the recognition process is performedtop-down, that is, driven by linguistic knowledge, and the system predicts sentence hypotheses, each of which is represented as asequence of words. Eachsequence is thenconvertedintoa sequence of phoneme models, and the likelihood (probability) of producing the spectralsequence of input speech given thephonemesequence is calculated.Thus,the matching and segmentation errors of phonemes are avoided (See Subsection 8.9.5).
8.2 SPEECHPERIODDETECTION Detection of the speech period is thefirststage of speech recognition.This is aparticularlyimportant stage because it is difficult to detect the speech period correctly in noisy surroundings and becauseadetectionerrorusuallyresultsinaserious recognition error. Consonants at the beginning or end of a speech periodand low energy vowels are especially difficult todetect. Additional noise such as breath noise at the end of a speech period must also be ignored. A speech period is usually detected by thefactthatthe short-timeaveraged energy levelexceeds athresholdfor longer thanapredeterminedperiod.The beginning point of a speech period is often determined as being a position which is a certain period prior to the position detected by the energy threshold. The energy levelis often compared with two kinds of thresholds to make a reliable detection decision. In addition to theenergy level, the zero-crossing number or the spectral difference betweenthe
gnition
Speech
249
input signal and reference noise spectrum is often used for speech period detection. Along with stationary noise which can be distinguished from the speech period using the above-mentioned methods, nonspeech sounds, such as coughing, the sound of turning pages, and even sounds utteredsubconsciously when thinking orsuddenly adjusting a sentence in midspeech, should be distinguishable from the actual speech. When the vocabulary is large, and the system must work speaker-independently,it is very troublesometodistinguish between speech and nonspeech sounds. Because this distinction is itself considered to be a speech recognition process, it is almost impossible to develop a perfect algorithmfordeterminingit. Research on word spotting, specifically, the automatic detectionof predetermined words from arbitrary continuoussentence speech, is expected to open the door to solving this problem. Besides speech period detection, voiced/unvoiced decision is also important. Although ascertaining the presence of vocal cord vibration, that is, the existence of a periodic wave, is most reliable, this method requires a large amount of computation. Therefore, the energy ratio of high- to low-frequency ranges, such as the range higher than 3 kHz and thatlower than 1 kHz, and similar measures are often used. When these methods are employed, is it necessary to normalize the effects of individuality and transmission characteristics to arrive at a reliable decision. Along these lines, a pattern recognitionapproachcombiningvariousparameters, such as autocorrelation coefficients, has also been attempted as previously mentioned (See Sec. 4.7).
8.3 SPECTRALDISTANCEMEASURES 8.3.1 DistanceMeasuresUsedinSpeechRecognition
As previously described, in almost all speech recognition systems, short-time spectral distances or similarities between input speech and stored units (models or reference templates) are calculated as
250
Chapter 8
the basis for the recognition decision. Spectral analysis is usually performed with one of five methods (See Sec. 4.2): Using band-pass filter outputs for 10 to 30 channels, Calculating the spectrum directly from the speech wave using FFT, Employing cepstral coefficients, Utilizing an autocorrelation function, and Deriving a spectralenvelopefrom LPC analysis (maximum likelihood estimation). Various distance (similarity) measures can be defined based on multivariate vectors representing short-time spectra which are obtained through these spectral analysis techniques. The distance measure d(x,y ) between two vectors x and y must desirably satisfy the following equations for effective use in speech recognition:
(a)
Symmetry :
(b)
Positivede finiteness : d(x,y) > 0, d ( x , y ) = 0,
x#y "X
= J'
(8.2)
If d(x, y ) is a distance in the mathematical sense of the word, it shouldsatisfythetriangleinequality.This condition is not necessary in speech recognition, however, and it is more important to formulate algorithms for calculating &x, y) efficiently. Although the simple Euclidean distance is used in many cases for d(x, y ) , several modifications have alsobeen attempted. Among these are weighted distances based on auditory sensitivity and the distancesin reduced multidimensional spaces obtainedthrough statistical analyses of discriminant analysis or principal component analysis. Formant frequencies, which are important features for representing speech characteristics, have rarely been used inthe most recent spectraldistance-based speech recognition because they are very difficult to extract automatically.
gnition
Speech
8.3.2
251
Distances Based on Nonparametric Spectral Analysis
The following methodshave been specifically investigated for obtainingspectraldistances based on generalspectral analysis techniques which do not incorporate modeling speech production mechanisms. 1) Band-pass filter bankmethod Band-pass filter banks have been used for many years and are still being employed because of the ease with which hardware for real-time analysis purposes can be realized. Center frequencies of band-pass filters are usually set withequal spaces alongthe logarithmic frequency scale. Differences of logarithmic output for each band-pass filter between the reference and input speech are averaged (summed) over all frequency ranges or averaged for their squared values to produce the overall distance. 2) FFT method Although it is possible to directly calculatethedistance between spectra obtained by FFT, spectral patterns smoothed by cepstral coefficients or window functionsintheautocorrelation domain are usually used. This is because the spectral fine structure varies according to pitch, voice individuality, andmanyother factors. The spectral values obtained at equal intervals on a linear frequency axis are usually resampled with equal spaces on a logarithmic frequency scale taking the auditory characteristics into consideration. Equal space resampling on a Bark-scale or a Melscale frequency axis hasalso been introducedin an effort to simulate the auditory characteristics more precisely. The Bark scale, which is based on theauditory critical bandwidth,corresponds to the frequency scale on the basilar membrane in theperipheral auditory system. This scale is defined as B = 13 arctan(0.768
+ 3.5 arctan
[AI
where B and j -represent the Bark scale and frequency in kilohertz.
Chapter 8
252
The Me1 scale corresponds to the auditory sensation of tone height. The relationship between frequency f in kilohertz and the Me1 scale Me1 is usually approximated by the equation
MeZ
=
1000 log? (1 +J>
(8-4)
The Bark and Me1 scales are nearly proportional to the logarithmic frequency scale in the frequency range above 1 kHz.
3) Cepstrummethod It is clear from the definition of cepstral coefficients that the Euclideandistancebetween vectors consisting of lower-order cepstral coefficients corresponds to the distance between smoothed logarithmic spectra. Me1 frequency cepstral coefficients (MFCCs) transformed from the logarithmic spectrum resampled at Mel-scale frequency as shown in Fig. 8.1 have also been used for this distance coefficients (Young, 1996). A and A2 aretransitionalcepstral which are described in Subsection 8.3.6. 4) Autocorrelationfunctionmethod The distance between vectors consisting of the autocorrelation function multiplied by the lag window corresponds to the distance between smoothed spectra. 8.3.3 DistancesBased on LPC
Since LPC analysis hasproven itself to be an excellentspeech analysis method, as mentioned in Chap. 5, it is also being widely usedinspeech recognition. Notations of various LPC analysisrelated parameters are indicated in Table 8.1, where fix) and g(X) represent spectral envelopes based on the LPCmodel for a reference template and input speech, respectively. These are given as
1
.f(N = 2n
Speech Recognition
f
Io
n
253
254
Chapter 8
TABLE 8.1 Notations for LPC Analysis-Related Parameters Parameters Spectral envelope Energy Autocorrelation coeff. Predictor coeff. Maximum likelihood parameter Normalized residual Cepstral coeff.
Reference template
Input speech
AN 4fl i;v> !l(f)
A,(f> RV)
C,,O
i = 1, . . . , p , j = - p , . . . , p p = order of LPC model n = -no, . . ., no
i=O
and
The following various distance measures using LPC analysisrelatedparameters have been proposedfordeterminingthe distance between .f(X) and g(X). 1. Maximum likelihood spectral distance (Itakura-Saito distance) Maximum likelihood spectral distance was introduced as an evaluationfunctionforspectral envelope estimationfromthe short-time spectral densityusing the maximum likelihood method. This distance is represented by the equation (see Sec. 5.3.2)
Speech Recognition
255
This distance is also called the Itakura-Saito distance (distortion). As described in Sec. 5.3.2, by defining d(X) = log f ( X ) - log g(X) forexaminingtherelationship between thisdistance and the logarithmic spectral distance, we obtain the equation
Whentheintegrandof this equation is processed by Taylor expansion for d(X) at the region around 0,
is derived. Thismeans that when Jd(X)I is small, thedistance E isclosetothesquaredlogarithmicspectraldistance. Equation (8.7) indicates that theintegrand of thisdistance is in proportion to d(X) when d(X) >> 0 and in proportion to e-''') when d(X) << 0.
2. Log likelihood ratiodistance The log likelihood ratio distance is defined as the logarithm of the ratio of output residual energy values for input speech passing through two kinds of inverse filters. The transmission functions of these filters respectively correspond to theinverse characteristics of the spectral envelopes for the reference template and input speech itself. The residual energy passed through the latter inverse filter is
256
Chapter 8
known as the normalized residual energy or the minimum residual energy. The distance is represented by
Thisequation is alsodefinedastheexpressionobtained minimizingthemaximumlikelihoodspectraldistance function of urn/&), and by removing the constant.
by E asa
3. Predictionresidual The prediction residual is obtained from the log likelihood ratio distance by removing the term related only to input speech. This is represented as
(8.10)
4. Coshmeasure The coshmeasure was devisedin orderto remove the asymmetry associated with the weighting forthe spectral differencein the maximum likelihood spectral distance E (Gray, Jr. and Markel, 1976). This measure, indicated in Eq. (8.1l), is obtained by summing Eq. (8.6) and its modification in which-fTX)and g(X) are inverted:
D
J
= 2-
27r
2{~0sh(logf(X) - logg(X))
-
l} dX
"71
(8.1 1) where, by definition,
Speech Recognition
257
and
Using rZ(X), D is represented by
of this equation is processed by Taylor Whentheintegrand expansion for d(X) at the region around 0,
Thisequation indicates thatthedistance D isvery close toa squared logarithmic spectral distance when(d(X)(is small and that its integrand is proportionaltotheexponentialfunction when Id(X)l >> 0. 5.
LPCcepstraldistance The LPC cepstral distance is the distance between spectral envelopes represented by the LPC cepstral coefficients. It can be expressed as
(8.14) When this distance is actually used, the summation is truncated to n = no, such that it corresponds to that of the spectral envelopes
258
Chapter 8
smoothed by the lower ordercepstral coefficients. As forthe relationship between the truncation order n o and the LPC analysis order y , n o 2 p isnecessary. If n o < p , itisprobablethe distance value becomes zero even between different spectra of the distance and that the positive definite characteristic measure cannot be maintained. The LPC cepstral distance is a useful distance measure for three major reasons. First, it can be easily calculated from linear predictor coefficients, as described in Sec. 4.3.2. Second, it directly correspondstothelogarithmicdistance between LPC spectral envelopes. Third, it satisfies the requirements for symmetryand the positive definite characteristic. The weightings for d(X) in distance measures of E, D,and L' are compared in Fig. 8.2. 8.3.4 Peak-Weighted Distances Based on LPC Analysis
The peak-weighted distancemeasures based onLPC analysis techniques areproduced by modifying thevariousLPC-based distance measures, thereby emphasizing the spectral differences at the peaks (Sugiyama and Shikano, 1981). That is, these distance measures are sensitive to discrepancies in spectral peaks such as formants where importantinformationfor speech recognition exists. Thismodification is accomplished by multiplyingthe integrand U(X) of theoriginaldistancemeasure by a weighting function w(X) emphasizingthespectralpeaks before integration over all frequency ranges, as (8.15) Experimental evaluation of various combinations of U and w , in terms of ease of computation, amount of weighting, and accuracy of recognizing phonemes in continuous speech, revealed that using the WLR (weighted likelihood ratio) is better thananyother measure (Sugiyama and Shikano, 1981). The WLR is calculated by
Speech Recognition
259
15
O!
I
i
10
i
-
i
!'/
-
I /
5-
/'///La i/'
* /
FIG. 8.2 Weighting factors for logarithmic spectral distance d(X) in maximum likelihood spectral distance E, cosh measure D, and cepstral distance L2.
(8.16)
260
Chapter 8
In this measure, the integrand of the maximum likelihood spectral distance E is used as U(X),f(X)/um and g(X)/dg)are used as w(X), and the equation is modified so that the LPC parameters can beused directly. Weightingaround the spectral peaks necessitates that spectral tilt in AX) and g(X) be removed beforehand. Summation in Eq. (8.16) is truncated to the appropriate order no. Concerning the relationship between no and the LPC analysis order p , no 2 p must be satisfied. TheLPCcorrelation coefficients obtained using recursive equations for linear predictor coefficients based on Eq. (5.63) in Sec. 5.7.6 are used onorders larger than p . Althoughboth correlation coefficients and LPC cepstralCoefficients are necessary forcalculatingtheWLR,thetotalamount of computation is almost the same as various conventional distance measures based on LPC analysis. Weightingfunctionsalongthe frequency axis canalso be included in w(X) of Eq. (8.15). Recognition experiments for vowels in continuous speech confirmed that second-order filters with a peak around 1 kHz are effective in improving accuracy (Sugiyama and Shikano, 1982). 8.3.5 WeightedCepstralDistance
A weighted cepstral distance measure was proposedand tested in a speaker-independent isolated word recognition system using wordbased reference templates and a standard dynamic time warping (DTW) technique, described in Sec. 8.5.1 (Tohkura, 1986). The measure is a statistically weighted distance measure with weights equal to the inverse variance of the cepstral coefficients such that
where w i is the inverse variance of theithcepstral coefficient. Figure 8.3 presents experimentally observed cepstral coefficient variances and inverse variances.
Speech Recognition 0.4
261 I80
Vorionce 9
.IC>I.
Total
"o-
Male utterances
-*-,
Female utterances
0.3 0,
V
c
0
'L 0.2 0
>
0.I
0
I
2
3
4
5
6
7
8
Cepstral c o e f f . index
FIG.8.3 Cepstral coefficient variances and inverse variances used as weightinginaweightedcepstraldistancemeasure.Weighting for quefrency-weighted cepstral distance measureis also indicated.
Experimentalresultsindicatethatthe weighted cepstral distancemeasureworkssubstantiallybetterthanboththe Euclideancepstraldistance and loglikelihoodratiodistance measuresacrosstwodifferentdatabases,namelya10-digit database and a 129-word airline vocabulary. Themostsignificantperformancecharacteristicofthe weighted cepstraldistance is thatittends to equalizethe performance of the recognizer across different talkers. Improvement due to weighting can be attributedtothefactthatit deweightsthelower-ordercepstralcoefficients ratherthan weights thehigher-ordercepstral coefficients. The results also demonstratethat when thenumber of cepstral coefficients is larger than 8, it is necessary to use some of the band-pass lifters
262
Chapter 8
(seeSec. 4.3.1) to reduce the weighting for higher-order cepstral coefficients. The quefrency-weighted cepstral distance measure, which is anotherform of theweighteddistancemeasure,has also been proposed (Paliwal, 1982). In this measure, w l l = n', namely, each cepstral coefficient is multiplied by its respective quefrency. Figure 8.3 also shows the weighting factor for this measure. Clearly, we find some similarity between 17' andthe inverse variance. The quefrency-weighted cepstral distance measure works well, and the errorrate using this measure is only slightly larger thanthat obtained by using the inverse variance-weighted cepstral distance measure. The quefrency-weighted cepstral distance is equal to the weighted slope metric (Klatt, 1982), as follows:
Summation in this equation is also truncated to the appropriate order, and some of the band-pass lifters are applied to higher-order cepstral coefficients. 8.3.6 TransitionalCepstralDistance
Dynamic spectral features (spectral transition) as well as instantaneous spectral features are believed to play animportant rolein human speech perception (Furui, 1986b). Based on this knowledge, a transitional cepstral measurewas proposed(Furui, 1981,1986a). Initially, spoken utterances are represented by timesequencesof cepstral coefficients and logarithmic energy.Regressioncoefficients (or lower-order polynomial expansion coefficients) for thesetime functions are extracted for every frame t over an approximately 50-ms period ((t-K)th frame to (t + K)th frame). The regression coefficient for each cepstral coefficientcalled'delta-cepstrunl," whch gives a
gnition
Speech
263
reliable estimation of the time derivative of the cepstrum time series (more specifically, the spectral slope in time), is represented as
Here, /zk is a window (usually symmetric) of length 2K + 1 and is sometimes set to a unit value for simplicity. A weighted Euclidean distance between two given transitional spectra is defined as
where the weighting coefficient w n is inversely proportional to the pooled variance of Ac,,. w l l is sometimes also set to a unit value for simplicity. Transitional logarithmicenergy, All, and its distance are defined inthesameway.Thetransitionalandinstantaneous distances are usually linearly combined as
x ' P
dCEP+ACEP+AENERGY =
~ ~ J 1 , ~ (c tcn ~ 'g))2
,I=
1
P
n= 1 +"j
(nu(/)-
(8.21)
Where win, and w 3 are weighting coefficients. The second-order derivative of the cepstrunl time series called 'delta-delta-cepstrum,' which can be easily calculated from a time series of delta-cepstrum, has also been combined. The effectivenessof the transitional distance measure was confirmed by speaker-independent isolated word recognition (Furui, 1986a) and speaker verification (Furui 1981). Theerrorratefor ~
.
7
~
~
~
,
264
Chapter 8
recognizing 100 Japanese city names was reduced from 6.2 to 2.4% by using the transitional cepstrum and energy in additionto the instantaneous cepstrum. This measure is advantageous in that its performance capability is resistant to transmission channel variations. 8.3.7
Prosody
Prosody can bedefined as information inspeech that is not localized to a specific sound segment, or information thatdoes not change the identity of speech segments (Childers et al., 1998). Such information includes the pitch, duration, energy, stress, andothersuprasegmental attributes. The segmentation (or grouping) function of prosodymaybe related moreto syntax (with some relation to semantics), while the saliency or prominence function may play a larger role in semantics and pragmatics than in syntax. To make maximum use of the potential of prosody will likely require a wellintegrated system, since prosody is related to linguistic units not just at and below the word level, but also to abstract units in syntax, semantics, discourse, and pragmatics. Present speech recognition systems make quite limited (or no) use of prosody, mainly because of its difficulty in automatic extraction and modeling.
8.4 STRUCTURE OF WORD RECOGNITION SYSTEMS
Thestructures of isolated wordrecognition systenzs can be classified into two types, as shown in Fig. 8.4: systems using words asunits (models or templates) (a), and systems using subword units,that is, units smaller thanwords, such asphonemesor syllables, and a worddictionary(b).Theworddictionary represents each word by a concatenation of the subword units. With a word-unit structure, input speech is compared with each word model or reference template, and the word unit with the smallest distance from the input speechisselected. With a subwordunit structure, on the other hand, short periods of input speech are
Speech Recognition
"P
U
"c
265
0
rc
Chapter 8
266
compared withthe subword units to calculatethedistances. The distances and word dictionary are then combined to make the decision. Withstructure(b),therefore,theamount of distance calculationdoes not depend on the sizeof vocabulary, and the memory size for storing the subword units and word dictionary and the amount of computation increases less than with structure (a) as thevocabulary increases. Structure (b) additionally features two otheradvantages.One is that thevocabularycan be easily increased or changed by rewriting the word dictionary. The other is that several types of pronunciation variations, such as vowel devocalization, can be manuallyadded to theworddictionary based on the spelling of each word. Representing each word by a concatenation of subword units corresponds to a very rough quantization of spectral space, and it producesalargeinformationloss.Thesystemshouldthus incorporate a structurein which recognition errorsin some subword units areprevented from causing serious word recognition errors, as is described in Subsection 8.1.3. Generally, structure (a) is better suited to a smaller vocabulary and (b) to a larger vocabulary. The reference templates or models are created in a training phase using one or more speechsegments corresponding to speech sounds of the same class. The resulting unit can be an exemplar or template, derived from some type of averaging technique, or it can be a model that characterizes the statistics of the features of the unit. To effectively reduce the memory size and the amount of computation necessary with structure (a), nonuniform sampling has been attempted, inwhich the spectral transition is precisely sampled and the stationary part is roughly sampled.
8.5
8.5.1
DYNAMIC TIME WARPING (DTW) DP Matching
Even if thesamespeakeruttersthesameword,the changes every time with nonlinearexpansion
duration andcontraction.
Speech Recognition
267
Therefore,with both structures (a) and (b) outlined in Sec. 8.4, DTW is essential at the word recognition stage. The DTW process nonlinearly expands or contracts the time axis to match the same phonemepositionsbetweentheinputspeechandreference templates. Thisprocesscan be efficiently accomplished by using the dynamic programming (DP) technique(Bellman, 1957), which will be described later. The DP technique was first applied to the DTW of speech by Slutsker (1968), Vintsyuk (1968), and Velichko and Zagoruyko (1970)of the USSR. A parallel investigation of this technique was conducted independently by Sakoe and Chiba(1971) of Japan. Results of these studies were published at almost the same time. This technique has had a very large impact on speech recognition, actually becoming an essential and widely applicable technique. In exploringthe DP technique, let usassumetwo time sequences for feature vectors which should be compared as
(8.23) When we consideraplanespanned by A and B asshownin Fig. 8.5, the time warping function indicating the correspondence between the time axes of A and B sequences can be represented by a sequence of lattice points on the plane, c = (i, j ) ?as (8.23) When the spectral distancebetween two feature vectors ai and bj is represented by d(c) = d(i,j), the sum of the distances from beginning to end of the sequences along F can be represented by
li= 1
k= I
(8.24)
Chapter 8 -
Adjustment
I
I I I
I
I I
I I I I
01
a2
ai
L
i
QI I
Y
A
FIG.8.5
DTW between two time sequences, A and B.
The smaller this value is, the better is the match between A and B. Here, M T is~ a positive weighting function related to F. Let us minimize Eq. (8.24) concerning F under the following conditions.
1. Monotonyandcontinuitycondition
2.
Boundarycondition il
= j1 = 1,
iK =
I, j K = J
(8.26)
Speech Recognition
269
3. Adjustmentwindowcondition
1 ik
-jk
1 5
I,
I =
constant
(8.27)
Condition 3 is applied to prevent extremeexpansionand contraction. Defining w k so that the denominator of Eq. (8.24) becomes constant independent of F simplifies the equation. For example, if w k = (ik - ik-l) + ( j k (io = j o = 0), w k becomesthe city block distance, and
(8.28)
Equation (8.24) then becomes 1
K
(8.29)
Since theobjectivefunctionto beminimizedbecomes additive, minimization can be efficiently solved without exhaustively examining all possibilities for F. Summation over a partial sequence of c1, c2, . . . , ch- (ck = (i, j ) ) is
(8.30)
270
Chapter 8
Theabove expresses thederivationof DP. Usingallthree conditionsfor F and theabove-mentionedformulation of Eq. (8.30) can be rewritten as
wk,
Therefore, the distance between the two time sequences of A and B after DTW can be obtained as follows. First, let us set the initial conditions to g(1,l) = 2 4 1,l) and j = I , and calculate Eq. (8.31) by varying i within the adjustment window. This calculation is iterated by increasing j until j = J. The overall distance between the two sequences is then obtained as g(I,J)/(l + J). This method is called D P matching, meaning DTW employing the DP technique. The warping function F is sometimes called the DP path. When similarityinstead of distance is used as d , it becomesa maximization problem which can be solved by the same formulation. 8.5.2 Variationsin DP Matching
Variousrestrictionsforthewarpingfunction F and various proposedandevaluated by formulations of w k havebeen recognition experiments. Good performance was confirmed for F and wk, both symmetrical to the two time sequences, and for the slope constraint indicated in Fig. 8.6(a), which restricts the local 1978). Speakerslope between 1/2 and 2 (Sakoe andChiba, dependentwordrecognitionexperiments for 50 Japanese city names uttered by four male and female speakers indicated that the error rate under the above-mentioned conditions for DP matching was O.8%, whereas that for linear warping was 5.9%. Asymmetrical DP matching is advantageousinthatthe number of summations dependsonly on the input orreference time sequence, and in that the number of summations is almost half that
Speech Recognition
‘s
4
271
272
Chapter 8
of the symmetrical method. Hence, the slope constraint indicated in Fig. 8.6(b) restricting the slopeto between $ and 2, which is similar to (a), is also frequently used (Itakura, 1975). Other modifications of DP matching include unconstrained endpoint DP matching, which was proposed as a means of coping with the variation in detected endpoint positions, and staggered array DP matching capable of performing unconstrained endpoint matchingwithreducedcalculations.Thestaggeredarray DP matching method will be described in the next subsection. The unconstrained endpoint method removes the boundary conditionthatthe beginnings and endings of thetwotime sequencesmust be matchedtogether,andallowsformatching within a certain endpoint region. This method is free from speech period detection error and makes possible the true speech period being conversely determined accordingto the DP matching results. Although either the input or reference time sequence( A in Fig. 8.6 (b)) can be matched to any part of the other time sequence in the asymmetricalmethod,unconstrainedendpointmatchingcan be principallyperformedonly atthe finalposition of both time sequences in the symmetrical method. 8.5.3 Staggered Array DP Matching
The staggered arrayDP matchingmethod realizes complete symmetricalunconstrainedendpointmatchingandreducesthe amount of computation by thinning outthe lattice points in a plane spanned by two time sequences (Shikano and Aikawa, 1982). In thismethod,iterativecomputationfor DP matching is only performed at every third point along the diagonal axis, as indicated by thesymbol 0 inFig.8.7(a),andthewarpingfunction is constrained as shown in Fig. 8.7(b). The amount of computation thus necessaryis one-third of thatusing the method outlined in Fig. 8.6(a). Accumulated distance values are compensated for by the distance values at neighboring points (indicated by the symbol - in the figure). Hence, precision is maintained in spite of thinning out the accumulation points.
Speech Recognition
273
of
le
function
(a)
FIG.8.7 DTW function (a) and its slope constraint (b) instaggered array DP matching.
Chapter 8
274
Actual iteration is performed at the points (ij),which satisfy
i
+j
= 3nz
+2
(
YIZ =
0, 1,2, . . . , T I I , ~ ~ ,
within the allowable region of thewarpingfunctionpathfor successive values of m . Here,int[s] is theintegralnumber calculation. The intermediate accumulated value g ( i j ) is stored in a register, R(k) = R(i - j ) , as indicated in Fig. 8.7(a). When slope constraining and distance compensation are performed as shown in Fig. 8.7(b), the DP matchingcalculation is performedinthe following way:
R ( k ) = min
Since the iterativeprocess is performed by renewing the contentsof the register R(k),the memory capacity for DP matching is also less than with conventional symmetrical methods. The unconstrained endpoint condition at both the beginning and end of the utterance is provided by using the spectral values before and after thespeech period, that is, the frames before al and bl andthoseafter aI and bJ. The overall distanceaccumulated along the optimum warping function F is obtained by
(8.34)
I
Speech Recognition
275
Wordrecognitionexperimentsindicatedthat ahigher recognition accuracy than is possible with conventional pseudounconstrainedendpointmethodscan be obtainedthroughthis method. Various modifications of this method involving changing thethinned-outpointsorthepointsincludedindistance accumulation have also been investigated (Shikano and Aikawa, 1982). 8.6 WORD RECOGNITIONUSINGPHONEMEUNITS 8.6.1
PrincipalStructure
A typical example of the phoneme-based word recognition system derived fromthemethodindicatedinFig. 8.4 (b) is shown in Fig. 8.8 (Kohda et al., 1972; Furui, 1980). In this system, phonemes are notdetermined at the phoneme recognitionstage, but similarity or distance values between each frame of input speech and each phoneme reference template are used for matching with the word dictionary. When the number of phoneme reference templates is increased so that various modifications, such as context-dependent variations,areincludedas differenttemplates,thismethod approachesthatindicated in Fig. 8.4 (a) using wordtemplates. Importantly, this means that the method using phoneme templates has a wide range of variations. In the first stage of constructing the word recognition system, phoneme reference templates are created according to the size and content of thevocabulary.Eachword is thenrepresented by a sequence of phoneme reference templates and stored in the word dictionary. The numberof basic phoneme reference templates used in the systems of various languages is around 40 to 50, including vowels, consonants,and several transitionaltemplates.Plural templates are sometimes prepared for several phonemes to ensure that the variation due to coarticulation and devocalization can be adequately handled. Along with the sequence of phoneme labels, the upper and lower limits for the duration of each phoneme, and
Chapter 8
276 Input
0 Spectral
Phoneme templates
1-1
identification
0
out put
FIG. 8.8 Block diagram of phoneme-based word recognition system using phoneme reference templates and word dictionary: (a) spectral analysis; (b) computation oflog likelighod matrix; (c) DTW and computation of total likelihood between each candidate word and input speech; (d) word identification.
gnition
Speech
277
the presence and location of periods ofsilence in the word are stored for each word in the word dictionary. When anunknownutterance is inputintothe system, the similarity between input speech andeachphoneme reference template is calculated at every frame period. All similarity values except those for silent periods are stored as similarity a matrix. The similarity matrix, word dictionary, and existence and location of the silence periodsare subsequently used for wordrecognition. This is performed by D P matching between input speech and the phoneme reference template sequence of each word. Accumulated similarity between input speech and eachword can be easily calculated using elements of the similarity matrix. In the word dictionary, plural phoneme sequences are prepared for severalwordsin order to cope with spectral variation due to devocalization and individual differences in the manner of pronunciation. Although the word dictionary is generally speaker-independent, phoneme reference templates need to be adapted to each speaker using adaptation utterances. Unlike recognition systems using word templates, however, the adaptation utterances do not need to include all vocabulary words. In fact, spectral patterns averagedoverall speech periods of each phoneme in adaptation utterances are calculated and stored as a phoneme reference template. Each of the phoneme periods in the adaptation utterances can be automatically determined by the DP matching method. 8.6.2 SPLIT Method
To apply effectively the phoneme-based wordrecognition system to large-vocabulary word recognition, the number of phonemes was increased so that spectral variation could be sufficiently covered. These reference templates, which do not necessarily correspond to individual phonemes, are called phonemeliketemplates or pseudophonemes. Hence, this recognition method is called the SPLIT (strings of phonemelike templates) method (Sugamura and Furui, 1982; Sugamura et al., 1983), and is essentially amixture of a conventionalphoneme-basedorword-based recognition system
Chapter 8
278
and the vector quantization (VQ) method used in speech coding (See Sec. 6.4). Phonenlelike templates are speaker-independently or speakerdependentlyproduced by clustering a set of short-timespectral patterns extracted from a large number of speech samples. This is the same technique as that used in producing a codebook in VQ. Since thesetemplates areproduced simply according to the distribution of spectralpatterns,that is, according to distance relationships between patternshavingnorelationto linguistic knowledge,thecorrespondencebetweeneachtemplateand phoneme is not clear. It is therefore impossible to produce a word dictionary directly based on orthographic knowledge. This means that this system is language-independent, specifically, that it can be applied to any language. A word dictionary is thus constructed for each word by assigning the nearest phonemelike template to each trainingutteranceframe. A sequence of symbolsindicatingthe templates is subsequently stored for each word. The SPLIT method is effective for reducing system complexity compared with the word-based method while still maintaining performance. As a modification of the SPLIT method, the double-SPLIT method in which input speech and word reference templates are both vector quantized was subsequently proposed (Shikano, 1982). Using this method in conjunction with adopting an efficient VQ techniqueforinput speech, the amount of spectraldistance calculation can be reduced since the distance value can simply be retrieved from the distance matrix. The distance matrix comprising thedistancesfor every pair of phonemeliketemplates is stored prior to recognition.
8.7 THEORY AND IMPLEMENTATION OF HMM 8.7.1
Fundamentals of HMM
The hidden Markov model (HMM) is a well-known and widely used statisticalmethod of characterizingthespectralproperties
,
.
" I .
L.
gnition
Speech
279
of theframes of a pattern. These models are also referred to as Markov sources or probabilisticfunctions of Markov chains in thecommunicationsliterature.Theunderlyingassumption of the HMM is that the speech signal can be well characterized as aparametricrandom process, andthat theparameters of thestochasticprocesscan be determined(estimated)in a precise, well-defined manner.The HMM methodprovidesa natural and highly reliable way of recognizing speech for a wide range of applications(Baker, 1975; Bahl and Jelinek, 1975; Jelinek, 1976; Ferguson, 1980; Rabiner et al., 1983; Huang etal., 1990; RabinerandJuang, 1993; Jelinek, 1997; Knill and Young, 1997). Figure 8.9 shows typical structures of HMM used in speech recognition.Model (a) is called an ergodic or fully connected model in which every state of the model can be reached (in a single step) from every otherstate of themodel. On theotherhand, model (b) is called a left-to-right model or a Bakis model because theunderlyingstate sequence associated with themodelhas the property that, as time increases, the state index increases, that is, the system states proceed from left to right. Clearly the left-toright model exhibits the desirable property of being readily able to model speech whose properties change over time in a successive manner. TheHMMscan beclassified into discretemodels or continuousmodelsaccordingtowhetherobservableevents assigned to eachstate(ortransition)are discrete,such as codewordsafter vector quantization, or continuous. With either way, the observation is probabilistic, that is, the model is a doubly embedded stochastic process withan underlying stochastic process that is not directly observable (it is hidden) but can be seen only throughanother set of stochastic processes thatproducethe sequence of observations. An HMM for discrete symbol observations is characterized by the following: 0 T
= =
( O , , 02,. . ., OT} = observation sequence (input utterance) length (duration) of observation sequence
Chapter 8
280
(a) Ergodic model
1 I I
bl ( k )
b&)
3
4
I
I
b, ( k )
b,(k)
5
bi (k>
(b) Left-to-right model FIG. 8.9 Typicalstructures of HMM usedinspeechrecognition b,(k): observation probability).
(ao: transition probability,
Speech Recognition
281
Q = (41, q 2 , . . . , q N } = (hidden) states in the model N = number of states V = ( V I , v2, . . . , v M ) = discretesetofpossiblesymbol observations (VQ codebookj M = number of observation symbols (VQ codebook size) A = (a,), ai, = Prob(qj at t + 11 qI at t ) = statetransition probability distribution For theergodicmodel, nu > 0 for all i, j . For theleft-to-right model, aij > 0 for i <j. B = { b,(k)}, bj ( k ) = Prob(vk at tl qj at t ) = observation symbol probability distribution in statej T = {xi),rTT, = Prob(q, at t = 1) = initial state distribution
The compact notation X = ( A , B, n) is used to represent an HMM. Specifying an HMM involves choosingthenumber of of discretesymbols, M , and states, N , as well asthenumber specifying thethreeprobability densities of A , B, and T. This parameter set is calculated using the training data, andit defines a probability measure for 0 = (010 2 . . . O,), i.e., Prob (OlX), where each observation 0, is one of the symbols from V. An observation sequence 0 is generated as follows: Step 1: Set t = 1. Step 2: Choose an initialstate i according tothe initialstate distribution T . Step 3: Choose 0, according to bi (k), thesymbolprobability distribution in state i. Step 4: Choose j according to {ao> (j = 1, 2, . . . , N), the state transition probability distribution for state j . Step 5: Set t t t + 1. Return to step 3 if t < T; otherwise terminate the procedure.
In the training phase, when 100 training utterances, O(”) =
{ Oi”’}
f= 1
,
TI)= number of frames
(8.35)
282
are obtained (n
Chapter 8 =
1, 2, . . ., loo), X*, which satisfies
n 100
X* = argmax
x
Prob(O(”) I X)
(8.36)
i2=1
is determined using the Baum-Welch algorithm(Baum, 1972). Here, Prob(O(”’1X) indicates the conditional probability. Intherecognitionphasefortheunknowninput, theprobability(likelihood)thattheobservedsequence is generated from each HMM is computed,and themodel with the highest accumulatedprobability is selected asthecorrect identification. A pair of a model )I?* and a state sequence q*, ()?I*, q*), which satisfies
is determined using the Viterbi algorithm, where X, is themth model ( m = 1,2, . . ., M ; M = vocabulary size), 0 = O1 O2 . . . OT is input speech ( T = number of frames), and q is a state sequence (Viterbi, 1967). Prob(0, ql XJ can be efficiently calculated using a forward-backwardalgorithm.Thesealgorithmsare precisely explained in the following subsections. 8.7.2 Three Basic Problems for HMMs
There are threekey problems that mustbe solved when utilizing the HMM model. Problem 1: Evaluation Problem Given the observation sequence 0 = ( 0 1 , 0 2 , . . ., OT>and the model X = ( A , B, T), how can the observation sequence probability Pro b( 01 X) be computed?
Speech Recognition
283
Problem 2: Hidden State Sequence Uncovering Problem Given the observation sequence 0 = ( 0 1 , 02,. , ., OT >,how can a state sequence I = {il, i2, . . ., iT), which is optimalinsome meaningful sense, be chosen? Problem 3: Training Problem Howcanthemodelparameters maximize Prob( 01 X)?
X
=
( A , B,
T)
be adjusted to
The principal structure of spoken word recognition systems based on the HMM is detailed in Fig. 8.10. This structure requires the derivation of solutions to these three problems for particular use. The solution to Problem 1 is utilized to score each word model based on the given test observation sequence for recognizing an unknown word. The solution to Problem 2 is used to develop an understanding of the physical meaning of the model states. The solution toProblem 3 is employed to optimallyobtainmodel parameters for each word model using training utterances. 8.7.3 SolutiontoProblemI-ProbabilityEvaluation Prob(O1X) can be represented as
Thesummation in thisequation is efficiently computed by the forward-backward procedure. Consider the forward variable at (i) as
This indicates the probability of the partial observation sequence (until time t ) and state qiat time t , given model X. We can solve for a , (i) recursively as follows:
Chapter 8
284
Speech wave
I
Spectralanalysis
I n
(Feature vector sequence) Vector quantization (VQ) (Symbol sequence) Recognition
Training
HMM each for
0
word
I
U E-
9 2 CD
1
3c
6
0
P, v, CD
w
Likelihood
1
Word identification
I
Recognition results FIG.8.10
Principal structure of word recognizer based on HMM model.
285
Speech Recognition
Step 1 : al(i) = q b , ( 0 1 ) Step 2 : F o r t = 1, 2,-,
(8.40)
(1 5 i I N ) T
-
1
(1 5 j
X) =
x
Step 3: Then
I N),
N
Prob(0
I
(8.42)
aT(i)
i= 1
This algorithm can be easily derived by transforming the HMM into a trellis or lattice diagram as shown in Fig. 8.11. In a similar manner, a backward variable Pt(i) is defined as
Observation ( t )
01
o3
. . . .
'0
0 0
02
I
D
2
0
b
0
OT
o
n
.-
Y
a# t
0 t
v)
. 0
1%:
0 N 0
0
\
__I___)
0 0
O
p
O
*Lo
0 0
FIG.8.11 Trellis or lattice diagram representing an HMM.
Chapter 8
286
Thisdemonstratestheprobability of thepartialobservation sequence from t + 1 to the end, given state qi at time t and model X. Again we can solve for Pt (i) recursively as follows: (1 5 i 5 N )
Step 1 : PT(i) = 1
(8.44)
(8.45)
Step 3: Then, N
8.7.4 Solution to Problem 2-Optimal
State Sequence
Problem 2 can be solved using the Viterbialgorithm.This algorithm is similar totheforward-backwardprocedure, except that a maximization over previous states is used in place of the summing procedure. The Viterbi algorithm is given as follows: Step 1: Initialization 81 (i) =
(1 5 i 5 N )
T W l )
91
(i)
=
0
Step 2: Recursion For 2 5 t _< T,1 5 j _< N ,
(8.47) (8.48)
Speech Recognition
287
(8.49)
(8.50) Step 3: Termination
P* = max [ST(i)]
(8.51)
iT = argmax [ST(i)]
(8.52)
1L i S N
ljilN
Step 4: State sequence backtracking For t = T
-
1, T
-
2,
- - , 1,
Here, P* is themaximumlikelihood, and 4 indicatesthe maximum likelihood state sequence. If one only wishes to compute P*, p values need not be maintained. The Viterbi algorithm is a form of the well-known dynamic programming method. In theViterbialgorithm,theobservationprobability (likelihood) at each state is usually converted to a logarithmic value. Then theaccumulatedprobability can be quickly calculated by using the DP method with only maximum selection and summation T, 1q < N , calculations. That is, for 1
st<
288
Chapter 8
is calculated, and finally the log-likelihood (8.55) is obtained. Since thelogarithmic values are used, thedynamic range of the accumulatedvalues becomes small, and therefore there is no need to be concerned about the underflow problem. Along with the development of the HMM, the fundamental D P technique is now often called the Viterbi algorithm. 8.7.5 Solution to Problem 3-Parameter
Estimation
Aniterativeprocedure, such astheBaum-Welch method,or agradienttechniqueforoptimization isused for solving this problem. With the Baum-Welch algorithm, &(ij) is first defined as
<
(i, j ) = Prob(i, = qi,
ir+l
= qi
1 0 ,X)
(8.56)
This denotes the probabilityof a pathbeing in state yi at time t and q j at time t + 1, given observation making a transition to state sequence 0, and model X. <,(i. j ) can be written as
(8.57)
In the above equation, a&) accounts for the first t observations, ending in state qj at time t. The term aij bl (0,+ accounts for the transition to state ql at time t + 1 with the occurrence of symbol O,+ The term + accounts for the remainder of the observation sequence. Prob(O1X) is the normalization factor. Next, ~ [ ( iis ) defined as
rt(i)= Prob(i, = q, 1 0
, ~ )
(8.58)
289
Speech Recognition
This represents the probability of being in state q i at time t, given observation sequence 0, and model X. rl(i)can be expressed as (8.59)
rf(i)can be related to Ct (i, j ) by summing tt (i, j ) over j , giving (8.60) If rf(i)and &(i,j ) are each summed over the time index t (from t = 1 to t = T - l), quantities are obtained which can be interpreted as
x rt(i) T- 1
=
expected number of transitions made from q i
r= 1
and T- 1
&(i, j ) = expected number of transitions from state qi to state qj t=l
Usingthesequantities, reestimated such that
the HMM parameter values can be
(8.62)
(8.63)
290
Chapter 8
The reestimationformulafor 7 r i correspondstotheprobability estimation of being in state q i at I = 1. The reestimation formulafor a,. representstheratio of the expected number of transitions from state ql to qJ divided by the expected number of transitions out of state qz. Finally,thereestimationformulafor bi(k) is equal to the ratioof the expected number of times of being in state j and observing symbol k divided by the expected number of times of being in state j . It can be verified that Prob(0lX) 2 Prob(0JX) = (5, A", B). Therefore, if X is iteratively used in place of X andtheabove reestimationcalculation is repeated,theprobability of 0 being observed from themodel can be improved until somelimiting point is reached. The above reestimation algorithm is generally called the EM algorithm, since it consists of the iterations of expectation value calculation and likelihood maximization.
(x
8.7.6 ContinuousObservationDensitiesin
HMMs
Allof the discussion thusfarhas considered only when the observations were characterized as discretesymbols chosen from a finite alphabet, and therefore a discrete probability density within each state of this model could be used. However, the observations are usually originally continuous signals or vectors, with possibly serious degradation associated with these discretization. Hence it would be advantageous to be able to use HMMs with continuous observation densities to model continuous signal representation directly. Themostgeneralrepresentation of the model probability density function(pdf), for which a reestimation procedure hasbeen formulated, is a finite mixture of the form
where 0 is the observation vectorbeing modeled, cJk is the mixture coefficient for the kthmixture in statej and N i s any log-concave or
Speech Recognition
291
elliptically symmetric density ( e g , Gaussian). Usually Gaussian for the kth mixture with mean vector pJh-and covariance matrixUjxcomponent in state j is used as N. The mixture gains cJksatisfy the stochastic constraint M
k= 1
so that the pdf is properly normalized, i.e.,
Itcan be shown thatthe reestimation formulasforthe coefficients of the mixture density are of the form
(8.68)
(8.69)
292
Chapter 8
where primedenotes vector transposeand where -yt 0,k) is the probability of being instate j at time f with the kth mixture component accounting for Ot, i.e., 1 at 0’)Pt
0’)
N
1
(8.71)
8.7.7Tied-MixtureHMM
Tied-mixture HMM, also called semicontinuous HMM? is a compronlise betweendiscrete and continuous HMMs, in which a type of continuous density codebook, that is, a set of independent Gaussian densities,isdesigned to cover the entire acoustic space (Huang and Jack, 1989). The Gaussian densities are derived in much the same way as the discrete VQ codebook, with the resulting set of means and covariances stored in a codebook. This method differs from the discrete HMM in the way the probability of an observation vector is computed; instead of assigning a fixed probability to any observation vector that fallswithin an isolated region, it actually determinestheprobabilityaccording tothe closeness of the observation vector to the mean vectors, that is, the exponent of the Gaussian distributions. For each state of each word or subword unit, the density is assumed to be a mixture of the fixed codebook densities. Hence, even though each state is characterized by a continuous mixture density, one need only estimate the set of mixture gains to specify the continuous density completely. 8.7.8 MMI
and MCE/GPD Training of HMM
Instead of maximizing the likelihood of observing both the given acoustic data and the transcription, the MMI estimation procedure maximizes the mutual information between the given acoustic data
gnition
Speech
293
and thecorrespondingword or transcription(Bahletal., 1986; Normandin, 1996). As opposed to maximumlikelihood (ML) estimation, which uses only class-specific data to train the classifier for theparticular class, MMI estimationtakes intoaccount information from data in competing classes. One new directionfor speech recognition is discriminative training which designs a recognizer that minimizes the error rate on task-specific testing data (Juang and Katagiri, 1992; Juang et al., 1996). Similar toMMI, the discriminative trainingtakesinto account the models of other competing categories and formulates the optimization criterion so that category separation is enhanced. Theoptimizationsolution is obtainedusing ageneralized probabilistic descent algorithm.Thismethod is therefore called the MCE (minimum classification error)/GPD (generalized probabilistic descent) method.Unlikethe Bayesian framework,this method does not require estimating the probability distributions, which usually cannot be reliably obtained. This method has been applied in various experimental studies for both speech and speaker recognition with good results. 8.7.9
HMM System for Word Recognition
Figure 8.12 (Rabiner et al., 1985) shows a block diagram of an isolated word HMM recognizer, where each word is modeled by a distinct HMM, and V is thevocabulary size of thewords. To performisolatedwordspeechrecognition, we must dothe following procedure: 1. For eachword u inthevocabulary, we mustestimatethe HMM model parameters X that maximize the likelihood of the training set observation vectors. It is important to limit the parameter estimates to prevent them from becoming too small. Theobservationprobabilities,themixturegains, andthe diagonal covariance coefficients are usually constrained to be greater than or equal to some minimum values even if related conditions never occurred in the training observation set.
294
S
-a .-acn
I
I
8-c I
i
”I
L
Chapter 8
I
gnition
Speech
2.
295
For each unknown word,theprocessingshowninFig. 8.12 (Rabiner and Juang, 1993) must be carried out, namely, measurement of the observation sequence 0 = (O,, 02,. . ., O,>, via a feature analysis; followedby calculation of model likelihoods for all possible models, P(OJX,), 1 5 v 5 V ; followed by selection of the word whose model likelihood is highest specifically, -
v*
=
I
argmax[P(O X,)]
(8.72)
1 < L j ' 1 1 -
The likelihood calculation step is generally performed using the Viterbi algorithm (i.e., the maximum likelihood path is used). The segmental k-meanstrainingprocedure as shownin Fig. 8.13 (Rabiner and Juang, 1993)iswidelyused to estimate parameter values, in which good initial estimates of the parameters of the bJ (0,)densities are essential for rapid and proper convergence of thereestimation formulas. Followingmodelinitialization, the setof training observation sequences is segmented into states, based on the current model X. Ths segmentation is achievedby finding theoptimum state sequence, via the Viterbi algorithm, and then backtracking along the optimal path. The results of the segmenting each of the training sequencesis a maximum likelihood estimate of the set of the observations that occurwithineach state according to the current model. Basedon this segmentation,the model parameter set isupdated. The resulting model is then compared to the previous model. If the model distance score exceeds a threshold, the old model is replacedby the new (reestimated) model, and the overall training loop is repeated. If model convergence is assumed, the final modelparameters are saved.
8.8
CONNECTED WORD RECOGNITION
8.8.1 Two-Level DP Matchingand Its Modifications
The DP matching technique used in isolated word recognition can be expanded into a technique which is applicable to connected
296
.
1
T'
>
Chapter 8
I
gnition
Speech
297
word recognition (Ney andAubert, 1996). The basic process involved in this expansion is to perform D P matchingbetween input speech and all possible concatenations of reference word templates to ensureselecting the best sequence having the smallest accumulated distance. Several problems persist, however, in finding theoptimal matching sequence of reference templates. One is that the number of words in the input speech is generally unknown. Another is that thelocations, in time, of theboundariesbetweenwordsare unknown. The boundaries are usually unclear because the end of onewordmaymergesmoothly with the beginning of the next word. Still another is that the amount of calculation becomes too large when all possible sequences and input speech are exhaustively matched using the method described in Sec. 8.5. This is because the number of ways of concatenating X words selected from the N-word vocabulary is Nx.It is thus very important to create an efficient means for ascertaining the optimal sequence. Fortunately, several methods have been devised that optimally solve the matching problem without giving rise to an exponential growth in the amount of calculation as the vocabulary or length of the word sequence grows. Specifically worth mentioning are four principal methods having different computation algorithms, but producing identical accumulated distance results. 1. Two-level DPmatching
Since DP matching is performed on two levels in this method, it is called two-level D P matching (Sakoe, 1979). Onthe first level, semiunconstrained endpoint DP matching is performed between every shortperiodofinput speech and each word reference template. The starting position of the warping function is shifted frame by frame in input speech. The meaning of the semiunconstrained endpoint is that only the final position of the warping function is unconstrained. On the second level, the accumulated distance for the word sequence is calculated again using the DP matching method based on the results derived at the first level.
Chapter 8
298
In exploring themethod, let us assume that first-level DP matching has already been performed between partial periods of the input utterance starting from every position and eachreference template.Theword with theminimumdistancefromtheinput utterance between positions s and t is written as ~ ( s ,t), and its distance is written as D(s, t). ~ ( s ,t ) and D(s, t ) are obtained and stored for every partial period of input speech, more precisely, for every combination of s and t (1 s < t 2 T, T = input speech length). These values are then used for second-level D P matching forobtainingtheword sequence minimizingtheaccumulated distance over the entire inputspeech. That is, the recognition result is the word sequence w(1, m l ) , ~ ( 1 3 2 + ~ I , m2), . . ., w ( m k + 1, 2") satisfying the following equation under the condition 1 1771 < m 2 . . . < 1?7k < TI
(8.73) Since this equation can be rewritten into the recursive form
Do = 0 D,, = min {Dl,,- 1 Ill=
1.I?
+ D (In,n ) }
(8.74)
it can be efficiently solved by the DP technique.
2. LB (level building) method In the level building method, the number of connected words is assumedto be oneforthefirstconditionandincreased successively. Distances between input speech and connected word sequence candidatesare calculated to select theoptimumword sequence, namely,the best matchingwords,foreachcondition (level) of the number of connected words. Figure 8.14 illustrates the LB method (Myers and Rabiner, 1981).
Speech Recognition
299
Search region for longest r e f e r e n c e at each l e v e l
Seutch region for shortest reference at aach level
FIG. 8.14 Illustration of warpingpathregions matching using LB method.
in four-level DTW
For thefirst level, specifically, forthe first wordinthe sequence, unconstrainedendpoint DP matching is performed between input speech and each word reference templateunder theconditionthatthewarpingfunctionmuststartfromthe
300
Chapter 8
beginning position of the input speech. For the second and later levels, unconstrainedendpoint D P matching is done using the optimum accumulated distances obtained at each previous level in theendarea (nzl(l)-~n2(l)) in Fig. 8.14 as initial values. This procedure is repeated until theallowed maximum numberof words (wordstring length) is reached. The word sequence with the smallest accumulated distance at theend of the input speech is finally selected as being the recognition result. The LB method is particularly beneficial in thatunconstrained endpoint DP matching can be performed at every level, whereas the first levelof the two-level DP matching consists of semi-unconstrainedendpointmatching.Consequently, since the LB methodcan solve theoptimizationproblemthrough onelevel DP matching,theamount of computation it requires is less than two-level DP matching. The LB method in the original form is unsuitedtoframesynchronous, real-time processing, however, since scanningandmatching with reference templates must be performed throughout the input string of speech at every level untilthenumber oflevels equalsthe allowed maximum processing of the LB number of words.Framesynchronous method has been realized by the clockwise DP method described below using anadditional memory forintermediatecalculation results.
3. CW (Clockwise) DP method In contrast with the LB method in which the assumed number of connectedwords (level)is increased successively, andthe best matching word string is selected for each level, the clockwise DP (CWDP) method performs this procedure through parallel matching synchronized to the input speech frame (Sakoe and Watari, 1981). This makes CWDP suitable for real-time processing. The number of parallel matchingcorrespondstotheallowable maximum number of words in the string. In the DP matching between a certain periodof input speech and each word reference template, theresult of optimum matching,
Speech Recognition
301
in particular, the optimum accumulated distance, for the speech input before this period is usedas an initial conditionforthe recursive calculation. The repetition of the same spectral distance calculation occurring on every level of the LB method is removed in the CWDP method. Thus, CWDP requires fewer calculations than the LB method. Memory capacity increases in the CWDP method, however, since intermediate results of recursive calculations for DP matching must be stored for each figure number and for each word reference template. 4. OS (one-stage) DP method or O(n) (order n) DP method
Asopposedtothe two-level DP or CWDP methods, in which DP recursive calculationsareperformedfor all the possible conditionsonthenumber of figures at every frame, only the optimumcondition is considered at every frame in the onestage (OS) DP or order n DP method (Vintsyuk, 1971; Bridle and Brown, 1979; Nakagawa, 1983). Although investigated indepenO(n) D P methodsareactuallythesame dently,theOSDPand algorithm. Since this methoddoesnot involve therepetition of recursive DP calculations,it requires fewer calculationsanda smaller memory. Specifically, the number of calculations necessary to calculate the distance between input and reference frames and for distance accumulation in this methoddoesnotdependonthenumber of figures in the input speech. If the length of the speech input andmeanlength of reference templatesarebothconstant,the number of calculations is proportional only tothe size of the vocabulary, 12. For this reason, this method is called the O(PZ) DP method. Since the intermediate results for each stage of the figure are not maintained, it is impossible to obtain the recognition results whenthenumber of figures is specified. Forthe samereason, automaton control is also impossible with this method. Table 8.2 comparesthenumber of calculationsandthe memory size for each of the four methods described.
302
z 7
Z 7
X 7
z
X
z 7
m
E
O
Y-
Chapter 8
Speech Recognition
303
8.8.2 Word Spotting
The term word spotting describes a variety of speech recognition applications where it is necessary to spot utterances that are of interest to the system and to reject irrelevant sounds (Rose, 1996; Rohlicek, 1995). Irrelevantsounds can include out-of-domain speech utterances,backgroundacoustic noise, andbackground speech. Wordspottingtechniques have been applied to a wide range of problems that can suffer from unexpected speech input. These include human-machine interactions where it is difficult to constrain users' utterances to be within the domain of the system. Mostwordspotting systems consist of a mechanism for generating hypothesized vocabularywords or phrasesfroma continuous utterance along with some sort of hypothesis testing mechanism for verifying thewordoccurrence.Hypothesized keywords aregenerated by incorporating models of out-ofvocabulary utterances and non-speech sounds that compete in a search procedure with models of the keywords. Hypothesis testing is performed by deriving measures of confidence for hypothesized words or phrases and applying a decision rule to this measure for disambiguating correctly detected words from false alarms. Word spotting was first attempted using a dynamic programming technique for template matching (Bridle, 1973). Non-linear warping of the time scale for a storedreference template for a word was performed in order to minimize an accumulated distance from the input utterance. In this system, a distance was computed by performing a dynamic programming alignment for every reference template beginning at each time instant of a continuous running input utterance. Each dynamic programming pathwas treated as a hypothesized keyword occurrence, requiring a second-stage decision rule for disambiguating the correctly decoded keywords from false alarms. Recently, hidden Markov model (HMM)-based approaches have been used for word spotting. The reference template and the distance are replaced by an HMM word model and the likelihood, respectively. In these systems, thelikelihood foran acoustic background or "filler" speech model is used as part of a likelihood
304
c
..
..
0
a, a,
Q v)
T) 0 I :
if
2
-a
0
a,
Chapter 8
gnition
Speech
305
ratio scoring procedure in a decision rule that is applied as asecond stage to the word spotter. The filler speech model represents the alternate hypothesis, that is, out-of-vocabulary or ‘non-keyword’ speech. Figure 8.15 (Rose, 1996) shows a basic structure of an HMMbased word spotter, inwhich filler models competewith the models for keywords in a finite state network. The outputof the system is a continuous streamof keywords and fillers, and the occurrence of a keyword in this output stream is interpretedasa hypothesized event that is to beverifiedby a second-stage decision rule. The specification of grammarsforconstrainingand weighting the possible word transitions can be incorporated into the likelihood calculation. A variety of filler structures has been used successfully. They include: A simple one-state HMM (a Gaussian mixture), A network of unsupervised units such as an ergodic HMM, or a parallel loop of clustered sequences, A parallel networkloopofsubnetworkscorrespondingto keyword pieces, phonetic models, or even models of whole words, such asthe most commonwords,a single pooled ‘other’ word, and unsupervised clustering of the other words, and An explicit network characterizing typical word sequences. Wordspottingperformancemeasuresare derived using Neyman-Pearson hypothesis testing formulation. Givena T length sequence of observation vectors YA-= y l k , . . ., yTk corresponding to a possible occurrence of a keyword, a word spotter maygeneratea score Sk representing the degree acceptance of confidence for that keyword. The null hypothesis Ho corresponds to the case where the input utterance is the correct keyword, and thealternate hypothesis H I correspondstoanimposter (false) utterance. A hypothesis test can be formulated by defining a decision rule S() such that
Chapter 8
306
S( Y k ) =
0, Sk > r, (accept Ho) 1, sk 5 r, (accept H I )
(8.75)
where r is a constant decision threshold. We can define the type I error as rejecting Ho when the keyword is in fact present and the type I1 error as accepting Ho when the keyword is not present. Since there is a trade-off between the two types of error, usually a boundary on the type I error is specified and the type I1 error is minimized within this constraint. Figure 8.16 (Rose, 1996) showsa simple loopingnetwork which consists of N keywords Wkl, . . ., W k N and M fillers FVD, . . ., WfM.Word insertion penalties Cki and Cfican be associated with the ith keywordand jth filler respectively,'and they can be adjusted to affect a trade-off between type I and type I1 errors similar to adjusting r in Equation 8.75. Suppose, for example, the network in Fig. 8.16 contained only a single keyword and a single filler. Then at each time f, Viterbi algorithm propagates the path extending fromkeyword Wk, represented by HMMXk, or filler J V f , represented by HMM A3 to network node PC according to (8.76) This corresponds to a decision rule at each time f: (8.77)
8.9
LARGE-VOCABULARY CONTINUOUS-SPEECH RECOGNITION
In large-vocabulary continuous-speech recognition, input speech is recognized using various kinds of information including a lexicon,
Speech Recognition
YI
L
307
308
Chapter 8
syntax, semantics, pragmatics, context, and prosodics. The lexicon indicates the phonemic structure of words, syntax expresses the grammatical structure, semantics defines the relationship between words as well as the attributes of each word, pragmatics expresses general knowledge concerning the present topics of conversation, context concerns the contextual information,such as that obtained through human - machine conversation, and prosodics represents accent and intonation. Various algorithms and databases usedintheseprocesses are referred to as knowledge sources for continuous-speech recognition. The keys determining system performance liewith the kinds of knowledge sources used and how they are combined as quickly as possible to produce the most probable recognition. Specifically, the focus involves how best to control the process of searching through possibilities. There arethree principal issues in solving these problems: the order inwhichtheseknowledgesources should beused, the direction of processes in the input speech period, and the procedures for evaluating and selecting the most probable hypotheses. 8.9.1 Three PrincipalStructural Models
Therearethreeprincipalmodelsforcombiningand using the knowledge sources: the hierarchy model,theblackboardmodel, and the network model. 1. Hierarchy model
Thehierarchymodeldistributesknowledge sources in multiple of processes aretransferred hierarchicalsubsystems.Results between adjacent subsystems in the bottom-up direction for taskindependent processes and in thetop-downdirectionfortaskdependent processes. The fundamental structure of the hierarchy model is presented inFig. 8.17(a). Acoustic features are extractedin the acoustic processor from input speech and converted into a phoneme sequence (lattice) by
309
Speech Recognition Recognltion results
4
Linguisticprocessor (Linguistic level I 1 Word candidates Word prediction (TOP -down I (Bottom-up 1 Acousticprocessor (Acoustic leve I 1
4 Speech wave
Recogn i t ion resu Its Word ,level
r
t
1Common
Acou E t ic leve I
database (Blackboard) I
leve I
Speech wave
Control mechanism
(c)
sources Recogn Network of knowledge (Word, syntax & semantic levels)"
t
'L
i t ion results
Acoustic l e v e l
t
Speechwave
FIG. 8.17 Three principal structural models of continuousspeech recognition: (a) hierarchymodel; (b) blackboardmodel; (c) network model.
means of segmentation and phoneme recognition. In thenext step, word orword-sequence candidatesare produced from the phoneme sequence which usually includes recognition errors.Word dictionaryas well asphonological rules representingphoneme
Chapter 8
310
modification rules associated with coarticulation are used for the word or word-sequence recognition. In the linguistic processor, a sentence is produced by removing incorrectcandidatewords according to linguistic knowledge such as syntax, semantics, and context information. Ontheotherhand, restrictions onwordcandidatesare provided in the top-down direction from thelinguistic processor to theacousticprocessor.Acousticandlinguistic processes are sometimes combined at a level below the word level. Actual acoustic and linguistic processors are further divided into multiple subsystems. 2.
Blackboard model
In the blackboard model,as in the hierarchy model, the recognition system is divided into multiple subsystems. A special feature of this system, however, is that each subsystem gains access to a common database independently to verify various hypotheses, as shown in Fig. 8.17(b). The process in each subsystem can be performed in parallelwithoutsynchronization.TheHearsay I1 system is a successful example of the blackboard model (Lesser et al., 1975). Systems based on hierarchy andblackboard models are characterized by flexibility. This is because various knowledge sources are classified and systematically combined to achieve the recognition and understanding of sentences while preserving their independence. 3.
Network model
The network model embeds all knowledge except the system control mechanism in one network, with every process beingperformed in this network, as shown in Fig. 8.17(c). Sentence recognition based on this model corresponds to the process of searching for a path in the network which matches the input speech. The process is thus similar to connected word recognition. Although the number of calculations
Speech Recognition
311
is relativelylarge, information loss on each level as well as information loss propagation can be prevented. In addition, all processes can be controlled homogeneously, and all knowledge sources can be handled uniformly. The Harpy system is a successful application of this model (Lowerre, 1976). The problem with the network model is that it is not as flexible in its application as the two previous models. Most of therecentlarge-vocabularycontinuous-speech recognition systems have been built based on the network model.
8.9.2 OtherSystemConstructingFactors
Prevalent among the directions which processes take are exemplified by the left-to-right and island-driven methods. In the former method, input speech is successively processed from beginning to end. In the latter method,the most reliable candidate word is first detected inthe input speech, which is then processed fromthis word to both ends. Although both methods have advantages and disadvantages,theleft-to-rightmethod is morefrequentlyused. This is because important words tend to be nearer the beginning of sentences and the left-to-right method is much easier to control. Quantitativeevaluationand selection of hypotheses are carried out by a variety of tree search algorithms. The depth-first method processes the longest word string first, and if this search fails, the system backtracks to the previous node. In the breadthfirst method, all word strings of the same length are processed in parallel, with the process proceeding from short to long strings. Withthe best-first method,thewordstringhavingthelargest evaluation value is selected at every node.Thestackalgorithm (Bahl et al., 1983) is widely used to find the best path first. These methods differ only in their search orders, exhibiting no essential difference insearchcapability.Reducingthesearchcost while maintaining the search efficiency, however, is very important for practical applications. The beam search method (Lowerre, 1976) is a modification of thebreadth-firstmethod,in which wordstrings with relatively
312
Chapter 8
large evaluation values areselected and processed in parallel. New algorithms such as the tree-trellis algorithm (Soong and Huang, 1991) which combines a Viterbi forward search and an A* (Paul, 1991) backwardsearchare very efficient ingeneratingN-best results (See Subsection 8.9.5). Various other trials have also been examined including pruning until only reliable candidates remain. Syntacticinformation,that is, syntacticrulesandtaskdependentknowledgeareusuallyrepresentedusingstatistical language modeling or context-free grammar (CFG). When more sophisticated control is required, they are represented by generation rules (rewriting rules) or by an augmented transition network (ATN) in which semantic information is embedded. Semantic information is represented in various ways. These include being represented by: (1) A combinationofsemanticmarkerswhichindicatefundamentalconceptsnecessary for classifying themeaning of words; (2) Embeddingtherestriction of semanticword classes inthe syntactic description as described above; (3) A semanticnet which indicatesthesemanticrelationship between word classes using a graph with nodes and branches; and (4) A case frame inwhich all words, mainly verbs, are qualifiedby words or phrases in a semantic class which coexist with the word. Procedural knowledge representation,predicate logic, anda production system have also been used for semantic information representation. 8.9.3 Statistical Theory of Continuous-Speech Recognition
In the state-of-the-art approach, speech production as well as the recognition processis modeled through four stages: text generation,
Speech Recognition
313
(Transmission theory 1
. L . " . I "
-I.--.
' Speech recognition
I
I I
t
Acoustic chonncl
-
r.-
"
"."
Text
L
-
i- " "-
Acoustic processing
system
I
-
I '
I
Y
I
-
t
Linguistic
I
decoding
+"""---J
" " "
(Speechrecognition process)
FIG.8.18 Structure of the state-of-the-art continuous speech recognition system.
speech production, acoustic processing, and linguistic decoding, as shown in Fig. 8.18. A speaker is assumed to be atransducer that transformsinto speech the text of thoughtshe/sheintends tocommunicate(informationsource). Based on information transmission theory, the sequence of processes is compared to an information transmission system, in which a word sequence W is convertedinto an acousticobservationsequence Y , with probability P(W, Y), through a noisy transmissionchannel, whichis then decoded to an estimated sequence I@. The goal of recognition is then to decode the word string, based on the acoustic observation sequence, so that the decoded string has themaximum a posteriori (MAP) probability(Rabiner and Juang, 1993; Young, 1996), i.e.,
FP =
argmaxP(W
1
Y).
(8.78)
It'
Using Bayes' rule, Eq. (8.78) can be written as (8.79)
314
Chapter 8
Since P ( Y ) is independent of W , the MAP decoding rule of Eq. (8.79) is
(8.80) The first term in Eq. (8.80), P( YI w), is generally called the of a sequence of acousticmodel as itestimatestheprobability acoustic observations conditioned on the word string. The second term is generally called the language model since it describes the probability associated with a postulated sequence of words. Such languagemodelscanincorporate both syntactic and semantic constraints of the language and the recognition task. Often, when only syntactic constraints are used, the language model is called a grammar. When the language models are represented in a finite state network, it can be integrated into the acoustic model in a straightforward manner. HMMs and statistical language models are typically used as theacoustic and languagemodels,respectively.Figure 8.19 diagramsthecomputation of theprobability P(WI Y) of word sequence W given theparameterizedacoustic signal Y. The data P(YI w) is computed using a likelihood of theacoustic composite hidden Markov model representing W constructed from simple HMM phoneme models joined in sequence according to word pronunciations stored in a dictionary.
8.9.4 Statistical LanguageModeling The statistical language model P( w)for word sequences W
is estimated from a given large text (training) corpus (Jelinek, 1997; Ney et al., 1997). Using the definition of conditional probabilities, we obtain the decomposition
Speech Recognition
4"
-m
L
315
Chapter 8
316
k
(8.82) i= 1
For large-vocabulary speech recognition, these conditional probabilities are typically used in the following way. The dependence of the conditional probability of observing a word tt'i at a position i is assumed to be restricted to itsimmediate N-1 predecessor words w i - N + . . . M'i-1. The resulting model is that of a Markov chain andis referred to asN-gram language model( N = 1: unigram; N = 2: bigram; and N = 3: trigram). The conditional probabilities P(wi I can be estimated by the simple relative frequency
wi:k+,)
(8.83) in which C is thenumber of occurrences of thestringinits argument in the given training corpus. In order for the estimate in Eq. (8.83) to be reliable, C has to be substantial in the given corpus. However, if the vocabulary sizeis2000 and N = 4, the possible number of differentword sequences w~~ is 16 trillion (20004), and, therefore, even if a considerably large training corpus is given, C = 0 for many possible word sequences. One way to circumvent this problem is to smooth the N-gram frequencies by using the deleted interpolationmethod (Jelinek, 1997). In the case of N = 3, the trigram model, the smoothing is done by interpolating trigram, bigram, and unigranl values
P(
"1
gnition
Speech
317
where thenonnegative weights satisfy X, + X2 + X3 = 1. The weights can be obtained by applyingtheprinciple of crossvalidation and the EM algorithm. This method has a disadvantage in that it needs a huge number of computations if the vocabulary size is large. In order to estimate the values of N-grams that do notoccur in the training corpus from N-1-gram values, Katz’s backoff smoothing(Katz, 1987;Ney etal., 1997) based ontheGood-Turing estimationtheory iswidely used. In this method, the number of occurrences of N-gram with so few occurrences is further reduced, and the left-over probability is distributed among the unobservedNgrams in proportion to their N-1-gram probabilities. The N-gram reducing ratio is called the discounting ratio. Even with these methods, itis almost practically impossibleto obtain N-gramswith N larger than 3 fora largevocabulary. Therefore,theword4-gramsareoftenapproximated by class 4-grams using word classes (groups), such as part of speech, as units as follows
where ci indicates the ith word class. A method using word cooccurrences as statistics overwider a range than adjacentwords has alsobeen explored. Language model adaptation for specific tasks and users has also been investigated. Introducingstatisticsintoconventionalgrammaras well as bigrams and trigrams of phonemesinstead of wordshavealso beentried.Thestatisticallanguagemodeling is themethod incorporatingbothsyntacticandsemanticinformationsimultaneously. One of the important issues in training statistical language modelsfromJapanesetext is that there is nospacing between words in the written form.Even no clear definition of words exists. Therefore,morphemesinstead of wordsareusedasunits, and morphologicalanalysis is applied totraining text for splitting sentences intomorphemesandproducingtheirbigramsand trigrams.
318
Chapter 8
8.9.5 Typical Structure of Large-Vocabulary ContinuousSpeech Recognition Systems
Thestructure of a typical large-vocabularycontinuous-speech recognition system currentlyunderstudy is showninFig. 8.20 (Rabiner and Juang, 1993). In this system, a speech wave is first converted into a time series of feature parameters, such as cepstra and delta-cepstra,inthefeatureextraction part.The system predicts a sentence hypothesis that is likely to be spoken by the user, based on thecurrenttopic,themeaning of words, and language grammar, and represents the sentence as a sequence of words. Thissequence is then converted into a sequence of phoneme models which were created beforehand in a training stage. Each phonememodel is typicallyrepresented by an HMM. The likelihood(probability) of producingthe time series of feature parameters from thesequence of the phoneme models is calculated, and combined with the linguistic likelihood of the hypothesized sequence to calculate the overall likelihood that the sentence was uttered by the speaker. The (overall) likelihood is calcualted for other sentence hypotheses, and the sentence with the highest likelihood scoreis chosen as the recognition result.Thus, in mostof the current advanced systems, the recognition process is performed top-down, that is, driven by linguistic knowledge. For state-of-theart systems, stochastic N-grams are extensively used. The use of a context-free language in recognition is still limited mainly due to the increase incomputationandthe difficulty instochastic modeling. Inordertoincorporate linguistic context within a speech subword unit, triphones and generalized triphones are now widely used. It has been shown that the recognition accuracyof a task can be increasedwhenlinguisticcontextdependency is properly incorporated to reduce the acoustic variability of the speech units being modeled. When triphones are used they result in a system that has too many parameters to train. The problem of too many parameters and too little training data is crucial in the design of a statistical speech recognizer. Therefore,tied-mixture models and
Speech Recognition
I:
319
320
Chapter 8
state-tyinghave been proposed.Figure 8.21 (Knill andYoung, 1997) shows a procedure for building tied-state Gaussian-mixture triphone HMMs. In thismethod,similar HMM states of the allophonic variants of each basic phone are tied together in order to maximize the amount of data available to train each state. The choice of which states to tie is made based on clustering using a phonetic decision tree, where phonetic questions, such ‘Is as the left context a nasal?’, are used to partition the present set into subsets in away that maximizes the likelihoodof the training data. The leaf nodes of each tree determine the setsof state tyings for each of the allophonic variants. In fluentcontinuous speech ithasalso been shown that interword units take into account cross-word coarticulation and thereforeprovidemoreaccuratemodeling of speech unitsthan intraword units.Word-dependentunitshavealsobeenusedto model poorly articulatedspeech sounds such as function words like a, the, in? and etc. Since a full searchof the hypotheses is very expensive in terms of processingtime and storage requirements,suboptimalsearch strategies are commonly used. As opposed to the traditional left-toright, one-pass search strategies, multi-pass algorithms perform a search in away that the first pass typically prepares partial theories and additional passes finalize the complete theory in a progressive manner. Multi-pass algorithms are usually designed to provide the N-best string hypotheses. To improve flexibility, simpler acoustic and language models are often used in the first pass as a rough match to introduce a word lattice. Detailed models and detailed matches are applied in later passes to combine partial theories into the recognized sentence.
8.9.6 Methods for Evaluating Recognition Systems
Three measures for representingthesyntacticcomplexity of recognizing taskshavethusfar been proposedtofacilitate the evaluation of the difficulty of speech recognition tasks. The
I ”
Speech Recognition
Mc 0 s a .d
"
tj
I c.1
I
I
321
322
Chapter 8
averagebranchingfactorindicatestheaveragenumber of words which can be predicated, that is, the words that can follow at each position of syntactic analysis (Goodman, 1976). Equivalent vocabulary size is amodificationoftheaveragebranching factorin which theacoustic similarity between words is taken intoconsideration(Goodman, 1976). Finally,perplexity is defined by 2H, where H is theentropy of a wordstringin sentence speech (Bahletal., 1982). Entropy H is given by equation
~ is theprobability of observingtheword where P ( I V ~. I. .Vwli) sequence. However, the language modelperplexity calculated by using a training text corpus does notnecessarily indicate the uncertainty of the texts which appear in speech recognition, since thetext database is limited in its size and it does not necessarily represent the whole natural language.Therefore, the following test-set perplexity PP or log perplexity log PP is frequently used for evaluating the difficulty of the recognition task:
1 logPP = - - log P(11'1 - N
*
M7N)
Thisindicates theobservationprobability of theevaluation (recognition) text per word measured using the trained language model. Althougheachmeasure offers its own benefits, a perfect measure has not yet been proposed. The performance of recognition systems is usually measured by the following %correct or accuracy:
"
1_1_
. ...
f
Speech Recognition
323
Yocorrect =
accuracy =
N
N
-
- sub
N
- del '
100
sub - del - ins - 100 N
(8.88)
(8.89)
Whenwords are used as measuringunits, they are called word %correct and word accuracy. N is the number of words in the speech forevaluation, and sub, del and ins are the numbers of substitutionerrors,deletionerrorsandinsertionerrors, respectively. The accuracy which includes insertion errors is more strict than%correct which doesnot.Thenumbercalculated by subtracting the accuracy from 100 is called the error rate. Actual systems should be evaluated by the combination of task difficulty and recognition performance.
8.10 8.10.1
EXAMPLES OF LARGE-VOCABULARY CONTINUOUSSPEECH RECOGNITION SYSTEMS DARPA Speech Recognition Projects
Applications of speech recognition technology can be classified into the two main areas of transcription and human-computer dialogue systems. A series of DARPA projects have been a major driving force of the recent progress in research on large-vocabulary, continuous-speech recognition. Specifically, transcription of speech reading newspapers, suchas North America business (NAB) newspapers including the Wall Street Journal (WSJ), and conversational speech recognition using an Air Travel Information System (ATIS) task wereactively investigated. Recently, broadcast news (BN) transcription and natural conversational speech recognition using Switchboard and Call Home tasks have been investigated as
Chapter 8
324
major DARPA programs. Research on human-computer dialogue systems named Communicator Program has also started. Thebroadcast news transcriptiontechnologyhas recently been integrated with information extraction and retrieval technology, andmanyapplication systems, such asautomatic voice document indexing and retrieval systems, are under development. These systems integratevarious diverse speech and language technologies including speech recognition, speaker change detection,speakeridentification,nameextraction,topic classification andinformationretrieval. In thehuman-computerinteraction domain, a variety of experimental systems for information retrieval through spoken dialogue are investigated. 8.10.2
EnglishSpeechRecognitionSystem Laboratory
at LlMSl
Thestructure of a typical large-vocabularycontinuous-speech recognition system developed atLIMSILaboratory inFrance for recognizing English broadcast-news speech is outlined as follows (Gauvain et al., 1999). The system uses continuous density HMMs with Gaussian mixture for acoustic modeling and backoff N-gramstatisticsestimated on largetext corporafor language modeling. For acoustic modeling, 39 cepstral parameters, consisting of 12 cepstral coefficients and the log energy, along with the first and second order derivatives, are derived from a Me1 frequencyspectrumestimated on the 0-8 kHz band (0-3.5 kHz for telephone speech models) every 10 ms. The pronunciations are based on a 48-phone set (three of them are used for silence, filler words, andbreath noises). Eachcross-wordcontext-dependent phonemodel is atied-stateleft-to-right HMM withGaussian mixtureobservation densities (about 32 components) where the tied states are obtained by means of a decision tree. Theacousticmodels were trained onabout 150 hours of Broadcast News data. Language models were trained on different data sets: BN transcripts, NAB newspapers and AP Wordstream
gnition
325
Speech
texts. The recognition vocabulary contains 65,122 words (72,788 phone transcriptions) and has a lexical coverage of over 99% on theevaluation test data.Priorto worddecodingamaximum likelihood partitioning algorithm using Gaussian mixture models (GMMs) segments the data into homogeneous regions and assigns gender,bandwidthandcluster labels to the speech segments. Details of the segmentation and labeling procedure are shown in Fig. 8.22. A criterion similar to BIC (Bayesian Information Criterion) (Schwarz, 1978) orMDL (MinimumDescription Length)(Rissanen, 1984) criterion is used to decide the number of segments. The word decoding procedure is shown in Fig. 8.23. The cepstral coefficients are normalized on a segment cluster basis using cepstral meannormalization and variance normalization.Each resulting cepstral coefficient for each segment has a zero mean and unity variance. Priorto decoding, segments longer than 30s are chopped into smaller pieces so as to limit the memory required for the trigram decoding pass. Word recognition is performed in three steps: 1) initial hypotheses generation, 2) word graphgeneration, and 3) final hypothesis generation, each with two passes. The initial hypotheses are used in cluster-based acoustic model adaptation using the MLLR technique prior to word graph generation and in all subsequent decoding passes. The final hypothesis is generated using a 4-gram interpolated with a category trigram model with 270 automatically generated word classes. The overall word transcription error on the November 1998 evaluation data was 13.6%. 8.10.3
English Speech Recognition System at IBM Laboratory
The IBM system uses acoustic models for sub-phonetic units with context-dependent tying (Chenetal., 1999). Theinstancesof context-dependent sub-phone classes are identified by growing a decision tree from the available training data and specifying the terminal nodes of the tree as the relevant instancesof these classes.
Chapter 8
326
Viterbi segmentation with GMMs Speech/music/background
1 Chop into small segments I Train aGMM for each segment Viterbi segmentation and reestimation GMM clustering
1"
Fewer clusters No change Viterbi segmentation with energy constraint Bandwidth and gender identification
FIG.8.22
Segmentation and labeling procedure of the LlMSl system
Speech Recognition
327
Cepstral mean and variance normalization for
1
Chopintosegments smaller than 30s Generate initial hypotheses
MLLR adaptation & word graph generation
FIG.8.23 Word decoding procedure of the LlMSl system.
The acoustic feature vectors that characterize the training data at the leaves are modeled by a mixture of Gaussian or Gaussian-like pdf s, with diagonal covariance matrices.The HMM used to model
-____"".-"-.""" _
c
"
"
"
" I I "
328
Chapter 8
each leaf is a simple one-state model, with a self-loop and a forward transition. The total number of Gaussians is 289 k. The BIC is used as a model selection criterion in segmentation,clusteringforunsupervised adaptation,and choosingthe number of GaussiansinGaussianmixturemodeling.TheIBM systemshowsalmost the sameword errorrate astheLIMSI system. 8.10.4 A Japanese Speech Recognition System
A large-vocabularycontinuous-speechrecognitionsystemfor Japanese broadcast-news speech transcription has been developed at Tokyo Institute of Technology in Japan (Ohtsuki et al., 1999). This is part of a joint research with a broadcast company whose goal is the closed-captioning of TV programs. The broadcast-news manuscripts that were used for constructing the language models were taken from the period of roughly four years, and comprised approximately 500 k sentences and 22 M words. To calculate word N-gramlanguagemodels,thebroadcast-newsmanuscripts were segmented into words by usingamorphologicalanalyzer since Japanese sentences are written without spaces between words. A word-frequency list was derived for the news manuscripts, and the 20 kmostfrequently used words were selected asvocabulary words. This 20 k vocabulary covers about 98% of the words in thebroadcast-newsmanuscripts.Bigramsandtrigrams were calculated and unseenN-grams were estimatedusingKatz’s back-off smoothing method. As showninFig. 8.24, atwo-pass search algorithm was used, in which bigrams were utilized in the first pass and trigrams were employed in the second pass to rescore the N-best hypotheses obtained as the result of the first pass. Japanesetext is writtenwithamixture of threekinds of characters: Chinese characters (Kanji) and two kinds of Japanese characters(HiraganaandKatakana).EachKanjihasmultiple readings, and correct readings can only be decided according to context. Therefore, a language model that depends on the readings of wordswasconstructedinordertotakeintoaccountthe
cn
0
Acoustic model training
fD fD
I
Speech-r
Acoustic analysis
+
2
Beamsearch decoder (First path)
'hypotheses N-best ' Rescoring (Second with acoustic score , Path)
-
-W
Recognition results
+
I
Trigram Language model training FIG. 8.24 Two-pass search structure used in the Japanese broadcast-news transcription system.
w N
(D
330
Chapter 8
frequency and context-dependency of the readings. Broadcastnews speech includes filled pauses at the beginning and in the middle of sentences, which cause recognition errors in the language models that use news manuscripts written prior to broadcasting. To cope with this problem, filled-pause modeling was introduced into the language model. Afterapplyingonline,unsupervised,incrementalspeaker adaptation using the MLLR-MAP (See Subsection 8.1 1.4) and VFS (vector-field smoothing) (Ohkura et al., 1992) methods, the word errorrate of 11.9%,on average over male and female speakers, was obtained for clean speech with no background noise. Summarizing transcribed news speech is useful for retrieving or indexing broadcast news. A method has been investigated for extracting topic words from nouns in the speech recognition results on the basis of a significance measure. The extracted topic-words were compared with ‘true’ topic-words, which were given by three human subjects. The results showed that, when the top five topicwords were chosen (recall = 13%), 87% of them were correct on average. Based on these topic words, summarizing sentences were created by reconstructing compound words and inserting verbs and postpositional particles.
8.11 SPEAKER-INDEPENDENT AND ADAPTIVE RECOGNITION Speaker-dependent variations in speech spectra are very complicated, and,as indicated in Subsection 8.1.2, there is no evidence that common physical features exist in the same words uttered by different speakers even if they can be clearlyrecognized by humans. A statistical analysis of the relationship between phonetic and individual information revealed that there is significantinteraction between them (Furui, 1978). It is thus verydifficult fora system to accurately recognize spoken words or sentences uttered by many speakers even if the vocabulary is as small as 10 digits. Only with a small vocabulary and nosimilar word pairs in the spectral domain can high accuracy be
Speech Recognition
331
achieved using a reference template 011 a model obtained by averaging the spectral patterns of many speakers for each word. Although looking for phonetic invariants, principally physical features commonly existing for all speakers for each phoneme, is important asbasic research, it seems too ambitious an undertaking. The present effective methodsforcoping with theproblem of speaker variability can be classified into two types of methods. One constitutesmethodsin which reference templates or statistical word/subword models are designed so that the range of individual variation is covered by them for each word, whereas the ranges of different words do not overlap. The other includes those in which the recognition system is provided with a training mechanism for automatically adapting to each new speaker. The need for effectively handlingindividualvariationsin speech has resulted inthelatter type of themethod,that is, introducing normalization or adaptation mechanisms into a speech recognizer. Such a method is based on the voice characteristics of each speaker observedusing utterances of a small number of words or short sentences. In the normalization method, spectral variation is normalized or removed from input speech, whereas inthe adaptation method,the recognizer templates or models are adapted to eachspeaker.Normalization oradaptation mechanisms are essential for very-large-vocabularyspeaker-independentword recognition. Since it is almost impossible toconducttraining involving every word in a large vocabulary, training using a short string of speech serves as a useft11 and realistic way of coping with the individuality problem. Unsupervised (online) adaptationhas also been attempted wherein the recognition system is automatically adaptedtothe speaker through therepetition of the recognition process without the need forthe utterances of predetermined words or sentences. Humans have also been found to possess a similar adaptation mechanism. Specifically, although the first several words uttered by a speaker new to the listener may be unintelligible, the latter quickly becomes accustomed to the former’s voice. Thus, theintelligibility of the speaker’s voice increases particularly after the listener hears several words and utterances (Kato and Kawahara, 1984).
332
Chapter 8
This section will focus on: 1) the multi-template method, in which multiple templates are created for each vocabulary word by clustering individual variations; 2) the statistical method, in which individual variations are represented by the statistical parameters in HMMs; and 3) thespeakernormalizationandadaptation methods,inwhichspeaker variability of input speech is automatically normalized or speaker-independent models are adapted to each new speaker. 8.11.I Multi-template Method
A spoken word recognizer based on the multi-template method clusters the speech data uttered by many speakers, and the speech sample at the center of each cluster or the mean value for the speech data associated with each cluster is stored as a reference template. Several algorithms areused in combination forclustering (Rabiner et al., 1979a, b). In the recognition phase, distances (or similarities) between input speech and all reference templates of all vocabulary words are calculated based on DP matching,andtheword with the smallest distance isselected asthewordspoken. Inorderto increase the reliability, the KNN (K-nearest neighbor) method is often used for the decision. Here, K reference templates with the smallest distances fromtheinput speech are selected fromthe multiple-reference template set for each word, with the mean value for these K templates being calculated for each word. The word with the smallest mean value is then selected as the recognition result. Experiments revealed that with 12 templates for each word, the recognition accuracy for K= 2 to 3 is higher than for K= 1. Speaker-independent connecteddigit recognition experiments were performed combining the LB and multi-template methods. This method is disadvantageous, however, in that when the number of reference templatesforeachword increases, the recognitiontaskbecomes equivalent tolarge-vocabularyword recognition, increasing the number of calculations and the memory
gnition
Speech
333
size. These problems have been resolved through the investigation of two methods based on the structure shown in Fig. 8.4(b), in which phoneme templates and a word dictionaryare utilized. In the first trial, the same word dictionary was used for all speakers, and multiplesets of phonemetemplates were prepared to cover variations in individual speakers (Nakatsu et al., 1983). For the second instance, the SPLIT method (see Subsection 8.6.2) was modified to use multiple-word templates (pseudophoneme sequences) foreachword to coverspeakervariations, whereas the set of pseudo-phoneme templates remain common to all speakers (Sugamura and Furui, 1984). This method was found to be able to reduce the number of calculations and memory size to roughly one-tenthof the methodusing word-based templates, while maintainingrecognitionaccuracy. In thismethod,pseudophonemes and multiple sequences in the word dictionary are produced by the same clustering algorithm. A VQ-based preprocessor is combinedwiththe modified SPLIT method for large-vocabulary speaker-independent isolated word recognition (Furui, 1987). Here, a speech wave is analyzed by time functions of instantaneous cepstralcoefficients and short-time regression coefficients for both cepstral coefficients and logarithmic energy. Regression coefficients represent spectral dynamics in every short period, as described in Sec. 8.3.6. A universal VQ codebookfor these time functions is constructed based on a multispeaker, multiword database. Next, a separate codebook is designed as a subset of the universal codebook for each word in the vocabulary. These word-specific codebooks are used for front-end processing toeliminatewordcandidateswithlarge-distance (distortion) scores. The SPLIT method subsequently resolves the choice among the remaining word candidates. 8.11.2 StatisticalMethod
The HMM method described in Sec. 8.7 is capable of including spectral distribution and variation in transitional probability for
334
Chapter 8
manyspeakers in themodelasa result of statisticalparameter estimation. It has been repetitively shown that given a large set of training speech, goodstatistical models can be constructedto achievehigh a performanceformanystandardized speech recognition tasks. Recognition experiments demonstrated that this method can achieve betterrecognitionaccuracy than the multitemplate method.Theamount of computation requiredinthe HMM method is much smaller than in the multi-template method (Rabiner et al., 1983). A trial was also conducted using HMM at the word level in the LB method (Rabiner and Levinson, 1985). It is still impossible, however, to accurately recognize the utterances of every speaker. A smallpercentage of people occasionally cause systems to produce exceptionally low recognition rates because of large nzisnlatches between the models and the input speech.This is an example of the'sheep andgoats' phenomenon. 8.11.3 Speaker Normalization Method
The nature of the speech production mechanism suggest that the vocal cord spectrum and the effects of vocal tractlengthcause phoneme-independent physical individuality in voiced sounds. Furthermore, the former can be observed in the averaged overall spectrum, that is, in the overall pattern of the long-time average spectrum, and the latter can be seen in thelinearexpansion or contraction coefficient alongthe frequency axis for the speech spectrum. Based on thisfactual evidence, individualitynormalization has been introducedforthephoneme-basedwordrecognition system described in Subsection 8.6.1 (Furui, 1975). Experimental results show that although this nlethod is effective, a gap exists between the recognition accuracies obtained using the method and those surfacing after training utilizing all of the vocabulary words for each speaker. This means that a more complicated model is necessary to ensure complete representation of voice individuality.
Speech Recognition
335
Nonlinear warping of the spectrum along the frequency axis has been attempted using the DP technique for normalizing the voice individuality (Matsumoto and Wakita, 1986). Since excessive warping causes the loss of phonetic features, an appropriate limit must be set for the warping function. 8.1 1.4 Speaker Adaptation Methods
The main adaptation methods currently being investigated are: 1) Bayesian learning, 2) spectral mapping, 3) linear (piecewise-linear) transformation, and 4) speaker cluster selection. Important practical issues in using adaptation techniques include the specification of a priori parameters (information), the availability of supervisioninformation,andtheamount of adaptation data needed to achieve effective learning. Since it is unlikely that all the phoneme units will be observed enough times ina small adaptation set, especially inlarge-vocabularycontinuous-speech recognition systems, onlya small numberofparameterscan beeffectively adapted.It is thereforedesirable to introduce some parameter correlation or tying so that all model parameterscan be adjusted at thesame time inaconsistent manner, even if some units are not included in the adaptation data. The Bayesian learning framework offers a way to incorporate newly acquired application-specific data into existing models and to combine them in an optimal manner. It is therefore an efficient technique for handling the sparse training data problem typically found in model parameter adaptation. This framework has been used to derive MAP (maximumaposteriori)estimates of the parameters of speech models, including HMM parameters (Lee and Gauvain, 1996). The MCE/GPD method described in Subsection 8.7.8 has also been successfully combined with MAP speaker adaptation of HMM parameters (Lin et al., 1994; Matsui et al., 1995). In the spectral mappingmethod? speaker-adaptive parameters are estimatedfromspeaker-independentparameters based on
336
Chapter 8
mappingrules.Themapping rules areestimatedfromthe relationship between speaker-independent and speaker-dependent parameters (Shikano et al.? 1986). If acorrelationstructure between parameterscan be established, and thecorrelationparameters can be estimated when training the general models, the parameters of unseen units can be adapted accordingly (Furui, 1980; Cox, 1995). To improve adaptation efficiency and effectiveness along this line, several techniques havebeenproposed,includingprobabilisticspectralmapping (Schwartz et al., 1987), cepstral normalization (Acero et al., 1990), and spectrum bias and shift transformation (Sankar andLee, 1996). Inadditiontoclusteringandsmoothing,a second type of constraintcan begiven to themodelparameters so that all the parametersareadjustedsimultaneouslyaccording toa predetermined set of transformations, e.g., atransformation based on multiple regression analysis (Furui, 1980). Variousmethods have recently been proposedin which alineartransformation (Affine transformation) between the reference and adaptive speaker-feature vectors is defined and then translated into abias vector and ascaling matrix, which can be estimated using an EM algorithm (MLLR; MaximumLikelihoodLinear Regression method)(Leggetterand Woodland, 1995). The transform parameters can be estimated from adaptation data that form pairs with the training data. In the speakerclusterselectionmethod, it is assumed that speakers can be divided into clusters, within which the speakers are similar. From many sets of phoneme model clusters representing speakervariability,themostsuitable set for the new speaker is automatically selected. This method is useful for choosing initial models, to which more sophisticated speakeradaptation techniques are applied. 8.11.5 Unsupervised Speaker Adaptation Methods
Themost useful adaptationmethod isunsupervisedonline instantaneous adaptation. In this approach, adaptationis performed at runtimeontheinput speech in an unsupervisedmanner.
Speech Recognition
337
Therefore, the recognition system does not require training speech to estimate the speaker characteristics; it works asif it were a universal (speaker-independent) system. This method is especially useful when the speakers vary frequently. Themostimportant issue in this method is how to perform phoneme-dependent adaptation without knowing the correct model sequence for the input speech. This is especially difficult for speakers whose utterances are error prone when using universal (speaker-independent) models, that is, for speakers who definitely need adaptation. It isvery useful if the online adaptation is performed incrementally, in which the recognition system continuously adapts to new adaptation data without using previous training data (Matsuoka and Lee, 1993). Hierarchicalspectralclustering is an adaptiveclustering technique that performs speaker adaptation in an automatic, selforganizingmanner. Themethod was proposedfor a matrixquantization-based speech coding (Shiraki et al., 1990) and a VQbased word-recognition system (Furui, 1989a, 1989b) in which each word is represented by a set of VQ index sequences. Speaker adaptation is achieved by adapting the codebook entries (spectral vectors) to a particular speaker while keeping the index sequence set intact. The key idea of this method is to cluster hierarchically the spectra in the new adaptation set in correspondence with those intheoriginal VQ codebook.Thecorrespondence between the centroid of a new cluster and the original code word is established by way of a deviation vector. Using deviation vectors, either code words or input frame spectra are shifted so that the corresponding centroids coincide. Continuity between adjacent clusters is maintained by determining the shifting vectors as the weighted-sum of thedeviationvectors of adjacentclusters. Adaptation is thus performed hierarchically fromglobal to local individuality as shown in Fig. 8.25. In the figure, u,, and v,, indicate the centroid of the mth codebook element cluster and that of the corresponding training speech cluster, respectively, pH, is thedeviationvector between these two centroids, and ci is a codebook element. The MLLR method has also been used asaconstraintin unsupervised speaker adaptation (Cox and Bridle, 1989; Digalakis and Neumeyer, 1995).
338
d=
n W
Chapter 8
Speech Recognition
339
The N-best-based unsupervised adaptation method (Matsui and Furui, 1996) uses the N most likely word sequences in parallel and iterativelymaximizesthejointlikelihoodforsentence hypotheses and modelparameters.The N-best hypotheses are created for each input speech by applyingspeaker-independent models;speaker adaptation based onconstrained Bayesian learning is then applied to each hypothesis. Finally, the hypothesis with the highest likelihood is selected as the most likely sequences. Figure 8.26 shows the overall structure of sucharecognition system. Conventionaliterativemaximization, which sequentially estimates hypotheses and model parameters, canonly reach a local maximum, whereas the N-best-based methodcan find aglobal maximum if reasonableconstraintsonparametersareapplied. Without giving reasonable constraints based on models of interspeakervariability, aninpututterancecan be adaptedto any hypothesis with resulting high likelihood. To reduce this problem, constraintsshould be placed on thetransformation so that it maintains a reasonable geometrical shape. Because inter-speakervariabilityofteninteracts with other variations,suchasallophoniccontextualdependency,intraspeakerspeechvariation,environmentalnoise, andchannel distortion,it is importanttocreatemethodsthatcan simultaneously cope with these other variations. Inter-speaker variability is generally more difficult to cope with than noise and channel variability, since the former is non-linear whereas the latter can usually be modeled as a linear transformation in the time, spectral, or cepstral domain. Therefore, the algorithms proposed for speaker adaptationcan generally be applied to noise andchannel adaptation.
8.12 ROBUST ALGORITHMS AGAINST NOISE AND CHANNEL VARIATIONS
The performance of a speech recognizer iswell known to often degradedrastically when there exist some acousticas well as
340
...
Chapter 8
Speech Recognition
341
linguistic mismatches between the testing and training conditions. Inadditiontothespeaker-to-speakervariability described in theprevioussection,theacousticmismatchesarisefromthe signal discrepancies dueto varyingenvironmentalandchannel conditions, such as telephone,microphone,background noise, room acoustics, and bandwidth limitations of transmission lines, as shown in Fig. 8.27. When people speak in a noisy environment, not only does the loudness (energy) of their speech increase, but thepitchandfrequencycomponentsalsochange.These speech variations are called theLombard effect. The linguistic mismatches arise from different task constraints, There has been a great deal of effort aiming at improving speech recognition and hence enhancing performance robustness in the abovementioned mismatches. Figure 8.28 shows the main methods forreducing mismatches that have been investigated to resolve speech variation problems (Juang, 199 1; Furui, 1992b, 1995c), along with the basic sequence of speech recognition processes. These methods can be classified into three levels: signal level, feature level, and model level. Since the speaker normalizationand adaptation methods are described in theprevioussection,this section focuses on environmental and channel mismatch problems. Several methodshavebeen used todeal withadditive noise: using special microphones,usingauditorymodelsfor speech analysis andfeatureextraction,subtracting noise, using noisemaskingandadaptivemodels,usingspectraldistance measures thatarerobust against noise, and compensating for spectraldeviation.Variousmethodshavealso been used to copewiththeproblemscausedbythedifferencesin characteristicsbetweendifferentkinds of microphonesand transmission lines. Acommonly used method is cepstralmeansubtraction (CMS), also called cepstralmeannormalization (CMN), in which the long-term cepstral meanis subtracted from the utterance. Thismethod is verysimple but veryeffective invarious applications of speech andspeakerrecognition(Atal, 1974; Furui, 1981).
3
Noise Other speakers Background noise
-
Distortion
t I
Speaker Voice quality Pitch Gender * Dialect Speaking style Stress/Emotion Speaking rate Lombardeffect
-
TasWContext Man-machine dialogue Dictation Free conversation Interview Phonetic/Prosodic context
-
FIG.8.27 Main causes of acoustic variation in speech
Microphone Distortion Electrical noise Directional characteristics
-
-
I
s b)
'5! 2 W
343
Speech Recognition
Close-talking microphone Microphone array Auditory models (EIH, SMC, p ~ p ) filtering Noise subtraction f Adaptive Comb filtering
Spectral mapping Cemtral mean normalization
."""""" I*"
/-.
Model-level normalization/ adaptation -.+ A
Distance/ Frequency weighting measure distance cepstralWeighted Cepstrum projection measure
fReferencd temlates/ \tnofels 1 &bust
1
Noise addition HMM(de)composition (PMC) Model transformation (MLLR) Bayesian adaptive learning
{
nLchingl---Word
Utterance
a Recognition
spotting
results
FIG.8.28 Main methods for contending with voice variation in speech recognition.
-"""""""
" " "
"""."
verification
344
8.12.1
Chapter 8
HMM Composition/PMC
TheHMMcomposition/parallelmodelcombination(PMC) method creates a noise-added-speech HMM by combining HMMs that model speech and noise (Gales and Young 1992; Martin et al., 1993). This method is closely related to the HMM decomposition proposed by Varga and Moore (1990, 1991). In HMM composition, observation probabilities (means and covariances) for noisy a speech HMMare estimated by convolutingtheobservation probabilitiesinalinearspectraldomain.Figures 8.29 and 8.30 showthe HMM composition process. Since a noise HMM can usually be trained by usinginput signals without speech, this method can be considered as an adaptation process where speech HMMs are adapted on the basis of the noise model. This method can be applied not only to stationary noise but also to time-variant noise, such as another speaker's voice. The effectiveness of this method was confirmed by experiments using speech signals to which noise or other speech had been added. The experimental results showed that this method produces recognition rates similar to those of HMMs trained by using a large noiseadded speech database.Thismethodhas fairly recently been extended to simultaneously cope with additive noise and convolutional (multiplicative) distortion (Gales and Young, 1993; Minami and Furui, 1995). 8.12.2 Detection-Based Approach for Spontaneous Speech Recognition
One of the most important remainingissues for speech recognition is how to create language models (rules) for spontaneous speech. When recognizing spontaneous speech in dialogs, it is necessary to dealwithvariations that are not encountered when recognizing speech that is read from texts. These variations include extraneous words,out-of-vocabularywords,ungrammatical sentences, disfluency, partialwords,repairs,hesitations,andrepetitions. It is
Speech Recognition 345
346
Chapter 8
w
a,
ro a,
Q v)
Q
Y'
gnition
Speech
347
crucial to develop robustand flexible parsingalgorithms that match the characteristics of spontaneous speech. How to extract contextual information, predict users' responses, and focus on key words are very important issues. A paradigm shift from the present transcription-based approach toa detection-based approach will be important to resolving such problems. A detection-based system consists of detectors, each of which aims at detectingthe presence of a prescribed event, such as a phoneme,aword, a phrase,a linguistic notion such as an expression of traveldestination.Thedetector uses a model for the event and an anti-model that provides contrast to the event. It follows the Neymann-Pearson lemma in that the likelihood ratio is used as the test statisticagainst a threshold. Several simple implementations of this paradigm have shown promises in dealing with naturalutterancescontainingmanyspontaneous speech phenomena (Kawahara et al., 1997). The following issues need to be addressed in this formulation: (1) Howtotrainthe models and anti-models?Theideaof discriminative trainingcan be applied using the verification error as the optimization criterion. (2) How to choose detection units? Reasonable choices are words and key phrases. (3) How to include language models and event context/constraints which can help raise the system performance in the integrated search after the detectors propose individual decisions?
This Page Intentionally Left Blank
Speaker Recognition
9.1 9.1.1
PRINCIPLES OF SPEAKER RECOGNITION Human and Computer Speaker Recognition
A technology closely related to speech recognition is speaker recognition, or the automatic recognitionofaspeaker(talker) throughmeasurements of specifically individualcharacteristics arisinginthespeaker's voice signal (Doddington, 1985; Furui, 1986; Furui, 1996; Furui, 1997; O'Shaugnessy, 1986; Rosenberg and Soong, 1991). Speaker recognition research is especially closely intertwinedwiththe principles underlyingspeaker-independent speech recognition technology. In the broadest sense of the word, speakerrecognitionresearchalso involves investigating clues humans use to recognize speakerseither by soundspectrogram (voice print) (Kersta, 1962; Tosi et al., 1972) or by hearing. History notes that as early as 1660 a witness was recorded as having been able to identify a defendant by his voice at one of the trial sessions summoned to determine circumstances surrounding the death of Charles I (NRC, 1979). Speaker recognition did not become a subject of scientific inquiry until over two centuries later,
349
350
Chapter 9
however,whentelephony made possiblespeakerrecognition independent of distanceinconjunction with soundrecording giving rise to speaker recognition independent of time. The use of soundspectrogramsinthe 1940s alsoincorporatedthe sensory capability of vision alongwith that of hearinginperforming speaker recognition. Notably, it was not until 1966 that a court of law finally admittedspeakerrecognitiontestimony based on spectrograms of speech sounds. In parallel with theauraland visual methods,automated methods of speaker recognition have continued to be developed, andare consequently yielding informationstrengtheningthe accuracy of theformermethods.Theautomatedmethodshave recently made remarkable progress partly owing to the influential advances in computerand pattern recognition technologies. Due to its ever increasing importance,this chapter will focus exclusively on automatic speaker recognition technology. The actual realization of speaker recognition systems makes use of voice as the key tool for verifying the identify of a speaker for application to an extensive array of customer-demand services. In the near future, these services will include banking transactions and shopping using the telephone network as well as the Internet, voicemail, database acquisition services including personal information accessing, reservation services, remote access of computers, and security control for protecting confidential areas of concern. Importantly,identity verification using voice is farmoreconvenient than usingcards, keys, orother artificialmeansfor identification, and is much safer because voice can neither be lost nor stolen. In addition, voice recognition does not require the use of hands. Accordingly, several systems are currently being planned forfutureapplicationsintherapidlyacceleratinginformationintensive age into which we are entering.Undersuchcircumstances, field trials combining speaker recognition with telephone cardsand creditcards (ATM)are alreadyunderway. Another important application of speaker recognition is its use for forensic purposes (Kunzel, 1994). The principal disadvantage of using voice is that its physical characteristics are variable and easily modified by transmission and
Speaker Recognition
351
microphone characteristics as well as by background noise. If a system is capable of accepting wide variation in the customer’s voice, for example, it might also unfortunately accept the voice of a differentspeaker if sufficiently similar. It is thus absolutely essential to use physical features which are stable and not easily mimicked or affected by transmission characteristics. 9.1.2 IndividualCharacteristics
Individualinformationincludes voice quality, voiceheight, loudness,speed,tempo,intonation,accent,andthe use of vocabulary. Various physical features interacting in a complicated manner produce these voice characteristics. They arise both from hereditary individual differences in articulatory organs, such as the length of the vocal tract and the vocal card characteristics, and from acquired differences in the manner of speaking. Voice quality and height, which arethemostimportant types of individual auditory information, are mainly related to the static and temporal characteristics of the spectral envelope and fundamental frequency (pitch). Thetemporalcharacteristics, that is, time functions of the spectral envelope, fundamental frequency, andenergy, can be used for speaker recognition in a way similar to those used for speech recognition.However, several considerations and processes designed to emphasize stable individual characteristics are necessary in order to achieve high-performance speaker recognition. The statistical characteristics derived from the time functions of spectral features are also successfully used in speaker recognition. The use of statistical characteristics specifically reduces the dimensions of templates, and consequently cuts down the run-time computation as well as the memory sizeof reference templates. Similar recognition results on 40-frame words have been obtained eitherwith standardDTW templatematching or with a single distance measure involving a 20-dimensional vector employing the statistical features of fundamental frequency and LPC parameters (Furui, 1981a).
Chapter 9
352
Since speaker recognition systems using temporal patterns of sourcecharacteristicsonly,such as pitch and energy, arenot resistant to mimicked voice, they shoulddesirably be combined with vocal tractcharacteristics, namely, withspectral envelope parameters, to build more robust systems (Rosenberg and Sambur, 1975).
9.2 SPEAKER RECOGNITION METHODS 9.2.1
Classification of Speaker Recognition Methods
Speakerrecognitioncan be principallydivided intospeaker verification and speakeridentification.Speaker verification is the process of accepting or rejecting theidentity claim of a speaker by comparinga set of measurements of the speaker’s utterances with a reference set of measurements of the utterance of the person whose identity is being claimed. Speaker identification is the process of determining from which of the registered speakersa given utterancecomes.Thespeakeridentification process is similar to the spoken word recognition process in that both determine which reference template is most similar to the input speech. Speaker verification is applicable to various kinds of services which include the use of voice as the key to confirming the identity claim of aspeaker.Speakeridentification is used incriminal investigations, for example, to determine which of the suspects produced the voice recorded at the scene of the crime. Since the possibility always exists that the actual criminal is not one of the suspects,however,theidentificationdecisionmust be made throughthecombined processes ofspeakerverification and speaker identification. Speaker recognition methods can also be divided into textdependent and text-independent methods. The former require the speaker to issue a predetermined utterance whereas the latter do not rely on a specific text being spoken. In general, because of the
Speaker Recognition
353
higher acoustic-phoneticvariability of text-independent input, more training materialis necessary to reliably characterize (model) a speaker than with text-dependent methods. Although several text-dependentmethods use features of special phonemes,suchasnasals,mosttext-dependent systems allow words (key words, names, ID numbers, etc.) or sentences to be arbitrarily selected for eachspeaker. In thelatter case, the differences in words or sentences between the speakers improves the accuracy of speaker recognition. When evaluating experimental systems, however, common key words or sentences are usually used for every speaker. Although key wordscan be fixed for eachspeakerin many applications of speaker verification, utterances of the same words cannot always be compared in criminal investigations. In such cases, atext-independentmethod is essential. Difficulty in speakerrecognition varies, depending on whether ornot the speakers intend for their identities to be verified. During speaker verification use, speakers are usually expected to cooperate without intentionallychangingtheirspeakingrate or manner. It is well known, however, andnaturalfrom theirpoint of view that speakers are most often uncooperative in criminal investigations, consequently compounding the difficulty in correctly recognizing their voices. Bothtext-dependent andindependentmethodshaveone serious weakness. That is, these systems can be easily beaten because anyone who plays back the recorded voice of a registered speaker uttering key words or sentences into the microphone can be accepted as the registered speaker. To contendwiththis problem, some methods employ small a set of words, such asdigits, as key words, and each user is prompted to utter a given sequence of key words that is randomly chosen eachtime the system is used (Higgins et al., 1991; Rosenberg et al., 1991). Yet even this method is not sufficiently reliable, since it can be beaten with advanced electronic recordingequipment thatcan readily reproduce key words in any requested order. Therefore,to counter this problem, a text-promptedspeakerrecognitionmethodhas recently been proposed. (See Subsection 9.3.3.)
Chapter 9
354
9.2.2 Structure of Speaker Recognition Systems
The common structure of speaker recognition systems is shown in speech wave are Fig. 9.1. Featureparametersextractedfroma compared with the stored reference templates or models for each registered speaker. The recognition decision is made according to the distance (or similarity) values. For speaker verification, input utterances with distances to thereference template smaller than the threshold are accepted as being utterances of the registered speaker (customer), while input utterances with distances larger than the thresholdare rejected as being those of adifferentspeaker (impostor).Withspeakeridentification,the registered speaker whose reference template is nearest to the input utterance between all of the registered speakers is selected as being the speaker of the input utterance. The receiver operating characteristic (ROC) curve adopted from psychophysics is used for evaluating speaker verification systems. In speaker verification, two conditions concern the input utterance: s, or the condition that the utterance belongs to the customer, and n, the opposite condition.Two decision conditions also exist: S, the condition that the utterance is accepted as being that of the customer, and N , the condition that the utterance is rejected. These conditions combine to make up the four conditional probabilities as designated in Table 9.1. Specifically, P(Sls) is the probability of correct acceptance; P(Sln) is the probability of false acceptance (FA), namely, the probability of accepting impostors, P(NIs) is the probability of false rejection (FR), or the probability of mistakenly rejecting the real customer; and P(Nlr?) is the probability of correct rejection. Since the relationships P(Sls)
+ P(NIs) = 1
P(Sln)
+ P(NJ12)= 1
and
Speaker Recognition
L
c
cr0
355
Chapter 9
356
TABLE9.1 Four Conditional Probabilities in Speaker Verification Input utterance condition Decision condition
s (customer)
n (impostor)
S (accept) N (reject)
exist for the four probabilities, speaker verification systems can be evaluated using the two probabilities P(Sls) and P(Sln). If these two values are assigned tothe vertical andhorizontal axes respectively, and if the decision criterion (threshold) of accepting the speech as being that of the customer is varied, ROC curves as indicated in Fig. 9.2 are obtained. Thefigure exemplifies the curves for three systems: A, B, and D. Clearly, the performanceof curve B is consistently superior to that of curve A, and D corresponds to the limiting case of purely chance performance. On theotherhand,therelationship between the decision 9.3. criterion and thetwokinds of errors is presentedinFig. Position n in Figs. 9.2 and 9.3 corresponds to the case in which a strict decision criterion is employed, and position b corresponds to that wherein a lax criterion is used. To set the threshold at the desired level of customer rejection and impostor acceptance, it is necessary toknowthedistribution of customerandimpostor scoresasbaseline data.The decisioncriterioninpractical applicationsshould be determinedaccording to the effects of decision errors. This criterion can be determined based on a priori probabilities of a match, P(s), on the cost values of the various decisionresults, andontheslope of theROCcurve.In experimental tests, the criterion is usually set a posteriori for each individual speaker in order to match up the two kinds of error rates, F R and FA, as indicated by c in Fig. 9.3.
Speaker Recognition
357
FIG.9.2 Receiver operating characteristic (ROC) curves; performance examples of three speaker verification systems: A, B, and D.
a
c
b
Decision criterion (Threshold)
FIG. 9.3 Relationshipbetweenerrorrateanddecision (threshold) in speakerverification.
criterion
358
9.2.3
Chapter 9
Relationship Between Error Rate and Number of Speakers
Let us assume that ZN representsapopulation of N registered speakers, that X' = (x1, x2, . . ., XJ is an n-dimensional feature vector representingthe speech sample, andthat Pi(X) is the probability density function of X for speaker i ( i 6 Z N ) .The chance probability density function of X within population ZN can then be expressed as
where Pr[i] is the a priorichanceprobabilityofspeaker i (Doddington, 1974). In the case of speaker verification, the region of X which should be accepted as the voice of customer i is
where Ci is chosen to effect the desired balance between FA and F R errors. With ZN constructed using randomly selected speakers, and with the a priori probability independent of the speaker, Pr[i] = 1/N, then Pz(X) will approach a limiting densityfunction independent of ZN as N becomes large. Thus, Pr(FA) and Pr(FR) are relatively unaffected by the size of the population, N , when it is large. From a practical perspective, Pz(x> is assumed to be constant since it is generally difficult to estimatethis value precisely, and
is simply used as the acceptance region.
Speaker Recognition
359
With speaker identification, the region of X,which should be judged as the voice of speaker i, is
The probability of error for speaker i then becomes
With Z N constructed by randomly selected speakers, the equations
can be obtained, where P,ai is the expected probability of not confusing speaker i with another speaker. Thus,the expected probability of correctly identifying a speaker decreases exponentially with the size of the population. This is anatural outcome of the fact that the distribution of infinite points cannot be separated in a finite parameter space. More specifically, when the population of speakers increases, the probability that the distributions of two or more speakers are very close increases.Therefore, the effectiveness of speaker identification systems must be evaluated according to their limits in population size. Figure 9.4 indicates this relationship between the size of the population and recognition error rates for speaker identification and verification (Furui, 1978). These results were obtained for a recognition system employing the statistical featuresof the spectral parameters derived from spoken words.
Chapter 9
360
2o Male Femole
n
10-
s
" "
9 . 4 "
Identification
-o-
-A-
Verification
U
0,
5-
c 0
L
2
2-
L
0,
O .-t
I-
.-
E 0.5 0
V
$ 0.2 0.1'
;
I
5
I
I
I
20 50 S i z e of populotion 10
I
100
FIG.9.4 Recognitionerrorrates as afunctionofpopulation speaker identification and verification.
size in
9.2.4 Intra-Speaker Variation and Evaluation of Feature Parameters
One of the most difficult problems in speaker recognition is the intra-speaker variation of feature parameters. The mostsignificant factor affecting speakerrecognitionperformance is variation in feature parameters from trial to trial (intersession variability or variability over time). Variationsarisefromthespeakerhim/ herself, from differences in recording and transmission conditions, and from noise. Speakers cannot repeat an utterance precisely the same way from trial to trial. It is well known that tokens of the same utterance recorded in one session correlate much morehighly than tokens recorded in separate sessions.
Speaker Recognition
361
It is importantforspeaker recognition systems to accommodate these variations since they affect recognition accuracy more significantly than in the case of speech recognition for two major reasons. First, the reference template for each speaker, which is constructed using training utterances prior to the recognition, is repeatedly used later. Second, individual information in a speech waveis more detailed thanphoneticinformation;that is, the interspeaker variation of physical parameters is much smaller than the interphoneme variation. A number of methods have been confirmed to be effective in reducing the effects of long-term variation in feature parameters and in obtaininggood recognition performanceafteralong interval (Furui, 198la). These include: 1. The application of spectral equalization, Le., the passing of the speech signal througha first- orsecond-order critical damping inverse filter which represents the overall pattern of the time-averaged spectrum for a word or short sentence of speech. An effect similar to the spectral equalization can be achieved by cepstral mean subtraction (CMS) or cepstral mean normalization (CMN) (Atal, 1974; Furui, 1981b). 2. The selection of stable feature parameters based on statistical evaluation using speech utterances recorded over a long period. of featureparametersextractedfroma 3. Thecombination variety of different words. 4. The construction of reference templates (models) and distance measures based on training utterances recorded over a long period. 5. The renewal of the reference template for each customerat the appropriate time interval. 6. Adaptation of the reference templates (models) as well as the verification threshold for each speaker. The effectiveness of the spectral equalization process, the socalled ‘blind equalization’method, was examined by means of speaker recognition experiments using statistical features extracted
362
Chapter 9
from a spoken word. Results with and without spectral equalization were compared for both short-term and long-term training. Theshort-termtraining set comprisedutterancesrecordedover a period of 10 days in three or four sessions at intervals of 2 or 3 days. The long-term trainingset consisted of utterances recorded over a 10-month period in four sessions at intervals of 3 months. The time interval between the last training utterance and the input utterance ranged from two or three days to five years. The speaker verification results obtained are exemplified in Fig. 9.5. Although these results clearly confirm that spectral equalization is effective in reducing errors as a functionof the time
-
.-c 4-
0
-
a
-
0
u at
*O
;
I
i
I
I
1
I
2
3
4
5
Interval
- W i tshp e c t r aelq u a l i z a t i o n ---
Without equalization
[years] 0 Short -term training 0 L o n g - t etr m aining
FIG. 9.5 Results of speaker verification using statistical features extracted from a spoken word with or without spectral equalization.
Speaker Recognition
363
interval for both short-term and long-term training, it is especially effective with short-term training. Concerning the speech production mechanism, the effectiveness of spectral equalization means that vocal tractcharacteristics are muchmorestable than the overall patterns of the vocal cord spectrunl. In the CMS (CMN) method, cepstral coefficients are averaged over the duration of an entire utterance, and the averaged values are subtracted from thecepstral coefficients of each frame. This method can compensate fairly well for additive variation in the log spectral domain. However, it unavoidably eliminates some text-dependent and speaker-specific features, so it isespeciallyeffective for textdependent speaker recognition application using sufficiently long utterances but is inappropriate for short utterances. It was shown that time derivatives of cepstral coefficients (deltacepstral coefficients) are resistant to linear channel mismatch between training and testing (Furui, 1981b; Soong and Rosenberg, 1986). In addition to thenormalizationmethodsintheparameter donlain, those in the distance/similarity domain using the likelihood ratio or a posteriori probability have also been actively investigated (See Subsection 9.2.5). To adapt HMMs for noisy conditions, the HMM composition (PMC; parallel model combination) method has been successfully employed. In selecting themost effective featureparameters,the following four parameters evaluation methods can be used: 1. Performing recognition experiments based on various combinations of parameters; 2. MeasuringtheF-ratio(intertointravarianceratio)for each parameter (Furui, 1978); 3. Calculatingthe divergence, which is an expansion of the F-ratio into a nlultidimensional space (Atal, 1972); and 4. Using the knockout method based on recognition error rates (Sambur, 1975). In order to reduce effectively the amount of infornlation, that is, the number of parameters, feature parameter sets are sometimes
_ " " " " I
""
.*""""UL_u
" " " " " I "
364
Chapter 9
projected into a space constructed by discriminant analysis which maximizes the F-ratio.
9.2.5
Likelihood(Distance)Normalization
To contend with theintra-speakerfeatureparametervariation problems, Higgins et al. (1991) proposed a normalization method for distance (similarity or likelihood) values that uses the likelihood ratio: log L ( X )
=
ratio of the conditional probability of The likelihood ra the observed measurements of the utterance given that the claimed identity is correct to the conditional probability of the observed measurements given that the speaker is an impostor. Generally, a positive value of log L indicates a valid claim, whereas a negative value indicates an impostor. The second term on the right-hand side of Eq. (9.8) is called the normalization term. The density at pointX for all speakers otherthan true speaker S can be dominated by thedensity for thenearest reference speaker, ifwe assume thatthe set of reference speakers is representative of all speakers. This means that the likelihood ratio normalization approximates the optimal scoring in Bayes’ sense. This normalization method is unrealistic, however, because even if only the nearest reference speakers is used, conditional probabilities must be calculatedforall of the reference speakers, which increases cost.Therefore,a set of speakers,knownas‘cohort speakers,’ has been chosen for calculating the normalization term of Eq. (9.8). Higgins etal.proposed using speakers thatare representative of the population near the claimed speaker. An experiment in which the size of the cohort speaker set was varied from 1 to 5 showed that speaker verification performance increases asafunction of thecohort size, andthatthe use of normalization significantlycompensates forthedegradation
ognition
Speaker
365
obtained by comparing verification utterances recorded using an electretmicrophonewithmodelsconstructedfromtraining utterances recorded with a carbon button microphone(Rosenberg, 1992). MatsuiandFurui (1993, 1994b) proposedanormalization method based on a posteriori probability:
The difference between thenormalizationmethod based on the likelihoodratio andthat based on aposterioriprobability is whether or not the claimed speaker is includedintheimpostor speaker set fornormalization;the cohort speaker set inthe likelihood-ratio-basedmethoddoes not includetheclaimed speaker, whereas thenormalizationtermfortheaposteriori probability-based method is calculated by using a set of speakers including the claimed speaker. Experimental results indicate that bothnormalizationmethodsalmostequallyimprovespeaker separability and reducethe need for speaker-dependent or textdependentthresholding,compared with scoring using only the model of theclaimedspeaker(MatsuiandFurui, 1994b; Rosenberg, 1992). The normalization method using the cohort speakers that are representative of thepopulationnearthe claimed speaker is expected to increase the selectivity of the algorithm against voices similar to the claimed speaker. However, this method is seriously problematic in thatit is vulnerable to illegal access by impostors of the opposite gender. Since the cohorts generally model only samegenderspeakers,theprobability of opposite-genderimpostor speech is not well modeled and the likelihood ratio is based on the tails of distributions, which gives rise to unreliable values. Another way of choosing the cohort speaker set is to use speaker who are typical of thegeneral population. Reynolds (1994) reported that arandomly selected, gender-balancedbackground speaker population outperformed a population near the claimed speaker.
366
Chapter 9
Carey et al. (1992) proposed a method in which the normalization term is approximated by the likelihood for a world model representingthepopulationingeneral.Thismethodhasthe advantage that the computational cost for calculating the normalization term is much smaller than the original method since it does not need to sum the likelihood values for cohort speakers. Matsui andFurui (1994b) proposedamethod based on tied-mixture HMMs in which the world model is formulatedasapooled mixture model representing the parameter distribution for all of the registered speakers. This model is created by averaging together the mixture-weighting factors of each reference speakercalculated using speaker-independentmixturedistributions.Thereforethe pooled model can be easily updated when a new speaker is added as a reference speaker. In addition, this method hasbeen confirmed to give much better results than either of the original normalization methods. Since these normalization methods do not take into account the absolute deviationbetween the claimed speaker's model and the input speech, they cannot differentiate highly dissimilar speakers. Higgins et al. (1991) reported that a multilayer network decision algorithm makes effective use of the relative and absolute scores obtained from the matching algorithm.
9.3 EXAMPLES
OF SPEAKER RECOGNITION SYSTEMS
9.3.1Text-DependentSpeakerRecognitionSystems
Large-scale experiments have been performed for some time for text-dependentspeakerrecognition which is more realistic than text-independent speaker recognition (Furui, 1981b; Zheng and Yuan, 1988; Naik et al., 1989; Rosenberg et al., 1991). They include experiments on a speaker verification system for telephone speech which was tested at Bell Laboratories using roughly 100 male and female speakers (Furui? 1981b). Figure 9.6 is a block diagram of the principalmethod.Withthismethod,not only is the time series
Speaker Recognition
367 Speech wave
I
1
1
LPC cepstrum
Exponsion by polynomiol f unc t ion
Long-time overage
i
N o r m a l i z a t i o n by overagecepstrum
c
1
1 F e a t u r es e l e c t i o n
I [pG-f
I
q Deci- s ionI S p e a k e ri d e n t i t y
FIG. 9.6 Block diagram indicating principal operation of speaker recognition method using time series of cepstral coefficients and their orthogonal polynomial coefficients.
brought into time registration with the stored reference functions, but a set of dynamic features(See Subsection 8.3.6) is also explicitly extracted and used for the recognition. Initially, 10 LPC cepstral coefficients are extracted every 10ms from a short speech sentence. These cepstral coefficients are then averaged over theduration of theentireutterance.The averaged values are next subtracted from the cepstral coefficients of every frame (CMS method) to compensate for the frequencyresponse distortion introduced by the transmission system and to reduce long-term intraspeaker spectral variability. Time functions forthecepstral coefficients are subsequentlyexpanded by an orthogonal polynomial representation over 90-ms intervals which are shifted every10 nls. The first- and second-order polynomial coefficients (A and A' cepstral coefficients) are thus obtained as the representations of dynamic characteristics. From the normalized cepstral and polynomial coefficients, a set of 18 elements is
368
Chapter 9
selected which are the most effective in separating the speaker’s overall distancedistribution.The time function of the set is brought into time registration with the reference template in order to calculate the distance between them. The overall distance is then compared with athresholdforthe verification decision. The threshold and reference template are updated every two weeks by using the distribution of interspeaker distances. Experimental results indicate that ahigh degree of verification accuracy canbe obtained even if the reference and input utterances are transmitted on different telephone systems, such as on those usingADPCMsandLPCvocoders.Anonlineexperiment performed over a period of six months, using dialed-up telephone speech uttered by 60 male and 60 female speakers, also supports the effectiveness of this system. An HMM can efficiently modelthestatisticalvariation inspectralfeatures.Therefore,HMM-basedmethodscan achieve significantly better recognition accuracies thanDTWbased methods if enough training utterances for each speaker are available. 9.3.2
Text-Independent Speaker Recognition Systems
In text-independentspeaker recognition, thewordsor sentences used in recognition trials generally cannot be predicted. Since it is impossible to model or match speech events at the word orsentence level, the following three kinds of methods shown in Fig. 9.7 have been actively investigated (Furui, 1986). (a) Long-term-statistics-based methods As text-independentfeatures,long-termsamplestatistics of variousspectralfeatures, such asthemeanandvariance of spectralfeatures over a series of utterances,havebeenused (Furui etal., 1972; Markel etal., 1977; Markel and Davi, 1979) (Fig. 9.7(a)). However,long-termspectral averages are extreme condensations of thespectralcharacteristics of aspeaker’s utterances and, as such, lack the discriminating power included
' c-
I
W
Speaker Recognition
I
0 .-
a,
cn .-0
T
T
T
369
370
Chapter 9
in the sequences of short-term spectral features used as models in text-dependent methods. In one of the trials using the long-term averagedspectrum (Furui etal., 1972), the effect of session-tosession variability was reduced by introducing a weighted cepstral distance measure. Studies on using statistical dynamic features have also been reported.Montacieetal. (1992) appliedamultivariate autoregression (MAR) model to the time series of cepstral vectors to characterizespeakers, andreportedgoodspeakerrecognition results. Griffinetal. (1994) studieddistancemeasuresforthe MAR-based method, and reported that when 10 sentences were used fortrainingandonesentencewas used fortesting, identification and verification rates were almost the same as those obtained by an HMM-based method. Itwas also reported that the optimum order of the MAR model was 2 or 3, and that distance normalization using a posteriori probability was essential to obtain good results in speaker verification. (b) VQ-based methods A set of short-term training feature vectors of a speaker can be used directly to representthe essential characteristics of that speaker. However, such a direct representationis impractical when thenumber of trainingvectors is large, since thememory and amount of computation requiredbecomeprohibitivelylarge. Therefore, attempts have been madeto find efficient ways of compressingthetraining data using vector quantization (VQ) techniques. In this method (Fig. 9.7(b)), VQ codebooks, consisting of a small number of representativefeaturevectors, are used asan efficient means of characterizing speaker-specific features (Li, and Wrench Jr., 1983; Matsui and Furui, 1990,1991; Rosenberg and Soong, 1987; Shikano, 1985; Soong et al., 1987). A speaker-specific codebook is generated by clustering the training feature vectors of each speaker. In the recognition stage, an input utteranceis vectorquantized by using the codebook of each reference speaker; the VQ distortion accumulated over the entire input utterance is used for making the recognition determination.
Speaker Recognition
371
(c) Ergodic-HMM-based methods Thebasicstructure is thesame astheVQ-basedmethod (Fig. 9.7(b)), but in this method an ergodic HMM is used instead of a VQ codebook. Over a long timescale, the temporal variation in speech signal parameters is represented by stochastic Markovian transitions between states. Poritz (1982) proposed using a five-state ergodic HMM (i.e., all possible transitions between states are allowed) to classify speech segments into one of the broad phonetic categories corresponding to the HMM states. A linear predictive HMM was adopted tocharacterize the output probability function. He characterized the automatically obtained categories as strong voicing, silence, nasal/liquid, stop burst/post silence, and frication. Tishby (1991) extended Portiz’s work to the richer class of mixture autoregressive (AR)HMMs.In these models,thestates are described as a linear combination (mixture) of AR sources. It was shown that the speaker recognition rates are strongly correlated with the total number of mixtures, irrespective of the number of states (Matsui and Furui, 1992). This means that the information on transitionsbetween different states is ineffective for text-independentspeakerrecognition.The case of a single-state continuous ergodic HMM corresponds to the technique based on the maximum likelihood estimation of a Gaussian-mixture model representation investigated by Rose et al. (1990). Furthermore, the VQ-based method can be regarded as a special (degenerate) case of a single-state HMM with a distortion measure being used as the observation probability. (d) Speech-recognition-based methods The VQ- and HMM-based methods can be regarded as methods that use phoneme-class-dependent speaker characteristics in shortterm spectral features through implicit phoneme-class recognition. In other words, phoneme-classes and speakers are simultaneously recognized in these methods. On the other hand, in the speechrecognition-basedmethods(Fig. 9.7(c)), phonemes or phonemeclasses are explicitly recognized and theneachphoneme (-class) segment in the input speech is compared with speaker models or templates corresponding to that phoneme (-class).
372
Chapter 9
Savic et al. (1990) used a five-state ergodic linear predictive HMM for broad phonetic categorization. In their method, after frames that belong toparticularphoneticcategorieshave been identified,feature selection is performed.Inthetrainingphase, reference templates are generated and verification thresholds are computed for each phonetic category. In the verification phase, afterphoneticcategorization,acomparison with the reference template for each particular category provides a verification score for that category. The final verification score is a weighted linear combination of the scores for eachcategory. The weights are chosen to reflect the effectiveness of particularcategories of phonemes in discriminating between speakers and are adjusted to maximize theverificationperformance.Experimentalresults showed that verification accuracy can be considerably improved by this category-dependent weighted linear combination method. Broadphoneticcategorizationcanalso be implemented by a speaker-specific hierarchical classifier instead of by an HMM, and the effectiveness of this approach has also been confirmed (Eatock and Mason, 1990). Rosenbergetal.have been testingaspeaker verification system using 4-digit phrases under field conditions of a banking application(Rosenbergetal., 1991; Setlur andJacobs, 1995). In this system, input speech is segmented into individualdigits usingaspeaker-independent HMM.The frameswithinthe word boundaries for a digit are compared with the corresponding speaker-specific HMM digit model and the Viterbi likelihood score is computed. This is done for each of the digits making up the inpututterance.The verification score is defined to be the average normalized log-likelihood score over all the digits in the utterance. Newmanetal. (1996) used alargevocabulary speech recognition system forspeakerverification. A set of speakerindependent phoneme models were adapted to each speaker. The speakerverificationconsistedoftwostages. First,speakerindependent speech recognition was runon each of thetest utterances to obtain phoneme segmentation. In the second stage, the segments were scored againsttheadapted models fora
cognition
Speaker
373
particular target speaker. Thescores were normalized by those with speaker-independent models. The system was evaluated using the 1995 NIST-administeredspeakerverification database, which consists of data taken from the Switchboard corpus. The results showed that this method could not outperform Gaussian mixture models. 9.3.3 Text-PromptedSpeakerRecognitionSystems Howcan we preventspeaker verification systems from being defeated by a recorded voice? Another problem is that people often do not like text-dependent systems because they do notlike to utter their identification number, such as their social security number, withinthehearing of other people. To contend withthese problems, a text-prompted speaker recognition method has been proposed. In this method, key sentences are completely changed every time (Matsui and Furui, 1993, 1994a). The system accepts the input utterance only when it determines thatthe registered speaker utteredthepromptedsentence. Because thevocabulary is unlimited, prospectiveimpostors cannot know inadvancethe sentence they will be prompted to say. This method can not only accurately recognize speakers but canalso reject an utterance whose text differs from the prompted text, even if it is uttered by a registered speaker. Thus, a recorded and played back voice can be correctly rejected. This method uses speaker-specific phoneme models as basic acoustic units. One of the major issues in this method is how to properly create these speaker-specific phoneme models when using trainingutterances of a limited size. Thephonememodels are represented by Gaussian-mixturecontinuous HMMsor tiedmixtureHMMs,and they aremade by adaptingspeakerindependentphonememodels to eachspeaker's voice. Since the text of trainingutterances is known, these utterances can be modeled as theconcatenation of phoneme models, and these models can be automatically adapted by an iterative algorithm.
374
Chapter 9
In the recognition stage, thesystem concatenates the phoneme models of each registered speaker to createa sentence HMM, according to the prompted text. The likelihood of input speech againstthe sentence model is thencalculated and used forthe speakerrecognitiondetermination. If thelikelihood of both speaker and text is high enough, the speaker is accepted as the claimed speaker. Notably, experimentalresults gave a high speaker and text verification rate when the adaptation method for tiedmixture-based phoneme models and the likelihood normalization method described in Subsection 9.2.5 were used.
10 Future Directions of Speech Information Processing
10.1 OVERVIEW
Forthemajority of humankind, speech understandingand productionareinvoluntary processesquickly and effectively performed throughoutour daily lives. A part of these human processes has already been synthetically reproduced, owing to the recent progress in speech signal processing, linguistic processing, computers, and LSI technologies. What we are actually capable of turningintopractical, beneficial tools at present using these technologies, however, can be considered very restricted at best. simplify to a certain degree the Figure 10.1 attemptsto relationships between thevarious types of speech recognition, understanding, synthesis, and coding technologies. Several of these remain to be investigated. It is essential to ensure that speech information processing technologies playthe ever-heightening, demand-stimulated role desired in facilitating the progress of the informationcommunications societies toward which we are aspiring.Thiscan only be achieved by enhancing our synthetic speech technologies to the point where they approach as closely as
375
376
concepts
-"--Understand-
1
I
I
7"""
i' I
I
J
1
(10 b p s )
I
concepts
I
I
t
codes
codes
""7-
A
into
I i ngu isti c
r
Parometer
production
I
I
I 1
FIG. 10.1 Principalspeechinformationprocessingtechnologies their relationships.
and
possible our inherent human abilities. Importantly, this necessitates our competently solving the broadest possible range of interrelated problems in the near future. In aneffort to graphically clarify the relationships between the elements of engineering and human speech information processing mechanisms,Fig. 10.2 detailsthevariations betweenspeech information processing technologies and the scientific and technological areas serving as the foundational roots of speech research. As is evident in the figure, and as described elsewhere in this book, speech research is fundamentally and intrinsically supported by a wide range of sciences. The intensification of speech research continues to underscore an even greater interrelationship between scientific and technological interests.
Future Directions of Speech Information Processing
377
- Large-vocabulary
- Speaker - Independent -Continuous
Speaker recognition normalization
: Speech synthesls -Synthesis bySpeaker rule -Text-to- speech
""""""--""-
speech
I
1
___ __"
""
"-
"-
Fundomental sciences and technologles
Neural net Acoustics Artificlal i n t e l I igence
FIG.10.2 Speech information processing "tree," consisting of present and future speech information processing technologies supported by scientific and technological areas serving as the foundations of speech research.
Althoughindividual aspects of speech information processing research have thus far been performed independently for the most
370
Chapter 10
part, they will encounter increased interaction until commonly shared problems become simultaneously investigated and solved. Only then can we expect to witness tremendous speech research progress, and hence fruition of widely applicable, beneficial techniques. Along these lines, thischaptersummarizeswhatareconsidered to be the nine most important research topics, in particular those which intertwinea multiplicity of speech researchareas. Thesetopicsmust be rigorouslypursued and investigated ifwe areto realize ourinformationcommunications societies fully incorporating enhanced speech technologies.
10.2 ANALYSIS AND DESCRIPTION OF DYNAMIC FEATURES
Psychological and physiological research into the human speech perception mechanisms overwhelmingly reports that the dynamic features of the speech spectrum and the speech wave over time intervals between 2 to 3 ms and 20 to50ms play crucially important roles in phoneme perception. This holds true not only for consonants such as plosives but also for vowels in continuous speech. On theotherhand,almost all speech analysis methods developed thus far, including Fourier spectral analysis and LPC analysis, assume the stationarityof the signals. Only a few methods have been investigated for representingtransitional or dynamic features. Although such representation constitutes one of the most difficult problems facing the speech researcher today, the discovery of a good method is expected to produce a substantial impact on the course of speech research. Coarticulationphenomenahave usually been studiedas variations or modifications of spectra resulting from the influence of adjacentphonemes or syllables. It is considered essential, however, that these phenomena be examined from theviewpoint of phonemic information existing in the dynamic characteristics. Additionallyimportanttotheserelatively‘microdynamic’ phonemic information-related features are the relatively
Future Directions Speech of Information Processing
379
‘macrodynamic’features covering theinterval between 200 to 300 ms and 2 to 3 s. The latter dynamic features bear prosodic features of speech such as intonation andstress. And although they seem to be easily extracted using the time functions of pitch and energy, they are actually extremely difficult to correctlyextract from a speech wave by automatic methods.Even if they were to be effectively extracted, it is still very difficult to relate these features to theperceptualprosodicinformation.Therefore, even inthe speech recognition area, in which prosodic features are expected to play asubstantial role, only a few trials utilizing them have succeeded to any notable degree. In speech synthesis, control rules for prosodic featureslargely affect the intelligibility and naturalness of synthesized voice. Here also, although the significance of prosodicfeatures is clearly as greatasthat of phonemicfeatures,theperception andcontrol mechanisms of prosodic features have not yet been clarified.
10.3 EXTRACTION AND NORMALIZATION OF VOICE INDIVIDUALITY
Although many kinds of speaker-independent speech recognizers have already been commercialized, a small fraction of people occasionally produce exceptionally low recognition rates withthese systems. A similar phenomenon, which is called the ‘sheep and goats phenomenon,’ also occurs in speaker recognition. The voice individualityproblemin speech recognitionhas been handled to a certain extent through studies into automatic adaptation algorithms using a small number of training utterances and unsupervised adaptationalgorithms.Intheunsupervised algorithms, utterances for recognition are also used for training. These algorithms are currently capable of only restricted application, however, since themechanism of producing voice individualityhas not yet been sufficiently delineated.Accordingly, becoming increasingly more important will be research on speakerindependent speech recognition systems having anautomatic
380
Chapter 10
speaker adaptation mechanism based on unsupervised training algorithms requiring no additional training utterances. Speaker adaptationor normalizationalgorithmsin speech recognition as well as speakerrecognitionalgorithmsshould be investigated using a common approach. This is because they are two sides of the same problem: how best to separate the speaker’s information and the phonemic information in speech waves. This approach is essential to effectively formulatethe unsupervised adaptation and text-independent speaker recognition algorithms. In the speech synthesis area, several speech synthesizers have been commercialized, in which voice quality can be selected from male, female, and infant voices. No system has been constructed, however, that can precisely select or control the synthesized voice quality.Research into themechanismunderlying voice quality, inclusive of voice individuality, is thus necessary to ensure that synthetic voice is capable of imitating a desired speaker’s voice or to select any voice quality such as a hard or soft voice. Even in speech coding (analysis-synthesis and waveform coding),thedependency of thecoded speech quality onthe individuality of the original speech increases with the advanced, high-compression-ratemethods. Putanother way, in these advanced methods, perceptual speech quality degradation of coded speech clearly depends onthe original voice. It is of obvious importance then to elucidate the mechanism of voice dependency and to develop a method which decreases this dependency.
10.4 ADAPTATION TO ENVIRONMENTAL VARIATION
For speech recognition and speaker recognition systems to bring their capabilities into full play during actual speech situations, they must be able to minimize effectively, or hopefully eliminate, the influence of overlapped stationary noise as well as unstationary noise such asother speaker’s voices. Present speech recognition systems have gone a long way toward resolving these problems by usingaclose-talkingmicrophone and by institutingtraining
Future Directions Speech of Information Processing
381
(adaptation)foreach speaker’s voice underthesame noise characteristicenvironment.Theenvironmentnaturallytends to vary, however, and the transmission characteristics of telephone sets and transmission lines also are not precisely controllable. This situation is becoming even more difficult because of the wide use of both cellular and codeless phones. Research is therefore necessary to ascertain mechanisms that will facilitate automatic adaptation to these variations. Also important for practical use is the development of a method capableof accurately recognizing a voice picked up by a microphone placed at a distance from the speaker. Since the intersession (temporal) variability of the physical properties of an individual voice decreases the recognition accuracy of speakerrecognition,a set of featureparametersmust be extracted that remain stable over long periods, even if, for example, thespeakershould be sufferingfrom acold orbronchial congestion. Furthermore, these parameters must be set up in such a way that they are extremely difficult to imitate.
10.5 BASIC UNITS FOR SPEECH PROCESSING Recognizing continuous speech featuring an extensive vocabulary necessitates the exploration of a recognition algorithm that utilizes basic units smaller than words. This establishment of basic speech unitsrepresentsone of theprincipalresearch foci fundamental not only to speech recognition but also to speakerrecognition, text-to-speech conversion, and very-low-bit-rate speech coding. Thesebasicspeechunitsconsideredintrinsic to speech information processing should be studiedfrom several perspectives: 1. Linguistic units (e.g., phonemes and syllables), 2. Articulatoryunits (e.g., positions and movingtargetsforthe jaw and tongue), 3. Perceptual units (e.g., distinctive features, and targets and loci of formant movement),
382
Chapter 10
4. Visual units (features used in spectrogram reading), and 5. Physical units ( e g , centroids in VQ and MQ). These units do not necessarily correspond. Furthermore, although conventional units have usually been produced from the linguistic point ofview, futureunits willbe established based on the combination of physical and linguistic units.Thisestablishment will take the visual, articulatory, and perceptual viewpoints into consideration.
10.6 ADVANCED KNOWLEDGE PROCESSING
One of the critical problems in speech understanding and text-tospeech conversion is how best to utilize and efficiently combine various kinds of knowledge sources including our common sense concerning language usage. There is ample evidence that human speech understanding involves the integration of a great variety of knowledge sources, including knowledge of the world or context, knowledge of the speaker and/or topic, lexical frequency, previous uses of a word or a semantically related topic, facial expressions (in face-to-facecommunication),prosody,as well as theacoustic attributes of the words. Our future systems could do much better by integrating these knowledge sources. The technological realization of these processes encompasses the use of the merits of artificial intelligence, particularly knowledge engineering systems, which provide methods capableof representing knowledge sources, including syntax and semantics, parallel and distributed processing methods for managing the knowledge sources, and treesearchmethods. A key factor inactualizinghighperformance speech understanding concerns the most potent way to combine the obscurely quantified acoustical information with the different types of symbolized knowledge sources. The use of statistical language modeling is especially convenient in the linguisticprocessing stage inspeech understanding. The methods produced from the results garnered from natural language
Future Directions Speech of Information Processing
383
processingresearch, such as phrase-structured grammarand casestructured grammar, are not alwaysuseful in speechprocessing, however, since there is a vast difference between written and spoken language. An entirely new linguistic science must therefore be invented for speech processing basedon the presently available technologies for natural language processing. Clearly, this novel science must also take the specific characteristics of conversational speech into consideration.
10.7 CLARIFICATION OF SPEECH PRODUCTION MECHANISM
A careful look into the dynamics of the diverse articulatory organs functioningduring speech production,coupled with trials for elucidatingtherelationship between thearticulatorymechanism and the acoustic characteristics of speech waves, exhibits considerable potential for producing key ideas fundamental to developing the new speech information processing technologies needed. Recent investigation has shown that the actual sound source of speech production is neither a simple pulse train nor white noise, nor is it necessarily linearlyseparablefromthevocaltract articulatory filter. Thisfindingruns contrarytotheproduction model now widely used. Furthermore, it is quite possible that simplification of the modelis one of the primary factors causing the degradation of synthesized voice. Therefore, development of a new soundsourcemodel that precisely representstheactualsource characteristics, as well asresearch on the mutualinteraction between the sound source and the articulatory filter, would seem to be necessary to enhance the progress of speech synthesis. Well-suited formulation of the rules governing movement of thearticulatoryorgansholdsthepromise of producinga clear representation of thecoarticulationphenomena which are very difficult to properly delineate at the acoustic level. Consequently, a dynamicmodel of coarticulation is intheprocess of being established based on these rules. This research is also expected to lead to a solution of the problem of not being able to clearly discern
384
Chapter 10
voice individuality and to produce techniques for segmenting the acoustic feature sequence into basic speech units. The actual direction this research is assuming is divided into a physiological approach and an engineering approach. The former approach involves the direct observation of the speech production processes. For example, vocal cord movement is observed using a fiberscope, an ultrasonic pulse method, or anoptoelectronic method. On the other hand, articulatorymovement in the vocal tract can be observed by scanning-type x-ray microbeam device, ultrasonic tomography, dynamic palatography, electromyography (EMG), or an electromagnetic articulograph (EMA) system. Although each of these methods have their own specially applicable features, none of them is capable of precisely observing the dynamics of the vocal organs. Accordingly, there will be a continuous need to improve on such devices and observation methods. The engineering approach concerns the estimation of source and vocal tract information from the acoustic features based on speech production models. This approach, founded on the results of the physiological approach, is expected to produce key ideas for developing new speech processing technologies.
10.8 CLARIFICATION OF SPEECH PERCEPTION ' MECHANISM
As is well known, a mutual relationship exists between the speech production and speech perception mechanisms. Psychological and physiological research into human speech perception is anticipated to give rise to new principles for guidingmorebroad-ranging progress in speech information processing. Although observation and modeling of the movement of vocal systems along with the physiological modeling of auditory peripheral systems have recently made great progress, the mechanism of speech informationprocessinginourownbrainhashardly been investigated. As described earlier, one of the most significant factors toward which speech perception research is being directed is the
Future Directions Speech of Information Processing
385
mechanism involved in perceiving dynamic signals. Psychological experiments on human memory clearly showed that speech plays a far moreimportantand essential role than vision in thehuman memory and thinking processes. Whereas models of separating acoustic sources have been researched in ‘auditory scene analysis,’ the mechanisms of how meanings of speech are understood and how speech is produced have not yet been elucidated. It will be necessary to clarify the process by which human beings understandandproducespokenlanguage,inorder to obtainhintsforconstructinglanguage models forourspoken language, which is very differentfromwrittenlanguage. It is necessary to be able to analyze context and accept ungrammatical sentences. Now is the time to start active research on clarifying the mechanism of speech information processing in thehuman brain so that epoch-making technological progress can be made based on the human model.
10.9 EVALUATION METHODS FOR SPEECH PROCESSING TECHNOLOGIES
Objective evaluationmethodsensuringquantitativecomparison between a broad range of techniques are essential to technological developments in the speech processing field. Establishing methods forevaluatingthemultifarious processes and systems employed here is, however, very difficult for a number of important reasons. One is thatnatural speech varies considerablyinits linguistic properties, voice qualities, and other aspects as well. Another is that efficiency of speech processing techniques often depends to a large extent on the characteristics of the input speech. Therefore,the following threeprincipalproblemsmust be solved before effectual evaluation methods can be established:
1. Task evaluation: creating a measure fully capable of evaluating the complexity and difficulty of the task (synthesis, recognition, or coding task) being processed;
386
Chapter 10
2.
Technique evaluation: formulating a method for evaluating the techniques both subjectively and objectively; 3. Databasefor evaluation:preparinga large-scale universal database for evaluating an extensive array of systems.
Crucialfutureproblemsincludehowtoevaluatethe performance of speechunderstandingandspokendialogue systems, and how best to measure the individualityand naturalness of coded and synthesized speech.
10.10 LSI FORSPEECHPROCESSINGUSE
Development and utilization of LSIs are indispensable tothe actualization of diverse, well-suited speech processing devices and systems. LSI technology has, on occasion, had considerable impact on the speech technology trend. Those algorithms that are easily packaged in LSIs, for example, tend to become mainstream tools even if they requirea relatively largenumber of elements and computation. Speech processing algorithms can be realized through specialpurpose LSIs and digital signal processors (DSPs). Although both avenueshaveadvantages and disadvantages,the DSP approach generally seems to be more beneficial, because speech processing algorithmsarebecomingsubstantiallymore diversified and continue to incorporate rapid advancements. The actual production of fully functioning speech processing hardware necessitates the fabrication of DSP-LSIs, which include high-speed circuits and large memories capable of processing and storing sufficiently longword-length data intheir design. Furthermore, the provisionof appropriatedevelopmentaltoolsforconstructingDSP-based systems using high-level computer languages is essential. It would be particularly beneficial if speech researchers were to assist in proposing the design policies behind the production of these LSIs and developmental devices.
Appendix A Convolution and z-Transform
A.l
CONVOLUTION
The convolution of x($ and h(n), usually written x(n) defined as
*
lz(n), is
If h ( n ) and x(n) are the impulse response of a linear system and its input, respectively, the system response can be expressed by the convolution 00
The convolution operation features the following properties: 1.
Commutativity: For any h and x , x(12) * h ( n ) = h ( n )
* x(n)
(A4
387
3.
Linearity: If parameters a and b are constants, then
h(n) *
XI ( 1 2 ) +
bx2(1~)] = a[h(n)* X I (n)]+ b [ h ( ~* )x ~ ( H ) ]
Generally,
i
4.
i
which means that the convolution and summing operations are interchangeable. Time reversal: If y(n) = x(n) * I z ( n ) , then y(-n) = x(-n)
5.
* h(-n)
(A@
Cascade: If two systems, Izl and 112, are cascaded, then the overall impulse response of thecombinedsystem is the convolution of the individual impulse responses,
and the overall impulse response is independent of the order in which the systems are connected.
A.2 Z-TRANSFORM
The direct z-transform of a time sequence x(n) is defined as
x(4
Convolution and z-Transform
389
where z is a complex variable and X ( z ) is a complex function. The inverse transform is given by
where the contour C must be in the convergence region of X ( z ) . The z-transform has the following elementary properties, in which Z[x(n)]represents the s-transform of .u(n): 1.
Linearity: Let x(n) and y(n) be any two functions and let X(z) and Y(z) be their respective z-transforms.Thenforany constants, a and b,
+ by(n)] = a X ( z ) + bY(z)
Z[ax(n)
2.
Convolution: If w(n)
x(n) * y(n), then
=
W(Z) =
3.
(A.11)
X ( s )Y ( z )
Shifting:
(A. 12)
Z[X(IZ - k ) ] = s-"(z)
4.
Differences:
Z [ x ( n - 1)
-
Z[x(n)- x ( n 5.
x(n)]= (z
-
l)]
=
(1
l)X(z)
(A.13)
z")X(z)
(A.14)
-
-
Exponential weighting:
Z[a"x(n>]= X(a-b)
6.
(A.15)
Linear weighting:
Z [ n x ( n ) ]= -z-
_ _ "" _ .
(A.10)
" "
~
d X (z ) dz
(A.16)
-
_" "
Appendix A
390
7.
Time reversal:
z[x(-n)] = X(z-l)
(A. 17)
The z-transforms for elementary functions are as follows. 1.
Unit impulse: Theunit impulse is defined as
'(") = If x ( n )
=
n= 0
{ i:
(A.18)
otherwise
S(n), then
x CG
X(")=
S(n)z-II
=
1
(A. 19)
I?=-CG
2.
Delayed unit impulse: If x(n)
x
=
S(n - k),
CG
X(")=
S(n
-
k)z-" (A.20)
3.
Unitstep: The unit step function
'(") If x(n)
=
=
{ hl
is defined as
n 2 0 otherwise
(A.21)
u(n), then
n=-ce
(A.22)
Convolution and z-Transform
4.
Exponential: If S ( H )
391
= U''ZI(I?),
(A.23)
A.3 STABILITY
A system is stable if a bounded (finite-amplitude)input x(n) always produces a bounded output y(r?).That is, if
1 x(n) 1 <
M all for
n
(A.24)
and if
where M is a finite constant,the system is stable.Hence,the necessary and sufficient condition for the stability of the system can be written as
An equivalent requirement is that all poles of H(z) lie within the unit circle.
This Page Intentionally Left Blank
Appendix B Vector Quantization Algorithm
B.l
VQ (VECTOR QUANTIZATION)TECHNIQUE FORMULATION
The VQ technique, which is one of the most important and widely used methods in speech processing, is formulated as follows (Gersho and Gray, 1992; Makhoul et al., 1985). It is assumed that x is a k-dimensional vector whose components are real-valued random variables. In vector quantization, a vector x is mapped onto another k-dimensional vector y . x is thus quantized as y and is written as
y takes on one of a finite set of values, Y = {vi>(1 I i 2 K ) . The set Y is referred to as the codebook, and {yi} are code vectors or templates. The size K of the codebook is referred to as the number of levels. of vector To design such a codebook, the k-dimensional space x is partitioned into K regions { Ci>(1 5 i K ) with a vector y i being associated with each region Cis The quantizer thenassigns the code vector y i if x is in Ci. This is represented by
q(x>
= yi
(B-2)
393
394
Appendix B
When K is quantized as y , a quantization distortion measure or distance measure d(x, y) can be defined between x and y . The overall average distortion is then represented by 1
M
A quantizer is said to be anoptimal(minimum-distortion) quantizer if the overall distortion is minimized over all K-level quantizers. Twoconditionsare necessary for optimality. The first is that the quantizer be realized by using a minimum-distortion or nearest-neighbor selection rule,
The second is that each code vector y i be chosen to minimize theaveragedistortionin region Ci. Such a vector is called the centroid of region C,. The centroid for a particular region depends on the definition of the distortion measure.
6.2
LLOYD’SALGORITHM(K-MEANSALGORITHM)
Lloyd’s algorithm or the K-means algorithm is an iterative clustering (refining) algorithm for codebook design. The algorithm divides the set of training vectors {x(r.l)) into K clusters {ci>in such a way that the two previously described conditions necessary for optimality are satisfied. The four steps of the algorithm are as follows. Step 1: Initialization Set m = 0 ( m : iterative index). Choose a set of initial code vectors, {yl(0)>(1 5 i 5 K ) , using an adequate method. Step 2: Classification Classify the set of trainingvectors {x(n)> (1 2 12 M) into clusters {Ci(nz))based on the nearest-neighbor rule,
Vector Quantization Algorithm
395
Step 3: Codevectorupdating Set 111 + nz + 1. Update thecode vector of every cluster by computingthecentroid of thetraining vectors ineach cluster as
Calculate the overall distortion D(111)for all training vectors. Step 4: Termination If the decrease in the overall distortion D ( m ) at iteration 117 relative to D(117 - 1) is below acertainthreshold,stop; otherwise, go to step 2. (Any other reasonable termination criteria may be substituted.) This algorithm systematically decreases the overall distortion by updating the codebook. The distortion sometimes converges, however, to a local optimum which may be significantly worse than the global optimum. Specifically, the algorithm tends to gravitate towards the local optimum nearest the initial codebook. A global optimummay be approximately achieved by repeatingthis algorithmfor several types of initializations, and choosingthe codebook having the minimum overall distortion results.
B.3 LBG ALGORITHM
Lloyd’s algorithm assumes that the codebook has a fixed size. A codebook can begin small and be gradually expanded, however, until it reaches its final size. One alternative is to split an existing cluster into two smaller clusters and assign a codebook entry to each.The following steps describe thismethodforbuilding an entire codebook (Gersho and Gray, 1992; Parsons, 1986).
396
Appendix B
Step 1: Create an initialclusterconsisting of the entire training a single entry set.theinitialcodebookthuscontains corresponding to the centroid of the entire set, as is depicted in Fig. B.l(a) for a two-dimensional input.
\
\
\
I
‘
FIG. B.l Splitting procedure. (a) Rate 0: Thecentroid of the entire training sequence. (b) Initial Rate 1: The single codewordis split to form an initial estimate of a two-word code. (c) Final Rate 1: The algorithm produces a good code with two words. The dotted line indicates the cluster boundary. (d) Initial Rate 2: The two words are split to form an initial estimate ofa four-word code.(e) Final Rate 2: The algorithm is run to produce a final four-word code.
Algorithm Quantization Vector
397
Step 2: Split thiscluster into twosubclusters,resultingina codebook of twice the size (Fig. B.l(b), (c)). process untilthecodebook Step 3: Repeatthiscluster-splitting reaches the desired size (Fig. B.l(d),(e)).
of ways. Ideally,each Splitting can be done inanumber clustershould be divided by ahyperplane which is normal (rectangular) to the direction of maximum distortion. This ensures that themaximumdistortions of thetwo new clusters will be smaller thanthat of theoriginal. As thenumber of codebook entries increases, however, thecomputational expense rapidly becomes prohibitive. Some authors perturb the centroid to generate two different points. If the centroidis x, then an initial estimateof two new codes can be created by forming x + A and x - A, where A is a small perturbation vector. The algorithm will then produce good codes (centroids).
This Page Intentionally Left Blank
Appendix C Neural Nets
Neuralnetmodelsarecomposed of many simple nonlinear computational nodes (elements) operating in parallel and arranged in patterns simulating biological neural nets (Lippman, 1987). The node sums N-weighted inputs and passes the result througha nonlinearity as shown in Fig. C.l. The node is characterized by an internal threshold or offset 0 and by the type of nonlinearity (nonlinear transformation). Shown in Fig. C.1are three types of nonlinearities: hard limiters, threshold logic elements, and sigmoidal nonlinearities. Among various kinds of neural nets, multilayer perceptrons have been proven successful indealingwithmanytypes of problems.The multilayer perceptrons are feedforward nets with one or more layers of nodes between the input and output nodes. These additional layers contain hidden nodes that are not directly connected to eitherthe inputoroutput nodes.A three-layer perceptron with two layers of hidden nodes is shown in Fig. C.2. The nonlinearity can be any of the three types shown in Fig. C.l . The decision rule involves selection of the class which corresponds to the output node having the largest output. In the formulas, xi / and .xk'' are the outputs of nodes in the first and second hidden layers, Oi/ and 0;' are internal thresholds in those nodes, andwijis the connection strength from the input to the first hidden layer. i j / is the connection strength between the first and the second layers,
399
Appendix C
400 X0
XI
Y ,-
Input
N- I
output
+ *N- I
fh
(a)
-I
H a r d l i m iSigmoid ter
logic Threshold
FIG.C . l Computational element or node which forms a weighted sum of N inputsandpasses the result througha nonlinearity. Three representative nonlinearities are shown below.
output
output layer Second hidden layer First hidden layer
ocjSNI-1
X0
xN-l
Input
FIG. C.2 A three-layerperceptronwith N continuous valued inputs, M outputs, and two layers of hidden units.
Neural Nets Structure
401 Types of
declslon reglons
Slngle-layer
A Two- layer
fi
Half plane bounded by hyperplane
I
Exclusive OR Problem
I
Classeswith Most general meshed regl'ons region shapes
I
.
9 Con vex open or closed regions
Three-layer Arbltrary (Complexity limited by number of nodes)
FIG.C.3 Types of decision regions that can be formed by single- and multilayer perceptrons with one and two layers of hidden units and two inputs. Shading indicates decision regions for class A. Smooth, closed contours bound input distributions for classes A and 6.Nodes in all nets use hard-limiting nonlinearities.
wi'
and is the connection strength between the second and output layers. The capabilities of multilayer perceptronsstemfromthe nonlinearitiesusedwithinnodes. By wayof example,the capabilities of perceptrons having one, two, and three layers that use hard-limiting nonlinearities are illustrated in Fig. C.3. A threelayer perceptron can form arbitrarily complex decision regions, and can separate the meshed classes as shown in the bottom of Fig. C.3. Generally, decision regions required by any classification algorithm can be generated by three-layer feedforward nets. The multilayer feedforward perceptrons can be automatically trainedtoimprove classification performance with thebackpropagationtrainingalgorithm.Thisalgorithm is aniterative
402
Appendix C
gradient algorithm designed to minimize the mean square error between the actual output of the net and the desired output. If the net is used as a classifier, all desired outputs are set to zero except for theonecorrespondingtothe class from which the output originates. That desired output is 1. Thealgorithmpropagates error terms required to adapt weights backward from nodes in the output layer to nodes in lower layers. The following outlines a back-propagation training algorithm which assumes a sigmoidal logistic nonlinearity for the function f ( a ) in Fig. C. 1.
Step 1: Weight and thresholdinitialization Set all weights and node thresholds to small random values. Step 2: Input and desired output presentation Present an input vector -yo, x1, . . . xN-l (continuous values) and specify the desired outputs do, d l , . . ., CJIM". Present samples from a training set cyclically until weights stabilize. Step 3: Actualoutput calculation Use the sigmoidal nonlinearity and formulas as in Fig. C.2 to calculate the outputs y o , y l , . . ., yM-l. Step 4: Weight adaption Use a recursive algorithm starting at the output nodes and working back to the first hidden layer. Adjust weights using
where wo(t) is the weight from hidden node i or from an input to nodej attime t , xi/ is the output of node i or an input,p is the gain term,and E j is an errorterm for nodej . If node j is an output node,
Neural Nets
403
If node j is an internal hidden node,
where k indicates all nodes in the layers above node j . Adapt internalnodethresholdsina similar manner by assuming they are connection weights on links from imaginary inputs havingavalue of 1. Convergence is sometimes faster and weight changes are smoothed if a momentum termis added to Eq. (C.2) as
where 0 < y < 1. Step 5: Repetition by returning to step 2 Repeat steps 2 to4 until the weights and thresholds converge. Neural nets typically provide a greaterdegree of robustness or faulttolerance than do conventionalsequentialcomputers.One difficulty notedwiththeback-propagationalgorithm is that in many cases the number of training data presentations required for convergence is large (more than 100 passes through all the training data).
This Page Intentionally Left Blank
Bibliography
CHAPTER 1 Fagen, M. D. Ed. (1975) A History of Engineering and Science in the Bell System, Bell TelephoneLaboratories,Inc., New Jersey, p. 6. Flanagan, J. L. (1972) Speech Analysis Synthesis and Perception, 2nd Ed., Springer-Verlag, New York. Furui, S. and Sondhi, M. Ed. (1992) Advances in Speech Signal Processing, Marcel Dekker, New York. Markel, J. D. and Gray, Jr., A. H. (1976) LinearPrediction of Speech, Springer-Verlag, New York. Rabiner,L.R.andSchafer,R. W. (1978) DigitalProcessing of Speech Signals, Prentice-Hall, New Jersey. Saito, S. and Nakata, K. (1985) Fundamentals of Speech Signal Processing, Academic Press Japan, Tokyo. Schroeder,M. Berlin.
R. (1999) Computer Speech,Springer-Verlag,
405
406
Bibliography
CHAPTER 2 Denes, P. B. andPinson, E. N. (1963) The Speech Chain, Bell Telephone Laboratories, Inc., New Jersey. Furui, S., Itakura, F., and Saito, S. (1972) ‘Talker recognition by the longtime averaged speech spectrum,’ Trans. IECEJ, S A , 10, pp. 549-556. Furui, S. (1986) ‘On the role of spectraltransitionfor speech perception,’ J. Acoust. SOC.Amer., 80, 4, pp. 1016-1025. Irii, H., Itoh, K., and Kitawaki, N. (1987) ‘Multi-lingual speech databasefor speech qualitymeasurements and itsstatistic on SpeechResearch, characteristics,’Trans.Committee Acoust. SOC. Jap., S87-69. Jakobson, R., Fant, G., and Halle, M. (1963) Preliminaries to SpeechAnalysis:TheDistinctiveFeaturesandTheir Correlates, MIT Press, Boston. Peterson, G. E. and Barney, H. L. (1952) ‘Control methods used inastudy of the vowels,’ J.Acoust. SOC.Amer., 24, 2, pp. 175-184. Saito, S., Kato, K., and Teranishi, N. (1958) ‘Statistical properties of fundamental frequencies of Japanese speech voices,’ J. Acoust. SOC. Jap., 14, 2, pp. 111-116. Saito, S. (1961) Fundamental Research on Transmission Quality of Japanese Phonemes, Ph.D Thesis, Nagoya Univ. Sato, H. (1975) ‘Acoustic cues of male and female voice quality,’ Elec. Conlmun. Labs Tech. J., 24, 5 , pp. 977-993. Stevens, K. N., Keyser, S. J., and Kawasaki, H. (1986) ‘Toward a phonetic and phonological theory of redundant features,’ in InvarianceandVariabilityin Speech Processes (eds. J. S. Perkel and D. H. Klatt), Lawrence Erlbaum Associates, New Jersey, pp. 426-449.
Bibliography
407
CHAPTER 3 Fant, G. (1959) ‘Theacoustics of speech,’ Proc.3rdInt.Cong. Acoust.: Sec. 3, pp. 188-201. Fant, G. (1960) Acoustic Theory of Speech Production, Mouton’s Co., Hague. Flanagan, J. L. (1972) Speech Analysis Synthesis and Perception, 2nd Ed., Springer-Verlang, New York. Flanagan, J. L., Ishizaka, K., and Shipley, K. L. (1975) ‘Synthesis of speech from a dynamic modelof the vocal cords and vocal tract,’ Bell Systems Tech. J., 54, 3, pp. 485-506. Flanagan,J.L.,Ishizaka, K., and Shipley, K. L. (1980) ‘Signal models for low bit-ratecoding of speech,’ J. Acoust.SOC. Amer., 68, 3, pp. 780-791. Ishizaka, K. and Flanagan, J. L. (1972) ‘Synthesis of voiced sounds from a two-massmodel of the vocal cords,’ Bell Systems Tech. J., 51, 6, pp. 1233-1268. Kelly, Jr., J. L. and Lochbaum, C. (1962) ‘Speech synthesis,’ Proc. 4th Int. Cong. Acoust., G42, pp. 1-4. Rabiner, L. R. and Schafer, R. W. (1978) Digital Processing of Speech Signals, Prentice-Hall, New Jersey. Stevens K. N. (1971) ‘Airflow and turbulence noise for fricative and stop consonants: static considerations,’ J. Acoust. SOC. Anler., 50, 4(Part 2), pp. 1180-1 192. Stevens, K. N. (1977) ‘Physics of laryngeal behavior and larynx models,’ Phonetica, 34, pp. 264-279. CHAPTER 4 Atal, B. S. (1974) ‘Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification,’ J. Acoust. SOC.Amer., 55, 6, pp. 1304-13 12.
408
Bibliography
Atal, B. S. andRabiner,L.R. (1976)‘A pattern recognition approach to voiced-unvoiced-silence classification with applications to speech recognition,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-24, 3, pp. 201-212. Bell, C. G., Fujisaki, H., Heinz, J. M., Stevens, K. N.,and House, A. S. (1 96 1)‘Reduction of speech spectra by analysisby-synthesis techniques,’ J.Acoust. SOC. Amer., 33, 12, pp. 1725-1736. Bogert, B. P.,Healy, M. J.R.,andTukey, J. W. (1963) ‘The frequencyanalysis of time-series for echoes,’ Proc.Symp. Time Series Analysis, Chap. 15, pp. 209-243. Dudley, H. (1939) ‘Thevocoder,’ pp. 122-126.
Bell LabsRecord,
18, 4,
Furui, S. (1 98 1)‘Cepstral analysis technique for automatic speaker verification,’ IEEE Trans. Acoust.,Speech, Signal Processing, ASSP-29, 2, pp. 254-272. Gold, B. and Rader, C. M. (1967) ‘The channel vocoder,’ IEEE Trans. Audio, Electroacoust., AU-15, 4, pp. 148-161. Imai, S. and Kitamura, T. (1978) ‘Speech analysis synthesis system using the log magnitude approximation filter,’ Trans. IECEJ, J61-A, 6, pp. 527-534. Itakura, F. and Tohkura, Y. (1978) ‘Feature extraction of speech signal and its application to data compression,’ Joho-shori, 19, 7, pp. 644-656. Itakura, F. (1981) ‘Speech analysis-synthesis based on spectrum encoding,’ J. Acoust. SOC. Jap., 37, 5, pp. 197-203. Markel,J. D. (1972) ‘The SIFT algorithmforfundamental frequencyestimation,’ IEEETrans.Audio. Electroacoust., AU-20, 5, pp. 367-377. Noll, A. M. (1964) ‘Short-time spectrum and ‘cepstrum’ techniques for vocal-pitchdetection,’ J.Acoust. SOC. Amer., 36, 2, pp. 296-302.
Bibliography
409
~011, A. M. (1967) ‘Cepstrum pitch determination,’ J. Acoust. SOC. Amer., 41, 2, pp. 293-309. Oppenheim,A. V. and Schafer, R. W. (1968) ‘Homomorphic analysis of speech,’ IEEETrans.Audio,Electroacoust., AU-16, 2, pp. 221-226. Oppenheim, A. V. (1969) ‘Speech analysis-synthesis system based onhomomorphic filtering,’ J.Acoust. SOC.Amer., 45, 2, pp. 458-465. Oppenheim, A. V. and Schafer, R. W. (1975) Digital Signal Processing, Prentice-Hall, New Jersey. Rabiner, L. R. and Schafer, R. W. (1975) Digital Processing of Speech Signals, Prentice-Hall, New Jersey. Schroeder, M. R. (1966) ‘Vocoders: analysis and synthesis of speech,’ Proc. IEEE, 54, 5 , pp. 720-734. Shannon, C. E. and Weaver, W. (1949) The Mathematical Theory of Communication, University of Illinois Press. Smith, C.P. (1969) ‘Perception of vocoder speech processed by pattern matching,’ J. Acoust. SOC.Amer., 46, 6(Part 2), pp. 1562-1571. Tohkura, Y. (1980)Speech Quality Improvement in PARCOR Speech Analysis-Synthesis Systems, Ph.D Thesis, Tokyo Univ. CHAPTER 5 Atal, B. S. and Schroeder, M. R. (1968) ‘Predictive coding of speech signals,’ Proc. 6th Int. Cong. Acoust., C-5-4. Atal, B. S. (1970) ‘Determination of the vocal-tract shape directly from the speech wave,’ J. Acoust. SOC. Amer., 47, l(Part l), 4K1, p. 64. Atal, B. S. and Hanauer, S. L. (1971) ‘Speech analysis and synthesis by linearprediction of the speech wave,’ J. Acoust.SOC. Amer., 50, 2(Part 2), pp. 637-655.
410
Bibliography
Fukabayashi, T., and Suzuki, H. (1975) ‘Speech analysis by linear pole-zero model,’ Trans. IECEJ, J58-A, 5 , pp. 270-277. Ishizaki, S. (1977) ‘Pole-zero model order identification in speech analysis,’ Trans. IECEJ, J60-A, 4, pp. 423-424. Itakura, F. and S. Saito (1968) ‘Analysis synthesis telephony based onthe maximumlikelihoodmethod,’Proc.6th Int.Cong. Acoust., C-5-5. Itakura, F. and Saito, S. (1971) ‘Digital filter techniques for speech analysis and synthesis,’Proc.7thInt.Cong.Acoust., Budapest, 25-C-1. Itakura, F. (1975) ‘Line spectrum representation of linear predictor coefficients of speech signal,’ Trans.Committeeon Speech Research, Acoust. SOC. Jap., S75-34. Itakura, F. and Sugamura, N. (1979) ‘LSP speech synthesizer, its principle and implementation,’ Trans. Committee on Speech Research, Acoust. SOC.Jap., S79-46. Itakura, F. (1981) ‘Speech analysis-synthesis based on spectrum encoding,’ J. Acoust. SOC. Jap., 37, 5 , pp. 197-203. Markel,J.D. (1972) ‘Digitalinverse filtering-A forformanttrajectoryestimation,’IEEETrans.Audio, Electroacoust., AU-20, 2, pp. 129-137.
new tool
Markel, J. D. and Gray, Jr., A. H. (1976) LinearPrediction of Speech, Springer-Verlag, New York. Matsuda, R. (1966) ‘Effects of thefluctuationcharacteristics of inputsignal on thetonaldifferential limen of speech transmissionsystemcontaining single dip infrequencyresponse,’ Trans. IECEJ, 49, 10, pp. 1865-1 871. Morikawa, H. and Fujisaki, H. (1984) ‘System identification of on a state-space thespeechproductionprocessbased representation,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-32, 2, pp. 252-262.
Bibliography
41 1
Nakajima, T., Suzuki, T., Ohmura, H., Ishizaki, S., and Tanaka, K. (1978) ‘Estimation of vocal tract area functionby adaptive deconvolution andadaptive speechanalysissystem,’ J. Acoust. SOC.Jap., 34, 3, pp. 157-166. Oppenheim,A. V., Kopec, G. E., andTribolet, J. M. (1976) ‘Speechanalysis by homomorphicprediction,’IEEE Trans.Acoust.,Speech?SignalProcessing,ASSP-24, 4, pp. 327-332. Sagayama, S. and Furui, S. (1977) ‘Maximum likelihood estimation of speech spectrum by pole-zero modeling,’ Trans. Committee on Speech Research, Acoust. SOC. Jap., S76-56. Sagayama, S. andItakura, F. (1981) ‘Compositesinusoidal modeling applied to spectral analysis of speech,’ Trans. IECEJ, J64-A, 2, pp. 105-112. Sugamura, N. andItakura, F. (1981) ‘Speech data compression by LSP speech analysis-synthesis technique,’ Trans. IECEJ, J64-A, 8, pp. 599-606. Tohkura, Y. (1980) Speech QualityImprovementinPARCOR Speech Analysis-Synthesis Systems, Ph.D Thesis, Tokyo Univ. Wakita, H. (1973) ‘Direct estimation of the vocal tract shape by inverse filtering of acoustic speech waveforms,’ IEEE Trans. Audio, Electroacoust., AU-21, 5 , pp. 417-427. Wiener, N. (1966) ExtrapolationInterpolation and Smoothing of Stationary Time Series, MIT Press, Cambridge, Massachusetts.
CHAPTER 6 Abut, H., Gray,R.M.,andRebolledo, G. (1982) ‘Vector quantization of speech and speech-like waveforms,’ IEEE Trans.Acoust., Speech, Signal Processing, ASSP-30, 3, pp. 423-435.
412
Bibliography
Anderson, J. B. and Bodie, J. B. (1975) ‘Tree encoding of speech,’ IEEE Trans. Information Theory, IT-21, 4, pp. 379-387. Atal, B. S. and Schroeder, M.R. (1970) ‘Adaptive predictive coding of speech signals,’ Bell Systems Tech. J., 49, 8, pp. 1973-1986. Atal, B. S. and Schroeder, M.R. (1979) ‘Predictive coding ofspeechsignals and subjective errorcriteria,’IEEE Trans.Acoust., Speech,SignalProcessing, ASSP-27, 3, pp. 247-254. Atal, B. S. and Remde, J. R. (1982)‘Anew model of LPC excitation for producing natural-sounding speech at low bit rates,’Proc.IEEEInt.Conf.Acoust.,Speech,Signal Processing, Paris, France, pp. 614-6 17. Atal, B. S. and Schroeder, M.R. (1984) ‘Stochasticcoding speechsignals at very low bitrates,’ Proc.Int.Conf. Commun., Pt. 2, pp. 1610-1613.
of
Atal, B. S. and Rabiner, L. R. (1986) ‘Speech research directions,’ AT&T Tech. J. 65, 5, pp. 75-88. Buzo, A., Gray, Jr., A. H., Gray, R. M., and Markel, J. D. (1980) ‘Speech coding based upon vector quantization,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-28, 5, pp. 562-574. Chen, J.-H., Melchner, M. J., Cox,R. V. and Bowker, D. 0. (1990) ‘Real-time implementation and performance of a 16kb/s lowdelay CELP speech coder,’Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 181-1 84. Childers, D.,Cox,R. V., DeMori,R.,Furui, S., Juang, B.-H., Mariani, J. J., Price, P., Sagayama, S., Sondhi, M. M. and Weischedel, R. (1998) ‘Thepast,present,andfuture of speech processing,’ IEEE Signal Processing Magazine, May, pp. 24-48. Crochiere, R. E., Webber, S. A., and Flanagan, J. L. (1976) ‘Digital coding of speech in sub-bands,’ Bell Systems Tech. J., 55, 8, pp. 1069-1085.
Bibliography
413
Crochiere, R. E., Cox, R. V., and Johnston, J. D. (1982) ‘Realtime speech coding,’ IEEETrans.Commun., COM-30, 4, pp. 621-634. Crochiere, R. E. and Flanagan, J. L. (1983) ‘Current perspectives indigitalspeech,’ IEEEComnlun.Magazine,January, pp. 3240. Cummiskey, P., Jayant, N. S., and Flanagan, J. L. (1973) ‘Adaptive quantizationindifferentialPCMcoding of speech,’ Bell Systems Tech. J., 52, 7, pp. 1105-1 118. Cuperman, V. and Gersho, A. (1982) ‘Adaptive differential vector coding of speech,’ Conf. Rec., 1982 IEEE Global Comnmn. Conf., Miami, FL, pp. E6.6.1-E6.6.5. David,Jr.,E.E.,Schroeder,M.R.,Logan, B. F., and Prestigiacomo,A. J. (1962) ‘Voice-excited vocodersfor practical speech bandwidth reduction,’ IRE Trans. Information Theory, IT-8, 5, pp. SlOl-S105. Elder, B. (1997) ‘Overview on the current developmentof MPEG-4 audio coding,’ in Proc. 4th Int. Workshopon Systems, Signals and Image Processing, Posnan. Esteban, D. andGaland,C. (1977) ‘Application of quadrature mirror filters to splitband voice schemes,’ Proc.IEEE Int. Conf. Acoust., Speech, Signal Processing, Hartford, CT, pp. 191-195. Farges, E. P. and Clements, M. A. (1986) ‘Hidden Markov models applied to very low bit rate speech coding,’ Proc.IEEE Int. Conf. Acoust., Speech, Signal Processing, Tokyo, Japan, pp. 433-436. Fehn, H. G. and Noll, P. (1982) ‘Multipathsearchcoding of stationary signals with applications to speech,’ IEEE Trans. Commun., COM-30, 4, pp. 687-701. Flanagan, J. L., Schroeder, M. R., Atal, B. S., Crochiere, R. E., Jayant, N. S., andTribolet,J.M. (1979) ‘Speech coding,’ IEEE Trans. Commun., COM-27, 4, pp. 710-737.
414
Bibliography
Foster, J., Gray, R. M., and Dunham, M. 0. (1985) ‘Finitestate vector quantizationfor waveformcoding,’ IEEETrans. Information Theory, IT-3 1, 3, pp. 348-359. Gersho,A.andCuperman, V. (1983) ‘Vector quantization: Apattern-matchingtechniquefor speech coding,’ IEEE Commun. Magazine, December, pp. 15-21. (1992) Vector Quantizationand Gersho, A. andGray,R.M. Signal Compression, Kluwer, Boston. Gerson, I. A. and Jasiuk, M. A. (1990) ‘Vector sum excited linear prediction (VSELP) speech coding at 8kbs,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 461-464. Griffin, D. and Lim, J. S. (1988) ‘Multiband excitation vocoder,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-36, 8, pp. 1223-1235.
F. (1984)‘Bit allocation in time and Honda,M.andItakura, frequencydonlains for predictive coding of speech,’ IEEE Trans.Acoust., Speech,SignalProcessing, ASSP-32, 3, pp. 465473. Jayant, N. S. (1970) ‘Adaptivedeltamodulation with aone-bit memory,’ Bell Systems Tech. J., 49, 3, pp. 321-342. Jayant,N. S. (1973) ‘Adaptivequantizationwithaone-word memory,’ Bell Systems Tech. J., 52, 7, pp. 1119-1144. Jayant, N. S. (1974) ‘Digital coding of speech waveforms: PCM, DPCM, and DMquantizers,’ Proc. IEEE, 62, 5, pp. 61 1-632. Jayant, N. S. and Noll, P. (1984) Digital Coding of Waveforms, Prentice-Hall, New Jersey. Jayant, N. S. and Ramamoorthy, V. (1986). ‘Adaptive Postfiltering of 16 kb/s-ADPCM Speech,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Tokyo, Japan, 16.4, pp. 829-832. Jelinek, F. andAnderson,J. B. (1971) ‘Instrumentabletree encoding of information sources,’ IEEE Trans. Information Theory, IT-17, 1, pp. 118-119.
Bibliography
415
Juang, B. H. and Gray, Jr., A. H. (1982) ‘Multiplestagevector quantizationfor speech coding,’Proc. IEEEInt.Conf. Acoust., Speech, Signal Processing, Paris,France,pp. 597600. Juang, B. H. (1986) ‘Design and performance of trellis vector quantizers for speech signals,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Tokyo, Japan, pp. 437-440. Kataoka, A.,Moriya, T.andHayashi, S. (1993) ‘An 8-kbit/s speech coder based on conjugate structure CELP,’ Proc.IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 592-595. Kitawaki, N., Itoh,K.,Honda,M.,andKakeki,K. (1982) ‘Comparison of objective speech quality measures for voiceband codecs,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Paris, France, pp. 1000-1003. Kleijn, W. B. andHaagen,J. (1994) ‘Transformationand decomposition of the speech signal for coding,’ IEEE Signal Processing Lett., 1, 9, pp. 136-138. Krasner, M. A. (1979) ‘Digital encoding of speech andaudio signals based on the perceptual requirement of the auditory system,’ Lincoln Lab. Tech. Rep., 535. Linde, Y., Buzo, A., and Gray, R. M. (1980) ‘An algorithm for vector quantizer design,’ IEEE Trans. Commun.,COM-28, 1, pp. 84-95. Lloyd, S. P. (1957) ‘Least squares quantization in PCM,’ Institute of MathematicalStatisticsMeeting,AtlanticCity,NJ, September;also (1982) IEEETrans.InformationTheory, IT-28, 2(Part I), pp. 129-136. Makhoul,J.and Berouti, M. (1979) ‘Adaptive noise spectral shaping and entropy coding in predictive coding of speech,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-27, 1, pp. 63-73. Malah, D., Crochiere, R. E., and Cox, R. V. (1981) ‘Performance of transformandsubband coding systems combinedwith
416
Bibliography
harmonic scaling of speech,’ IEEE Trans. Acoust.,Speech, Signal Processing, ASSP-29, 2, pp. 273-283. Max, J. (1960) ‘Quantizing for minimum distortion,’ IRE Trans. Information Theory, IT-6, 1, 3, pp. 7-12. McAulay, R.J.andQuatieri, T. F. (1986) ‘Speech analysis/ synthesis based on a sinusoidal representation,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-34, pp. 744-754. Miki, S., Mano, K., Ohmuro,H.andMoriya,T. (1993) ‘Pitch synchronousinnovationCELP (PSI-CELP),’Proc.Eurospeech, pp. 261-264. Moriya, T.andHonda, M. (1986) ‘Speech coderusingphase equalizationandvectorquantization,’Proc.IEEEInt. Conf.Acoust.,Speech,SignalProcessing,Tokyo, Japan, pp. 1701-1704. Noll, P. (1975) ‘A comparative study of various schemes for speech encoding,’ Bell Systems Tech. J., 54, 9, pp. 1597-1614. Ozawa, K., Araseki, T., and Ono, S. (1982) ‘Speech coding based on multi-pulseexcitationmethod,’ Trans. Committee on Communication Systems, IECEJ, CS82-161. Ozawa, K. and Araseki, T. (1986) ‘High quality multi-pulsespeech coder with pitch prediction,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Tokyo, Japan, pp. 1689-1692. Rabiner, L. R. and Schafer, R. W. (1978) DigitalProcessing of Speech Signals, Prentice-Hall, New Jersey. Richards, D. L. (1973) Telecommunication by Speech,Butterworths, London. Roucos, S., Schwartz, R.,andMakhoul,J. (1982a)‘Vector quantization for very-low-rate coding of speech,’ Conf. Rec. 1982 IEEE Global Commun. Conf., Miami, FL, pp. E6.2.1E6.2.5. Roucos, S., Schwartz,R., andMakhoul, J. (1982b) ‘Segment quantizationforvery-low-rate speech coding,’Proc. IEEE
Bibliography
417
Int. Conf. Acoust., Speech, Signal Processing, Paris, France pp. 1565-1 568. Schafer, R. W. and Rabiner, L. R. (1975) ‘Digital representation of speech signals,’ Proc. IEEE, 63, 1, pp. 662-677. Schroeder, M. R. andAtal, B. S. (1982) ‘Speech coding using efficient block codes,’ Proc. IEEE Int. Conf. Acoust., Speech Signal Processing, Paris, France, pp. 1668-1 67 1. Schroeder, M. R. andAtal, B. S. (1985) ‘Code-excided linear prediction (CELP): high-quality speech at very low bit rates,’ Proc.IEEEInt.Conf. Acoust., Speech, Signal Processing, Tampa, FL, pp. 937-940. Shiraki, Y. and Honda, M.(1986) ‘Very low bit rate speech coding based onjoint segmentation and variablelength segment quantizer,’ Proc. Acoust. SOC.Amer. Meeting, J. Acoust. SOC. Amer., Supple. 1, 79, p. S94. Smith, C.D. (1969) ‘Perception of vocoder speech processed by pattern matching,’ J. Acoust. SOC.Amer., 46, 6(Part 2), pp. 1562-1571. Stewart, L. C., Gray, R. M., and Linde, Y. (1982) ‘The design of trellis waveform coders,’ IEEE Trans. Commun.,COM-30,4, pp. 702-710. Supplee,L., Cohn, R., Collura, J. and McCree, A. (1997) ‘MELP:The new FederalStandardat 2400 bps,’ Proc. IEEEInt.Conf. Acoust., Speech, Signal Processing, pp. 1591-1 594. Tribolet, J. M.and Crochiere, R. E. (1978) ‘A vocoder-driven adaptationstrategyfor low bit-rateadaptivetransform coding of speech,’ Proc. Int. Conf. Digital Signal Processing, Florence, Italy, pp. 638-642. Tribolet, J. M. and Crochiere, R. E. (1979) ‘Frequencydomain coding of speech,’ IEEETrans.Acoust., Speech, Signal Processing, ASSP-27, 5, pp. 5 12-530.
418
Bibliography
Tribolet, J. M. and Crochiere, R. E. (1980) ‘A modified adaptive transformcoding scheme withpost-processing-enchancement,’Proc.IEEEInt.Conf.Acoust.,Speech,Signal Processing, Denver, Colorado, pp. 336-339. Wong, D. Y., Juang, B. H., and Gray, Jr., A. H. (1982) ‘An 800 bit/s vector quantization LPC vocoder,’ IEEE Trans.Acoust., Speech, Signal Processing, ASSP-30, 5, pp. 770-780. Wong, D. Y., Juang, B. H., and Cheng, D. Y. (1983) ‘Very low data rate speech compression with LPC vector and matrix quantization,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 65-68. Zelinski, R.and Noll, P. (1977) ‘Adaptivetransformcoding of speech signals,’ IEEETrans.Acoust., Speech, Signal Processing, ASSP-25, 4, pp. 299-309. CHAPTER 7 Allen, J., Carlson, R., Granstrom,B., Hunnicutt, S., Klatt, D., and of Unrestricted Pisoni, D. (1979) MITalk-79:Conversion English Text to Speech, MIT. Black, A. W. and Campbell, N. (1995) ‘Optimizing selection of unitsfrom speech databasesforconcatenative synthesis,’ Proc. Eurospeech, pp. 58 1-584. Coker, C. H., Umeda, N., and Browman, C. P. (1978) ‘Automatic synthesis fromordinary English text,’ IEEETrans.Audio, Electroacoust., AU-21, 3, pp. 293-298. Crochiere, R. E. and Flanagan, J.L. (1986) ‘Speech processing: An evolving technology,’ AT&T Tech. J., 65, 5, pp. 2-1 1. Ding, W. and Campbell, N. (1997) ‘Optimizing unit selection with voice source and formants in the CHATR speech synthesis system,’ Proc. Eurospeech, pp. 537-540. Dixon, N. R. and Maxey, H. D. (1968) ‘Terminal analog synthesis of continuous speech using the diphone method of segment
Bibliography
assembly,’ IEEETrans.Audio, pp. 40-50.
419
Electroacoust.,AU-16,
1,
Donovan, R. E. and Woodland, P. C. (1999) ‘A hidden Markovmodel-based trainable speech synthesizer,’ Computer Speech and Language, 13, pp. 223-241. Flanagan, J. L. (1972) ‘Voices of men and machines,: J. Acoust. SOC. Amer., 51, 5(Part l), pp. 1375-1387. Hirokawa,T.,Itoh, K. andSato, H. (1992) ‘Highquality speech synthesis based on wavelet compilation of phoneme segments,’ Proc.Int.Conf.SpokenLanguage Processing, pp. 567-570. Hirose, K., Fujisaki, H.,andKawai,H. (1986) ‘Generation of prosodic symbols for rule-synthesis of connected speech of Japanese,’Proc. IEEEInt.Conf. Acoust., Speech, Signal Processing, Tokyo, 45.4, pp. 2415-2418. Liu, Huang, X., Acero, A., Adcock, J., Hon, H.-W., Goldsmith, J., J. and Plumpe, M. (1996) ‘WHISTLER: A trainable text-tospeech system,’ Proc. Int. Conf. SpokenLanguage Processing, pp. 2387-2390. formant Klatt, D. H. (1980) ‘Software foracascade/parallel synthesizer,’ J. Acoust. SOC. Amer.,67, 3, pp. 971-995. Klatt,D.H. (1987) ‘Review of text-to-speech conversionfor English,’ J. Acoust. SOC.Amer., 82, 3, pp. 737-793. Laroche, L., Stylianou, Y. and Moulines, E. (1993) ‘HNS: Speech modification based on a harmonic + noise model,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 550-553. Lovins, J. B., Macchi, M. J., and Fujimura, 0. (1979)‘A demisyllable inventoryfor speech synthesis,’ 97th Meeting of Acoust. SOC. Amer., YY4.
F. (1990) ‘Pitch-synchronous Moulines,E.andCharpentier, waveform processing techniques for text-to-speech synthesis using diphones,’ Speech Communication, 9, pp. 453-467.
420
Nakajima, S. andHamada, of synthesisunitsbased Proc.IEEEInt.Conf. pp. 659-662.
Bibliography
H. (1988) ‘Automaticgeneration oncontextorientedclustering,’ Acoust.,Speech,SignalProcessing,
Nakajima, S. (1993) ‘English speech synthesisbased layeredcontextorientedclustering,’Proc.Eurospeech, pp. 1709-1712.
on multi-
Sagisaka, Y. and Tohkura, Y. (1984) ‘Phoneme duration control for speech synthesis by rule,’ Trans.IECEJ, J67-A,7, pp. 629-636. Sagisaka,Y. (1988) ‘Speech synthesis by ruleusing an optimal selection of non-uniformsynthesisunits,’Proc.IEEEInt. Conf. Acoust., Speech, Signal Processing, pp. 679-682. Sagisaka, Y. (1998) ‘Corpusbased Processing, 2, 6, pp. 407-414.
speech synthesis,’ J. Signal
Sato, H. (1978) ‘Speech synthesis on the basis of PARCOR-VCV concatenation units,’ Trans. IECEJ, J61-D, 11, pp. 858-865. Sato, H. (1984a) ‘Speech synthesis using CVC concatenation units and excitationwaveformselements,’ Trans.Committeeon Speech Research, Acoust. SOC. Jap., S83-69. Sato, H. (1984b) ‘Japanese text-to-speech conversion system,’ Rev. of the Elec. Commun. Labs., 32, 2, pp. 179-187. Tokuda, K., Masuko, T., Yamada, T., Kobayashi, T. and Imai, S. (1995) ‘An algorithm for speech parameter generation from continuousmixture HMMs withdynamicfeatures,’Proc. Eurospeech, pp. 757-760. CHAPTER 8 Acero, A. and Stern, R. M. (1990) ‘Environmental robustness in automatic speech recognition,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 849-852.
Bibliography
421
Atal, B. (1 974)‘Effectiveness of linear prediction characteristics of the speech wave forautomatic speakeridentification and verification,’ J. Acoust., SOC., Amer.,55, 6, pp. 1304-1 3 12. Bahl, L. R.and Jelinek, F. (1975) ‘Decoding for channelswith insertions,deletions, and substitutions, with applications to speech recognition,’ IEEE Trans. Information Theory,IT-2 1, pp. 404-41 1. Bahl, L. R., Brown, P.F., de Souza, P.V. and Mercer, L. R. (1986) ‘Maximum mutual information estimation of hidden Markov modelparametersfor speech recognition,’Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 49-52. Baker, J.K. (1975) ‘Stochasticmodeling forautomatic speech understanding,’in Speech Recognition(ed. D.R. Reddy), pp. 521-542. Baum, L. E. (1972) ‘Aninequality and associatedmaximization technique in statistical estimation for probabilistic functions of a Markov process,’ Inequalities, 3, pp. 1-8. Bellman, R. (1957) Dynamic Programming, Princeton Univ. Press, New Jersey. Bridle, J. S. (1973) ‘Anefficientelastictemplatemethod detectingkeywordsinrunningspeech,’Brit.Acoust.SOC. Meeting, pp. 1-4.
for
Bridle, J. S. and Brown, M. D. (1979) ‘Connected word recognition usingwholewordtemplates,’Proc.Inst.Acoust. Autumn Conf., pp. 25-28. Brown, P. F., Della Pietra, V. J., de Souza, P. V., Lai, J. C. and Mercer, R. L. (1992) ‘Class-based n-gram models of natural language,’ Computational Linguistics, 18, 4, pp. 467-479. Chen, S. S., Eide,E. M., Gales, M.J. F., Gopinath,R. A., Kanevsky, D.and Olsen,P. (1999) ‘Recentimprovements to IBM’s speech recognition system for automatic transcription of broadcast news,’ Proc.DARPA Broadcast News Workshop, pp. 89-94.
422
Bibliography
Childers, D., Cox, R. V., DeMori, R., Furui, S., Juang, B.-H., Mariani, J.J., Price, P., Sagayama, S., Sondhi, M. M.and Weischedel, R. (1998) ‘The past, present, and future of speech processing,’ IEEE Signal Processing Magazine, May, pp. 24-48. Cox, S. J.and Bridle, J. S. (1989) ‘Unsupervised speaker adaptation by probabilisticfitting,’Proc. IEEEInt.Conf. Acoust., Speech, Signal Processing, pp. 294-297. Cox, S. J. (1995) ‘Predictivespeaker adaptation inspeech recognition,’ Computer Speech and Language, 9, pp. 1-17. Davis, K. H., Biddulph, R., and Balashek, S. (1952) ‘Automatic recognition of spoken digits,’ J. Acoust. SOC.Amer., 24,6, pp. 637-642. Digalakis, V. and Neumeyer,L. (1995) ‘Speaker adaptation usingcombinedtransformation and Bayesianmethods,’ Proc.IEEEInt.Conf. Acoust., Speech, Signal Processing, pp. 680-683. Furui, S. (1975) ‘Learning and normalizationofthetalker differences intherecognition of spokenwords,’ Trans. Committee on Speech Research, Acoust. SOC.Jap., S75-25. Furui, S. (1978) Research on IndividualInformationin Waves, Ph.D Thesis, Tokyo University.
Speech
Furui, S. (1980) ‘A training procedure for isolated word recognition systems,’ IEEETrans. Acoust., Speech, Signal Processing, ASSP-28, 2, pp. 129-136. Furui, S. (1 98 1)‘Cepstral analysis technique for automatic speaker verification,’ IEEE Trans. Acoust.,Speech, Signal Processing, ASSP-29, 2, pp. 254-272. Furui, S. (1986a) ‘Speaker-independent isolated word recognition using dynamicfeatures of speech spectrum,’ IEEETrans, Acoust., Speech, Signal Processing, ASSP-34, 1, pp. 52-59. Furui, S. (1986b) ‘On the role of spectraltransitionfor speech perception,’ J. Acoust. SOC. Amer., 80, 4, pp. 1016-1025.
Bibliography
423
Furui, S. (1987) ‘A VQ-based preprocessor using cepstral dynamic features for large vocabulary word recognition,’ Proc. IEEE Int.Conf. Acoust., Speech, Signal Processing, Dallas, TX, 27.2, pp. 1127-1 130. Furui, S. (1989a) ‘Unsupervised speaker adaptation method based on hierarchicalspectral clustering,’ Proc.IEEEInt.Conf. Acoust., Speech, Signal Processing, pp. 286-289. Furui, S. (1989b) ‘Unsupervisedspeaker adaptation based on hierarchical spectral clustering,’ IEEE Trans. Acoust.,Speech, Signal Processing, ASSP-37, 12, pp. 1923-1930. Furui, S. (1992) ‘Toward robust speech recognition under adverse conditions,’ Proc. ESCA Workshop on Speech Processing in Adverse Conditions, Cannes-Mandelieu, pp. 3 1-42. Furui, S. (1995) ‘Flexible speech recognition,’Proc.Eurospeech, pp. 1595-1603. Furui, S. (1997) ‘Recent advances in robust speech recognition,’ Proc. ESCA-NATO Workshopon Robust Speech Recognition forUnknownCommunication Channels, Pont-a-Mousson, pp. 11-20. Gales, M. J. F. and Young, S. J. (1992) ‘An improved approach to thehidden Markov modeldecomposition of speech and noise,’ Proc.IEEEInt.Conf.Acoust.,Speech, Signal Processing, pp. 233-236. Gales, M. J. F. andYoung, S. J. (1993) ‘Parallel model combinationfor speech recognitionin noise,’ Technical Report CUED/F-INFENG/TRl35, Cambridge Univ. Gauvain, J.-L., Lamel, L., Adda, G. and Jardino, M. (1999) ‘The LIMSI 1998 Hub-4Etranscription system,’ Proc. DARPA Broadcast News Workshop, pp. 99-104. Goodman, R. G. (1976) Analysis of Languages for Man-Machine Voice Communication,Ph.DThesis,Carnegie-Mellon University.
“ “ ” l “ . “ ”
“ ”
” ” -
-
“ ”
424
Bibliography
Gray, Jr., A. H. and Markel, J. D. (1976) ‘Distance measures for speechprocessing,’ IEEETrans.Acoust., Speech,Signal Processing, ASSP-24, 5 , pp. 380-391. Huang,X.-D.andJack,M. A. (1989) ‘Semi-continuoushidden Markov source models for speech signals,’ Computer Speech and Language, 3, pp. 239-251. Huang, X.-D., Ariki, Y. and Jack, M. A. (1990) Hidden Markov Models for Speech Recognition,EdinburghUniv.Press, Edinburgh. Itakura, F. (1975) ‘Minimum prediction residual principle applied to speech recognition,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-23, 1, pp. 67-72. Jelinek, F. (1976) ‘Continuous speech recognition by statistical methods,’ Proc. IEEE, 64, 4, pp. 532-556. Jelinek, F. (1997) Statistical Methods for Speech Recognition, MIT Press, Cambridge. Juang, B.-H. (1 99 1)‘Speech recognition in adverse environments,’ Computer Speech and Language, 5 , pp. 275-294.
S. (1992) ‘Discriminativelearning Juang,B.-H.andKatagiri, forminimumerrorclassification,’IEEETrans., Signal Processing, 40, 12, pp. 3043-3054. Juang,B.-H.,Chou, W. and Lee, C.-H. (1996) ‘Statistical anddiscriminativemethodsforspeechrecognition,’in Automatic Speech and SpeakerRecognition(eds. C.-H. Lee, F. K. Soong and K. K. Paliwal), Kluwer, Boston, pp. 109-1 32. (1984) ‘Adaptability to individual Kato, K. andKawahara,H. talkers in monosyllabic speech perception,’ Trans. Committee on Hearing Research, Acoust. SOC. Jap., H84-3. Katz, S. K. (1987) ‘Estimation from sparse data for the language model fora speechrecognition,’ IEEETrans.Acoust., Speech, Signal Processing, ASSP-35, 3, pp. 400-401.
Bibliography
425
Kawahara,T., Lee, C.-H.andJuang, B.-H. (1977) ‘Combining key-phrasedetection andsubwordbasedverificationfor flexible speech understanding,’ Proc. IEEE Int. Conf.Acoust., Speech, Signal Processing, pp. 1303-1 306. Klatt, D. H. (1982) ‘Prediction of perceived phonetic distance from critical-band spectra: A first step,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing,Paris, France, S 11.1, pp. 1278-128 1. Knill, K. and Young, S. (1997) ‘Hidden Markov models in speech andlanguage processing,’inCorpus-Based Methodsin Language and Speech Processing(eds. S. Youngand G. Bloothooft), Kluwer, Dordrecht, pp. 27-68.
S., andSaito, S. (1972) ‘Spokendigit Kohda,M.,Hashimoto, mechanical recognition system,’ Trans. IECEJ, 55-D, 3, pp. 186-193. Lee, C.-H. and Gauvain, G.-L. (1996) ‘Bayesian adaptive learning and MAP estimation of HMM,’ in Automatic Speech and Speaker Recognition (eds. C.-H. Lee, F. K. Soong and K. K. Paliwal), Kluwer, Boston, pp. 83-107. Leggetter, C. J. and Woodland, P.C. (1995) ‘Maximum likelihood linear regression for speakeradaptation of continuous density hidden Markov models,’ Computer Speech and Language, 9, pp. 171-185. Lesser, V. R., Fennell, R. D., Erman, L. D., and Reddy, D. R. (1975) ‘Organization of the Hearsay I1 speech understanding system,’ IEEETrans. Acoust.,Speech,SignalProcessing, ASSP-23, 1, pp. 11-24. Lin, C.-H., Chang, P.-C. and Wu, C.-H. (1994) ‘An initial studyon speaker adaptationforMandarin syllablerecognitionwith minimum error discriminativetraining,’Proc. Int.Conf. Spoken Language Processing, pp. 307-3 10. Lowerre, B. T. (1976) TheHarpy Speech RecognitionSystem, Ph.D Thesis, Computer Science Department,CarnegieMellon University.
426
Bibliography
Martin, F., Shikano, K. and Minami, Y. (1993) ‘Recognition of noisy speech by composition of hidden Markov models,’ Proc. Eurospeech, pp. 1031-1034. Matsui, T. and Furui, S. (1995)‘A study of speaker adaptation based on minimum classification training,’ Proc. Eurospeech, pp. 81-84. Matsui, T. andFurui, S. (1996) ‘N-best-basedinstantaneous speaker adaptationmethodfor speech recognition,’Proc. Int. Conf. Spoken Language Processing, pp. 973-976. Matsumoto, H.andWakita,H. (1986)‘Vowel normalization by frequency warped spectral matching,’ Speech Communication, 5, 2, pp. 239-251. Matsuoka, T. and Lee, C.-H. (1993) ‘A study of on-line Bayesian adaptation for HMM-based speech recognition,’ Proc. Eurospeech, pp. 8 15-8 18. Minami, Y. andFurui, S. (1995) ‘Universal adaptationmethod based on HMM composition,’ Proc. ICA, pp. 105-108. Myers, C. S. andRabiner,L. R. (198 1) ‘Connecteddigit recognitionusinga level-building DTW algorithm,’ IEEE Trans.Acoust., Speech, Signal Processing, ASSP-29, 3, pp. 351-363. Nakagawa, S. (1983) ‘A connectedspokenword or syllable recognitionalgorithm by pattern matching,’ Trans. IECEJ, J66-D, 6, pp. 637-644. Nakatsu, R., Nagashima, H., Kojima, J., and speechrecognitionmethodfortelephone IECEJ, J66-D, 4, pp. 377-384.
Ishii, N. (1983) ‘A voice,’ Trans.
Ney, H.andAubert, X. (1996) ‘Dynamicprogrammingsearch strategies: From digitstrings to largevocabularyword graphs,’ in Automatic Speech and Speaker Recognition (eds. C.-H. Lee, F. K. Soong and K. K. Paliwal), Kluwer, Boston, pp. 385-41 1.
Bibliography
427
Ney, H., Martin, S. and Wessel, F. (1997) ‘Statistical language modeling using leaving-one-out,’ in Corpus-BasedMethods in Language and Speech Processing (eds. S. Youngand G. Bloothooft), Kluwer, Dordrecht, pp. 174-207. Normandin, Y. (1 996) ‘Maximum mutual information estimation of hidden Markov models,’ in Automatic Speech and Speaker Recognition (eds. C.-H. Lee, F. K. Soong and K. K. Paliwal), Kluwer, Boston, pp. 57-81. Ohkura, K., Sugiyama, M.andSagayama, S. (1992) ‘Speaker adaptation based on transfervector field smoothingwith continuous mixture density HMMs,’ Proc. Int. Conf. Spoken Language Processing, pp. 369-372. Ohtsuki, K., Furui, S., Sakurai,N.,Iwasaki,A.andZhang, 2.-P. (1999) ‘RecentadvancesinJapanesebroadcast transcription,’ Trans. Eurospeech, pp. 671-674.
news
Paliwal, K.K. (1982) ‘On theperformance of the quefrencyweighted cepstral coefficients in vowel recognition,’ Speech Communication, 1, 2, pp. 151-154. Paul, D. (1991) ‘Algorithmsfor an optimal A* searchand linearizing the search in the stack decoder,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 693-696. Rabiner, L. R., Levinson, S. E., Rosenberg, A. E., and Willpon, J. G. (1979a) ‘Speaker-independentrecognition of isolated words using clustering techniques,’ IEEETrans. Acoust., Speech, Signal Processing, ASSP-27, 4, pp. 336-349. Rabiner,L. R. and Wilpon, J. G. (1979b) ‘Speaker-independent isolatedwordrecognitionforamoderate size(54 word) vocabulary,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-27, 6, pp. 583-587. Rabiner, L. R., Levinson, S. E., and Sondhi, M. M. (1983) ‘On the application of vectorquantizationandhiddenMarkov models to speaker-independent,isolatedwordrecognition,’ Bell Systems Tech. J., 62, 4, pp. 1075-1105.
428
Bibliography
Rabiner, L. R. and Levinson, S. L. (1985) ‘A speaker-independent, syntax-directed, connected word recognition system based on hidden Markov models and level building,’ IEEETrans. Acoust., Speech, Signal Processing,ASSP-33, 3, pp. 561-573. Rabiner, L. R., Juang, B.-H., Levinson, S. E. and Sondhi, M. M. (1985) ‘Recognition of isolated digits using hidden Markov models with continuous mixture densities,’ AT&T Tech. J., 64, 6, pp. 1211-1234. Rabiner,L.andJuang,B.-H. (1993) Fundamentals of Speech Recognition, Prentice Hall, New Jersey. Rissanen, J. (1984) ‘Universalcoding,information,prediction and estimation,’ IEEE Trans. Information Theory, 30, 4, pp. 629-636. Rohlicek, J. R. (1995) ‘Wordspotting,’inModernMethods Speech Processing (ed. R.P.RamachandranandR. Mammone), Kluwer, Boston, pp. 123-1 57.
of
Rose, R.C. (1996) ‘Wordspottingfromcontinuousspeech utterances,’inAutomatic Speech and SpeakerRecognition (eds. C.-H. Lee, F. K. Soong and K. K. Paliwal),Kluwer, Boston, pp. 303-329. Sakoe, H.andChiba, S. (1971) ‘Recognition of continuously spokenwordsbasedontime-normalization by dynamic programming,’ J. Acoust. SOC.Jap., 27, 9, pp. 483-500. Sakoe, H. and Chiba, S. (1978) ‘Dynamic programming algorithm optimizationforspokenwordrecognition,’IEEETrans. Acoust., Speech, Signal Processing, ASSP-26, 1, pp. 43-49. Sakoe, H. (1979) ‘Two-level DP-matching - A dynamic programming-based pattern matching algorithm for connected word recognition,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-27, 6, pp. 588-595. Sakoe, H.andWatari,M. (1981) ‘Clockwise propagatingDPmatching algorithm for word recognition,’ Trans. Committee on Speech Research, Acoust. SOC. Jap., S81-65.
Bibliography
429
Sankar,A.and Lee, C.-H. (1996) ‘A maximum-likelihood approach to stochastic matching for robust speech recognition,’ IEEE Trans. Speech and Audio Processing, 4,3, pp. 190-202. Schwartz, R., Chow, Y.-L. and Kubala, F. (1987) ‘Rapid speaker adaptation usingaprobabilisticspectralmapping,’Proc. IEEEInt.Conf. Acoust.,Speech,SignalProcessing, pp. 633-636. Schwarz, G. (1978) ‘Estimatingthedimension Annals of Statistics, 6, pp. 461-464.
of amodel,’ The
Shikano, K. (1982) ‘Spoken word recognition based upon vector quantization of input speech,’ Trans. Committee on Speech Research, Acoust. SOC. Jap., S82-60. Shikano, K. andAikawa, K. (1982)‘Staggeredarray DP matching,’ Trans. Committee on Speech Research,Acoust. SOC.Jap., S82-15. Shikano, K., Lee, K.-F, and Reddy, R. (1986) ‘Speaker adaptationthroughvectorquantization,’Proc.IEEEInt.Conf. Acoust., Speech, Signal Processing, Tokyo, Japan, 49.5, pp. 2643-2646. Shiraki, Y. and Honda, M. (1990) ‘Speaker adaptation algorithms based on piece-wise moving adaptive segment quantization method,’Proc. IEEEInt.Conf. Acoust.,Speech,Signal Processing, pp. 657-660. Slutsker, G. (1968) ‘Non-linearmethod signal,’ Trudy N. I. I. R.
of analysis of speech
Soong, F. K. and Huang, E. F. (1991) ‘A tree-trellis fast search for findingN-bestsentencehypotheses,’Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 705-708. Stern, R. M., Acero, A., Liu, F.-H. and Ohshima, Y. (1996) ‘Signal processing forrobustspeechrecognition,’inAutomatic Speech and SpeakerRecognition(eds. C.-H. Lee, F. K. Soong and K. K. Paliwal), Kluwer, Boston, pp. 357-384.
430
Bibliography
Sugamura, N. andFurui, S. (1982) ‘Largevocabularyword recognition using pseudo-phoneme templates,’ Trans. IECEJ, J65-D, 8, pp. 1041-1048. Sugamura, N., Shikano, K., and Furui, S. (1983) ‘Isolated word recognition using phoneme-like templates,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Boston, MA, 16.3, pp. 723-726. Sugamura, N. andFurui, S. (1984) ‘Isolatedwordrecognition using strings of phoneme-like templates (SPLIT),’ J. Acoust. SOC.Japan, (E)5, 4, pp. 243-252. Sugiyama, M.andShikano, K. (1981) ‘LPCpeak weighted spectralmatching measures,’ Trans.IECEJ, J64-A, 5 , pp. 409-416. Sugiyama, M. and Shikano, K. (1982) ‘Frequency weighted LPC spectralmatching measures,’ Trans.IECEJ, J65-A, 9, pp. 965-972. Tohkura, Y. (1986) ‘A weighted cepstraldistancemeasure for speech recognition,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Tokyo, Japan, 14.17, pp. 761-764. Varga, A.P.andMoore, R. K. (1990) ‘Hidden Markov model decomposition of speech and noise,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 845-848. Varga, A. P. and Moore,R.K. (1 99 1)‘Simultaneous recognitionof concurrent speechsignalsusinghidden Markovmodel decomposition,’ Proc. Eurospeech, pp. 1175-1 178. Velichko, V. and Zagoruyko, N. (1970) ‘Automatic recognition of 200 words,’ Int. J. Man-Machine Studies, 2, pp. 223-234. Vintsyuk, T. K. (1968) ‘Speech recognition by dynamic programming,’ Kybernetika, 4, 1, pp. 81-88. Vintsyuk, T. K. (1971) ‘Element-wise recognition of continuous speech composed of wordsfroma specified dictionary,’ Kibernetika, 2, pp. 133-143.
Bibliography
431
Viterbi, A. J. (1967) ‘Error bounds for convolutional codes andan asymptoticallyoptimaldecodingalgorithm,’ IEEETrans. Information Theory, IT-13, pp. 260-269. Young, S. (1996) ‘A review of large-vocabulary continuous-speech recognition,’ IEEE Signal Processing Magazine,September, pp. 45-57. CHAPTER 9 Atal, B. S. (1972) ‘Automatic speaker recognition based on pitch contours,’ J. Acoust. SOC. Amer.,52,6(Part 2), pp. 1687-1697. Atal, B. S. (1974) ‘Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification,’ J. Acoust. SOC. Amer., 55, 6, pp. 1304-1312. Carey, M.and Parris,E. (1992) ‘Speakerverification using connected words,’ Proc.Institute of Acoustics, 14, 6, pp. 95-100. RomeAir Doddington, G. R. (1974) ‘Speakerverification,’ Development Center, Tech Rep., RADC 74-179. Doddington, G. (1 985) ‘Speaker recognition-Identifying people by their voices,’ Proc. IEEE, 73, 11, pp. 165 1-1664. Eatock, J. and Mason, J. (1990) ‘Automatically focusing on good discriminating speech segments in speaker recognition,’ Proc. Int. Conf. Spoken Language Processing, 5.2, 133-136. Furui, S., Itakura, F., and Saito, S. (1972) ‘Talker recognition by longtime averaged speech spectrum,’ Trans. IECEJ, 55-A, 10, pp. 549-556. Furui, S. (1978) Research on Individuality Information in Speech Waves, Ph.D Thesis, Tokyo University. Furui, S. (1981a) ‘Comparison of speakerrecognitionmethods using statistical features and dynamic features,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-29, 3? pp. 342-350.
432
Bibliography
Furui, S. (1981b) ‘Cepstralanalysistechniqueforautomatic speakerverification,’ IEEETrans.Acoust., Speech,Signal Processing, ASSP-29, 2, pp. 254-272. Furui, S. (1986) ‘Research on individualityfeaturesin speech waves and automatic speaker recognition techniques,’ Speech Communication, 5 , 2, pp. 183-197. Furui, S. (1996) ‘An overview of speaker recognitiontechnology,’ in Automatic Speech and Speaker Recognition(eds. C.-H. Lee, F. K. Soong and K. K. Paliwal), Kluwer, Boston, pp. 31-56. Furui, S. (1997) ‘Recent advances in speaker recognition,’ Pattern Recognition Letters, 18, pp. 859-872. Griffin, C., Matsui, T. and Furui, S. (1994) ‘Distance measures for text-independent speaker recognition based on MAR model,’ Proc.IEEEInt.Conf.Acoust. Speech,SignalProcessing, Adelaide, 23. 6, pp. 309-312. Higgins,A.,Bahler,L. andPorter, J. (1991) ‘Speakerverificationusingrandomizedphraseprompting,’DigitalSignal Processing, 1, pp. 89-106. Kersta, L. G. (1962) ‘Voiceprintidentification,’ Nature, 196, pp. 1253-1257. Li, K. P.andWrench,Jr.,E. H. (1983) ‘An approachto textindependent speaker recognition with short utterances,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, Boston, MA, 12.9, pp. 555-558. Kunzel, H. (1994) ‘Currentapproachestoforensicspeaker recognition,’ESCAWorkshoponAutomaticSpeaker Recognition, Identification and Verification, pp. 135-141. Markel, J., Oshika, B. andGray, A. (1977) ‘Long-termfeature averaging for speakerrecognition,’ IEEETrans. Acoust. Speech Signal Processing, ASSP-25, 4, pp. 330-337. Markel, J. and Davi, S. (1979) ‘Text-independent speaker recognitionfromalargelinguisticallyunconstrainedtime-spaced
Bibliography
data base,’ IEEE Trans. Acoust. ASSP-27, 1, pp. 74-82.
433
Speech SignalProcessing,
Matsui, T. and Furui, S. (1990) ‘Text-independent speaker recognition using vocal tract and pitch information,’ Proc. Int. Conf. Spoken Language Processing, Kobe, 5.3, pp. 137-140. Matsui,T.andFurui, S. (199 1)‘A text-independentspeaker recognition method robust against utterance variations,’ Proc. IEEEInt.Conf.Acoust. Speech SignalProcessing, S6.3, pp. 377-380. Matsui, T. and Furui, S. (1992) ‘Comparison of text-independent speakerrecognitionmethods using VQ-distortion and discrete/continuousHMMs,’Proc.IEEEInt.Conf.Acoust. Speech, Signal Processing, San Francisco, pp. 11-157-1 60. Matsui, T. and Furui, S. (1993) ‘Concatenated phoneme models fortext-variablespeakerrecognition,’Proc. IEEEInt. Conf.Acoust.Speech,SignalProcessing,Minneapolis,pp. 11- 391-394. Matsui, T.andFurui, S. (1994a) ‘Speaker adaptation of tiedmixture-basedphonememodelsfortext-promptedspeaker recognition,’Proc. IEEEInt.Conf.Acoust. Speech,Signal Processing, Adelaide, 13.1. Matsui, T. and Furui, S. (1994b) ‘Similarity normalization method for speakerverificationbased ona posterioriprobability,’ ESCA Workshop on Automatic Speaker Recognition, Identification and Verification, pp. 59-62. Montacie, C., Deleglise, P., Bimbot, F. and Caraty, M.-J. (1992) ‘Cinematictechniquesforspeechprocessing:Temporal decompositionandmultivariatelinearprediction,’Proc. IEEEInt.Conf.Acoust. Speech,SignalProcessing,San Francisco, pp. 1-153-1 56. Naik, J., Netsch, M. and Doddington, G. (1989) ‘Speaker verification over long distance telephone lines, Proc.IEEEInt.Conf. Acoust., Speech, Signal Processing,’ S10b.3, pp. 524-527.
434
Bibliography
National Research Council (1979) On the Theory and Practice of Voice Identification, Washington, D. C. Newman,M., Gillick, L.,Ito, Y., McAllaster, D. and Peskin, B. (1996) ‘Speaker verification throughlargevocabulary continuous speech recognition,’Proc. Int.Conf.Spoken Language Processing, Philadelphia, pp. 24 19-2422. O’Shaugnessy, D. (1986) ‘Speakerrecognition,’ Magazine, 3,4, pp. 4-17.
IEEE ASSP
Poritz, A. (1982) ‘Linear predictive hidden Markov models and the speech signal,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, SI 1.5, pp. 1291-1294. Reynolds, D. (1994) ‘Speakeridentification and verification using Gaussian mixture speaker models,’ ESCA Workshop onAutomaticSpeakerRecognition,Identificationand Verification, pp. 27-30. Rose, R.and Reynolds, R. (1990) ‘Text independentspeaker identification using automatic acoustic segmentation,’ Proc. IEEE Int. Conf. Acoust. Speech, Signal Processing, S51.10, pp. 293-296. Rosenberg, A. E. and Sambur, M. R. (1975) ‘New techniques for automatic speaker verification,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-23, 2, pp. 169-176. Rosenberg, A.and Soong, F. (1987) ‘Evaluation of avector quantizationtalkerrecognition system in text independent and text dependent modes,’ Computer Speech and Language, 22, pp. 143-157. Rosenberg,A., Lee, C. and Gokcen, S. (1991) ‘Connectedword talker verification using whole word hidden Markov models,’ Proc.IEEEInt.Conf.Acoust. Speech, Signal Processing, Toronto, S6.4, pp. 381-384. Rosenberg, A. and Soong, F. (1991) ‘Recent research in automatic speaker recognition,’ in Advances inSpeech Signal Processing
Bibliography
435
(eds. S. Furui and M. M.Sondhi), Marcel Dekker,New York, pp. 701-737. Rosenberg,A. (1992) ‘The use of cohort normalized scores for speaker verification,’ Proc.Int.Conf.SpokenLanguage Processing, Banff, Th.sAM.4.2, pp. 599-602. Sambur, M. R. (1975) “Selection of acoustic features for speaker identification,’ IEEE Trans. Acoust., Speech, Signal Processing, ASSP-23, 2, pp. 176-182. Savic, M. andGupta, S. (1990) ‘Variable parameterspeaker verification system based on hidden Markov modeling,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, S5.7, pp. 28 1-284. Setlur,A. andJacobs, T. (1995) ‘Results of aspeaker verification service trial using HMM models,’ EUROSPEECH’95, Madrid, pp. 639-642. Shikano, K. (1985) ‘Text-independent speaker recognition experiments using codebooksinvectorquantization,’J.Acoust. SOC.Am. (abstract), Suppl. 1, 77, SI 1. Soong, F. K. and Rosenberg,A.E. (1986) ‘On the use of instantaneous and transitional spectral information in speaker recognition,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 877-880. Soong, F., Rosenberg, A.andJuang, B. (1987) ‘A vector quantizationapproachtospeakerrecognition,’AT&T Technical Journal, 66, pp. 14-26. Tishby, N. (199 1) ‘On theapplication of mixture AR hidden Markov models to text independentspeakerrecognition,’ IEEE Trans. Acoust. Speech, Signal Processing, ASSP-30, 3, pp. 563-570. Tosi, O., Oyer, H., Lashbrook, W., Pedrey, C., Nicol, J., and Nash, E. (1972) ‘Experiment on voice identification,’ J. Acoust. SOC. Amer., 51, 6(Part 2), pp. 2030-2043.
436
Bibliography
Zheng, Y. and Yuan, B. (1988) ‘Text-dependent speaker identification using circular hidden Markov models,’ Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, S13.3, pp. 580-582.
APPENDICES Gersho, A. andGray,R.M. (1992) VectorQuantizationand Signal Compression, Kluwer, Boston. Lippmann, R.P. (1987) ‘An introduction to computing with neural nets,’ IEEE ASSP Magazine, 4, 2, pp. 4-22. Makhoul, J., Roucos, S., and Gish, H. (1985) ‘Vector quantization,’ Proc. IEEE, 73, 11, pp. 1551-1588. Parsons, T. W. (1986) Voice and Speech Processing, McGraw-Hill, New York, pp. 274-275.
Index
A %correct, 322 A* search, 3 12 Abdominal muscles, 10 Accent, 10 component, 230 Accuracy, 322 Acoustic background, 303 Acoustic model, 314 Adaptation: backward (feedback), 143, 151 to environmental variation, 380 forward (feedforward), 143 on line, 336 instantaneous, 336 Adaptive bit allocation, 163 Adaptive delta modulation (ADM), 151 Adaptive differential PCM, 143, 148, (ADPCM), 151, 158
Adaptive inverse filtering, 114 Adaptive PCM (APCM), 143 Adaptive prediction, 147 backward, 149, 151 forward, 149 Adaptive predictive coding, 143, (APC), 149, 153 with adaptive bit allocation (APC-AB), 166 Adaptive predictive DPCM (AP-DPCM), 149 Adaptive quantization, 138, 143 backward, 151 Adaptive transform coding (ATC), 163 with VQ (ATC-VQ), 179 Adaptive vector predictive coding (AVPC), 180 Adjustment window condition, 269 AEN (articulation equivalent loss), 201 Affine transformation, 336 Affricate sound, 11
437
Index
438
Air Travel Information System (ATIS), 323 A-law, 142 Alexander Graham Bell, 1 Aliasing distortion, 47 Allophones, 2 19, 229 All-pole: model, 89 polynomial spectral density function, 90 spectrum, 68 speech production system, 68 Allophonic variations, 320 Amplitude: density distribution function, 21 level, 20 Analog-to-digital (A/D) conversion, 45, 51 Analysis-by-synthesis coder, 196 Analysis-by-synthesis (A-b-S) method, 42, 7 1, 190 Analysis-synthesis, 73,135 Antiformant, 30,127 Antiresonance, 30 circuit, 27, 224 Anti-model, 347 A posteriori probability, 363, 365 Area function, 33, 11 1 AR process, 91 Arithmetic coding, 134 Articulation, 9, 11, 27, 30 manner of, 12 place of, 12 Articulation equivalent transmission loss (AEN), 201 Articulators, 11 Articulatory model, 223 Articulatory movement, 11 Articulatory organs, 11, 246
Articulatory units, 381 Artificial intelligence (AI), 382 Aspiration, 11 Auditory critical bandwidth, 251 Auditory nerve system, 7 Auditory scene analysis, 385 Audrey, 243 Augmented transition network (ATN), 312 Autocorrelation: function, 52, 53, 251, 252 method, 87, 252 Automation control, 301 Autoregressive (AR) process, 89 Average branching factor, 322 B Back-propagation training algorithm, 401 Backward prediction error, 102 Backward propagation wave, 33 Backward variable, 285 Bakis model, 279 Band-pass: filter (BPF), 82, 70, 250 bank, 76, 159, 251 lifters, 252 Bark-scale frequency axis, 25 1 Basilar membrane, 25 1 Baum-Welch algorithm, 282, 288 Bayes’ rule, 3 13 Bayes’ sense, 364 Bayesian learning, 335 Beam search method, 31 1 Bernoulli effect, 10 Best-first method, 3 1 1 BIC (Bayesian Information Criterion), 325, 328 Bigram, 316
Index
Binary tree coding (BTC), 178 Blackboard model, 3 10 Blind equalization, 36 1 Bottom-up, 308 Boundary condition: at the lips and glottis, 115 for the time warping function, 268 Breadth-first method, 3 11 C Cascade connection, 225 Case frame, 312 Centroid, 176,337,394 Cepstral analysis, 79 Cepstral coefficient, 62, 77, 251 Cepstral distance (CD), 202 Cepstrunl, 62 method, 252 Cepstral mean: normalization (CMN), 325, 341, 361 subtraction (CMS), 341, 361 CHATR, 238 Cholesky decomposition method, 89 City block distance, 269 Claimed speaker, 364 Class N-gram, 317 Clustering, 332 Clustering-based methods, 176 Cluster-splitting method (LBG algorithm), 176, 395 Coarticulation, 16, 245, 378 dynamic model of, 383 Code: vectors, 176,393 Codebook, 176, 281, 393 Codeword, 279
439
Code-excited linear predictive coding (CELP), 193 Coding, 45, 47, 199 bit rate, 200 delay, 200 in frequency domain, 159 methods, evaluation of, 199 in time domain, 141 Cohort speakers, 364 Complexity of coder and decoder, 200 Composite sinusoidal model (CSM), 126 Concatenation synthesizer, 238 Connected word recognition, 295 Connection strength, 399 Consonant, 6 Context, 308 Context-dependent phoneme units, 229, 247 Context-free grammar (CFG), 312 Context-oriented-clustering (COC) method, 237 Continuous speech recognition, 246 Conversational speech recognition, 246 Convolution, 387 Convolutional (multiplicative) distortion, 344 Corpus, 314 Corpus-based speech synthesis, 237 Cosh measure, 256 Covariance method, 87 CS-ACELP, 205 Customer (registered speaker), 354
440
CVC syllable, 228, 247 CV syllable, 228, 247 D DARPA speech recognition projects, 323 Database for evaluation, 386 Decision criterion (threshold), 356 DECtalk system, 236 Deemphasis, 5 1 Delayed decision encoding, 173 Delayed feedback effect, 8 Deleted interpolation method, 316 Delta-cepstrum, 262, 363 Delta-delta-cepstrum, 263 Delta modulation (DM or AM), 149 Demisyllable, 229, 297 Depth-first method, 3 1 1 Detection-based approach, 344 Devocalization, 266 Diaphragm, 10 Differential coding, 148 Differential PCM (DPCM), 145, 148 Differential quantization, 149 Digital filter bank, 70 Digital processing of speech, Digital signal processors (DPSs), 386 Digital-to-analog (D/A) conversion, 5 1 Digitization, 45 Diphone, 229, 247 Diphthong, 13 Discounting ratio, 3 17 Discourse, 264
Index
Discrete cosine transform (DCT), 163 Discrete Fourier transform (DFT), 57, 163 Discriminant analysis, 364 Discriminative training, 293, 347 Distance (similarity) measure, 176, 249 based on LPC, 252 based on nonparametric spectral analysis, 25 1 Distance normalization, 364 Distinctive features, 20 Distortion rate function, 135 Divergence, 363 Double-SPLIT method, 278 Dual z-transform, 68 Duration, 230, 234, 264 Durbin’s recursive solution method, 89, 105, 108 Dyad, 229, 247 Dynamic characteristics, 367 Dynamic spectral features, 262, 367, 378 Dynamic programming (DP), matching, 266, 277, 297 asymmetrical, 270 staggered array, 272, 249 symmetrical, 270 unconstrained endpoint, 270 variations in, 270 method, 287 CW (clockwise), 300 O(n) (order n), 301 OS (one-stage), 301 path, 270 Dynamic spectral features (spectral transition), 262, 367, 378
Index
Dynamic time warping (DTW), 260, 266 E Ears, 7 EM algorithm, 290 Energy level, 248 Entropy, 322 coding, 133 Equivalent vocabulary size, 322 Error: deletion, 323 insertion, 323 rate, 323 substitution, 323 Euclidean distance, 250 Evaluation: factors for speech coding systems, 199 methods objective, 200 subjective, 200 for speech processing technologies, 385 F False acceptance (FA), 354 False rejection (FR), 354 Fast Fourier transform (FFT), 57, 251 Feedforward nets, 399 FFT cepstrum, 69 Filler, 305 speech model, 303 Filter bank, 70 Fine structure, 64 Finite state VQ (FSVQ), 182
441
First-order differential processing, 114 Fixed prediction, 147 FI-F~ plane, 16 Formant, 14,127 bandwidth, 19 frequency, 14, 39 extraction, 7 1 Formant-type speech synthesis method, 224 Forward-backward algorithm, 282, 283 Forward and backward waves, 223 Forward prediction error, 102 Forward propagation wave, 33 Forward-type AP-DPCM, 153 Forward variable, 283 Fourier transform, 53 pair (Wiener-Khintchine theorem), 54 Frame, 60 interval, 60 length, 60 F-ratio (inter- to intravariance ratio), 363 Frequency resolution, 60 Frequency spectrum, 52 Fricative, 10 Full search coding (FSC), 178 Fundamental equations, 35 Fundamental frequency (pitch), 10, 24, 79, 230, 351 Fundamental period, 10 G
Gaussian, 29 1 mixture, 305 mixture model (GMM), 325, 37 1
Index
442
Generation rules (rewriting rules), 3 12 Glottal area, 42 Glottal source, 10 Glottal volume velocity, 42 Glottis, 10 Good-Turing estimation theory, 317 Grammar, 3 14 Granular noise, 150
H Hamming window, 58 Hanning window, 58 Hard limiters, 399 Harmonic plus noise model (HNM), 220 Harpy system, 31 1 Hat theory of intonation, 230 Hearing, 7 Hearsay I1 system, 3 10 Hidden layers, 399 Hidden Markov model (HMM), 278 coding, 184 composition, 344, 363 continuous, 279, 290 decomposition, 344 discrete, 279 ergodic, 279, 305 based method, 371 evaluation problem, 282 hidden state sequence hidden state sequence uncovering problem, 283 left-to-right, 279 linear predictive, 37 1
[Hidden Markov model (HMM)] mixture autoregressive (AR), 37 1 MMI training of, 292 MCE/GPD training of, 292, 335 problems, procedures, semicontinuous, 292 system for word recognition, 293 theory and implementation of, 278 three basic algorithms for, 282 tied mixture, 292 training problem, 283 Hidden nodes, 399 Hierarchy model, 308 High-emphasis filter, 102 Homomorphic analysis, 66 Homomorphic filtering, 66 Homomorphic prediction, 129 Huffman coding, 133 Human-computer dialog systems, 323 Human-computer interaction, 243 Hybrid coding, 135,187 I IBM, 325 Impostor, 354 Individual characteristics, 349, 351 Individual differences: acquired, 35 1 hereditary, 35 1 Individuality, 246
Index
Information: rate distortion theory, 134, 177 transmission theory, 3 13 Initial state distribution, 28 1 Input and output nodes, 399 Integer band sampling, 162 Intelligibility test, 200 Internal thresholds, 399 Interpolation characteristics, 126 Inter-session (temporal) variability, 360 Intonation, 7, 10 component, basic, 230 Intraspeaker variation, 360, 364 Inverse filter, 85, 255 first- or second-order critical damping, 361 Inverse filtering method, 93, 114 Irreversible coding, 133 Island-driven method, 3 11 Isolated word recognition, 246 Itakura-Saito distance (distortion), 254
J Jaw, 9 K Karhunen-Loeve transform (KLT), 163 Katz's backoff smoothing, 3 17 Kelly's speech synthesis (production) model, 37, 110 K-means algorithm (Lloyd's algorithm), 176, 394 K-nearest neighbor (KNN) method, 332
443
Knockout method, 363 Knowledge processing, advanced, 382 Knowledge source, 308, 382 L Lag window, 252 Language model, 314, 344 Large-vocabulary continuous speech recognition, 306 Larynx, 9 Lattice, 248 filter, 109 diagram, 285 LBG algorithm (cluster-splitting method), 176, 395 LD-CELP, 205 Left-to-right method, 3 11 Level building (LB) method, 298 Lexicon, 306 Lifter, 77, 261 Liftering, 65 Likelihood, 248, 282 normalization, 364 ratio, 347, 363, 364 LIMSI, 324 Linear delta modulation (LDM), 149 Linearly separable equivalent circuit, 30, 64, 73, 85 Linear PCM, 142 Linear prediction, 2, 83, 145 Linear predictive coding (LPC), 2,78 analysis, 68, 83, 250, 252 procedure, 86 methods: code-excited, 138 multi-pulse-excited, 138
Index
444
[Linear predictive coding (LPC)] residual-excited, 138, 187 speech-excited, 138, 187 parameters, mutual relationships between, 127 speech synthesizer, 228 Linear predictor: coefficients, 84 filter, 84 Linear transformation, 335 based on multiple regression analysis, 336 Line spectrum pair (LSP), 1 16 analysis, 1 16 principle of, 1 16 solution of, 119 parameters, 121 coding of, 126 synthesis filter, 122 Linguistic constraints, 246 Linguistic information, 5, 243 Linguistic knowledge, 246 Linguistic science, new, 383 Linguistic units, 38 1 Lip rounding, 12 Lips, 9 Lloyd's algorithm (K-means algorithm), 176,394 Local decoder, 145 Locus theory, 229 Log likelihood ratio distance, 255 Log PCM, 142 Lombard effect, 341 Long-term (pitch) prediction, 148, 153 Long-term (term) averaged speech spectrum (LAS), 23, 370 Long-term-statistics-based method, 368
Loss: heat conducgion, 32 leaky, 32 viscous, 32 Loudness, 230 LPC: cepstral coefficients, 257 cepstral distance, 257 cepstrum, 69 correlation coefficients, 260 correlation function, 127 LSI for speech processing use, 386 Lungs, 89
M Markov: chains, 279 sources, 279 Mass conservation equation, 32 Matched filter principle, 197 Matrix quantization (MQ), 138, 182, 337 Maximum a posteriori (MAP), 330 decoding rule, 314 estimates, 335 probability, 3 13 Maximum likelihood (ML): estimation, 293 method, 70, 254 spectral distance, 254 spectral estimation, 89 formulation of, 89 physical meaning of, 93 MDL (Minimum Description Length) criterion, 325 Mean opinion score (MOS), 200
Index
Me1 frequency cepstral coefficient (MFCC), 252 Mel-scale frequency axis, 25 1 Mimicked voice, 352 Minimum phase impulse response, 77 Minimum residual energy, 256 Mismatches: acoustic, 341 linguistic, 341 MITalk-79 system, 234 Mixed excitation LPC (MELP), 196 Mixture, 290 M-L method, 173 MLLR (maximum likelihood linear regression) method, 325, 330 Models, 244 Modified, autocorrelation function, 14, 98, 107 Modified correlation method, 79 Momentum equation, 32 Morph, 234 Morphemes, 3 17 Morphological analysis, 3 17 p-law, 142 Multiband excitation (MBE), 196 Multilayer perceptrons, 399 Multipath search coding, 173 Multiple regression analysis, 336 Multi-pulse-excited LPC (MPC), 189 Multistage processing, 178 Multistage VQ, 179 Multitemplate method, 332 Multivariate autoregression (MAR), 370 Mutual information, 292
445
N N-best: based adaptation, 339 hypotheses, 339 results, 3 12 N-gram language model, 316 Nasal, 11 cavity, 9 Nasalization, 11 Nasalized vowel, 1 1 Nearest-neighbor selection rule, 394 Network model, 310 Neural net, 399 Neutral vowel, 13 Neyman-Pearson: hypothesis testing formulation, 305 lemma, 347 Noise: additive, 341 shaping, 138, 156 source, 44 threshold, 135 Nonlinear quantization, 138 Nonlinear warping of the spectrum, 335 Nonparametric analysis (NPA), 52 Nonuniform sampling, 266 Nonspeech sounds, 249 Normal equation, 89 Normalized residual energy, 256 Nyquist rate, 47 0
Objective evaluation, 200 Observation probability, 28 1 distribution, 28 1
446
Opinion-equivalent SNR (SNRq), 200 Opinion tests, 200 Optimal (minimum-distortion) quantizer, 394 Oral cavity, 9 Orthogonal polynomial representation, 367 Out-of-vocabulary, 305, 344 P Pair comparison (A-B test), 200 Parallel connection, 225 Parallel model combination (PMC), 344, 363 Parametric analysis (PA), 52 PARCOR (partial autocorrelation): analysis, 102 formulation of, 102 analysis-synthesis system, 110 coefficient, 102 extraction process, 89 and LPC coefficients, relationship between, 108 synthesis filter, 109 Partial correlator, 107 Peak factor, 21 Peak-weighted distance, 258 Perceiving dynamic signals, 385 Perceptually-based weighting, 192 Perceptual units, 38 1 Periodogram, 92 Perplexity, 322 log, 322 test-set, 322 Pharynx, 9 Phase equalization, 195
Index
Phone, 6 Phoneme, -6, 247 reference template, 275 Phoneme-based algorithm, 247 Phoneme-based system, 229 Phoneme-based word recognition, 275 Phoneme-like templates, 277 Phoneme context, 238 Phonemic symbol, 6 Phonetic decision tree, 320 Phonetic information, 246 Phonetic invariants, 331 Phonetic symbol, 6 Phonocode method, 184 Phrase component, 230 Physical units, 382 Pitch, 10, 264 error double-, 79 half-, 79 extraction, 78 by correlation processing, 79 by spectral processing, 79 by waveform processing, 79 Pitch-synchronous waveform concatenation, 220 Pitch (long-term) prediction, 148, 153 n--type four-terminal circuits, 223 Plosive, 10 Pole-zero analysis, 127 by maximum likelihood estimation, 130 Polynomial coefficients, 367 Polynomial expansion coefficients, lower order, 262 Positive definiteness, 250 Postfilter, adaptive noiseshaping, 158
Index
Postfiltering, 158 Pragmatics, 264, 308 Preemphasis, 51 Predicate logic, 3 12 Prediction, 145 error, 102 operators, forward and backward, 106 gain, 147 residual, 141, 145,256 Predictive coding, 141, 143 Procedural knowledge representation, 3 12 Production: model, 383 system, 3 12 Progressing wave model, 32 Prosodic features, 379 control of, 230 Prosodics, 308 Prosody, 264 Pseudophoneme, 277 PSI-CELP, 205 Pulse code modulation (PCM), 138,141 Pulse generator, 27
Q Quadrature mirror filter (QMF), 162 Quantization, 47 distortion, 49, 177 error, 49 noise, 49 step size, 47 Quantizing, 45 Quefrency, 64 Quefrency-weighted cepstral distance measure, 262
447
R Radiation, 9, 27 Random learning, 176 Rate distortion function, 135 Receiver operating characteristic (ROC) curve, 354 Recognition: speaker, 349 speech, 243 Rectangular window, 58 Reduction, 245 Reference template, 244, 264 Reflection coefficient, 35 11 1, 223 Registered speaker (customer), 354 Regression coefficients, 262 Residual: energy, 255 error, 84 signal, 99, 107 Residual-excited LPC vocoder (RELP), 187 Resonance (formant), 30 characteristics, 12 circuit, 27, 224 model, 38 Reversible coding, 133 Rewriting rules (generation rules), 3 12 Robust algorithms, 339 Robust and flexible speech coding, 21 1
S Sampling, 45, 46 frequency, 46 period, 46 Scalar quantization, 177
448
Search: one-pass, 320 multi-pass, 320 Segment quantization, 138 Segmental k-means training procedure, 295 Segmental SNR (SNR,,,), 201 Segmentation, 245 Selective listening, 8 Semantic class, 3 12 Semantic information, 312 Semantic markers, 312 Semantic net, 312 Semantics, 264, 308 Semivowel, 1 1 Sentence, 6 hypothesis, 248 Shannon-Fano coding, 133 Shannon’s information source coding theory, 133 Shannon-Someya’s sampling theorem, 46 Sheep and goats phenomenon, 334, 379 Short-term (spectral envelope) prediction, 148 Short-term spectrum, 52 Side information, 143,156 Sigmoidal nonlinearities, 399 Signal-to-amplitude-correlated noise ratio, 200 Signal-to-quantization noise ratio (SNR), 507 of a PCM signal, 142 Similarity matrix, 277 Similarity (distance) measure, 249 Simplified inverse filter tracking (SIFT) algorithm, 79 Single-path search coding, 175
Index
Sinusoidal transform coder (STC), 196 Slope: constraint, 270 overload distortion, 149 Smaller-than-word units, 248 Soft palate (velum), 10 Sound: pressure, 33 source model, 383 production, 27 spectrogram (voice print), 14, 60, 70, 349 spectrograph, 60 Source, 30 generation, 9 parameter, 78 estimation, 98 from residual signals, 98 Speaker: adaptation, 33 1, 335 unsupervised, 336 cluster selection, 335 identification, 352 normalization, 33 1, 334 recognition, 349 algorithms, textindependent, 380 human and computer, 349 methods, 352 principles of, 349 systems: examples of, 366 structure of, 354 text-dependent, 366 text-independent, 368 text-prompted, 373 text-dependent, 352 text-independent, 352
Index
[Speaker:] text-prompted, 353 verification, 352 Special-purpose LSIs, 386 Spectral analysis, 52 Spectral clustering:, hierarchical, 337 Spectral distance measure, 249 Spectral distortion, 126 Spectral envelope, 52, 64, 351 prediction, 148 Spectral equalization, 114, 361 Spectral equalizer, 102 Spectral fine structure, 52 Spectral parameters, statistical features of, 362 Spectral mapping, 335 Spectral similarity, 249 Speech: acoustic characteristics of, 14 analysis-synthesis system by LPC, 99 chain, 8 coding, 133 principal techniques for, 133 voice dependency in, 380 communication, 1 corpus, 237 database, 237 information processing future directions of, 375 technologies, 375 perception mechanism, clarification of, 384 period detection, 248 principal characteristics of, 5 processing basic units for, 381 technologies, evaluation methods for, 385
449
[Speech:] production, 5, 27, 383 mechanism, 9 clarification of, 383 ratio, 26 recognition, 243 advantages of, 243 based method, 371 classification of, 246 continuous, 245 conversational, 246 difficulties in, 245 principles of, 243 speaker-adaptive, 330 speaker-dependent, 246 speaker-independent, 246, 330 spectral structure of, 52 statistical characteristics of, 20 synthesis, 2 13 based on analysis-synthesis method, 216, 221 based on speech production mechanism, 222 based on waveform coding, 216, 217 by HMM, 222 principles of, 213 synthesizer by J. Q. Stewart, 216 by von Kempelen, 214 understanding, 246 SPLIT method, 277, 333 Spoken language, 385 Spontaneous speech recognition, 344 Stability, 101, 107, 121,391 Stack algorithm, 311 Standardization of speech coding methods, 199, 203
450
State transition probability, 28 1 distribution, 28 1 State-tying, 320 Stationary Gaussian process, 90 Statistical characteristics, 351 Statistical features, 359 Statistical language modeling, 312, 314 Stochastically excited LPC, 193 Stop consonant, 10 Stress, 7, 264 Sturm-Liouville derivative equation, 38 Subband coding (SBC), 143, 159 Subglottal air pressure, 10 Subjective evaluation, 200 Subword units, 248, 264 Supra-segmental attributes, 264 Syllable, 6 Symmetry, 250 Syntactic information, 312 Syntax, 264, 308 Synthesis by rule, 216, 226 principles of, 226 Synthesized voice quality, 380
T Talker recognition, 349 Task evaluation, 385 Technique evaluation, 386 Telephone, 1 Templates, 176 Temporal characteristics, 35 1 Temporal (inter-session) variability, 360, 381 Terminal analog method, 222, 224 Text-to-speech conversion, 23 1, 234
Index
Threshold logic elements, 399 Tied-mixture models, 3 18 Tied-state Gaussian-mixture triphone models, 320 Time: and frequency division, 141 resolution, 60 warping function, 267 Time-averaged spectrum, 361 Time-domain harmonic scaling (TDHS) algorithm, 168 Time domain pitch synchronous overlap add (TD-PSOLA) method, 220 Toeplitz matrix, 89 Tokyo Institute of Technology, 328 Tongue, 9 Top-down, 248, 308 Trachea, 9 Training mechanism, 331 Transcription, 243, 246, 323 Transform coding, 141 Transitional cepstral coefficient, 252 Transitional cepstral distance, 262 Transitional distance measure, 263 Transitional features, 378 Transitional logarithmic energy, 263 Tree coding: variable rate (VTRC), 196 Tree search, 178, 3 11 coding, 173 Tree-trellis algorithm, 3 12 Trellis: coding, 173, 184 diagram, 285
Index
Trigram, 316 Triphone, 318 Two-level DP matching, 295 Two-mass model, 40 U Unigram, 316 Units of reference templates/ models, 247 Universal coding, 134 Unsupervised (online) adaptation, 33 1 Unvoiced consonant, 11 Unvoiced sound, 11 V Variable length: coding, 133 VCV syllable, 247 VCV units, 228 Vector PCM (VPCM), 176 Vector quantization (VQ), 141, 173, 278, 279 algorithm, 393 based method, 370 based word recognition, 337 codebook, 337, 370 for linear predictor parameters, 180 principles of, 175 Vector-scalar quantization, 179 Velum (soft palate), 10 VFS (vector-field smoothing), 330 Visual units, 382 Viterbi algorithm, 282, 286,
451
Vocal cord, 10 model, 40 spectrum, 334, 363 vibration waveform, 42 Vocal organ, 7 Vocal tract, 9 analog method, 222, 223 area, estimation based on PARCOR analysis, 110 characteristics, 363 length, 334 transmission function, 38 model, 32 Vocal vibration, 10 Vocoder, 73 baseband, 187 channel, 76 correlation, 77 formant, 77 homomorphic, 77 linear predictive, 78 LSP, 78 maximum likelihood, 78 PARCOR, 78 pattern matching, 77 voice-excited, 187 Vocoder-driven ATC, 166, 188 Voder by H. Dudley, 216 Voiced consonant, 11 Voiced sound, 11 Voiced/unvoiced decision, 77, 8 1, 249 Voice-excited LPC vocoder (VELP), 187 Voice individuality, extraction and normalization of, 379 Voice print, 349 Volume velocity, 33
Index
452
Vowel, 6, 10 triangle, 16 VQ-based preprocessor, 333 VQ-based word recognition, 337 VSELP, 205 W Waveform coding, 135 Waveform interpolation (WI), 196 Waveform-based method, 228 Webster's horn equation, 38 Weighted cepstral distance, 260, 370 Weighted distances based on auditory sensitivity, 250 Weighted likelihood ratio (WLR), 258 Weighted slope metric, 262 Whispering, 11 White noise generator, 27 Wiener-Khintchine theorem, 54 Window function, 57
Word, 6, 247 dictionary, 264 lattice, 320 model, 264 recognition, 247 systems, structure of, 264 using phoneme units, 275 spotting, 249, 303 template, 264 World model, 366 Y Yule-Walker equation, 89 Z Zero-crossing: analysis, 70 number, 248 rate, 71 Zero-phase impulse response, 77 Z-transform, 68, 387, 388