Signals and Communication Technology
For further volumes: http://www.springer.com/series/4748
Qi (Peter) Li
Speaker Authentication
123
Dr. Qi (Peter) Li Li Creative Technologies (LcT), Inc. Vreeland Road 30 A, Suite 130 Florham Park, NJ 07932 USA e-mail:
[email protected]
ISSN 1860-4862 ISBN 978-3-642-23730-0 DOI 10.1007/978-3-642-23731-7
e-ISBN 978-3-642-23731-7
Springer Heidelberg Dordrecht London New York Library of Congress Control Number: 2011939406 Ó Springer-Verlag Berlin Heidelberg 2012 This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Cover design: eStudio Calamar, Berlin/Figueres Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
To my parents YuanLin Shen and YanBin Li
Preface
My research on speaker authentication started in 1995 when I was an intern at Bell Laboratories, Murray Hill, New Jersey, USA, while working on my Ph.D. dissertation. Later, I was hired by Bell Labs as a Member of Technical Staff, which gave me the opportunity to continue my research on speaker authentication with my Bell Labs colleagues. In 2002, I established Li Creative Technologies, Inc. (LcT), located in Florham Park, New Jersey. At LcT, I am continuing my research in speaker authentication with my LcT colleagues. Recently, when I looked at my publications during the last fifteen years, I found that my research has covered all the major research topics in speaker authentication: from front-end to back-end; from endpoint detection to decoding; from feature extraction to discriminative training; from speaker recognition to verbal information verification. This has motivated me to put my research results together into a book in order to share my experience with my colleagues in the field. This book is organized by research topic. Each chapter focuses on a major topic and can be read independently. Each chapter contains advanced algorithms along with real speech examples and evaluation results to validate the usefulness of the selected topics. Special attention has been given to the topics related to improving overall system robustness and performance, such as robust endpoint detection, fast discriminative training theory and algorithms, detection-based decoding, and sequential authentication. I have also given attention to those novel approaches that may lead to new research directions, such as a recently developed auditory transform (AT) to replace the fast Fourier transform (FFT) and auditory-based feature extraction algorithms. For real applications, a good speaker authentication system must first have an acceptable authentication accuracy and then be robust to background noise, channel distortion, and speaker variability. A number of speaker authentication systems can be designed based on the methods and techniques presented in this book. A particular system can be designed to meet required specifications by selecting an authentication method or combining several authentication and decision methods introduced in the book.
vii
viii
Preface
Speaker authentication is a subject that relies on the research efforts of many different fields, including, but not limited to, physics, acoustics, psychology, physiology, hearing, auditory nerve, brain, auditory perception, parametric and nonparametric statistics, signal processing, pattern recognition, acoustic phonetics, linguistics, natural language processing, linear and nonlinear programming, optimization, communications, etc. This book only covers a subset of these topics. Due to my limited time and experience, this book only focuses on the topics in my published research. I encourage people with the above backgrounds to consider contributing their knowledge to speech recognition and speaker authentication research. I also encourage colleagues in the field of speech recognition and speaker authentication to extend their knowledge to the above fields in order to achieve breakthrough research results. This book does not include those fundamental topics which have been very well introduced in other textbooks. This author assumes the reader has a basic understanding of linear systems, signal processing, statistics, and pattern recognition. This book can also be used as a reference book for government and company officers and researchers working in information technology, homeland security, law enforcement, and information security, as well as for researchers and developers in the areas of speaker recognition, speech recognition, pattern recognition, and audio and signal processing. It can also be used as a reference or textbook for senior undergraduate and graduate students in electrical engineering, computer science, biomedical engineering, and information management.
Acknowledgments The author would like to thank the many people who helped the author in his career and in the fields of speaker and speech recognition. I am particularly indebted to Dr. Donald W. Tufts at the Department of Electrical, Computer, and Biomedical Engineering, University of Rhode Island, for his role in guiding and training me in pattern recognition and speech signal processing. Special thanks are due to Dr. S. Parthasathy and Dr. Aaron Rosenberg who served as mentors when I first joined Bell Laboratories. They led me into the field of speaker verification research. I am particularly grateful to Dr. Biing-Hwang (Fred) Juang for his guidance in verbal information verification research. The work extended speaker recognition to speaker authentication, which has broader applications. Most topics in this book were prepared based on previously published peerreviewed journal and conference papers where I served as the first author. I would like to thank all the coauthors of those publications, namely Dr. Donald Tufts, Dr. Peter Swaszek, Dr. S. Parthasarathy, Dr. Aaron Rosenberg, Dr. Biing-Hwang Juang, Dr. Frank Soong, Dr. Chin-Hui Lee, Qiru Zhou, Jinsong Zheng, Dr. Augustine Tsai, and Yan Huang. Also, I would like to
Preface
ix
thank the many anonymous reviewers and editors for their helpful comments and suggestions. The author also would like to thank Dr. Bishnu Atal, Dr. Joe Olive, Dr. Wu Chou, Dr. Oliver Siohan, Dr. Mohan Sondhi, Dr. Oded Ghitza, Dr. Jingdong Chen, Dr. Rafid Sukkar, Dr. Larry O’Gorman, Dr. Richard Rose, and Dr. David Roe, all former Bell Laboratories colleagues, for their useful discussions and their kind help and support on my research there. Also, I would like to thank Dr. Ivan Selesnick for our recent collaborations. Within Li Creative Technologies, the author would like to thank Yan Huang and Yan Yin for our recent collaborations in speaker identification research. I also would like to thank my colleagues Dr. Manli Zhu, Dr. Bozhao Tan, Uday Jain, and Joshua Hajicek for useful discussions on biometrics, acoustic, speech, and hearing systems. From 2008 to 2010, the author’s research on speaker identification was supported by the U.S. AFRL under the contract number FA8750-08-C-0028. I would like to thank program managers Michelle Grieco, John Parker, and Dr. Stanly Wenndt for their help and support. Some of the research results have been included in Chapter 7 and Chapter 8 of this book. Other results will be published later. I would like to thank my colleague Craig B. Adams and my daughter Joy Y Li for their work in editing this book. I would like to thank Uday Jain, Dr. Manli Zhu and Dr. Bozhao Tan for their proofreading. Also, this book could not have been finished without the support of my wife Vivian for the many weekends which I spent working on it. The author also would like to thank the IEEE Intellectual Property Rights Office for permissions to use the IEEE copyright materials which I previously published in IEEE publications in the book. Finally, I would like to thank Dr. Christoph Baumann, Engineering Editor at Springer, for his kind invitation to prepare and publish this book.
Florham Park, NJ
Qi (Peter) Li July 2011
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Biometric-Based Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.3 Information-Based Authentication . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.4 Speaker Authentication . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.4.1 Speaker Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.4.2 Verbal Information Verification . . . . . . . . . . . . . . . . . . . . . . 9 1.5 Historical Perspective and Further Reading . . . . . . . . . . . . . . . . . 11 1.6 Book Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2
Multivariate Statistical Analysis and One-Pass Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Multivariate Gaussian Distribution . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Principal Component Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3 Vector Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 One-Pass VQ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.1 The One-Pass VQ Algorithm . . . . . . . . . . . . . . . . . . . . . . . 2.4.2 Steps of the One-Pass VQ Algorithm . . . . . . . . . . . . . . . . 2.4.3 Complexity Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4.4 Codebook Design Examples . . . . . . . . . . . . . . . . . . . . . . . . 2.4.5 Robustness Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.5 Segmental K-Means . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
23 23 25 27 28 28 32 33 34 38 39 40 40
Principal Feature Networks for Pattern Recognition . . . . . . . 3.1 Overview of the Design Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Implementations of Principal Feature Networks . . . . . . . . . . . . . . 3.3 Hidden Node Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Gaussian Discriminant Node . . . . . . . . . . . . . . . . . . . . . . . .
43 43 46 48 49
3
xi
xii
Contents
3.3.2 Fisher’s Node Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Principal Component Hidden Node Design . . . . . . . . . . . . . . . . . . 3.4.1 Principal Component Discriminant Analysis . . . . . . . . . . 3.5 Relation between PC Node and the Optimal Gaussian Classifier 3.6 Maximum Signal-to-Noise-Ratio (SNR) Hidden Node Design . . 3.7 Determining the Thresholds from Design Specifications . . . . . . . 3.8 Simplification of the Hidden Nodes . . . . . . . . . . . . . . . . . . . . . . . . 3.9 Application 1 – Data Recognition . . . . . . . . . . . . . . . . . . . . . . . . . 3.10 Application 2 – Multispectral Pattern Recognition . . . . . . . . . . . 3.11 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
51 52 53 53 55 56 56 56 58 59 59
4
Non-Stationary Pattern Recognition . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Gaussian Mixture Models (GMM) for Stationary Process . . . . . 4.2.1 An Illustrative Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Hidden Markov Model (HMM) for Non-Stationary Process . . . . 4.4 Speech Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.5 Bayesian Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.6 Statistical Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61 61 62 63 66 68 68 70 71 72
5
Robust Endpoint Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 A Filter for Endpoint Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Real-Time Endpoint Detection and Energy Normalization . . . . 5.3.1 A Filter for Both Beginning- and Ending-Edge Detection 5.3.2 Decision Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Real-Time Energy Normalization . . . . . . . . . . . . . . . . . . . . 5.3.4 Database Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
75 76 78 81 82 82 83 85 88 89
6
Detection-Based Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94 6.2 Change-Point Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 6.3 HMM State Change-Point Detection . . . . . . . . . . . . . . . . . . . . . . . 97 6.4 HMM Search-Space Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 6.4.1 Concept of Search-Space Reduction . . . . . . . . . . . . . . . . . . 100 6.4.2 Algorithm Summary and Complexity Analysis . . . . . . . . 102 6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104 6.5.1 An Example of State Change-Point Detection . . . . . . . . . 104 6.5.2 Application to Speaker Verification . . . . . . . . . . . . . . . . . . 104 6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
Contents
xiii
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 7
Auditory-Based Time Frequency Transform . . . . . . . . . . . . . . . 111 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 7.1.1 Observing Problems with the Fourier Transform . . . . . . . 112 7.1.2 Brief Introduction of the Ear . . . . . . . . . . . . . . . . . . . . . . . 114 7.1.3 Time-Frequency Analyses . . . . . . . . . . . . . . . . . . . . . . . . . . 117 7.2 Definition of the Auditory-Based Transform . . . . . . . . . . . . . . . . . 118 7.3 The Inverse Auditory Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . 120 7.4 The Discrete-Time and Fast Transform . . . . . . . . . . . . . . . . . . . . . 123 7.5 Experiments and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.5.1 Verifying the Inverse Auditory Transform . . . . . . . . . . . . . 124 7.5.2 Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126 7.6 Comparisons to Other Transforms . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
8
Auditory-Based Feature Extraction and Robust Speaker Identification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135 8.2 Auditory-Based Feature Extraction Algorithm . . . . . . . . . . . . . . 138 8.2.1 Forward Auditory Transform and Cochlea Filter Bank . 138 8.2.2 Cochlear filter cepstral coefficients (CFCC) . . . . . . . . . . . 140 8.2.3 Analysis and Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 8.3 Speaker Identification and Experimental Evaluation . . . . . . . . . . 142 8.3.1 Experimental Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142 8.3.2 The Baseline Speaker Identification System . . . . . . . . . . . 143 8.3.3 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145 8.3.4 Further Comparison with PLP and RASTA-PLP . . . . . . 146 8.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
9
Fixed-Phrase Speaker Verification . . . . . . . . . . . . . . . . . . . . . . . . . 151 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151 9.2 A Fixed-Phrase System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152 9.3 An Evaluation Database and Model Parameters . . . . . . . . . . . . . 154 9.4 Adaptation and Reference Results . . . . . . . . . . . . . . . . . . . . . . . . . 155 9.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
10 Robust Speaker Verification with Stochastic Matching . . . . . 157 10.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157 10.2 A Fast Stochastic Matching Algorithm . . . . . . . . . . . . . . . . . . . . . 158 10.3 Fast Estimation for a General Linear Transform . . . . . . . . . . . . . 160 10.4 Speaker Verification with Stochastic Matching . . . . . . . . . . . . . . 161
xiv
Contents
10.5 Database and Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 163 10.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164 11 Randomly Prompted Speaker Verification . . . . . . . . . . . . . . . . . 165 11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165 11.2 Normalized Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 168 11.3 Applying NDA in the Hybrid Speaker-Verification System . . . . 169 11.3.1 Training of the NDA System . . . . . . . . . . . . . . . . . . . . . . . . 169 11.3.2 Training of the HMM System . . . . . . . . . . . . . . . . . . . . . . . 171 11.3.3 Training of the Data Fusion Layer . . . . . . . . . . . . . . . . . . . 173 11.4 Speaker Verification Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . 173 11.4.1 Experimental Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 11.4.2 NDA System Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 11.4.3 Hybrid Speaker-Verification System Results . . . . . . . . . . . 174 11.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176 12 Objectives for Discriminative Training . . . . . . . . . . . . . . . . . . . . . 179 12.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179 12.2 Error Rates vs. Posterior Probability . . . . . . . . . . . . . . . . . . . . . . . 180 12.3 Minimum Classification Error vs. Posterior Probability . . . . . . . 181 12.4 Maximum Mutual Information vs. Minimum Classification Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 12.5 Generalized Minimum Error Rate vs. Other Objectives . . . . . . . 185 12.6 Experimental Comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 12.7 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 12.8 Relations between Objectives and Optimization Algorithms . . . 187 12.9 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189 13 Fast Discriminative Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191 13.2 Objective for Fast Discriminative Training . . . . . . . . . . . . . . . . . 193 13.3 Derivation of Fast Estimation Formulas . . . . . . . . . . . . . . . . . . . . 195 13.3.1 Estimation of Covariance Matrices . . . . . . . . . . . . . . . . . . . 196 13.3.2 Determination of Weighting Scalar . . . . . . . . . . . . . . . . . . 196 13.3.3 Estimation of Mean Vectors . . . . . . . . . . . . . . . . . . . . . . . . 197 13.3.4 Estimation of Mixture Parameters . . . . . . . . . . . . . . . . . . . 198 13.3.5 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198 13.4 Summary of Practical Training Procedure . . . . . . . . . . . . . . . . . . 200 13.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200 13.5.1 Continuing the Illustrative Example . . . . . . . . . . . . . . . . . 200 13.5.2 Application to Speaker Identification . . . . . . . . . . . . . . . . . 204 13.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
Contents
xv
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 14 Verbal Information Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207 14.2 Single Utterance Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209 14.2.1 Normalized Confidence Measures . . . . . . . . . . . . . . . . . . . . 211 14.3 Sequential Utterance Verification . . . . . . . . . . . . . . . . . . . . . . . . . . 212 14.3.1 Examples in Sequential-Test Design . . . . . . . . . . . . . . . . . . 214 14.4 VIV Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 215 14.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220 15 Speaker Authentication System Design . . . . . . . . . . . . . . . . . . . . 223 15.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223 15.2 Automatic Enrollment by VIV . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224 15.3 Fixed-Phrase Speaker Verification . . . . . . . . . . . . . . . . . . . . . . . . . 226 15.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 15.4.1 Features and Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 227 15.4.2 Experimental Results on Using VIV for SV Enrollment . 228 15.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
List of Tables
1.1
List of Biometric Authentication Error Rates . . . . . . . . . . . . . . . .
5
2.1 2.2 2.3 2.4 2.5
Quantizer MSE Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparison of One-Pass and LBG Algorithms . . . . . . . . . . . . . . . Comparison of Different VQ Design Approaches . . . . . . . . . . . . . . Comparison for the Correlated Gaussian Source . . . . . . . . . . . . . . Comparison on the Laplace Source . . . . . . . . . . . . . . . . . . . . . . . . .
35 36 37 37 38
3.1
Comparison of Three Algorithms in the Land Cover Recognition 58
5.1
Database Evaluation Results (%) . . . . . . . . . . . . . . . . . . . . . . . . . . . 88
7.1
2 Correlation Coefficients, σ12 , for Different Sizes of Filter Bank in AT/inverse AT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126
8.1 8.2
Summary of The training, Development, and Testing Set. . . . . . 143 Comparison of MFCC, MGFCC, and CFCC Features Tested on the Development Tet. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
9.1
Experimental Results in Average Equal-Error Rates of All Tested Speakers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
10.1 Experimental Results in Average Equal-Error Rates (%) . . . . . . 163 11.1 Segmentation of the Database . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174 11.2 Results on Discriminant Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 11.3 Major Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 12.1 Comparisons on Training Algorithms . . . . . . . . . . . . . . . . . . . . . . . 187 13.1 Three-Class Classification Results of the Illustration Example . . 203 13.2 Comparison on Speaker Identification Error Rates . . . . . . . . . . . . 204
xvii
xviii List of Tables
14.1 False Acceptance Rates when Using Two Thresholds and Maintaining False Rejection Rates to Be 0.0% . . . . . . . . . . . . . . . . 216 14.2 Comparison on Two and Single Threshold Tests . . . . . . . . . . . . . . 216 14.3 Summary of the Experimental Results on Verbal Information Verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 15.1 Experimental Results without Adaptation in Average Equal-Error Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229 15.2 Experimental Results with Adaptation in Average Equal-Error Rates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
List of Figures
1.1 1.2 1.3
Speaker authentication approaches. . . . . . . . . . . . . . . . . . . . . . . . . . 6 A speaker verification system. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 An example of verbal information verification by asking sequential questions. Similar sequential tests can also be applied in speaker recognition and other biometric or multi-modality verification. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
2.1
An example of bivariate Gaussian distribution: ρ11 = 1.23, ρ12 = ρ21 = 0.45, and ρ22 = 0.89. . . . . . . . . . . . . . . . . . . . . . . . . . . The contour of the Gaussian distribution in Fig. 2.1. . . . . . . . . . . An illustration of a constant density ellipse and the principal components for a normal random vector X. The largest eigenvalue associates with the long axis of the ellipse and the second eigenvalue associates with the short axis. The eigenvectors associate with the axes. . . . . . . . . . . . . . . . . . . . . . . . . The method to determine a code vector: (a) select the highest density cell; (b) examine a group of cells around the selected one; (c) estimate the center of the data subset; (d) cut a “hole” in the training data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Principal Component (PC) method to determine a centroid. Left: Uncorrelated Gaussian source training data. Right: The residual data after four code vectors have been located. . . . . . . . Left: The residual data after all 16 code vectors have been located. Right: The “+” and “◦” are the centroids after one and three iterations of the LBG algorithm, respectively. . . . . . . . Left: The Laplace source training data. Right: The residual data, one-pass designed centroids “+”, and one-pass+2LBG centroids “◦”. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.2 2.3
2.4
2.5 2.6 2.7
2.8
24 25
26
29 31 35
36
38
xix
xx
List of Figures
3.1
3.2 3.3 3.4 3.5 3.6
3.7 3.8
4.1
An illustrative example to demonstrate the concept of the PFN: (a) The original training data of two labeled classes which are not linearly separable. (b) The hyperplanes of the first hidden node (LDA node). (c) The residual data set and the hyperplanes of the second hidden node (SNR node). (d) The input space partitioned by two hidden nodes and four thresholds designed by the principal feature classification (PFC) method. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A parallel implementation of PFC by a Principal Feature Network (PFN). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A sequential implementation of PFC by a Principal Feature Tree (PFT). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (a) Partitioned input space for parallel implementation. (b) Parallel implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (a) Partitioned input space for sequential implementation. (b) Sequential (tree) implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . (a) A single Gaussian discriminant node. (b) A Fisher’s node. (c) A quadratic node. (d) An approximation of the quadratic node. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . When using only Fisher’s nodes, three hidden nodes and six thresholds are needed to finish the design. . . . . . . . . . . . . . . . . . . . Application 1: (a) (bottom) The sorted contribution of each threshold in the order of its contribution to the class separated by the threshold. (b) (top) Accumulated network performance in the order of the sorted thresholds. . . . . . . . . . . . . . . . . . . . . . . .
45 47 47 48 49
50 54
57
4.7
Class 1: a bivariate Gaussian distribution with m1 = [0 5], m2 = [−3 3], and m3 = [−5 0]. Σ1 = [1.41 0; 0 1.41], Σ2 = [1.22 0.09; 0.09 1.22], and Σ3 = [1.37 0.37; 0.27 1.37] . . . . . Class 2: a bivariate Gaussian distribution with m1 = [2 5], m2 = [−1 3], and m3 = [0 0]. Σ1 = [1.41 0; 0 1.41], Σ2 = [0.77 1.11; 1.11 1.09], and Σ3 = [1.41 0.04; 0.04 1.41] . . . . . Class 3: a bivariate Gaussian distribution with m1 = [−3 − 1], m2 = [−2 − 2], and m3 = [−5 − 2]. Σ1 = [1.41 0; 0 1.41], Σ2 = [0.76 0.11; 0.11 1.09], and Σ3 = [1.41 0.04; 0.04 1.41] . . . . . Contours of the pdf ’s of 3-mixture GMM’s: the models are used to generate 3 classes of training data. . . . . . . . . . . . . . . . . . . . Contours of the pdf ’s of 2-mixture GMM’s: the models are trained from ML estimation using 4 iterations. . . . . . . . . . . . . . . . Enlarged decision boundaries for the ideal 3-mixture models (solid line) and 2-mixture ML models (dashed line). . . . . . . . . . . Left-to-right hidden Markov model. . . . . . . . . . . . . . . . . . . . . . . . . .
5.1 5.2
Shape of the designed optimal filter. . . . . . . . . . . . . . . . . . . . . . . . . 81 Endpoint detection and energy normalization for real-time ASR. 81
4.2
4.3
4.4 4.5 4.6
64
64
65 65 66 66 67
List of Figures
5.3 5.4 5.5
5.6
5.7
6.1
6.2 6.3
6.4
6.5
6.6
6.7
6.8
State transition diagram for endpoint decision. . . . . . . . . . . . . . . . Example: (A) Energy contour of digit “4”. (B) Filter outputs and state transitions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . (A) Energy contours of “4-327-631-Z214” from original utterance (bottom, 20 dB SNR) and after adding car noise (top, 5 dB SNR). (B) Filter outputs for 5 dB (dashed line) and 20 dB (solid line) SNR cases. (C) Detected endpoints and normalized energy for the 20 dB SNR case, and (D) for the 5 dB SNR case. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Comparisons on real-time connected digit recognition with various signal-to-noise ratios (SNR’s). From 5 to 20 dB SNR’s, the introduced real-time algorithm provided word error rate reductions of 90.2%, 93.4%, 57.1%, and 57.1%, respectively. . . . . (A) Energy contour of the 523th utterance in DB5: “1 Z 4 O 5 8 2”. (B) Endpoints and normalized energy from the baseline system. The utterance was recognized as “1 Z 4 O 5 8”. (C) Endpoints and normalized energy from the real-time, endpoint-detection system. The utterance was recognized correctly as “1 Z 4 O 5 8 2”. (D) The filter output. . . . . . . . . . . .
xxi
83 84
85
87
89
The scheme of the change-point detection algorithm with tδ = 2: (a) the endpoint detection for state 1; (b) the endpoint detection for state 2; and (c) the grid points involved in p1 , p2 and p3 computations (dots). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 Left-to-right hidden Markov model. . . . . . . . . . . . . . . . . . . . . . . . . . 98 All the grid points construct a full search space Ψ . The grid points involved in the change-point detection are marked as black points. A single path (solid line) is detected from the forward and backward change-point detection. . . . . . . . . . . . . . . . 102 A “hole” is detected from the forward and backward state change-point detection. A search is needed only among four grid points, (8,3), (8,4), (9,3) and (9,4). The solid line indicates the path with the maximum likelihood score. . . . . . . . . 102 A search is needed in the reduced search space Ω which includes all the black points in between the two dashed lines. The points along the dashed lines are involved in change-point detection, but they do not belong to the reduced search space. . 103 A special case is located between (11,4) and (18,6), where the forward boundary is under the backward one. A full search can be done in the subspace {(t, st ) | 11 ≤ t ≤ 18; 4 < st < 6}. . 103 The procedure of sequential state change-point detection from state 1 (top) to state 7 (bottom), where the vertical dashed lines are the detected endpoints of each state. . . . . . . . . . . . . . . . . 105 The procedure of sequential state change-point detection from state 8 (top) to state 13 (bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . 106
xxii
List of Figures
6.9
7.1
(a) Comparison of average individual equal-error rates (EER’s); (b) Comparison on average speedups. . . . . . . . . . . . . . . . 107
Male’s voice: “2 0 5” recorded simultaneously by close-talking (top) and hands-free microphones in a moving car (bottom). . . . 112 7.2 The speech waveforms in Fig. 7.1 were converted to spectrograms by FFT and displayed in Bark scale from 0 to 16.4 Barks (0 to 3500 KHz). The background noise and the pitch harmonics were generated mainly by FFT. . . . . . . . . . . . . . . 113 7.3 The spectrum of FFT at the 1.15 second time frame from Fig. 7.2: The solid line represents the speech from a close-talking microphone. The dashed line is from a hands-free microphone mounted on the visor of a moving car. Both speech files were recorded simultaneously. The FFT spectrum shows 30 dB distortion at low frequency bands due to background noise and pitch harmonics as noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114 7.4 Illustration of human ear and cochlea. . . . . . . . . . . . . . . . . . . . . . . 115 7.5 Illustration of a stretched out cochlea and a traveling wave exciting a portion of the basilar membrane. . . . . . . . . . . . . . . . . . . 115 7.6 Impulse responses of the BM in the AT when α = 3 and β = 0.2. They are very similar to the research results reported in hearing research. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121 7.7 The frequency responses of the cochlear filters when α = 3: (A) β = 0.2; and (B) β = 0.035. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 7.8 The traveling wave generated by the auditory transform from the speech data in Fig. 7.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 7.9 A section of the traveling wave generated by the auditory transform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124 7.10 Spectrograms from the output of the cochlear transform for the speech data in Fig. 7.1 respectively. The spectrogram at top is from the data recorded by the close-talking microphone, while the spectrogram at bottom is from the hands-free microphone. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.11 The spectrum of AT at the 1.15 second time frame from Fig. 7.10: The solid line represents the speech from a close-talking microphone. The dashed line is from a hands-free microphone mounted on the visor of a moving car. Both speech files were recorded simultaneously. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 7.12 Comparison of speech waveforms: (A) The original waveform of a male voice speaking the words “two, zero, five.” (B) The synthesized waveform by inverse AT with the bandwidth of 80 to 5K Hz. When the filter numbers are 8, 16, 32, and 64, the 2 correlation coefficients σ12 for the two speech data sets are 0.74, 0.96, 0.99, and 0.99, respectively. . . . . . . . . . . . . . . . . . . . . . . 126
List of Figures xxiii
7.13 (A) and (B) are speech waveforms simultaneously recorded in a moving car. The microphones are located on the car visor (A) and speaker’s lapel (B), respectively. (C) is after noise reduction using the AT from the waveform in (A), where results are very similar to (B). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127 7.14 Comparison of FT and AT spectrums: (A) The FFT spectrogram of a male voice “2 0 5”, warped into the Bark scale from 0 to 6.4 Barks (0 to 3500 KHz). (B) The spectrogram from the cochlear filter output for the same male voice. The AT is harmonic free and has less computational noise.128 7.15 Comparison of AT (top) and FFT (bottom) spectrums at the 1.15 second time frame for robustness: The solid line represents speech from a close-talking microphone. The dashed line represents speech from a hands-free microphone mounted on the visor of a moving car. Both speech files were recorded simultaneously. The FFT spectrum shows 30 dB distortion at low-frequency bands due to background noise compared to the AT. Compared to the FFT spectrum, the AT spectrum has no pitch harmonics and much less distortion at low frequency bands due to background noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129 7.16 The Gammatone filter bank: (A) The frequency responses of the Gammatone filter bank generated by (7.19). (B) The frequency responses of the Gammatone filter bank generated by (7.19) plus a equal loudness function. . . . . . . . . . . . . . . . . . . . . 130 8.1 8.2 8.3 8.4 8.5 8.6 8.7 9.1
Schematic diagram of the auditory-based feature extraction algorithm named cochlear filter cepstral coefficients (CFCC). . . 138 Comparison of MFCC, MGFCC, and the CFCC features tested on noisy speech with white noise. . . . . . . . . . . . . . . . . . . . . . 145 Comparison of MFCC, MGFCC, and CFCC features tested on noisy speech with car noise. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 Comparison of MFCC, MGFCC, and CFCC features tested on noisy speech with babble noise. . . . . . . . . . . . . . . . . . . . . . . . . . . 147 Comparison of PLP, RASTA-PLP, and the CFCC features tested on noisy speech with white noise. . . . . . . . . . . . . . . . . . . . . . 148 Comparison of PLP, RASTA-PLP, and the CFCC features tested on noisy speech with car noise. . . . . . . . . . . . . . . . . . . . . . . . 148 Comparison of PLP, RASTA-PLP, and the CFCC features tested on noisy speech with babble noise. . . . . . . . . . . . . . . . . . . . . 149 A fixed-phrase speaker verification system. . . . . . . . . . . . . . . . . . . . 153
xxiv List of Figures
10.1 A geometric interpretation of the fast stochastic matching. (a) The dashed line is the contour of training data. (b) The solid line is the contour of test data. The crosses are the means of the two data sets. (c) The test data were scaled and rotated toward the training data. (d) The test data were translated to the same location as the training data. Both contours overlap each other. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159 10.2 A phrase-based speaker verification system with stochastic matching. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162 11.1 The structure of a hybrid speaker verification (HSV) system. . . . 167 11.2 The NDA feature extraction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170 11.3 The Type 2 classifier (NDA system) for one speaker. . . . . . . . . . . 171 13.1 Contours of the pdf ’s of 3-mixture GMM’s: the models are used to generate three classes of training data. . . . . . . . . . . . . . . . 201 13.2 Contours of the pdf ’s of 2-mixture GMM’s: the models are from ML estimation using four iterations. . . . . . . . . . . . . . . . . . . . 201 13.3 Contours of the pdf ’s of 2-mixture GMM’s: The models are from the fast GMER estimation with two iterations on top of the ML estimation results. The overlaps among the three classes are significantly reduced. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202 13.4 Enlarged decision boundaries for the ideal 3-mixture models (solid line), 2-mixture ML models (dashed line), and 2-mixture GMER models (dash-dotted line): After GMER training, the boundary of ML estimation shifted toward the decision boundary of the ideal models. This illustrates how GMER training improves decision accuracies. . . . . . . . . . . . . . . . . . . . . . . . 202 13.5 Performance improvement versus iterations using the GMER estimation: The initial performances were from the ML estimation with four iterations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203 14.1 An example of verbal information verification by asking sequential questions. (Similar sequential tests can also be applied in speaker verification and other biometric or multi-modality verification.) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 208 14.2 Utterance verification in VIV. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210 14.3 False acceptance rate as a function of robust interval with SD threshold for a 0% false rejection rate. The horizontal axis indicates the shifts of the values of the robust interval τ . . . . . . . 218 14.4 An enlarged graph of the system performances using two and three questions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 15.1 A conventional speaker verification system . . . . . . . . . . . . . . . . . . . 224
List of Figures
xxv
15.2 An example of speaker authentication system design: Combining verbal information verification with speaker verification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225 15.3 A fixed-phrase speaker verification system . . . . . . . . . . . . . . . . . . . 226
Chapter 1 Introduction
1.1 Authentication Authentication is the process of positively verifying the identity of a user, device, or any entity in a computer system, often as a prerequisite to allowing access to resources in the system [49]. Authentication has been used by human for thousands of years to recognize each other, to identify friends and enemies, and to protect their information and assets. In the computer era, the purpose of identification is more than just to identify people in our presence, but also to identify people in remote locations, computers on a network, or any entity in computer networks. As such, authentication has been extended from a manual identification process to an automatic one. People are now paying more and more attention to security and privacy; thus authentication processes are everywhere in our daily life. Automatic authentication technology is now necessary for all computer and network access and it plays an important role in security. In general, an automatic authentication process usually includes two sessions: enrollment/registration and testing/verification. During an enrollment/registration session, the identity of a user or entity is verified and an authentication method is initialized by both parties. During a testing/verification session, the user must follow the same authentication method to prove their identity. If the user can pass the authentication procedure, the user is accepted and allowed to access the protected territory, networks, or systems; if not, the user is rejected and no access is allowed. One example is the authentication procedure in banks. When a customer opens an account, the bank asks the customer to show a passport, a driver’s license, or other documents to verify the customer’s identify. An authentication method, such as an account number plus a PIN (personal identification number) is then initialized during the registration session. When the customer wants to access the account, the customer must provide both the account number and PIN for verification to gain access to the account. We will see
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/978-3-642-23731-7_1, Ó Springer-Verlag Berlin Heidelberg 2012
1
2
1 Introduction
similar procedures in automatic authentication systems using speech processing throughout this book. Following the discussions in [32], authentication can be further differentiated into human-human, machine-machine, human-machine, and machinehuman authentications. Human-human authentication is the traditional method. This can be done by visually verifying a human face or signature or identifying a speaker’s voice on the phone. As the most fundamental method, human-human authentication continually plays an important role in our daily lives. Machine-machine authentication is the process by which the authentication of machines is done by machines automatically. Since the process on both sides can be pre-designed for the best performance in terms of accuracy and speed, this kind of authentication usually provides very high performances. Examples are encryption and decryption procedures, well-established protocols, and secured interface designs. Machine-human authentication is the process by which a person verifies the machine generated identity or password. For example, people can identify a host ID, an account name, or the manufacturing code of a machine or device. Finally, human-machine authentication, also called user authentication, is the process of verifying the validity of a claimed user by a machine automatically [32]. Due to its large potential applications, this is a very active research area. We will focus our discussions on user authentication. We can further label the user authentication process as information based, token based, or biometric based [32]. The information-based approach is characterized by the use of secret or private information. A typical example of the information-based approach is to ask private questions such as mother’s maiden name or the last four digits of one’s social security number. The information can be updated or changed at various times. For example, the private questions can be last date of deposit and deposit amount. The token-based approach is characterized by the use of physical objects, such as metal keys, magnetic key, electronic keys, or photo ID. The biometric-based approach is characterized by biometric matching, such as using voice, fingerprint, iris, signature, or DNA characteristics. In real applications, an authentication system may use a combination of more than one or all of the above approaches. For example, a bank or credit card authentication system may include an information-based approach – PIN number; a token-based approach – bank card; and a biometric-based approach – signature. In terms of decision procedure, authentication can be sequential or parallel. A sequential procedure is to make a decision by using the “AND” logic and a parallel procedure is to make a decision by “OR” logic. Using the above bank example, if an authentication decision is made by investigating the bank card, the PIN number and signature, one by one, it is a sequential process. If the decision is made by investigating only one of the three things, it is a parallel
1.1 Authentication
3
decision process. Details of the sequential decision procedure will be discussed in Chapter 14. We now discuss the biometric-based and information-based approaches in more detail in the following sections. A speaker authentication system can be implemented by both approaches.
1.2 Biometric-Based Authentication A biometric is a physical or biomedical characteristic or personal trait of a person that can be measured with or without contact and be used to recognize the person by an authentication process. For practical applications, useful biometrics must have the following properties [51, 15]: • • • • •
Measurability: A characteristic or trait can be captured by a sensor and extracted by an algorithm. Distinctiveness: A measure of a characteristic is significantly different from subject to subject. Repeatability: A measured characteristic can be repeated through multiple measurements at different environments, locations, and times. Robustness: A measure of a characteristic has no significant changes when measuring the same subject for a period of time at different environments or even by different sensors. Accessibility: A subject is willing to cooperate for feature extraction. The biometric authentication process can be finished in an acceptable time period and amount of cost.
Well known and popular biometrics include voice, fingerprint, iris, retina, face, hand, signature, and DNA. The measurement or feature extraction can be based on different kinds of data, such as optical image, video, audio, thermal video, DNA sequence, etc. Some biometric features are easy to capture, such as voice and face image; others may be difficult, such as DNA sequence. Some features can have very high authentication distinctiveness, such as DNA and iris; others may not. Also, some of them can be captured at very low cost, such as voice; others can be too expensive in term of cost and time, such as DNA, given the existing technology. When developing a biometric system, all the above factors need to be considered. For use of speech as a biometric, its measurability and accessibility are advantages, but distinctiveness, repeatability, and robustness are the challenges to speaker authentication. Most of the topics in this book address these challenges. In term of the statistical properties, biometric signals can be divided into two categories, stationary and non-stationary. A biometric signal is stationary if its probability density function does not vary over any time shift during
4
1 Introduction
feature extraction, such as still fingerprints and irises; otherwise, it is nonstationary, such as speech and video. Based on this definition, fingerprint, iris, hand, and still face can be considered as stationary biometric signals, while speech and video are non-stationary biometric signals. Special statistical methods are then needed to handle the authentication process on a nonstationary biometric signal. An automatic biometric authentication system is essentially a pattern recognition system that operates by enrollment and testing sessions, as discussed above. During the enrollment session, the biometric data from an individual is acquired, features are extracted from the data, and statistical models are trained based on the extracted features. During the testing session, the acquired data is compared with the statistical models for an authentication decision. A biometric authentication process can be used to recognize a person by identification or verification. Identification is the process of searching for a match through a group of previously enrolled or characterized biometric information. It is often called “one-to-many” matching. Verification is the process of comparing the claimed identity with one previously enrolled or one set of characterized information. It is often called “one-to-one” matching. To give our readers an understanding of the accuracy of biometric authentication, we collected some reported results in Table 1.1. We note that it is unfair to use a table to compare different biometrics because the evaluations were designed for different purposes. Data collections and experiments are based on different conditions and objectives. Also, different biometric modalities have different advantages and disadvantages. The purpose is to give readers a general level of understanding of biometric approaches. Since the numbers in the table may not represent the latest experimental and evaluation results, readers are encouraged to look at the references and find the detailed evaluation methods and approaches for each of the evaluations. In the table, we added our experimental results on speaker verification and verbal information verification as reported in several chapters of this book. We note that in the speaker verification experiments all the tested speakers are using the same pass-phrase and the pass-phrase is less than two seconds on average. If different speakers used different pass-phrases, the speaker verification equal-error rate can be below 0.5%. We note that NIST 2006 error rates were extracted from the decision error tradeoff (DET) curves in [28]. For detailed results about the evaluation, readers should read [37]. As we discussed above, biometric tests can also be combined, and a biometric-based decision procedure can be in parallel or sequential. A decision can be made based on one or multiple kinds of biometric tests. Furthermore, biometric-based authentication can be combined with other authentication methods. For example, in [30, 31] speech was used to generate the key for encryption, which is a combination of biometric-based and information-based authentications; therefore, understanding authentication technology and meth-
1.2 Biometric-Based Authentication
5
Table 1.1. List of Biometric Authentication Error Rates BioMetric Iris
Task
MEGC2009 [34] ICE2006 [35] Finger- FVC2006 print [13] FVC2004 [12] FpVTE2003 [10] Face MBGC2009 [34] FRVT2006 [13] Speech NIST2006 [37] See Table 10.1
Description Still, HD-NIR portal video
False Rejection 10%
False Accept 0.1%
Left/right eye
1.0% / 1.5%
0.1%
Open category-4 databases 0.021%- 5.56% 0.021%-5.56% Light category-4 databases 0.148%-5.4% 0.148%-5.4% Average 20 years old 2% 2% US Gov. Ops. Data
Still/portable (Controlled-uncontrolled) Varied resolution Varied lighting Speaker recognition Text independent Speaker verification Text dependent See Table 14.3 Verbal information verification
0.1%
1.0%
1%-5%
0.1%
1%-5% 1% 3% from DET
0.1% 0.1% 1.2 % from DET
1.8% 0%
1.8% 0%
ods can help readers design specific authentication procedures to meet application requirements. We note that among all biometric modalities, a person’s voice is one of the most convenient biometrics for user authentication purposes because it is easy to produce, capture, and transmit over the ubiquitous telephone and Internet networks. It also can be supported with existing telephone and wireless communication services without requiring special or additional hardware.
1.3 Information-Based Authentication Information-based authentication is the process of securing a transaction by pre-registered information and knowledge. The most popular examples are password, PIN, mother’s maiden name, or other private information. The private information is pre-registered in a bank, trust agent, or computer server, and a user needs to provide the exact same information to gain access. More complex authentication systems use cryptographic protocols, where the key is the pre-registered information for encryption and decryption [47]. Although the private information can be provided by typing, it is inconvenient for applications where keyboards are not available or difficult to access, such as handheld devices and telephone communications. When using speech
6
1 Introduction
to provide the information, operators are then needed for the process. How to verify the verbal information automatically is called verbal information verification which will be discussed in detail in this book.
1.4 Speaker Authentication As discussed above, speaker authentication is concerned with authenticating a user’s identity via voice characteristics, a biometric-based authentication, or by verbal content, an information-based authentication. There are two major approaches to speaker authentication: speaker recognition (SR) and verbal information verification (VIV). The SR approach attempts to verify a speaker’s identity based on his/her voice characteristics while the VIV approach verifies a speaker’s identity through verification of the content of his/her utterance(s).
Speaker Authentication
Speaker Recognition (Authentication by speech characteristics)
Speaker Verification
Verbal Information Verification (Authentication by verbal content)
Speaker Identification
Fig. 1.1. Speaker authentication approaches.
As shown in Fig. 1.1, the approach to speaker authentication can be categorized into two groups: one uses a speaker’s voice characteristics, which leads to speaker recognition, and the other focuses on the verbal content of the spoken utterance, which leads to verbal information verification (VIV) . Based on our above definitions, speaker recognition, recognizing who is talking, is a biometric-based authentication, while VIV, verifying a user based on what is being said, is an information-based authentication. These two techniques can be further combined to provide an enhanced system as indicated by the dashed line.
1.4 Speaker Authentication
Training Utterances: " Open Sesame" " Open Sesame" " Open Sesame"
Model Training
7
Speaker-Dependent Model
Database Enrollment Session Test Session
Identity Claim Test Utterance: " Open Sesame"
Speaker Verifier
Scores
Fig. 1.2. A speaker verification system.
1.4.1 Speaker Recognition Speaker recognition as one of the voice authentication techniques has been studied for several decades [2, 1, 39, 5, 4, 23]. As shown in Fig. 1.1, Speaker recognition (SR) can be formulated in two operating modes, speaker verification and speaker identification. Speaker verification (SV) is the process of verifying whether an unknown speaker is the person as claimed, i.e. a yes-no or one-to-one, hypothesis testing problem. Speaker identification (SID) is the process of associating an unknown speaker with a member in a pre-registered, known population, i.e. a one-to-many matching problem. A typical SV system is shown in Fig. 1.2, which has two operating scenarios: enrollment session and test session. A speaker needs to enroll first before she or he can use the system. In the enrollment session, the user’s identity, such as an account number, together with a pass-phrase, such as a digit string or a key phrase like “open sesame” shown in the figure, is assigned to the speaker. The system then prompts the speaker to say the pass-phrase several times to allow training or constructing of a speaker-dependent (SD) model that registers the speaker’s speech characteristics. The digit string can be the same as the account number and the key phrase can be selected by the user so it is easy to remember. An enrolled speaker can use the verification system in a future test. Similar procedures apply in the case of SID. These schemes are sometimes referred to as direct methods as they use the speaker’s speech characteristics to infer or verify the speaker’s identity directly. In a test session, the user first claims his/her identity by entering or speaking the identity information. The system then prompts the speaker to say the passphrase. The pass-phrase utterance is compared against the stored SD model. The speaker is accepted if the verification score exceeds a preset threshold; otherwise, the speaker is rejected. Note that the pass-phrase may or may not be kept secret.
8
1 Introduction
When the pass-phrases are the same in training and testing, the system is called a fixed pass-phrase system or fixed phrase system. Frequently, a short phrase or a connected-digit sequence, such as a telephone or account number, is chosen as the fixed pass-phrase. Using a digit string for a pass-phrase has a distinctive difference from other non-digit choices. The high performance of current, connected-digit speech recognition systems and embedded errorcorrecting possibilities of digit strings make it feasible that the identity claim can be made via spoken, rather than key-in input [40, 43]. If such an option is installed, the spoken digit string is first recognized by automatic speech recognition (ASR) and the standard verification procedure then follows using the same digit string. Obviously, successful verification of a speaker relies upon a correct recognition of the input digit string. A security concern may be raised about using fixed pass-phrases since a spoken pass-phrase can be tape-recorded by impostors and used in later trials to get access to the system. A text-prompted SV system has been proposed to circumvent such a problem. A text-prompted system uses a set of speakerdependent word or subword models, possibly for a small vocabulary such as the digits. These models are employed as the building blocks for constructing the models for the prompted utterance, which may or may not be part of the training material. When the user tries to access the system, the system prompts the user to utter a randomly picked sequence of words in the vocabulary. The word sequence is aligned with the pre-trained word models and a verification decision is made based upon the evaluated likelihood score. Compared to a fixed-phrase system, such a text-prompted system normally needs longer enrollment time in order to collect enough data to train the SD word or subword models. The performance of a text-prompted system is, in general, not as high as that of a fixed-phrase system. This is due to the fact that the phrase model constructed from concatenating elementary word or subword models is usually not as accurate as that directly trained from the phrase utterance in a fixed-phrase system. Details on a text-prompted system will be discussed later in this book. A typical SID system also has two operation scenarios – training section and testing section. During training, we need to train multiple speakerdependent acoustic models, one model associated with one speaker. During testing, when we receive testing speech data, we evaluate the data using all the trained speaker-dependent models. The identified speaker is the speaker who’s acoustic model has the best match to the given testing data compared to others. The above systems are called text-dependent, or text-constrained speaker recognition (SR) systems because the input utterance is constrained, either by a fixed phrase or by a fixed vocabulary. A verification system can also be textindependent. In a text-independent SR system, a speaker’s model is trained on the general speech characteristics of the person’s voice [38, 14]. Once such a model is trained, the speaker can be verified regardless of the underlying text of the spoken input. Such a system has wide applications for monitoring and
1.4 Speaker Authentication
9
verifying a speaker on a continuous basis. In order to characterize a speaker’s general voice pattern without a text constraint, we normally need a large amount of phonetically or acoustically rich training data in the enrollment procedure. Also, without the text or lexical constraint, longer test utterances are usually needed to maintain a satisfactory SR performance. Without a large training set and long test utterances, the performance of a text-independent system is usually inferior to that of a text-dependent system. When evaluating an SR system, if it is both trained and tested by the same set of speakers, it is called a closed test; otherwise, it is an open test. In a closed test, data from all the potential impostors (i.e., all except the true speaker) in the population can be used to train a set of high performance, discriminative speaker models. However, as most SR applications are of an open-test nature, to train the discriminative model against all possible impostors is not possible. As an alternative, a set of speakers whose speech characteristics are close to the speaker can be used to train the speaker-dependent discriminative model, or speaker-independent models can be used to model impostors. 1.4.2 Verbal Information Verification When applying the current speaker recognition technology to real-world applications, several problems are encountered. One such problem is the need of a voice enrollment session to collect data for training the speaker-dependent (SD) model. The voice enrollment is an inconvenience to the user as well as the system operators who often has to supervise the process and check the quality of the collected data to ensure the system performance. The quality of the collected training data has a critical effect on the performance of an SV system. A speaker may make a mistake when repeating the training utterances/passphrases several times. Furthermore, as we have discussed in [26], since the enrollment and testing voice may come from different telephone handsets and networks, acoustic mismatch between the training and testing environments may occur. The SD models trained on the data collected in an enrollment session may not perform well when the test session is in a different environment or via a different transmission channel. The mismatch significantly affects the SV performance. This is a significant drawback of the direct method, in which the robustness in comparative evaluation is difficult to ensure. Alternatively, in light of the progress in modeling for speech recognition, the concept and algorithm of verbal information verification (VIV) was proposed [24, 22] to take advantage of speaker registered information. The VIV method is the process of verifying spoken utterances against the information stored in a given personal data profile. A VIV system may use a dialogue procedure to verify a user by asking questions. An example of a VIV system is shown in Fig. 1.3. It is similar to a typical tele-banking procedure: after an account number is provided, the operator verifies the user by asking some personal information, such as mother’s maiden name, birth date, address, home telephone number, etc. The user must provide answers to
10
1 Introduction
‘‘In which year were you born ?’’ Get and verify the answer utterance. Correct
Wrong
‘‘In which city/state did you grow up ?’’
Rejection
Get and verify the answer utterance. Correct
Wrong
‘‘May I have your telephone number, please ?’’
Rejection
Get and verify the answer utterance. Correct
Acceptance on 3 utterances
Wrong
Rejection
Fig. 1.3. An example of verbal information verification by asking sequential questions. Similar sequential tests can also be applied in speaker recognition and other biometric or multi-modality verification.
the questions correctly in order to gain access to his/her account and services. In this manner, a speaker’s identity is embedded in the knowledge she or he has towards some particular questions, and thus one often considers VIV an indirect method. To automate the whole procedure, the questions can be prompted by a text-to-speech system (TTS) or by pre-recorded messages. The difference between SR (the direct method) and VIV (the indirect method) can be further addressed in the following three aspects. First, in an speaker recognition system, either for SID or for SV, we need to train speakerdependent (SD) models, while in VIV we usually use speaker-independent statistical models with associated acoustic-phonetic identities. Second, a speaker recognition system needs to enroll a new user and to train the SD model, while a VIV system does not require voice enrollment. Instead, a user’s personal data profile is created when the user’s account is set up. Finally, in speaker recognition, the system has the ability to reject an impostor even when the input utterance contains a legitimate pass-phrase, if the utterance indeed fails to match the pre-trained SD model. In VIV, it is solely the user’s responsibility to protect his or her personal information because no speaker-specific
1.4 Speaker Authentication
11
voice characteristics are used in the verification process. In real applications, there are several ways to circumvent the situation in which an impostor uses a speaker’s personal information obtained from eavesdropping on a particular session. A VIV system can ask for information that may not be a constant from one session to another, e.g. the amount or date of the last deposit, or a subset of the registered personal information, i.e. a number of randomly selected information fields in the personal data profile. To improve user convenience and system performance, we can further combine VIV and SV to construct a progressive integrated speaker authentication system. In the combined system, VIV is used to verify a user during the first few accesses. Simultaneously, the system collects verified training data for constructing speaker-dependent models. Later, the system migrates to an SV system for authentication. The combined system is convenient to users since they can start to use the system without going through a formal enrollment session and waiting for model training. Furthermore, since the training data may be collected from different channels in different VIV sessions, the acoustic mismatch problem is mitigated, potentially leading to a better system performance in test sessions. The SD statistical models can be updated to cover different acoustic environments while the system is in use to further improve system performance. VIV can also be used to ensure training data for SV. Details of this approach will be discussed in following chapters.
1.5 Historical Perspective and Further Reading Speaker authentication has been studied for several decades. It is difficult to provide a complete review of speaker authentication research history. This section is not intended to review the entire history of speaker authentication, but rather to briefly summarize the important progress in speaker authentication based on the author’s limited experience and understanding of technical approaches. Also, this section tries to provide links between previous research and development to the chapters in this book. From 1976 to 1997, Proceedings of the IEEE published review or tutorial papers on speaker authentication about once a decade [1, 39, 8, 5]. A general overview of speaker recognition was published in 1992 [44]. Before 1998, speaker authentication mainly focused on speaker recognition, including SV and SID, for text-dependent and text-independent applications. In 1999, the author and his commeagues’ review paper extended speaker recognition to speaker authentication [23]. The above review papers and the references therein are good starting points to review the history of speaker authentication. A speaker recognition system consists of two major subsystems: feature extraction and pattern recognition. Feature extraction is used to extract acoustic features from speech waveforms, while pattern recognition is used to recognize
12
1 Introduction
true speakers and imposters from their acoustic features. We briefly review these two areas, respectively. In the 1960’s, pitch contour [3], linear prediction analysis, and cepstrum [2] were applied to speaker recognition to extract features. In [11], the cepstral normalization was introduced. It is still a useful technique in today’s speaker recognition systems. In [18, 9], filter banks were applied to feature extraction, where the filter banks were implemented by analogue filters followed by wave rectifiers. In [38], the MFCCs were used in speaker recognition. As we analyzed in Chapter 8, linear predictive cepstrum and Mel cepstrum may have similar performances for narrow-band signals, but for wideband signals, the Mel cesptrum can provide better performance. In [19], an auditory-based transform and auditory filter bank was introduced; furthermore, an auditory-based feature was presented in [21, 20] to replace FFT. We note that compared with an FFT spectrum-based filter bank [7], such as used in Mel frequency cepstral coefficients (MFCC), the filter bank in the analogue implementation and the auditory transform have many advantages. The details will be discussed in Chapter 7. Also beginning in the 1960’s, the basic statistical pattern recognition techniques were being used, including mean, variance, and Gaussian distribution [36, 3, 17]. In [36], a ratio of variance of speaker means to average intraspeaker variance was introduced. The concept is similar to the concept of linear discriminant analysis (LDA). At that time, the statistical methods were based on an assumption that the speaker’s data are in one Gaussian distribution. In [48], vector quantization (VQ) was applied to speaker recognition. The VQ approach is more general than previous approaches because it can handle the data in any distribution. Later, the Gaussian mixture model (GMM) and hidden Markov model (HMM) [50, 42, 45, 38, 14, 29] were applied to speaker recognition. The GMM model can model any data distribution and the HMM can characterize the non-stationary signals to multiple states, where the data in each state of HMM can be modeled by a GMM. The HMM approach is still the popular approach to text-dependent speaker verification [33]. The GMM is popular in text-independent speaker recognition [38]. In [41, 43, 33], cohort and background models were used to improve robustness. Instead of computing the absolute scores of just speaker dependent models, the cohort and background models provide related scores between a speaker-dependent model and a background model; therefore, the approach makes the speaker recognition system more robust. From [27], discriminative training started to be applied to speaker recognition. The training speed was a concern for real applications in speaker recognition. The research in Chapter 13 was then conducted to speed up the discriminative training. In 1996 [46], the support vector machine (SVM) was applied to speaker recognition. Recently, as the SVM software is available to the public, more papers are being written about applying SVM to speaker recognition [6]. Recently, the soft margin estimation as a discriminative objective was applied to speaker identification and convex optimization was used to solve the parameter estimation problem [52].
1.5 Historical Perspective and Further Readeing
13
In 1976 [39], approaches and challenges to speaker verification and two speaker verification systems developed by Texas Instruments and Bell Labs were reviewed. Some of the challenges introduced then are still the challenges confronting today’s speaker authentication systems, such as recording environments and transmission conditions in dialed-up telephone lines. Those problems are even more problematic in today’s wireless and VoIP networks. To address the mismatch between training and testing conditions, a linear transform approach was developed in [26] and described in Chapter 10. The factor analysis approach was developed in [16]. Both attempt to address the mismatch problem in the model domain. In [21, 20], an auditory-based feature was developed to address the problem from the feature domain. The details are available in Chapter 8. Regarding system evaluations, in text-independent speaker recognition, NIST has been coordinating annual or biannual public evaluations since 1996 [37]. Readers can find recent approaches from the related publications. In textdependent speaker verification, in 1997 a bank invited the companies who had the speaker verification systems to attend the performance evaluation using the bank’s proprietary database. The author attended the evaluation with his speaker verification system developed based on the Bell Labs software environment. The term speaker authentication was introduced in 1997 while the author and his Bell Labs colleagues developed the VIV technique [25, 24]. Since VIV is beyond the traditional speaker recognition system, we introduced the term speaker authentication. As it will be introduced in Chapters 14 and 15, VIV opens up a new research and application area in speaker authentication. In addition to above discussions, another research area in speaker authentication is in generating cryptographic keys from spoken pass-phrases or passwords. The goal of this research is to enable a device to generate a key from voice biometrics to encrypt data saved in the device. An attacker who captures the device and extracts all the information it contains, however, should be unable to determine the key and retrieval information from the device. This research was reported in [30, 31].
1.6 Book Organization The chapters of this book cover three areas: pattern recognition, speech signal processing, and speaker authentication. As the foundation of this book, pattern recognition theory and algorithms are introduced in Chapter 2 – 4, 12, and 13. They are general enough and can be applied to any pattern recognition task including speech and speaker recognition. Signal processing algorithms and techniques are discussed in Chapters 5 – 8. They can be applied to audio signal processing, communications, speech recognition, and speaker authentication. Finally, algorithms, techniques, and methods for speaker authentication are introduced in Chapters 9, 10, 11, 14, and 15.
14
1 Introduction
The book focuses on novel research ideas and effective and useful approaches. For each traditional topic, the author presents a newer and effective approach. For example, in vector quantization (VQ), in addition to the traditional K-means algorithm, the author presents a fast one-pass VQ algorithm. In neural networks (NN), instead of the back-propagation algorithm, the author presents a sequential and fast NN design method with data pruning. In decoding, instead of the Viterbi algorithm, the author presents a detectionbased decoding algorithm. In signal processing, instead of the FFT, the author presents the auditory transform. In discriminative training, the author presents a fast, closed-form solution, and so on. Each chapter discuses one major topic which can be read independent of other chapters. We now provide a brief synopsis of each chapter and how it relates to the other chapters in the book. Chapter 2 – Multivariate Statistical Analysis and One-Pass Vector Quantization: Since speaker authentication technology is basically a statistical approach, multivariate statistical analysis is foundational to this book. This chapter introduces the popular multivariate Gaussian distribution and principal component analysis (PCA) with illustrations. The Gaussian distribution function is the core of the GMM and HMM which are used in speaker authentication. The PCA is important to understand the geometrical property of datasets. Following that, we introduce the traditional vector quantization (VQ) algorithm where the initial centroids are selected randomly. Furthermore, the one-pass VQ algorithm is presented. It initializes the centroids in a more efficient way. The segmental K-means algorithm extends the VQ algorithm to the non-stationary time-variant signals. The traditional VQ and one-pass algorithms can be applied to train the Gaussian models and large background models for speaker recognition while the segmental K-means can be used to train the HMM for speaker verification. Chapter 3 – Principal Feature Networks for Pattern Recognition: Pattern recognition is one of the fundamental technologies in speaker authentication. During the last several decades, many pattern recognition algorithms have been developed from linear discriminant analysis to decision trees, Bayesian decision theory, multilayer neural networks, and support vector machines. There are already many available books that introduce the techniques and software toolboxes for most of the pattern recognition techniques. Instead of reviewing each of the techniques, in this chapter we introduce a novel neural network training and construction approach. It is a sequential design procedure. Given the required recognition performance or error rates, the algorithm can construct a recognizer or neural network step by step based on the discriminative pattern recognition techniques. The algorithm can determine the structure of the recognizer or neural network based on the required performance. The introduced technique can help readers understand the relationship between multivariate statistic analysis and neural networks. The principal feature networks have been used in real-world applications.
1.6 Book Organization
15
Chapter 4 – Non-Stationary Pattern Recognition: Traditional pattern recognition technologies are considered stationary patterns, i.e. the patterns do not change via time; however, the patterns of speech signals in a feature domain do change with time; therefore, non-stationary pattern recognition techniques are necessary for speaker authentication. Actually, the current approach to non-stationary pattern recognition is based on the stationary approach. It divides speech utterances into a sequence of small time segmentations or called states. Within each state, the speech pattern is assumed to be stationary; thus, the stationary pattern recognition techniques can be applied to solve the non-stationary pattern recognition problem. This chapter provides the foundation for further descriptions of the algorithms used in speaker authentication. Chapter 5 – Robust Endpoint Detection: Recorded speech signals for speaker authentication normally come with a combination of speech signal and silence. To achieve the best performance in terms of recognition speed and accuracy, one usually removes the silence in a front-end process. The detection of the presence of speech embedded in various types of non-speech events and background noise is called endpoint detection or speech detection or speech activity detection. When the signal-to-noise ratio (SNR) is high, the task is not very difficult, but when the SNR is low, such as in wireless communications, VoIP, or strong background noise, the task can be very difficult. In this chapter, we present a robust endpoint detection algorithm which is invariant to different background noise levels. The algorithm has been used in real applications and has significantly improved speech recognition performances. For any real applications, a robust endpoint detection algorithm is necessary. The technique can be used not only for speaker authentication, but for audio signal processing and speech recognition as well. Chapter 6 – Detection-Based Decoder: When the HMM is used in speech and speaker authentication, a decoding algorithm, such as the Viterbi decoding algorithm, is then needed to search for the best state sequence for non-stationary pattern recognition. However, the search space is usually large. One has to reduce the search space for practical applications. A popular approach is to pre-assign or guess a beam width as the search space. Obviously, this is not the best way to reduce the search space. In this chapter, we introduce the detection theory to the decoding task and present a detection-based decoding algorithm to reduce the search space based on change-point detection theory. Our experimental results show that the algorithm can significantly speed up the decoding procedure. The algorithm can be used in speech recognition as well. Chapter 7 – Auditory-Based Time-Frequency Transform: The time-frequency transform plays an important role in signal processing. The Fourier transform (FT) has been used for decades, but, as analyzed in this chapter, the FT is not robust to background noise and also generates significant computational noise. In this chapter, we present an auditory-based, time-frequency transform based on our study of the hearing periphery sys-
16
1 Introduction
tem. The auditory transform (AT) is a pair of forward and inverse transforms which has been proved in theory and validated in experiments for invertiblity. The AT has much less computational noise than the FT and can be free from pitch harmonics. The AT provides a solution to robust signal processing and can be used as an alternative solution to the FT. We also compare the AT with the FFT, wavelet transform, and the Gammatone filter bank. Chapter 8 – Auditory-Based Feature Extraction and Robust Speaker Identification: In this chapter, we present an auditory-based feature extraction algorithm. The features are based on the robust time-frequency transform introduced in Chapter 7, plus a set of modules to mimic the signal processing functions in the cochlea. The purpose is to address the acoustic mismatch problem between training and testing in speaker recognition. The new auditory-based algorithm has shown in our experiment, to be more robust than the traditional MFCC (Mel frequency cepstral coefficients), PLP, and RASTA-PLP features. Chapter 9 – Fixed-Phrase Speaker Verification: In this chapter, we focus on a fixed-phrase SV system for open-set applications. Here, fixed-phrase means that the same pass-phrase is used for one speaker in both training and testing sessions and the text of the pass-phrase is known by the system through registration. A short, user-selected phrase, also called a pass-phrase, is easy to remember and use. For example, it is easier to remember “open sesame” as a pass-phrase than a 10-digit phone number. Based on our experiments, the selected pass-phrase can be short and less than two seconds duration and still can get a good performance. Chapter 10 – Robust Speaker Verification with Stochastic Matching: In this chapter, we address the acoustic mismatch between training and testing environments from a different approach – transforming feature space. Speaker authentication performances are degraded when a model trained under one set of conditions is used to evaluate data collected from different telephone channels, microphones, etc. The mismatch can be approximated as a linear transform in the cepstral domain. We present a fast, efficient algorithm to estimate the parameters of the linear transform for real-time applications. Using the algorithm, test data are transformed toward the training conditions by rotation, scale, and translation without destroying the detailed characteristics of speech. As a result, the pre-trained, SD models can be used to evaluate the details under the same condition as training. Compared to cepstral mean subtraction (CMS) and other bias-removal techniques, the presented linear transform is more general since CMS and others only consider translation; compared to maximum-likelihood approaches for stochastic matching, the presented algorithm is simpler and faster since iterative techniques are not required. Chapter 11 – Randomly Prompted Speaker Verification: In this chapter we first introduce the randomly prompted SV system and then present a robust algorithm for randomly prompted SV. The algorithm is referred to here as normalized discriminant analysis (NDA). Using this technique, it is
1.6 Book Organization
17
possible to design an efficient linear classifier with very limited training data and to generate normalized discriminant scores with comparable magnitudes for different classifiers. The NDA technique is applied to a recognizer for randomly prompted speaker verification where speaker specific information obtained when utterances are processed with speaker-independent models. In experiments conducted on a network-based telephone database, the NDA technique shows a significant improvement over the Fisher linear discriminant analysis. Furthermore, when the NDA is used in a hybrid SV system combining information from speaker dependent and speaker independent models, verification performance is better than the HMM with cohort normalization. Chapter 12 – Objectives for Discriminative Training: Discriminative training has shown advantages over maximum likelihood training in speech and speaker recognition. To this end, a discriminative objective needs to be defined first. In this chapter the relations among several popular discriminative training objectives for speech and speaker recognition, language processing, and pattern recognition are derived and discovered through theoretical analysis. Those objectives are the minimum classification error (MCE), maximum mutual information (MMI), minimum error rate (MER), and a recently-proposed generalized minimum error rate (GMER). The results show that all the objectives can be related to minimum error rates and maximum a posteriori probability. The results and the analytical methods used in this chapter can help in judging and evaluating discriminative objectives, and in defining new objectives for different tasks and better performances. Chapter 13 – Fast Discriminative Training: Currently, most discriminative training algorithms for nonlinear classifier designs are based on gradient-descent (GD) methods for objective minimization. These algorithms are easy to derive and effective in practice but are slow in training speed and have difficulty in selecting the learning rates. To address the problem, we present our study on a fast discriminative training algorithm. The algorithm initializes the parameters by the expectation maximization (EM) algorithm, and then it uses a set of closed-form formulas derived in this chapter to further optimize a proposed objective of minimizing error rate. Experiments in speech applications show that the algorithm provides better recognition accuracy in fewer iterations than the EM algorithm and a neural network trained by hundreds of gradient-decent iterations. Our contribution in this chapter is to present a new way to formulate the objective minimization process, and thereby introduce a process that can be efficiently implemented with the desired result as promised by discriminative training. Chapter 14 – Verbal Information Verification: Traditional speaker authentication focuses on SV and SID, which are both accomplished by matching the speaker’s voice with his or her registered speech patterns. In this chapter, we introduce a technique named verbal information verification (VIV) in which spoken utterances of a claimed speaker are verified against the key (usually confidential) information in the speaker’s registered profile automatically to decide whether the claimed identity should be accepted or rejected. Using
18
1 Introduction
the proposed sequential procedure involving three question-response turns, the VIV achieves an error-free result in a telephone speaker authentication experiment with 100 speakers. The VIV opens up a new research direction and application in speaker authentication. Chapter 15 – Speaker Authentication System Design: Throughout this book we introduce various speaker authentication techniques. In realworld applications, a speaker authentication system can be designed by combining two or more techniques to construct a useful and convenient system to meet the requirements of a particular application. In this chapter, we provide an example of a speaker authentication system design. The design requirement is from an on-line banking system which requires speaker verification, but does not want to inconvenience customers by an enrollment procedure. The designed system is a combination of VIV and SV. Following this example, readers can design their own system for any particular application.
References 1. Atal, B. S., “Automatic recognition of speakers from their voices,” Proceeding of the IEEE, vol. 64, pp. 460–475, 1976. 2. Atal, B. S., “Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification,” Journal of the Acoustical Society of America, vol. 55, pp. 1304–1312, 1974. 3. Atal, B., Automatic speaker recognition based on pitch contours. PhD thesis, Polytech. Inst., Brookly, NY, June 1968. 4. Campbell, J. P., “Forensic speaker recognition,” IEEE Signal Processing Magazine, pp. 95–103, March 2009. 5. Campbell, J. P., “Speaker recognition: A tutorial,” Proceedings of the IEEE, vol. 85, pp. 1437–1462, Sept. 1997. 6. Campbell, W. M., “Support vector machines using GMM supervectors for speaker verification,” IEEE Signal Processing Letter, pp. 308–311, May 2006. 7. Davis, S. B. and Mermelstein, P., “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. on Acoustics, speech, and signal processing, vol. ASSP-28, pp. 357–366, August 1980. 8. doddington, G., “Speaker recognition – identifying people by their voices,” Proceedings of the IEEE, vol. 73, pp. 1651–1664, Nov. 1985. 9. Doddington, G., “Speaker verification – final report,” Tech Rep. RADC 74–179, Rome Air Development Center, Griffiss AFB, NY, Apr. 1974. 10. FpVTE2003, “http://www.nist.gov/itl/iad/ig/fpvte03.cfm,” in Proceedings of The Fingerprint Vendor Technology Evaluation (FpVTE), 2003. 11. Furui, S., “Cepstral analysis techniques for automatic speaker verification,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 27, pp. 254–277, April 1981. 12. FVC2004, “http://bias.csr.unibo.it/fvc2004/,” in Proceedings of The Third International Fingerprint Verification Competition, 2004. 13. FVC2006, “http://bias.csr.unibo.it/fvc2006/,” in Proceedings of The Fourth International Fingerprint Verification Competition, 2006.
References
19
14. Gish, H. and Schmidt, M., “Text-independent speaker identification,” IEEE Signal Processing Magazine, pp. 18–32, Oct. 1994. 15. Jain, A. K., Ross, A., and Prabhakar, S., “An introduction to biometric recognition,” IEEE Trans. on Circuits and System for Video Tech., vol. 14, pp. 4–20, January 2004. 16. Kenny, P. and Dumouchel, P., “Experiments in speaker verification using factor analysis likelihood ratios,” in Proceedings of Odyssey, pp. 219–226, 2004. 17. Li, K. and Hughes, G., “Talker differences as they appear in correlation matrices of continuous speech spectra,” J. Acoust. Soc. Amer., vol. 55, pp. 833–837, Apr. 1974. 18. Li, K. and J.E. Dammann, W. C., “Experimental studies in speaker verification using an adaptive system,” J. Acoust. Soc. Amer., vol. 40, pp. 966–978, Nov. 1966. 19. Li, Q., “An auditory-based transform for audio signal processing,” in Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, (New Paltz, NY), Oct. 2009. 20. Li, Q. and Huang, Y., “An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions,” IEEE Trans. on Audio, Speech and Language Processing, Sept. 2011. 21. Li, Q. and Huang, Y., “Robust speaker identification using an auditory-based feature,” in ICASSP 2010, 2010. 22. Li, Q. and Juang, B.-H., “Speaker verification using verbal information verification for automatic enrollment,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Seattle), May 1998. 23. Li, Q., Juang, B.-H., Lee, C.-H., Zhou, Q., and Soong, F. K., “Recent advancements in automatic speaker authentication,” IEEE Robotics & Automation magazine, vol. 6, pp. 24–34, March 1999. 24. Li, Q., Juang, B.-H., Zhou, Q., and Lee, C.-H., “Automatic verbal information verification for user authentication,” IEEE Trans. on Speech and Audio Processing, vol. 8, pp. 585–596, Sept. 2000. 25. Li, Q., Juang, B.-H., Zhou, Q., and Lee, C.-H., “Verbal information verification,” in Proceedings of EUROSPEECH, (Rhode, Greece), pp. 839–842, Sept. 22-25 1997. 26. Li, Q., Parthasarathy, S., and Rosenberg, A. E., “A fast algorithm for stochastic matching with application to robust speaker verification,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Munich), pp. 1543–1547, April 1997. 27. Liu, C.-S., Lee, C.-H., Juang, B.-H., and Rosenberg, A., “Speaker recognition based on minimum error discriminative training,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, 1994. 28. Martin, A., Doddington, G., Kamm, T., Ordowski, M., and Przybocki, M., “The DET curve in assessment of detection ask performance,” in Proceedings of Eurospeech, (Rhodes, Greece), pp. 1899–1903, Sept. 1997. 29. Matsui, T. and Furui, S., “Comparison of text-independent speaker recognition methods using VQ-distoration and discrete/continuous HMM’s,” IEEE Trans. on speech and Audio Processing, vol. 2, pp. 456–459, 1994. 30. Monrose, F., Reiter, M. K., Li, Q., and Wetzel, S., “Using voice to generate cryptographic keys: a position paper,” in Proceedings of Speaker Odyssey, June 2001.
20
1 Introduction
31. Monrose, F., Reiter, M. K., Q. Li, D. L., and Shih, C., “Toward speech generated cryptographic keys on resource constrained devices,” in Proceedings of the 11th USENIX Security Symposium, August 2002. 32. O’Gorman, L., “Comparing passwords, tokens, and biometrics for user authentication,” Proceedings of the IEEE, vol. 91, pp. 2021–2040, December 2003. 33. Parthasarathy, S. and Rosenberg, A. E., “General phrase speaker verification using sub-word background models and likelihood-ratio scoring,” in Proceedings of ICSLP-96, (Philadelphia), October 1996. 34. Phillips, P. J., “Mbgc portal challenge version 2 preliminary results,” in Proceedings of MBGC Third Workshop, 2009. 35. Phillips, P. J., Scruggs, W. T., O’Toole, A. J., Flynn, P. J., Bowyer, K. W., Schott, C. L., and Sharpe, M., “Frvt 2006 and ice 2006 large-scale results,” in NISTIR, March 2007. 36. Pruzansky, S., “Pattern-matching procedure for automatic talker recognition,” J. Acoust. Soc. Amer., vol. 35, pp. 354–358, Mar. 1963. 37. Przybocki, M., Martin, A., and Le, A., “NIST speaker recognition evaluations utilizing the mixer corpora – 2004, 2005, 2006,” IEEE Trans. Audio, Speech and Language Processing, vol. 15, pp. 1951–1959, Sept. 2007. 38. Reynolds, D. and Rose, R. C., “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. on Speech and Audio Processing, vol. 3, pp. 72–83, 1995. 39. Rosenberg, A. E., “Automatic speaker verification: a review,” Proceedings of the IEEE, vol. 64, pp. 475–487, April 1976. 40. Rosenberg, A. E. and DeLong, J., “HMM-based speaker verification using a telephone network database of connected digital utterances,” Technical Memorandum BL01126-931206-23TM, AT&T Bell Laboratories, December 1993. 41. Rosenberg, A. E., DeLong, J., Lee, C.-H., Juang, B.-H., and Soong, F. K., “The use of cohort normalized scores for speaker verification,” in Proceedings of the International Conference on Spoken Language Processing, (Banff, Alberta, Canada), pp. 599–602, October 1992. 42. Rosenberg, A. E., Lee, C.-H., and Juang, B.-H., “Subword unit talker verification using hidden markov models,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 269–272, 1990. 43. Rosenberg, A. E. and Parthasarathy, S., “Speaker background models for connected digit password speaker verification,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Atlanta), pp. 81–84, May 1996. 44. Rosenberg, A. and Soong, F., “Recent research in automatic speaker recognition,” in Advances in Speech Signal Processing, (Furui, S. and Sondhi, M., eds.), pp. 701–738, NY: Marcel Dekker, 1992. 45. Savic, M. and Gupta, S., “Variable parameter speaker verification system based on hidden Markov modeling,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 281–284, 1990. 46. Schmidt, M. and Gish, H., “Speaker identification via support vector machine,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 105–108, 1996. 47. Simmons, G. J., “A survey of information authentication,” Proceedings of the IEEE, vol. 76, pp. 603–620, May 1988. 48. Soong, F. K. and Rosenberg, A. E., “On the use of instantaneous and transitional spectral information in speaker recognition,” IEEE Tran. Acoust., Speech, Signal Processing, vol. ASSP-36, pp. 871–879, June 1988.
References
21
49. Stocksdale, G., “Glossary of terms used in security and intrusion detection,” Online, NSA, 2009. http:/www.sans.org/newlook/resources/glossary.htm. 50. Tishby, N., “Information theoretic factorization of speaker and language in hidden markov models, with application to speaker recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. v1, 97–90, April 1988. 51. Woodward, J. D., Webb, K. W., Newton, E. M., Bradley, M. A., Rubenson, D., Larson, K., Lilly, J., Smythe, K., Houghton, B., Pincus, H. A., Schachter, J., and Steinberg, P., Army Biometric Applications Identifying and Addressing Sociocultural Concerns. RAND Arrayo, 2001. 52. Yin, Y. and Li, Q., “Soft frame margin estimation of Gaussian mixture models for speaker recognition with sparse training data,” in ICASSP 2011, 2011.
Chapter 2 Multivariate Statistical Analysis and One-Pass Vector Quantization
Current speaker authentication algorithms are largely based on multivariate statistical theory. In this chapter, we introduce the most important technical components and concepts of multivariate analysis as they apply to speaker authentication: the multivariate Gaussian (also called normal) distribution, principal component analysis (PCA), vector quantization (VQ), and segmental K-means. These fundamental techniques have been used for statistical pattern recognition and will be used in our further discussions throughout this book. Understanding the basic concepts of these techniques is essential for understanding and developing speaker authentication algorithms. Those readers who are already familiar with multivariate analysis can skip most of the sections in this chapter; however, the one-pass VQ algorithm presented in Section 2.4 is from the author and Swaszek’s research [8] and may be unknown to the reader. The algorithm is useful in processing very large datasets. For example, when training a background model in speaker recognition, the dataset can be huge; the one-pass VQ algorithm can speed up the initialization of the training procedure. It can be used to initialize the centroids during Gaussian mixture model (GMM) or hidden Markov model (HMM) training for large datasets. Also, the concept will be applied to Chapter 3 in sequential design of a classifier and to Chapter 14 in sequential design of speaker authentication systems.
2.1 Multivariate Gaussian Distribution Multivariate Gaussian distribution plays a fundamental role in multivariate analysis and many real-world problems fall naturally within the framework of Gaussian theory. It is also important and popular in speaker recognition. The importance of the Gaussian distribution in speaker authentication rests on its extension to mixture Gaussian distribution or Gaussian mixture model (GMM). A mixture Gaussian distribution with enough Gaussian components
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/978-3-642-23731-7_2, Ó Springer-Verlag Berlin Heidelberg 2012
23
24
2 Multivariate Statistical Analysis and One-Pass Vector Quantization
0.2 0.15 0.1 0.05 0 10 5 10
0
5 0 −5
−5 −10
Fig. 2.1. An example of bivariate Gaussian distribution: ρ11 = 1.23, ρ12 = ρ21 = 0.45, and ρ22 = 0.89.
can approximate the “true” population distribution of speech data. In addition, the EM (Expectation-Maximization) algorithm provides a convenient training algorithm for the GMM. The algorithm is fast and guarantees for convergence at every iteration. Third, based on the GMM, we can build a hidden Morkov model (HMM) for speaker verification and verbal information verification. In this section we discuss the single Gaussian distribution, which is the basic component in a mixture Gaussian distribution. The mixture Gaussian distribution will be discussed with models for speaker recognition in the following sections. A p-dimensional Gaussian density for the random vector X = [x1 , x2 , . . . , xp ] can be presents as: 1 1 −1 f (x) = exp − (x − μ) Σ (x − μ) (2.1) 2 (2π)p/2 |Σ|1/2 where −∞ < xi < ∞, i = 1, 2, . . . , p. The p-dimensional normal density can be denoted as Np (μ, Σ) [6]. For example, when p = 2, the bivariate Gaussian density has the following individual parameters: μ1 = E(X1 ), μ2 = E(X2 ), σ11 = Var(X1 ), σ22 = Var(X2 ), the covariance matrix is: σ11 σ12 Σ= (2.2) σ12 σ22
2.1 Multivariate Gaussian Distribution
25
10
5
0
−5 −10
0
−5
5
10
Fig. 2.2. The contour of the Gaussian distribution in Fig. 2.1.
and the inverse of the covariance is
σ11 σ12 . (2.3) 2 σ12 σ22 σ11 σ22 − σ12 √ √ Introducing the correlation coefficient ρ12 = σ12 σ11 σ22 , we have the expression for the bivariate (p = 2) Gaussian density as: 1 1 f (x1 , x2 ) = exp − 2 2(1 − ρ212 ) 2π σ11 σ22 (1 − ρ12 ) x1 − μ1 2 x2 − μ2 2 ( √ ) +( √ ) σ 11 σ 22 x1 − μ1 x2 − μ2 −2ρ12( √ )( √ ) (2.4) σ 11 σ 22 Σ −1 =
1
For illustration, we plot two bivariate Gaussian distributions and its contour as in Figs. 2.1 and 2.2.
2.2 Principal Component Analysis Principal component analysis (PCA) is concerned with explaining the variancecovariance structure through the most important linear combinations of the original variable [6]. In speaker authentication, the principal component analysis is a useful tool for data interpretation, reduction, analysis, and visualization. When a feature or data set represents in a N dimensional data space, the actual data variability may be largely in a small number of k dimension, where k < p; thus the k dimension can represent the data in p dimension and the original data set can be reduced from a n × p matrix to a n × k matrix, where k is the number of principal components.
26
2 Multivariate Statistical Analysis and One-Pass Vector Quantization
Fig. 2.3. An illustration of a constant density ellipse and the principal components for a normal random vector X. The largest eigenvalue associates with the long axis of the ellipse and the second eigenvalue associates with the short axis. The eigenvectors associate with the axes.
When a random vector, X = [X1 , X2 , . . . , Xp ], has a covariance matrix Σ, which has the eigenvalue λi and eigenvector ei pairs: (λ1 , e1 ), (λ2 , e2 ), . . . , (λp , ep ), and λ1 ≥ λ2 ≥ . . . ≥ λp ≥ 0. The ith principal component is represented as: Yi = ei X = e1i X1 + e2i X2 + . . . + epi Xp , i = 1, 2, . . . , p
(2.5)
and then the variance and covariance of Yi are: Var(Yi ) = ei Σei i = 1, 2, . . . , p
(2.6)
A proof of this property can be found in [6]. An illustration of the concept of the principal components is given in Fig. 2.3. The principal components coincide with the axes of the constant density ellipse. The largest eigenvalue is associated with the long axis and the second eigenvalue is associated with the short axis. The eigenvectors are associated with directions of the axes. In speaker authentication, the PCA can be used to reduce the feature space for efficient computation without much affecting the recognition or classification accuracy. Also, it can be used to project multidimensional data samples
2.2 Principle Component Analysis
27
onto a selected two-dimensional space, so the data is visible for analysis. Readers can refer to [6] to gain more knowledge on multivariate statistical analysis.
2.3 Vector Quantization Vector quantization (VQ) is a fundamental tool for pattern recognition, classification, and data compression. It has been applied widely to speech and image processing. VQ has become the first step in training the GMM and HMM. Both are the most popular models used in speaker authentication. The task of VQ is to partition m-dimensional space into multiple cells represented by quantized vectors. The vectors are all called centroids, codebook vectors, and codewords. The VQ training criterion is to minimize overall average distortions over all cells when using centroids to represent the data in cells. The most popular algorithm for VQ training is the K-means algorithm. In a quick overview, the K-means algorithm is an iteration process with the following steps: First, initialize the centriods by an adequate method, such as the one which will be discussed in the next section. Second, partition each training data vector into one of the cells by looking for the nearest centroids. Third, update the centroids by the data grouped into the corresponding cell. Last, repeat the second and third steps until the values of centroids converges to required ranges. We note that the K-means algorithm can only converge to a local optimum [12]. Different initial centroids may lead to different local optimum. This will also affect the HMM training results when using VQ for HMM training initialization. When using HMM for speaker recognition, it can be observed each time the recognizer is retrained. It may have slightly different recognition accuracies. The LBG (Linde, Buzo, and Gray) algorithm [12] is another popular algorithm for VQ. Instead of selecting all initial centroids at one time, the LBG algorithm determines the initial centroids through a splitting process. The vector quantization (VQ) method can be used directly for speaker identification. In [17], Soong, Rosenberg and Juang used a speaker-dependent codebook to represent characteristics of a speaker’s voice. The codebook is generated by a clustering procedure based upon a predefined objective distortion measure, which computes the dissimilarity between any two given vectors [17]. The codebook can also be considered an implicit representation of a mixture distribution used to describe the statistical properties of the source, i.e., the particular talker. In the test session, input vectors from the unknown talker are compared with the nearest codebook entry and the corresponding distortions are accumulated to form the basis for a classification decision.
28
2 Multivariate Statistical Analysis and One-Pass Vector Quantization
2.4 One-Pass VQ The initial centroids or codebook is critical to the final VQ results. To achieve a high performance, the high computational complexity in both the VQ encoding and codebook design are usually required. The methods most often employed for designing the codebook, such as the K-means and the LBG algorithms, are iterative algorithms and require a large amount of CPU time to achieve a locally optimum solution. In speaker recognition, it is often required to pull all available data together to train a background model or cohort model. The first step in the training is to initialize the centroids for VQ. The traditional initialization algorithm just randomly selects the centroids, which is not efficient and wastes a lot of computation time when a dataset is huge. To speed up the model training procedure, an algorithm which can provide better initial centroids and needs less iteration is required. In [8], we proposed a one-pass VQ algorithm for the purpose. We introduce it in this section in detail. 2.4.1 The One-Pass VQ Algorithm The one-pass VQ algorithm is based on a sequential statistical analysis of the local characteristics of the training data and a sequential pruning technique. The work was originaly proposed by the author and Swaszek in [8]. It was inspired by a constructive neural network design concept for classification and pattern recognition as described in Chapter 3 [10, 11, 9, 22]. The one-pass VQ algorithm sequentially selects a subset of the training vectors in a high density area of the training data space and computes a VQ codebook vector. Next, a sphere is constructed about the code vector whose radius is determined such that the encoding error for points within the sphere is acceptable. In the next stage, the data within the sphere is pruned (deleted) from the training data set. This procedure continues on the remainder of the training set until the desired number of centroids is located. To achieve a local optimum, one or a few iterations like in the K-means algorithm can be further applied. The basic steps of the one-pass VQ algorithm are illustrated in Figure 2.4 which shows how a code vector is determined for a two-dimensional training data set (the algorithm is developed and implemented in N dimensions). We refer to this figure in the following discussion. The one-pass VQ algorithm is compared with several benchmark results for uncorrelated Gaussian, correlated Gaussian, and Laplace sources; its performance is seen to be as good or better than the benchmark results with much less computation. Furthermore, the one-pass initialization needs only slightly more computation than a single iteration of the K-means algorithm when the data set is large with high dimension. The algorithm also has a robust property and can be invariant to outliers. The algorithm can be applied
2.4 One-Pass VQ
(a)
(b)
(c)
(d)
29
Fig. 2.4. The method to determine a code vector: (a) select the highest density cell; (b) examine a group of cells around the selected one; (c) estimate the center of the data subset; (d) cut a “hole” in the training data set.
directly in classification, pattern recognition and data compression, or indirectly as the first step in training Gaussian mixture models or hidden Markov models. Design Specifications The training data set X is assumed to consist of M N -dimensional training vectors. The total number of regions R is determined by the desired bit-rate. Let D represent the maximum allowable distortion ||xm − yc ||2 ≤ D
(2.7)
where xm ∈ X is a data vector and yc is the nearest code vector to xm . The value for D can be determined either from the application (such as from human vision experiments, etc.) or estimated from the required data compression ratio. Source Histogram The one-pass design first partitions the input space by a cubic lattice to compute a histogram. Each histogram cell is an N -dimensional hypercube.
30
2 Multivariate Statistical Analysis and One-Pass Vector Quantization
(In general, we could allow different side lengths for the cells.) The number of cells in each dimension is calculated based on the range of the data in that dimension and the allowable maximal distortion D. The employed method for calculating the number of cells in the jth-dimension, Lj , is Lj =
xj,max − xj,min 2D/3
(2.8)
where xj,max and xj,min are, respectively, the maximal and minimal values of X in dimension j. The term 2D/3 is the recommended size of the cell in that dimension. The probability mass function f (k1 , k2 , . . . , kN ) with ki ∈ {1, 2 . . . Li } is determined by counting the number of training data vectors within each cell. The sequential design algorithm starts from that cell which currently has the largest number of training vectors (the maximum of f (· · ·) over the ki ). In Figure 2.4(a) we assume that the highlighted region is that cell. Then, as shown in Figure 2.4(b), a group of contiguous cells around the selected one is chosen. This region Xs ⊂ X (highlighted) will be used to locate a VQ code vector. Locating a Code Vector Two algorithms are employed in locating a code vector: a principal component method and a direct median method. For the results presented here, the medians of Xs in each of the N dimensions are computed as the components of the code vector. Medians are employed to provide some robustness with respect to variation in the data subset Xs . The median is marked in Figure 2.4(c) by the “+” symbol. Principal Component (PC) Method: As shown in Fig. 2.5, this method first solves an eigenvector problem on the data set Xs , RXs E = λE,
(2.9)
where RXs is the covariance matrix of Xs , λ is a diagonal matrix of eigenvalues, and E = [e1 , e2 , ..., eN ] are the eigenvectors. Then, the Xs is projected onto the directions of each of the eigenvectors. Yj = Xs ej .
(2.10)
Medians are further determined from each of the Yj . The centroid is obtained by taking the inverse transforms of the median of Yj in each dimension. Direct-Median Method: This method is straight forward. It just uses the medians of Xs in each of the original dimensions as the estimated centroid. It is clear that the second method can save the eigenvector calculation, so it is faster than the first one. For evaluation, both methods have been tested on the codebook design of Gaussian and Laplace sources. The mean square errors from both methods are very close, while the PC method is slightly better for the Laplace source and the direct median method is better for the Gaussian source.
2.4 One-Pass VQ
31
e2
e1
Fig. 2.5. The Principal Component (PC) method to determine a centroid.
Pruning the Training Data Once the code vector is located, the next step is to determine a sphere around it as shown in Figure 2.4(c). (We note that for classification applications it could be an ellipse.) One way to determine the size of the sphere is to estimate the maximal number of data vectors which will be included inside the sphere. The set of vectors within the sphere Xc is a subset of Xs , Xc ⊂ Xs . To determine a sphere for the cth code vector, the total number of data vectors T c within the sphere is estimated by Tc = Wc
Mc R+1−c
(2.11)
where M c is the total number of data vectors in the current training data set (M 1 = M and M c < M when c > 1) and R + 1 − c is the number of code vectors which have yet to be located. The term W c is a variable weight W c = W c−1 − W
(2.12)
where W is the change of the weight variable between each of the algorithm’s R cycles. The weight is employed to keep the individual spheres from growing too large. From the experience of the design examples in this chapter, we note that the resulting performance is not very sensitive to either W 1 or W . After the number of vectors of the sphere Xc is determined, the data subset Xc is pruned from the current training data set. As shown in Figure 2.4(d) a “hole” is cut and the data vectors within the hole are pruned. The entries of the mass function f (· · ·) associated with the highlighted cells are then updated. The design for the next code vector starts by selecting that cell which currently contains the largest number of training vectors.
32
2 Multivariate Statistical Analysis and One-Pass Vector Quantization
Due to the nature of the design procedure it can be imagined that the diameters of the spheres might get larger and larger while the design is in progress. To alleviate this, each sphere’s diameter is limited to 2D. This value is chosen to avoid the situation in which one sphere entirely becomes a subset of another sphere (some overlap is common). Updating the Designed Code Vectors Once After locating all R code vectors and cutting the R “holes” in X there often remains a residual data set Xl ⊂ X from the training set. The last step of our one-pass design is to update the designed code vectors by calculating the centroids of the nearest vectors around each code vector. This is equal to one iteration of the K-means algorithm. 2.4.2 Steps of the One-Pass VQ Algorithm The one-pass VQ algorithm is summarized below: Step 1. Initialization 1.1 Give the training data set X (an M × N data matrix), the number of designed centroid C (or the number of regions since R = C), and the Allowed Maximal Distortion D which can be either given or estimated from the X and C. 1.2 Set weight value W 1 and W . Step 2. Computing the Rectangular Histogram 2.1 Determine the number of cells in each dimension n, n = 1, 2, ..., N as in the formula (2.8). 2.2 Compute the probability mass function f by counting the training vectors within each of the cells of the histogram. Step 3. Sequential codebook design 3.1 From f (·), select the histogram cell which currently contains the most training vectors. 3.2 Group histogram cells adjacent to the selected one to obtain the data set Xs . 3.3 Determine a centroid for Xs by the PC or direct median method. 3.4 Calculate the maximal number of vectors T c for the cth “hole”as in (2.11). 3.5 Determine and prune (delete) total T c vectors from Xs . Update the entries of f (·) associated with the cells of Xs only. 3.6 Update the parameters W c+1 as in (2.12) and Rc+1 = Rc − 1. 3.7 c = c + 1 and goto Step 3.1 if Rc+1 > 0. Step 4. Improve the designed centroids
2.4 One-Pass VQ
33
4.1 Update the VQ’s centroids once by re-calculating the means of those vectors from X nearest to each centroid (one iteration of the K-means algorithm). Step 5. Stop 2.4.3 Complexity Analysis This section is concerned with complexity analysis on the sequential selection of codebook vectors (principal features). The one-pass VQ includes one sequential selection and one LBG iteration. Complexity is measured by counting the number of real number operations required by the sequential selection. We define the following notations for the analysis. • • • • • •
R — the number of VQ regions M — the number of training vectors N — the data dimensions L — the number of histogram bins per dimension k — the number of data vectors in a highlight window is k times larger than the average number, M/R. Bmax — The max number of histogram cells (boxes) with nonzero count of training data vectors.
The VQ design requires some initialization: 1) Computing the initial histogram requires N M log2 (L) ops This is followed by R repetitions of finding a centroid of the local window about the histogram maximum and pruning the data set: 2) Sorting the histogram counts requires (to find a high density cell) Bmax log2 (Bmax ) ops 3) Finding a centroid by calculating median kN
M M log2 (k ) ops R R
4) Computing the distance of each training vector in the window to the centroid (N multiplications plus 2N-1 additions/subtractions for each vector) M k (3N − 1) ops R 5) Sorting the distance to determine the cut-off point k
M M log2 (k ) ops R R
34
2 Multivariate Statistical Analysis and One-Pass Vector Quantization
6) Updating the histogram k
M ops R
Summing all this yields a total of
N M log2 (L) + R[Bmax log2 (Bmax ) + kN +k
M M log2 (k ) R R
M M M M (3N − 1) + k log2 (k ) + k ] R R R R
(2.13)
operations. It is equal to N M log2 (L) + R[Bmax log2 (Bmax ) + 3N k
M M M + (N + 1)k log2 (k )] (2.14) R R R
For comparison, the LBG algorithm needs 3N M R operations. Suppose a data source is a 512 × 512 image and the codebook size is 64 (M = 16384, R = 64, and N = 16). If we use the parameters in a worst case for the one-pass algorithm, K = 4, L = 8, and Bmax = M/2, the sequential selection will need 21.89 Mops and one LBG iteration will need 50.33 Mops. 2.4.4 Codebook Design Examples In this section the one-pass VQ algorithm is tested by designing codebooks for uncorrelated Gaussian, correlated Gaussian, and independent Laplace sources since many benchmark results on these sources have been reported [2, 3, 14, 15, 21, 19, 20, 18, 24]. To facilitate the comparison of CPU time and Flops (floating point operations) from different systems we use the well-known LBG algorithm as a common measurement and define the following two kinds of speed-up Speed-up-in-time =
CPU time for LBG CPU time for algorithm
(2.15)
Flops for LBG . Flops for algorithm
(2.16)
Speed-up-in-flops =
Since we use a high-level, interpreted language for simulation on a multi-user system, we prefer the Speed-up-in-flops as a comparison measurement. Two dimensional Uncorrelated Gaussian source In this example, the data vectors are from a zero-mean, unit-variance, independent Gaussian distribution. The joint density is given by f (x1 , x2 ) =
1 exp[−(x21 + x22 )/2)], −∞ < x1 , x2 < ∞. 2π
(2.17)
2.4 One-Pass VQ 4
4
3
3
2
2
1
1
0
0
−1
−1
−2
−2
−3
−3
−4 −4
−3
−2
−1
0
1
2
3
4
−4 −4
−3
−2
−1
0
1
2
35
3
4
Fig. 2.6. Left: Uncorrelated Gaussian source training data. Right: The residual data after four code vectors have been located.
The training set X consisted of 4,000 length two vectors as shown in Figure 2.6. The goal is to design a size R = 16 codebook to make the MSE (meansquared error) as small as possible. To set the design parameters we estimate D by D = 5.2/(2 × log2 (16)) = 0.65 since most of the data vectors are within a circle region of diameter 5.2. We used the weight W 1 = 1.4 and W = 0.005. The simulation showed that the first “hole” was selected and cut in the center of the Gaussian distribution; the 2nd to the 4th holes were selected around the first one as shown in Figure 2.6. The algorithm continued until all 16 code vectors had been located and 16 holes had been cut. Figure 2.7 shows these code vectors and the residual data. Then the code vectors were updated by the nearest vectors of X. The “+” signs in the Figure 2.7 are the final centroids generated by the one-pass algorithm. Table 2.1. Quantizer MSE Performance 1 3 2 4 5 6 7 8
One-pass VQ (introduced method) One-pass VQ + 2 LBG (introduced) Linde-Buzo-Gray (LBG, 20 iterations) Strictly polar quantizer (from [24]) Unrestricted polar quantizer (from [24]) Scalar (Max) quantizer (from [14]) Dirichlet polar VQ (from [21]) Dirichlet rotated polar VQ (from [21])
0.218 0.211 0.211 0.240 0.218 0.236 0.239 0.222
In order to further compare the one-pass algorithm with the LBG algorithm we used the centroids designed by the one-pass algorithm as the initial
36
2 Multivariate Statistical Analysis and One-Pass Vector Quantization
4
4
3
3
2
2
1
1
0
0
−1
−1
−2
−2
−3
−3
−4 −4
−3
−2
−1
0
1
2
3
4
−4 −4
−3
−2
−1
0
1
2
3
4
Fig. 2.7. Left: The residual data after all 16 code vectors have been located. Right: The “+” and “◦” are the centroids after one and three iterations of the LBG algorithm, respectively. Table 2.2. Comparison of One-Pass and LBG Algorithms Type
Iterations MFLOPS CPU Time MSE (seconds) One-Pass 1 1.5 35 0.218 One-Pass +2LBG 1+2 3.0 67 0.211 LBG 9 7.2 166 0.218 LBG 20 15.1 333 0.211
centroids for the LBG algorithm, then ran two iterations of LBG (called “onepass+2LBG”). The centroids designed by the one-pass+ 2LBG are shown in Figure 2.7(Right) denoted by the “◦”s. They are very close to the one-pass centroids (“+”s). This suggests another application of the one-pass algorithm: it can be used to determine the initial centroids for the LBG algorithm for a fast and high-performance design. The MSE of the one-pass design is compared in Table 2.1 with the MSE from other methods on an uncorrelated Gaussian source with 16 centroids. The MSE of the one-pass algorithm is equal to or better than the benchmark results. Table 2.2 shows the CPU time and Mflops used by the one-pass and LBG algorithms. For the same MSE, 0.218, the one-pass has a speed-up-inflops of 7.2/1.5 = 4.8. Multidimensional Uncorrelated Gaussian Source As shown in Table 2.3, the one-pass VQ algorithm was compared with five other algorithms on an N = 4 i.i.d. Gaussian source of 1,600 training vectors
2.4 One-Pass VQ
37
and codebook size of R = 16. The speed-up-in-flops in Table 2.3, items 1 and 2, are from our simulations. The speed-up-in-time in items 3 to 7 were calculated based on the training time provided by Huang and Harris in [4]. Again, the one-pass algorithm shows a higher speed-up. Table 2.3. Comparison of Different VQ Design Approaches VQ Design Approaches 1 2 3 4
MSE per dimension 0.35054 0.34919 0.41791 0.42202
Speed-up
One-pass VQ (introduced) 2.12 LBG with 5 iterations 1.00 LBG (from [4]) 1.00 Directed-search 1.50 binary-splitting [4] 5 Pairwise nearest neighbor[4] 0.42975 2.00 6 Simulated annealing [4] 0.41166 0.0023 7 Fuzzy c-mean clustering [4] 0.51628 0.22 The speed-ups in item 1 and 2 are in-flops. All others are in-time.
Correlated Gaussian Source The training set is 4,000 dimension two Gaussian distributed vectors with zero means, unit variances, and correlation coefficient 0.8. The codebook size is R = 16. The results are compared with a benchmark result in Table 2.4. The one-pass VQ algorithm yields lower MSE. Table 2.4. Comparison for the Correlated Gaussian Source Types
Itera- Mflops CPU MSE tions (seconds) 1 1.51 58 0.1351
One-Pass One-Pass + LBG 1+2 Block VQ (from [20])
2.95
110 0.1279 0.1478
Laplace Source In this example, the 4,000 dimension two training vectors have an independent Laplace distribution f (x1 , x2 ) =
1 −√2|x1 | −√2|x2 | e e 2
(2.18)
38
2 Multivariate Statistical Analysis and One-Pass Vector Quantization
The training data is shown in Figure 2.8(Left); the one-pass VQ’s centroids (“+”s) as well as the residual data are shown in Figure 2.8(Right). The improved centroids by one-pass+2LBG are denoted as “◦”s in Figure 2.8(Right) also. Table 2.5 contains a comparison of the one-pass algorithm with several other algorithms on independent Laplace sources. The one-pass algorithm has a speed-up-in-flops of 5.8/1.5 = 3.9 and lower MSE than the benchmark results.
4
4
3
3
2
2
1
1
0
0
−1
−1
−2
−2
−3
−3
−4 −4
−3
−2
−1
0
1
2
3
−4 −4
4
−3
−2
−1
0
1
2
3
4
Fig. 2.8. Left: The Laplace source training data. Right: The residual data, one-pass designed centroids “+”, and one-pass+2LBG centroids “◦”.
Table 2.5. Comparison on the Laplace Source Types
Itera- Mflops CPU MSE tions. (sec.) 1 1.5 60 0.262
One-Pass One-Pass + LBG 1+2 LBG (this chapter) 7 UDQ (from [18]) MAX (from [14]) LBG (from [18])
3.0 5.8
111 0.259 208 0.260 0.302 0.352 0.264
2.4.5 Robustness Analysis For robust speaker and speech recognition, the selected VQ algorithm needs to be robust. What this means is that the outliers in the training data should
2.4 One-Pass VQ
39
have little effect on the VQ training results. The one-pass VQ algorithm has the necessary robust property, because the sequential process can be invariant to outliers. Let us assume that the residual data vectors Xo in Figure 2.8(Right) are outliers. The entire training data set X is a union of the removed set Xr and outliers set Xo , X = Xr ∪ Xo . Due to the low density of the outlier area, the one-pass algorithm would not assign codebook vectors to them. If we don’t want to include the outliers in our training in order to improve the robustness of the designed codebook, we can only use the data set Xr in the last step of centroid update (Step 4 in the List of the Algorithm.) Thus, the outliers Xo are not included in the VQ design and the designed codebook is therefore robust with regard to these outliers. In summary, the experimental results for different data sources demonstrate that the one-pass algorithm results in near-optimal MSE while the CPU time and Mflops are only slightly more than that of one single iteration of the LBG algorithm. High performance, fast training time, and robustness are the advantages of the one-pass algorithm.
2.5 Segmental K-Means In the above discussions, we addressed the segmentation problem for a stationary process, where the joint probability distribution does not change when shifting in time. However, a speech signal is a non-stationary process, where the joint probability distribution of observed data changes when shifting in time. The current approach to addressing segmentation in a speech signal is to segment the non-stationary speech sequence into a sequence of small segmentations. Within each small segmentation, we can assume the data is stationary. Segmental K-mans was developed for this purpose. It has been applied to Markov chain modeling or hidden Markov model (HMM) parameter estimation [13, 16, 7]. It is one of the fundamental algorithms for HMM training. The K-mean algorithm involves iteration of two kinds of computations: segmentation and optimization. In the beginning, the model parameters such as the centroids or means are initialized by random numbers in meaningful ranges. In the segmentation stage, the sequentially observed non-stationary data are partitioned into multiple sequential states. There is no overlap among the states. Within each state, the joint probability distribution is assumed to be stationary; thus, the optimization process can be followed to estimate the model parameters for the data within each state. The segmentation process is equivalent to a sequential decoding procedure and can be performed by using the Viterbi [23] algorithm. The iteration between the optimization and segmentation may need to repeat for several times. The details of the segmental K-mean algorithms are available in [13, 16, 7] while the details of the decoding process will be discussed in Chapter 6.
40
2 Multivariate Statistical Analysis and One-Pass Vector Quantization
2.6 Conclusions In this chapter, we introduced multivariate Gaussian distribution because it will be used often in the rest of this book. We also introduced the concept of principal component analysis because this concept is helpful for reading the rest of this book and to intuitively think about data representations when developing new algorithms. We briefly reviewed the popular K-means and LBG algorithms for VQ. Readers can get more detailed information from other textbooks (e.g. [6, 1, 5]) to understand these traditional algorithms in more detail. We presented the one-pass VQ algorithm in detail. The onepass VQ algorithm is useful in training background models with very large datasets such as those used in speaker recognition. The concept and method of sequential data processing and pruning used in the one-pass VQ will be used in Chapter 3 and Chapter 14 for pattern recognition and speaker authentication. Finally, we discussed the segmental K-mean algorithm, which will be used in HMM training throughout the book.
References 1. Duda, R. O., Hart, P. E., and Stork, D. G., Pattern Classification, Second Edition. New York: John & Wiley, 2001. 2. Fischer, T. R. and Dicharry, R. M., “Vector quantizer design for gaussian, gamma, and laplacian sources,” IEEE Transactions on Communications, vol. COM-32, pp. 1065–1069, September 1984. 3. Gray, R. M. and Linde, Y., “Vector quantizers and predictive quantizers for Gauss-Markov sources,” IEEE Transactions on Communications, vol. COM-30, pp. 381–389, September 1982. 4. Huang, C. M. and Harris, R. W., “A comparison of several vector quantization codebook generation approaches,” IEEE Transactions on Image Processing, vol. 2, pp. 108–112, January 1993. 5. Huang, X., Acero, A., and Hon, H.-W., Spoken language processing. NJ: Prentice Hall PTR, 2001. 6. Johnson, R. A. and Wichern, D. W., Applied Multivariate Statistical Analysis. New Jersey: Prentice Hall, 1988. 7. Juang, B.-H. and Rabiner, L. R., “The segmental k-means algorithm for estimating parameters of hidden Markove models,” IEEE Trans. Acoustics, speech and Signal Processing, vol. 38, pp. 1639–1641, Sept. 1990. 8. Li, Q. and Swaszek, P. F., “One-pass vector quantizer design by sequential pruning of the training data,” in Proceedings of International Conference on Image Processing, (Washington DC), October 1995. 9. Li, Q. and Tufts, D. W., “Improving discriminant neural network (DNN) design by the use of principal component analysis,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Detroit MI), pp. 3375–3379, May 1995. 10. Li, Q. and Tufts, D. W., “Synthesizing neural networks by sequential addition of hidden nodes,” in Proceedings of the IEEE International Conference on Neural Networks, (Orlando FL), pp. 708–713, June 1994.
References
41
11. Li, Q., Tufts, D. W., Duhaime, R., and August, P., “Fast training algorithms for large data sets with application to classification of multispectral images,” in Proceedings of the IEEE 28th Asilomar Conference, (Pacific Grove), October 1994. 12. Linde, Y., Buzo, A., , and Gray, R. M., “An algorithm for vector quantizer design,” IEEE Transactions on Communications, vol. COM-28, pp. 84–95, 1980. 13. MacQueen, J., “Some methods for classification and analysis of multivariate observations,” in Proc. 5th Berkeley Symp. Math. Stat., Prob., pp. 281–296, 1967. 14. Max, J., “Quantizing for minimum distortion,” IEEE Transactions on Information Theory, vol. IT-6, pp. 7–12, March 1960. 15. Paez, M. D. and Glisson, T. H., “Minimum mean-square-error quantization in speech pcm and dpcm systems,” IEEE Transactions on Communications, vol. COM-20, pp. 225–230, April 1972. 16. Rabiner, L. R., Wilpon, J. G., and Juang, B.-H., “A segmental k-means training procedure for connected word recognition,” AT&T Technical Journal, vol. 65, pp. 21–31, May/June 1986. 17. Soong, F. K., Rosenberg, A. E., and Juang, B.-H., “A vector quantization approach to speaker recognition,” AT&T Technical Journal, vol. 66, pp. 14–26, March/April 1987. 18. Swaszek, P. F., “Low dimension / moderate bitrate vector quantizers for the laplace source,” in Abstracts of IEEE International Symposium on Information Theory, p. 74, 1990. 19. Swaszek, P. F., “Vector quantization for image compression,” in Proceedings of Princeton Conference on Information Sciences and Systems, (Princeton NJ), pp. 254–259, March 1986. 20. Swaszek, P. F. and Narasimhan, A., “Quantization of the correlated gaussian source,” in Proceedings of Princeton Conference on Information Sciences and Systems, (Princeton NJ), pp. 784–789, March 1988. 21. Swaszek, P. F. and Thomas, J. B., “Optimal circularly symmetric quantizers,” Journal of Franklin Institute, vol. 313, pp. 373–384, June 1982. 22. Tufts, D. W. and Li, Q., “Principal feature classification,” in Neural Networks for Signal Processing V, Proceedings of the 1995 IEEE Workshop, (Cambridge MA), August 1995. 23. Viterbi, A. J., “Error bounds for convolutional codes and an asymptotically optimal decoding algorithm,” IEEE Transactions on Information Theory, vol. IT-13, pp. 260–269, April 1967. 24. Wilson, S. G., “Magnitude/phase quantization of independent gaussian variates,” IEEE Transactions on Communications, vol. COM-28, pp. 1924–1929, November 1980.
Chapter 3 Principal Feature Networks for Pattern Recognition
Pattern recognition is one of the fundamental technologies in speaker authentication. Understanding the concept of pattern recognition is important in developing speaker authentication algorithms and applications. There are already many books and tutorial papers on pattern recognition and neural network (e.g. [6, 1]). Instead of repeating a similar introduction of the fundamental pattern recognition and neural networks techniques, we introduce a different approach for neural network training and construction that was developed by the author and Tufts and named the principal feature network (PFN) [13, 14, 15, 12, 20], which is an analytical method to construct a classifier or recognizer. Through this chapter, readers will gain a better understanding of pattern recognition methods and neural networks and their relation to multivariate statistical analysis. The PFN uses the fundamental methods in multivariate statistics as the core techniques and applies the techniques sequentially in order to construct a neural network for classification or pattern recognition. The PFN can be considered a fast neural network design algorithm for pattern recognition, speaker authentication, and speech recognition. Due to its data pruning method, the PFN algorithm is efficient in processing large databases. This chapter also discusses the relationship among different hidden node design methods as well as the relationship between neural networks and decision trees. The PFN algorithm has been used in real-world pattern recognition applications.
3.1 Overview of the Design Concept A neural network consists of input, hidden, and output nodes and one or more hidden layers. Each hidden layer can have multiple hidden nodes. Popular approaches, like the backpropagation algorithm [21], first define a network structure in terms of the number of hidden layers and the number of hidden nodes at each layer, and then train the hidden node parameters. Conversely, PFN is designed to construct a neural network sequentially. It starts from
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/978-3-642-23731-7_3, Ó Springer-Verlag Berlin Heidelberg 2012
43
44
3 Principal Feature Networks for Pattern Recognition
training the first hidden nodes based on multivariate statistical theory, then adds more hidden nodes until reaching the design specification; finally, it combines the hidden nodes to construct the output layer and the entire network. Such an approach provides very fast training due to a data pruning method. This chapter is intended to help readers understand the neural networks for pattern recognition and the functions of layers and hidden notes. We define the principal feature as a discriminate function which is intended to provide the maximum contribution to correct classification using a current training data set. Our principal feature classification (PFC) algorithm is a sequential procedure finding principal features and pruning classified data. When a new principal feature is found, the correctly classified data of the current training dataset is pruned so that the next principal feature can be constructed with the subset of data which has not been classified yet, rather than redundantly reclassify or consider some of the already well classified training vectors. The PFC can also be considered a nonparametric statistical procedure for classification while the hidden node design is using multivariate statistics. The network constructed by the PFC algorithm is called the principal feature network (PFN). The PFC does not need gradient-descent based training algorithms which need a very long training time as in the backpropagation and many other similar training algorithms. On the other hand, the PFC does not need a backward node pruning as in CART [2] and neural tree network [17], or a node “pruning” by retraining (see [16] for a survey). A designed PFN can be implemented in the structure of a decision tree or a neural network. For these reasons, the PFC can be considered a fast and efficient algorithm for both neural network and decision tree design for classification or pattern recognition. We use the following example to illustrate the concept of the PFC algorithm and PFN design. Example 1 – An Illustrative Example for the PFN Design Procedure We use the two labeled classes of artificial training data illustrated in Fig. 3.1 (a) to better specify the procedure of finding principal features and pruning the training data. We sequentially find principal features and associated hidden nodes at each stage by selecting the best of the following two methods for choosing the next feature: (1) Fisher’s linear discriminant analysis (LDA) [8, 10, 5], or (2) maximal signal-to-noise-ratio (SNR) discriminant analysis (see Section 3.4). A method for determining multiple thresholds associated with these features is used in evaluating the effectiveness of candidate features. Although more complicated features can be used, such as those from multivariate Gaussian [10], multivariate Gaussian mixture [19], or radial basis functions [3], the above two features are simple, complementary, and efficient. The details of designing the nodes will be introduced in the following sections.
3.1 Overview of the Design Concept 3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
-0.5 4
4.5
5
5.5
6
6.5
7
−0.5 4
4.5
5
(a)
5.5
6
6.5
7
6
6.5
7
(b) 3.5
3.5
3
3
2.5
2.5
2
2
1.5
1.5
1
1
0.5
0.5
0
0
−0.5 4
45
4.5
5
5.5
(c)
6
6.5
7
-0.5 4
4.5
5
5.5
(d)
Fig. 3.1. An illustrative example to demonstrate the concept of the PFN: (a) The original training data of two labeled classes which are not linearly separable. (b) The hyperplanes of the first hidden node (LDA node). (c) The residual data set and the hyperplanes of the second hidden node (SNR node). (d) The input space partitioned by two hidden nodes and four thresholds designed by the principal feature classification (PFC) method.
In the first step, we use all the training data in the input space of Fig. 3.1(a) to find a principal feature. In this step, Fisher’s LDA provides a good result. The corresponding feature can be calculated by an inner product of the data vector with the LDA weight vector. The hyperplanes perpendicular to the vector are shown in Fig. 3.1(b). It is important to note that multiple threshold values can be used with each feature. Then, the data vectors which have been classified at this step of the design procedure are pruned off. Here two threshold values have been used. Thus the unclassified data between the two corresponding hyperplanes is used to train the second hidden node. The residual training data set for the next design stage is shown in Fig. 3.1(c). This is used to determine the second feature and second hidden node. Since the mean vectors of the two classes are very close now, Fisher’s LDA does not give a satisfactory principal feature. In the second hidden node design, maximum SNR analysis provides a better candidate for a principal feature and
46
3 Principal Feature Networks for Pattern Recognition
the threshold-setting procedure in [11] gives us two associated hyperplanes which are also shown in Fig. 3.1(c). The overall partitioned regions are shown in Fig. 3.1 (d). All of the training data vectors have now been correctly classified. The size of training data, the performance specifications, and the need to generalize to new, test data influence the threshold settings and the stopping point of the design. For this simple classification problem, the Backpropagation (BP) training method [4] takes hundreds of seconds to hours, and one still does not get satisfactory classification using a multilayer perception (MLP) network with 5 hidden nodes and one output node, sum-squared error (SSE) = 4.35. The radial basis network (RBF) with common kernel functions [4] can converge to an acceptable performance in 35 seconds, but it needs 56 nodes, SSE = 0.13. On the same problem, the principal feature classification only takes 0.2 seconds on the same machine and needs only two hidden nodes in a sequential implementation, Fig. 3.5(b). The performance of PFN is better than both BP and RBF, SSE = 0.00.
3.2 Implementations of Principal Feature Networks A principal feature network (PFN) is a decision network in which each hidden node computes the value of a principal feature and compares this value with one or more multiple threshold values. A PFN can be implemented as a neural network through a parallel implementation or as a decision tree through a sequential implementation. A parallel implementation of the PFN is shown in Fig. 3.2. The outputs of the hidden-layer are binary words. Each word represents one partitioned region in the input data space. Each class may have more than one hiddenlayer word. The outputs of the output layer are binary words too, which are logic functions of the hidden-layer words, but each class is only represented by one unique output binary word. Each hidden node threshold is labeled to one class, and the associated hyperplane partitions the input space into classified and unclassified regions for that class. All or nearly all of the training vectors within a classified region belong to the labeled class. The corresponding unclassified region includes the training vectors which have not been correctly classified yet. Generally speaking, each hidden node is a sub-classifier in the PFN. It classifies part of the training vectors and leaves the unclassified vectors to other hidden nodes. A decision-tree implementation is shown in Fig 3.3. We note that each hidden node can have multiple thresholds; thus more than one classification decision can be made in one node of the tree, e.g. Fig 3.5(b). In a sequence, the hidden nodes will be evaluated in the order that they are trained. Since each hidden node binary threshold output has been associated with one class in training, the sequential calculation stops as soon as a decision can be made. Since the first few hidden nodes are designed using the highest density regions
3.2 Implementations of Principal Feature Networks
d1
47
dj
....
PFN output
.... y1
.... ....
p1
θ3
θ2
θ1
p Σ
2
yk θk
.......
Σ
Binary word of hidden nodes
pm
Σ
wm
w1
x1
.......
x2
xn
Fig. 3.2. A parallel implementation of PFC by a Principal Feature Network (PFN).
x input data vector
p 1 < Θ-1
p 1 = x t w1 p 1 > Θ+1
-
pm= x t wm
p m< Θm Class 3
Hidden node 1
Class 2
Class 1
pm> Θ+m Class 4
Hidden node m
Class i
Fig. 3.3. A sequential implementation of PFC by a Principal Feature Tree (PFT).
48
3 Principal Feature Networks for Pattern Recognition
in the input training space, it is very likely that a decision can be made early in the procedure. Example 1 (continued) – Parallel and Sequential Implementations Parallel and sequential implementations of the designed PFC are shown in Fig.3.4(b) and Fig. 3.5(b). Corresponding partitioned input spaces are shown in Fig. 3.4(a) and 3.5(a). For details on parallel and processor-array implementations, please refer to [11, 14].
d y2 - + o
y1 - + Class x
y4
y3
y2
y1
Class x
o x
Class o
Class x -
+
x
+ y - 3 + y - 4
- +
(a)
Σ
Σ x2
x1 (b)
Fig. 3.4. (a) Partitioned input space for parallel implementation. (b) Parallel implementation.
3.3 Hidden Node Design With the PFN construction procedure explained, we can now introduce the hidden node design. The single hidden node design algorithms are inspired by the optimal multivariate Gaussian classification rule [10]. The rule is implemented as a Gaussian discriminant node [12] to facilitate theoretical analysis. Two practical hidden nodes are further defined: a Fisher’s Node is a linear node trained by Fisher’s linear discriminant analysis [8, 10, 14, 13] and used for training classes with separable mean vectors. The principal component (PC) node is trained based on a criterion to maximize a discriminant signal-to-noise ratio [12]. It is for training classes with common mean vectors. Both nodes are designed for non-Gaussian and nonlinearly separable cases in multi-dimensional and multi-class classification. When the training vectors are from more than two classes, the node design
3.3 Hidden Node Design
49
x Input vector y2 - +
Hidden node 1
y1 - + III Class x
II Class o
I Class x + y IV Class o - 3 + y V - 4 Class x
Class o
I
Class x Hidden node 2
V
Class x
Class o
Class x
III
IV
- +
+
-
II
(b)
(a)
Fig. 3.5. (a) Partitioned input space for sequential implementation. (b) Sequential (tree) implementation.
algorithms can figure out which class to separate first. This class normally has more separable data vectors in the input data space. Connections between a Fisher’s node, a PC node, and a Gaussian node are also discussed below. It should be noted that there are many other algorithms for single hidden node training [9, 22, 17]. Most of them use gradient-descent or iterative methods. We prefer the statistical approaches because in a single node design because they are optimal and much faster. 3.3.1 Gaussian Discriminant Node When two training data populations Class 1 and Class 2 are described as multivariate Gaussian distributions with sample mean vectors and covariance matrices μ1 , Σ1 and μ2 , Σ2 , respectively, the minimum-cost classification rule is defined as [10]: Class1 : L(x) ≥ θ;
Class2 : L(x) < θ;
(3.1)
where x is an observed data vector or feature vector of N components and θ is a threshold determined by the cost ratio, the prior probability ratio, and the determinants of the covariance matrices, and −1 t −1 t −1 L(x) = xt (Σ−1 1 − Σ2 )x − 2(μ1 Σ1 − μ2 Σ2 )x
=
N
λi |xt wi |2 − 2w0 x,
(3.2) (3.3)
i=1 t −1 where w0 = (μt1 Σ−1 1 − μ2 Σ2 ) and for i > 0, λi and wi are the i’th eigenvalue −1 and eigenvector for matrix Σ−1 1 − Σ2 . We define the above equation as a Gaussian Discriminant Node [12]. Its implementation is shown in Fig. 3.6(a).
50
3 Principal Feature Networks for Pattern Recognition
θ Σ λ1
-2
λN
Sqr.
Sqr.
Σ
Σ
θ
Sqr.
Σ
Σ
......
Σ
WN
W1
W0
W0
......
.... (b)
...... (a)
θ1
θ
θ2
Sqr. Σ
Σ W1
W1
......
......
(c)
(d)
Fig. 3.6. (a) A single Gaussian discriminant node. (b) A Fisher’s node. (c) A quadratic node. (d) An approximation of the quadratic node.
When the covariance matrices in (3.3) are the same, the first quadratic term is zero, and the above classifier computes Fisher’s linear discriminant. The general node becomes a Fisher’s node, as shown in Fig. 3.6(b). When the second term in the equation is ignored, the above formulas only have the first quadratic term. Due to the sequential design procedure, in each PFN design step we only use the eigenvector associated with the largest eigenvalue (or a small number of principal eigenvectors). Thus, we have the quadratic node as shown in Fig. 3.6(c). The threshold squaring function can be further approximated by two thresholds as shown in Fig. 3.6(d). This gives us a theoretical reason to allow more than one threshold on each hidden node as shown in Fig. 3.2.
3.3 Hidden Node Design
51
3.3.2 Fisher’s Node Design Suppose that we have two classes of vectors in the n-dimensional input space and wish to use the current hidden node to separate these two classes. We need to find a direction for w, so that the projected data from the classes can be separated as far as possible. This problem has been studied by Fisher [8, 5, 10]. The solution is called linear discriminant analysis (LDA). For a two-class classification problem, the criterion function J in Fisher’s linear discriminant [5] can be written as: Jmax (w) = where
wt SB w , wt SW w
SB = (m1 − m2 )(m1 − m2 )t ,
and SW =
(x − m1 )(x − m1 )t +
x∈X1
(x − m2 )(x − m2 )t ,
(3.4)
(3.5) (3.6)
x∈X2
where X1 and X2 are the data matrices of two different classes, each row represents one training data vector. The m1 and m2 are the sample means of the two classes. And SW is a linear combination of the sample covariance matrices of the two classes. We assume that (m1 −m2 ) is not a zero vector. If it is zero or close to zero, then a different training method (principle component discriminant analysis) [12] should be applied to design the hidden nodes. The problem of the formula (3.4) is the well known generalized Rayleigh quotient problem. The weight vector w to maximize J is the eigenvector associated with the largest eigenvalue of the following generalized eigenvalue problem. SB w = λSW w (3.7) When SW is nonsingular, the above equation can be written as the following conventional eigenvalue problem: −1 SW SB w = λw
(3.8)
As pointed out in [5], since SB has rank one and SB w is always in the direction of m1 − m2 , there is only one nonzero eigenvalue and the weight vector w can be solved directly as: −1 w = αSW (m1 − m2 ) (3.9) in which α is a constant which can be chosen for normalization or to make inner products with w computationally simple in implementation. For multiple class problems, the equation (3.5) becomes SB =
c i=1
ri (mi − m)(mi − m)t ,
(3.10)
52
3 Principal Feature Networks for Pattern Recognition
where, mi is the mean of class i, ri is the number of training data vectors in class i, and m is the mean vector of all classes, and the equation (3.6) becomes SW =
c
(x − mi )(x − mi )t .
(3.11)
i=1 x∈Xi
We then solve the generalized eigenvalue problem the same as in equation (3.7). To save floating point operations (Flops), the problem in (3.4) can also be converted to the conventional eigenvalue problem by changing the variable: 1/2
w = SW w. then, Jmax (w ) = where
−1/2
S = SW
(3.12)
t
w S w w t w −1/2
SB SW
,
(3.13) (3.14)
The w will be the eigenvector associated to the largest eigenvalue in solving a standard eigenvalue problem. Finally, the weight vector w can be obtained by: −1/2
w = SW
w .
(3.15)
Example 1 (continued) For the two classes of data X1 and X2 as shown in Fig. 3.1(a), use (3.6) to calculate the SW and Fig. (3.9) to obtain the weight vector w. Then project the X1 and X2 onto the weight vector. Two hyperplanes perpendicular to the weight vector w (as in Fig. 3.1(b)) can be determined. The details in determining the thresholds will be discussed in Section 3.7.
3.4 Principal Component Hidden Node Design When the mean vectors of training classes are far enough apart, Fisher’s node is effective. However, when the mean vectors of the training classes are too close, Fisher’s LDA does not provide good classification. Here we should use the quadratic node, or approximate it by using a PC node [12].
3.4 Principal Component Hidden Node Design
53
3.4.1 Principal Component Discriminant Analysis To design a quadratic node directly for non-Gaussian data, our criterion is to choose a weight vector w to maximize a discriminant signal-to-noise ratio J. J=
E{(Xl w)t (Xl w)} w t Σl w = , E{(Xc w)t (Xc w)} w t Σc w
(3.16)
where Xl is a matrix of row vectors of training data from class l. Xc is the matrix of training data from all classes except Xl . The class l is the class which has the largest eigenvalue among the eigenvalues calculated from the data matrices of each class, respectively. Σl and Σc are the estimated covariance matrices and w is the weight vector. In the case when mean vectors are the same and not zero, the criterion still can be used. The weight vector w can be determined by solving the following generalized eigenvalue and eigenvector problem: Σl w = λΣc w, (3.17) The eigenvector associated with the largest eigenvalue provides the maximum value of J. However, more than one weight vector can be selected to improve the discriminant analysis. In other words, more than one quadratic hidden node can be trained by solving the eigenvalue problem once. Example 1 (continued) For the residual data set in Fig. 3.1 (c), the mean vectors of the two classes are so close that the Fisher’s node is not efficient. A principal component node was designed by solving the eigenvalue problem in (3.17), where the covariance matrices, Σl and Σc , are calculated based on the residual data vectors in Fig. 3.1 (c). The determined hyperplanes perpendicular to the weight vector w are shown in Fig. 3.1 (c). The input space partitioned by one Fisher’s node and one principle component node is shown in Fig. 3.7. If we used Fisher’s method to design the second node, the residual data set would not be totally partitioned by the second Fisher’s node, and one more node would be needed to totally partition the input space as shown in Fig. 3.7. Even though that inefficient extra node can be removed by lossless simplification (see Section 3.8 [14, 13]) at the end of the design, it will slow down the training procedure.
3.5 Relation between PC Node and the Optimal Gaussian Classifier The principal component node of the previous section is intended for nonGaussian common-mean-vector classes. We can prove that it approximates the discriminant capability of a Gaussian classifier (Gaussian node) when the
54
3 Principal Feature Networks for Pattern Recognition 3.5
3
2.5
2
1.5
1
0.5
0
-0.5 4
4.5
5
5.5
6
6.5
7
Fig. 3.7. When using only Fisher’s nodes, three hidden nodes and six thresholds are needed to finish the design.
class data are from Gaussian distributions with zero mean vectors by the following: −1/2 1/2 Let w = Σc V and V = Σc w. The equation (3.16) can be further written as −t/2 −1/2 V t Σc Σl Σc V V t SV J= = , (3.18) t V V V tV −t/2
−1/2
where S = Σc Σl Σc . Its singular value decomposition is S = UΛUt . The Λ is a diagonal matrix of eigenvalues. The maximum occurs when V = U1 , where U1 is the eigenvector associated with the largest eigenvalue λ1 of Λ. Thus, the weight vector for which J of (3.16) is a maximum is w = Σ−1/2 U1 . c
(3.19)
The classification functions of this quadratic node are Class l : |xt Σ−1/2 Ut1 |2 > θ c
(3.20)
Class c : |xt Σ−1/2 Ut1 |2 ≤ θ , c
(3.21)
where θ is a classification threshold. This can provide a good approximation to the performance of the Gaussian classifier. The classification rule in (3.3) can now be written as [18] −1 L(x) = xt (Σ−1 c − Σl )x
= =
t
x Σ−t/2 (I c t −t/2 x Σc (I
− −
−1 1/2 −1/2 Σt/2 x, c Σl Σc )Σc −1 −1/2 S )Σc x,
(3.22) (3.23) (3.24)
3.5 Relation Between Pc Node and the Optimal Gaussian Classifier
55
= xt Σ−t/2 (I − UΛ−1 Ut )Σ−1/2 x, c c
(3.25)
−Λ
(3.26)
t
=x =
Σ−t/2 U(I c
N
−1
t
)U
Σ−1/2 x, c
(1 − 1/λi )|xt Σc −1/2 Uti |2 ,
(3.27)
i=1
Taking the first principal component (i = 1) from the above formula, the discriminant function becomes the same as the quadratic node in (3.20) and (3.21). Furthermore, if we use two thresholds to approximate the square function in Fig. 3.6(c), the classification rules of (3.20) and (3.21) become Class l : θ1 ≤ xt Σ−1/2 U1 ≤ θ2 c
(3.28)
Class c : xt Σ−1/2 U1 > θ2 , or xt Σ−1/2 U1 < θ1 . c c
(3.29)
The implementation of (3.28) and (3.29) is called the principal component node as shown in Fig. 3.6(d). The structure is the same as the normal PFN hidden node defined in Fig. 3.2.
3.6 Maximum Signal-to-Noise-Ratio (SNR) Hidden Node Design When the mean vectors of training classes are far apart, the Fisher’s node is effective. However, when the mean vectors of the training classes are too close, the Fisher’s LDA node does not provide good classification. Then one can use the quadratic Gaussian discriminant [10], or the following simple discriminant which can be used for non-Gaussian data and often has almost the same discriminant capability as the quadratic Gaussian discriminant when the data vector is multivariate Gaussian. Proof is given in [12]. To design a robust quadratic node for possibly non-Gaussian data, we choose a weight vector w to maximize a discriminant signal-to-noise ratio. J=
w t Σl w , w t Σc w
(3.30)
where Σl is the covariance matrix calculated from Class l which has the largest eigenvalue among the eigenvalues calculated from the sample covariance matrices of each class respectively; Σc is the covariance matrix calculated from data pooled from all other classes. The weight vector w can be determined by solving a generalized eigenvalue and eigenvector problem, i.e. Σl w = λΣc w. The maximum SNR node has been used to classify overlapped Gaussian data [12].
56
3 Principal Feature Networks for Pattern Recognition
3.7 Determining the Thresholds from Design Specifications After a weight vector is obtained by the LDA or by maximum SNR analysis, all current training vectors are projected to the weight vector. The histograms of projected data of all classes can then be evaluated. The classes on the far right and far left on the vector can be separated by determining thresholds. The thresholds and separated regions are then labeled to the separated classes. A technique for determining thresholds from design specifications, i.e. performance requirements for every class, was developed by the author and Dr. Tufts and has been applied successfully in all of the examples and applications of this chapter. Interested readers are referred to [11] for the details of this procedure.
3.8 Simplification of the Hidden Nodes Due to the PFN architecture and the training procedure, pruning of the PFN hidden nodes is simpler than the pruning algorithms for multilayer perception (MLP) networks. We developed two kinds of pruning algorithms for different applications, lossless and lossy simplifications. Lossless Simplification is for a minimal implementation; Lossy Simplification is for improving the ability of the network to generalize, that is perform well on new data sets. To avoid confusion with the data pruning described in the above sections, we use the term simplification. Generally speaking, lossy simplification is needed for most applications. Readers are referred to [11] for lossless simplification. During PFN training, a threshold is labeled to a class which is associated with that threshold. We recall that each hidden node can have more than one threshold associated with separated classes. Also, the percentage of the training vectors of each class classified by each threshold in the sequential design can be saved in an array. The array called the contribution array can then be used for simplification analysis. We use the following example to illustrate the details.
3.9 Application 1 – Data Recognition In a real signal recognition application, a large set of multi-dimensional training vectors of 10 classes was completely classified by a PFN using 49 hidden nodes and 98 thresholds. The contribution of each threshold to its labeled class, in terms of percentage of classification rate, is saved in a contribution array. The array was sorted and plotted in Fig. 3.8(a). From the Fig. 3.8(a), we can see that only a few of the thresholds have significant contribution to full recognition of their classes. The accumulated network performance in the order of the sorted thresholds is shown in Fig. 3.8(b). The more thresholds
3.9 Application 1 – Data Recognition
57
we keep, the higher the network accuracy we can obtain on the training data set, but keeping those thresholds which provide little contribution can affect the ability of the designed network to generalize to new or testing data. Accumulated Network Performance Network Accuracy %
100 80 60 40 20 0 0
10
20
30
40
50
60
70
80
90
80
90
Contribution of Each Threshold to Its Data Class
Contribution %
100 80 60 40 20 0 0
10
20
30
40
50
60
70
Fig. 3.8. Application 1: (a) (bottom) The sorted contribution of each threshold in the order of its contribution to the class separated by the threshold. (b) (top) Accumulated network performance in the order of the sorted thresholds.
In the simplification procedure we seek to attain a desired network performance which comes from the design specifications. This value is used to prune thresholds. In this example, the desired network performance is 92% correct decisions. A horizontal dash-dot line in Fig. 3.8(b) marks the desired 92% accuracy. The line intersections with the curve of the accumulated network performance. By projecting the intersection onto the Fig. 3.8(a) as the vertical broken line in both Fig. 3.8(a) and (b), a necessary number of thresholds to meet the desired network performance can be determined. For this example, the first 38 thresholds in Fig. 3.8(a) can meet the 92% network accuracy as requested in the design specifications. Thus thresholds 39 to 98 in the sorted contribution array can be deleted. If all of the thresholds associated with a hidden node have been deleted, then that hidden node should also be deleted. After this lossy simplification, the designed PFN has a performance of 91% on the training set and 88% on the test set using 31 hidden nodes and 38 thresholds. Thus, the performance on the test set is close to the performance on the training set.
58
3 Principal Feature Networks for Pattern Recognition
3.10 Application 2 – Multispectral Pattern Recognition We applied the principal feature classification to recognize categories of land cover from three images of Block Island, Rhode Island, corresponding to three spectral bands - two visible and one infrared. Each complete image has 4591 × 7754 pixels. Each pixel has a resolution of 1.27 m and belongs to one of 14 categories of land covers. The training data set is a matrix which is formed from a subset of pixels which have been labeled. Each row is one training vector which has 9 feature elements associated with one pixel [7], and each of these vectors was labeled with one of 14 land cover categories. The 9 features of data vectors consist of pixel intensity in the three color bands, three local standard deviations of intensity in a diameter of 10m floating window around the row-designated pixel (one for each color), and three additional features from the side information of a soil database. These features are degree of local slope at the designated pixel, aspect of the slope at the designated pixel, and drainage class of soil. In [7] Duhaime identified the 14 categories of land covers for supervised training. The computer experiments on the multispectral image features started by using backpropagation and RBF algorithms [4]. However, both of them did not get the needed classification results in a reasonable amount of time as estimated from their convergence speeds. Then the PFN and a modified radial basis function (MRBF) algorithm [15] were applied to solve the problem. The experimental results are listed in Table 3.1 and compared with one another. Table 3.1. Comparison of Three Algorithms in the Land Cover Recognition Algorithms
Mflops CPU No. of Accuracy Time Nodes on Test (sec.) Sets PFN (proposed) 37.64 58 77 72% MRBF [15] 221.93 518 490 60% LDA [7] – – – 55% PFN: Principal feature network; MRBF: Modified radial basis function network; and LDA: Linear discriminant analysis.
The MRBF method used a training data set of 140 sample vectors (limited by memory space), 10 from each category, and tested on a test data set of 700 samples, 50 samples from each category. The method results in an average accuracy of 60% on the test set for all of the 14 categories defined above. The training took 518 seconds CPU time on a Sun Sparc IPX workstation. The PFN is trained by 700 training vectors since the PFN can be trained with much less memory space. It took only 58 seconds CPU time and reached an average performance of 72% on the same test set and on all 14 categories. (The performance is 65% if using the 140 sample set for training.) The performance of LDA, as reported in [7], was 55% on an average of 11 categories out of the
3.10 Application 2 – Multispectral Pattern Recognition
59
all 14 categories based on different training and test data sets. The simulation software was written in an interpretive language for both PFN and MRBF.
3.11 Conclusions The principal feature network (PFN) has been compared in experiments with popular neural networks, such as BP and RBF. It also has been compared with many constructive algorithms [11], such as cascade-correlation architecture, decision tree algorithms, etc. Generally speaking, the PFN possesses the advantages of the constructive algorithms. By applying multivariate statistical analysis in defining and training hidden nodes, the classifier can be trained much faster than by gradient-descent or other iterative algorithms. The overfitting problem results from requiring a higher classification accuracy than the system can actually achieve. This is solved by appropriately pruning thresholds using the design specifications, and thus generalization to new test data can be realized by lossy simplification. Compared with other algorithms, the PFN needs much less computation time in training and uses simpler structures for implementation while achieving the same or better classification performance than the traditional neural network approach. Due to the advantages of the PFN, it has been selected and implemented in important real-world applications. Through reading this chapter, we hope readers have a better understanding on the concepts of multivariate statistics and neural networks.
References 1. Bishop, C., Neural networks for pattern recognition. NY: Oxford Univ. Press, 1995. 2. Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J., Classification and Regression Trees. Belmont, CA: Wadsworth International Group, 1984. 3. Chen, S., Cowan, C. F. N., and Grant, P. M., “Orthogonal least squares learning algorithm for radial basis function networks,” IEEE Transactions on Neural Networks, vol. 2, March 1991. 4. Demuth, H. and Beale, M., Neural network toolbox user’s guide. Natick, MA: The MathWorks Inc., 1994. 5. Duda, R. O. and Hart, P. E., Pattern Classification and Scene Analysis. New York: John & Wiley, 1973. 6. Duda, R. O., Hart, P. E., and Stork, D. G., Pattern Classification, Second Edition. New York: John & Wiley, 2001. 7. Duhaime, R. J., The Use of Color Infrared Digital Orthophotography to Map Vegetation on Block Island, Rhode Island. Master’s thesis, University of Rhode Island, Kingston RI, May 1994. 8. Fisher, R. A., “The statistical utilization of multiple measurements,” Annals of Eugenics, vol. 8, pp. 376–386, 1938. 9. Gallant, S. I., Neural Network Learning and Expert systems. Cambridge, MA: The MIT Press, 1993.
60
3 Principal Feature Networks for Pattern Recognition
10. Johnson, R. A. and Wichern, D. W., Applied Multivariate Statistical Analysis. New Jersey: Prentice Hall, 1988. 11. Li, Q., Classification using principal features with application to speaker verification. PhD thesis, University of Rhode Island, Kingston, RI, October 1995. 12. Li, Q. and Tufts, D. W., “Improving discriminant neural network (DNN) design by the use of principal component analysis,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Detroit MI), pp. 3375–3379, May 1995. 13. Li, Q. and Tufts, D. W., “Principal feature classification,” IEEE Trans. Neural Networks, vol. 8, pp. 155–160, Jan. 1997. 14. Li, Q. and Tufts, D. W., “Synthesizing neural networks by sequential addition of hidden nodes,” in Proceedings of the IEEE International Conference on Neural Networks, (Orlando FL), pp. 708–713, June 1994. 15. Li, Q., Tufts, D. W., Duhaime, R., and August, P., “Fast training algorithms for large data sets with application to classification of multispectral images,” in Proceedings of the IEEE 28th Asilomar Conference, (Pacific Grove), October 1994. 16. Reed, R., “Pruning algorithms - a survey,” IEEE Transactions on Neural Networks, vol. 4, pp. 740–747, September 1993. 17. Sankar, A. and Mammone, R. J., “Growing and pruning neural tree networks,” IEEE Transactions on Computers, vol. C-42, pp. 291–299, March 1993. 18. Scharf, L. L., Statistical Signal Processing. Reading MA: Addison-Wesley, 1990. 19. Streit, R. L. and Luginbuhl, T. E., “Maximum likelihood training of probabilistic neural networks,” IEEE Transactions on Neural Networks, vol. 5, September 1994. 20. Tufts, D. W. and Li, Q., “Principal feature classification,” in Neural Networks for Signal Processing V, Proceedings of the 1995 IEEE Workshop, (Cambridge MA), August 1995. 21. Werbos, P. J., The roots of backpropagation: from ordered derivatives to neural networks and political forecasting. New York: J. Wiley & Sons, 1994. 22. Zurada, J. M., Introduction to Artificial Neural Systems. New York: West publishing company, 1992.
Chapter 4 Non-Stationary Pattern Recognition
So far, we have discussed pattern recognition for stationary signals. In this chapter, we will discuss pattern recognition for both stationary and nonstationary signals. In speaker authentication, some tasks, such as speaker identification, are treated as stationary pattern recognition while others, such as speaker verification, are treated as non-stationary pattern recognition. We will introduce the stochastic modeling approach for both stationary and nonstationary pattern recognition. We will also introduce the Gaussian mixture model (GMM) and the hidden Markov model (HMM), two popular models that will be used throughout the book.
4.1 Introduction Signal or feature vectors extracted from speech signals are recognized as a stochastic process, which can be either stationary or non-stationary. A stationary process is a stochastic process whose joint probability distribution does not change when shifted in time. Stationary pattern recognition is used to recognize patterns which can be characterized as a stationary process, such as still image recognition. The model used for recognition can be a feedforward neural network, the principal feature network, or a Gaussian mixture model (GMM). A non-stationary process, on the other hand, is a stochastic process whose joint probability distribution does change when shifted in time. Nonstationary pattern recognition is used to recognize those patterns which change over time, such as video images or speech signals. In this case, the model used for recognition can be a hidden Markov model (HMM) or a recurrent neural network. In this chapter, we introduce the GMM and the HMM. In speaker authentication, the GMM is used for context-independent speaker identification, while the HMM is used for speaker verification and verbal information verification. Here, we focus on the basic GMM and HMM concepts and Bayesian decision
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/978-3-642-23731-7_4, Ó Springer-Verlag Berlin Heidelberg 2012
61
62
4 Non-Stationary Pattern Recognition
theory. In subsequent chapters, we present real applications of the HMM and GMM in speaker authentication.
4.2 Gaussian Mixture Models (GMM) for Stationary Process The GMM is defined to represent stochastic data distributions: p(ot |Cj ) = p(ot |λj ) =
I
ci N (ot ; μi , Σi ),
(4.1)
i=1
where λj is the GMM for class Cj , ci is a mixture weight which must satisfy I the constraint i=1 ci = 1, I is the total number of mixture components, and N (·) is a Gaussian density function: 1 1 T Σ −1 (o − μ ) , (4.2) N (ot ; μi , Ri ) = exp − (o − μ ) t i t i i 2 (2π)d/2 |Σi |1/2 where μi and Σi are the d-dimensional mean vector and covariance matrix of the i’th component. t = 1, . . . T , and Ot is the tth sample or observation. Given observed feature vectors, the GMM parameters can be estimated iteratively using a hill-climbing algorithm or the expectation-maximization (EM) algorithm [3]. As has been proved, the algorithm ensures a monotonic increase in the log-likelihood during the iterative procedure until a fixed-point solution is reached [21, 7]. In most applications, a model parameter estimation can be accomplished in just a few iterations. At each step of the iteration, the parameter estimation formulas for mixture i are: cˆi =
T 1 p(i|ot, λ) T t=1
T
μ ˆ i = t=1 T ˆi = Σ where
T
t=1
p(i|ot , λ)ot
t=1
p(i|ot , λ)
p(i|ot , λ)(ot − μ ˆ i )(ot − μ ˆ i )T T t=1 p(i|ot , λ)
p(oi |λ)ci p(i|ot , λ) = I . j=1 p(oi |λ)cj
(4.3)
(4.4)
(4.5)
(4.6)
One application of the above model is for context-independent speaker identification, where we assume that each speaker’s speech characteristics manifest only acoustically and are represented by one (model) class. When a
4.2 Gaussian Mixture Models (GMM) for Stationary Process
63
spoken utterance is long enough, it is reasonable to assume that the acoustic characteristic is independent of its content. For a group of M speakers, in the enrollment phase, we train M GMM’s, λ1 , λ2 , ..., λM , using the re-estimation algorithm. In the test phase, given an observation sequence O, the objective is to find in the prescribed speaker population the speaker model that achieves the maximum posterior probability. From Eq. (4.19) and assuming the prior populations are the same for all speakers, the decision rule is Take action αk , where k = arg max
1≤i≤M
T
log p(ot |λi ).
(4.7)
t=1
where αk is the action of deciding that the observation O is from speaker k. In summary, the decision on authentication is made by computing the likelihood based on the probability density functions (pdf ’s) of the feature vectors. Parameters that define these pdf ’s have to be estimated a priori. 4.2.1 An Illustrative Example In this example, we artificially generated three classes of two-dimensional data. The distributions were of Gaussian-mixture types, with three components in each class. Each token was two-dimentional. For each class, 1,500 tokens were drawn from each of the three components; therefore, there were 4,500 tokens in total. The data density distributions of the three classes are shown in Fig. 4.1 to Fig. 4.3. The contours of the distributions of classes 1, 2 and 3 are shown in Fig. 4.4, where the means are represented as +, ∗, and boxes, respectively. In real applications, the number of mixture components is unknown. Therefore, we assumed the GMM’s, which need to be trained, have two mixture components with full covariance matrices for each class. Maximum likelihood (ML) estimation was applied to train the GMMs with four iterations based on the training data drawn from the ideal models. The contours that represent the pdf ’s of each of the GMMs after ML estimation are plotted in Fig. 4.5. The testing data with 4,500 tokens for each class were obtained using the same methods as the training data. The ML classifier provided an accuracy of 76.07% and 75.97% for training and testing datasets, respectively. The decision boundary is shown in Fig. 4.6 as the dotted line. For comparison purposes, suppose we know the real models, and therefore use the same model that generated the training data to do the testing, i.e. we use three mixtures in each model and the ML classifier. The performances are 77.19% and 77.02% for training and testing data sets. This is the optimal performance in the sense of minimizing Bayes error. The ideal boundary is plotted in Fig. 4.6 as the solid line. This example will be continued when we discuss discriminative training in Chapter 13.
64
4 Non-Stationary Pattern Recognition
0.06 0.04 0.02 0 10
5
0
−5
−10
−5
0
5
10
Fig. 4.1. Class 1: a bivariate Gaussian distribution with m1 = [0 5], m2 = [−3 3], and m3 = [−5 0]. Σ1 = [1.41 0; 0 1.41], Σ2 = [1.22 0.09; 0.09 1.22], and Σ3 = [1.37 0.37; 0.27 1.37]
0.06 0.04 0.02 0 10 5 0 −5
−10
−5
0
5
10
Fig. 4.2. Class 2: a bivariate Gaussian distribution with m1 = [2 5], m2 = [−1 3], and m3 = [0 0]. Σ1 = [1.41 0; 0 1.41], Σ2 = [0.77 1.11; 1.11 1.09], and Σ3 = [1.41 0.04; 0.04 1.41]
4.2 Gaussian Mixture Models (GMM) for Stationary Process
65
0.1
0.05
0 −10
10 −5 5
0 0
5 10
−5
Fig. 4.3. Class 3: a bivariate Gaussian distribution with m1 = [−3 − 1], m2 = [−2 − 2], and m3 = [−5 − 2]. Σ1 = [1.41 0; 0 1.41], Σ2 = [0.76 0.11; 0.11 1.09], and Σ3 = [1.41 0.04; 0.04 1.41] 10
5
0
−5 −10
−5
0
5
Fig. 4.4. Contours of the pdf ’s of 3-mixture GMM’s: the models are used to generate 3 classes of training data.
66
4 Non-Stationary Pattern Recognition 10
5
0
−5 −10
−5
0
5
Fig. 4.5. Contours of the pdf ’s of 2-mixture GMM’s: the models are trained from ML estimation using 4 iterations. 7 6 5 4 3 2 1 0 −1 −2 −7
−6
−5
−4
−3
−2
−1
0
1
2
Fig. 4.6. Enlarged decision boundaries for the ideal 3-mixture models (solid line) and 2-mixture ML models (dashed line).
4.3 Hidden Markov Model (HMM) for Non-Stationary Process A speech signal is a non-stationary signal. For many applications, such as speaker verification and speech recognition, we have to consider the temporal information of the non-stationary speech signal; therefore, a more powerful model, the HMM, is then applied to characterize both the temporal structure and the corresponding statistical variations along a sequence of feature vectors or observations of an utterance.
4.3 Hidden Markov Model (HMM) for Non-Stationary Process
67
In speech and speaker verification, an HMM is trained to represent the acoustic pattern of a subword, a word, or a whole pass-phrase. There are many variants of HMM’s. The simplest and most popular one is an N -state, left-toright model without a state-skip as shown in Figure 4.7. This is widely used in speaker authentication. The figure shows a Markov chain with a sequence of states, representing the evolution of speech signals. Within each state, a GMM is used to characterize the observed speech feature vector as a multivariate distribution.
a11
a22 a12
a23
s1 b1
a33 a34
s2 b2
aNN
s3
....
aN-1,N
b3
sN bN
Fig. 4.7. Left-to-right hidden Markov model.
An HMM, denoted as λ, can be completely characterized by three sets of parameters, the state transition probabilities, A, the observation densities, B, and the initial state probabilities, Π; as shown in the following notation: λ = {A, B, Π} = {ai,j , bi , πi }, i, j = 1, ..., N,
(4.8)
where N is the total number of states. Given an observation sequence O = {ot }Tt=1 , the model parameters, {A, B, Π}, of λ can be trained by an iterative method to optimize a prescribed performance criterion, e.g., ML estimation. In practice, the segmental K-mean algorithm [16] with ML estimation has been widely used. Following model initialization, the observation sequence is segmented into states based on the current model parameter set λ. Then, within each state, a new GMM is trained by the EM algorithm to maximize ˆ is then used to re-segment the observation the likelihood. The new HMM λ sequence by the Viterbi algorithm (see Chapter 6) and re-estimation of model parameters. The iterative procedure usually converges in a few iterations by the EM algorithm. In addition to the ML criterion, the model can also be trained by optimizing a discriminative objective. For example, the minimum classification error (MCE) criterion [9] was proposed along with a corresponding generalized probabilistic descent (GPD) training algorithm [8, 2] to minimize an objective function that approximates the error rate closely. Other criteria like maximum mutual information (MMI) [1, 15] have also been attempted. Instead of modeling the distribution of the data set of the target class, the
68
4 Non-Stationary Pattern Recognition
criteria also incorporate data of other classes. A discriminative model is thus constructed to implicitly model the underlying distribution of the target class but with an explicit emphasis on minimizing the classification error or maximizing the mutual information between the target class and others. The discriminative training algorithms have been applied successfully to speech recognition. The MCE/GPD algorithm has also been applied to speaker recognition [12, 10, 17, 18]. Generally speaking, the models trained by discriminative objective functions yield better recognition and verification performance, but the long training time makes it less attractive to real applications. We will study the objectives of discriminative training in Chapter 12 and a new discriminative training algorithm for speaker recognition in Chapter 13.
4.4 Speech Segmentation Given an HMM, λ, and a sequence of observations, O = {ot }Tt=1 , the optimal state segmentation can be determined by evaluating the maximum joint state-observation probability, maxs P (O, s|λ), conventionally called maximum likelihood decoding. One popular algorithm that accomplishes this objective efficiently is the Viterbi algorithm [19, 5]. When fast decoding and forced alignment are desired, a new reduced search space algorithm [11] can be employed. Details about our detection-based decoding algorithm are presented in Chapter 6.
4.5 Bayesian Decision Theory In an M -class recognition problem, we are 1) given an observation (or a feature vector) o in a d-dimensional Euclidean space Rd , and a set of classes designated as {C1 , C2 , ..., CM }, and 2) asked to make a decision, for example, to classify o into, say, class Ci , where one class can be one speaker or one acoustic unit. We denote this as an action αi . By Bayes’ formula, the probability of being class Ci given o is the posterior (or a posteriori) probability: P (Ci |o) =
p(o|Ci )P (Ci ) p(o)
(4.9)
where p(o|Ci ) is the conditional probability , P (Ci ) is prior probability, and p(o) =
M
p(o|Cj )P (Cj )
(4.10)
j=1
can be viewed as a scale factor that guarantees that the posterior probabilities sum to one.
4.5 Bayesian Decision Theory
69
Let L(αi |Cj ) be the loss function describing the loss incurred for taking action αi when the true class is Cj . The expected loss (or risk) associated with taking action αi is R(αi |o) =
M
L(αi |Cj )P (Cj |o).
(4.11)
j=1
This leads to the Bayes decision rule: To minimize the overall risk, compute the above risk for j = 1, ..., M and then select the action αi such that R(αi |o) is minimum. For speaker authentication, we are interested in the zero-one loss function: 0 i=j i, j = 1, ..., M L(αi |Cj ) = (4.12) 1 i = j. It assigns no loss to a correct decision and a unit loss to an error, equivalent to counting the errors. The risk to this specific loss function is R(αi |o) =
M
L(αi |Cj )P (Cj |o)
(4.13)
P (Cj |o) = 1 − P (Ci |o).
(4.14)
j=1
=
j=i
Thus, to minimize the risk or error rate, we take action αk that maximizes the posterior probability P (Ci |o): Take action αk , where k = arg max P (Ci |o). 1≤i≤M
(4.15)
When the expected value of the loss function is equivalent to the error rate, it is called the minimum-error-rate classification [4]. Recalling the Bayes formula in Eq. (4.9), when the density p(o|Ci ) has been estimated for all classes and the prior probabilities are known, we can rewrite the above decision rule as: Take action αk , where k = arg max p(o|Ci )P (Ci ). 1≤i≤M
(4.16)
So far, we have only considered the case of a single observation (or feature vector) o. In speaker authentication, we always encounter or employ a sequence of observations O = {oi}Ti=1 , where T is the total number of observations. After speech segmentation (which will be discussed later), we assume that during a short time period these sequential observations are produced by the same speaker and they belong to the same acoustic class or unit, say Ci . , Furthermore, if we assume that the observations are independent and identically distributed (i.i.d.), the joint posterior probability, P (Ci |O) is merely the product of the component probabilities:
70
4 Non-Stationary Pattern Recognition
P (Ci |O) =
T
P (Ci |ot ).
(4.17)
t=1
From Eq. (4.16), the decision rule for the compound decision problem is αk = arg max
1≤i≤M
T
p(ot |Ci )P (Ci ).
(4.18)
t=1
In practice, the decision is usually based on the log likelihood score : αk = arg max
1≤i≤M
T
log p(ot |Ci )P (Ci ).
(4.19)
t=1
4.6 Statistical Verification Statistical verification as applied to speaker verification and utterance verification can be considered a two-class classification problem, whether a spoken utterance is from the true speaker (the target source) or from an impostor (the alternative source). Given an observation o, a decision αi is taken based on the following conditional risks derived from Eq. (4.9): R(α1 |o) = L(α1 |C1 )P (C1 |o) + L(α1 |C2 )P (C2 |o) R(α2 |o) = L(α2 |C1 )P (C1 |o) + L(α2 |C2 )P (C2 |o)
(4.20) (4.21)
The action α1 corresponds to the decision of positive verification if R(α1 |o) < R(α2 |o).
(4.22)
Bring (4.20) and (4.21) into(4.22) and rearranging the terms, we take action α1 if: P (C1 |o) L(α1 |C2 ) − L(α2 |C2 ) > = T1 (4.23) P (C2 |o) L(α2 |C1 ) − L(α1 |C1 ) where T1 > 1 is a prescribed threshold. Furthermore, by applying the Bayes formula, we have p(o|C1 ) P (C2 ) > T1 = T2 . (4.24) p(o|C2 ) P (C1 ) For a sequence of observation O = {oi }Ti=1 which are assumed to be independent and identically distributed (i.i.d.), we have the likelihood-ratio test: T p(ot |C1 ) P (O|C1 ) r(O) = t=1 = > T3 . (4.25) T P (O|C2 ) t=1 p(ot |C2 ) The same result can also be derived from the Neymann-Pearson decision formulation, thus the name Neyman Pearson test [14, 13, 20]. It can be shown
4.6 Statistical Verification
71
that the likelihood-ratio test minimizes the verification error for one class while maintaining the verification error for the other class constant [6, 13]. In practice, we compute a log-likelihood ratio for verification: R(O) = log P (O|C1 ) − log P (O|C2 ). A decision is made according to the rule: Acceptance: R(O) ≥ T ; Rejection: R(O) < T ,
(4.26)
(4.27)
where T is a threshold value, which can be determined theoretically or experimentally. There are two types of error in a test: false rejection, i.e., rejecting the hypothesis when it is actually true, and false acceptance, i.e., accepting it when it is actually false. The equal error rate (EER) is defined as the error rate when the operating point is so chosen as to achieve equal error probabilities for the two types of error. EER has been widely used as a verification performance indicator. In utterance verification, we assume that the expected word or subword sequence is known, and so the task is to verify whether the input spoken utterance matches it. Similarly, in speaker verification, the text of the passphrase is known. The task in speaker verification is to verify whether the input spoken utterance matches the given sequence, using the model trained by the speaker’s voice.
4.7 Conclusions In this chapter, we introduced the basic techniques in modeling non-stationary speech signals for speaker authentication. For speaker identification, GMM is often used as the model for classification. For speaker verification, we often use HMM as the model for recognition. The above models and methods introduced in this chapter provide the foundation for developing baseline speaker authentication systems. In the following chapters, we will introduce advanced algorithms where we still use the baseline systems as the benchmark to compare with the new algorithms. Although we will keep the GMM and HMM models, our training method will be extended from maximum likelihood to discriminative training and our decision methods will be extended from Bayesian decision to detection-based decision. This chapter completes the first part of this book and its goal of introducing the basic theory and models for pattern recognition and multivariate statistical analysis. In the following chapters, we will introduce advanced speaker authentication systems. Most of this work is from the author’s previous research in collaboration with his colleagues.
72
4 Non-Stationary Pattern Recognition
References 1. Bahl, L. R., Brown, P. F., de Souza, P. V., and Mercer, R. L., “Maximum mutual information estimation of hidden Markov model parameters for speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Tokyo), pp. 49–52, 1986. 2. Chou, W., “Discriminant-function-based minimum recognition error rate pattern-recognition approach to speech recognition,” Proceedings of the IEEE, vol. 88, pp. 1201–1222, August 2000. 3. Dempster, A. P., Laird, N. M., and Rubin, D. B., “Maximum likelihood from incomplete data via the EM algorithm,” Journal of Royal Statistical Society, vol. 39, pp. 1–38, 1977. 4. Duda, R. O., Hart, P. E., and Stork, D. G., Pattern Classification, Second Edition. New York: John & Wiley, 2001. 5. Forney, G. D., “The Viterbi algorithm,” Proceeding of IEEE, vol. 61, pp. 268– 278, March 1973. 6. Fukunaga, K., Introduction to statistical pattern recognition, second edition. New York: Academic Press, Inc., 1990. 7. Juang, B.-H., “Maximum-likelihood estimation for mixture multivariate stochastic observations of Markov chains,” AT&T Technical Jouranl, vol. 64, pp. 1235–1249, July-august 1985. 8. Juang, B.-H., Chou, W., and Lee, C.-H., “Minimum classification error rate methods for speech recognition,” IEEE Trans. on Speech and Audio Process., vol. 5, pp. 257–265, May 1997. 9. Juang, B.-H. and Katagiri, S., “Discriminative learning for minimum error classification,” IEEE Transactions on Signal Processing, vol. 40, pp. 3043–3054, December 1992. 10. Korkmazskiy, F. and Juang, B.-H., “Discriminative adaptation for speaker verification,” in Proceedings of Int. Conf. on Spoken Language Processing, (Philadelphia), pp. 28–31, 1996. 11. Li, Q., “A detection approach to search-space reduction for HMM state alignment in speaker verification,” IEEE Trans. on Speech and Audio Processing, vol. 9, pp. 569–578, July 2001. 12. Liu, C. S., Lee, C.-H., Chou, W., Juang, B.-H., and Rosenberg, A. E., “A study on minimum error discriminative training for speaker recognition,” Journal of the Acoustical Society of America, vol. 97, pp. 637–648, January 1995. 13. Neyman, J. and Pearson, E. S., “On the problem of the most efficient tests of statistical hypotheses,” Phil. Trans. Roy. Soc. A, vol. 231, pp. 289–337, 1933. 14. Neyman, J. and Pearson, E. S., “On the use and interpretation of certain test criteria for purpose of statistical inference,” Biometrika, vol. 20A, pp. Pt I, 175–240; Pt II, 1928. 15. Normandin, Y., Cardin, R., and Mori, R. D., “High-performance connected digit recognition using maximum mutual information estimation,” IEEE Trans. on Speech and Audio Processing, vol. 2, pp. 299–311, April 1994. 16. Rabiner, L. R., Wilpon, J. G., and Juang, B.-H., “A segmental k-means training procedure for connected word recognition,” AT&T Technical Journal, vol. 65, pp. 21–31, May/June 1986. 17. Rosenberg, A. E., Siohan, O., and Parthasarathy, S., “Speaker verification using minimum verification error training,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Seattle), pp. 105–108, May 1998.
References
73
18. Siohan, O., Rosenberg, A. E., and Parthasarathy, S., “Speaker identification using minimum verification error training,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Seattle), pp. 109–112, May 1998. 19. Viterbi, A. J., “Error bounds for convolutional codes and an asymptotically optimal decoding algorithm,” IEEE Transactions on Information Theory, vol. IT-13, pp. 260–269, April 1967. 20. Wald, A., Sequential analysis. NY: Chapman & Hall, 1947. 21. Wu, C. F. J., “On the convergence properties of the EM algorithm,” The Annals of Statstics, vol. 11, pp. 95–103, 1983.
Chapter 5 Robust Endpoint Detection
Often the first step in speech signal processing is the use of endpoint detection to separate speech and silence signals for further processing. This topic has been studied for several decades; however, as wireless communications and VoIP phones are becoming more and more popular, more background and system noises are affecting communication channels, which poses a challenge to the existing algorithms; therefore new and robust algorithms are needed. When speaker authentication is applied to adverse acoustic environments, endpoint detection and energy normalization can be crucial to the functioning of real systems. In low signal-to-noise ratio (SNR) and non-stationary environments, conventional approaches to endpoint detection and energy normalization often fail, and speaker and speech recognition performances usually degrade dramatically. The purpose of this chapter is to address the above endpoint problem. The goal is to develop endpoint detection algorithms which are invariant to different SNR levels. For different types of applications, we developed two approaches: a real-time approach and a batch-mode approach. We focus on the real-time approach in this chapter. The batch-mode approach is available in [18]. The real-time approach uses an optimal filter plus a three-state decision diagram for endpoint detection. The filter is designed utilizing several criteria to ensure accuracy and robustness. It has almost invariant response at various background noise levels. The detected endpoints are then applied to energy normalization sequentially. Evaluation results show that the real-time algorithm significantly reduces the string error rates in low SNR situations. The error reduction rates even exceed 50% in several evaluated databases. The algorithms presented in this chapter can also be applied to speech recognition and voice communication systems, networks, and devices. They can be implemented in either hardware or software. This work was originally reported by the author, Zheng, Tasi, and Zhou in [18].
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/978-3-642-23731-7_5, Ó Springer-Verlag Berlin Heidelberg 2012
75
76
5 Robust Endpoint Detection
5.1 Introduction In speaker authentication and many other voice applications, we need to process the signal in utterances consisting of speech, silence, and other background noise. The detection of the presence of speech embedded in various types of non-speech events and background noise is called Endpoint detection or speech detection or speech activity detection. In this chapter, we address endpoint detection by sequential processes to support real-time recognition (in which the recognition response is the same as or faster than recording an utterance). The sequential process is often used in automatic speech recognition (ASR) [21] while the batch-mode process is often allowed in speaker recognition [17], name dialing [16], command control, and embedded systems, where utterances are usually as short as a few seconds and the delay in response is usually small. Endpoint detection has been studied for several decades. The first application was in a telephone transmission and switching system developed in Bell Labs for time assignment of communication channels [5]. The principle was to use the free channel time to interpolate additional speakers by speech activity detection. Since then, various speech-detection algorithms have been developed for ASR, speaker verification, echo cancellation, speech coding, and other applications. In general, different applications need different algorithms to meet their specific requirements in terms of computational accuracy, complexity, robustness, sensitivity, response time, etc. The approaches include those based on energy threshold (e.g. [26]), pitch detection (e.g. [8]), spectrum analysis, cepstral analysis [11], zero-crossing rate [22, 12], periodicity measure, hybrid detection [13], fusion [24], and many other methods. Furthermore, similar issues have also been studied in other research areas, such as edge detection in image processing [6, 20] and change-point detection in theoretical statistics [7, 3, 25, 15, 4]. As is well known, endpoint detection is crucial to both ASR and speaker recognition because it often affects a system’s performance in terms of accuracy and speed for several reasons. First, cepstral mean subtraction (CMS) [2, 1, 10], a popular algorithm for robust speaker and speech recognition, needs accurate endpoints to compute the mean of speech frames precisely in order to improve recognition accuracy. Second, if silence frames can be removed prior to recognition, the accumulated utterance likelihood scores will focus more on the speech portion of an utterance instead of on both noise and speech. Therefore, it has the potential to increase recognition accuracy. Third, it is hard to model noise and silence accurately in changing environments. This effect can be limited by removing background noise frames in advance. Fourth, removing non-speech frames when the number of non-speech frames is large can significantly reduce the computation time. Finally, for open speech recognition systems, such as open-microphone desktop applications and audio transcription of broadcast news, it is necessary to segment utterances from continuing audio input.
5.1 Introduction
77
In applications of speech and speaker recognition, non-speech events and background noise complicate the endpoint detection problem considerably. For example, the endpoints of speech are often obscured by speaker-generated artifacts such as clicks, pops, heavy breathing, or by dial tones. Long-distance telephone transmission channels also introduce similar types of artifacts and background noise. In recent years, as wireless, hands-free, and Voice over Internet Protocol (VoIP) phones get more and more popular, the endpoint detection becomes even more difficult since the signal-to-noise ratios (SNR) of these kinds of communication devices and channels are usually lower than traditional landline telephones. Also, the noise in wireless and VoIP phones has stronger non-stationary property than traditional telephones. The noise in today’s telecommunications may come from the background, such as car noise, room reflection, street noise, background talking, etc., or from communication systems, such as coding, transmission, packet loss, etc. In these cases, the ASR or speaker authentication performance often degrades dramatically due to unreliable endpoint detection. Another problem related to endpoint detection is real-time energy normalization. In both ASR and speaker recognition, we usually normalize the energy feature such that the largest energy level in a given utterance is close to or slightly below a constant of zero or one. This is not a problem in batchmode processing, but it can be a crucial problem in real-time processing since it is difficult to estimate the maximal energy in an utterance with just a short-time data buffer while the acoustic environment is changing. It becomes especially hard in adverse acoustic environments. A look-ahead approach to energy normalization can be found in [8]. As we will point out later in this study, real-time energy normalization and endpoint detection are two related problems. The more accurately we can detect endpoints, the better we can do on real-time energy normalization. We note that endpoint detection as a front module is normally used before ASR or speaker recognition. When a real-world application allows ASR for the entire utterance, the ASR decoder may provide more accurate endpoints if silence models in the ASR system are trained properly. In general, the ASR-based approach for endpoint detection will course more resource in computation and take a longer time compared to the detection-based approach described in this chapter. A good detection-based system must meet the following requirements: accurate location of detected endpoints; robust detection at various noise levels; low computational complexity; fast response time; and simple implementation. The real-time energy normalization problem is addressed together with endpoint detection. The rest of the chapter is organized as follows: In Section 5.2, we introduce a filter for endpoint detection. In Section 5.3, we present a sequential algorithm of combined endpoint detection and energy normalization for speech recognition in adverse environments and provide experimental results in large database evaluations.
78
5 Robust Endpoint Detection
5.2 A Filter for Endpoint Detection The feature vector in ASR can be used for endpoint detection directly; however, to ensure the low-complexity requirement, we only use the onedimensional (1-D) short-term energy in the cepstral feature to be the feature for endpoint detection: g(t) = 10 log10
nt +I−1
o(j)2
(5.1)
j=nt
where o(j) t g(t) I nt
a data sample; a frame number; the frame energy in dB; the window length; the number of the first data sample in the window.
Thus, the detected endpoints can be aligned to the ASR feature vector automatically, and the computation can be reduced from the speech-sampling rate to the frame rate. For accurate and robust endpoint detection, we need a detector that can detect all possible endpoints from the energy feature. Since the output of the detector contains false acceptances, a decision module is then needed to make final decisions based on the detector’s output. Here, we assume that one utterance may have several speech segments separated by possible pauses. Each of the segments can be determined by detecting a pair of endpoints named segment beginning and ending points. On the energy contours of utterances, there is always a rising edge following a beginning point and a descending edge preceding an ending point. We call them beginning and ending edges, respectively, as shown in Fig. 5.4 (A). Since endpoints always come with the edges, our approach is first to detect the edges and then to find the corresponding endpoints. The foundation of the theory of the optimal edge detector was first established by Canny[6]. who developed an optimal step-edge detector. Spacek [23], on the other hand, formed a performance measure combining all three quantities derived by Canny and provided the solution of the optimal filter for step edge. Petrou and Kittler then extended the work to ramp-edge detection [20]. Since the edges corresponding to endpoints in the energy feature are closer to the ramp edge than the ideal step edge, the author and Tsai applied Petrou and Kittler’s filter to the endpoint detection for speaker verification in [17]. In summary, we need a detector that meets the following general requirements: 1) invariant outputs at various background energy levels; 2) capability of detecting both beginning and ending points;
5.2 A Filter for Endpoint Detection
3) 4) 5) 6) 7)
79
short time delay or look-ahead; limited response level; maximum output SNR at endpoints; accurate location of detected endpoints; and, maximum suppression of false detection.
We then need to convert the above criteria to a mathematic representation. As we have discussed, it is reasonable to assume that the beginning edge in the energy contour is a ramp edge that can be modeled by the following function: 1 − e−sx /2 for x ≥ 0 c(x) = (5.2) esx /2 for x ≤ 0 where x represents the frame number of the feature, and s is some positive constant which can be adjusted for different kinds of edges, such as beginning or ending edges, and for different sampling rates. The detector is a one-dimensional filter f (x) which can be operated as a moving-average filter in the energy feature. From the above requirements, the filter should have the following properties which are similar to those in [20]: 1. It must be antisymmetric, i.e., f (x) = −f (−x), and thus f (0) = 0. This follows from the fact that we want it to detect antisymmetric features [6], i.e., sensitive to both beginning and ending edges according to the request in 2); and to have near-zero response to background noise at any level, i.e. invariant to background noise according to the request in 1). 2. According to the request in 3), it must be of finite extent going smoothly to zero at its ends: f (±w) = 0, f (±w) = 0 and f (x) = 0 for |x| ≥ w, where w is the half width of the filter. 3. According to the request in 4), it must have a given maximum amplitude |k|: f (xm ) = k where xm is defined by f (xm ) = 0 and xm is in the interval (−w, 0). If we represent above requirements 5), 6), and 7), as S(f (x)), L(f (x)), and C(f (x)) respectively, the combined objective function have the following form: J = max F {S(f (x)), L(f (x)), C(f (x))} ; f (x)
(5.3)
Subject to properties 1, 2 , and 3. It aims at finding the filter function, f (x), such that the value of the objective function F is maximal subject to properties 1 to 3. Fortunately, the object function is very similar to optimal edge detection in image processing, and the details of the object function have been derived by Petrou and Kittler [20] following Canny [6] as below. Assume that the beginning or ending edge in log energy is a ramp edge as defined in (5.2). And, assume that the edges are emerged with white Gaussian
80
5 Robust Endpoint Detection
noise. Following Canny’s criteria, Petrou and Kittler [20] derived the SNR for this filter f (x) as being proportional to 0 f (x)(1 − esx )dx S = −w (5.4) 0 2 dx |f (x)| −w where w is a half width of the actual filter. They consider a good locality measure to be inversely proportional to the standard deviation of the distribution of endpoints where the edge is supposed to be. It was defined as 0 s2 −w f (x)esx dx L = . (5.5) 0 (x)|2 dx |f −w Finally, the measure for the suppression of false edges is proportional to the mean distance between the neighboring maxima of the response of the filter to white Gaussian noise, 0 |f (x)|2 dx 1 C = −w . (5.6) 0 w |f (x)|2 dx −w Therefore, the combined objective function of the filter is: J = max{(S · L · C)2 } f (x)
2 0 0 sx sx f (x)(1 − e )dx f (x)e dx −w −w s = 2 . 0 0 w |f (x)|2 dx −w |f (x)|2 dx −w 4
(5.7) After applying the method of Lagrange multipliers, the solution for the filter function is [20]: f (x) = eAx [K1 sin(Ax) + K2 cos(Ax)] + e−Ax [K3 sin(Ax) + K4 cos(Ax)] + K5 + K6 esx
(5.8)
where A and Ki are filter parameters. Since f (x) is only half of the filter, when w = W , the actual filter coefficients are: h(i) = {−f (−W ≤ i ≤ 0), f (1 ≤ i ≤ W )}
(5.9)
where i is an integer. The filter can then be operated as a moving-average filter in: W F (t) = h(i)g(t + i) (5.10) i=−W
5.2 A Filter for
etection
81
where g(·) is the energy feature and t is the current frame number. An example of the designed optimal filter is shown in Fig. 5.1. Intuitively, the shape of the filter indicates that the filter must have positive response to a beginning edge, negative response to an ending edge, and a near zero response to silence. Its response is basically invariant to different background noise levels since they all have near zero responses. 0.1 0.05 0 −0.05 −0.1 −15
−10
−5
0
5
10
15
Fig. 5.1. Shape of the designed optimal filter.
5.3 Real-Time Endpoint Detection and Energy Normalization The approach of using endpoint detection for real-time ASR or speaker authentication is illustrated in Fig. 5.2 [19]. We use an optimal filter, as discussed in the last section, to detect all possible endpoints, followed by a three-state logic as a decision module to decide real endpoints. The information of detected endpoints is also utilized for real-time energy normalization. Finally, all silence frames are removed and only the feature vectors including cepstrum and the normalized energy corresponding to speech frames are sent to the recognizer.
Optimal Filter
Decision Logic
Endpoints Silence Removal
Energy
Energy Norm.
ASR Feature
Cepstrum Fig. 5.2. Endpoint detection and energy normalization for real-time ASR.
82
5 Robust Endpoint Detection
5.3.1 A Filter for Both Beginning- and Ending-Edge Detection After evaluating the shapes of both beginning and ending edges, we choose the filter size to be W = 13 to meet requirements 2) and 3). For W = 7, and s = 1, the filter parameters have been provided in [20] as: A = 0.41, [K1 . . . K6 ] = [1.583, 1.468, −0.078, −0.036, −0.872, −0.56]. For W = 13 in our application, we just need to rescale: s = 7/W = 0.5385 and A = 0.41s = 0.2208, while Ki ’s are as above. The shape of the designed filter is shown in Fig. 5.1 with a simple normalization, h/13. For real-time detection, let H(i) = h(i − 13); then the filter has 25 points in total with a 24-frame look-ahead since both H(1) and H(25) are zeros. The filter operates as a moving-average filter: F (t) =
24
H(i)g(t + i − 2)
(5.11)
i=2
where g(·) is the energy feature and t is the current frame number. The output F (t) is then evaluated in a 3-state transition diagram for final endpoint decisions. 5.3.2 Decision Diagram Endpoint decision needs to be made by comparing the value of F (t) with some pre-determined thresholds. Due to the sequential nature of the detector and the complexity of the decision procedure, we use a 3-state transition diagram to make final decisions. As shown in Fig. 5.3, the three states are: silence, in speech and leaving speech. Either the silence or the in-speech state can be a starting state, and any state can be a final state. In the following discussion, we assume that the silence state is the starting state. The input is F (t), and the output is the detected frame numbers of beginning and ending points. The transition conditions are labeled on the edges between states, and the actions are listed in parentheses. “Count” is a frame counter, TU and TL are two thresholds with TU > TL , and “Gap” is an integer indicating the required number of frames from a detected endpoint to the actual end of speech. We use Fig. 5.4 as an example to illustrate the state transition. The energy for a spoken digit “4” is plotted in Fig. 5.4 (A) and the filter output is shown in Fig. 5.4 (B). The state diagram stays in the silence state until F (t) reaches point A in Fig. 5.4 (B), where F (t) ≥ TU means that a beginning point is detected. The actions are to output a beginning point (corresponding to the left vertical solid line in Fig. 5.4 (A)) and to move to the in-speech state. It stays in the in-speech state until reaching point B in Fig. 5.4 (B), where F (t) < TL . The diagram then moves to the leaving-speech state and sets Count = 0. The counter resets several times until reaching point B’. At point C, Counter = Gap = 30. An actual endpoint is detected as the left vertical
5.3 Real-Time Endpoint Detection and Energy Normalization
83
F < TU Silence
TL < F < TU & Count > Gap (Output an
F > TU (Output a beg. point)
end. point )
F < TL (Count = 0)
F > TU
In Speech F > TL
Leaving Speech F
TL < F < TU & Count < Gap (Count++)
Fig. 5.3. State transition diagram for endpoint decision.
dashed line in Fig. 5.4 (B). The diagram then moves back to the silence state. During the stay in the leaving-speech state, if F (t) > TU , this means that a beginning edge is coming, and we should move back to the in-speech state. The 30-frame gap corresponds to the period of descending energy before reaching a real ending point. We note that the thresholds, such as TU , and TL , are set in the filter outputs instead of absolute energy. Since the filter output is stable to the noise levels, the detected endpoints are more reliable. Those constants, Gap, TU , and TL , can be determined empirically by plotting several utterances and corresponding filter outputs. As we will show in the database evaluation, the algorithm is not very sensitive to the values of TU , and TL since the same values were used in different databases. Also, in some applications, two separate filters can be designed for beginning and ending point detection. The size of the beginning filter can be smaller than 25 points while the ending filter can be larger than 25 points. This approach may further improve accuracy; however, it will have a longer delay and use more computation. The 25-point filter used in this section was designed for both beginning and ending point detection in an 8 KHz sampling rate. Also, in the case that an utterance is started from an unvoiced phoneme, it is practical to back up about ten frames from the detected beginning points. 5.3.3 Real-Time Energy Normalization Suppose that the maximal energy value in an utterance is gmax . The purpose of energy normalization is to normalize the utterance energy g(t), such that the
84
5 Robust Endpoint Detection 80
+N
BEGINING POINT
70
BEGINNING EDGE
60 50 40
ENDING POINT ENDING EDGE
+
M
40
60
20 10
SILENCE STATE T
100 (A)
80
160
LEAVING−SPEECH STATE C
B
60
140
+
TL
−20 40
120
IN−SPEECH STATE A
U
0 −10
80
100 (B)
+
+
120
+
B’
140
160
Fig. 5.4. Example: (A) Energy contour of digit “4”. (B) Filter outputs and state transitions.
largest value of energy is close to zero by performing g˜(t) = g(t)−gmax . In realtime mode, we have to estimate the maximal energy gmax sequentially while the data are being collected. Here, the estimated maximum energy becomes a variable and is denoted as gˆmax (t). Nevertheless, we can use the detected endpoints to obtain a better estimate. We first initialize the maximal energy to a constant g0 , which is selected empirically, and use it for normalization until we detect the first beginning point at M as in Fig. 5.4, i.e. gˆmax (t) = g0 , ∀t < M. If the average energy g¯(t) = E{g(t); M ≤ t ≤ M + 2W } ≥ gm
(5.12)
where gm is a pre-selected threshold to ensure that new gˆmax is not from a single click, we then estimate the maximal energy as: gˆmax (t) = max{g(t); M ≤ t ≤ M + 2W }
(5.13)
where 2W + 1 = 25 is the length of the filter and 2W the length of the lookahead window. At point M, the look-ahead window is from M to N as shown in Fig. 5.4. From now on, we update gˆmax (t) as: gˆmax (t) = max{g(t + 2W ), gˆmax (t − 1); ∀t > M}.
(5.14)
Parameter g0 may need to be adjusted for different systems. For example, the value of g0 could be different between telephone and desktop systems. Parameter gm is relatively easy to determine.
5.3 Real-Time Endpoint Detection and Energy Normalization
85
For the example in Fig. 5.5, the energy features of two utterances with 20 dB SNR (bottom) and 5 dB SNR (top) are plotted in Fig. 5.5 (A). The 5 dB utterance is generated by artificially adding car noise to the 20 dB one. The filter outputs are shown in Fig. 5.5 (B) for 20 dB (solid line) and 5 dB (dashed line) SNRs, respectively. The detected endpoints and normalized energy for 20 and 5 dB SNRs are plotted in Fig. 5.5 (C) and Fig. 5.5 (D), respectively. We note that the filter outputs for 20 and 5 dB cases are almost invariant around TL and TU , although their background energy levels have a difference of 15 dB. This ensures the robustness in endpoint detection. We also note that the normalized energy profiles are almost the same as the originals, although the normalization is done in real-time mode.
80 70 60 50
10
100
200
300
(A)
400
500
600
100
200
300
(B)
400
500
600
TU
0
TL
−10
0 −20 −40
0
100
200
300
(C) 400
500
600
700
0
100
200
300
(D) 400
500
600
700
0 −20 −40
Fig. 5.5. (A) Energy contours of “4-327-631-Z214” from original utterance (bottom, 20 dB SNR) and after adding car noise (top, 5 dB SNR). (B) Filter outputs for 5 dB (dashed line) and 20 dB (solid line) SNR cases. (C) Detected endpoints and normalized energy for the 20 dB SNR case, and (D) for the 5 dB SNR case.
5.3.4 Database Evaluation The introduced real-time algorithm was compared with a baseline endpoint detection algorithm on one noisy database and several telephone databases.
86
5 Robust Endpoint Detection
Baseline Endpoint Detection The baseline system is a real-time, energy contour based adaptive detector developed based on the algorithm introduced in [21, 26]. It is used in research and commercial speech recognizers. In the baseline system, a 6-state decision diagram is used to detect endpoints. Those states are named as initializing, silence, rising, energy, fell-rising, and fell states. In total, eight counters and 24 hard-limit thresholds are used for the decisions of state transition. Two adaptive threshold values were used in most of the thresholds. We note that all the thresholds are compared with raw energy values directly. Energy normalization in the baseline system is done separately by estimating the maximal and minimal energy values, then comparing their difference to a fixed threshold for decision. Since the energy values change with acoustic environments, the baseline approach leads to unreliable endpoint detection and energy normalization, especially in low SNR and non-stationary environments. Noisy Database Evaluation In this experiment, a database was first recorded from a desktop computer at 16 KHz sampling rate, then down-sampled to 8 KHz sampling rate. Later, car and other background noises were artificially added to the original database at the SNR levels of 5, 10, 15, and 20 dB. The original database has 39 utterances and 1738 digits in total. Each utterance has 3, 7, or 11 digits. Linear predictive coding (LPC) features and the short-term energy were used, and the hidden Markov model (HMM) in a head-body-tail (HBT) structure was employed to model each of the digits [9, 14]. The HBT structure assumes that context-dependent digit models can be built by concatenating a left-contextdependent unit (head) with a context-independent unit (body) followed by a right-context-dependent unit (tail). We used three HMM states to represent each “head” and “tail”, and four HMM states to represent each “body”. Sixteen mixtures were used for each body state, and four mixtures were used for each head or tail state. The real-time recognition performances on various SNR’s are shown in Fig. 5.6. Compared to the baseline algorithm, the introduced real-time algorithm significantly reduced word error rates. The baseline algorithm failed to work in low SNR cases because it uses raw energy values directly to detect endpoints and to perform energy normalization. The real-time algorithm makes decisions on the filter output instead of raw energy values; therefore, it provided more robust results. An example of error analysis is shown in Fig. 5.7. Telephone Database Evaluation The introduced real-time algorithm was further evaluated in 11 databases collected from the telephone networks with 8 kHz sampling rates in various
5.3 Real-Time Endpoint Detection and Energy Normalization
87
56.3
WORD ERROR RATE (%)
BASELINE 50
PROPOSED
40 30 22.6 20 10
5.5 1.5 5 dB
10 dB
3.5
1.5
15 dB
3.5
1.5
20 dB
SIGNAL-TO-NOISE RATIO (SNR)
Fig. 5.6. Comparisons on real-time connected digit recognition with various signalto-noise ratios (SNR’s). From 5 to 20 dB SNR’s, the introduced real-time algorithm provided word error rate reductions of 90.2%, 93.4%, 57.1%, and 57.1%, respectively.
acoustic environments. LPC parameters and short-term energy were used. The acoustic model consists of one silence model, 41 mono-phone models, and 275 head-body-tail units for digit recognition. It has a total of 79 phoneme symbols, 33 of which are for digit units. Eleven databases, DB1 to DB11, were used for the evaluation. DB1 to DB5 contain digit, alphabet, and word strings. Finite-state grammars were used to specify the valid forms of recognized strings. DB6 to DB11 contain pure digit strings. In all the evaluations, both endpoint detection and energy normalization were performed in realtime mode, and only the detected speech portions of an utterance were sent to the recognition back-end. In the real-time, endpoint-detection system, we set the parameters as g0 = 80.0, gm = 60.0, TU = 3.6, TL = −3.0, and Gap = 30. These parameters were unchanged throughout the evaluation in all 11 databases to show the robustness of the algorithm, although the parameters can be adjusted according to signal conditions in different applications. The evaluation results are listed in Table 5.1. It shows that the real-time algorithm works very well in regular telephone data as well. It provides word-error reduction in most of the databases. The word-error reductions even exceed 30% in DB2, DB6, and DB9. To analyze the improvement, the original energy feature of an utterance, “1 Z 4 O 5 8 2”, in DB6 is plotted in Fig. 5.7 (A). The detected endpoints and normalized energy using the conventional approach are shown in Fig. 5.7 (B) while the results of the real-time algorithm are shown in Fig. 5.7 (C). The filter output is plotted in Fig. 5.7 (D). From Fig. 5.7 (B), we can observe that
88
5 Robust Endpoint Detection Table 5.1. Database Evaluation Results (%) Database IDs (Number of strings, Number of words) DB1 (232, 1393) DB2 (671, 1341) DB3 (1957,1957) DB4 (272, 1379) DB5 (259, 2632) DB6 (576, 1738) DB7 (583, 1743) DB8 (664, 2087) DB9 (619, 8194) DB10 (651, 8452) DB11 (707, 9426)
Word Error Rate Word BaseProError line posed Reduction 13.7 11.8 13.9 14.6 7.9 45.9 4.5 4.4 2.2 10.0 9.6 4.0 15.8 15.7 0.6 2.8 1.1 60.7 1.7 1.5 11.8 0.9 0.7 22.2 1.0 0.7 30.0 5.7 5.6 1.8 1.6 1.4 12.5
the normalized maximal energy of the conventional approach is about 10 dB below zero, which causes a wrong recognition result: “1 Z 4 O 5 8”. On the other hand, the introduced algorithm normalized the maximal energy to zero (approximately), and the utterance was recognized correctly as “1 Z 4 O 5 8 2”.
5.4 Conclusions In this chapter, we presented a robust, real-time endpoint detection algorithm. In the algorithm, a filter with a 24-frame look-ahead detects all possible endpoints. A three-state transition diagram then evaluates the output from the filter for final decisions. The detected endpoints are then applied to real-time energy normalization. Since the entire algorithm only uses a 1-D energy feature, it has low complexity and is very fast in computation. The evaluation in a noisy database has showed significant string error reduction, over 50% on all 5 to 20 dB SNR situations. The evaluations in telephone databases have showed over 30% reductions in 4 out of 12 databases. The real-time algorithm has been implemented in real-time ASR systems. The contributions are not only to improve recognition accuracy but also the robustness of the entire system in low signal-to-noise ratio environments. The presented algorithm can be applied to both speaker recognition and speech recognition. Since the presented endpoint detection algorithm is fast and with very low computational complexity, it can be used in communication and speech processing applications directly. For example it can be implemented in embedded systems, such as wireless phones or portable devices to save cost and speed up processing time. Another examples is in a web or computer server supporting multi-users, such as a speaker verification server for millions of users. It normally requires low computational complexity to reduce cost and increase
70 60 50 40 30
(A)
50
100
150
200
250
300
(B)
50
100
150
200
250
300
(C)
50
100
150
200
250
300
(D)
50
100
150
200
250
300
0 −20 −40 0 −20 −40 20 0 −20
Fig. 5.7. (A) Energy contour of the 523th utterance in DB5: “1 Z 4 O 5 8 2”. (B) Endpoints and normalized energy from the baseline system. The utterance was recognized as “1 Z 4 O 5 8”. (C) Endpoints and normalized energy from the realtime, endpoint-detection system. The utterance was recognized correctly as “1 Z 4 O 5 8 2”. (D) The filter output.
response speed. For these cases, a solution is to use the above endpoint detector to remove all silence; therefore, we can reduce the number frames for decoding or recognition significantly.
References 1. Atal, B. S., “Automatic recognition of speakers from their voices,” Proceeding of the IEEE, vol. 64, pp. 460–475, 1976. 2. Atal, B. S., “Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification,” Journal of the Acoustical Society of America, vol. 55, pp. 1304–1312, 1974. 3. Bansal, R. K. and Papantoni-Kazakos, P., “An algorithm for detecting a change in stochastic process,” IEEE Trans. Information Theory, vol. IT-32, pp. 227– 235, March 1986. 4. Brodsky, B. and Darkhovsky, B. S., Nonparametric methods in change-point problems. Boston: Kluwer Academic, 1993.
90
5 Robust Endpoint Detection
5. Bullington, K. and Fraser, J. M., “Engineering aspects of TASI,” Bell Syst. Tech. J., pp. 353–364, Mar 1959. 6. Canny, J., “A computational approach to edge detection,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. PAMI-8, pp. 679–698, Nov. 1986. 7. Carlstein, E., M¨ uller, H.-G., and Siegmund, D., Change-point problems. Hayward, CA: Institute of Mathematical Statistics, 1994. 8. Chengalvarayan, R., “Robust energy normalization using speech/nonspeech discriminator for German connected digit recognition,” in Proceedings of Eurospeech’99, (Budapest), pp. 61–64, Sept. 1999. 9. Chou, W., Lee, C.-H., and Juang, B.-H., “Minimum error rate training of interword context dependent acoustic model units in speech recognition,” in Proceedings of Int. Conf. on Spoken Language Processing, pp. 432–439, 1994. 10. Furui, S., “Cepstral analysis techniques for automatic speaker verification,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 27, pp. 254–277, April 1981. 11. Haigh, J. A. and Mason, J. S., “Robust voice activity detection using cepstral features,” in Proceedings of IEEE TENCON, (China), pp. 321–324, 1993. 12. Junqua, J. C., Reaves, B., and Mak, B., “A study of endpoint detection algorithms in adverse conditions: Incidence on a DTW and HMM recognize,” in Proceedings of Eurospeech, pp. 1371–1374, 1991. 13. Lamel, L. F., Rabiner, L. R., Rosenberg, A. E., and Wilpon, J. G., “An improved endpoint detector for isolated word recognition,” IEEE Trans. on Acoustics, speech, and signal processing, vol. ASSP-29, pp. 777–785, August 1981. 14. Lee, C.-H., Giachin, E., Rabiner, L. R., Pieraccini, R., and Rosenberg, A. E., “Improved acoustic modeling for large vocabulary speech recognition,” Computer Speech and Language, vol. 6, pp. 103 – 127, 1992. 15. Li, Q., “A detection approach to search-space reduction for HMM state alignment in speaker verification,” IEEE Trans. on Speech and Audio Processing, vol. 9, pp. 569–578, July 2001. 16. Li, Q. and Tsai, A., “A language-independent personal voice controller with embedded speaker verification,” in Eurospeech’99, (Budapest, Hungary), Sept. 1999. 17. Li, Q. and Tsai, A., “A matched filter approach to endpoint detection for robust speaker verification,” in Proceedings of IEEE Workshop on Automatic Identification, (Summit, NJ), Oct. 1999. 18. Li, Q., Zheng, J., Tsai, A., and Zhou, Q., “Robust endpoint detection and energy normalization for real-time speech and speaker recognition,” IEEE Trans. on Speech and Audio Processing, vol. 10, pp. 146–157, March 2002. 19. Li, Q., Zheng, J., Zhou, Q., and Lee, C.-H., “A robust, real-time endpoint detector with energy normalization for ASR in adverse environments,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Salt Lake City), May 2001. 20. Petrou, M. and Kittler, J., “Optimal edge detectors for ramp edges,” IEEE Trans. on Pattern Analysis and Machine Intelligence, vol. 13, pp. 483–491, May 1991. 21. Rabiner, L. and Juang, B.-H., Fundamentals of speech recognition. Englewood Cliffs, NJ: PTR Prentice Hall, 1993. 22. Rabiner, L. R. and Sambur, M. R., “An algorithm for determining the endpoints of isolated utterances,” The Bell System Technical Journal, vol. 54, pp. 297–315, Feb. 1975.
References
91
23. Spacek, L. A., “Edge detection and motion detection,” Image Vision Comput., vol. 4, p. 43, 1986. ¨ 24. Tanyer, S. G. and Ozer, H., “Voice activity detection in nonstationary noise,” IEEE Trans. on Speech and Audio Processing, vol. 8, pp. 478–482, July 2000. 25. Wald, A., Sequential analysis. NY: Chapman & Hall, 1947. 26. Wilpon, J. G., Rabiner, L. R., and Martin, T., “An improved word-detection algorithm for telephone-quality speech incorporating both syntactic and semantic constraints,” AT&T Bell Laboratories Technical Journal, vol. 63, pp. 479–498, March 1984.
Chapter 6 Detection-Based Decoder
Decoding or searching is an important task in both speaker and speech recognition. In speaker verification (SV), given a spoken password and a speakerdependent hidden Markov model (HMM), the task of decoding or searching is to find optimal state alignments in the sense of maximum likelihood score of the entire utterance. Currently, the most popular decoding algorithm is the Viterbi algorithm with a pre-defined beam width to reduce the search space; however, it is difficult to determine a suitable beam width beforehand. A small beam width may miss the optimal path while a large one may slow down the process. To address the problem, the author has developed a non-heuristic algorithm to reduce the search space [12, 14]. The details are presented in this chapter. Following the definition of the left-to-right HMM, we first detect the possible change-points between HMM states in a forward-and-backward scheme, then use the change-points to enclose a subspace for searching. The Viterbi algorithm or any other search algorithm can then be applied to the subspace to find the optimal state alignment. In SV tasks, compared to a full-search algorithm, the proposed algorithm is about four times faster, while the accuracy is still slightly better; compared to the beam-search algorithm, the search-space reduction algorithm provides better accuracy with even lower complexity. In short, for an HMM with S states, the computational complexity can be reduced by up to a factor of S/3 with slightly better accuracy than in a full-search approach. The applications of the search-space reduction algorithm are significant. As wireless phones and portable devices are becoming more and more popular, SV is needed for portable devices where the computational resource and power supply are limited. Simply stated, a fast algorithm equates to longer battery life. On the other hand, a network or web server may need to support millions of users of telephone lines, wireless channels, and network computers when using SV for access control. In both cases, a fast decoding algorithm is necessary to achieve robust, accurate, and fast speaker authentication while
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/978-3-642-23731-7_6, Ó Springer-Verlag Berlin Heidelberg 2012
93
94
6 Detection-Based Decoder
at the same time using limited computational resources and power in order to minimize system and hardware cost and extend the battery life.
6.1 Introduction The hidden Markov model (HMM) has been widely used in speech and speaker recognition where the non-stationary speech signal is represented as a sequence of states. In automatic speech recognition (ASR), given an utterance and a set of HMM’s, a decoding algorithm is needed to search for the optimal state and word path, such that the overall likelihood score of the utterance is maximum. In speaker verification (SV), given a spoken password and a speaker-dependent HMM, the task is to find the optimal state alignment in the sense of maximum likelihood. This is called HMM state alignment. We referred to as alignment in this chapter. As the technology of SV is ready for use in real applications, a fast and accurate alignment algorithm with low complexity is needed to support both large-scale and portable applications. For example, a portable device, such as a wireless phone, usually has limited computational resources and power. A fast algorithm with low complexity will allow for SV to be implemented in the device at lower cost and lower power consumption, such as a wireless phone handset or a smart phone. On the other hand, a telephone or web server for SV may need to support millions of users. A fast algorithm will allow for the same hardware to support more telephone lines and reduce the cost per service. Our research on alignment can also benefit the similar decoding problem in ASR. Generally speaking, there are two basic requirements for an alignment algorithm – accuracy and speed. In the last few decades, several search algorithms have been developed based on dynamic programming [4] or heuristic search to pursue the requirements, such as the Viterbi algorithm [26], stack decoders [1], multi-pass search (e.g. [6, 20]), forward-backward search [6, 20], state-detection search [13, 14], etc. which have been applied to both speech and speaker recognition (e.g. [11, 19, 8]). Fortunately, in ASR language information can be applied to prune the search path and to reduce the search space, such as language model pruning (or word-end pruning), language model look-ahead, etc. (See [18] and [19] for a survey.) However, in alignment, since the whole model is just one word or one phrase, no word or language information can be applied to pruning. The current technique for alignment is the Viterbi algorithm with a state level beam search. The Viterbi algorithm is optimal in the sense of maximum likelihood [26, 9]; therefore, it meets the first requirement for accuracy. However, a full Viterbi search is impractical due to the large search space. There are two major approaches to address the search speed problem. One approach changes the optimal algorithm to a near-optimal one in order to increase the alignment speed (e.g. [13]), but it may lose some accuracy. Another approach keeps the optimal alignment algorithm while trying to reduce the
6.1 Introduction
95
search space. The most popular approach is the beam-search algorithm (e.g. [17, 19, 8]) applied to the state level. It reduces the search space by pruning the search paths with low likelihood scores using a pre-determined beam width. Obviously, it improves alignment speed due to the search-space reduction, but it is difficult to determine the beam width beforehand. When the value of the beam width is too large, alignment can provide better accuracy, but it slows down the speed; when the beam width is too small, alignment is faster, but it may give poor accuracy. Therefore, we present a search-space reduction algorithm which can detect a subspace from the constraints of the left-to-right HMM states without using a beam width for alignment. As is well known, the HMM is a parametric, statistical model with a set of states characterizing the evolution of a non-stationary process in speech through a set of short-time stationary events. Within each state, the distribution of the stochastic process is usually modeled by Gaussian mixture models (GMM). Given a sequence of observations, it moves from one state to another sequentially. Between every pair of connected states, there is a change-point. The purpose of the search-space reduction algorithm is to detect the possible change-points in a forward-and-backward scheme, then use the change-points to enclose a subspace for searching. In the case that an utterance matches the HMM’s, the algorithm would not miss the optimal path; in an impostor case, the algorithm limits the search space, therefore it has the potential to decrease the impostor’s likelihood scores. Once a subspace is detected, a search algorithm, such as the Viterbi algorithm, can be applied to find the optimal path in the subspace. The problem of detecting a change in the characteristics of the stochastic process, random sequences, and fields is referred to as the change-point problem [5]. There are two kinds of approaches to the problem: use of a parametric method and use of a nonparametric method. The parametric method is based on full a priori information, i.e. the probabilistic model. The model is constructed from training data where the change points (e.g. state segments) are given to guarantee statistical homogeneity. If the segment is not available, the nonparametric method can be applied [5] to preliminary change-point detection for each interval of homogeneity; then, some parametric method might be applied for more accurate change-point detection. In our application, since the HMM has been trained by a priori information, we only consider the parametric method of change-point detection. In order to detect change-points quickly and reliably, we need a sequential detection approach. Sequential testing was first studied by Wald [27] and was known as the sequential probability ratio test (SPRT). The SPRT was designed to decide between two simple hypotheses sequentially. Using the SPRT to detect the change-points in distributions was first proposed by Page for memoryless processes [21, 22]. Its asymptotic properties were studied by Lorden [16]. The general form of the test was proposed by Bansal [2], and Bansal and Papantoni-Kazakos [3]. They also studied its asymptotic properties for stationary and ergodic processes under some general regularity conditions.
96
6 Detection-Based Decoder
It has been proven that the SPRT is asymptotically optimal in the sense that it requires the very minimum expected sample size for decision, subject to a false-alarm constraint [16, 3, 10]; however, the Page algorithm needs a pre-determined threshold value for a decision. It is not critical if only one change-point between two density functions needs to be determined, but, for the alignment, we have to detect the changes between many different density functions, and the threshold values are usually not available. Our solution is to extend the Page algorithm to a general test which releases the specific threshold [12]. The rest of the chapter is organized as follows: In Section 6.2, we discuss how to extend the Page algorithm to a sequential testing which can detect the change-points in data distributions without using likelihood thresholds. In Section 6.3, we apply the change-point detection algorithm to HMM state detection. In Section 6.4, we present the search-space reduction algorithm, then we compare it to the Viterbi beam-search algorithms in an SV database in Section 6.5.
6.2 Change-Point Detection Let ot denote an observation of the d-dimensional feature vector at time t, and p1 (ot ) and p2 (ot ) be the d-dimensional density functions of two well-known, distinct, discrete-time, and mutually-independent stochastic processes, i.e. the two stochastic processes are not the same, and their statistical distributions are known, respectively. (See [23] for detailed definitions.) Given a sequence of observation vectors, O = {ot; t ≥ 1}, and the density functions, p1 (ot ) and p2 (ot ), the objective is to detect a possible p1 to p2 change as reliably and quickly as possible. Since the change can happen at any time, Page proposed a sequential detection scheme [21, 22, 10] as follows: Given a pre-determined threshold δ > 0, observe data sequentially, and decide that the change from p1 to p2 has occurred at the first time t if k t T (ot) = R(oi ) − min R(oi ) ≥ δ (6.1) i=1
1≤k≤t
where R(oi ) = log
i=1
p2 (oi ) . p1 (oi )
When T (ot ) ≥ δ, the endpoint of p1 is k = arg min R(oi ) . 1≤k≤t
i=1
(6.2)
(6.3)
6.2 Change-Point Detection
97
It is straightforward to implement the above test in a recursive form. As pointed out by Page [21], the above test breaks up into a Repeated Wald sequential test [27] with boundaries at (0, δ) and a zero initial score. It is asymptotically optimal in the sense that it requires the minimum expected sample size for decisions, subject to a false alarm constraint. The related theorems and proofs can be found in [16, 3, 10]; however, in many applications, it is impractical to determine the threshold value δ. For example, in speech segmentation, we may have over one thousand subword HMM’s and each HMM has several states. Due to different speakers and different spoken contents, it is impractical to pre-determine all of the threshold values for every possible combination of connected states or every possible speaker. To apply the sequential scheme to speech applications, we modify the above detection scheme as follows [13]: Select a time threshold tδ > 0. Observe data sequentially, and decide that the change from p1 to p2 occurs, if t − ≥ tδ , and T (ot ) =
t i=1
R(oi ) − min
1≤k≤t
(6.4)
k
R(oi )
> ε,
(6.5)
i=1
where ε ≥ 0 is a small number or can be just zero, and R(oi ) is defined as in (6.2). The enpoint of p1 can be calculated by (6.3). Here, we assume that the duration of p2 is not less than tδ . Fig. 6.1 (a) illustrates the scheme, where tδ as in (6.4) is a time threshold representing a time duration, and δ as in (6.1) represents a threshold value of the accumulated log likelihood ratio. It is much easier to determine tδ than δ in speech and speaker recognition. A common tδ can be applied to different HMM’s and different states. Generally speaking, a larger tδ can give a more reliable change-point with less false acceptance, but it may increase false rejection, delay the decision, and cost more in computation. To avoid false rejection, we can just let tδ = 1 as discussed in the next section.
6.3 HMM State Change-Point Detection We have introduced the scheme of detecting the change-point between two stochastic processes. Now, we can apply the scheme to HMM state changepoint detection. Since the left-to-right HMM is the most popular HMM in speech and speaker recognition, we focus our discussions on it. Nevertheless, the change-point detection algorithm can also be extended to other HMM configurations. A left-to-right HMM without state skip is shown in Figure 6.2. It is a Markov chain with a sequence of states which characterizes the evolution of a
98
6 Detection-Based Decoder
p
Σ log p2
1
(a)
tδ δ l
t
p Σ log p3 2
(b)
1
2
3
6
4
8
10
p1 p2 p3
t5
11
State 3 t
State 2
State 1 (c)
7
t9
Fig. 6.1. The scheme of the change-point detection algorithm with tδ = 2: (a) the endpoint detection for state 1; (b) the endpoint detection for state 2; and (c) the grid points involved in p1 , p2 and p3 computations (dots).
a11
a22 a12
a23
s1 b1
a33 a34
s2 b2
aNN
s3
....
aN-1,N
b3
sN bN
Fig. 6.2. Left-to-right hidden Markov model.
non-stationary process in speech through a set of short-time stationary events (states). Within each state, the probability density functions (pdf ’s) of speech data are modeled by Gaussian mixtures. An HMM can be completely characterized by a matrix of state-transition probabilities: A = {ai,j }; observation densities, B = {bj }; and initial state probabilities, Π = {πi } as: λ = {A, B, Π} = {ai,j , bj , πi ; i, j = 1, ..., S},
(6.6)
6.3 HMM State Change-Point Detection
99
where S is the total number of states. Given an observation vector ot , the continuous observation density for state j is bj (ot ) =
M
cjm N (ot , μjm , Σjm ),
(6.7)
m=1
where M is the total number of Gaussian components N (.); cjm , μjm and Σjm are the the mixture coefficient, mean vector, and covariance matrix of the mth mixture at state j, respectively. As we presented in [13], detecting the change-point between states is similar to detecting the change-point between two data distributions. For a leftto-right HMM, it can be implemented by repeating the following procedure until obtaining the last change-point between state S − 1 and S: Select a time threshold tδ > 0, observe data sequentially at time t, and decide that the change from state s to s + 1 occurs, if t − s ≥ tδ , and t
T (ot ) =
R(oi ) −
i=s−1 +1
⎧ ⎨ min
k
s−1
where ε ≥ 0,
⎧ ⎨ s = arg
(6.8)
min
s−1
k i=s−1 +1
⎫ ⎬ R(oi ) > ε, ⎭
(6.9)
⎫ ⎬ R(oi )
⎭
,
(6.10)
bs+1 (oi ) , (6.11) bs (oi ) and bj (.) is defined in (6.7); s−1 and s are the endpoints of the last and current states, respectively. The procedure should be run recursively from 1 to S−1 . R(oi ) =
Fig. 6.1 illustrates the above scheme for the endpoint detection for state 1, Fig. 6.1 (a), and state 2, Fig. 6.1 (b). Fig. 6.1 (c) illustrates the grid points involved in the detection. It is straightforward to implement the above procedure in a recursive form. For the task of search-space detection, we let tδ = 1 and ε = 0; thus, the detector will not make any false rejections except false acceptances. Nevertheless, the false acceptance issue can be resolved in the algorithm introduced in the next section. The above algorithm assumes that there is no state skip in the alignment. When allowing one state skip, we need to consider an additional test with (oi ) R (oi ) = bs+2 bs (oi ) , then compare the result to the above test to decide both the number of the next state and the location of the corresponding changepoint.
100
6 Detection-Based Decoder
6.4 HMM Search-Space Reduction We define the entire search space, in terms of grid points, as: Ψ = {(t, st ) | 1 ≤ t ≤ T ; 1 ≤ st ≤ S} ,
(6.12)
where (t, st ) ∈ Ψ is a grid point for probability computation, t is the frame number of feature vectors, st is the state index at time t, and T and S are the total numbers of frames and states, respectively. The probability density at each grid point is computed by (6.7). The goal is to detect a subspace Ω ⊆ Ψ , which includes the path with the maximum likelihood score under the constraint of the left-to-right HMM. 6.4.1 Concept of Search-Space Reduction When applying the above state change-point detection algorithm with tδ = 1 and ε = 0 in a forward time scheme, i.e. from t = 1 to t = T , we can detect a sequence of state change-points. The grid points along the sequence form a boundary in the search space, called the forward boundary: + + B + = (t, s+ (6.13) t ) | st ≤ st+1 , t = 1, ..., T ⊂ Ψ, where s+ t is the state index at time t along the boundary. An example of the forward boundary is shown in Fig. 6.3 as the solid line. The grid points along the forward dashed line and the solid line were involved in the forward detection. On the other hand, if we detect the state change-points in a backward scheme, i.e. from t = T to t = 1, we can detect another sequence of state change-points. The grid points along the sequence form another boundary, called the backward boundary: − − (6.14) B − = (t, s− t ) | st ≤ st+1 , t = 1, ..., T ⊂ Ψ, where s− t is the state index at time t along the boundary. An example of the backward boundary is again the grid points along the solid line in Fig. 6.3. The dashed line from right to left indicates the direction of the backward sequential detection. Generally speaking, neither one of the boundaries guarantees the optimal path since both of them may include false acceptances; however, the two boundaries enclose a subspace consisting of the grid points inside and along the boundaries. If the grid points of the forward boundary are above the backward boundary at every time frame, i.e. + ⊂ Ψ, (6.15) Ω = (t, st ) | 1 ≤ t ≤ T ; s− t ≤ st ≤ st where (t, s+ ) ∈ B + and (t, s− ) ∈ B − , subspace Ω must include the optimal path under the constraint of the left-to-right HMM since neither one of the boundaries, B + or B − , allows false rejection.
6.4 HMM Search-Space Reduction
101
However, the constraint in (6.15) may not always hold true, i.e. the grid points of B + may be located below B − at some time frames. This may be due to a skipped state, a mismatch between the model and data, or an impostor’s utterance. In these special cases, depending on the application, we can reject an utterance or perform a full search in Ψ or construct a search space as follows, M N
Ω = φm ψn , (6.16) m=1
n=1
where φm is an enclosed, regular subspace in which B − is under B + , + (6.17) φm = (t, st ) | ti ≤ t ≤ tj ; s− ti ≤ st ≤ stj , and ψn is a rectangular subspace in which B + is under B − , − ψn = (t, st ) | ti ≤ t ≤ tj ; s+ ti ≤ st ≤ stj .
(6.18)
Once a search space is enclosed, an optimal search algorithm can then be employed in either Ω or Ω to find the optimal path. There are three typical cases in search-space reduction. A real search space can be a combination of these cases. Case 1: Single Path in the Reduced Search Space If the forward and backward boundaries are identical, there exists only a single path in the reduced search space, i.e. B + = B − = Ω. In this case, a further search is not necessary, and a maximum likelihood score can be computed from the path directly, such as the solid line in Fig 6.3. We note that the change-point detection involves the points along the dashed lines, but they are not in the subspace Ω. Case 2: Multiple Paths in a Local Area When multiple paths exist in a search space, the forward and backward boundaries do not meet in some local areas. In this case, a search is needed only for those local areas. An example is shown in Fig. 6.4 as the “hole” of four grid points, (8,3), (8,4), (9,3) and (9,4). We only need to search those four grid points. Another example of a larger, reduced search space is shown in Fig. 6.5. Case 3: Special Cases In some special cases, the forward boundary may be under the backward boundary in some areas, as shown in Fig. 6.6 between (11,4) to (18,6). This
102
6 Detection-Based Decoder
s 6
States
5 4 3 2 1 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 t
Frames
Fig. 6.3. All the grid points construct a full search space Ψ . The grid points involved in the change-point detection are marked as black points. A single path (solid line) is detected from the forward and backward change-point detection.
s 6
States
5 4 3 2 1 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 t
Frames
Fig. 6.4. A “hole” is detected from the forward and backward state change-point detection. A search is needed only among four grid points, (8,3), (8,4), (9,3) and (9,4). The solid line indicates the path with the maximum likelihood score.
could be caused by the data skipping one HMM state or not going through all HMM states. Most likely, the data are from an imposter, i.e. there is no match between the testing data and the HMM. Depending on the application, we can either reject the dataset or perform asearch on either Ψ or Ω . For the example + , and in Fig. 6.6, Ω = φ ψ, where φ = (t, st ) | 1 ≤ t ≤ 11; s− t ≤ st ≤ st ψn = {(t, st ) | 11 ≤ t ≤ 18; 4 ≤ st ≤ 6} . ψn is a rectangular subspace. 6.4.2 Algorithm Summary and Complexity Analysis We summarize the search-space reduction algorithm as follows: 1. Perform a forward state-change-point detection to obtain a forward boundary, B + .
6.4 HMM Search-Space Reduction
103
s 6
States
5 4 3 2 1 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 t
Frames
Fig. 6.5. A search is needed in the reduced search space Ω which includes all the black points in between the two dashed lines. The points along the dashed lines are involved in change-point detection, but they do not belong to the reduced search space.
s 6
States
5 4 3 2 1 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16 17 18 t
Frames
Fig. 6.6. A special case is located between (11,4) and (18,6), where the forward boundary is under the backward one. A full search can be done in the subspace {(t, st ) | 11 ≤ t ≤ 18; 4 < st < 6}.
2. Perform a backward state change-point detection to obtain a backward boundary, B − . 3. If B + is above B − at all time frames, search for the optimal path in subspace Ω as defined in (6.15), if necessary; otherwise, search for the optimal path in Ω as defined in (6.16). 4. Return the accumulated log-likelihood score. Return the state segmentation if required. Assuming each state has T /S frames, the search-space reduction algorithm needs approximately 4 TS + 3(S − 2) TS = TS (3S − 2) grid-point computations, where T and S are the total numbers of frames and states, respectively. When
104
6 Detection-Based Decoder
S 2/3, it needs about 3T grid-point computations while a full-search algorithm needs approximately S × T grid-point computations. Supposing that the overhead computation can be ignored, the upper bound of the speedup of the algorithm is approximately S/3. In other words, the computational complexity can be reduced by up to a factor of S/3, in the ideal case, compared to a full Viterbi search algorithm.
6.5 Experiments In this section, we apply the search-space reduction algorithm to a fixed-phrase SV task as discussed in Chapter 9. We compare the proposed algorithm with full-search and beam-search algorithms. 6.5.1 An Example of State Change-Point Detection In an SV system, a speaker-dependent (SD) left-to-right HMM is trained with 14 states, and each state has four Gaussian mixtures, for the pass-phrase “open sesame”. The speech feature is a sequence of 24 dimensional cepstrum and delta-cepstrum coefficients derived from a 10th order linear predictive coding (LPC) analysis over a 30 ms window updated at 10 ms intervals. For a test utterance of 100 feature frames, Fig. 6.7 is the procedure of change-point detection from state 1 to 7, and Fig. 6.8 is from state 8 to 13, respectively. (ot ) In state s, we plot log bs+1 bs (ot ) for each time step t. The vertical dashed lines are the detected end-points for each state. When tδ = 2 and ε = 0, the detected endpoints are exactly the same as those from a full Viterbi search. 6.5.2 Application to Speaker Verification An SV system includes two kinds of sessions, enrollment and test. In an enrollment session, an identity, such as an account number, is assigned to a speaker, and the speaker is asked to select a spoken pass-phrase. The system collects and verifies five training utterances. The detailed procedure is presented in Chapter 9. A speaker-dependent (SD) HMM, called the target model, is then constructed for the whole phrase. In a test session, the speaker’s test utterance is compared against the pre-trained target model. The speaker is accepted if the likelihood-ratio score exceeds a preset threshold; otherwise the speaker is rejected. In this experiment, a typical beam-search algorithm was applied to the comparison: if (t, s∗t ) is the grid point with the best path at time t, and the log-likelihood score is L(t, s∗t ), the path to any (t, st ) will be a candidate for extension at frame t+1 only if L(t, st ) ≥ L(t, s∗t )−θ, where θ is the beam width [17, 7, 19] and L is an accumulated log-likelihood score from the beginning of the utterance.
6.5 Experiments
105
0 −200 −400 0
5
10
15
20
25
30
35
40
45
50
5
10
15
20
25
30
35
40
45
50
5
10
15
20
25
30
35
40
45
50
5
10
15
20
25
30
35
40
45
50
5
10
15
20
25
30
35
40
45
50
5
10
15
20
25
30
35
40
45
50
5
10
15
20
25
30
35
40
45
50
−200 −400 0 −200 −400 0 −20 −40 0 −200 −400 0 −100 −200 0 −50 −100
Fig. 6.7. The procedure of sequential state change-point detection from state 1 (top) to state 7 (bottom), where the vertical dashed lines are the detected endpoints of each state.
To facilitate the comparison, in this test we only apply different alignment algorithms in the target score L(O, Λt) computation. We fixed the endpoints and background scores L(O, Λb ) during the comparison. The feature vector for SV is composed of 12 cepstral and 12 delta-cepstral coefficients since it is not necessary to use the 39 features for SV. The cepstrum is derived from a 10th order LPC analysis over a 30 ms window and the feature vectors are updated at 10 ms intervals [24]. The experimental database consists of fixed-phrase utterances recorded over the long-distance telephone network by 100 speakers, 51 male and 49 female. The fixed phrase, common to all speakers, is “I pledge allegiance to the flag” with an average length of 2 seconds. Five verified utterances are used to train the target model. For testing, we use 40 utterances recorded from a true speaker in different sessions, and 192 utterances recorded from 50 impostors of the same gender in different sessions. The speaker-dependent (SD) target models for the phrases are left-to-right HMM’s. The number of states depends on the total number of phonemes in the phrases. There are four Gaussian components associated with each state [24]. The background models are concatenated SI phoneme HMM’s trained
106
6 Detection-Based Decoder
0 −200 −400 050
55
60
65
70
75
80
85
90
95
100
55
60
65
70
75
80
85
90
95
100
55
60
65
70
75
80
85
90
95
100
55
60
65
70
75
80
85
90
95
100
55
60
65
70
75
80
85
90
95
100
55
60
65
70
75
80
85
90
95
100
−200 −400 050 −200 −400 050 −100 −200 1050 0 −10 050 −100 −200 50
Fig. 6.8. The procedure of sequential state change-point detection from state 8 (top) to state 13 (bottom).
on a telephone speech database from different speakers and texts [25]. There are 43 phoneme HMM’s, and each model has three states with 32 Gaussian components associated with each state. Due to unreliable variance estimates from a limited amount of speaker-specific training data, a global variance estimate was used as the common variance for all Gaussian components in the target models [24]. The search-space reduction algorithm was compared to both the beamsearch and full-search algorithms on SV system with a total of 3,970 utterances from true speakers and 19,608 utterances from impostors in the database. For accuracy analysis, we list the experimental results as verification equal-error rates (EER’s) and summarize them in Fig. 6.9 (a). The results indicate that the accuracy of the search-space reduction algorithm is almost the same as the full-search algorithm and is much better than the beam-search algorithm for different beam widths of 200, 300, and 500. For complexity analysis, we evaluated the experimental results in three aspects: grid points involved in the alignment, speedup, and overhead computation. On average, the search-space reduction algorithm visited only about 27% of the grid points, while the beam-search algorithm visited more grid points to obtain a similar accuracy. Furthermore, the complexity is compared in terms of speedup defined as follows:
6.5 Experiments
Speedup =
Flops of the full-search algorithm , Flops of a compared algorithm
107
(6.19)
Equal-Error Rates
where each arithmetic computation counts as one flop (floating point operation). As shown in Fig. 6.9 (b), compared to the full-search algorithm, the search-space reduction algorithm is about four times faster. Compared to the beam-search algorithm, the search-space reduction algorithm is either much faster when the accuracy is about the same, or more accurate when the alignment speed is about the same. % 2.3
2.29 2.24 2.18
2.2 2.1
2.07
2.09
2.0 Full- Beam Beam Proposed Search 500 300 (a)
Average Speedups
4.0
Beam 200
3.92 3.08
3.0
2.47 1.94
2.0 1.0
1.00
0.0 Full- Beam Proposed Search 500 (b)
Beam 300
Beam 200
Fig. 6.9. (a) Comparison of average individual equal-error rates (EER’s); (b) Comparison on average speedups.
The slightly better accuracy of the search-space reduction algorithm over that of the full-search algorithm is due to the search-space reduction. The average target scores of the true speakers are almost the same as the full-search
108
6 Detection-Based Decoder
one while the average impostor score of the search-space reduction algorithm is reduced. In the true speaker’s case, since there is no false rejection in changepoint detection, the optimal path in the reduced search space is usually the same one as in the full search space. On the other hand, in the impostor’s case, since there is a mismatch between the model and utterance, the algorithm may skip one state or may not go through all the states in detecting the boundaries (e.g. Fig 6.6). The optimal path in the reduced search space may not be the same one as in the full space; therefore, the impostor’s likelihood score can be lower than a full-search one while the true speaker’s scores are almost the same. Since the search space of the search-space reduction algorithm is a subspace of the full-search algorithm, its likelihood score is usually smaller than the full-search one; however, the search-space reduction algorithm can increase the difference between the true speaker’s score and the impostors’ scores. This reduced the EER’s slightly in the test. We note that in the above experiments, the endpoints were given from an SI-HMM alignment. If the target model alignment includes silence, the speedups of the above experiments will be larger and the accuracy improved due to more accurate endpoints. The accuracy can be further improved if target model adaptation is allowed, as discussed in Chapters 9 and 10 in [24, 15]. Also, if we further apply the search-space reduction algorithm to the background model alignment, it will show much greater speedup since there are a large number of concatenated states in the background model.
6.6 Conclusions In this chapter, we first introduced an algorithm for sequential HMM state change-point detection, then presented an algorithm for search-space reduction. In the search-space reduction algorithm, we use the detected-state change-points to form the boundaries of a search space. When two boundaries are detected by using a forward-and-backward scheme, a subspace is detected, and a search algorithm can then be applied to find the optimal path. Compared to the beam-search algorithm, the search-space reduction algorithm is capable of detecting a subspace without using any beam width which is difficult to determine beforehand. The experiment in a speaker verification task shows that the search-space reduction algorithm can provide much better accuracy than the beam-search algorithm at a faster speed. Compared to a full-search algorithm, the search-space reduction algorithm is about four times faster with slightly better accuracy in the experiment. Given total S states in an HMM, the computational complexity can be reduced by up to a factor of S/3. Decoding and searching always take a lot of computation resource. Especially when speech applications are implemented in handhold devices, such as wireless phones and smart phones. The limited computation resources may effect application performances significantly. The proposed algorithm can lead
References
109
to better performance or less power consumption while using the same device. Furthermore, the proposed algorithm can be implemented in speech recognition. In addition to speaker authentication, the search-space reduction algorithm may also be applied to the decoding problem in digital communications and in network routing, where the Viterbi and other search algorithms are usually applied.
References 1. Bahl, L. R. et al., “Large vocabulary natural language continuous speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 465– 467, May 1989. 2. Bansal, R. K., “An algorithm for detecting a change in stochastic process,” Master Thesis, University of Connecticut, EECS Dept., 1983. 3. Bansal, R. K. and Papantoni-Kazakos, P., “An algorithm for detecting a change in stochastic process,” IEEE Trans. Information Theory, vol. IT-32, pp. 227– 235, March 1986. 4. Bellman, R. E., Dynamic Programming. Princeton, NJ: Princeton University Press, 1957. 5. Brodsky, B. and Darkhovsky, B. S., Nonparametric methods in change-point problems. Boston: Kluwer Academic, 1993. 6. Chen, J. K. and Soong, F. K., “An n-best candidates-based discriminative training for speech-recognition applications,” IEEE Trans. on Speech and Audio Processing, vol. 2, pp. 206–216, Jan. 1994. 7. Deller, J. R., Proakis, J. G., and Hansen, J. H. L., Discrete-time processing of speech signals. NY: Macmillan Publishing, 1993. 8. Deshmukh, N., Ganapathiraju, A., and Picone, J., “Hierarchical search for largevocabulary conversational speech recognition,” IEEE Signal Processing Magazine, vol. 16, pp. 84 – 107, Sept. 1999. 9. Forney, G. D., “The Viterbi algorithm,” Proceeding of IEEE, vol. 61, pp. 268– 278, March 1973. 10. Kazakos, D. and Papantoni-Kazakos, P., Detection and Estimation. NY: Computer Science Press, 1990. 11. Lee, C.-H. and Rabiner, L. R., “A frame-synchronous network search algorithm for connected word recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, pp. 1649–1658, November 1989. 12. Li, Q., “A detection approach to search-space reduction for HMM state alignment in speaker verification,” IEEE Trans. on Speech and Audio Processing, vol. 9, pp. 569–578, July 2001. 13. Li, Q., “A fast decoding algorithm based on sequential detection of the changes in distribution,” in Proc. Int’l Conf. on Spoken Language Processing, (Sydney), Nov. 1998. 14. Li, Q., “A fast, sequential decoding algorithm with application to speaker verification,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Phoenix), March 1999. 15. Li, Q. and Juang, B.-H., “Speaker verification using verbal information verification for automatic enrollment,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Seattle), May 1998.
110
6 Detection-Based Decoder
16. Lorden, G., “Procedures for reacting to a change in distribution,” The Annals of Mathematical Statistics, vol. 42, pp. 1897–1908, 1971. 17. Lowerre, B. and Reddy, R., The HARPY speech understanding system, in W. A. Lea, ed., Trends in Speech Recognition. NJ: Printice Hall, 1980. 18. Ney, H., Haeb-Umbach, R., Tran, B.-H., and Oerder, M., “Improvements in beam search for 10000-word continuous speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (San Francisco, CA), pp. I–9 – I–12, March 1992. 19. Ney, H. and Ortmanns, S., “Dynamic programming search for continuous speech recognition,” IEEE Signal Processing Magazine, vol. 16, pp. 64 – 83, Sept. 1999. 20. Nguyen, L., Schwartz, R., Kubala, F., and Placeway, P., “Search algorithms for software-only real-time recognition with very large vocabularies,” in Proceedings of DARPA Human language Technology Workshop, pp. 91–95, March 1993. 21. Page, E. S., “Continuous inspection schemes,” Biometrika, vol. 41, pp. 100–115, 1954. 22. Page, E. S., “A test for a change in a parameter occuring at an unknown point,” Biometrika, vol. 42, pp. 523–527, 1955. 23. Papoulis, A., Probability, Random variables, and stochastic processes. NY: McGraw-Hill, 1984. 24. Parthasarathy, S. and Rosenberg, A. E., “General phrase speaker verification using sub-word background models and likelihood-ratio scoring,” in Proceedings of ICSLP-96, (Philadelphia), October 1996. 25. Rosenberg, A. E. and Parthasarathy, S., “Speaker background models for connected digit password speaker verification,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Atlanta), pp. 81–84, May 1996. 26. Viterbi, A. J., “Error bounds for convolutional codes and an asymptotically optimal decoding algorithm,” IEEE Transactions on Information Theory, vol. IT-13, pp. 260–269, April 1967. 27. Wald, A., Sequential analysis. NY: Chapman & Hall, 1947.
Chapter 7 Auditory-Based Time Frequency Transform
Time-frequency transforms play an important role in signal processing. Many speech processing algorithms needs to convert the time domain speech signal to a frequency domain. The Fourier transform (FT) and the fast Fourier transform (FFT) have been used for decades, but they are not robust to background noise. As shown in this chapter, FFT generates computation noise and pitch harmonics during its computation. In a different approach, the traveling wave in the cochlea was modeled as a Gammatone function. A bank of the functions has been used as the forward transform to decompose the input signal into different frequency bands, but there is no proven inverse transform for the Gammatone filter bank, and the filter bandwidths are fixed and cannot be adjusted for different kinds of applications. To address the above issues, the author presents a robust, invertible, and auditory-based time-frequency transform named auditory-based transform or auditory transform (AT) in [23, 22]. In this chapter, we provide a detailed introduction of the AT. The AT is a pair of forward and inverse transforms which has been proven in theory and validated in experiments. In order to derive the inverse transform and provide flexible filter bandwidth, we modified the traditional Gammatone function. We also present the fast AT algorithm for discrete-time signals. Compared to the FT, the AT has the following advantages: First, it is more robust to both background and computational noises. Second, AT can be free from pitch harmonics when processing speech data. Finally, AT only needs real number computation. We have applied the presented transform to noise reduction, speech feature extraction, and other applications. Our experiments have shown significant advantages over the FT. Furthermore, the derived inverse formula can also be used to compute the inverse continuous wavelet transform numerically.
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/978-3-642-23731-7_7, Ó Springer-Verlag Berlin Heidelberg 2012
111
112
7 Auditory-Based Time Frequency Transform 2000
1000
0
−1000
−2000
0
0.2
0.4
0.6
0.8 1 Time (Seconds)
1.2
1.4
1.6
0
0.2
0.4
0.6
0.8 1 Time (Seconds)
1.2
1.4
1.6
2000
1000
0
−1000
−2000
Fig. 7.1. Male’s voice: “2 0 5” recorded simultaneously by close-talking (top) and hands-free microphones in a moving car (bottom).
7.1 Introduction The author defined the auditory-based time-frequency transform (AT) by simulating the periphery auditory system, especially the traveling waves in the cochlea [22]. Before presenting the AT, we first discuss the problems with the Fourier transform (FT), and then we introduce the basic signal processing functions of the ear. 7.1.1 Observing Problems with the Fourier Transform As is well known, most of the information in speech can be presented in the frequency domain. Therefore, most speech signal processing algorithms convert speech waveforms from the time domain to the frequency domain and present the information as a spectrum or spectrogram. For many existing speech processing systems, the fast Fourier transform (FFT) is the major tool for the time-frequency transform. Over the last several decades, a significant amount of effort has been put into the research of the signals after FFT [10], i.e. in the frequency domain, but little effort has been put into the research of new time-frequency transform algorithms to challenge FFT. From the examples in Fig. 7.1 and Fig. 7.2, we can easily observe the problems and much of the noise in the frequency domain are, in fact, created by the process of FFT.
7.1 Introduction
113
Critical Band Rate (Bark)
16 14 12 10 8 6 4 2 0
0.2
0.4
0.6
0.8 1 Time (Seconds)
1.2
1.4
1.6
0
0.2
0.4
0.6
0.8 1 Time (Seconds)
1.2
1.4
1.6
Critical Band Rate (Bark)
16 14 12 10 8 6 4 2
Fig. 7.2. The speech waveforms in Fig. 7.1 were converted to spectrograms by FFT and displayed in Bark scale from 0 to 16.4 Barks (0 to 3500 KHz). The background noise and the pitch harmonics were generated mainly by FFT.
Fig. 7.1 is a depiction of the waveforms of a male’s voice: “2 0 5”, which was recorded simultaneously in two channels by a close-talking microphone (top) and a hands-free microphone (bottom) in a moving car with a sampling rate of 8 KHz. The close-talking microphone was located on the speaker’s lapel, while the hands-free microphone was located on the sun visor. Both speech files were recorded simultaneously. Due to the distance between the speaker’s mouth and both microphones, background car noise was substantially involved in the voice waveform in Fig. 7.1. If we compute the spectrograms for the corresponding waveforms using FFT, we will obtain the spectrograms as shown in Fig. 7.2. The top spectrogram was obtained from the waveform of the close-talking (clean) speech in Fig. 7.2 (top), while the bottom spectrogram was obtained from the waveform of the hands-free (noisy) speech in Fig. 7.2 (bottom). The spectrograms were computed in the standard way, i.e. traditional feature extraction for speaker or speech recognition: a window of the length of 30 ms shifted every 10 ms with an overlap of 20 ms. A Hamming window was used before applying FFT. To facilitate further analysis, we warped the frequency from the linear scale to the Bark scale as described in [27, 26], which is a frequency distribution scale similar to the human hearing system. Fig. 7.3 represents a cross section at the 1.15 second time frame from Fig. 7.2. The solid line represents the speech from the close-talking microphone,
114
7 Auditory-Based Time Frequency Transform 100
dB
80
60
40
20
2
4
6
8 10 Critical Band Rate (Bark)
12
14
16
Fig. 7.3. The spectrum of FFT at the 1.15 second time frame from Fig. 7.2: The solid line represents the speech from a close-talking microphone. The dashed line is from a hands-free microphone mounted on the visor of a moving car. Both speech files were recorded simultaneously. The FFT spectrum shows 30 dB distortion at low frequency bands due to background noise and pitch harmonics as noise.
and the dashed line represents the speech from the hands-free microphone. From Figs. 7.2 and 7.3, we can observe the following: • •
•
Background noise distortion: The FFT spectrum show a 30 dB distortion between the close-talking and hands-free speech at low frequency bands. It is due to the car background noise. Pitch harmonics: The FFT spectrum shows significant pitch harmonics as periodical waves along the frequency axis, which are not desired by speaker and speech recognition. This is due to the fixed length of the FFT window. Computation noise: The noise displayed as “snow” in Fig. 7.2 is not the recorded environmental noise. It was generated by the FFT computation. It has the potential to degrade the speaker and speech recognition performance.
For robust speaker recognition, we do need a more robust time-frequency transform as the foundation of feature extraction. The transform should generate less distortion to background noise and less computation noise while retaining the useful information, such as pitch and formants. 7.1.2 Brief Introduction of the Ear In order to obtain more robust solution for speaker and speech recognition, we need a more robust way to conduct the time-frequency transform than FFT. Our approach has been inspired by a study of the human hearing system.
7.1 Introduction
115
Fig. 7.4. Illustration of human ear and cochlea.
Fig. 7.5. Illustration of a stretched out cochlea and a traveling wave exciting a portion of the basilar membrane.
As shown in Fig. 7.4, the human hearing system includes the outer (external) ear, middle ear, and inner ear. The outer ear consists of the pinna, the concha, and a canal leading to the eardrum. The changing sound pressure of any source is collected by the outer ear. The canal and the eardrum provide
116
7 Auditory-Based Time Frequency Transform
the main elements of a complex acoustic cavity, such a cavity being expected to change the sound pressure at the eardrum at different frequencies. The frequency transfer function from the concha to the eardrum has been measured and reported in [44, 54]. Once an acoustic stimulus reaches the eardrum, it is transmitted from the middle ear to the inner ear. The middle ear couples sound energy from the external auditory meatus to the cochlea, and helps to match the impedance of the auditory meatus to the much higher impedance of the cochlear fluid. The sound is transmitted from the eardrum to the oval window of the cochlea by three small bones: the malleus, the incus, and the stapes. The middle ear transfer function was obtained by measuring the pressure at the eardrum and behind the oval window of a cat [38]. A combined outer ear and middle ear transfer function has been plotted in [27]. When sound enters the human ear, the waves go through the outer ear and push on the eardrum. The eardrum and three middle ear bones work together as a mechanical amplifier and transfer the sound waves through mechanical movement of the bones. When the last bone in the middle ear, the stapes, moves, it sets the fluid inside the cochlea in motion creating traveling waves on the basilar membrane (BM) as illustrated in Fig. 7.5. Every audible frequency has a specific response location on the basilar membrane. This is how the ear decomposes the frequency components of a time domain signal. The frequency distribution on the basilar membrane is in a nonlinear scale. It was reported and modeled as the Bark scale [58], ERB scale [36], or mel scale [6]. The above three scales are similar to the log scale, and they represent measurements of the human auditory system. Furthermore, the hair cells connected to the basilar membrane convert the movement of the traveling wave into electrical signals and transfer the signals to the human brain. The traveling wave and its impulse response have been measured and reported in the literature, such as [48, 21, 53, 7, 52, 34, 43]. The basilar membrane tuning and auditory filter shapes have been studied in the literature, such as [39, 2, 20, 35, 55, 50]. Many electronic and mathematic models have been defined to mimic the traveling wave, the auditory filters, and the frequency responses of the basilar membrane, such as [19, 18, 36, 8, 57, 1, 31, 30, 13, 29, 42]. Furthermore, based on an understanding of the basic function of the cochlea, many feature extraction algorithms have been defined for speech recognition, for example [6, 14, 12, 32, 27, 26], but all the approaches used FFT as the time-frequency transform which is obviously different than the frequency decomposing process in the cochlea. We note that this is only a brief introduction on how sound waves are converted to frequency band responses by the hearing system from a signal processing point of view. More details can be found in hearing and psychoacoustic texts [9, 37, 40, 54, 11]. Resent research on hearing periphery models can be found in [56, 3]. Further references on auditory modeling can be found therein. Instead of modeling the hearing system precisely, our research is more interested in developing feature extraction algorithms based on our
7.1 Introduction
117
understanding of the signal processing mechanism inside the hearing system. The goal of our research is to improve audio and speech signal processing algorithms and performances. 7.1.3 Time-Frequency Analyses The Fourier transform (FT) has fixed time-frequency resolution and the frequency distribution is restricted to be linear. As we have discussed, these limitations generate problems in audio and speech processing. On the other hand, the wavelet transform (WT) provides flexible time-frequency resolution which is similar to the auditory periphery [13, 45] in concept, but WT also has notable problems. First, no existing wavelet has the capability of mimicking the impulse responses of the basilar membrane, so it cannot be used directly for modeling the cochlea or related processing. Additionally, even though forward and inverse continuous wavelet transforms are defined, to the best of our knowledge, there is no numerical computational formula to approximate inverse continuous wavelet transforms. No such function exists even in a commercial wavelet package [33]. In previous experiments, previously proposed solutions [46] could not be validated for speech signals. Discrete wavelet transform can also be applied in speech processing [4], but the frequency distribution is limited to the dyadic scale which is different than the distribution in the cochlea. From hearing research, the Gammatone function [16] has been proposed to model the impulse responses of the human basilar membrane and Gammatone filter banks have been used to decompose time-domain signals into different frequency bands. A tutorial on the Gammatone filter bank is available in [49] and further references are available therein. The Gammatone filter bank has been used to decompose the speech signal into multiple frequency bands; however, there is no mathematic formula or proof on how to synthesize the decomposed multichannel signals back into a time-domain signal, only suggestions on resynthesis in plain language [51]. Another problem with the Gammatone function is the Q-factors of its band-pass filters are not adjustable, which effect the performance in speech and speaker recognition as we will discuss in the next chapter. Other auditory-based models, such as [47, 5], are more interested in modeling and analysis and do not provide an inverse transform which is necessary for many applications in noise reduction, hearing aid, coding and many real applications. Furthermore, a Gammatone-based transform with analysis and synthesis is presented in [15], but the filter bank derives a complex valued output which is different than the real cochlea and makes implementation complicate. Also, there is no fast algorithm for that approach. To address the above problems and to provide an alternative timefrequency transform in addition to the FFT and WT, we present an auditorybased transform as follows.
118
7 Auditory-Based Time Frequency Transform
7.2 Definition of the Auditory-Based Transform Our research was motivated by modeling the measured impulse responses of the BM in [48, 53]. In fact, the author modeled the impulse response independently before knowing the work of the Gammatone function. We will provide a comparison with the Gammatone function in Section 7.6 since many people are familiar with it. The author’s approach and research results on the auditory transform (AT) were first presented in [23] and then published in [22]. We define the AT to meet the following requirements: first, the central frequencies of the decomposed sub-bands can be selected without any constrains. In other words, the frequency distribution of the sub-bands can be in any nonlinear scale or similar to the scale in the cochlea. Second, the computation must be simple and in real numbers. Many of the models of the hearing system are too complex to be used directly for real applications. Third, its inverse transform exists and can be proved; thus, the decomposed signal can be synthesized back to the original signal. Fourth, the filter bandwidth, i.e. the Q-factor of the filters in its filter bank can be adjusted. Fifth, it is robust to background noise. Last, fast numerical computation algorithms can be derived to support real-time applications. In a simplified approximation, the impulse response of the basilar membrane (BM) in the cochlea can be represented by the function ψ(t) ∈ L2 (R). The function satisfies the following conditions: 1. It integrates to zero:
∞ −∞
ψ(t) dt = 0.
2. It is square integrable or has finite energy: ∞ |ψ(t)|2 dt < ∞.
(7.1)
(7.2)
−∞
3. It satisfies:
∞
−∞
where 0 < C < ∞ and
|Ψ (ω)|2 dω = C. ω
Ψ (ω) =
∞ −∞
ψ(t)e−jωt dω.
(7.3)
(7.4)
4. It tapers off to zero on both ends just as it is observed in psychoacoustic experiments with the BM [48]. 5. It has one major modulation frequency and its frequency response is a triangle-like, band-pass filter.
7.2 Definition of the Auditory-Based Transform
119
The first three conditions are required by mathematics for further derivation. The last two are required in order to match previous psychoacoustic and physiological experimental results and to approximate numerical computations presented later. Let f (t) be any square integrable function. The forward transform of f (t) with respect to a function representing the BM impulse response ψ(t) is defined as: 1 t−b T (a, b) = f (t) ∗ ψ (7.5) a |a| where ∗ denotes a convolution process, a and b are real, and both f (t) and ψ(t) belong to L2 (R), and T (a, b) represents the traveling waves in the BM. The above equation can also be written as: ∞ T (a, b) = f (t)ψa,b (t) dt (7.6) −∞
where
1 ψa,b (t) = ψ |a|
t−b a
.
(7.7)
Note that 1/ |a| is an energy normalizing factor. It ensures that the energy stays the same for all a and b; therefore, we have: ∞ ∞ |ψa,b (t)|2 dt = |ψ(t)|2 dt (7.8) −∞
−∞
Factor a is a scale or dilation variable. By changing a, we can shift the central frequency of an impulse response function. Factor b is a time shift or translation variable. For a given value of a, factor b shifts the function ψa,0 (t) by an amount b along the time axis. To compute the convolution in (7.5), the cochlear impulse response function or cochlear filter is needed. The author studied the cochlear impulse response plotted in [48, 21, 53, 7, 52, 34, 43], and then defined the following function to represent the cochlear impulse response in 2003 [23]. It was then published in [22]. α 1 t−b t−b ψa,b (t) = exp −2πfL β a a |a| t−b cos 2πfL + θ u(t − b) (7.9) a where α > 0 and β > 0, u(t) is the unit step function, u(t) = 1 for t ≥ 0 and 0 otherwise. The value of θ should be selected such that (7.1) is satisfied. b is the time shift variable, and a is the scale variable. The value of a can be determined by the current filter central frequency fc and the lowest central frequency fL in the cochlear filter bank:
120
7 Auditory-Based Time Frequency Transform
a = fL /fc .
(7.10)
Since we contract ψa,b (t) from its lowest frequency representation along the time axis, the value of a is in the range of 0 < a ≤ 1. If we stretch ψ, the value of a > 1. The frequency distribution of the cochlear filter and fc can be in the form of linear or nonlinear scales such as ERB (equivalent rectangular bandwidth) [36], Bark, [58], Mel, [6], log, etc. Note that the value of a needs to be pre-calculated for the required central frequency of the cochlear filter. Fig. 7.6 shows the impulse responses of 5 cochlear filters and Fig. 7.7 (A) their corresponding frequency responses. We note that the impulse response of the BM in AT are very similar to the results reported in hearing research, such as the figures in [48], [21], [37] (Fig. 1.12), [53], etc. Normally, we use α = 3. The value of β controls the filter band width, i.e. the Q-factor. We used β = 0.2 for noise reduction and β = 0.035 or around the number for feature extraction [25] where higher frequency resolution is needed. In most applications, parameters α and β may need to be determined by experiments. A study of the relation between β and speaker recognition is shown in [24]. The author derived the function in (7.9) from the psychoacoustic experiment results directly, such as the impulse responses plotted in [48, 53]. In fact, Eq. (7.9) defined by the author is different from the Gammatone function. In the standard Gammatone function, the Q-factor is fixed. In our definition, the Q-factor can be adjusted by changing parameter β. Further comparison is given in Section 7.6. Using the speech data in Fig. 7.1 as input, the AT can output a bank of decomposed traveling waves in different frequency bands as shown in Fig. 7.8. An enlarged plot by expending the time frame from 0.314 to 0.324 seconds is shown in Fig. 7.9. We note that unlike the FFT all the numbers in the AT are real, which is a significant advantage in real-time applications. From the output of the forward AT, we can construct a spectrogram similar to the FFT spectrogram. There are many ways to compute and plot an AT spectrogram. One way which is similar to the cochlea is first to apply a rectifier function on the output of the AT. This action removes the negative part of the AT output. Then, we can use a shift window and compute the energy of each band in the window. Fig. 7.10 shows the spectrograms computed from the AT using the same speech data as in Fig. 7.2. The window size is 30 ms and shifts every 20 ms with 10 ms overlap. For different applications, the spectrogram computation can be different. The spectrum of AT at the 1.15 second time frame is shown in Fig. 7.11. We will compare it with the FFT spectrum in Section 7.6.
7.3 The Inverse Auditory Transform Just as the Fourier transform requires an inverse transform, a similar inverse transform is also needed for the auditory transform. The need arises when
7.3 The Inverse Auditory Transform
121
507
1000
2014
3046
4050 Hz
−5
x 10 1 0 −1 0 0.002 −5 x 10 1 0 −1 0.002 0 −5 x 10 1 0 −1 0.002 0 −6 x 10 5 0 −5 0 0.002 −6 x 10 5 0 −5 0 0.002
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.004
0.006
0.008
0.01
0.012
0.014
0.016
0.004
0.006 0.008 0.01 Time (Second)
0.012
0.014
0.016
Fig. 7.6. Impulse responses of the BM in the AT when α = 3 and β = 0.2. They are very similar to the research results reported in hearing research.
the processed frequency decomposed signals need to be converted back to real signals, such as in the application of speech and music synthesis or noise reduction. If we can prove the inverse transform, it also means that no information is lost through the proposed forward transforms. This is important when we use the transform for feature extraction and other applications where we must make sure that the transform does not lose any information. If (7.3) is satisfied, the inverse transform exists: 1 ∞ ∞ 1 f (t) = T (a, b)ψa,b(t) da db (7.11) C a=0 b=0 |a|2 The derivation of the above transform is similar to the inverse continuous wavelet transform, such as in [41] and others. Equation (7.6) can be written in the form of convolution with f (b): T (a, b) = f (b) ∗ ψa,0 (−b)
(7.12)
Taking the Fourier transform of both sides, the convolution becomes multiplication: ∞ T (a, b)e−jωb db = |a|F (ω)Ψ ∗ (aω) (7.13) b=−∞
122
7 Auditory-Based Time Frequency Transform
Magnitude (dB)
−80 −90 −100 −110 −120
500
1000 1500 2000 2500 3000 3500 4000 4500 5000 (A) Frequency (Hz)
500
1000 1500 2000 2500 3000 3500 4000 4500 5000 (B) Frequency (Hz)
Magnitude (dB)
−20 −40 −60 −80 −100
Fig. 7.7. The frequency responses of the cochlear filters when α = 3: (A) β = 0.2; and (B) β = 0.035.
where F (ω) and Ψ (ω) represent the Fourier transforms of f (t) and ψ(t), respectively. We now multiply both sides of the above equation by Ψ (aω)/|a|3/2 and integrate with a: ∞ ∞ 1 T (a, b)Ψ (aω)e−jωb da db = 3/2 |a| a=−∞ b=−∞ ∞ |Ψ (aω)|2 F (ω) da. (7.14) |a| a=−∞ The integration on the right-hand side can be further written as: ∞ ∞ Ψ (|aω)|2 |Ψ (ω)|2 da = dω = C |a| |ω| a=−∞ ω=−∞
(7.15)
to meet the admissibility condition in (7.3) where C is a constant. Rearrange (7.14), we can then have ∞ 1 ∞ 1 F (ω) = T (a, b)Ψ (aω)e−jωb da db. (7.16) C a=−∞ b=−∞ |a|3/2 We now can take the inverse Fourier transform on both sides of the above equation to achieve (7.11). The above procedure is similar to derive the inverse wavelet transform.
7.4 The Discrete-Time and Fast Transform
123
Critical Band Rate (Bark)
16 14 12 10 8 6 4 2 0
0.2
0.4
0.6
1 0.8 Time (Seconds)
1.2
1.4
1.6
0
0.2
0.4
0.6
1 0.8 Time (Seconds)
1.2
1.4
1.6
Critical Band Rate (Bark)
16 14 12 10 8 6 4 2
Fig. 7.8. The traveling wave generated by the auditory transform from the speech data in Fig. 7.1.
7.4 The Discrete-Time and Fast Transform In practical applications, a discrete-time auditory transform is necessary. The forward discrete auditory transform can be written as: 1 b−n T [ai, b] = f [n] ψ ai |ai | n=0 N
(7.17)
where ai = fL /fci is the scaling factor for the ith frequency band fci and N is the length of signal f [n]. The scaling factor ai can be a linear or nonlinear scale. For the discrete transform, ai can also be in ERB, Bark, Log, or other nonlinear scales. We derived the corresponding discrete-time inverse transform: ak N 1 1 b−n ˜ f [n] = T [ai , b]ψ C a =a |ai | ai i
1
(7.18)
b=1
where 0 ≤ t ≤ N , a1 ≤ ai ≤ ak , and 1 ≤ b ≤ N . We note that f˜[n] approximates to f [n] given the limited number of decomposed frequency bands. The above formulas have been verified by the following experiments. Also note that (7.18) can also be applied to compute the inverse continuous WT, where just ψ needs to be replaced.
124
7 Auditory-Based Time Frequency Transform
0.02 0 −0.02 4245.02 2636.59 1606.42 946.614 524.019 Frequency (Hz)253.355 80
0.314
0.318
0.316
0.32
0.322
0.324
Time (Second)
Fig. 7.9. A section of the traveling wave generated by the auditory transform.
Just as the Fourier Transform has a fast algorithm, the FFT, a fast algorithm for the auditory transform also exists. The most computational intensive components in (7.17) and (7.18), the convolutions, of both the forward and the inverse transforms can be implemented by the FFT and invert FFT. Also, depending on the application, the data resolution of the lower frequency bands can be reduced to save the computation. In our recent research, the AT and inverse AT have been implemented as a real-time system for audio signal processing, speaker recognition, and speech recognition.
7.5 Experiments and Discussions In this section, we present the experimental results to compare the AT with the FFT. Also, we verify the inverse AT by experiments to validate the above mathematic derivations. 7.5.1 Verifying the Inverse Auditory Transform In addition to the theoretical proof, we also validated the AT via real speech data. Fig. 7.12 (A) shows the speech waveform for a male voice speaking the words“two, zero, five” recorded with a sampling rate of 16 kHz. The inverse AT results are shown in Fig. 7.12 (B), which matches the original waveform perfectly. This result verified the inverse AT defined in (7.11). To further verify the inverse AT, we computed the correlation coefficients 2 [17], σ12 , between the original speech signals as shown in Fig. 7.1 (top) and
7.5 Experiments and Discussions
125
Critical Band Rate (Bark)
16 14 12 10 8 6 4 2 0
0.2
0.4
0.6
0.8 1 Time (Seconds)
1.2
1.4
1.6
0
0.2
0.4
0.6
0.8 1 Time (Seconds)
1.2
1.4
1.6
Critical Band Rate (Bark)
16 14 12 10 8 6 4 2
Fig. 7.10. Spectrograms from the output of the cochlear transform for the speech data in Fig. 7.1 respectively. The spectrogram at top is from the data recorded by the close-talking microphone, while the spectrogram at bottom is from the hands-free microphone. 100
dB
80
60
40
20
2
4
6
8 10 Critical Band Rate (Bark)
12
14
16
Fig. 7.11. The spectrum of AT at the 1.15 second time frame from Fig. 7.10: The solid line represents the speech from a close-talking microphone. The dashed line is from a hands-free microphone mounted on the visor of a moving car. Both speech files were recorded simultaneously.
the synthesized speech signals from inverse AT. The original signals were first decomposed into different numbers of channels using AT and then synthesized back to speech signals using the inverse AT. The more the cochlear filters, the higher the values of the correlation coefficients, which means that the synthesized speech has lesser distortions. The relation is shown in Table 7.1.
126
7 Auditory-Based Time Frequency Transform 2000 1000 0 −1000 −2000 0
0.2
0.4
1 0.8 0.6 (A) Time (Second)
1.2
1.4
1.6
0.2
0.4
1 0.8 0.6 (B) Time (Second)
1.2
1.4
1.6
−3
x 10 2 0 −2 0
Fig. 7.12. Comparison of speech waveforms: (A) The original waveform of a male voice speaking the words “two, zero, five.” (B) The synthesized waveform by inverse AT with the bandwidth of 80 to 5K Hz. When the filter numbers are 8, 16, 32, and 2 64, the correlation coefficients σ12 for the two speech data sets are 0.74, 0.96, 0.99, and 0.99, respectively. 2 Table 7.1. Correlation Coefficients, σ12 , for Different Sizes of Filter Bank in AT/inverse AT
Number of Filters in AT 8 Clean speech as in Fig. 7.1 (top) 0.74 Noisy speech as in Fig. 7.1 (bottom) 0.76
16 0.96 0.94
32 0.99 0.97
64 0.99 0.97
128 0.99 0.97
The experimental results indicate that 32 filters are good enough for most applications needing the inverse AT. 7.5.2 Applications The auditory transform can be applied to any applications where the FT or WT has been used, especially to audio and speech signal processing. We describe two applications briefly. Noise Reduction: Fig. 7.13 is also an example of applying the AT to noise reduction. The original speech, as shown in Fig. 7.13 (A), was first decomposed into 32 frequency bands using (7.17). An endpoint detector was used to recognize noise and speech [28]. A denoising function was then applied to each of the decomposed frequency bands. The function suppresses more on
7.5 Experiments and Discussions
127
2000 1000 0 −1000 −2000 −3000
0
0.5
1
1.5 (A)
2
1.5 (B)
2
1.5 (C)
2
2.5
3 x 10
4
x 10
4
x 10
4
2000 1000 0 −1000 −2000
0
0.5
1
2.5
3
2000 1000 0 −1000 −2000
0
0.5
1
2.5
3
Fig. 7.13. (A) and (B) are speech waveforms simultaneously recorded in a moving car. The microphones are located on the car visor (A) and speaker’s lapel (B), respectively. (C) is after noise reduction using the AT from the waveform in (A), where results are very similar to (B).
the noise waveforms and less on the speech waveforms. The inverse transform in (7.18) is then applied to convert the processed signals back to clean speech signals as shown in Fig. 7.13 (C), which is similar to the original close-talking data in Fig. 7.13 (B). New feature for speaker and speech recognition: Recently, based on the AT, we developed a new feature extraction algorithm for speech signal processing, named cochlear filter cepstral coefficients (CFCC). Our experiments show that the new CFCC features have significant noise robustness over the traditional MFCC features in a speaker recognition task [25, 24]. The details are available in Chapter 8.
7.6 Comparisons to Other Transforms In this section, we compare the AT with the FFT and gammatone filter bank.
128
7 Auditory-Based Time Frequency Transform
Fig. 7.14. Comparison of FT and AT spectrums: (A) The FFT spectrogram of a male voice “2 0 5”, warped into the Bark scale from 0 to 6.4 Barks (0 to 3500 KHz). (B) The spectrogram from the cochlear filter output for the same male voice. The AT is harmonic free and has less computational noise.
Comparison between AT and FFT The FFT is the major tool for the time-frequency transform used in speech signal processing. We use Fig. 7.14 to illustrate the differences between the spectrograms generated from the Fourier transform and our auditory transform [22]. The original speech wave file is recorded from male voice “2 0 5”. We then calculated the FFT spectrograms as shown in Fig. 7.14 (A), with 30 ms Hamming window shifting every 10 ms. To facilitate the comparison, we then warped the frequency distribution from linear scale to the Bark scale using the method in [26]. The spectrogram of the AT on the same speech data is shown in Fig. 7.14 (B). It was generated from the output of the cochlear filter bank as defined in (8.5) plus a window to compute the average densities for each band. In comparing the two spectrograms in Fig. 7.14, we can observe that there are no pitch harmonics and less computational noise in the spectrums generated from the AT while keeping all formant information. This could be due to the variable length of cochlear filters and the selection of parameter β in (8.5). Furthermore, we compared the spectra shown in Fig. 7.15. A male voice was recorded while in a moving car using two different microphones. A closetalking microphone was placed on the speaker’s lapel, and a hands-free microphone was placed on the car visor. Fig. 7.15 is associated with a cross section
7.6 Comparisons to Other Transforms
129
100
dB
80
60
40
20
2
4
6
8 10 Critical Band Rate (Bark)
12
14
16
2
4
6
8 10 Critical Band Rate (Bark)
12
14
16
100
dB
80
60
40
20
Fig. 7.15. Comparison of AT (top) and FFT (bottom) spectrums at the 1.15 second time frame for robustness: The solid line represents speech from a close-talking microphone. The dashed line represents speech from a hands-free microphone mounted on the visor of a moving car. Both speech files were recorded simultaneously. The FFT spectrum shows 30 dB distortion at low-frequency bands due to background noise compared to the AT. Compared to the FFT spectrum, the AT spectrum has no pitch harmonics and much less distortion at low frequency bands due to background noise.
of Fig. 7.14 at the 1.15 second mark. The solid line represents speech recorded by the close-talking microphone, while the dashed line corresponds to speech recorded by the hands-free microphone. Fig. 7.15 (top) is the spectrum from our AT [22] and Fig. 7.15 (bottom) is from the FFT. From Fig. 7.15, we can observe the following in the FFT spectrum, which are not as significant in the AT spectrum. First, the FFT spectra show a 30 dB distortion at low-frequency bands due to the car background noise. Second, the FFT spectra show significant pitch harmonics, which is due to the fixed length of the FFT window. Last, the noise displayed as “snow” in Fig. 7.14 (A) was generated by the FFT computation. All these may effect the performance of a feature extraction algorithm. For robust speaker identification, we need a time-frequency transform more robust than FFT as the foundation for feature extraction. The transform should generate less distortion from background noise and less computation noise from selected algorithms, such as pitch harmonics, while also retaining the useful information. Here, the AT provides a robust solution to replace the FFT.
130
7 Auditory-Based Time Frequency Transform
Magnitude (dB)
140 120 100 80 60 40 20 500
1000 1500 2000 2500 3000 3500 4000 4500 5000 (A) Frequency (Hz)
500
1000 1500 2000 2500 3000 3500 4000 4500 5000 (B) Frequency (Hz)
Magnitude (dB)
10 0 −10 −20 −30 −40 −50
Fig. 7.16. The Gammatone filter bank: (A) The frequency responses of the Gammatone filter bank generated by (7.19). (B) The frequency responses of the Gammatone filter bank generated by (7.19) plus a equal loudness function.
Comparison between AT and Gammatone In comparison, the filter bank constructed by the AT in (7.9) is different than the Gammatone filter bank. The Gammatone function is presented as: Gfc (t) = tN −1 exp [−2πb(fc )t] cos(2πfc t + φ)u(t)
(7.19)
where N is the filter order, fc is the filter center frequency, φ is the phase, u(t) is the unit step function, b(fc ) = 1.019ERB(fc ) [49] and the bandwidth is fixed to fc as shown in Fig. 7.16, where (A) is plotted from (7.19) directly and (B) is after applying a loudness function to the filters plotted in (A) [49]. In comparing Fig. 7.7 and Fig. 7.16 and ignoring the filter gains, we can find that in AT the filter bandwidths can be easily adjusted by β which can be independent to fc while in the Gammatone function, the filter bandwidth is fixed and locked to fc , i.e. the Q-factor is fixed. The flexible filter bandwidth of AT provides the greater freedom in developing applications. For different applications, we can select different β for the best performance. As we will discuss in Chapter 8, a feature extraction algorithm developed based on the AT can achieve the best performance by adjusting the filter bandwidth, β. Compared to the FT, the AT uses real-number computations and is more robust to background noise. The frequency distribution of the AT can be in Bark, ERB, or any nonlinear scale. Also, its time-frequency resolution is
7.. 6 Comparisons to other Transforms
131
adjustable. Our experiments have shown that the AT is more robust to background noise and without pitch harmonics. Compared to the WT, the AT filter bank is significantly closer to the impulse responses of the basilar membrane than any existing wavelets. The filter in (7.9) is different than any existing wavelet and the frequency distribution is in an auditory-based scale. Compared to the discrete-time WT, the frequency response of AT is not limited to the dyadic scale. It can be in any linear or nonlinear scale. The derived formula in (7.18) can also be used to compute the discrete-time inverse continuous WT.
7.7 Conclusions A robust and invertible time-frequency transform named auditory transform (AT) is presented in this chapter. The forward and inverse transforms were proven in theory and validated in experiments. The AT is an alternative solution to the FFT and WT for robust audio and speech signal processing. Inspired by study of the human hearing system, the AT was designed to mimic the impulse responses of the basilar membrane and its nonlinear frequency distribution. The AT is an ideal solution to decompose input signals into frequency bands for audio and speech signal processing. As demonstrated, the AT has significant advantages due to its noise robustness and its freedom from harmonic distortion and computational noise. Also, Compared to FFT, the AT only uses real number computation. This will save the computation resource in signal processing in the frequency domain. These advantages can lead to many new applications, such as robust feature extraction algorithms for speech and speaker recognition, new algorithms or devices for noise reduction and denoising, speech and music synthesis, audio coding, new hearing aids, new cochlear implants, speech enhancement, audio signal processing, etc. In Chapter 8, based on the AT, we will present a new feature extraction algorithm for robust speaker recognition.
References 1. Allen, J., “Cochlear modeling,” IEEE ASSP Magazine, pp. 3–29, Jan. 1985. 2. Barbour, D. L. and Wang, X., “Contrast tuning in auditory cortex,” Science, vol. 299, pp. 1073–1075, Feb. 2003. 3. Bruce, I., Sacs, M., and Young, E., “An auditory-periphery model of the effects of acoustic trauma on auditory nerve responses,” J. Acoust. Soc. Am, vol. 113, pp. 369–388, 2003. 4. Choueiter, G. F. and Glass, J. R., “An implementation of rational wavelets and filter design for phonetic classification,” IEEE Trans. on Audio, Speech, and Language Processing, vol. 15, pp. 939–948, March 2007. 5. Daubechies, I. and Maes, S., “A nonlinear squeezing of the continuous wavelet transform based on auditory nerve models,” in Wavelets in Medicine and Biology (A. Aldroubi and M. Unser, eds.), (CRC Press), pp. 527–546, 1996.
132
7 Auditory-Based Time Frequency Transform
6. Davis, S. B. and Mermelstein, P., “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. on Acoustics, speech, and signal processing, vol. ASSP-28, pp. 357–366, August 1980. 7. Evans, E. F., “Frequency selectivity at high signal levels of single units in cochlear nerve and cochlear nucleus,” in Psychophysics and Physiology of Hearing, pp. 195–192, 1977. Edited by E. F. Evans, and J. P. Wilson. London UK: Academic Press. 8. Flanagan, J. L., Speech analysis synthesis and perception. New York: SpringerVerlag, 1972. 9. Fletcher, H., Speech and hearing in communication. Acoustical Society of America, 1995. 10. Furui, S., “Cepstral analysis techniques for automatic speaker verification,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 27, pp. 254–277, April 1981. 11. Gelfand, S. A., Hearing, an introduction to psychological and physiological acoustics, 3rd edition. New York: Marcel Dekker, 1998. 12. Ghitza, O., “Auditory models and human performance in tasks related to speech coding and speech recognition,” IEEE Trans. on Speech and Audio Processing, vol. 2, pp. 115–132, January 1994. 13. Goldstein, J. L., “Modeling rapid waveform compression on the basilar membrane as a multiple-bandpass-nonlinear filtering,” Hearing Res., vol. 49, pp. 39–60, 1990. 14. Hermansky, H. and Morgan, N., “Rasta processing of speech,” IEEE Trans. Speech and Audio Proc., vol. 2, pp. 578–589, Oct. 1994. 15. Hohmann, V., “Frequency analysis and synthesis using a Gammatone filterbank,” Acta Acoustica United with Acustica, vol. 88, pp. 433–442, 2002. 16. Johannesma, P. I. M., “The pre-response stimulus ensemble of neurons in the cochlear nucleus,” The proceeding of the symposium on hearing Theory, vol. IPO, pp. 58–69, June 1972. 17. Johnson, R. A. and Wichern, D. W., Applied Multivariate Statistical Analysis. New Jersey: Prentice Hall, 1988. 18. Kates, J. M., “Accurate tuning curves in cochlea model,” IEEE Trans. on Speech and Audio Processing, vol. 1, pp. 453–462, Oct. 1993. 19. Kates, J. M., “A time-domain digital cochlea model,” IEEE Trans. on Signal Processing, vol. 39, pp. 2573–2592, December 1991. 20. Khanna, S. M. and Leonard, D. G. B., “Basilar membrane tuning in the cat cochlea,” Science, vol. 215, pp. 305–306, Jan 182. 21. Kiang, N. Y.-S., Discharge patterns of single fibers in the cat’s auditory nerve. MA: MIT, 1965. 22. Li, Q., “An auditory-based transform for audio signal processing,” in Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, (New Paltz, NY), Oct. 2009. 23. Li, Q., “Solution for pervasive speaker recognition,” SBIR Phase I Proposal, Submitted to NSF IT.F4, Li Creative Technologies, Inc., NJ, June 2003. 24. Li, Q. and Huang, Y., “An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions,” IEEE Trans. on Audio, Speech and Language Processing, Sept. 2011. 25. Li, Q. and Huang, Y., “Robust speaker identification using an auditory-based feature,” in ICASSP 2010, 2010.
References
133
26. Li, Q., Soong, F. K., and Olivier, S., “An auditory system-based feature for robust speech recognition,” in Proc. 7th European Conf. on Speech Communication and Technology, (Denmark), pp. 619–622, Sept. 2001. 27. Li, Q., Soong, F. K., and Siohan, O., “A high-performance auditory feature for robust speech recognition,” in Proceedings of 6th Int’l Conf. on Spoken Language Processing, (Beijing), pp. III 51–54, Oct. 2000. 28. Li, Q., Zheng, J., Tsai, A., and Zhou, Q., “Robust endpoint detection and energy normalization for real-time speech and speaker recognition,” IEEE Trans. on Speech and Audio Processing, vol. 10, pp. 146–157, March 2002. 29. Lin, J., Ki, W.-H., Edwards, T., and Shamma, S., “Analog VLSI implementations of auditory wavelet transforms using switched-capacitor circuits,” IEEE Trans. on Circuits and systems I: Fundamental Theory and Applications, vol. 41, pp. 572–583, Sept. 1994. 30. Liu, W., Andreou, A. G., and M. H. Goldstein, J., “Voiced-speech representation by an analog silicon model of the auditory periphery,” IEEE Trans. on Neural Networks, vol. 3, pp. 477–487, May 1992. 31. Lyon, R. F. and Mead, C., “An analog electronic cochlea,” IEEE Trans. on Acoustics, Speech, and Signal processing, vol. 36, pp. 1119–1134, July 1988. 32. Max, B., Tam, Y.-C., and Li, Q., “Discriminative auditory features for robust speech recognition,” IEEE Trans. on Speech and Audio Processing, vol. 12, pp. 27–36, Jan. 2004. 33. Misiti, M., Misiti, Y., Oppenheim, G., and Poggi, J.-M., Wavelet Toolbox User’s Guide. MA: MathWorks, 2006. 34. Møller, A. R., “Frequency selectivity of single auditory-nerve fibers in response to broadband noise stimuli,” J. Acoust. Soc. Am., vol. 62, pp. 135–142, July 1977. 35. Moore, B., Peters, R. W., and Glasberg, B. R., “Auditory filter shapes at low center frequencies,” J. Acoust. Soc. Am, vol. 88, pp. 132–148, July 1990. 36. Moore, B. C. J. and Glasberg, B. R., “Suggested formula for calculating auditory-filter bandwidth and excitation patterns,” J. Acoust. Soc. Am., vol. 74, pp. 750–753, 1983. 37. Moore, B. C., An introduction to the psychology of hearing. NY: Academic Press, 1997. 38. Nedzelnitsky, V., “Sound pressures in the casal turn of the cat cochlea,” J. Acoustics Soc. Am., vol. 68, pp. 1676–1680, 1980. 39. Patterson, R. D., “Auditory filter shapes derived with noise stimuli,” J. Acoust. Soc. Am., vol. 59, pp. 640–654, 1976. 40. Pickles, J. O., An introduction to the physiology of hearing, 2nd Edition. New York: Academic Press, 1988. 41. Rao, R. and Bopardikar, A., Wavelet Transforms. MA: Adison-Wesley, 1998. 42. Sellami, L. and Newcomb, R. W., “A digital scattering model of the cochlea,” IEEE Trans. on Circuits and systems I: Fundamental Theory and Applications, vol. 44, pp. 174–180, Feb. 1997. 43. Sellick, P. M., Patuzzi, R., and Johnstone, B. M., “Measurement of basilar membrane motion in the guinea pig using the Mossbauer technique,” J. Acoust. Soc. Am., vol. 72, pp. 131–141, July 1982. 44. Shaw, E. A. G., The external ear, in Handbook of Sensory Physiology. New York: Springer-Verlay, 1974. W. D. Keidel and W. D. Neff eds.
134
7 Auditory-Based Time Frequency Transform
45. Teich, M. C., Heneghan, C., and Khanna, S. M., “Analysis of cellular vibrations in the living cochlea using the continuous wavelet transform and the shorttime Fourier transform,” in Time frequency and wavelets in biomedical signal processing, pp. 243–269, 1998. Edited by M. Akay. 46. Torrence, C. and Compo, G. P., “A practical guide to wavelet analysis,” Bulletin of the American Meteorological Society, vol. 79, pp. 61–78, January 1998. 47. Volkmer, M., “Theoretical analysis of a time-frequency-PCNN auditory cortex model,” Internal J. of Neural Systems, vol. 15, pp. 339–347, 2005. 48. von B´ek´esy, G., Experiments in hearing. New York: McGRAW-HILL, 1960. 49. Wang, D. and Brown, G. J., Fundamentals of computational auditory scene analysis in Computational Auditory Scene Analysis Edited by D. Wang and G. J. Brown. NJ: IEEE Press, 2006. 50. Wang, K. and Shamma, S. A., “Spectral shape analysis in the central auditory system,” IEEE Trans. on Speech and Audio Processing, vol. 3, pp. 382–395, Sept. 1995. 51. Weintraub, M., A theory and computational model of auditory monaural sound separation. PhD thesis, Standford University, CA, August 1985. 52. Wilson, J. P. and Johnstone, J., “Basilar membrane and middle-ear vibration in guinea pig measured by capacitive probe,” J. Acoust. Soc. Am., vol. 57, pp. 705–723, 1975. 53. Wilson, J. P. and Johnstone, J., “Capacitive probe measures of basilar membrane vibrations in,” Hearing Theory, 1972. 54. Yost, W., Fundamentals of Hearing: An Introduction, 3rd Edition. New York: Academic Press, 1994. 55. Zhou, B., “Auditory filter shapes at high frequencies,” J. Acoust. Soc. Am, vol. 98, pp. 1935–1942, October 1995. 56. Zilany, M. and Bruce, I., “Modeling auditory-nerve response for high sound pressure levels in the normal and impaired auditory periphery,” J. Acoust. Soc. Am, vol. 120, pp. 1447–1466, Sept. 2006. 57. Zweig, G., Lipes, R., and Pierce, J. R., “The cochlear compromise,” J. Acoust. Soc. Am., vol. 59, pp. 975–982, April 1976. 58. Zwicker, E. and Terhardt, E., “Analytical expressions for critical-band rate and critical bandwidth as a function of frequency,” J. Acoust. Soc. Am., vol. 68, pp. 1523–1525, 1980.
Chapter 8 Auditory-Based Feature Extraction and Robust Speaker Identification
In the previous chapter, we introduced a robust auditory transform (AT). In this chapter, we present an auditory-based feature extraction algorithm based on the AT and apply it to robust speaker identification. Usually, the performances of acoustic models trained in clean speech drop significantly when tested in noisy speech. The presented features, however, have shown strong robustness in this kind of situation. We present a typical text-independent speaker identification system in the experiment section. Under all three different mismatched testing conditions, with white noise, car noise, or babble noise, the auditory features consistently perform better than the baseline mel frequency cepstral coefficient (FMCC) features. The auditory features are also compared with perceptual linear predictive (PLP) and RASTA-PLP features, The features consistently perform much better than PLP. Under white noise, the FMCC features are much better than RASTA-PLP. Under car and babble noises, the performace are similar.
8.1 Introduction In automatic speaker authentication, feature extraction is the first crucial component. To ensure the performance of a speaker authentication system, successful front-end features should carry enough discriminative information for classification or recognition, fit well with the back-end modeling, and be robust with respect to the changes of acoustic environments. After decades of research and development, speaker authentication under various operating modes is still a challenging problem, especially when acoustic training and testing environments are mismatched. Since the human hearing system is robust to mismatched conditions, we developed an auditory-based feature extraction algorithm based on the auditory transform introduced in Chapter 7. We name the new features ascochlear filter cepstral coefficients (CFCC). The auditory-based feature extraction algorithm was originally proposed by the author and Huang in [11, 10].
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/978-3-642-23731-7_8, Ó Springer-Verlag Berlin Heidelberg 2012
135
136
8 Auditory-Based Feature Extraction and Robust Speaker Identification
Speaker authentication uses the same features as speech recognition. In general, most speech feature extraction methods fall into the following two categories: modeling the human voice production system or modeling the peripheral auditory system. For the first approach, one of the most popular features is a group of cepstral coefficients derived from linear prediction, known as the linear prediction cepstral coefficients (LPCC) [3, 14]. The LPCC feature extraction utilizes an all-pole filter to model the human vocal tract with speech formants captured by the poles of the all-pole filter. The narrow band (e.g., up to 4 KHz) LPCC features work well in a clean environment. However, in our previous experiments, the linear predictive spectral envelope shows large spectral distortion in noisy environments [13, 12]. This results in significant performance degradation. For the second approach, there are two groups of features, based on either Fourier transforms (FT) or auditory-based transforms. The representative of the first group is the MFCC (Mel frequency cepstral coefficients), where the fast Fourier transform (FFT) is applied to generate the spectrum in the linear scale, and then a bank of band-pass filters is placed along a Mel frequency scale on top of the FFT output [4]. Alternately, the FFT output is warped to a Mel or Bark scale and then a bank of band-pass filters is placed linearly on top of the warped FFT output [13, 12]. The algorithm presented in this chapter belongs to the second group where the auditory-based transform (AT) is defined as an invertible, time-frequency transform to replace FFT. The output AT can be in any kind of frequency scale (e.g., linear, Bark, ERB, etc). Therefore, there is no need to place the band-pass filter in a Mel scale as in the MFCC or warp the frequency distributions as in [13, 12]. In MFCC, we need two steps to obtain the filterbank output, FFT and then Mel filter bank. By using the AT, it is only one step. The output of auditory filters are the filterbank outputs. There is no need for FFT. MFCC features [4] in the first group are one of the most popular features for speech and speaker recognition. Like the LPCC features, the MFCC features perform well in clean environments but not in adverse environments or mismatched training and testing conditions. Perceptual linear predictive (PLP) analysis is another peripheral auditory-based approach. Based on the FFT output, it uses several perceptually motivated transforms, including Bark frequency, equal-loudness pre-emphasis, and masking curve. [6] The relative spectra, known as RASTA, is further developed to filter the time trajectory to suppress constant factors in the spectral component [7]. It is often cascaded with the PLP feature extraction to form the RASTA-PLP features. Comparisons between MFCC and RASTA-PLP have been reported in [5]. Further comparisons among the PLP, RASTA-PLP, and CFCC features in experiments will be given at the end of this chapter. Both MFCC and RASTA-PLP features are based on the FFT. As mentioned above, the FFT has a fixed time-frequency resolution and a well-defined inverse transform. Fast algorithms exist for both the forward transform and
8.1 Introduction
137
the inverse transform. Despite its simplicity and efficient computation algorithms, we believe that when applied to speech processing the time-frequency decomposition mechanism of the FT is different than the mechanism in the hearing system. First, it uses fixed-length windows, which generate pitch harmonics in the entire speech bands. Secondly, its individual frequency bands are distributed linearly, which is different from the distribution in the human cochlea. Further wrapping is needed to convert to the Bark, MEL, or other scales. Finally, in our recent study [9, 8] as shown in Chapter 7, we found that the FFT spectrogram has more noise distortion and computation noise than an auditory-based transform which we recently developed. Thus, we find it is promising to develop a new feature extraction algorithm based on the new auditory-based, time-frequency transform [8] to replace the FFT in speech feature extraction. Based on the study of the human hearing system, the author proposed an auditory-based, time-frequency transform (AT) [9, 8] in Chapter 7. The new transform is a pair of a forward transform and an inverse transform. Through the forward transform, the speech signal can be decomposed into a number of frequency bands using a bank of cochlear filters. The frequency distribution of the cochlear filters is similar to the distribution in the cochlea and the impulse response of the filters is similar to that of the traveling wave. Through the inverse transform, the original speech signal can be reconstructed from the decomposed band-pass signals. The AT has been proven in theory and validated in experiments [8], which is presented in detail in Chapter 7. Compared to the FFT, the AT has flexible time-frequency resolution and its frequency distribution can take on any linear or nonlinear form. Therefore, it is easy to define the distribution to be similar to that of the Bark, Mel, or ERB scale, which is similar to the frequency distribution function of the Basilar membrane. Most importantly, the AT transform has significant advantages in noise robustness and can be free of the pitch harmonic distortion as plotted in [8] and Fig. 7.14. Therefore, the AT provides a new platform for feature extraction research. It forms the foundation for our robust feature extraction. The ultimate goal of this study is to develop a practical speech front-end feature extraction algorithm that conceptually emulates the human peripheral hearing system and thus achieves a superior noise robustness performance under mismatched training and testing conditions. The remainder of the chapter is organized as follows: Section 8.2 demonstrates the presented auditory feature extraction algorithm and provides an analytic study and discussion, and Section 8.3 studies the feature parameters using a development dataset and presents the experimental results of the CFCC in comparison to other front-end features in a testing dataset.
138
8 Auditory-Based Feature Extraction and Robust Speaker Identification
8.2 Auditory-Based Feature Extraction Algorithm In this section, we will describe the structure of the auditory-based feature extraction algorithm and provide details of its computation. The human hearing system is very complex. Although we would like to emulate the human peripheral hearing system in detail, our computers may not be able to meet the requirements of real-time applications; therefore, we will simulate only the most important features of the human peripheral hearing system. An illustrative block diagram of the algorithm is shown in Fig. 8.1. The algorithm is intended to conceptually replicate the hearing system at a high level and consists of the following modules: cochlear filter bank, hair-cell function with windowing, cubic-root nonlinearity, and discrete cosine transform (DCT). A detailed description of each module follows.
Fig. 8.1. Schematic diagram of the auditory-based feature extraction algorithm named cochlear filter cepstral coefficients (CFCC).
8.2.1 Forward Auditory Transform and Cochlea Filter Bank The cochlear filter bank in Fig. 8.1 is an implementation of the invertible auditory transform (AT) as defined and described in Chapter 7. For feature extraction, we only use the forward transform which is the foundation of the auditory-based feature. The forward transform models the traveling wave in the cochlea where the sound waveform is decomposed into a set of subband signals. Let f (t) be speech signals. A transform of f (t) with respect to a cochlear filter ψ(t), representing the basilar membrane (BM) impulse response in the cochlea, is defined as: ∞ 1 b−t T (a, b) = f (t) ψ dt, (8.1) a |a| −∞ where a and b are real, both f (t) and ψ(t) belong to L2 (R), and T (a, b) representing the traveling waves in the BM is the decomposed signal and filter output. The above equation can also be written as:
8.2 Auditory-Based Feature Extraction Algorithm
T (a, b) = f (t) ∗ ψa,b (t) dt, where
1 ψa,b (t) = ψ |a|
t−b a
139
(8.2)
.
(8.3)
Like in the wavelet transform, factor a is a scale or dilation variable. By changing a, we can shift the central frequency of an impulse response function to receive a band of decomposed signals. Factor b is a time shift or translation variable. For a given value of a, factor b shifts the function ψa,0 (t) by an amount b along the time axis. Note that 1/ |a| is an energy normalizing factor. It ensures that the energy stays the same for all a and b; therefore, we have: ∞ ∞ |ψa,b (t)|2 dt = |ψ(t)|2 dt. (8.4) −∞
−∞
The cochlear filter, as the most important part of the transform, is defined by the author as: 1 t−b ψa,b (t) = ψ a |a| α 1 t−b t−b = exp −2πfLβ a a |a| t−b cos 2πfL + θ u(t − b), (8.5) a where α > 0 and β > 0, u(t) is the unit step function, i.e. u(t) = 1 for t ≥ 0 and 0 otherwise. Parameters α and β determine the shape and width of the cochlear filter in the frequency domain. They can be empirically optimized as shown in our experiments in Section 8.3. The value of θ should be selected such that (8.6) is satisfied: ∞ ψ(t) dt = 0. (8.6) −∞
This is required by the transform theory to ensure no information is lost during the transform [8]. The value of a can be determined by the current filter central frequency, fc , and the lowest central frequency, fL , in the cochlear filter bank: a = fL /fc .
(8.7)
Since we contract ψa,b (t) with the lowest frequency along the time axis, the value of a is in 0 < a ≤ 1. If we stretch ψ, the value of a is in a > 1. The frequency distribution of the cochlear filter can be in the form of linear or nonlinear scales such as ERB (equivalent rectangular bandwidth) [15], Bark [21], Mel scale [4], log, etc. For a particular band number i, the corresponding value of a is represented as ai , which needs to be pre-calculated for the required central frequency of the cochlear filters at band number i.
140
8 Auditory-Based Feature Extraction and Robust Speaker Identification
8.2.2 Cochlear filter cepstral coefficients (CFCC) The cochlear filter bank is intended to emulate the impulse response in the cochlea. However, there are other operations in the ear. The inner hair cells act as a transducer for mechanical movements of the BM into neural activities. When the BM moves up and down, a shearing motion is created between the BM and the tectorial membrane [16]. This causes the displacement of the uppermost hair cells which generates neural signals; however, the hair cells only generate neural signals in one direction of the BM movement. When the BM moves in the opposite direction, there is neither excitation nor neuron output. We studied different implementations of the hair cell function. The following function of the hair cell output provides the best performance in our evaluated task: h(a, b) = T (a, b)2 ; ∀ T (a, b), (8.8) where T (a, b) is the filter-bank output. Here, we assume that all other detailed functions in the outer ear, middle ear, and the feedback control of the auditory system and brain to the cochlea have been ignored or have been included in the auditory filter responses. In the next step, the hair cell output for each band is converted into a representation of nerve spike count density in a duration associated with the current band central frequency. We use the following equation to mimic the concept: +d−1 1 S(i, j) = h(i, b), = 1, L, 2L, · · · ; ∀ i, j, (8.9) d b=
where d = max{3.5τi, 20ms} is the window length, τi is the period of the ith band, and L = 10 ms is the window shift duration. We empirically set the computations and the parameters, but they may need to be adjusted for different datasets. Instead of using a fixed length window as in computing FFT spectrograms, we are using a variable length window for different frequency bands. The higher the frequency, the shorter the window. This prevents the high-frequency information from being smoothed out by long window duration. The output of the above equation and the spectrogram of the cochlear filter bank can be used for both feature extraction and analysis. Furthermore, we apply the scales of loudness function suggested by Stevens [19, 20] to the hair cell output as: y(i, j) = S(i, j)1/3 .
(8.10)
In the last step, the discrete cosine transform (DCT) is applied to decorrelate the feature dimensions and generate the cochlear filter cepstral coefficients (CFCC), so the features can work with the existing back-end.
8.2 Auditory-Based Feature Extraction Algorithm
141
8.2.3 Analysis and Comparison The FFT is the major tool for the time-frequency transform used in speech signal processing. Most of features were developed based on FFT. A comparison between the AT used in CFCC and FFT used in MFCC, PLP, and other features are provided in Chapter 7. The reader can refer to that chapter for details. We now focus our comparison on feature levels. The analysis and discussion is intended to help the reader understand the CFCCs. Further comparisons in experiments will be made in the next section. Comparison between CFCC and MFCC Since the MFCC features are popular in both speaker and speech recognition, we compare the CFCCs with the MFCCs in this section. It is understood that the MFCC features use the FFT to convert the time domain speech signal to the frequency domain spectrum, represented by complex numbers, and then apply triangle filters on top of the spectrum. The triangle filters are distributed in the Mel scale. The CFCC features use a bank of cochlear filters to decompose the speech signal into multiple bands. The frequency response of a cochlear filter has a bell-like shape rather than a triangle shape. The shape and bandwidth of the filter in the frequency domain can be adjusted by parameters α and β in (8.5). In each of the bands, the decomposed signal is still the time domain signal, represented by real numbers. The central frequencies of the cochlear filters can be in any distribution, including Mel, ERB, Bark, log, etc. When using the FFT to compute a spectrogram, the window size must be fixed to all frequency bands, due to the fixed point FFT. When we compute a spectrogram from the decomposed signals generated by the cochlear filters, the window size can be different for different frequency bands. For example, we use a longer window for a lower frequency band to average out the background noise and a shorter window for a higher frequency band to protect high-frequency information. Furthermore, the MFCCs use a logarithm as the nonlinearity while the CFCCs use a cubic root. Furthermore, linear and/or nonlinear operations can be applied to the decomposed multiband signals to mimic signal processing in the peripheral hearing system or to tune for the best performance for a particular application. The operations in this section are just one of the feasible configurations. To achieve the best performance, different applications may require different configurations or adaptations. To this end, the auditory transform can be considered as a new platform for future feature extraction research. While the MFCC has been used for several decades, the CFCC is new. Further improvement may be required to finalize all the details. We introduce the CFCC as a platform with the hope that the CFCC features will be further improved through research in various tasks and databases.
142
8 Auditory-Based Feature Extraction and Robust Speaker Identification
Comparison between CFCCs and Gammatone-Based Features There are Gammatone-based features named Gammatone frequency cepstral coefficients (GFCC) [18]. They are also auditory-based speech features. During our implementation, an exact implementation following the description in [18] did not give us reasonable experimental results; therefore, we then replaced the “downsampling” procedure in [18] by computing an average of the absolute values on the Gammatone filter-bank output using a 20 ms window shift every 10 ms, followed by a cubic root function and DCT. This procedure gave us the best results in our experiments, but because they are different from the original GFCCs, we have named the modified GFCC (MGFCC) features. Comparisons in experiments are available in the next section.
8.3 Speaker Identification and Experimental Evaluation In this section, we use a closed set, text-independent speaker identification task to evaluate the new auditory feature extraction algorithm. In a training section, after feature extraction, a set of Gaussian mixture models is trained. Each model is associated with one speaker. In a testing session, given a testing utterance, the log likelihood scores are calculated for each of the GMMs. The GMM with the largest score is selected and the associated speaker is the estimated speaker of the given utterance. Since we are addressing the robustness problem. The CFCC frontend and GMM backed (CFCC/GMM) system was evaluated in a task where the acoustic conditions of training and testing are mismatched, i.e. the training data set was recorded under clean conditions while the testing data sets were mixed with different types of background noise at various noisy levels. 8.3.1 Experimental Datasets The Speech Separation Challenge database contains speech recorded from a closed-set of 34 speakers (18 male and 16 female speakers). All speech files are single-channel data sampled at 25 kHz and all material is end-pointed (i.e. there is little or no initial or final silence) [1]. The training data was recorded under clean conditions. The testing sets were obtained by mixing clean testing utterances with white noise at different SNR levels; in total there are five testing conditions provided in the database (i.e. noisy speech at -12 dB, -6 dB, 0 dB, and 6 dB SNR, and clean speech). We find this database ideal for the study of noise robustness when training and testing conditions do not match. In particular, since all the noisy testing data is generated from the same speech with only the noise level changing, this largely reduces the performance fluctuations due to variations other than noise types and mixing levels.
8.3 Speaker Identification and Experimental Evaluation
143
Table 8.1. Summary of The training, Development, and Testing Set. Data Set # of Spks. # of Utters / Spk. Dur. (sec) / Spk. Training 34 20 36.8s Develop. 34 10 18.3s Testing 34 10 ∼ 20 29.6s
As shown in Table 8.1 the Speech Separation Challenge database was partitioned into training, development and testing sets and there is no overlap among the data sets. In our experiments speaker models were first trained using the clean training set and then tested on noisy speech at four SNR levels. We created three disjoint subsets from the database as the training set, development set, and testing set. Each set has 34 speakers. The training set has 20 utterances per speaker and 680 utterances in total. The average duration of training data per speaker is 36.8 seconds of speech. The development set has 1700 utterances in total. There are five testing conditions (i.e. noisy speech at -12 dB, -6 dB, 0 dB, and 6 dB SNR, and clean speech). Each condition has 10 utterances per speaker. The average duration of each utterance is 1.8 seconds. The development set is only with white noise. The testing set has the same five testing conditions. Each condition has 10 to 20 utterances per speaker. The duration of each testing utterance is about 2 to 3 seconds of speech. The testing set has about 2500 utterances for each noise type. For three types of noises, white, car, and babble, we have about 7500 utterances in total for testing. Note that the training set consists of only clean speech, while both the development set and the testing set consist of clean speech and noisy speech at five different SNR levels. Note that we mainly focused on 0 dB and 6 dB conditions in our feature analysis and comparisons because when conditions are under -6 dB the performance of all features are close to random chances. We note that in addition to white noise testing conditions provided in the Speech Separation Challenge database, we also generated two more sets of testing conditions with car noise or babble noise at -6 dB, 0 dB, and 6 dB SNR. The car noise and babble noise were recorded under real-world conditions, and mixed with the clean test speech from the database. These test sets were used as additional material to further test the robustness of the auditory features. The sizes of the testing sets with different types of noises are the same. 8.3.2 The Baseline Speaker Identification System Our baseline system uses the standard MFCC front-end features and Gaussian mixture models (GMMs). Twenty-dimensional MFCC features (c1 ∼ c20) were extracted from the speech audio based on a 25 ms window with a framerate of 10 ms; the frequency analysis range was set to be 50 Hz ∼ 8000 Hz. Note that the delta and double delta of the MFCCs were not used here since they were not found to be helpful in discerning between speakers in
144
8 Auditory-Based Feature Extraction and Robust Speaker Identification
our experiments. We also found cepstrum mean subtraction was not helpful; therefore it was not used in our baseline system. The back-end of the baseline system is the standard GMM’s trained using the maximum likelihood estimation (MLE) [17]. Let Mi represent the GMM model for the i-th speaker, and i be the index for speakers. During testing, the testing utterances u match against all hypothesized speaker models (Mi ), and the speaker identification decision (J) is made by: J = arg max log p(uk |Mi ), (8.11) i
k
where uk is the k-th frame of utterance u and p(·|Mi ) is the probability density function. Thirty-two Gaussian mixtures were used in the speaker GMM models. To obtain fair comparison of the different front-end features, only the front-end feature extraction was varied and the configuration of the backend of the system remained the same in all the experiments throughout this chapter. When we developed the CFCC features, we analyzed the effects of the following items to speaker ID performance in a development dataset, such as filter bandwidth (β), equal-loudness function, various windowing schemes, and nonlinearity. The results were reported in [10]. Here, we only introduce the parameters and operations which provide the best results. Based on our analytic study, the CFCC feature extraction can be summarized as follows: First, the speech audio file is passed through the band-pass filter bank. The filter width parameter β was set to 0.035. The Bark scale is used for the filter bank distribution and equal-loudness weighting is applied at different frequency bands. Second, the traveling waves generated from the cochlear filters are windowed and averaged by the hair cell function. The window length is 3.5 epochs of the band central frequency or 20 ms, whichever is the shortest. Third, a cubic root is applied. Finally, since most back-end systems adopt diagonal Gaussian, the discrete cosine transform (DCT) is used to decorrelate the features. The 0th component, corresponding to the energy, is removed from the DCT output for the speaker ID task. It is needed for speech recognition. Table 8.2 shows a comparison of the speaker identification accuracy of the optimized CFCC features with the MGFCCs and MFCCs tested on the development set. Table 8.2. Comparison of MFCC, MGFCC, and CFCC Features Tested on the Development Tet. Testing SNR MFCC MGFCC CFCC
-6 dB 6.8% 9.1% 12.6%
0 dB 15.9% 45.0% 57.9%
6 dB 42.1% 88.8% 90.3%
8.3 Speaker Identification and Experimental Evaluation
145
8.3.3 Experiments Using the optimized CFCC feature parameters selected from the development set, we conducted speaker identification experiments on the testing set with the results depicted in Fig. 8.2. As we can see from Fig. 8.2, in clean testing conditions, the CFCC features generate comparable results, over 96%, to the MFCCs. As white noise is added to the clean testing data at increasing intensity, the performance of the CFCCs are significantly better than both the MGFCCs and MFCCs. For example, when the SNR of the testing condition drops to 6dB, the accuracy of the MFCC system drops to 41.2%. In comparison, the parallel system using the CFCC features still achieves 88.3% accuracy, more than twice as accurate as the MFCC features. Similarly, the MGFCC features have an accuracy of 85.1%, which is better than the MFCC features, but not as good as the CFCC features. The CFCC performance in the testing data set is similar to its performance in the development set. Overall, we see that the CFCC features significantly outperform both the widely used MFCC features and the related auditory-based MGFCC features in this speaker identification task. 1 0.9
CFCC(White Noise) MGFCC(White Noise) MFCC(White Noise)
0.8
Accuracy
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −6dB
0dB
Test Conditions
6dB
Clean
Fig. 8.2. Comparison of MFCC, MGFCC, and the CFCC features tested on noisy speech with white noise.
To further test the noise robustness of the CFCCs, we conducted more experiments on noisy speech data with two kinds of real-world noise (car noise and babble noise) as described in Section 8.3.1 using the same experimental setup. Figure 8.3 and Figure 8.4 present the experimental results for the car noise and the babble noise at -6 dB, 0 dB and 6 dB levels, respectively. The
146
8 Auditory-Based Feature Extraction and Robust Speaker Identification
auditory features consistently outperform the baseline MFCC system and the MGFCC system under both real-world car noise and babble noise testing conditions. 1
0.9
CFCC(Car Noise) MGFCC(Car Noise) MFCC(Car Noise)
Accuracy
0.8
0.7
0.6
0.5
0.4 −6dB
0dB
Test Conditions
6dB
Clean
Fig. 8.3. Comparison of MFCC, MGFCC, and CFCC features tested on noisy speech with car noise.
8.3.4 Further Comparison with PLP and RASTA-PLP We conducted further experiments with PLP and RASTA-PLP features using the same experimental setup as described before. The comparative results on white noise, car noise, and babble noise are depicted in Fig. 8.5, Fig. 8.6, and Fig. 8.7, respectively. We are not surprised to observe that the CFCC features outperform the PLP features in all three testing conditions. The PLP features minimize the differences between speakers while preserving important speech information via the spectra warping technique [6], which, as a consequence, has never been chosen as preferable speech features for speaker recognition. It is interesting to observe that the CFCCs perform significantly better than RASTA-PLP on white noise testing conditions at all different levels; however, for car noise and babble noise the performance of the CFCCs and RASTAPLP are fairly close. As previously stated, RASTA is a technique that utilizes a band-pass filter to smooth out the variations of short-term noise and remove constant offset due to static spectral coloration in the speech channel [7]. It is typically used in combination with PLP, which is referred to as RASTA-PLP [2]. Our experiments show that RASTA filtering largely improves the performance of PLP features in speaker identification under mismatched training
8.3 Speaker Identification and Experimental Evaluation 1 0.9
147
CFCC(Babble Noise) MGFCC(Babble Noise) MFCC(Babble Noise)
0.8
Accuracy
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −6dB
0dB
Test Conditions
6dB
Clean
Fig. 8.4. Comparison of MFCC, MGFCC, and CFCC features tested on noisy speech with babble noise.
and testing conditions. It is particularly helpful when tested under car noise and babble noise, but it is not as effective for flatly distributed white noise. In comparison, the CFCCs consistently generate superior performance in all three conditions.
8.4 Conclusions In this chapter, we presented a new auditory-based feature extraction algorithm named CFCC and applied to robust speaker identification in mismatched conditions. Our research was motivated by studies of the signal processing functions in the human peripheral auditory system. The CFCC features are based on the flexible time-frequency transform (AT) presented in the previous chapter in combination with several components to emulate the human peripheral hearing system. The analytic study for feature optimization was conducted on a separate development set. The optimized CFCC features were then tested under a variety of mismatched testing conditions, which included white noise, car noise, and babble noise. Our experiments in speaker identification tasks show that under mismatched conditions, the new CFCCs consistently perform better than both the MFCC and MGFCC features. Further comparison with PLP and RASTA-PLP features shows that although RASTA-PLP can generate comparable results when tested on car noise or babble noise, it does not perform as well when tested on flatly distributed white noise. In comparison, the CFCCs generate superior results under all
148
8 Auditory-Based Feature Extraction and Robust Speaker Identification 1 0.9
CFCC(White Noise) PLP(White Noise) RASTA−PLP(White Noise)
0.8
Accuracy
0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 −6dB
0dB
Test Conditions
6dB
Clean
Fig. 8.5. Comparison of PLP, RASTA-PLP, and the CFCC features tested on noisy speech with white noise. 1 0.9
Accuracy
0.8 0.7 0.6 0.5 0.4
−6dB
CFCC(Car Noise) PLP(Car Noise) RASTA−PLP(Car Noise) 0dB
Test Conditions
6dB
Clean
Fig. 8.6. Comparison of PLP, RASTA-PLP, and the CFCC features tested on noisy speech with car noise.
three noise conditions. The presented feature can be applied to other speaker authentication tasks. For the best performance some of the parameters, such as β in the auditory transform, can be adjusted. Also, for different sampling rate, the auditory filter distribution should be adjusted accordingly.
8.4 Conclusions
149
1 0.9 0.8
Accuracy
0.7 0.6 0.5 0.4 0.3 0.2 CFCC(Babble Noise) PLP(Babble Noise) RASTA−PLP(Babble Noise)
0.1 0 −6dB
0dB
Test Conditions
6dB
Clean
Fig. 8.7. Comparison of PLP, RASTA-PLP, and the CFCC features tested on noisy speech with babble noise.
We note that this chapter is just an example of using the auditory transform as the platform for feature extraction. Different versions of CFCC can be developed for different tasks and noise environments. The presented parameters and operations may not be the best for all applications. The reader can modify and tune the configurations to achieve the best results on specific tasks. Currently, the author is extending the auditory feature extraction algorithm to speech recognition.
References 1. http://www.dcs.shef.ac.uk/ martin/SpeechSeparationChallenge/. 2. http://www.icsi.berkeley.edu/ dpwe/projects/sprach/sprachcore.html. 3. Atal, B. S., “Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification,” Journal of the Acoustical Society of America, vol. 55, pp. 1304–1312, 1974. 4. Davis, S. B. and Mermelstein, P., “Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences,” IEEE Trans. on Acoustics, speech, and signal processing, vol. ASSP-28, pp. 357–366, August 1980. 5. grimaldi, M. and Cummins, F., “Speaker identification using instantaneous frequencies,” IEEE Trans. on Audio, Speech, and language processing, vol. 16, pp. 1097–1111, August 2008. 6. Hermansky, H., “Perceptual linear predictive (PLP) analysis of speech,” J. Acoust. Soc. Am., vol. 87, pp. 1738–1752, 1990.
150
8 Auditory-Based Feature Extraction and Robust Speaker Identification
7. Hermansky, H. and Morgan, N., “Rasta processing of speech,” IEEE Trans. Speech and Audio Proc., vol. 2, pp. 578–589, Oct. 1994. 8. Li, Q., “An auditory-based transform for audio signal processing,” in Proceedings of IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, (New Paltz, NY), Oct. 2009. 9. Li, Q., “Solution for pervasive speaker recognition,” SBIR Phase I Proposal, Submitted to NSF IT.F4, Li Creative Technologies, Inc., NJ, June 2003. 10. Li, Q. and Huang, Y., “An auditory-based feature extraction algorithm for robust speaker identification under mismatched conditions,” IEEE Trans. on Audio, Speech and Language Processing, Sept. 2011. 11. Li, Q. and Huang, Y., “Robust speaker identification using an auditory-based feature,” in ICASSP 2010, 2010. 12. Li, Q., Soong, F. K., and Olivier, S., “An auditory system-based feature for robust speech recognition,” in Proc. 7th European Conf. on Speech Communication and Technology, (Denmark), pp. 619–622, Sept. 2001. 13. Li, Q., Soong, F. K., and Siohan, O., “A high-performance auditory feature for robust speech recognition,” in Proceedings of 6th Int’l Conf. on Spoken Language Processing, (Beijing), pp. III 51–54, Oct. 2000. 14. Makhoul, J., “Linear prediction: a tutorial review,” Proceedings of the IEEE, vol. 63, pp. 561–580, April 1975. 15. Moore, B. C. J. and Glasberg, B. R., “Suggested formula for calculating auditory-filter bandwidth and excitation patterns,” J. Acoust. Soc. Am., vol. 74, pp. 750–753, 1983. 16. Moore, B. C., An introduction to the psychology of hearing. NY: Academic Press, 1997. 17. Reynolds, D. and Rose, R. C., “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. on Speech and Audio Processing, vol. 3, pp. 72–83, 1995. 18. Shao, Y. and Wang, D., “Robust speaker identification using auditory features and computational auditory scene analysis,” in Proceedings of IEEE ICASSP, pp. 1589–1592, 2008. 19. Stevens, S. S., “On the psychophysical law,” Psychol. Rev., vol. 64, pp. 153–181, 1957. 20. Stevens, S. S., “Perceived level of noise by Mark VII and decibels (E),” J. Acoustic. Soc. Am., vol. 51, pp. 575–601, 1972. 21. Zwicker, E. and Terhardt, E., “Analytical expressions for critical-band rate and critical bandwidth as a function of frequency,” J. Acoust. Soc. Am., vol. 68, pp. 1523–1525, 1980.
Chapter 9 Fixed-Phrase Speaker Verification
As we introduced in Chapter 1, speaker recognition includes speaker identification and speaker verification. In the previous chapters, we discussed speaker identification. In this chapter, we will introduce speaker verification. From an application point of view, speaker verification has more commercial applications than speaker identification.
9.1 Introduction Among the different speaker authentication systems introduced in Chapter 1, we focus here on the fixed-phrase speaker verification system for open-set applications. Here, fixed-phrase means that the same pass-phrase is used for one speaker in both training and testing sessions and the text of the passphrase is known by the system through registration. The system introduced in this chapter was first proposed by Parthasarathy and Rosenberg in [6] and a system with stochastic matching using the same database was then reported in [2] by the author and above authors. A fixed-phrase speaker verification system for speaker authentication has the following advantages. First, a short, user-selected phrase, also called a pass-phrase, is easy to remember and use. For example, it is easier to remember “open sesame” as a pass-phrase than a 10-digit phone number. Based on our experiments, the selected pass-phrase can be short and less than two seconds duration and still can get a good performance. Second, a fixed-phrase system usually has a better performance than a text-prompted system [3]. Last, the fixed-phrase system can be implemented as a language independent system easily, i.e. a user can create a pass-phrase in a selected language which has the potential to increase the security level.
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/978-3-642-23731-7_9, Ó Springer-Verlag Berlin Heidelberg 2012
151
152
9 Fixed-Phrase Speaker Verification
9.2 A Fixed-Phrase System In this chapter, we introduce a fixed-phrase system which generates superior performance in several evaluations using different databases. As shown in Fig. 1.2, a developed fixed-phrase system has two phases, enrollment and test. In the enrollment phase, a speaker-dependent model is trained. In the test phase, the trained model is used to verify given utterances. The final decision is made based on a likelihood score of the speaker-dependent model or a likelihood ratio score calculated from the speaker-dependent model and a speaker-independent background model. For feature extraction, the speech signal is sampled at 8 kHz and preemphasized using a first-order filter with a coefficient of 0.97. The samples are blocked into overlapping frames of 30 ms in duration and updated at 10 ms intervals. Each frame is windowed with a Hamming window followed by a 10th order linear predictive coding (LPC) analysis. The LPC coefficients are then converted to cepstral coefficients, where only the first 12 coefficients are retained for computing the feature vector. The feature vector consists of 24 features including the 12 cepstral coefficients and 12 delta cepstral coefficients [6]. A simple and useful technique for robustness in feature extraction is cepstral mean subtraction (CMS) [1]. The algorithm calculates the mean from the feature cepstral vectors of an utterance and then subtracts the mean from each cepstral vector. During enrollment, LPC cepstral feature vectors corresponding to the nonsilence portion of the enrollment pass-phrases are used to train a speakerdependent (SD), context-dependent, and left-to-right HMM to represent the voice pattern in the utterance. The model is called a whole-phrase model [6]. In addition to model training, the text of the pass-phrase collected from the enrollment session is transcribed into a sequence of phonemes, {Sk }K k=1 , where Sk is the kth phoneme and K is the total number of phonemes in the passphrase. The models and the transcription are used later for training a background model and for computing model scores. A detailed block diagram of a test session is shown in Fig. 9.1. After a speaker claims his or her identity, the system expects the user to utter the same phrase as in the enrollment session. The voice waveform is first converted to the prescribed feature representation. In the forced alignment block, a sequence of speaker-independent phoneme models is constructed according to the phonemic transcription of the pass-phrase. The sequence of phonemic models is then used to segment and align the feature vector sequence through use of the Viterbi algorithm. In the cepstral mean subtraction block, silence frames are removed, and a mean vector is computed based on the remaining speech frames. The mean is then subtracted from all speech frames [1]. This is an important step for channel compensation. It makes the system more robust to changes in the operating environment as well as in the transmission channel. We note that the forced alignment block is also used for accurate endpoint detection. For a fast response, the real-time endpoint detection algorithm
9.2 A Fixed-Phrase System
153
Database Speaker-dependent model
Identity claim Phoneme Transcription Feature Vectors
Forced Alignment
Speaker-independent phoneme models
L(O, Λ t )
Target Score Computation
+
Cepstral Mean Subtraction Background Score Computation
-
+
Threshold
Decision
L(O, Λb)
Background models
Fig. 9.1. A fixed-phrase speaker verification system.
introduced in the previous chapter can be implemented for a faster response and better performance [4, 5]. In the block of target score computation of Fig. 9.1, speech feature vectors are decoded into states by the Viterbi algorithm, using the trained wholephrase model. A log-likelihood score for the target model, i.e. the target score, is calculated as 1 L(O, Λt) = log P (O|Λt ), (9.1) Nf where O is the feature vector sequence, Nf is the total number of vectors in the sequence, Λt is the target model, and P (O|Λt ) is the likelihood score resulting from Viterbi decoding. Using background models is usually a useful approach to improve the robustness of a speaker verification or identification system. Intuitively, channel distortion or background noise will affect the scores of both the speakerdependent model and background models due to the mismatch between the training and current acoustic conditions. However, since both the scores in the speaker-dependent model and background models are changed, the ratio of the scores changes relatively less; therefore, the likelihood ratio is a more robust score for decision. The speaker-independent background models can be phoneme dependent or phoneme independent. In the first case, the background model can be a hidden Markov model (HMM). In the second case, it can be a GMM (Gaussian mixture model). As reported in [6], a phoneme-dependent background model can provide better results. This is because a phoneme-dependent model can model the speech more accurately than a phoneme-independent model and the phoneme-dependent model can be more sensitive to background noise and channel distortion. In the block of background (non-target) score computation, a set of speaker-independent HMM’s in the order of the transcribed phoneme sequence, Λb = {λ1 , ..., λK }, is applied to align the input utterance with the expected transcription using the Viterbi decoding algorithm. The segmented
154
9 Fixed-Phrase Speaker Verification
utterance is O = {O1 , ..., OK }, where Oi is the set of feature vectors corresponding to the i’th phoneme, Si , in the phoneme sequence. The background or non-target likelihood score is then computed by L(O, Λb ) =
K 1 log P (Oi |λi ), Nf
(9.2)
i=1
where Λb = {λi }K i=1 is the set of SI phoneme models in the order of the transcribed phoneme sequence, P (Oi |λbi ) is the corresponding phoneme likelihood score, and K is the total number of phonemes. The target and background scores [7] are then used in the likelihood-ratio test: R(O; Λt , Λb ) = L(O, Λt ) − L(O, Λb ), (9.3) where L(O, Λt) and L(O, Λb ) are defined in (9.1) and (9.2) respectively.
9.3 An Evaluation Database and Model Parameters The above system was tested on a database consisting of fixed-phrase utterances [6, 2]. The database was recorded over a long-distance telephone network consisting of 100 speakers, 51 male and 49 female. The fixed phrase, common to all speakers, is “I pledge allegiance to the flag” with an average utterance length of two seconds. Five utterances from each speaker recorded in one enrollment session (one telephone call) are used to construct an SD target HMM. For testing, we used 50 utterances recorded from a true speaker in different sessions (from different telephone channels and handsets at different times with different background noise), and 200 utterances recorded from 51 male or 49 female impostors of the same gender in different sessions. Five repetitions of the pass-phrase recorded in an enrollment session are used for training. For testing, the data is divided into true and impostor groups. The true speaker group has two repetitions from each of 25 test sessions; therefore, we have 50 testing utterances from the true speaker. The impostor group consists of four utterances from each speaker of the same gender to test the true speaker. We have about 200 utterances from the impostor group in total. An open-set evaluation is closer to real applications. For example, a largescale tele-banking system usually involves a large user population. The population also changes on a daily basis. We have to test the system as an open-set problem. The SD target models for the phrases are left-to-right HMM’s. The number of states depends on the total number of phonemes in the phrases. The more the phonemes, the more the states. There are four Gaussian mixture components associated with each state [6]. The background models are concatenated speaker independent (SI) phone HMM’s trained on a telephone speech database from different speakers and text [7]. There are 43 HMM’s,
9.3 An Evaluation Database and Model Parameters
155
corresponding to 43 phonemes respectively, and each model has three states with 32 Gaussian components per state. Again, due to unreliable variance estimates from a limited amount of speaker-specific training data, a global variance estimate was used as the common variance to all Gaussian components in the target models [6].
9.4 Adaptation and Reference Results In order to further improve the SD HMM, a model adaptation/re-estimation procedure is employed. The second, fourth, sixth, and eighth test utterances from the true speaker, which were recorded at different times, are used to update the means and mixture weights of the SD HMM for verifying successive test utterances. For the above database, the average individual equal-error rate over 100 speakers is 3.03% without adaptation and 1.96% with adaptation, respectively [2], as shown in Table 9.1. The performance can be further improved to 2.61% and 1.80% with the stochastic matching technique introduced in the next chapter. In general, the longer the pass-phrase, the higher the accuracy. The response time depends on the hardware/software configuration. For most applications, a real-time response can be achieved using a personal computer. Table 9.1. Experimental Results in Average Equal-Error Rates of All Tested Speakers Without Adaptation With Adaptation Fixed Pass-phrase 3.03% 1.96% Fixed Pass-phrase 2.61% 1.80% With stochastic matching Note: Tested on 100 speakers. All speakers used a common pass-phrase and all impostors were the same gender as the true speaker.
We note that the same pass-phrase was used for all speakers in our evaluation and all speakers were tested against the same gender. The above results are the worst cases of the performance and it is for our research. The actual system equal error rate (EER) would be better when users choose their unique and different pass-phrases. Also, to ensure the open test nature, none of the impostor’s data was used for discriminatively training the SD target model in the above experiments.
9.5 Conclusions In this chapter, we introduced a useful speaker verification system. In many research papers, the performances of speaker verification systems were reported as the EERs while all the speakers in the database selected the same
156
9 Fixed-Phrase Speaker Verification
pass-phrase. We have to note that this is a worst case scenario, which should never happen in real applications. When different speakers are using different pass-phrases, the EER can be less than or equal to 0.5%. To further improve system performance, other advanced techniques introduced in this book can be applied to a real system design to achieve the best performances in terms of lower EER, faster response time, user friendly interface, and noise robustness. The advanced and useful techniques are endpoint detection, auditory feature, fast decoding, discriminative training, and combined system design with verbal information verification (VIV). Readers may refer to corresponding chapters in this book when developing a real speaker authentication application. The algorithm presented in this chapter can be applied to any language, and a speaker verification system can be developed as a language-independent system. In that case, the background model can be a language independent universal phoneme model. In summary, the pass-phrase can be in any language or accent and the SV system can be language dependent or language independent.
References 1. Furui, S., “Cepstral analysis techniques for automatic speaker verification,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 27, pp. 254–277, April 1981. 2. Li, Q., Parthasarathy, S., and Rosenberg, A. E., “A fast algorithm for stochastic matching with application to robust speaker verification,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Munich), pp. 1543–1547, April 1997. 3. Li, Q., Parthasarathy, S., Rosenberg, A. E., and Tufts, D. W., “Normalized discriminant analysis with application to a hybrid speaker-verification system,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, (Atlanta), May 1996. 4. Li, Q. and Tsai, A., “A matched filter approach to endpoint detection for robust speaker verification,” in Proceedings of IEEE Workshop on Automatic Identification, (Summit, NJ), Oct. 1999. 5. Li, Q., Zheng, J., Tsai, A., and Zhou, Q., “Robust endpoint detection and energy normalization for real-time speech and speaker recognition,” IEEE Trans. on Speech and Audio Processing, vol. 10, pp. 146–157, March 2002. 6. Parthasarathy, S. and Rosenberg, A. E., “General phrase speaker verification using sub-word background models and likelihood-ratio scoring,” in Proceedings of ICSLP-96, (Philadelphia), October 1996. 7. Rosenberg, A. E. and Parthasarathy, S., “Speaker background models for connected digit password speaker verification,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Atlanta), pp. 81–84, May 1996.
Chapter 10 Robust Speaker Verification with Stochastic Matching
In today’s telecommunications environment, which includes wireless, landline, VoIP, and computer networks, the mismatch between training and testing environments poses a big challenge to speaker authentication systems. In Chapter 8, we addressed the mismatch problem from a feature extraction point of view. In this chapter, we address the problem from an acoustic modeling point of view. These two approaches can be used independently or jointly. Speaker recognition performances are degraded when a hidden Markov model (HMM) trained under one set of conditions is used to evaluate data collected from different wireless or landline channels, microphones, networks, etc. The mismatch can be approximated as a linear transform in a cepstral domain in any type of feature extraction algorithm. In this chapter, we present a fast, efficient algorithm to estimate the parameters of the linear transform for real-time applications. Using the algorithm, test data are transformed toward the training conditions by rotation, scale, and translation without destroying the detailed speaker-dependent characteristics of speech, then, speaker dependent HMM’s can be used to evaluate the details under the transformed condition similar to the original training condition. Compared to cepstral mean subtraction (CMS) and other bias removal techniques, the linear transform is more general since CMS and others only consider translation; compared to maximum-likelihood approaches for stochastic matching, the algorithm is simpler and faster since iterative techniques are not required. The fast stochastic matching algorithm improves the performance of a speaker verification system in the experiments reported in this chapter. This approach was originally reported by the author, Parthasarathy and Rosenberg in [4].
10.1 Introduction For speaker recognition, a speaker-dependent hidden Markov model (HMM) for a true speaker is usually trained based on training data collected in an enrollment session. The HMM, therefore, matches the probability density func-
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/978-3-642-23731-7_10, Ó Springer-Verlag Berlin Heidelberg 2012
157
158
10 Robust Speaker Verification with Stochastic Matching
tions (pdf ’s) of the training data perfectly in the acoustic environment of the training data. In a verification session, test data are very often collected through a different communication channel and handset. Since the acoustic condition is different from the enrollment session, it usually causes a mismatch between the test data and the trained HMM. Speaker recognition performance is degraded by the mismatch. The mismatch can be represented as a linear transform in the cepstral domain: y = Ax + b, (10.1) where x is a vector of the cepstral frame of a test utterance; A and b are the matrix and vector which need to be estimated for every test utterance; and y is a transformed vector. Geometrically, b represents a translation and A represents both scale and rotation. When A is diagonal, it is only a scaling operation. Cepstral mean subtraction (CMS) [2, 1, 3] is a fast, efficient technique for handling mismatches in both speaker and speech recognition. It estimates b and assumes A to be an identity matrix. In [9], the vector b was estimated by long-term average, short-term average, and a maximum likelihood approach. In [10, 11], maximum likelihood (ML) approaches were used to estimate b, a diagonal A, and model parameters for HMM’s for stochastic matching. A least-squares solution of the linear transform parameters was briefly introduced in [5]. In this chapter we consider the mismatch modeled with a general linear transform, i.e. A is a full matrix, and b is a vector. The approach is to have the overall distribution of test data match the overall distribution of training data. Then a speaker-dependent (SD) HMM trained on the training data is applied to evaluate the details of the test data. This is based on the assumption that differences between speakers are mainly on the details which have been characterized by HMM’s. A fast algorithm with stochastic matching for fixedphrase speaker verification is presented in this chapter. Compared to CMS and other bias removal techniques [9, 8], the introduced linear transform approach is more general since CMS and others only consider the translation; compared to the ML approaches [9, 8, 10, 11], the algorithm is simpler and faster since iterative techniques are not required and the estimation of the linear transform parameters is separated from HMM training and testing.
10.2 A Fast Stochastic Matching Algorithm We use Fig. 10.1 as a geometric interpretation of the fast stochastic matching algorithm. In Fig. 10.1 (a), the dashed line is a contour of the training data. In Fig. 10.1 (b), the solid line is a contour of the test data. Due to different channels, noise levels, and telephone transducers, the mean of the test data is translated from the training data; the distribution is shrunk [6] and rotated
10.2 A Fast Stochastic Matching Algorithm
R train
159
R train
R test
(b)
(a)
R train
T
AR test A (c)
(d)
Fig. 10.1. A geometric interpretation of the fast stochastic matching. (a) The dashed line is the contour of training data. (b) The solid line is the contour of test data. The crosses are the means of the two data sets. (c) The test data were scaled and rotated toward the training data. (d) The test data were translated to the same location as the training data. Both contours overlap each other.
from the HMM training condition. The mismatch may cause a wrong decision when using the trained HMM to score the mismatched test data. By applying the algorithm, we first find a covariance matrix, Rtrain , from the training data which characterizes the overall distribution approximately. Then, we find a covariance matrix, Rtest , from the test data and estimate the parameters of the A matrix for the linear transform in (10.1). After applying the first transform, the overall distribution of the test data is scaled and rotated, ARtest AT , to be the same as the training data except for the difference of the means, as shown in Fig. 10.1 (c). Next, we find the difference between the means and translate the test data to the same location of the training data as shown in Fig. 10.1 (d), where the contour of the transformed test data overlaps with the contour of the training data. We note that as a linear transform, the proposed algorithm does not destroy the details of the pdf of the test data. The details will be measured and evaluated by the trained SD HMM. If the test data from a true speaker mismatch the HMM training condition, the data will be transformed to match the trained HMM approximately. If the
160
10 Robust Speaker Verification with Stochastic Matching
test data from a true speaker match the training condition, the calculated A and b are close to an identity matrix and a zero vector respectively, so the transform will not have much effect on the HMM scores. This technique attempts to improve mismatch whether the mismatch occurs because test and training conditions differ or because the test and training data originate from different speakers. It is reasonable to suppose that speaker characteristics are found mainly in the details of the representation. However, to the extent that they are also found in global features, this technique would increase the matching scores between true speaker models and impostor test utterances. Performance, then, could possibly degrade, particularly when other sources of mismatch are absent, that is, when test and training conditions are actually matched. However, experiments in this chapter show that performance overall does improve. If the test data from an impostor do not match the training condition, the overall distribution of the data will be transformed to match it, but the details of the distribution still do not match a true speaker’s HMM because the transform criterion is not for details and there is no non-linear transform here.
10.3 Fast Estimation for a General Linear Transform In a speaker verification training session, we collect multiple utterances with the same content, and use a covariance matrix Rtrain and a mean vector mtrain to represent the overall distribution of the training data of all the training utterances in a cepstral domain. They are defined as follows: Rtrain =
Ni U 1 1 (xi,j − mi )(xi,j − mi )T , U i=1 Ni j=1
(10.2)
U 1 mi , U i=1
(10.3)
and mtrain =
where xi,j is the jth non-silence frame in the ith training utterance, U is the total number of training utterances, Ni and mi are the total number of nonsilence frames and the mean vector of the ith training utterance respectively, and mtrain is the average mean vector of the non-silence frames of all training utterances. In a test session, only one utterance will be collected and verified at a time. The covariance matrix for the test data is Rtest =
Nf 1 (yj − mtest )(yj − mtest )T , Nf j=1
(10.4)
10.3 Fast Estimation For a General Linear Transform
161
where yj and mtest are a non-silence frame and the mean vector of the test data, Nf is the total number of non-silence frames. The criterion for parameter estimation is to have Rtest match Rtrain through a rotation, scale, and translation (RST) of the test data. For rotation and scale, we have the following equation: Rtrain − ARtest AT = 0,
(10.5)
where A is defined as in (10.1); Rtrain and Rtest are defined as in (10.2) and (10.4). By solving (10.5), we have the A matrix for (10.1), 1
−1
2 2 A = Rtrain Rtest .
(10.6)
Then, the translation term b of (10.1) can be obtained by b = mtrain − mrs
Nf 1 = mtrain − Axj Nf j=1
(10.7)
where mtrain is defined as in (10.3); mrs is a mean vector of rotated and scaled frames; Nf is the total number of non-silence frames of a test utterance; xj is the jth cepstral vector frame. To verify a given test utterance against a set of true speaker’s models (consisting of a SD HMM plus Rtrain , mtrain ), first Rtest , A and b are calculated by using (10.4), (10.6), and (10.7), then all test frames are transformed by (10.1) to reduce the mismatch.
10.4 Speaker Verification with Stochastic Matching The above stochastic matching algorithm has been applied to a text-dependent speaker verification system using general phrase passwords. The system has been discussed in the previous chapter [7]. Stochastic matching is included in the front-end processing to further improve the system robustness and performance. The system block diagram with stochastic matching is shown in Fig. 10.2. After a speaker claims an identity (ID), the system expects the same phrase obtained in the associated training session. First, a speaker-independent (SI) phone recognizer segments the input utterance into a sequence of phones by forced decoding using the transcription saved from the enrollment session. Since the SD models are trained on a small amount of data from a single session, they can’t be used to provide reliable and consistent phone segmentations. So the SI phone models are used. On the other hand, the cepstral coefficients of the utterance from the test speaker is transformed to match the training data distribution by computing Eqs. (10.4), (10.6), (10.7), and (10.1). Then, the transformed cepstral coefficients, decoded phone sequence,
162
10 Robust Speaker Verification with Stochastic Matching SI phone HMM’s
SI Phone Alignment
Transcription Identity claim
Speaker Info
Phone string/ boundaries
Speaker Verifier
SD H Ta M rg M et ’s
Cepstrum
SI background HMM’s
R train m train
Scores
Transformed cepstrum
Stochastic Matching
Cepstrum
Fig. 10.2. A phrase-based speaker verification system with stochastic matching.
and associated phone boundaries are transmitted to a verifier. In the verifier, a log-likelihood-ratio score is calculated based on the log-likelihood scores of target and background models. LR (O; Λt ; Λb ) = L(O, Λt ) − L(O, Λb )
(10.8)
where O is the observation sequence over the whole phrase, and Λt and Λb are the target and background models, respectively. The background model is a set of HMM’s for phones. The target model is one HMM with multiple states for whole phrase. As reported in [7], this configuration provides the best results in experiments. Furthermore, L(O, Λt ) =
1 P (O|Λt ), Nf
(10.9)
where P (O|Λt) is the log-likelihood of the phrase evaluated by one HMM, Λt , using Viterbi decoding, and Nf is the total number of non-silence frames in the phrase. Np 1 L(O, Λb ) = P (Oi |Λbi ) (10.10) Nf i=1 where P (Oi |Λbi ) is the log-likelihood of the ith phone, Oi is the segmented observation sequence over the ith phone, Λbi is an HMM for the ith phone, Np is the total number of the decoded non-silence phones, and Nf is the same as above. A final decision on rejection or acceptance is made based on the LR score with a threshold. If a significantly different phrase is given, the phrase could be rejected by the SI phone recognizer before using the verifier.
10.5 Database and Experiments
163
10.5 Database and Experiments The feature vector in this chapter is composed of 12 cepstrum and 12 delta cepstrum coefficients. The cepstrum is derived from a 10th order linear predictive coding (LPC) analysis over a 30 ms window. The feature vectors are updated at 10 ms intervals. The experimental database consists of fixed-phrase utterances recorded over the long distance telephone networks by 100 speakers, 51 male and 49 female. The fixed phrase, common to all speakers, is “I pledge allegiance to the flag” with an average length of 2 seconds. Five utterances of each speaker recorded in one session are used to train a SD HMM plus Rtrain , mtrain for the linear transform. For testing, we used 50 utterances recorded from a true speaker at different sessions (different telephone channels at different times), and 200 utterances recorded from 50 impostors of the same gender at different sessions. For model adaptation, the second, fourth, sixth, and eighth test utterances from the tested true speaker are used to update the associated HMM plus Rtrain , mtrain for verifying succeeding test utterances. The target models for phrases are left-to-right HMM’s. The number of the states are 1.5 times the total number of phones in the phrases. There are four Gaussian components associated with each state. The background models are concatenated phone HMM’s trained on a telephone speech database from different speakers and texts. Each phone HMM has three states with 32 Gaussian components associated with each state. Due to unreliable variance estimates from a limited amount of training data, a global variance estimate is used as a common variance to all Gaussian components [7] in the target models. The experimental results are listed in Table 10.1. These are the averages of individual equal-error rates (EERs) over the 100 evaluation speakers. The baseline results are obtained with log-likelihood-ratio scores using phrasebased target model and phone-based speaker background models. The EERs without and with adaptation are 5.98% and 3.94% respectively. When using CMS, the EER’s are 3.03% and 1.96%. When using the algorithm introduced in this chapter, the equal error rates are 2.61% and 1.80%. Table 10.1. Experimental Results in Average Equal-Error Rates (%) Algorithms No Adaptation With Adaptation Baseline 5.98 3.94 CMS 3.03 1.96 RST (Presented) 2.61 1.80
164
10 Robust Speaker Verification with Stochastic Matching
10.6 Conclusions A simple, fast and efficient algorithm for robust speaker verification with stochastic matching was presented. The algorithm was applied to a general phrase speaker verification system. In the experiments, when there is no model adaptation, the algorithm improves relative EERs by 56% compared with a baseline system without any stochastic matching, and 14% compared with a system using CMS. When model adaptation is applied, the improvements are 54% and 8%. Less improvement is obtained because the SD models are updated to fit different acoustic conditions. The presented algorithm can also be applied to speaker identification and other applications to improve system robustness.
References 1. Atal, B. S., “Automatic recognition of speakers from their voices,” Proceeding of the IEEE, vol. 64, pp. 460–475, 1976. 2. Atal, B. S., “Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification,” Journal of the Acoustical Society of America, vol. 55, pp. 1304–1312, 1974. 3. Furui, S., “Cepstral analysis techniques for automatic speaker verification,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 27, pp. 254–277, April 1981. 4. Li, Q., Parthasarathy, S., and Rosenberg, A. E., “A fast algorithm for stochastic matching with application to robust speaker verification,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Munich), pp. 1543–1547, April 1997. 5. Mammone, R. J., Zhang, X., and Pamachandran, R. P., “Robust speaker recognition,” IEEE Signal Processing Magazine, vol. 13, pp. 58–71, Sept. 1996. 6. Mansour, D. and Juang, B.-H., “A family of distortion measures based upon projection operation for robust speech recognition,” IEEE Trans. Acoust., Speech, Signal Processing, vol. 37, pp. 1659–1671, November 1989. 7. Parthasarathy, S. and Rosenberg, A. E., “General phrase speaker verification using sub-word background models and likelihood-ratio scoring,” in Proceedings of ICSLP-96, (Philadelphia), October 1996. 8. Rahim, M. G. and Juang, B.-H., “Signal bias removal by maximum likelihood estimation for robust telephone speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 4, pp. 19–30, January 1996. 9. Rosenberg, A. E., Lee, C.-H., and Soong, F. K., “Cepstral channel normalization techniques for HMM-based speaker verification,” in Proceedings of Int. Conf. on Spoken Language Processing, (Yokohama, Japan), pp. 1835–1838, 1994. 10. Sankar, A. and Lee, C.-H., “A maximum-likelihood approach to stochastic matching for robust speech recognition,” IEEE Transactions on Speech and Audio Processing, vol. 4, pp. 190–202, May 1996. 11. Surendran, A. C., Maximum-likelihood stochastic matching approach to nonlinear equalization for robust speech recognition. PhD thesis, Rutgers University, Busch, NJ, May 1996.
Chapter 11 Randomly Prompted Speaker Verification
In the previous chapters, we introduced algorithms for fixed-phrase speaker verification. In this chapter, we introduce an algorithm for randomly prompted speaker verification. Instead of having a user to select and remember a passphrase, the system randomly displays a phrase and asks the user to read; therefore, the user does not need to remember the phrase. In our approach, a modified linear discriminant analysis technique, referred to here as normalized discriminant analysis (NDA), is presented. Using this technique it is possible to design an efficient linear classifier with very limited training data and to generate normalized discriminant scores with comparable magnitudes for different classifiers. The NDA technique is applied to speaker verification classifier based on speaker-specific information obtained when utterances are processed with speaker-independent models. The algorithm has shown significant improvement in speaker verification performance. This research was originally reported by the author, Partharathy, and Rosenberg in [7].
11.1 Introduction As we introduced in Chapter 1, speaker recognition is categorized into two major areas: speaker identification and speaker verification. Speaker identification is the process of associating an unknown speaker with a member of a known population of speakers, while speaker verification is the process of verifying whether an unknown speaker is the same as a speaker in a known population whose identity is claimed. Another distinguishing feature of speaker recognition is whether it is text-dependent or text-independent. Text-dependent recognition requires that speakers utter a specific phrase or a given password. Text-independent recognition does not require a specific utterance. Furthermore, text-dependent speaker verification systems can be divided into two categories: fixed-phrase or randomly prompted speaker verification. A voice
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/978-3-642-23731-7_11, Ó Springer-Verlag Berlin Heidelberg 2012
165
166
11 Randomly Prompted Speaker Verification
password such as “open sesame” is a fixed phrase where the user must remember the pass-phrase. On the other hand, a bank ATM machine or a speaker verification system can display a sequence of random numbers or even a random phrase to prompt the user to read. This kind of system is called randomly prompted speaker verification where the user does not need to remember the pass-phrase. The randomly prompted utterances can provide a higher level of security than fixed-phrase passwords or pass-phrases. The advantages of a randomly prompted speaker verification system is users do not need to remember the pass phrase and a pass-phrase can be changed easily time to time. This chapter focuses on text-dependent, randomly prompted speaker verification. We use connected digits as the prompted phrases. We use an 11 word vocabulary including the digits “0” through “9” plus “o”. In training, 11 connected digit utterances are recorded in one session for each speaker. The training utterances are designed to have each of the 11 words appear five times in different contexts, so that there are five tokens of each digit for each speaker. In testing, either a fixed 9-digit test utterance or randomly selected 4-digit test utterances are recorded in different sessions. As reported in previous chapters and references wherein, conventional training algorithms for speaker verification are based on maximum likelihood estimation using only training data recorded from the speakers to be modeled. Gradient-descent based discriminative training algorithms [12], neural tree algorithms [3, 11], and linear discriminant analysis (LDA)[16, 17, 13] were used for speaker verification. The discriminant approaches provide a potential modeling advantage because they account for the separation of each designated speaker from a group of other speakers. A text-dependent, connected-digit, speaker verification system often consists of different classifiers for different words for each speaker. The concept of principal feature classification introduced in Chapter 3 [9, 10, 8, 18] can be applied in the design of these classifiers, and linear discriminant analysis [2, 5, 4] can be used to find principle features. However, two kinds of problems occur when LDA is used to design these classifiers: the amount of training data is usually small, and the discriminant scores obtained from different classifiers are scaled differently so that it is hard to compare and combine them; therefore, in this chapter, we introduce a normalized discriminant analysis (NDA) technique to address these problems. In this chapter, the NDA is applied to design a hybrid speaker verification (HSV) system. As described by Setlur et al [16, 17], the system combines two types of word models or classifiers. (We use the term classifier when we do not use decoding). The first type of classifier used is a speaker-dependent, continuous density, hidden Markov model (HMM). This representation has been shown to provide good performance for connected digit password speaker verification [15]. The second type of classifier is based on speaker specific information that can be obtained when password utterances are processed with speaker-independent (SI) HMM’s. The mixture components of an SI Gaussian
11.1 Introduction
167
mixture HMM are found by clustering training data from a wide variety of speakers and recording conditions. For a particular state of a particular word, each such component is representative of some subset of the training data. When a test utterance is processed, the score for each test vector is calculated using a weighted sum of mixture component likelihoods. Because of the way the mixture component parameters are trained, it is reasonable to expect that each test speaker will have different characteristic distributions of likelihood scores across these components. The second type of classifier is based on these characteristic mixture component likelihood distributions obtained over training utterances for each speaker.
Data Fusion
HMM utterance score
NDA utterance score
HMM score for cohorts
NDA word-level classification
NDA feature extraction
HMM score for test speaker
Feature vectors Fig. 11.1. The structure of a hybrid speaker verification (HSV) system.
Although the second type of speaker classifier yields significantly lower performance than the first type, it has been shown in [16, 17] that, when combined, the two representations yield significantly improved verification performance over either one by itself. The aspects which distinguish our study from Setlur et al [16, 17] are in three areas. First, NDA is used instead of Fisher linear discriminant analysis; second, verification is carried out on a database recorded over a long-distance telephone network under a variety of recording and channel conditions; third, only small amounts of training data are available per speaker. The HSV system similar to [16, 17] is shown in Figure 11.1. The HSV system consists of three modules: a Type 1, HMM classifier; a Type 2, dis-
168
11 Randomly Prompted Speaker Verification
criminant analysis classifier; and a data fusion layer. We note that an HSV system could include more classifiers as long as each individual classifier can provide independent information. The structure of HSV is similar to the mixture decomposition discrimination (MDD) system in [16]. However, there are some differences between the two systems. In HSV, a modified discriminant analysis improved the performance of the NDA classifier. The database in this research has more noise, the number of training data is much less, and the way to use the data is restricted to long distance telephone applications.
11.2 Normalized Discriminant Analysis To classify segmented words between a true speaker and impostors, in general we can apply the principal feature classification (PFC) presented in Chapter 3 [8, 18] to train classifiers for each word of each speaker. However, for this training problem, the true speaker class only has 5 training data vectors and each vector has 96 elements. The training data sets are linear separable. Then as a special case of the PFC, we only need to find the first principal feature by LDA. Word-level discrimination between a true speaker and impostors is a twoclass classification problem. We found, from analysis of training data, that the data is linear separable into the speaker classes, so Fisher’s LDA is a simple and effective tool for the classification problem. Using LDA, a principal feature, weight vector w, is found, such that the projected data from the true speaker and impostors is maximally separated. In brief, for two-class LDA, w can be solved directly as −1 w = SW (mT − mI )
(11.1)
where mT and mI are the sample means of the two classes, true speaker and impostors, and SW , is usually defined as SW = (x − mT )(x − mT )t + (x − mI )(x − mI )t , (11.2) x∈XT
x∈XI
where XT and XI are the data matrices of a true speaker and impostors. SW must be non-singular. Each row in the matrices represents one training data vector. More details on LDA can be found in [2, 5, 4]. However, in practical speaker verification applications, there are usually only a few training vectors for each true speaker. For example, there are only five vectors available in our experiments. To compensate for this lack of training data, we redefine the SW in (11.2) as SˆW = RT + γRCT + RI + δRCI ,
(11.3)
where RT and RI are the sample covariance matrices from the true speaker and impostors and RCI and RCT are compensating covariance matrices from
11.2 Normalized Discriminant Analysis
169
another available group of speakers (not used in the evaluation). RCI is the sample covariance matrix of additional speakers, pooling their data. Actually, RI and RCI can be combined except we may want to weight the associated data sets differently. RCT is defined as RCT
Ts 1 = Ri , Ts i=1
(11.4)
where Ri is the sample covariance matrix of Speaker i in the other group, and Ts is the total number of speakers in the group. γ and δ are weight factors determined experimentally. An LDA score p of a data vector x is obtained by projecting the x onto a weight vector w, p = wt x. To make the scores comparable across different words and different speakers, we use the following normalization. pˆ = αwt x + β
(11.5)
where α = 2d , β = −1 − 2μdI , and d = μT − μI . The μT and μI are the means of projected data from true speaker and impostors. After normalization, the μT and μI are located at +1 and −1. pˆ is the NDA score.
11.3 Applying NDA in the Hybrid Speaker-Verification System NDA can be used to design Type 2 classifiers for speaker verification. The classifiers can be used separately or as a module in the HSV system [16, 17]. 11.3.1 Training of the NDA System As described earlier, the Type 2 classifier features are determined from speaker-independent (SI) HMM’s. Each training or test utterance is first segmented into words and states. As shown in Figure 11.2, we use the averaged outputs of the Gaussian components on the HMM states as one fixed-length feature vector for the NDA training. The elements of the feature vector are defined as follows: xjm =
Tj 1 log(N (ot , μjm , Rjm )), Tj t=1
j = 1, ..., J; m = 1, ..., Mj .
(11.6)
where ot is the cepstral feature vector at time frame t, μjm and Rjm are the mean and covariance of the mth mixture component for state j, N (.) is a Gaussian function, and Tj is the total number of frames segmented into state j.
170
11 Randomly Prompted Speaker Verification
Decision Output
Word-Level Classification
State j - 1 features
State j features 1 T
1
Σ
T
Log
....
Σ
1 T
Log m=1
State j + 1 features Σ
Log
....
m=2
m=Mj
....... ....... ....... ....... .......
Ο
t=1
Ο
t=2
Οt Ο Ο
is the cepstral feature vector at time frame t.
t=T-1 t=T
Fig. 11.2. The NDA feature extraction.
Thus, a sequence of cepstral feature vectors associated with one segmented word is mapped onto a fixed-length feature vector. The length of the feature vector is equal to the total number of Gaussian components of the word HMM (J × M ). For example, a word HMM has 6 states and each state has 16 components. The length of the NDA feature vector is 6 × 16 = 96. Feature extraction is almost identical to the technique in [16, 17] except that the HMM mixture weights are omitted since they are absorbed in the NDA calculation. The structure of the word and utterance verification for one speaker is shown in Figure 11.3. There is an NDA classifier for each word. An utterance score SN DA (O) is a weighted sum of NDA scores of all words in the utterance. L
SN DA (O) =
1 uk pˆk , ki ∈ {1, .., 11}. L i=1 i i
(11.7)
11.3 Applying NDA in the Hybrid Speaker-Verification System
171
S NDA 1/L L
Σ u kd k i=1
u
1
NDA for Word "1" vs. all others
Feature vectors of Word "1"
u
u
2
NDA for Word "2" vs. all others
...........
Feature vectors of Word "2"
11
NDA for Word "oh" vs. all others
Feature vectors of Word "11"
Fig. 11.3. The Type 2 classifier (NDA system) for one speaker.
where L is the length of the utterance O (the total number of words in the utterance), pˆki is the NDA score for the ith word. Equation (11.7) specifies a linear node with associated weight vectors uki that can be determined by optimal data fusion [1] to equalize the performance across words if sufficient training data is available. 11.3.2 Training of the HMM System For the Type 1 classifier, the HMM scores are calculated from speakerdependent (SD) HMM models. Cohort normalization is applied by selecting five scores from a group of speakers not in the evaluation group. Cepstral mean normalization is applied in both the HMM and NDA classifiers. Usually, the verification score S for word W for Speaker I is calculated by T
S{O|W, I} =
1 max{ log bj (ot )}, j = 1, ..., J, T t=1
(11.8)
where the O = {o1 , o2 , ...oT } is the sequence of T raw data vectors segmented for word W , the J is the total number of states. bj is the mixture Gaussian likelihood for the state j. bj (ot ) = P r(ot |j) =
M m=1
Cjm N (ot , μjm , Rjm ),
(11.9)
172
11 Randomly Prompted Speaker Verification
which is a weighted sum of all Gaussian components at the state j, and M is the total number of the components. The μjm and Rjm are the mean vector and covariance matrix of the mth component at the state j. Scores for Classification The HMM recognition is based on the log likelihood scores calculated from the speech portion of the recording [14, 15]. We use SD HMM for the scores with word segmentations and labels provided by SI HMMs. For a sequence of T feature vectors for a word W, O = {o1 , o2 , ...oT }, the likelihood of the sequence to the model for word W and speaker I, λWI is P r{O|λWI } = max { {st (W)}
T
ast st+1 bst (ot )}
(11.10)
t=1
where max{st (W)} implies a Viterbi search to obtain the optimal segmentation of the vectors into states [6]. ast st+1 is the state-transition probability from state st to state st+1 . In this implementation, all state transition probabilities are set to 0.5 for verification so they play no role in classification. bj is the mixture Gaussian likelihood for the state j. It is defined as bj (ot ) = P r(ot |j) =
M
cjm N (ot , μjm , Rjm ),
(11.11)
m=1
which is a weighted sum of all Gaussian components for state j. M is the total number of the components. μjm and Rjm are the mean vector and covariance matrix of the mth component at the state j. Subscripts for W and I are omitted from a and b for clarity. Thus, the verification score using the model of word W for Speaker I is calculated by S{O|λWI } =
T 1 max { log bst (ot )}, j = 1, ..., J, T st (W) t=1
(11.12)
Cohort Normalization Scores The concept of cohort normalization is from the Neyman-Pearson classification rule. It includes discriminant analysis in the HMM classification. P r{O|λI } > 1, P r{O|λI¯ }
(11.13)
where P r{O|λI } is the likelihood score associated with the utterance O compared with the model for speaker I. P r{O|λI¯ } is the likelihood from speaker models other than I.
11.3 Applying NDA in the Hybrid Speaker-Verification System
173
Applying the rule in log likelihood score, we have the cohort normalized scores for an utterance O as SHMM {O|I} = S{O|λI } −
1 K
K
S{O|λk },
(11.14)
k=1,k=I
K 1 where we associate S{O|λI } with log P r{O|λI } and K k=1,k=I S{O|λk } with log P r{O|λI¯ } The λ1 to λK are a group of selected cohort models. They are similar in some sense to λI . The similarity measurement was defined in [14]. 11.3.3 Training of the Data Fusion Layer The final decision on a given utterance O, d(O), is made by combining the NDA score SN DA (O) and HMM score SHMM (O) and using a hardlimiter threshold. S = v1 SN DA (O) + v2 SHMM (O) 1, S > θ, true speaker; d(O) = 0, S ≤ θ, impostor,
(11.15) (11.16)
where v1 and v2 are weight values trained by LDA in the same way that w in (11.1) and (11.2) is determined where XT , mT and XI , mI are replaced by the HMM and NDA scores and associated means. The scores are obtained from a group of speakers not used in the evaluation. This is an SI output node of the HSV system.
11.4 Speaker Verification Experiments In this section, we introduce the database and our experimental results for a randomly prompted speaker verification task. 11.4.1 Experimental Database The database consists of approximately 6000 connected digit utterances recorded over dialed-up telephone lines. The vocabulary includes eleven words. These are the digits “0” through “9” plus “oh”. The database is partitioned into four subsets as shown in Table 11.1. There are 43 speakers in the Roster A, and 42 in Roster B. For each speaker, there are eleven 5-digit utterances designated for training recorded in a single session from a single channel in As and Bs . These utterances are designed to have each digit appear five times in different contexts. Each speaker has a group of test utterances in Am and Bm . These utterances are recorded over a series of sessions with a variety of
174
11 Randomly Prompted Speaker Verification Table 11.1. Segmentation of the Database Roster A Roster B Training utterances
As
Bs
Test utterances
Am
Bm
handsets and channels. The test utterances in Am and Bm are either fixed 9-digit utterances or randomly selected 4-digit utterances. An SI HMM-based digit recognizer [15] is used to segment each utterance into words (digits), and to generate raw feature vectors. In the digit recognizer, 10th order autocorrelation vectors are analyzed over a 45 ms window shifted every 15 ms through the utterance. Each set of autocorrelation coefficients is converted to a set of 12 cepstral coefficients from linear predictive coding (LPC) coefficients. These cepstral coefficients are further augmented by a set of 12 delta cepstral coefficients calculated over a five-frame window of cepstral coefficients. Each “raw” data vector has 24 elements consisting of the 12 cepstral coefficients and the 12 delta cepstral coefficients [15]. 11.4.2 NDA System Results Experiments were conducted first to test the NDA classifier. The SI HMM’s used to obtain NDA features as in (11.6) were trained from a distinct database of connected digit utterances. These HMM’s have six states for words “0” through “9” and five states for word “o”. Each state has 16 Gaussian components. So, for a six state HMM, the NDA features have 6 × 16 = 96 elements. For each true speaker in Roster A, RT in (11.3) was calculated using utterances from As ; RI was obtained from Bs , RCT from both Bs and Bm , and RCI from Bm . The γ and δ parameters are not very sensitive for these data sets. To calculate (11.7), we use uki = 1 due to a lack of training data. The results in terms of averaged individual equal-error rates (ERR’s) are listed in Table 11.2. An EER of 6.13% was obtained with NDA using both score normalization (11.5) and pooled covariance matrices (11.3). With only score normalization (11.5) the EER is 10.12%. Without score normalization and compensating covariance matrices (as in [16, 17]), the equal-error rate was 18.18%. The NDA techniques provided an 82.78% improvement. 11.4.3 Hybrid Speaker-Verification System Results For the Type 1 classifier, SD HMM’s were trained using the utterances in As . Five cohort models were constructed from utterances in Roster B. The utterances in Am were used for testing. The HMM scores in Table 11.3 were
11.4 Speaker Verfication Experiments
175
Table 11.2. Results on Discriminant Analysis Algorithms
Scores
Cov. Matrices Normalized Pooled Normalized Unpooled Unnormalized Unpooled
NDA NDA LDA (as in [16, 17]) 1,514 true speaker utterances 23,730 impostor utterances
EER % 6.13 10.12 18.18
obtained from the experiments in [15]. The NDA scores were obtained from the current experiments. To obtain the common weight values v1 and v2 in (11.16) for all speakers, both Type 1 and Type 2 classifiers were trained using the data set Bs . Then v1 and v2 are formed by LDA using the output scores from the data set Bm . The major results are listed in Table 11.3. Table 11.3. Major Results Equal-error rates (%) Systems Mean Median HSV with NDA 4.32 3.14 HMM-cohort 5.30 4.35 HMM 9.41 7.42 NDA 8.68 8.15 1,514 true speaker utterances 11,620 impostor utterances
The HSV system reduced the verification EER’s by 18.43% (mean) and 27.88% (median) from the HMM classifiers with cohort normalization. With respect to storage requirements, the HMM classifier needs 51.56 Kb space per speaker for model parameters. The NDA classifier needs: 4 [(96 − 1) × 10 + (80 − 1)] = 4.116 Kb storage space per speaker, so the HSV system needs only slightly more storage than the HMM system.
11.5 Conclusions In this chapter, we introduced the randomly prompted speaker verification system through a real system design. In our experiments, the NDA technique showed an 82.78% relative improvement in performance over the classifier using Fisher’s LDA. Furthermore, when the NDA is used in a hybrid speaker verification system combining information from speaker-dependent and speaker-
176
11 Randomly Prompted Speaker Verification
independent models, speaker verification performance was relatively improved by 18% compared to the HMM classifier with cohort normalization. From author’s experience, whenever we demonstrate a speaker verification system, someone in the audience always asks the following question: if an imposter pre-records a user’s pass-phrase and plays it back in front of a speaker verification system, will the imposter be accepted by the speaker verification system? We now have two answers to the question. First, a fixedphrase speaker verification system with special techniques can prevent imposters from using pre-recorded speech to break our system. Second, the randomly prompted speaker verification system introduced in this chapter can fully address this security concern. We note that although the system introduced here uses connected digits as the pass-phrases, the proposed algorithm can be extended to connected words or sentences; however, the training utterances need to be very well designed to cover all phonemes which will appear in the randomly prompted pass-phrases. The LDA used in this chapter is a discriminative training algorithm; however, the objective is to separate classes as much as possible. The LDA objective did not consider optimizing system performances, such as minimizing error rates directly. In the next two chapters, we will discuss discriminative training objectives and algorithms which can optimize the speaker recognition performance.
References 1. Chair, Z. and Varshney, P. K., “Optimal data fusion in multiple sensor detection systems,” IEEE Transactions on Aerospace and Electronic Systems, vol. AES22, pp. 98–101, January 1986. 2. Duda, R. O. and Hart, P. E., Pattern Classification and Scene Analysis. New York: John & Wiley, 1973. 3. Farell, K. R., Mammone, R. J., and Assaleh, K. T., “Speaker recognition using neural networks and conventional classifiers,” IEEE Transactions on Speech and Audio Processing, vol. 2, Part II, January 1994. 4. Fisher, R. A., “The statistical utilization of multiple measurements,” Annals of Eugenics, vol. 8, pp. 376–386, 1938. 5. Johnson, R. A. and Wichern, D. W., Applied Multivariate Statistical Analysis. New Jersey: Prentice Hall, 1988. 6. Lee, C.-H. and Rabiner, L. R., “A frame-synchronous network search algorithm for connected word recognition,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, pp. 1649–1658, November 1989. 7. Li, Q., Parthasarathy, S., Rosenberg, A. E., and Tufts, D. W., “Normalized discriminant analysis with application to a hybrid speaker-verification system,” in IEEE International Conference on Acoustics, Speech, and Signal Processing, (Atlanta), May 1996. 8. Li, Q. and Tufts, D. W., “Improving discriminant neural network (DNN) design by the use of principal component analysis,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Detroit MI), pp. 3375–3379, May 1995.
References
177
9. Li, Q. and Tufts, D. W., “Synthesizing neural networks by sequential addition of hidden nodes,” in Proceedings of the IEEE International Conference on Neural Networks, (Orlando FL), pp. 708–713, June 1994. 10. Li, Q., Tufts, D. W., Duhaime, R., and August, P., “Fast training algorithms for large data sets with application to classification of multispectral images,” in Proceedings of the IEEE 28th Asilomar Conference, (Pacific Grove), October 1994. 11. Liou, H. S. and Mammone, R. J., “A subword neural tree network approach to text-dependent speaker verification,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Detroit MI), pp. 357– 360, May 1995. 12. Liu, C. S., Lee, C.-H., Chou, W., Juang, B.-H., and Rosenberg, A. E., “A study on minimum error discriminative training for speaker recognition,” Journal of the Acoustical Society of America, vol. 97, pp. 637–648, January 1995. 13. Netsch, L. P. and Doddington, G. R., “Speaker verification using temporal decorrelation post-processing,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, 1992. 14. Rosenberg, A. E. and DeLong, J., “HMM-based speaker verification using a telephone network database of connected digital utterances,” Technical Memorandum BL01126-931206-23TM, AT&T Bell Laboratories, December 1993. 15. Rosenberg, A. E., DeLong, J., Lee, C.-H., Juang, B.-H., and Soong, F. K., “The use of cohort normalized scores for speaker verification,” in Proceedings of the International Conference on Spoken Language Processing, (Banff, Alberta, Canada), pp. 599–602, October 1992. 16. Setlur, A. R., Sukkar, R. A., and Gandhi, M. B., “Speaker verification using mixture likelihood profiles extracted from speaker independent hidden Markov models,” in Submitted to International Conference on Acoustics, Speech, and Signal Processing, 1996. 17. Sukkar, R. A., Gandhi, M. B., and Setlur, A. R., “Speaker verification using mixture decomposition discrimination,” Technical Memorandum NQ8320300950130-01TM, AT&T Bell Laboratories, January 1995. 18. Tufts, D. W. and Li, Q., “Principal feature classification,” in Neural Networks for Signal Processing V, Proceedings of the 1995 IEEE Workshop, (Cambridge MA), August 1995.
Chapter 12 Objectives for Discriminative Training
The first step in discriminative training is to define an objective function. In this chapter, the relations among a class of discriminative training objectives is derived and discovered through our theoretical analysis. The objectives selected for our discussion are the minimum classification error (MCE), maximum mutual information (MMI), minimum error rate (MER), and generalized minimum error rate (GMER). The author’s analysis shows that all these objectives can be related to both minimum error rates and maximum a posteriori probability [10]. In theory, the MCE and GMER objectives are more general and flexible than the MMI and MER objectives, and MCE and GMER are beyond the Bayesian decision theory. The results and the analytical methods used in this chapter can help in judging and evaluating discriminative objectives, and in defining new objectives for different tasks and better performances. We note that although our discussions are based on the applications of speaker recognition, the analysis can be further extended to speech recognition tasks.
12.1 Introduction In previous chapters, we have applied the expectation-maximization (EM) and linear discriminative training (LDA) algorithms in training acoustic models. It has been reported that discriminative training techniques provide significant improvements in recognition performance compared to the traditional maximum-likelihood (ML) objective in speech and speaker recognition as well as in language processing. Those discriminative objectives include the minimum classification error (MCE) [6, 7], maximum mutual information (MMI) [1, 15, 16], minimum error rate (MER) [3], and a recently proposed generalized minimum error rate (GMER) objective [13, 12, 11], as well as other related and new versions. The most important task of these discriminative training algorithms is to define the objective function. Understanding the existing discriminative
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/978-3-642-23731-7_12, Ó Springer-Verlag Berlin Heidelberg 2012
179
180
12 Objectives for Discriminative Training
objectives will help in defining new objective for different applications and achieving better performances. Among those objectives, the MCE and MMI objectives have been used for years, and both have shown good performances over the ML objective in speech and speaker recognition experiments (e.g. [5, 16, 8, 19]). Consequently, research has been conducted to compare their performances through experiments or theoretical analysis (e.g. [17, 18, 14]); however, the experimental comparisons are limited to particular tasks, and the results do not help in understanding the theory; on the other hand, the previous theoretical analyses are not conclusive or adequate to show the relations among those objectives. In this chapter, we intend to derive and discover the relations among the discriminative objectives theoretically and conclusively without any bias to any particular tasks. The analytical method used here can be applied to define, judge, evaluate, and compare other discriminative objectives as well. In the following sections, we will first review the relations between error rates and the a posteriori probability, and then derive the relations between the discriminative objectives and either the a posteriori probability or the error rates; finally, we will distinguish the relations among the different objectives.
12.2 Error Rates vs. Posterior Probability In an M -class classification problem, we are asked to make a decision in order to identify a sequence of observations, x, as a member of a class, say, Ci . The true identity of x, say, Cj , is unknown, except in the design or training phase in which observations of known identity are used as references for parameter optimization. We denote the event ai , the action of identifying an observation, as class Ci . The decision is correct if i = j; otherwise, it is incorrect. It is natural to seek a decision rule that minimizes the probability of error, or, empirically, the error rate, which entails a zero-one loss function: 0i=j i, j = 1, ..., M (12.1) L(ai |Cj ) = 1 i = j. It assigns no loss to a correct decision and assigns a unit loss to an error. The probabilistic risk of ai corresponding to this loss function is R(ai |x) =
M
L(ai |Cj )P (Cj |x)
j=1
P (Cj |x)
(12.2)
= 1 − P (Ci |x)
(12.3)
=
j=i
where P (Ci |x) is the a posteriori probability that x belongs to Ci . Thus, the zero-one loss function links the error rates to the a posteriori probability. To
12.2 Error Rates vs. Posterior Probability
181
minimize the probability of error, one should therefore maximize the a posteriori probability P (Ci |x). This is the basis of Bayes’ maximum a posteriori (MAP) decision theory, and is also referred to as minimum error rate (MER) [3] in an ideal setup. We note that the a posteriori probability P (Ci |x) is often modeled as Pλi (Ci |x), a function defined by a set of parameters λi . Since the parameter set λi has a one-to-one correspondence with Ci , we write Pλi (Ci |x) = P (λi |x) and other similar expressions without ambiguity. If we consider all M classes and all data samples, an objective for MER can be defined as: max J(Λ) =
M Nk 1 P (λk |xk,i ) N
(12.4)
k=1 i=1
where Nk is the total number of training data of class k, N = M k=1 Nk , and xk,i is the ith observation (one or a sequence of feature vectors) of class k. Λ is a set of model parameters, Λ = {λk }M k=1 . Since multilayer neural networks are for a similar task in pattern classification, we note that it has been shown that neural networks trained by backpropagation on a sum-squared error objective can approximate the true a posteriori probability in a least-square sense [3]. In this chapter, we focus on the objectives that have been defined for speech and speaker recognition or other dynamic pattern-recognition problems.
12.3 Minimum Classification Error vs. Posterior Probability The MCE objective was derived through a systematic analysis of classification errors. It introduced a misclassification measure to embed the decision process in an overall MCE formulation. During the derivation, it was also considered that the misclassification measure is continuous with respect to the classifier parameters. The empirical average cost as the typical objective in the MCE algorithm was defined as [6]: min L(Λ) =
N M 1 k (dk (xi ); Λ)1(xi ∈ Ck ). N i=1
(12.5)
k=1
where M and N are the total numbers of classes and training data, and Λ = {λk }M k=1 . It can be rewritten as: min L(Λ) =
M Nk 1 k (dk (xk,i ); Λ) N i=1 k=1
(12.6)
182
12 Objectives for Discriminative Training
M where Nk is the total number of training data of class k, N = k=1 Nk , and xk,i is the ith observation of class k. k is a loss function, and a sigmoid function is often used for it: k (dk ) =
1 1+
e−ζdk +α
,
ζ>0
where dk is a class misclassification measure defined as: ⎡ ⎤1/η 1 dk (x) = −gk (x; Λ) + ⎣ gj (x; Λ)η ⎦ , M −1
(12.7)
(12.8)
j,j=k
where η > 0, and gj (x; Λ), j = 1, 2, ...M and gj = k, is a set of class conditional-likelihood functions. The second term is also called the Holder norm. A practical class misclassification measure for hidden Markov model (HMM) training was defined as [5]: ⎡ ⎤1/η 1 dk (x) = −gk (x; Λ) + log ⎣ exp[ηgj (x; Λ)]⎦ (12.9) M −1 j=k
where x = xk,i , and function g(.) is defined as [5]: gk (x; Λ) = log p(x|λk ).
(12.10)
Thus, we can rewrite the class misclassification measure in (12.9) as: ⎡ ⎤1/η 1 dk (x) = − log p(x|λk ) + log ⎣ p(x|λj )η ⎦ . (12.11) M −1 j=k
When η = 1, we have dk (x) = − log
p(x|λk ) . 1 j=k M −1 p(x|λj )
(12.12)
p(x|λk )Pk j=k p(x|λj )Pj
(12.13)
It can be further presented as: dk (x) = − log
where Pk = 1 and Pj = M1−1 ; they are similar to the a priori probability if we conduct a normalization. To facilitate our further comparison, we convert the minimization problem to a maximization problem. Let d˜k (x) = −dk (x) p(x|λk )Pk = log , j,j=k p(x|λj )Pj
(12.14) (12.15)
12.3 Minimum Classification Error vs. Posterior Probability
183
and take it into the sigmoid function in (12.7). Assuming ζ = 1 and α = 0, we have 1 k (d˜k ) = (12.16) 1 + e−d˜k p(x|λk )Pk = (12.17) p(x|λk )Pk + j=k p(x|λj )Pj p(x|λk )Pk = M . j=1 p(x|λj )Pj
(12.18)
Thus, the MCE objective in (12.6) is simplified to: ˜ max L(Λ) =
M Nk 1 p(xk,i |λk )Pk M N i=1 j=1 p(xk,i |λj )Pj
(12.19)
M Nk 1 P (λk |xk,i ). N i=1
(12.20)
k=1
=
k=1
This demonstrates that the MCE objective can be simplified to MER as defined in (12.4) and linked to the a posteriori probability if we make the following assumptions: Pk = 1 1 Pj = M −1 η=1 ζ=1 α = 0.
(12.21) (12.22) (12.23) (12.24) (12.25)
Among the parameters, Pk = 1 and Pj ≤ 1 imply that the MCE objective weights the true class higher or equal to competing classes. The parameter η plays the role of the Holder norm in (12.9). By changing η, the weights between the true class and competing classes can be further adjusted. The rest of the parameters, ζ and α, are related to the sigmoid function. α represents the shift of the sigmoid function. Since other parameters can play a similar role, α is usually set to zero. ζ is related to the slope of the sigmoid function. For different tasks and data distributions, different values of ζ can be selected to achieve the best performance. ζ is one of the most important parameters in the MCE objective, and it makes the MCE objective flexible and adjustable to different tasks and different data distributions.
12.4 Maximum Mutual Information vs. Minimum Classification Error The objective of MMI was defined in [1] as:
184
12 Objectives for Discriminative Training
p(xk,i |λk )Pk I(k) = log M . j=1 p(xk,i |λj )Pj
(12.26)
In [16], the MMI objective was presented in the form of the r(Λ) =
N
p(λn |xn )
(12.27)
n=1
=
Nk
p(xk,i |λk )p(λk ) . M i=1 j=1 p(xk,i |λj )p(λj )
(12.28)
As it has been discussed in [16], MMI increases the a posteriori probability of the model corresponding to the training data; therefore, MMI relates to MER. If we consider all M models and all data as in the above discussions, the complete objective for MMI training is: max I(Λ) = =
Nk M
p(xk,i |λk )Pk log M j=1 p(xk,i |λj )Pj k=1 i=1 Nk M
log P (λk |xk,i ).
(12.29)
(12.30)
k=1 i=1
By applying power series expansion, we can have max I(Λ) ≈
Nk M
(P (λk |xk,i ) − 1).
(12.31)
k=1 i=1
Furthermore, since a constant number does not affect the objective, the objective can be written as ˆ max I(Λ) =
Nk M
P (λk |xk,i ).
(12.32)
k=1 i=1
Since and N is a constant, the MMI objective in (12.30) is equivalent to: ˜ max I(Λ) =
M Nk 1 P (λk |xk,i ) N i=1
(12.33)
k=1
which is equivalent to the simplified version of the MCE objective in (12.20) or the MER objective in (12.4). In other words, a procedure for optimizing MMI is equivalent to a procedure for optimizing MER, or the simplified version of MCE based on the above assumptions.
12.5 Generalized Minimum Error Rate vs. Other Objectives
185
12.5 Generalized Minimum Error Rate vs. Other Objectives In order to derive a set of closed-form formulas for fast discriminative training, in [12, 11], we defined a GMER objective as: M Nm 1 ˜ max J(Λ) = (dm,n ) N m=1 n=1
where (dm,n ) = dm,n
(12.34)
1 1+e−ζdm,n
is a sigmoid function, and = log p(xm,n |λm )Pm − Lm log p(xm,n |λj )Pj ,
(12.35)
j=m
where 0 < Lm ≤ 1 is a weighting scalar. Intuitively, Lm represents the weighting between true class m and competing classes j = m. When Lm < 1, it means that the true class m is more important than the competing classes. When Lm = 1, it means the true class and competing classes are equally important. The exact value of Lm can be determined during estimation, based on the constraint that estimated covariance matrixes must be positive-definite. The sigmoid function plays a role similar to its role in the MCE objective. It actually provides different weights to different training data. For that data which is hardly ambiguous in its classification, the weight is close to 0 (i.e., decisively wrong) or 1 (i.e., decisively correct); for the data near the classification boundary, the weighting is in-between. The slope of the sigmoid function is controlled by the parameter ζ > 0. Its value can be adjusted based upon the data distributions in specific tasks. When Lm = 1 and ζ = 1, we have J˜ in (12.34) equivalent to MER in (12.4) and MMI in (12.33). The GMER objective in (12.34) can be equivalent to the simplified version of the MCE objective in (12.20) if we have: Lm = 1
(12.36)
Pk = 1
(12.37)
1 Pj = M −1 ζ=1 α=0
(12.38) (12.39) (12.40)
The new GMER objective is more general and flexible than both MMI and MER objectives. It also remains the most important parameter ζ from the MCE objective, and the function of weighting parameter η in MCE was replaced by Lm in the GMER objective; thus, the GMER objective is as flexible as the MCE objective. The most important feature is that the GMER objective is concise; thus, we can derive a new set of closed-form formulas for fast parameter estimation for discriminative training [12, 11].
186
12 Objectives for Discriminative Training
12.6 Experimental Comparisons The above theoretical analysis is sufficient to prove relations among the different objectives. It is not surprising that all the experimental results which we can find from different research sites also support the theoretical analysis. For a fair comparison, we cite the results from other sites in addition to our experiments. In speaker verification, ML (maximum likelihood), MMI, and MCE objectives were compared using the NIST 1996 evaluation dataset by Ma, et al. [14]. There were 21 male target speakers and 204 male impostors. The reported relative equal-error-rate reductions compared to the ML objective were 3.2% and 7.0% for MMI and MCE, respectively. In speech recognition, ML, MMI, and MCE objectives were compared using a common database by Reichl and Ruske [17]. It was found that both MMI and MCE objectives can have speech-recognition performance improvements over the ML objective. The absolute error-rate reduction in the MMI objective was 2.5% versus 5.3% in the MCE objective. In speaker identification, we compared the ML and GMER objectives using an 11-speaker group from the NIST 2000 dataset [11]. For the testing durations of 1, 5, and 10 seconds, the ML objective had error rates of 31.4%, 6.59%, and 1.39%, while the GMER objective had error rates of 26.81%, 2.21%, and 0.00%. The relative error-rate reductions were 14.7%, 66.5%, and 100%, respectively. For the best results, the weighting scalar Lm was determined by the optimization algorithm in iterations, and the values were changed in different iterations, but, for any one of the iterations, it always showed that Lm = 1.0. The slope of sigmoid function for the best results is ζ = 0.8 = 1.0. Based on our above analysis, we know that these imply that the GMER objective outperforms the MMI and MER objectives.
12.7 Discussion The results from this chapter can also help in defining new objectives for parameter optimization. In general, in order to optimize a desired performance requirement, the defined objective must be related to the requirement as closely as possible, either mathematically or at least intuitively. For example, in order to minimize a recognition error rate, the objective should be defined to be related to the recognition error rate. If the requirement is to minimize an equal error rate, the objective should be defined to be related to the equal error rate. From this point of view, representing the desired objective, such as a particular error rate, is a necessary and sufficient condition in defining a useful objective, while the discriminative property is just a necessary condition because a discriminative objective may not necessarily relate to error rates. This concept can help in evaluating and adjudging objectives and predicting the corresponding performances. For example, if a likelihood
12.7 Discussion
187
Table 12.1. Comparisons on Training Algorithms Objec- Optimization Learning Determine Relation to tives Algorithms Parameters Learn. Para. Post. Prob. ML EM None – Not same Closed form MMI Closed form Yes Experiments Same MCE
GPD/Gradient descent GMER Closed form
Yes Yes
Experiments Extended Automatic
Extended
ratio is used as an objective, it is discriminative but is not related to error rates because the ratio cannot be presented as the a posteriori probability; thus, the likelihood ratio cannot be used as a training objective. Furthermore, optimization algorithms should also be considered when defining an objective. When an objective is simple, it may not be possible to represent the expected training object exactly, but it may be easy to derive a fast and efficient training algorithm. On the other hand, when an objective is complicated, it can represent a training objective well, but one cannot easily derive a fast training algorithm. From this point of view, a parameter optimization algorithm should be considered when defining an objective.
12.8 Relations between Objectives and Optimization Algorithms For pattern recognition or classification, objectives and optimization algorithms for parameter estimation are related to each other, and they both play important roles in solving real-world problems, in terms of recognition accuracy, speed of convergence, and time in adjusting learning rates or other parameters. We summarize the factors of the discussed objectives in Table 12.1. Regarding optimization methods, in general, closed-form formulas like the expectation maximization (EM) algorithm in maximum likelihood estimation for parameter re-estimation are more efficient than a gradient-descent kind of approach. However, not every objective has the closed-form formulas. When an objective is complicated, such as the MCE objective, it has less of a chance to derive closed-form formulas. Thus, many algorithms have to rely on gradientdescent methods. For the MMI objective, a closed-form parameter estimation algorithm was derived in [4] through an inequality; however, there is a constant D in the algorithm and the value of the constant needs to be pre-determined for parameter estimation. Like the learning rate in gradient-descent methods, it is difficult
188
12 Objectives for Discriminative Training
to determine the value of D as reported in the literature [16]. The GMER algorithm is developed under our belief that for the best performances, in terms of recognition accuracy and training speed, the objective and optimization method should be developed jointly [12, 11]. The GMER’s recognition accuracy is similar to the MCE while the training speech is close to the EM algorithm used in the ML estimation. We will discuss the GMER algorithm in Chapter 13. If we want to further investigate the differences between the MCE objective in (12.6) and MMI objectives in (12.30), the differences are mainly in the parameter set listed from (12.21) to (12.25). In theory, those parameters provide the flexibility to adjust the MCE objective for different recognition tasks and data distributions; therefore, the MCE objective is more general compared to the MMI and MER objectives. In practice, from many reported experiments, we know that some of the parameters can play an important role in recognition or classification performances. For example, η and ζ can be adjusted to achieve better performances.
12.9 Conclusions In recent years, the objectives of discriminative training algorithms have been extended from statically separating different data classes as in LDA to more specific or detailed tasks, such as minimizing classification errors (MCE) [6, 2, 20], generalized minimum error rate (GMER) [13], maximum mutual information (MMI) [1], maximize the decision margins [21], and soft margin [9, 22]. In this chapter, we demonstrated that all four objectives which we have discussed for discriminative training in speaker recognition can be related to both minimum error rates and maximum a posteriori probability under some assumptions. While the MMI and MER were directly defined for maximum a posteriori probability, the MCE and GMER objectives can be equivalent to maximum a posteriori probability under some assumptions and simplifications. While MCE was directly defined for minimum error rates, MMI, MER, and GMER can also be related to error rates through the zero-one loss function and some assumptions. In real applications, the distributions of testing data are not exactly the same as the distribution of training data. Since MCE and GMER are more general and flexible, by adjusting the slope of the sigmoid function, MCE and MER can weight the data near the decision boundary differently, and this property is not available in MMI and MER. Furthermore, by adjusting the weighting parameters for classes, MCE and GMER can weight classes differently instead of just weighting them by their a priori probabilities such as those in MMI and MER. From these points of view, MCE and GMER may have the potential to provide more robust recognition or classification performance to testing data and to real applications. Actually, the MCE and GMER
12.9 Conclusion
189
objectives are beyond the traditional Bayes decision theory. The GMER will be discussed in detail in Chapter 13.
References 1. Bahl, L. R., Brown, P. F., de Souza, P. V., and Mercer, R. L., “Maximum mutual information estimation of hidden Markov model parameters for speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Tokyo), pp. 49–52, 1986. 2. Chou, W., “Discriminant-function-based minimum recognition error rate pattern-recognition approach to speech recognition,” Proceedings of the IEEE, vol. 88, pp. 1201–1222, August 2000. 3. Duda, R. O., Hart, P. E., and Stork, D. G., Pattern Classification, Second Edition. New York: John & Wiley, 2001. 4. Gopalakrishnan, P. S., Kanevsky, D., Nadas, A., and Nahamoo, D., “An inequality for rational functions with applications to some statistical estimation problems,” IEEE Trans. on Information theoty, vol. 37, pp. 107–113, Jan. 1991. 5. Juang, B.-H., Chou, W., and Lee, C.-H., “Minimum classification error rate methods for speech recognition,” IEEE Trans. on Speech and Audio Process., vol. 5, pp. 257–265, May 1997. 6. Juang, B.-H. and Katagiri, S., “Discriminative learning for minimum error classification,” IEEE Transactions on Signal Processing, vol. 40, pp. 3043–3054, December 1992. 7. Katagiri, S., Lee, C.-H., and Juang, B.-H., “New discriminative algorithm based on the generalized probabilistic descent method,” in Proceedings of IEEE Workshop on Neural Network for Signal Processing, (Princeton), pp. 299–309, September 1991. 8. Korkmazskiy, F. and Juang, B.-H., “Discriminative adaptation for speaker verification,” in Proceedings of Int. Conf. on Spoken Language Processing, (Philadelphia), pp. 28–31, 1996. 9. Li, J., Yuan, M., and Lee, C. H., “Soft margin estimation of hidden markov model parameters,” in Proc. ICSLP, pp. 2422–2425, 2007. 10. Li, Q., “Discovering relations among discriminative training objectives,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Montreal), p. 2004, May 2004. 11. Li, Q. and Juang, B.-H., “Fast discriminative training for sequential observations with application to speaker identification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Hong Kong), April 2003. 12. Li, Q. and Juang, B.-H., “A new algorithm for fast discriminative training,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Orlando, FL), May 2002. 13. Li, Q. and Juang, B.-H., “Study of a fast discriminative training algorithm for pattern recognition,” IEEE Trans. on Neural Networks, vol. 17, pp. 1212–1221, Sept. 2006. 14. Ma, C. and Chang, E., “Comparison of discriminative training methods for speaker verification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. I–192 – I–195, 2003.
190
12 Objectives for Discriminative Training
15. Nadas, A., Nahamoo, D., and Picheny, M. A., “On a model-robust training method for speech recognition,” IEEE Transactions on Acoust., Speech, Signal Processing, vol. 36, pp. 1432–1436, Sept. 1988. 16. Normandin, Y., Cardin, R., and Mori, R. D., “High-performance connected digit recognition using maximum mutual information estimation,” IEEE Trans. on Speech and Audio Processing, vol. 2, pp. 299–311, April 1994. 17. Reichl, W. and Ruske, G., “Discriminant training for continuous speech recognition,” in Proceedings of Eurospeech, 1995. 18. Schluter, R. and Macherey, W., “Comparison of discriminative training criteria,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 493–497, 1998. 19. Siohan, O., Rosenberg, A. E., and Parthasarathy, S., “Speaker identification using minimum verification error training,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Seattle), pp. 109–112, May 1998. 20. Siohan, O., Rosenberg, A., and Parthasarathy, S., “Speaker identification using minimum classification error training,” in Proc. IEEE Int. Conf. on Acoustic, Speech, and Signal Process, pp. 109–112, 1998. 21. Vapnik, V. N., The nature of statistical learning theory. NY: Springer, 1995. 22. Yin, Y. and Li, Q., “Soft frame margin estimation of Gaussian mixture models for speaker recognition with sparse training data,” in ICASSP 2011, 2011.
Chapter 13 Fast Discriminative Training
A good training algorithm for pattern recognition needs to satisfy two criteria. First, the objective function is associated to the desired performance, and second, the parameter estimation process derived from the objective is easy to compute using available computation resources and can converge in the required time. For example, the expectation-maximization (EM) algorithm guarantees in convergence but its objective is not to minimize the error rate which is desired by most applications. On the other hand, many new objective functions are very well defined to directly associate to desired performance, but are often too computationally complicated and may not be able to get the desired results in a reasonable amount of time. Therefore, for real applications, to define an objective and derive an estimation algorithm is a joint design process. This chapter presents an example where a discriminative objective was defined together with its fast training algorithm. Many discriminative training algorithms for nonlinear classifier designs are based on gradient-descent (GD) methods for parameter minimization. These algorithms are easy to derive and effective in practice, but are slow in training speed and have difficulty selecting the learning rates. Their drawbacks prevent them from meeting the needs of many speaker recognition applications. To address the problem, we present a fast discriminative training algorithm. The algorithm initializes the parameters in the EM algorithm, and then uses a set of closed-form formulas to further optimize an objective of minimizing error rate. Experiments in speech applications show that the algorithm provides better recognition accuracy than the EM algorithm and much faster training speed than GD approaches. This work was originally reported by the author and Juang in [11].
13.1 Introduction As we have discussed in Chapter 3 and Chapter 4, the construct of a pattern classifier can be linear, such as a single-layer perceptron, or nonlinear,
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/978-3-642-23731-7_13, Ó Springer-Verlag Berlin Heidelberg 2012
191
192
13 Fast Discriminative Training
such as a multi-layer perceptron (MLP), a Gaussian mixture model (GMM), or a hidden Markov model (HMM) if the event to be recognized involves non-stationary signals. A linear classifier uses a hyper-plan to partition the data space. A nonlinear classifier uses nonlinear kernels to model the data distribution or the a posteriori probability, and may be better matched to the statistical behavior of the data than a linear classifier [6]. Another approach is to use a set of hyper-plans to partition the data space as described in Chapter 3. The parameters in the classifier or recognizer need to be optimized based on given and labeled data. General methodology for optimizing the classifier parameters falls into two broad classes: distribution estimation and discriminative training. The distribution-estimation approach to classifier training is based on Bayes decision theory, which suggests estimation of data distribution as the first and most imperative step in the design of a classifier. The most commonly-used criterion for distribution estimation is maximum likelihood (ML) estimation [6]. For those complex distributions used in many nonlinear classifiers, the EM algorithm for ML estimation [5] is, in general, very efficient because, while it is a hill-climbing algorithm, it guarantees a net gain in the optimization objective at every iteration, leading to uniform convergence to a solution. The concept of discriminative training is simple. When training one classifier for one class, discriminative training also considers separating the trained class from other classes as much as possible. One example of discriminative training is the principal feature networks introduced in Chapter 3 which solves the discriminative training problem in a sequential procedure with data pruning. In speech and speaker recognition, one often defines a cost function or objective commensurate with the performance of the pattern classification or recognition system, and then minimizes the cost function. As discussed in Chapter 12, the cost functions proposed for discriminative training include the conventional squared-error used in the backpropagation algorithm [20, 23, 21], the minimum classification error criterion (MCE) in the generalized probabilistic descent (GPD) algorithm [8, 4], the maximum mutual information (MMI) criterion [1, 7, 17, 2], and other versions [15, 13, 14, 16]. Variations of the gradient-descent (GD) algorithm [3] [8] [19], as well as simulated and deterministic annealing algorithms [9], are used for optimizing these objectives. In general, these optimization algorithms converge slowly, particularly for large-scale or real-time problems such as automatic speech and speaker recognition, which employ HMM’s with hundreds of GMM’s and tens of thousands of parameters, hindering their applications. In this chapter, we show an approach of jointly defining a discriminative objective and deriving parameter estimation algorithm. We define a generalized minimum error rate (GMER) objective for discriminative training, and thus name the algorithm fast GMER estimation. It is a batch-mode approach using an approximate closed-form solution for optimization.
13.2 Objective for Fast Discriminative Training
193
13.2 Objective for Fast Discriminative Training As the first step, we define an objective for generalized minimum error rate estimation (GMER). When defining the objective, we considered: first, it is discriminative; and, second, a fast algorithm can be derived from the objective. In an M -class classification problem, we are asked to make a decision, to identify an observation x, as a member of a class, say, Ci . The true identity of x, say Cj , is not known, except in the design or training phase, in which observations of known identity are used as references for parameter optimization. We denote event αi as the action of identifying an observation as class i. The decision is correct if i = j; otherwise it is incorrect. It is natural to seek a decision rule that minimizes the probability of error, or, empirically, the error rate, which entails a zero-one loss function: 0i=j i, j = 1, ..., M L(αi |Cj ) = (13.1) 1 i = j. The function assigns no loss to a correct decision, and assigns a unit loss to an error. The probabilistic risk of αi corresponding to this loss function is R(αi |x) =
M
L(αi |Cj )P (Cj |x) = 1 − P (Ci |x)
(13.2)
j=1
where P (Ci |x) is the a posteriori probability that x belongs to Ci . To minimize the probability of error, one should therefore maximize the a posteriori probability P (Ci |x). This is the basis of Bayes’ maximum a posteriori (MAP) decision theory, and is also referred to as minimum error rate (MER) [6]. The a posteriori probability P (Ci |x) is often modeled as Pλi (Ci |x), a function defined by a set of parameters λi . Since the parameter set λi has a one-toone correspondence with Ci , we write Pλi (Ci |x) = P (λi |x) and other similar expressions without ambiguity. We further define an aggregate a posteriori (AAP) probability for the set of design samples {xm,n ; n = 1, 2, .., Nm, m = 1, 2, .., M }:
J= =
M Nm 1 P (λm |xm,n ) M m=1 n=1 M Nm 1 p(xm,n |λm )Pm M m=1 n=1 p(xm,n )
(13.3)
where xm,n is the nth training token from class m, Nm is the total number of tokens for class m, and Pm is the corresponding prior probability, respectively. The above AAP objective for maximum a posteriori probability or minimum error rates can be further extended to a more general and flexible objective. We name it the GMER objective:
194
13 Fast Discriminative Training
max J˜ = max
M Nm M Nm 1 1 (dm,n ) = max m,n M m=1 n=1 M m=1 n=1
where (dm,n ) =
1 1 + e−αdm,n
(13.4)
(13.5)
is a sigmoid function, and dm,n = log p(xm,n |λm )Pm − log
p(xm,n |λj )Pj
(13.6)
j=m
represents a log probability ratio between the true class m and competing classes j = m. The sigmoid function can provide different weighting effects to different training data. For the data that has been well classified, weighting is close to 1 or 0; for the data near the classification boundary, weighting is near 0.5. The slope of the sigmoid function is controlled by the parameter α, where α > 0. Thus, the values of α can affect the training performance and convergence. By adjusting the value for different recognition tasks, the GMER objective can provide better performance than the MER objective. We introduce a weighting scalar Lm into (13.6) in the objective to ensure that the estimated covariance is positive definite and let 0 < Lm ≤ 1: dm,n = log p(xm,n |λm )Pm − Lm log p(xm,n |λj )Pj (13.7) j=m
For simplicity, we denote Lm = L. Intuitively, L represents the weighting between the true class m and competing classes j = m. When L < 1, it means that the true class is more important than the competing classes. When L = 1, it means the true class and competing classes are equally important. The range of the values of L can be determined during estimation. When L = 1 and α = 1, we have J˜ = J. For testing, based on the Bayes decision rule to minimize risk and average probability of error, for a given observation x, we should select the action or class i that maximizes the posterior probability: i = arg max {P (λm |x)}. 1≤m≤M
(13.8)
Since P (x) in (13.3) is the same for all classes, we have i = arg max {P (x|λm )Pm }. 1≤m≤M
(13.9)
Thus, although the training procedure can be different, the decision procedures are still the same for both ML estimation and discriminative estimation. The decision boundary between classes i and j is the line that satisfies the condition of P (x|λi )Pi = P (x|λj )Pj .
13.3 Derivation of Fast Estimation Formulas
195
13.3 Derivation of Fast Estimation Formulas We now derive the fast estimation formulas for GMM parameter estimation based on the objective. p(xm,n |λm ) =
I
cm,i p(xm,n |λm,i )
(13.10)
i=1
where p(xm,n |λm ) is mixture density, p(xm,n |λm,i ) is component density, cm,i I is mixing parameters subject to i cm,i = 1, and I is the number of mixture components that constitute the conditional probability density. The parameters for the component density are a subset of the parameters of the mixture density, i.e., λm,i ⊂ λm . In most applications, the component density is defined as a Gaussian kernel: p(xm,n |λm,i ) =
1 (2π)d/2 |Σm,i |1/2
1 −1 exp(− (xm,n − μm,i )T Σm,i (xm,n − μm,i )) 2
(13.11)
where μm,i and Σm,i are respectively the mean vector and covariance matrix of the ith component of the mth GMM, d is the dimension of observation vectors, and T represents the vector or matrix transpose. Let ∇θm,i J be the gradient of J with respect to θm,i ⊂ λm,i . Making the gradient vanish for maximizing J, we have: ∇θm,i J˜ =
Nm
ωm,i (xm,n )∇θm,i log p(xm,n |λm,i )
n=1
−L
Nj
ω ¯ j,i (xj,¯n )∇θm,i log p(xj,¯n |λm,i )
¯ =1 j=m n
=0 where ωm,i (xm,n ) = m,n (1 − m,n )
(13.12) cm,i p(xm,n |λm,i )Pm p(xm,n |λm )Pm
cm,i p(xj,¯n |λm,i )Pm ω ¯ j,i (xj,¯n ) = j,¯n (1 − j,¯n ) n |λk )Pk k=j p(xj,¯
(13.13) (13.14)
where is computed using (13.6) to represent the (unregulated) error rate. (This is deliberately set to separate, in concept, the influence of a token from the relative importance of various parameters of the classifier upon the performance of the classifier.) To find the solution to (13.12), we assume that ωm,i and ω ¯ j,i can be approximated as constants around θm,i . Discussions regarding this assumption is in [11].
196
13 Fast Discriminative Training
13.3.1 Estimation of Covariance Matrices From the Gaussian component in (13.10), we have log p(xm,n |λm,i ) = − log[(2π)d/2 |Σm,i |1/2 ] 1 −1 − (xm,n − μm,i )T Σm,i (xm,n − μm,i ). 2 For optimization of the covariance matrix, we take the derivative with respect to matrix Σm,i 1 −1 ∇Σm,i log p(xm,n |λm,i ) = − Σm,i 2 1 −1 −1 + Σm,i (xm,n − μm,i )(xm,n − μm,i )T Σm,i 2
(13.15)
where ∇Σ is defined as a matrix operator ∇Σ ≡
∂ ∂si,j
d i,j=1
,
(13.16)
where si,j is an entry of matrix Σ, and d is the dimension number of observation vectors. Bringing (13.15) into (13.12) and rearranging the terms, we have: A − LB Σm,i = (13.17) D where Nm A= ωm,i (xm,n )(xm,n − μm,i )(xm,n − μm,i )T (13.18) n=1
B=
Nj
ω ¯ j,i (xj,¯n )(xj,¯n − μm,i )(xj,¯n − μm,i )T
(13.19)
¯ =1 j=m n
and D=
Nm n=1
ωm,i (xm,n ) − L
Nj
ω ¯ j,i (xj,¯n ).
(13.20)
¯ =1 j=m n
Both A and B are matrices and D is a scalar. For simplicity, we ignore subscripts m, i for A, B, and D. 13.3.2 Determination of Weighting Scalar The estimated covariance matrix, Σm,i , must be positive definite. We use this requirement to determine the upper bound of the weighting scalar L. Using the eigenvectors of A−1 B, we can construct an orthogonal matrix ˜ ˜ ˜ and B ˜ are diagonal, U, such that (1) A−LB = UT (A−L B)U, where both A
13.3 Derivation of Fast Estimation Formulas
197
˜ − LB ˜ have the same eigenvalues. These claims and (2) both A − LB and A have been proven in Theorems 1 and 2 in [11]. L can then be determined as: d a ˜k L < min , (13.21) ˜bk k=1 ˜ and B, ˜ respectively. L where a ˜i > 0 and ˜bi > 0 are the diagonal entries of A also needs to satisfy D(L) > 0
and
0 < L ≤ 1.
(13.22)
Thus, for the ith mixture component of model m, we can determine Lm,i . If model m has I mixtures, we need to determine one Lm to satisfy all mixture components in the model. Therefore, the upper bound of Lm is Lm ≤ min{Lm,1 , Lm,2 , . . . , Lm,i }.
(13.23)
In numerical computation, we need an exact number of L; therefore, we have Lm = η min{Lm,1 , Lm,2 , . . . , Lm,i }
(13.24)
where 0 < η ≤ 1 is a pre-selected constant, and it is much easier to determine compared to the learning rate in gradient-descent algorithms. 13.3.3 Estimation of Mean Vectors We take the derivative of (13.15) with respect to vector μm,i : −1 ∇μm,i log p(xm,n |λm,i ) = Σm,i (xm,n − μm,i ).
where ∇μ is defined as a vector operator d ∂ ∇μ ≡ , ∂νi i=1
(13.25)
(13.26)
where νi is an entry of vector μ, and d is the dimension number of observation vectors. Bringing (13.25) into (13.12) and rearranging the terms, we obtain the solution for mean vectors: E − LF μm,i = (13.27) D where Nm E= ωm,i (xm,n )xm,n (13.28) n=1
F=
Nj
ω ¯ j,i (xj,¯n )xj,¯n
(13.29)
¯ =1 j=m n
and D is defined in (13.20). Again, for simplicity, we ignore subscripts m, i for E, F, and D. We note that the both E and F are vectors, and scalar L has been determined when estimating Σm,i .
198
13 Fast Discriminative Training
13.3.4 Estimation of Mixture Parameters The last step is to compute the mixture parameters cm,i subject to 1. Introducing Lagrangian multipliers γm , we have I M m Jˆ = J˜ + γm cm,i − 1 . m=1
I i
cm,i =
(13.30)
i=1
Taking the first derivative and making it vanish for maximization, we have ∂ J˜ 1 = D + γm = 0. ∂cm,i cm,i
(13.31)
Rearranging the terms, we have cm,i = −
1 D. γm
(13.32)
Summing over cm,i , for i = 1...I, we can solve γm as γm = −(G − LH) where G=
Nm Im
ωm,i (ci , xm,n )
(13.33)
(13.34)
n=1 i=1
and H=
Nj Ij
ω ¯ j,i (ci , xj,¯n ).
(13.35)
¯ =1 i=1 j=m n
Bringing (13.33) into (13.32), we have cm,i =
D . G − LH
(13.36)
13.3.5 Discussions We have discussed the necessary conditions for optimization, i.e., ∇θm,i J = 0. In theory, we also need to meet the following sufficient conditions: 1) ∇2θm,i J < 0, in order to ensure a maximum solution; and 2) |∇θm,i ωm,i | ≈ 0 and |∇θm,i ω ¯ j,i | ≈ 0 around θm,i . This is to ensure that ωm,i (θm,i ) and ω ¯ j,i (θm,i ) in (13.12) are approximately constant; therefore the independent assumption is sound. Further discussions on the sufficient conditions are available in [11]. Also, in the above derivations, one observation or training token, x, represents one feature vector, and a decision is made based on the single feature
13.3 Derivation of Fast Estimation Formulas
199
vector. In speech and speaker recognition, one spoken phoneme can be represented by several feature vectors, and a short sentence can have over one hundred feature vectors. For such applications, decisions are usually made on a sequence of continually observed (extracted) feature vectors, and we have to make corresponding changes in the objective and estimation formulas accordingly. Given the n’th observation of class m with a sequence of feature vectors, the observation (token) can be presented as: n Xm,n = {xm,n,q }Q q=1
(13.37)
where the n’th token has a sequence of Qn feature vectors. To deal with this kind of problem, one usually assumes that the variable to represent the vectors is independent, identically-distributed (i.i.d); therefore, the probability or likelihood p(Xm,n |λm ) can be calculated as: p(Xm,n |λm ) =
Qn
p(xm,n,q |λm ).
(13.38)
q=1
The GMER objective can be rewritten as: M Nm M Nm 1 1 max J˜ = (dm,n ) = m,n M m=1 n=1 M m=1 n=1
where (dm,n ) = dm,n
(13.39)
1 1+e−αdm,n
is a sigmoid function, and = log p(Xm,n |λm )Pm − log p(Xm,n |λj )Pj
(13.40)
j=m
represents a log probability ratio between the true class m and the competing classes j = m. Using the same method demonstrated above, readers can derive a set of re-estimation formulas, or refer to the results in [10]. We ignore the derivations here since the procedures and results are very similar to the above derivations. Given the nth observation: a sequence of feature vectors, Xn , the decision on the observation can be made as: i = arg max {P (Xn |λm )Pm } 1≤m≤M
= arg max {Pm 1≤m≤M
Qn
p(xn,q |λm )}.
(13.41) (13.42)
q=1
In practice, the above computation is often conducted as log likelihood: i = arg max {log Pm + 1≤m≤M
Qn q=1
log p(xn,q |λm )}.
(13.43)
200
13 Fast Discriminative Training
13.4 Summary of Practical Training Procedure For programming, the practical training procedure for the fast GMER estimation can be summarized as follows: 1. initialize all model parameters for all classes by ML estimation; 2. for every mixture component i in model m, compute ωm,i and ω ¯ m,i using (13.13) and (13.14), and compute A, B, and D using (13.18), (13.19), and (13.20). 3. determine the weighting scalar L by (13.24); 4. for every mixture component i, compute Σm,i μm,i , and cm,i using (13.17), (13.27), and (13.36); 5. evaluate the performance using (13.4) and (13.6) for model m. If the performance is improved, save the best model parameters; 6. repeat Step 4 and 5 if performance is impved; 7. repeat Steps 2 to 6 for the required number of iterations for model m; 8. use the saved model for class m and repeat the above procedure for all untrained models; and 9. output the saved models for testing and applications.
13.5 Experiments The fast GMER estimation has been applied to several pattern recognition, speech recognition, and speaker recognition projects. We selected two examples present here. 13.5.1 Continuing the Illustrative Example In Chapter 11, we use a two-dimensional data classification problem to illustrate the GMM classifier. Here, we continue the example and apply the GMER to the classification problem. The data distributions were of Gaussian-mixture types, with three components in each class. Each token was of two dimensions. For each class, 1,500 tokens were drawn from each of the three components; therefore, there were 4,500 tokens in total. The contours of the ideal distributions of classes 1, 2 and 3 are shown in Fig. 13.1, where the means are represented as +, ∗, and boxes, respectively. In order to simulate real applications, we assumed that the number of mixture components is unknown. Therefore, we assumed the GMM’s, which need to be trained, have two mixture components with full covariance matrices for each class. In the first step, ML estimation was applied to initialize the GMMs with four iterations based on the training data drawn from the ideal models. The contours that represent the pdf ’s of each of the GMMs after ML estimation are plotted in Fig. 13.2.
13.5 Experiments
201
10
5
0
−5 −10
−5
0
5
Fig. 13.1. Contours of the pdf ’s of 3-mixture GMM’s: the models are used to generate three classes of training data. 10
5
0
−5 −10
−5
0
5
Fig. 13.2. Contours of the pdf ’s of 2-mixture GMM’s: the models are from ML estimation using four iterations.
In the next step, we used the fast GMER estimation to further train new GMM’s with 2 iterations based on the parameters estimated by ML estimation. The contours representing the pdf ’s of each of the new GMM’s after the GMER estimation are plotted in Fig. 13.3. All the contours in Figures 13.2 and 13.3 are plotted on the same scale. From Fig. 13.2 and Fig. 13.3, we can observe that GMER training significantly reduced the overlaps among three classes. The decision boundaries of the three cases were plotted in Fig. 13.4. After GMER training, the boundaries from ML estimation shifted toward the decision boundaries from the ideal models. We note that both the ML and
202
13 Fast Discriminative Training 10
5
0
−5 −10
−5
0
5
Fig. 13.3. Contours of the pdf ’s of 2-mixture GMM’s: The models are from the fast GMER estimation with two iterations on top of the ML estimation results. The overlaps among the three classes are significantly reduced. 7 6 5 4 3 2 1 0 −1 −2 −7
−6
−5
−4
−3
−2
−1
0
1
2
Fig. 13.4. Enlarged decision boundaries for the ideal 3-mixture models (solid line), 2-mixture ML models (dashed line), and 2-mixture GMER models (dash-dotted line): After GMER training, the boundary of ML estimation shifted toward the decision boundary of the ideal models. This illustrates how GMER training improves decision accuracies.
GMER models were trained from a limited set of training data drawn from the ideal model, and the shifted areas have high-data density. This illustrates how GMER training improves classification accuracies. We summarized the experimental results in Table 13.1. The testing data with 4,500 tokens for each class were obtained using the same methods as the
13.5 Experiments
203
TRAINING (%)
77.5 IDEAL CASE 77.19 77 76.48 76.5 76.07 76 0
76.26
76.18
3
2
1
TESTING (%)
77.5 IDEAL CASE 77.02
77 76.5
76.61
76.70
76.68
76 75.97 75.5 0
1
MER ITERATIONS
2
3
Fig. 13.5. Performance improvement versus iterations using the GMER estimation: The initial performances were from the ML estimation with four iterations. Table 13.1. Three-Class Classification Results of the Illustration Example Algorithms Iterations Training Set MLE 4 76.07% MER 1 76.18% MER 2 76.26% MER 3 76.48% Ideal Case 77.19%
Testing Set 75.97% 76.61% 76.70% 76.68% 77.02%
training data. The ML estimation provided an accuracy of 76.07% and 75.97% for training and testing datasets respectively, while the GMER estimation improved the accuracy to 76.26% and 76.70% after two iterations. The control parameters were set to η = 0.5 and α = 0.01. If we use the same model that generated the training data to do the testing, the ideal performances are 77.19% and 77.02% for training and testing. These ideal cases represent the ceiling of this example. To evaluate the behaviors of the GMER estimation, we plot the training and testing data on each of the iterations in Fig. 13.5. On testing, the relative improvement of the GMER estimation against the ceiling is significant.
204
13 Fast Discriminative Training Table 13.2. Comparison on Speaker Identification Error Rates AlgoriIteraTest Length thms tions 1 sec 5 sec 10 sec ML Estimation 5 31.41% 6.59% 1.39% GMER MLE5 26.81% 2.21% 0.00% (Propossed) + MER1 Relative Error Reduction 14.65% 66.46% 100.00%
13.5.2 Application to Speaker Identification We also used a text-independent speaker identification task to evaluate the GMER algorithm on sequentially-observed feature vectors. The experiment included 11 speakers. Given a sequence of feature vectors extracted from a speaker’s voice, the task was to identify the true speaker from a group of 11 speakers. Each speaker had 60 seconds of training data and 30 - 40 seconds of testing data with a sampling rate of 8 KHz. These speakers were randomly picked from the 2000 NIST Speaker Recognition Evaluation Database. The speech data were first converted into 12-dimensional (12-D) Mel-frequency cepstral coefficients (MFCC’s) through a 30 ms window shifted every 10 ms [18]. Thus, for every 10 ms, we had one 12-D MFCC feature vector. The silence frames were then removed by a batch-mode endpoint detection algorithm [12]. The testing performance was evaluated based on segments of 1, 5, and 10 seconds of testing speech. The speech segment was constructed by moving a window of the length of 10, 50, or 100 vectors at every feature vector on the testing data collected sequentially. A detailed introduction to speaker identification and a typical ML estimation approach can be found in [18] and previous chapters. We first constructed GMM’s with 8-mixture components for every speaker using the ML estimation. Each GMM was then further trained discriminatively using the sequential GMER estimation described above. During the test, for every segment, we computed the likelihood scores of all trained GMM’s in the selected test length. The speaker with the highest score was labeled as the owner of the segment. The experimental results are listed in Table 13.2. For 1, 5, and 10 seconds of testing data, the sequential GMER algorithm had 14.56%, 66.46%, and 100.00% relative error rate reductions respectively, compared to ML estimation, which is the most popular algorithm in speaker identification.
13.6 Conclusions In this chapter, we show an example of jointly defining an objective function and its training algorithm, so we can have a closed-form solution for
13.6 Conclusion
205
fast estimation. Although the algorithm does not guarantee analytical convergence at each iteration, we empirically demonstrate that the GMER algorithm can train a classifier or recognizer in only a few iterations, much faster than gradient-descent-based methods, while also providing better recognition accuracy due to the generalization of the principle of error minimization. Our experimental results indicated that the fast GMER algorithm is efficient and effective. So far, our discussion on discriminative training has been focused on error rate related objectives. The author and his colleague’s recent work focuses on decision margin related objectives and uses convex optimization approach to estimate the GMM parameters. The approach received good performance results as well. Interested readers are referred to [22] for more detailed information.
References 1. Bahl, L. R., Brown, P. F., de Souza, P. V., and Mercer, R. L., “Maximum mutual information estimation of hidden Markov model parameters for speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Tokyo), pp. 49–52, 1986. 2. Ben-Yishai, A. and Burshtein, D., “A discriminative training algorithm for hidden Markov models,” IEEE Trans. on Speech and Audio Processing, may 2004. 3. Bishop, C., Neural networks for pattern recognition. NY: Oxford Univ. Press, 1995. 4. Chou, W., “Discriminant-function-based minimum recognition error rate pattern-recognition approach to speech recognition,” Proceedings of the IEEE, vol. 88, pp. 1201–1222, August 2000. 5. Dempster, A. P., Laird, N. M., and Rubin, D. B., “Maximum likelihood from incomplete data via the EM algorithm,” Journal of Royal Statistical Society, vol. 39, pp. 1–38, 1977. 6. Duda, R. O., Hart, P. E., and Stork, D. G., Pattern Classification, Second Edition. New York: John & Wiley, 2001. 7. Gopalakrishnan, P. S., Kanevsky, D., Nadas, A., and Nahamoo, D., “An inequality for rational functions with applications to some statistical estimation problems,” IEEE Trans. on Information theoty, vol. 37, pp. 107–113, Jan. 1991. 8. Juang, B.-H. and Katagiri, S., “Discriminative learning for minimum error classification,” IEEE Transactions on Signal Processing, vol. 40, pp. 3043–3054, December 1992. 9. Kirkpatrick, S., C. D. Gelatt, J., and Vecchi, M. P., “Optimization by simulated annealing,” Science, vol. 220, pp. 671–680, 1983. 10. Li, Q. and Juang, B.-H., “Fast discriminative training for sequential observations with application to speaker identification,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Hong Kong), April 2003. 11. Li, Q. and Juang, B.-H., “Study of a fast discriminative training algorithm for pattern recognition,” IEEE Trans. on Neural Networks, vol. 17, pp. 1212–1221, Sept. 2006.
206
13 Fast Discriminative Training
12. Li, Q., Zheng, J., Tsai, A., and Zhou, Q., “Robust endpoint detection and energy normalization for real-time speech and speaker recognition,” IEEE Trans. on Speech and Audio Processing, vol. 10, pp. 146–157, March 2002. 13. Markov, K. and Nakagawa, S., “Discriminative training of GMM using a modified EM algorithm for speaker recognition,” in Proc. ICSLP, 1998. 14. Markov, K., Nakagawa, S., and Nakamura, S., “Discriminative training of HMM using maximum normalized likelihood algorithm,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, pp. 497–500, 2001. 15. Max, B., Tam, Y.-C., and Li, Q., “Discriminative auditory features for robust speech recognition,” IEEE Trans. on Speech and Audio Processing, vol. 12, pp. 27–36, Jan. 2004. 16. Mora-Jimenez, I. and Cid-Sueiro, J., “A universal learning rule that minimize well-formed cost functinos,” IEEE Trans. On Neural Networks, vol. 16, pp. 810– 820, July 2005. 17. Normandin, Y., Cardin, R., and Mori, R. D., “High-performance connected digit recognition using maximum mutual information estimation,” IEEE Trans. on Speech and Audio Processing, vol. 2, pp. 299–311, April 1994. 18. Reynolds, D. and Rose, R. C., “Robust text-independent speaker identification using Gaussian mixture speaker models,” IEEE Trans. on Speech and Audio Processing, vol. 3, pp. 72–83, 1995. 19. Robinson, M., Azimi-Sadjadi, M. R., and Salazar, J., “Multi-aspect target discrimination using hidden Markov models and neural networks,” IEEE Trans. On Neural Networks, vol. 16, pp. 447–459, March 2005. 20. Werbos, P. J., The roots of backpropagation: from ordered derivatives to neural networks and political forecasting. New York: J. Wiley & Sons, 1994. 21. Wu, W., Feng, G., Li, Z., and Xu, Y., “Deterministic convergence of an online gradient method for BP networks,” IEEE Trans. On neural Networks, vol. 16, pp. 533–540, May 2005. 22. Yin, Y. and Li, Q., “Soft frame margin estimation of Gaussian mixture models for speaker recognition with sparse training data,” in ICASSP 2011, 2011. 23. Yu, X., Efe, M. O., and kaynak, O., “A general backpropagation algorithm for feedforward neural networks learning,” IEEE Trnas. On Neural Networks, vol. 13, pp. 251–254, Jan. 2002.
Chapter 14 Verbal Information Verification
So far in this book we have focused on speaker recognition, which includes speaker verification (SV) and speaker identification (SID). Both tasks are accomplished by matching a speaker’s voice with his or her registered and modeled speech characteristics. In this chapter, we present another approach to speaker authentication verbal information verification (VIV), in which spoken utterances of a claimed speaker are verified against the key (usually confidential) information in the speaker’s registered profile automatically to decide whether the claimed identity should be accepted or rejected. Using a sequential procedure for VIV involving three question-response turns, we achieved an error-free result in a telephone speaker authentication experiment with 100 speakers. This work was originally reported by the author, Juang, Zhou, and Lee in [5, 6].
14.1 Introduction As we have discussed in previous chapters, to ensure proper access to private information, personal transactions, and security of computer and communication networks, automatic user authentication is necessary. Among various kinds of authentication methods, such as voice, password, personal identification number (PIN), signature, fingerprint, iris, hand shape, etc., voice is the most convenient one because it is easy to produce, capture, and transmit over the telephone or wireless networks. It also can be supported with existing services without requiring special devices. Speaker recognition is a voice authentication technique that has been studied for several decades. There are however still several problems which affect real-world applications, such as acoustic mismatch, quality of the training data, inconvenience of enrollment, and the creation of a large database to store all the enrolled speaker patterns. In this chapter we present a different approach to speaker authentication called verbal information verification (VIV) [5, 6]. VIV can be used independently
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/978-3-642-23731-7_14, Ó Springer-Verlag Berlin Heidelberg 2012
207
208
14 Verbal Information Verification
or can be combined with speaker recognition to provide convenience to users while achieving a higher level of security. As introduced in Chapter 1, VIV is the process of verifying spoken utterances against the information stored in a given personal data profile. A VIV system may use a dialogue procedure to verify a user by asking questions. An example of a VIV system is shown in Fig. 14.1. It is similar to a typical telebanking procedure: after an account number is provided, the operator verifies the user by asking some personal information, such as mother’s maiden name, birth date, address, home telephone number, etc. The user must answer the questions correctly in order to gain access to his/her account. To automate the whole procedure, the questions can be prompted by a text-to-speech system (TTS) or by pre-recorded messages.
‘‘In which year were you born ?’’ Get and verify the answer utterance. Correct
Wrong
‘‘In which city/state did you grow up ?’’
Rejection
Get and verify the answer utterance. Correct
Wrong
‘‘May I have your telephone number, please ?’’
Rejection
Get and verify the answer utterance. Correct
Acceptance on 3 utterances
Wrong
Rejection
Fig. 14.1. An example of verbal information verification by asking sequential questions. (Similar sequential tests can also be applied in speaker verification and other biometric or multi-modality verification.)
We note that the major difference between speaker recognition and VIV in speaker authentication is that a speaker recognition system utilizes a speaker’s voice characteristics represented by the speech feature vectors while a VIV system mainly inspects the verbal content in the speech signal. The difference
14.1 Introduction
209
can be further addressed in the following three aspects. First, in a speaker recognition system, for either SID or SV, we need to train speaker-dependent (SD) models, while in VIV we usually use speaker-independent models with associated acoustic-phonetic identities or subwords. Second, a speaker recognition system needs to enroll a new user and to train the SD model, while a VIV system does not need such an enrollment. A user’s personal data profile is created when the user’s account is set up. Finally, in speaker recognition, the system has the ability to reject an imposter when the input utterance contains a legitimate pass-phrase but fails to match the pre-trained SD model. In VIV, it is solely the user’s responsibility to protect his or her own personal information because no speaker-specific voice characteristics are used in the verification process. However, in real applications, there are several ways to avoid impostors using a speaker’s personal information by monitoring a particular session. A VIV system can ask for some information that may not be a constant from one session to another, e.g. the amount or date of the last deposit; or a subset of the registered personal information, e.g., a VIV system can require a user to register N pieces of personal information (N > 1), and each time only randomly ask n questions (1 ≤ n < N ). Furthermore, as we are going to present in Section 15.2, a VIV system can be migrated to an SV system and VIV can be used to facilitate automatic enrollment for SV, which will be discussed in Chapter 15.
14.2 Single Utterance Verification Using speech processing techniques, we have two ways to verify a single spoken utterance for VIV: by automatic speech recognition (ASR) or by utterance verification. With ASR, the spoken input is transcribed into a sequence of words. The transcribed words are then compared to the information prestored in the claimed speaker’s personal profile. With utterance verification, the spoken input is verified against an expected sequence of word or subword models which is taken from a personal data profile of the claimed individual. Based on our experience [6] and the analysis in Section 14.3, the utterance verification approach can give us much better performance than the ASR approach. Therefore, we focus our discussion only on utterance verification approach in this study. The idea of utterance verification for computing confidence scores was used in keyword spotting and non-keyword rejection (e.g. [13, 14, 2, 8, 17, 18, 16]). A similar concept can also be found in fixed-phrase speaker verification [15, 11, 7] and in VIV [6, 4]. A block diagram of a typical utterance verification for VIV is shown in Fig. 14.2. The three key modules, utterance segmentation by forced decoding, subword testing and utterance level confidence scoring, are described in detail in the following subsections. For an access control application, when a user opens an account, some of his or her key information is registered in a personal profile. Each piece of
210
14 Verbal Information Verification Phone/subword transcription for "Murray Hill"
Identity claim
S1, . . .,S m
Target likelihoods P(O 1 | λ 1 ) . . . P(O m| λ m )
Forced Decoding
Pass-utterance "Murray Hill"
SI HMM’s for the transcription λ1 . . . λm
Confidence Phone boundaries
Anti
Scores
Measure
likelihoods
P(O 1 | λ 1 ) . . . P(O m| λ m) Computation λ1 . . . λ m
Anti HMM’s for the transcription
Fig. 14.2. Utterance verification in VIV.
the key information is represented by a sequence of words, S, which in turn is equivalently characterized by a concatenation of a sequence of phones or subwords, {Sn }N n=1 , where Sn is the nth subword and N is the total number of subwords in the key word sequence. Since the VIV system only prompts one single question at a time, the system knows the expected key information to the prompted question and the corresponding subword sequence S. We then apply the subword models λ1 , ..., λN in the same order of the subword sequence S to decode the answer utterance. This process is known as forced decoding or forced alignment, in which the Viterbi algorithm is employed to determine the maximum likelihood segmentations of the subwords, i.e. P (O|S) = where
max
t1 ,t2 ,...,tN
N P (O1t1 |S1 )P (Ott12 +1 |S2 ) . . . P (OttN |SN ), −1 +1
N O = {O1 , O2 , ..., ON } = {O1t1 , Ott12 +1 , ..., OttN }, −1 +1
(14.1)
(14.2)
is a set of segmented feature vectors associated with subwords, t1 , t2 , ..., tN are the end frame numbers of each subword segments respectively, and On = n Ottn−1 +1 is the segmented sequence of observations corresponding to subword Sn , from frame number tn−1 + 1 to frame number tn , where t1 ≥ 1 and ti > ti−1 . Given an observed speech segment On , we need a decision rule by which we assign the subword to either hypothesis H0 or H1 . Following the definition in [17], H0 means that observed speech On consists of the actual sound of subword Sn , and H1 is the alternative hypothesis. For the binary-testing problem, one of the most useful tests for decision making is the Neyman-Pearson lemma [10, 9, 19]. For a given number of observations K, the most powerful
14.2 Single Utterance Verification
211
test, which minimizes the error for one class while maintaining the error for the other class constant, is a likelihood ratio test, r(On ) =
P (On |H0 ) P (On |λn ) = ¯n ) , P (On |H1 ) P (On |λ
(14.3)
¯ n are the target HMM and corresponding anti-HMM’s for where λn and λ subword unit Sn , respectively. The target model, λn , is trained using the data ¯ n is trained using the data of subword Sn ; the corresponding anti-model, λ ¯ which is highly confused with subword Sn [17], i.e. of a set of subwords S ¯ n ⊂ {Si }, i = n. The log likelihood ratio (LLR) for subword Sn is S ¯ n ). R(On ) = log P (On |λn ) − log P (On |λ
(14.4)
For normalization, an average frame LLR, Rn , is defined as Rn =
1 ¯n) , log P (On |λn ) − log P (On |λ ln
(14.5)
where ln is the length of the speech segment. For each subword, a decision can be made by Acceptance: Rn ≥ Tn ; (14.6) Rejection: Rn < Tn , where either a subword-dependent threshold value Tn or a common threshold T can be determined numerically or experimentally. In the above test, we assume the independence among all HMM states and among all subwords. Therefore, the above test can be interpreted as applying the Neyman-Pearson lemma in every state, then combining the scores together as the final average LLR score. 14.2.1 Normalized Confidence Measures A confidence measure M for a key utterance O can be represented as M(O) = F (R1 , R2 , ..., RN ),
(14.7)
where F is the function to combine the LLR’s of all subwords in the key utterance. To make a decision on the subword level, we need to determine the threshold for each of the subword tests. If we have the training data for each subword model and the corresponding anti-subword model, this is not a problem. However, in many cases, the data may not be available. Therefore, we need to define a test which allow us determining the thresholds without using the training data. For subword Sn which is characterized by a model, λn , we define Cn =
¯n ) log P (On |λn ) − log P (On |λ log P (On |λn )
(14.8)
212
14 Verbal Information Verification
where log P (On |λn ) = 0. Cn > 0 means the target score is larger than the anti-score and vice versa. Furthermore, we define a normalized confidence measure for an utterance with N subwords as N 1 M= f (Cn ), N n=1
where f (Cn ) =
1, if Cn ≥ θ; 0, otherwise,
(14.9)
(14.10)
M is in a fixed range of 0 ≤ M ≤ 1. Due to the normalization in Eq. (14.8), θ is a subword-independent threshold which can be determined separately. A subword is accepted and counted as part of the utterance confidence measure only if its Cn score is greater than or equal to the threshold value θ. Thus, M can be interpreted as the percentage of acceptable subwords in an utterance; e.g. M = 0.8 implies that 80% of the subwords in the utterance are acceptable. Therefore, an utterance threshold can be determined or adjusted based on the specifications of system performance and robustness.
14.3 Sequential Utterance Verification For a verification with multiple utterances, the above single utterance test strategy can be extended to a sequence of subtests which is similar to the step-down procedure in statistics [1]. Each of the subtests is an independent single-utterance verification. As soon as a subtest calls for rejection, H1 is chosen and the procedure is terminated; if no subtest leads to rejection, H0 is accepted, i.e. the user is accepted. We define H0 be the target hypothesis in which all the answered utterances match the key information in the profile. We have H0 =
J
H0 (i),
(14.11)
i=1
where J is the total number of subtests, and H0 (i) is a component target hypothesis in the ith subtest corresponding to the ith utterance. The alternative hypothesis is J H1 = H1 (i), (14.12) i=1
where H1 (i) is a component alternative hypothesis corresponding to the ith subtest. We assume the independence among subtests. On the ith subtest, a decision can be made on Acceptance: M (i) ≥ T (i); (14.13) Rejection: M (i) < T (i);
14.3 Sequential Utterance Verification
213
where M (i) and T (i) are the confidence score and the corresponding threshold for utterance i, respectively. As is well known, when performing a test, one may commit one of two types of error: rejecting the hypothesis when it is true - false rejection (FR), or accepting it when it is false - false acceptance (FA). We denote the FR and FA error rates as εr and εa , respectively. An equal-error rate (EER), ε, is defined when the two error rates are equal in a system, i.e. εr = εa = ε. For a sequential test, we extend the definitions of error rates as follows. False rejection error on J utterances (J ≥ 1) is the error when the system rejects a correct response in anyone of J hypothesis subtests. False acceptance error on J utterances (J ≥ 1) is the error when the system accepts an incorrect set of responses after all of J hypothesis subtests. Equal-error rate on J utterances is the rate at which the false rejection error rate and the false acceptance error rate on J utterances are equal. For convenience, we denote the above FR and FA error rates on J utterances as Er (J) and Ea (J), respectively. Let Ωi = R1 (i) ∪ R0 (i) be the region of confidence scores of the ith subtest, where R0 (i) is the region of confidence scores which satisfy M (i) ≥ T (i) from which we accept H0 (i), and R1 (i) is the region of scores which satisfy M (i) < T (i) from which we accept H1 (i). The FR and FA errors for subtest i can be represented as the following conditional probabilities εr (i) = P ( M (i) ∈ R1 (i) | H0 (i) ),
(14.14)
εa (i) = P ( M (i) ∈ R0 (i) | H1 (i) ),
(14.15)
and respectively. Furthermore, the FR error on J utterances can be evaluated as J
Er (J) = P (
{M (i) ∈ R1 (i)} | H0 ),
i=1
= 1−
J (1 − εr (i)),
(14.16)
i=1
and the FA error on J utterances is Ea (J) = P (
J
{M (i) ∈ R0 (i)} | H1 ),
i=1
=
J
εa (i).
(14.17)
i=1
Equations (14.16) and (14.17) indicate an important property of the sequential test defined above: the more the subtests, the less the FA error and the larger the FR error. Therefore, we can have the following strategy in a
214
14 Verbal Information Verification
VIV system design: starting from the first subtest, we first set the threshold value such that the FR error rate for the subtest, εr , is close to zero or a small number corresponding to design specifications, then add more subtests in the same way until meeting the required system FA error rate, Ea , or reaching the maximum numbers of allowed subtests. In a real application, it may save verification time by arranging the subtests in the order of descending importance and decreasing subtest error rates; thus, the system first prompts users with the most important question or with the subtest which we know has a high FR error εr (i). Therefore, if a speaker is falsely rejected, the session can be restarted right away with little inconvenience to the user. Equation (14.16) also indicates the reason that an ASR approach would not perform very well in a sequential test. Although ASR can give us a low FR error, εr (i), on each of the individual subtests, the overall FR error on J utterances Er (J), J > 1, can still be very high. In the proposed utterance verification approach, we make the FR on each individual subtest close to zero by adjusting the threshold value while controlling the overall FA error by adding more subtests until reaching the design specifications. We use the following examples to show the above concept. 14.3.1 Examples in Sequential-Test Design We use two examples to show the sequential-test design based on required error rates. Example 1: Adding Additional Subtest A bank operator asks two kinds of personal questions while verifying a customer. When automatic VIV is applied to the procedure, the average individual error rates on these two subtests are εr (1) = 0.1%, εa (1) = 5%; and εr (2) = 0.2%, εa (2) = 6%, respectively. Then, from Eq. (14.16) and (14.17), we know that the system FR and FA errors on a sequential test are Er (2) = 0.3% and Ea (2) = 0.3%. If the bank wants to further reduce the FA error, one additional subtest can be added to the sequential test. Suppose the additional subtest has εr (3) = 0.3% and εa (3) = 7%. The overall system error rates will be Er (3) = 0.6% and Ea (3) = 0.021%. Example 2: Determining the Number of Subtests A security system requires Er (J) ≤ 0.03% and Ea (J) ≤ 0.2%. It is known that each subtest can have εr ≤ 0.01%, and εa ≤ 12% by adjusting the thresholds. In this case, we need to determine the number of subtests, J, to meet the design specifications. From Eq. (14.17), we have
log Ea log 0.002 J= = = 3. log εa log 0.12
14.3 Sequential Utterance Verfication
215
Then, the actual system FA rate on three subtests is Ea (3) = 0.17% ≤ 0.2%; the FR rate on three tests is Ea (3) = 0.03%. Therefore, three subtests can meet the required performance on both FR and FA.
14.4 VIV Experimental Results In the following experiments, the VIV system verifies speakers by three sequential subtests, i.e. J = 3. The system performance with various decision thresholds will be evaluated and compared. The experimental database includes 100 speakers. Each speaker gave three utterances as the answers to the following three questions: “In which year were you born?” “In which city and state did you grow up?” “May I have your telephone number, please?” This is a biased database. Twenty six percent (26%) of the speakers have a birth year in the 1950’s; 24% are in the 1960’s. There is only a one digit difference among those numbers of birth years. Regarding place of birth, 39% were born in “New Jersey”, with 5% born in the exact same city and state: “Murray Hill, New Jersey”. Thirty eight percent (38%) of the telephone numbers provided start with “908 582 ...”, which means that at least 60% of the digits in their answer for their telephone number are identical. In addition, some of the speakers have foreign accents, and some cities and states are in foreign countries. In the experiments, a speaker was considered a true speaker when the speaker’s utterances were verified against his or her data profile. The same speaker was considered an impostor when the utterances were verified against other speakers’ profiles. Thus, for each true speaker, we have three utterances from the speaker and 99 × 3 utterances from the other 99 speakers as impostors. The speech signal was sampled at 8 kHz and pre-emphasized using a firstorder filter with a coefficient of 0.97. The samples were blocked into overlapping frames of 30 ms in duration and updated at 10 ms intervals. Each frame was windowed with a Hamming window. The cepstrum was derived from a 10th order LPC analysis. The LPC coefficients were then converted to cepstral coefficients, where only the first 12 coefficients were kept. The feature vector consisted of 39 features including 12 cepstral coefficients, 12 delta cepstral coefficients, 12 delta-delta cepstral coefficients, energy, delta energy, and delta-delta energy [12]. The models used in evaluating the subword verification scores were a set of 1117 right context-dependent HMM’s as the target phone models [3], and a set of 41 context-independent anti-phone HMM’s as anti-models [17]. For a VIV system with multiple subtests, either one global threshold, i.e. T = T (i), or multiple thresholds, i.e. T (i) = T (j), i = j, can be used. The
216
14 Verbal Information Verification
thresholds can be either context (key information) dependent or context independent. They can also be either speaker dependent or speaker independent. Two Speaker-Independent Thresholds For robust sequential verification, we define the logic of using two speakerindependent and context-dependent thresholds for a multiple-question trial as follows: TL , when TL ≤ M (i) < TH at the first time T (i) = (14.18) TH , otherwise, where TL and TH are two threshold values; M (i) and T (i) are the values of confidence measure and threshold, respectively, for the ith subtest. Eq. (14.18) means TL can be used only once during the sequential trial. Thus, if a true speaker has only one lower score in a sequential test, the speaker still has the chance to pass the overall verification trial. This is useful in noisy environments or for speakers who may not speak consistently. When the above two thresholds were applied to VIV testing, the system performance was improved from the single threshold test, as shown in Table 14.1. The minimal FA rates in the table were obtained by adjusting the thresholds while maintaining the FR rates to be 0%. As we can see from the table, the thresholds for M have limited ranges 0.0 ≤ TL , TH ≤ 1.0 and clear physical meanings: i.e. TL = 0.69 and TH = 0.84 imply that 69% and 84% of phones are acceptable respectively. The comparison on using single and two thresholds is listed in Table 14.2. Table 14.1. False Acceptance Rates when Using Two Thresholds and Maintaining False Rejection Rates to Be 0.0% Confidence False Acceptance Thresholds measure on three utterance M 0.57% TL = 0.69; TH = 0.84 M2 0.79% TL = −0.262; TH = 0.831
Table 14.2. Comparison on Two and Single Threshold Tests No. of SI Error rates on Threshold thresholds three utterances values Two FA = 0.57% FR = 0.0% TL = 0.69; TH = 0.84 Single FA = 0.75% FR = 1.0% T = 0.89
14.4 VIV Experimental Results
217
Robust Intervals A speaker may have fluctuated test scores, even for utterances of the same text due to variations in voice characteristics, channels, and acoustic environment. We therefore need to define a robust interval, τ , to characterize the variation and the system robustness, T (i) = T (i) − τ,
0 ≤ τ < T (i)
(14.19)
where T (i) is an original context-dependent utterance threshold as defined in Eq. (14.13), and T (i) is the adjusted threshold value. The robust interval, τ , is equivalent to the tolerance in the test score to accommodate fluctuation due to variations in environments or a speaker’s conditions. In a system evaluation, τ can be reported with error rates as an allowed tolerance; or it can be used to determine the thresholds based on system specifications. For example, a bank authentication system may need a smaller τ to ensure a lower FA rate for a higher security level while a voice messaging system may select a larger τ for a lower FR rate to avoid user frustration. Speaker-Dependent Thresholds To further improve the performance, a VIV system can start from a speakerindependent threshold, then switch to speaker- and context-dependent thresholds after the system has been used for several times by a user. To ensure no false rejection, the upper bound of the threshold for subtest i of a speaker can be selected as T (i) ≤ min{M (i, j)},
j = 1, ..., I,
(14.20)
where M (i, j) is the confidence score for utterance i on the jth trial, and I is the total number of trials that the speaker has performed on the same context of utterance i. In this case, we have three thresholds associated with the three questions for each speaker. Following the design strategy proposed in Section 14.3, the thresholds were determined by first estimating T (i) as in Eq. (14.20) to guarantee 0% FR rate. Then, the thresholds were shifted to evaluate the FA rate on different robust intervals τ as defined in Eq. (14.19). The relation between robust interval and false acceptance rates on three questions using normalized confidence measure is shown in Fig. 14.3, where the horizontal axis indicates the changes of the values of robust interval τ . The three curves represent the performance of a VIV system using one to three questions for speaker authentication while maintaining a FR rate of 0%. An enlarged graph of the performance for the cases of two and three subtests is shown in Fig. 14.4. We note that the conventional ROC plot cannot be applied here since the FR is 0%. We note that the threshold adjustment is made on per-speaker, perquestion situation although the plot in Fig. 14.4 is the overall performance for all speakers.
218
14 Verbal Information Verification
From the figures, we can see that when using a one question test, we cannot obtain a 0% EER. Using two questions, we have a 0% equal-error rate but with no tolerance (i.e. robust interval τ = 0). With three questions, the VIV system gave 0% EER with 6% robust interval, which means when a true speaker’s utterance scores are 6% lower (e.g. due to variations in telephone quality), the speaker can still be accepted while all impostors in the database can be rejected correctly. This robust interval gives room for variation in the true speaker’s score to ensure robust performance of the system. Fig. 14.4 also implies that three questions are necessary to obtain a 0% FA in the experiment. In real applications, a VIV system may apply SI thresholds to a new user and switch to SD thresholds after the user access the system successfully for a few times. The thresholds can also be updated based on the recent scores to accommodate the changes of a speaker’s voice and environment. An updated SD threshold can be determined as T (i) < α min{M (i, j)},
1 ≤ I − k ≤ j ≤ I, k ≥ 1
(14.21)
where M (i, j) is the confidence score for utterance i on the jth trial, I is the total number of trials that the speaker has produced on the same context of utterance i, and k is the update duration, i.e. the updated threshold is determined based on the last k trials. 40
False acceptance with utterance verification(%)
1 subtest 35
2 subtests 3 subtests
30
25
20
15
10
5
0 −20
−18
−16 −14 −12 −10 −8 −6 −4 Robust interval for speaker dependent thresholds (%)
−2
0
Fig. 14.3. False acceptance rate as a function of robust interval with SD threshold for a 0% false rejection rate. The horizontal axis indicates the shifts of the values of the robust interval τ .
A summary of VIV for speaker authentication is shown in Table 14.3. In the utterance verification approach, when SD thresholds are set for each key
14.4 VIV Experimental Results
219
3.5
False acceptance with utterance verification(%)
2 subtests 3
3 subtests
2.5
2
1.5
1
0.5
0 −20
−18
−16 −14 −12 −10 −8 −6 −4 −2 Robust interval for speaker dependent thresholds (%)
0
Fig. 14.4. An enlarged graph of the system performances using two and three questions. Table 14.3. Summary of the Experimental Results on Verbal Information Verification Approaches
False False Accuracy Robust Rejection Acceptance Interval Sequential Utterance 0% 0% 100% 6% Verification (Note: Tested on 100 speakers with 3 questions while speaker-dependent thresholds were applied.)
information field, we achieved 0% average individual EER with a 6% robust interval.
14.5 Conclusions In this chapter we presented an automatic verbal information verification technique for user authentication. VIV authenticates speakers by verbal content instead of voice characteristics. We also presented a sequential utterance verification solution to VIV with a system design procedure. Given the number of test utterances (subtests), the procedure can help us to design a system with minimal overall error rate; given a limit on the error rate, the procedure can find out how many subtests are needed to obtain the expected accuracy. In a VIV experiment with three questions prompted and tested sequentially, the proposed VIV system achieved 0% equal-error rate with 6% robust interval on 100 speakers when SD utterance thresholds were applied. However, since VIV verifies the verbal content only and not a speaker’s voice characteris-
220
14 Verbal Information Verification
tics, it is the user’s responsibility to protect his or her personal information. The sequential verification technique can also be applied to other biometric verification systems or multi-modality verification systems in which more than one verification method can be employed, such as voice plus fingerprint verification, or other kinds of configurations. For real-world applications, the solution of a practical and useful speaker authentication system can be a combination of speaker recognition and verbal information verification. In the next chapter, Chapter 15, we will provide an example of real speaker authentication design using the techniques outlined in this book.
References 1. Anderson, T. W., An introduction to multivariate statistical analysis, second edition. New York: John Wiley & Sons, 1984. 2. Kawahara, T., Lee, C.-H., and Juang, B.-H., “Combining key-phrase detection and subword-based verification for flexible speech understanding,” in Proceedings of ICASSP, (Munich), pp. 1159–1162, May 1997. 3. Lee, C.-H., Juang, B.-H., Chou, W., and Molina-Perez, J. J., “A study on taskindependent subword selection and modeling for speech recognition,” in Proc. of ICSLP, (Philadelphia), pp. pp. 1816–1819, Oct. 1996. 4. Li, Q. and Juang, B.-H., “Speaker verification using verbal information verification for automatic enrollment,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Seattle), May 1998. 5. Li, Q., Juang, B.-H., Zhou, Q., and Lee, C.-H., “Automatic verbal information verification for user authentication,” IEEE Trans. on Speech and Audio Processing, vol. 8, pp. 585–596, Sept. 2000. 6. Li, Q., Juang, B.-H., Zhou, Q., and Lee, C.-H., “Verbal information verification,” in Proceedings of EUROSPEECH, (Rhode, Greece), pp. 839–842, Sept. 22-25 1997. 7. Li, Q., Parthasarathy, S., and Rosenberg, A. E., “A fast algorithm for stochastic matching with application to robust speaker verification,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Munich), pp. 1543–1547, April 1997. 8. Lleida, E. and Rose, R. C., “Efficient decoding and training procedures for utterance verification in continuous speech recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Atlanta), pp. 507–510, May 1996. 9. Neyman, J. and Pearson, E. S., “On the problem of the most efficient tests of statistical hypotheses,” Phil. Trans. Roy. Soc. A, vol. 231, pp. 289–337, 1933. 10. Neyman, J. and Pearson, E. S., “On the use and interpretation of certain test criteria for purpose of statistical inference,” Biometrika, vol. 20A, pp. Pt I, 175–240; Pt II, 1928. 11. Parthasarathy, S. and Rosenberg, A. E., “General phrase speaker verification using sub-word background models and likelihood-ratio scoring,” in Proceedings of ICSLP-96, (Philadelphia), October 1996. 12. Rabiner, L. and Juang, B.-H., Fundamentals of speech recognition. Englewood Cliffs, NJ: PTR Prentice Hall, 1993.
References
221
13. Rahim, M. G., Lee, C.-H., and Juang, B.-H., “Robust utterance verification for connected digits recognition,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Detroit), pp. 285–288, May 1995. 14. Rahim, M. G., Lee, C.-H., Juang, B.-H., and Chou, W., “Discriminative utterance verification using minimum string verification error (MSVE) training,” in Proc. IEEE Int. Conf. Acoustic, Speech, Signal Processing, (Atlanta), pp. 3585–3588, May 1996. 15. Rosenberg, A. E. and Parthasarathy, S., “Speaker background models for connected digit password speaker verification,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Atlanta), pp. 81–84, May 1996. 16. Setlur, A. R., Sukkar, R. A., and Jacob, J., “Correcting recognition errors via discriminative utterance verification,” in Proc. Int. Conf. on Spoken Language Processing, (Philadelphia), pp. 602–605, Oct. 1996. 17. Sukkar, R. A. and Lee, C.-H., “Vocabulary independent discriminative utterance verification for non-keyword rejection in subword based speech recognition,” IEEE Trans. Speech and Audio Process., vol. 4, pp. 420–429, November 1996. 18. Sukkar, R. A., Setlur, A. R., Rahim, M. G., and Lee, C.-H., “Utterance verification of keyword string using word-based minimum verification error (WBMVE) training,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Processing, (Atlanta), pp. 518–521, May 1996. 19. Wald, A., Sequential analysis. NY: Chapman & Hall, 1947.
Chapter 15 Speaker Authentication System Design
In this book, we have introduced various speaker authentication techniques. Each of the techniques can be considered as a technical component, such as speaker identification, speaker verification, verbal information verification, and so on. In real-world applications, a speaker authentication system can be designed by combining the technical components to construct a useful and convenient system to meet the requirements of a particular application. In this chapter we provide an example of a speaker authentication system design. Following this example, the author hopes readers can design their own system for their particular applications to improve the security level of the protected system. This design example was originally reported in [2].
15.1 Introduction Consider the following real-world scenario: a bank would like to provide convenient services to its customers while retaining a high level of security. The bank would like to use a speaker verification technique to verify customers during an on-line banking service, but the bank does not want to bother customers with an enrollment procedure for speaker verification. The bank wants a customer to use the on-line banking service right after the customer opens a bank account without any acoustic enrollment. The bank also wants to use biometrics to enhance the banking system security. How do we design a speaker authentication system to meet the bank’s requirements based on the techniques introduced in this book? Based on the techniques introduced in this book, a feasible solution is to combine speaker verification (SV) with verbal information verification (VIV). In such a system design, a user can be verified by VIV in the first 4 to 5 accesses to the on-line banking system, usually from different acoustic environments, e.g., different ATM locations or different telephones, land-line or wireless. The VIV system verifies the user’s personal information and collects and verifies the pass-phrase utterance for use as training data for speaker-dependent model
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/978-3-642-23731-7_15, Ó Springer-Verlag Berlin Heidelberg 2012
223
224
15 Speaker Authentication System Design
Training Utterances: " Open Sesame" " Open Sesame" " Open Sesame"
HMM Training
Speaker-Dependent HMM
Database Enrollment Session Test Session
Identity Claim Test Utterance: " Open Sesame"
Speaker Verifier
Scores
Fig. 15.1. A conventional speaker verification system
construction. The user’s answers to the VIV questions can be a pass-phrase which will be used for SV later. After a speaker-dependent (SD) model is constructed, the system then migrates from a VIV system to an SV system. This approach avoids the inconvenience of a formal enrollment procedure, ensures the quality of the training data for SV, and mitigates the mismatch caused by different acoustic environments between training and testing. In this chapter we describe such a system in detail. Experiments in this chapter show that such a system can improve the SV performance over 40% in relative equal-error rate (EER) reduction compared to a conventional SV system.
15.2 Automatic Enrollment by VIV A conventional SV system is shown in Fig. 15.1. It involves two kinds of sessions, enrollment and test. In an enrollment session, an identity, such as an account number, is assigned to a speaker, and the speaker is asked to select a spoken pass-phrase, e.g. a connected digit string or a phrase. The system then prompts the speaker to repeat the pass-phrase several times, and a speaker dependent HMM is constructed based on the utterances collected in the enrollment session. In a test session, the speaker’s test utterance is compared against the pre-trained, speaker-dependent HMM model. The speaker is accepted if the likelihood-ratio score exceeds a pre-set threshold; otherwise the speaker is rejected. When applying the current speaker recognition technology to real-world applications, several problems were encountered which motivated our research of VIV [1, 2] introduced in Chapter 14. A conventional speaker recognition system needs an enrollment session to collect data for training an SD model. Enrollment is inconvenient to the user as well as the system developer who often has to supervise and ensure the quality of the collected data. The ac-
15.2 Automatic Enrollment by VIV
Pass-phrases of the first few accesses: " Open Sesame" " Open Sesame" " Open Sesame"
225
Save for training
Verbal Information Verification
Verified pass-phrases for training
Automatic Enrollment
HMM Training Speaker-dependent HMM Database
Speaker Verificaiton Identity claim Test pass-phrase: " Open Sesame"
Speaker Verifier
Scores
Fig. 15.2. An example of speaker authentication system design: Combining verbal information verification with speaker verification
curacy of the collected training data is critical to the performance of an SV system. Even a true speaker might make a mistake when repeating the training utterances/pass-phrases several times. Furthermore, since the enrollment and testing voice may come from different telephone handsets and networks, there may exist an acoustic mismatch between the training and testing environments. The SD models trained on the data collected in one enrollment session may not perform well when the test session is in a different environment or via a different transmission channel. The mismatch significantly affects SV performance. To alleviate the above problems, we propose using VIV. The combined SV and VIV system [1, 2] is shown in Fig. 15.2, where VIV is involved in the enrollment and one of the key utterances in VIV is the pass-phrase which will be used in SV later. During the first 4 to 5 accesses, the user is verified by a VIV system. The verified pass-phrase utterances are recorded and later used to train a speaker-dependent HMM for SV. At this point, the authentication process can be switched from VIV to SV. There are several advantages to the combined system. First, the approach is convenient to users since it does not need a formal enrollment session and a user can start to use the system right after his/her account is opened. Second, the acoustic mismatch problem is mitigated since the training data are from different sessions, potentially via different handsets and channels. Third, the quality of the training data is ensured since the training phrases are verified
226
15 Speaker Authentication System Design
Database Speaker-dependent model
Identity claim Phoneme Transcription Feature Vectors
Forced Alignment
L(O, Λ t )
Target Score Computation
+
Cepstral Mean Subtraction
Speaker-independent phoneme models
Background Score Computation
-
+
Threshold
Decision
L(O, Λb)
Background models
Fig. 15.3. A fixed-phrase speaker verification system
by VIV before establishing the SD HMM for the pass-phrase. Finally, once the system switches to SV, it would be difficult for an impostor to access the account even if the impostor knows the true speaker’s pass-phrase.
15.3 Fixed-Phrase Speaker Verification The details of a fixed-phrase SV system can be found in Chapter 9. A block diagram of the test session used in our evaluation is shown in Fig. 15.3. After the speaker claims the identity, the system expects the same pass-phrase obtained in the training session. First, a speaker-independent phone recognizer is applied to find the endpoints by forced alignment. Then, cepstral mean subtraction (CMS) is conducted to reduce the acoustic mismatch. In general, to improve SV performance and robustness, a general stochastic matching algorithm as discussed in Chapter 10 and the endpoint detection algorithm as discussed in Chapter 5 can also be applied. In the block of target score computation of Fig. 15.3, the feature vectors are decoded into states by the Viterbi algorithm, using the whole-phrase model trained by the VIV-verified utterances. A log-likelihood score for the target model, i.e. the target score, is calculated as L(O, Λt) =
1 log P (O|Λt), Nf
(15.1)
where O is a set of feature vectors, Nf is the total number of vectors, Λt is the target model, and P (O|Λt ) is the likelihood score from the Viterbi decoding. In the block of the background score computation, a set of speakerindependent (SI) HMM’s in the order of the transcribed phoneme sequence, Λb = {λ1 , ..., λK }, is applied to align an input utterance with the expected transcription using the Viterbi decoding algorithm. The segmented utterance
15.3 Fixed-Phrase Speaker Verification
227
is O = {O1 , ..., OK }, where Oi is the set of feature vectors corresponding to the ith phoneme, Si , in the phoneme sequence. There are different ways to compute the likelihood score for the background (alternative) model. Here, we apply the background score proposed in [3]. L(O, Λb ) =
K 1 log P (Oi |λbi ), Nf i=1
(15.2)
where Λb = {λbi }K i=1 is the set of SI phoneme models, in the order of the transcribed phoneme sequence, P (Oi |λbi ) is the corresponding phoneme likelihood score, K is the total number of phonemes. The SI models are trained from a different database by the EM algorithm [3]. In real implementations, the SI model can be the same one as used in VIV. The target and background scores are then used in the following likelihoodratio test [3]. R(O; Λt , Λb ) = L(O, Λt ) − L(O, Λb ), (15.3) where L(O, Λt) and L(O, Λb ) are defined in Eqs. (15.1) and (15.2) respectively. A final decision on rejection or acceptance is made based on comparing R in Eq. (15.3) with a threshold. As pointed in [3], if a significantly different phrase is given, the phrase could be rejected by the SI phoneme alignment before using the verifier.
15.4 Experiments In this section, we conduct experiments to verify the design of the speaker authentication system. 15.4.1 Features and Database The feature vector for SV is composed of 12 cepstral and 12 delta-cepstral coefficients since it is not necessary to use the 39 features for SV. The cepstrum is derived from a 10th order LPC analysis over a 30 ms window and the feature vectors are updated at 10 ms intervals [3]. The experimental database consists of fixed-phrase utterances recorded over the long distance telephone network by 100 speakers, 51 male and 49 female. The fixed phrase, common to all speakers, is “I pledge allegiance to the flag” with an average length of two seconds. We assume the fixed phrase is one of the verified utterances in VIV. Five utterances of the pass-phrase recorded from five separate VIV sessions are used to train an SD HMM; thus the training data are collected from different acoustic environments and telephone channels at different time. We assume all the collected utterances have been verified by VIV to ensure the quality of the training data.
228
15 Speaker Authentication System Design
For testing, we used 40 utterances recorded from a true speaker in different sessions, and 192 utterances recorded from 50 impostors of the same gender in different sessions. For model adaptation, the second, fourth, sixth, and eighth test utterances from the tested true speaker are used to update the associated HMM for verifying subsequent test utterances incrementally [3]. The SD target models for the phrases are left-to-right HMM’s. The number of states are dependent on the total number of phonemes in the phrases. There are four Gaussian components associated with each state [3]. The background models are concatenated SI phone HMM’s trained on a telephone speech database from different speakers and texts [4]. There are 43 phonemes HMM’s and each model has three states with 32 Gaussian components associated with each state. Due to unreliable variance estimates from a limited amount of speakerspecific training data, a global variance estimate was used as the common variance to all Gaussian components in the target models [3]. 15.4.2 Experimental Results on Using VIV for SV Enrollment In Chapter 14, we reported the experimental results of VIV on 100 speakers. The system had 0% error rates when three questions were tested by sequential utterance verification. Since we were using a pre-verified database, we assume that all the training utterances collected by VIV are correct. The results show improvement when reducing the acoustic mismatch by using VIV for enrollment. The SV experimental results with and without adaptation are listed in Table 15.1 and Table 15.2 for the 100 speakers, respectively. The numbers are in the average percentage of individual EER. The first data column lists the EER’s using individual thresholds and the second data column lists the EER’s using common (pooled) thresholds for all tested speakers. The baseline system is the conventional SV system in which a single enrollment session is used. The proposed system is the combined system in which VIV is used for the automatic enrollment for SV. After the VIV system is used five times, collecting training utterances from five different sessions, it then switches over to an SV system. The test utterances for both the baseline and the proposed system are the same. Without adaptation, the baseline system has an EER of 3.03% and 4.96% for individual and pooled thresholds respectively, while the proposed system has an EER of 1.59% and 2.89% respectively. With adaptation as defined in the last subsection, the baseline system has an EER of 2.15% and 3.12%, while the proposed system has an EER of 1.20% and 1.83%, respectively. The proposed system without adaptation has an even lower EER than the baseline system with adaptation. This is because the SD models in the proposed system were trained using the data from different sessions while the baseline system just performs an incremental adaptation without reconstructing the models after collecting more data.
15.4 Expriments
229
Table 15.1. Experimental Results without Adaptation in Average Equal-Error Rates Algorithms Individual Thresholds Pooled Thresholds SV (Baseline) 3.03 % 4.96 % VIV+SV(proposed) 1.59 % 2.89 % Table 15.2. Experimental Results with Adaptation in Average Equal-Error Rates Algorithms Individual Thresholds Pooled Thresholds SV (Baseline) 2.15 % 3.12 % VIV+SV(proposed) 1.20 % 1.83 %
The experimental results indicate several advantages of the proposed system design method. First, since VIV can provide the training data from different sessions representing different channel environments, we can do significantly better than one training session. Second, although we can adapt the models originally trained by the data collected in one session, the proposed system still does better. This is due to the fact that a new model constructed by multi-session training data is more accurate than by incremental adaptation using the multi-session data. Lastly, in real-world applications, all the utterances used in training and adaptation can be verified by VIV before training or adaptation. Although this advantage cannot be observed in this database evaluation, it is critical in any real-world application since even a true speaker may make a mistake while uttering a pass-phrase. The mistake will never be corrected once involved in model training or adaptation. VIV can protect the system from involving wrong training data. In this section, we only proposed one configuration on combined VIV with SV. For different applications, different combinations of speaker authentication techniques can be integrated to meet different specifications. For example, VIV can be employed in SV to verify a user before the user’s data is used for SD model adaptation, or both the VIV and SV system can share the same set of speaker independent models and the decoding scores from VIV can be used in SV as the background score.
15.5 Conclusions In this chapter, we present a design example for speaker authentication system. To improve both user convenience and system performance, we combined verbal information verification (VIV) and speaker verification (SV) to construct a convenient speaker authentication system. In the system, VIV is used to verify users in the first few accesses. Simultaneously, the system collects verified training data for constructing speaker-dependent models. Later, the system migrates from a VIV to SV system for authentication. The combined
230
15 Speaker Authentication System Design
system is convenient to users since they can start to use the system without going through a formal enrollment session and waiting for model training. However, it is still the user’s responsibility to protect his or her personal information from impostors until the speaker-dependent model is trained and the system is migrated to an SV system. After the migration, an impostor would have difficulty accessing the account even if the pass-phrase is known. Since the training data could be collected from different channels in different VIV sessions, the acoustic mismatch problem is mitigated, potentially leading to a better system performance in test sessions. The speaker dependent HMM’s can be updated to cover different acoustic environments while the system is in use to further improve the system performance. Our experiments have shown that the combined speaker authentication system improves SV performance by more than 40% compared to that of a conventional SV system by just mitigating the acoustic mismatch. Furthermore, VIV can be used to ensure training data for SV. In order to design a real-world speaker authentication system, it may be necessary to combine several speaker authentication techniques because each technique has its own advantages and limitations. If we combine them through a careful design, the combined system has more chance to meet the design specifications and be more useful in real-world applications.
References 1. Li, Q. and Juang, B.-H., “Speaker verification using verbal information verification for automatic enrollment,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Seattle), May 1998. 2. Li, Q., Juang, B.-H., Zhou, Q., and Lee, C.-H., “Automatic verbal information verification for user authentication,” IEEE Trans. on Speech and Audio Processing, vol. 8, pp. 585–596, Sept. 2000. 3. Parthasarathy, S. and Rosenberg, A. E., “General phrase speaker verification using sub-word background models and likelihood-ratio scoring,” in Proceedings of ICSLP-96, (Philadelphia), October 1996. 4. Rosenberg, A. E. and Parthasarathy, S., “Speaker background models for connected digit password speaker verification,” in Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (Atlanta), pp. 81–84, May 1996.
Index
A posteriori probability, 180, 181, 192 Accessibility, 3 Acoustic mismatch, 9, 11 Adaptation, 228 Anti-HMM, 211 Anti-model, 211 Antisymmetric features, 79 ASR, 8 AT, 111, 118, 124 ATM, 223 Auditory features, 135, 136 filters, 116 Auditory transform, 111 dilation variable, 139 discrete-time, 123 fast inverse transform, 123 fast transform, 123 inverse, 120, 124 scale variable, 139 shift variable, 139 translation variable, 139 Auditory-based feature extraction, 135, 136, 138 transform, 111, 118 Authentication, 1 biometric-based, 2, 3 human-human, 2 human-machine, 2 information-based, 2, 5, 6 machine-human, 2 machine-machine, 2 token-based, 2
Automatic enrollment, 224 Automatic speech recognition, 8 Average individual equal-error rates, 174 Background model, 153 Background noise, 114 Backpropagation, 44, 181, 192 Band number, 139 Bank application, 223 Bark scale, 113, 116, 128, 131 Basilar membrane, 116, 138 Batch-mode process, 76 Baye decision rule, 69 Bayes, 181 Bayesian decision theory, 68, 179 Beam search, 93 width, 93 Benchmark, 38 Central frequency, 139 Centroids, 27 Cepstral domain, 158 Cepstral mean subtraction, 152, 158 CFCC, 127, 140, 147 Change-point detection, 76, 95–97 problem, 95 state detection, 97 Classification rules, 55 Closed test, 9
Q. Li, Speaker Authentication, Signals and Communication Technology, DOI: 10.1007/978-3-642-23731-7, Ó Springer-Verlag Berlin Heidelberg 2012
231
232
Index
CMS, 152, 158, 226 Cochlea, 116 Cochlear filter, 120, 139 filter bank, 138 filter cepstral coefficients, 140 Cochlear filter cepstral coefficients, 127 Code vector, 30 Codewords, 27 Cohort normalization, 172 Combined system, 225 Common thresholds, 228 Communications, 75 Computation noise, 114 Conditional-likelihood functions, 182 Confidence measure, 211 Cryptographic protocols, 5 Cubic lattice, 29 Cubic root, 141 Data space, 28 Decision boundary, 63 diagram, 75 parallel, 5 sequential, 5 Decision procedure, 2 parallel, 2 sequential, 2 Decoder complexity, 93 complexity analysis, 103 detection-based , 93 forward-backward, 95 optimal path, 95 search space alignment, 94 search-space reduction, 93 subspace, 95 Decoding algorithm, 94 Decomposed frequency bands, 127 Decryption, 2 Denoising, 127 DET, 4 Detection, 93 Dialogue procedure, 10 Direct median method, 30 Direct method, 10 Discriminant analysis, 44 Discriminant function, 44
Discriminative training, 192 training objectives, 179 Distinctiveness, 3 Distortion, 30 Distribution estimation, 192 Eardrum, 116 Eavesdropping, 11 EER, 71, 155, 156, 213 Eigenvalue, 26, 50 Eigenvector, 30 EM algorithm, 62, 191 Encoding error, 28 Encryption, 2 key, 4 voice key, 4 Endpoint detection, 75, 76, 86 beginning edge, 78 beginning point, 78 decision diagram, 82 ending edge, 78 ending point, 78 energy normalization, 83 filter, 78, 82 real-time, 81 rising edge, 78 Energy feature, 78 normalization, 75, 86 Enrollment, 1, 224 session, 4, 7, 154, 157 Equal loudness function, 130 Equal-error rate, 155, 156, 213 Equivalent rectangular bandwidth, 120 ERB, 116, 120, 131, 139 Error rate, 180 Estimation covariance matrices, 196 mean vectors, 197 mixture parameters, 198 Evolution, 95 Expectation-maximization algorithm, 191 FA, 213 False acceptance, 71, 213 rejection, 71, 213
Index Fast discriminative training, 191 Fast Fourier transform, 111 Fast training algorithm, 193 FFT, 111 Filter bandwidth, 111 bank, 136 Finite-state grammars , 87 Fisher’s LDA, 44 Fixed pass phrase, 8 phrase, 8 Flops, 34 Forced alignment, 152 Forward auditory transform, 138 Fourier transform, 111 FR, 213 Full search, 93 Gammatone filter, 130 filter bank, 117, 142 function, 111, 117, 130 GFCC features, 142 MGFCC features, 142 transform, 117 Gaussian classifier, 54 correlated, 29 node, 54 uncorrelated, 29 Gaussian mixture model, 23, 62, 153 GD, 191 Generalized minimum error rate, 179 Generalized probabilistic descent, 67 Generalized Rayleigh quotient problem, 51 GMER, 179, 185, 191 GMM, 23, 62, 153 GPD, 67 Gradient descent, 44, 191 Hair cell, 116, 140 Hearing system, 115 Hidden Markov model, 66 left-to-rightl, 97 SD, 155 Search-space reduction, 100 Hiden nodes, 43
233
High density, 28 Histogram, 29 HMM, 66 left-to-right, 97 SD, 155 Search-space reduction, 100 Holder norm, 182 HSV, 166 Hypercube, 29 Hyperplane, 53 Identification, 4 Impostor, 9 Impulse response cochlear filters, 120 plot, 120 Indirect method, 10 Inner ear, 116 Internet, 5 Invariant, 29 Inverse AT, 124 Inverse auditory transform, 124 Inverse transform, 30 Joint probability distribution, 61 K-means, 23, 28, 32 algorithm, 27 Key information, 207 Keyword spotting, 209 Laplace distribution, 37 source, 30, 37 LBG, 27, 34 LDA, 46, 51, 166 Left-to-right HMM, 67, 152 Likelihood ratio, 71, 152 ratio test, 154 Linear discriminant analysis, 51, 166 Linear prediction cepstral coefficients, 136 Linear separable, 168 Linear transform, 157, 158, 160 fast estimation, 160 LLR, 211 Local optimum, 28 Log likelihood
234
Index
ratio, 162, 211 score, 70 LPC, 152 LPCC, 136 MAP, 181 Maximum a posteriori, 181 Maximum allowable distortion, 29 Maximum mutual information, 183 MCE, 179, 188 Mean-squared error, 35 Measurability, 3 Mel frequency cepstral coefficient, 136 Mel frequency cepstral coefficients, 135, 136, 204 Mel scale, 116 MER, 179, 188 MFCC, 135, 136, 144, 204 MGFCC, 142, 144 Middle ear, 116 transfer function, 116 Minimum classification error, 179 Minimum error rate, 179, 181 classification, 69 Mismatch, 157, 158, 160 ML, 179 MMI, 179, 183, 188 Modified GFCC, 142 Modified radial basis function network, 58 Mother’s maiden name, 10 Mother’s medien name, 5 MRBF, 58 MSE, 35–37 Multidimensional, 36 Multiple paths, 101 Multispectral pattern recognition, 58 Multivariate Gaussian, 23 Gaussian distribution, 23 statistical analysis, 27 statistics, 23 NDA, 165 Neural networks, 44 Neyman Pearson, 211 Neymann Pearson lemma, 210 test, 70
NIST, 4 Non-keyword rejection, 209 Non-stationary pattern recognition , 61 process, 61, 95 Normalized confidence measures, 211 Normalized discriminant analysis, 165, 166, 168 Objective, 179 function, 179 generalized minimum error rate, 179, 185 GMER, 179, 193 HMM training, 182 maximum mutual information, 179, 183 maximum-likelihood, 179 MCE, 179, 181 MER, 179 minimum classification error, 179, 181, 183 minimum error rate, 179 MMI, 179 One-pass VQ, 23, 28 algorithm, 32 codebook design, 34 complexity analysis, 33 data pruning, 28 direct-median method, 30 principal component method, 30 pruning, 31 robustness, 38 updating, 32 One-to-many, 4 One-to-one, 4 Open test, 9 Open-set evaluation, 154 Operator, 9 Optimal Gaussian classifier , 53 Outer ear, 116 transfer function, 116 Outlier, 29, 39 Oval window, 116 Parametric model, 95 Pass-phrase, 8 Pattern recognition, 43 PCA, 23, 25
Index Perceptual linear predictive, 135, 136 Personal data profile, 11 Personal identification number, 2 Personal information, 11 PFC, 44 PFN, 43, 44 design procedure, 44 Phoneme alignment, 227 PIN, 2 Pitch harmonics, 114 PLP, 135, 136, 147 Pooled thresholds, 228 Posterior probability, 180, 181 Pre-emphasize, 152 Principal component, 25 analysis, 23, 25 discriminant analysis, 53 hidden node design, 52 method, 30 Principal feature, 44 classification, 44 Principal feature network, 43, 44 classified region, 46 construction procedure, 48 contribution array, 56 decision-tree implementation, 46 Fisher’s node, 48, 51 Gaussian discriminant node, 49 hidden node, 46 hidden node design, 48 lossy simplification, 56 maximum SNR hidden node, 55 parallel implementation, 46, 48 PC Node, 53 principal component node, 48 sequential implementation, 48 simplification, 56 thresholds, 56 unclassified region, 46 Pruning, 28, 44, 95 Q-factor, 118, 120 Radial basis network, 46 Radius, 28 RASTA, 147 RASTA-PLP, 135, 136, 147 RBF, 46 Rectifier function, 120
Recurrent neural network, 61 Recursive, 97 Registration, 1 Repeatability, 3 Resynthesis, 117 Robust interval, 217 Robust speaker identification, 135 Robustness, 3 Rotation, 158, 161 RST, 161 Scale, 158, 161 SD, 7, 9, 154, 209 SD target model, 228 Search path, 95 Searching, 93 Segmental K-means, 39, 67 Sequential design algorithm, 30 detection, 95 detection scheme, 96 observations, 69 probability ratio test, 95 procedure, 44 process, 76 questions, 208 utterance verification, 212 SI, 154, 209 SID, 7 Single Gaussian distribution, 24 Single path, 101 Single utterance ASR, 209 verification, 209 SNR, 44, 142 Source histogram, 29 Speaker authentication, 6 progressive integrated, 11 system, 223 Speaker dependent, 7, 209 Speaker identification, 7, 204 Speaker independent, 209 Speaker recognition, 6, 7 text constrained, 8 Speaker verification, 7 background score, 154 connected digit, 166 contex-dependent, 152 enrollment phase, 152
235
236
Index
fixed phrase, 151, 152, 158, 226 fixed-phrase, 165 general phrase, 161 global variance, 163 hybrid, 174 Hybrid system, 166 language-dependent, 156 language-independent, 151, 156 pass-phrase, 151, 155 randomly prompted, 165, 166 robust, 157 speaker dependent, 152 speaker independent, 152 stochastic matching, 161 target model, 163 target score, 154 test phase, 152 text dependent, 161 text independent, 8 text prompted, 8 whole-phrase model, 152 Speaker-dependent threshold, 217 Speaker-independent models, 9 Spectrogram, 113 AT, 120 auditory transform, 120 FFT, 120 Spectrum, 129 Speech activity detection, 76 Speech detection, 76 Speech recognition, 6 Speech segmentation, 68 Speech segments, 78 Speech-sampling rate, 78 Speed up, 34, 37 SPRT, 95 SR, 6, 7 State change-point detection, 104 State transition diagram, 82 Stationary pattern recognition, 61 process, 61 Statistical properties, 3 non-stationary, 3 stationary, 3 Statistical verification, 70 Step-down procedure, 212 Stochastic matching, 157 fast, 158
fast algorithm, 158 geometric interpretation , 158 linear transform, 158 rotation, 158 scale, 158 Stochastic process, 61 Subband signals, 138 Subtest, 212 Subword hypothesis testing, 210 model, 210 sequence, 210 test, 211 SV, 7 Synthesized speech, 126 System design, 223 Target model, 154 Test session, 7, 154 Testing, 1 session, 4 Text to speech, 10, 208 Time-frequency analyses, 117 auditory-based analysis, 137 transforms, 111 Training procedure, 200 Translation, 158, 161 Traveling wave, 116 TTS, 10, 208 Uncorrelated Gaussian source, 34 User authentication, 2 Utterance verification, 71, 209 Variable length window, 140 Vector quantization, 27 Verbal information verification, 6, 9, 207 Verification, 1, 4 Verification session, 158 Viterbi, 67, 68 algorithm, 93 search, 94 VIV, 6, 9, 207, 208, 223, 224 Voice characteristics, 6 VoIP, 75, 77 VQ, 23, 27, 28 Warped frequency, 128
Index Wavelet transform, 117 Window shifting, 128 Wireless communication, 5
Word error rates, 86 WT, 117
237